[RFC,10/18] drm/scheduler: Add can_run_job callback

Message ID 20230307-rust-drm-v1-10-917ff5bc80a8@asahilina.net
State New
Headers
Series Rust DRM subsystem abstractions (& preview AGX driver) |

Commit Message

Asahi Lina March 7, 2023, 2:25 p.m. UTC
  Some hardware may require more complex resource utilization accounting
than the simple job count supported by drm_sched internally. Add a
can_run_job callback to allow drivers to implement more logic before
deciding whether to run a GPU job.

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
 drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
 include/drm/gpu_scheduler.h            |  8 ++++++++
 2 files changed, 18 insertions(+)
  

Comments

Christian König March 8, 2023, 8:46 a.m. UTC | #1
Am 07.03.23 um 15:25 schrieb Asahi Lina:
> Some hardware may require more complex resource utilization accounting
> than the simple job count supported by drm_sched internally. Add a
> can_run_job callback to allow drivers to implement more logic before
> deciding whether to run a GPU job.

Well complete NAK.

This is clearly going against the idea of having jobs only depend on 
fences and nothing else which is mandatory for correct memory management.

If the hw is busy with something you need to return the fence for this 
from the prepare_job callback so that the scheduler can be notified when 
the hw is available again.

Regards,
Christian.

>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
>   include/drm/gpu_scheduler.h            |  8 ++++++++
>   2 files changed, 18 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 4e6ad6e122bc..5c0add2c7546 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
>   		if (!entity)
>   			continue;
>   
> +		if (sched->ops->can_run_job) {
> +			sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> +			if (!sched_job) {
> +				complete_all(&entity->entity_idle);
> +				continue;
> +			}
> +			if (!sched->ops->can_run_job(sched_job))
> +				continue;
> +		}
> +
>   		sched_job = drm_sched_entity_pop_job(entity);
>   
>   		if (!sched_job) {
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 9db9e5e504ee..bd89ea9507b9 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
>   	struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
>   					 struct drm_sched_entity *s_entity);
>   
> +	/**
> +	 * @can_run_job: Called before job execution to check whether the
> +	 * hardware is free enough to run the job.  This can be used to
> +	 * implement more complex hardware resource policies than the
> +	 * hw_submission limit.
> +	 */
> +	bool (*can_run_job)(struct drm_sched_job *sched_job);
> +
>   	/**
>            * @run_job: Called to execute the job once all of the dependencies
>            * have been resolved.  This may be called multiple times, if
>
  
Asahi Lina March 8, 2023, 9:41 a.m. UTC | #2
On 08/03/2023 17.46, Christian König wrote:
> Am 07.03.23 um 15:25 schrieb Asahi Lina:
>> Some hardware may require more complex resource utilization accounting
>> than the simple job count supported by drm_sched internally. Add a
>> can_run_job callback to allow drivers to implement more logic before
>> deciding whether to run a GPU job.
> 
> Well complete NAK.
> 
> This is clearly going against the idea of having jobs only depend on 
> fences and nothing else which is mandatory for correct memory management.
> 
> If the hw is busy with something you need to return the fence for this 
> from the prepare_job callback so that the scheduler can be notified when 
> the hw is available again.

I think you misunderstood the intent here... This isn't about job
dependencies, it's about in-flight resource limits.

drm_sched already has a hw_submission_limit that specifies the number of
submissions that can be in flight, but that doesn't work for us because
each job from drm_sched's point of view consists of multiple commands
split among 3 firmware queues. The firmware can only support up to 128
work commands in flight per queue (barriers don't count), otherwise it
overflows a fixed-size buffer.

So we need more complex accounting of how many underlying commands are
in flight per queue to determine whether it is safe to run a new job,
and that is what this callback accomplishes. This has to happen even
when individual jobs have no buffer/resource dependencies between them
(which is what the fences would express).

You can see the driver implementation of that callback in
drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then
calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that
does the actual available slot count checks.

The can_run_job logic is written to mirror the hw_submission_limit logic
(just a bit later in the sched main loop since we need to actually pick
a job to do the check), and just like for that case, completion of any
job in the same scheduler will cause another run of the main loop and
another check (which is exactly what we want here).

This case (potentially scheduling more than the FW job limit) is rare
but handling it is necessary, since otherwise the entire job
completion/tracking logic gets screwed up on the firmware end and queues
end up stuck (I've managed to trigger this before).

~~ Lina
  
Christian König March 8, 2023, 10 a.m. UTC | #3
Am 08.03.23 um 10:41 schrieb Asahi Lina:
> On 08/03/2023 17.46, Christian König wrote:
>> Am 07.03.23 um 15:25 schrieb Asahi Lina:
>>> Some hardware may require more complex resource utilization accounting
>>> than the simple job count supported by drm_sched internally. Add a
>>> can_run_job callback to allow drivers to implement more logic before
>>> deciding whether to run a GPU job.
>> Well complete NAK.
>>
>> This is clearly going against the idea of having jobs only depend on
>> fences and nothing else which is mandatory for correct memory management.
>>
>> If the hw is busy with something you need to return the fence for this
>> from the prepare_job callback so that the scheduler can be notified when
>> the hw is available again.
> I think you misunderstood the intent here... This isn't about job
> dependencies, it's about in-flight resource limits.
>
> drm_sched already has a hw_submission_limit that specifies the number of
> submissions that can be in flight, but that doesn't work for us because
> each job from drm_sched's point of view consists of multiple commands
> split among 3 firmware queues. The firmware can only support up to 128
> work commands in flight per queue (barriers don't count), otherwise it
> overflows a fixed-size buffer.
>
> So we need more complex accounting of how many underlying commands are
> in flight per queue to determine whether it is safe to run a new job,
> and that is what this callback accomplishes. This has to happen even
> when individual jobs have no buffer/resource dependencies between them
> (which is what the fences would express).

Yeah, I already assumed that you have something like this.

And to make it clear this is unfortunately a complete NAK to this 
approach! You can't do this!

The background is that core memory management requires that signaling a 
fence only depends on signaling other fences and hardware progress and 
nothing else. Otherwise you immediately run into problems because of 
circle dependencies or what we call infinite fences.

Jason Ekstrand gave a create presentation on that problem a few years 
ago on LPC. I strongly suggest you google that one up.

> You can see the driver implementation of that callback in
> drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then
> calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that
> does the actual available slot count checks.
>
> The can_run_job logic is written to mirror the hw_submission_limit logic
> (just a bit later in the sched main loop since we need to actually pick
> a job to do the check), and just like for that case, completion of any
> job in the same scheduler will cause another run of the main loop and
> another check (which is exactly what we want here).

Yeah and that hw_submission_limit is based on a fence signaling again.

When you have some firmware limitation that a job needs resources which 
are currently in use by other submissions then those other submissions 
have fences as well and you can return those in the prepare_job callback.

If those other submissions don't have fences, then you have a major 
design problem inside your driver and we need to get back to square one 
and talk about that dependency handling.

> This case (potentially scheduling more than the FW job limit) is rare
> but handling it is necessary, since otherwise the entire job
> completion/tracking logic gets screwed up on the firmware end and queues
> end up stuck (I've managed to trigger this before).

Actually that's a pretty normal use case. I've have rejected similar 
requirements like this before as well.

For an example how this can work see amdgpu_job_prepare_job(): 
https://elixir.bootlin.com/linux/v6.3-rc1/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L251

The gang submit gives and example of a global fence lock and the VMIDs 
are an example of a global shared firmware resource.

Regards,
Christian.

>
> ~~ Lina
  
Karol Herbst March 8, 2023, 12:39 p.m. UTC | #4
On Wed, Mar 8, 2023 at 9:46 AM Christian König <christian.koenig@amd.com> wrote:
>
> Am 07.03.23 um 15:25 schrieb Asahi Lina:
> > Some hardware may require more complex resource utilization accounting
> > than the simple job count supported by drm_sched internally. Add a
> > can_run_job callback to allow drivers to implement more logic before
> > deciding whether to run a GPU job.
>
> Well complete NAK.
>

There hasn't even been any kind of discussion yet you already come
around with a "Well complete NAK"

First, this can be seen as rude behavior and me being part of the drm
community I don't want to have to see this kind of thing.

Obviously, any kind of strong "technical" review point is a nak until
people settle with an agreement on what to land, there is no point in
pointing out a "NAK", especially if that's the first thing you say. If
you want to express your strong disagreement with the proposed
solution, then state what your pain points are directly.

If there is a long discussion and a maintainer feels it's going
nowhere and no conclusion will be reached it might be this kind of
"speaking with authority" point has to be made. But not as the starter
into a discussion. This is unnecessarily hostile towards the
contributor. And I wished we wouldn't have to see this kind of
behavior here.

Yes, some kernel maintainers do this a lot, but kernel maintainers
also have this kind of reputation and people don't want to have to
deal with this nonsense and decide to not contribute at all. So please
just drop this attitude.

> This is clearly going against the idea of having jobs only depend on
> fences and nothing else which is mandatory for correct memory management.
>

I'm sure it's all documented and there is a design document on how
things have to look like you can point out? Might help to get a better
understanding on how things should be.

> If the hw is busy with something you need to return the fence for this
> from the prepare_job callback so that the scheduler can be notified when
> the hw is available again.
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Asahi Lina <lina@asahilina.net>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
> >   include/drm/gpu_scheduler.h            |  8 ++++++++
> >   2 files changed, 18 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 4e6ad6e122bc..5c0add2c7546 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
> >               if (!entity)
> >                       continue;
> >
> > +             if (sched->ops->can_run_job) {
> > +                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> > +                     if (!sched_job) {
> > +                             complete_all(&entity->entity_idle);
> > +                             continue;
> > +                     }
> > +                     if (!sched->ops->can_run_job(sched_job))
> > +                             continue;
> > +             }
> > +
> >               sched_job = drm_sched_entity_pop_job(entity);
> >
> >               if (!sched_job) {
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 9db9e5e504ee..bd89ea9507b9 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
> >       struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
> >                                        struct drm_sched_entity *s_entity);
> >
> > +     /**
> > +      * @can_run_job: Called before job execution to check whether the
> > +      * hardware is free enough to run the job.  This can be used to
> > +      * implement more complex hardware resource policies than the
> > +      * hw_submission limit.
> > +      */
> > +     bool (*can_run_job)(struct drm_sched_job *sched_job);
> > +
> >       /**
> >            * @run_job: Called to execute the job once all of the dependencies
> >            * have been resolved.  This may be called multiple times, if
> >
>
  
Christian König March 8, 2023, 1:47 p.m. UTC | #5
Am 08.03.23 um 13:39 schrieb Karol Herbst:
> On Wed, Mar 8, 2023 at 9:46 AM Christian König <christian.koenig@amd.com> wrote:
>> Am 07.03.23 um 15:25 schrieb Asahi Lina:
>>> Some hardware may require more complex resource utilization accounting
>>> than the simple job count supported by drm_sched internally. Add a
>>> can_run_job callback to allow drivers to implement more logic before
>>> deciding whether to run a GPU job.
>> Well complete NAK.
>>
> There hasn't even been any kind of discussion yet you already come
> around with a "Well complete NAK"
>
> First, this can be seen as rude behavior and me being part of the drm
> community I don't want to have to see this kind of thing.
>
> Obviously, any kind of strong "technical" review point is a nak until
> people settle with an agreement on what to land, there is no point in
> pointing out a "NAK", especially if that's the first thing you say. If
> you want to express your strong disagreement with the proposed
> solution, then state what your pain points are directly.
>
> If there is a long discussion and a maintainer feels it's going
> nowhere and no conclusion will be reached it might be this kind of
> "speaking with authority" point has to be made. But not as the starter
> into a discussion. This is unnecessarily hostile towards the
> contributor. And I wished we wouldn't have to see this kind of
> behavior here.
>
> Yes, some kernel maintainers do this a lot, but kernel maintainers
> also have this kind of reputation and people don't want to have to
> deal with this nonsense and decide to not contribute at all. So please
> just drop this attitude.

Yes, you are completely right with that, but getting this message to the 
recipient is intentional on my side.

I give completely NAKs when the author of a patch has missed such a 
fundamental technical connection that further discussion doesn't make sense.

It's not meant to be in any way rude or offending. I can put a smiley 
behind it if it somehow helps, but we still need a way to raise this big 
red stop sign.

>> This is clearly going against the idea of having jobs only depend on
>> fences and nothing else which is mandatory for correct memory management.
>>
> I'm sure it's all documented and there is a design document on how
> things have to look like you can point out? Might help to get a better
> understanding on how things should be.

Yeah, that's the problematic part. We have documented this very 
extensively: 
https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences

And both Jason and Daniel gave talks about the underlying problem and 
try to come up with patches to raise warnings when that happens, but 
people still keep coming up with the same idea over and over again.

It's just that the technical relationship between preventing jobs from 
running and with that preventing dma_fences from signaling and the core 
memory management with page faults and shrinkers waiting for those 
fences is absolutely not obvious.

We had at least 10 different teams from different companies falling into 
the same trap already and either the patches were rejected of hand or 
had to painfully reverted or mitigated later on.

Regards,
Christian.

>
>> If the hw is busy with something you need to return the fence for this
>> from the prepare_job callback so that the scheduler can be notified when
>> the hw is available again.
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
>>>    include/drm/gpu_scheduler.h            |  8 ++++++++
>>>    2 files changed, 18 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 4e6ad6e122bc..5c0add2c7546 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
>>>                if (!entity)
>>>                        continue;
>>>
>>> +             if (sched->ops->can_run_job) {
>>> +                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>>> +                     if (!sched_job) {
>>> +                             complete_all(&entity->entity_idle);
>>> +                             continue;
>>> +                     }
>>> +                     if (!sched->ops->can_run_job(sched_job))
>>> +                             continue;
>>> +             }
>>> +
>>>                sched_job = drm_sched_entity_pop_job(entity);
>>>
>>>                if (!sched_job) {
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 9db9e5e504ee..bd89ea9507b9 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
>>>        struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
>>>                                         struct drm_sched_entity *s_entity);
>>>
>>> +     /**
>>> +      * @can_run_job: Called before job execution to check whether the
>>> +      * hardware is free enough to run the job.  This can be used to
>>> +      * implement more complex hardware resource policies than the
>>> +      * hw_submission limit.
>>> +      */
>>> +     bool (*can_run_job)(struct drm_sched_job *sched_job);
>>> +
>>>        /**
>>>             * @run_job: Called to execute the job once all of the dependencies
>>>             * have been resolved.  This may be called multiple times, if
>>>
  
Karol Herbst March 8, 2023, 2:43 p.m. UTC | #6
On Wed, Mar 8, 2023 at 2:47 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 08.03.23 um 13:39 schrieb Karol Herbst:
> > On Wed, Mar 8, 2023 at 9:46 AM Christian König <christian.koenig@amd.com> wrote:
> >> Am 07.03.23 um 15:25 schrieb Asahi Lina:
> >>> Some hardware may require more complex resource utilization accounting
> >>> than the simple job count supported by drm_sched internally. Add a
> >>> can_run_job callback to allow drivers to implement more logic before
> >>> deciding whether to run a GPU job.
> >> Well complete NAK.
> >>
> > There hasn't even been any kind of discussion yet you already come
> > around with a "Well complete NAK"
> >
> > First, this can be seen as rude behavior and me being part of the drm
> > community I don't want to have to see this kind of thing.
> >
> > Obviously, any kind of strong "technical" review point is a nak until
> > people settle with an agreement on what to land, there is no point in
> > pointing out a "NAK", especially if that's the first thing you say. If
> > you want to express your strong disagreement with the proposed
> > solution, then state what your pain points are directly.
> >
> > If there is a long discussion and a maintainer feels it's going
> > nowhere and no conclusion will be reached it might be this kind of
> > "speaking with authority" point has to be made. But not as the starter
> > into a discussion. This is unnecessarily hostile towards the
> > contributor. And I wished we wouldn't have to see this kind of
> > behavior here.
> >
> > Yes, some kernel maintainers do this a lot, but kernel maintainers
> > also have this kind of reputation and people don't want to have to
> > deal with this nonsense and decide to not contribute at all. So please
> > just drop this attitude.
>
> Yes, you are completely right with that, but getting this message to the
> recipient is intentional on my side.
>
> I give completely NAKs when the author of a patch has missed such a
> fundamental technical connection that further discussion doesn't make sense.
>
> It's not meant to be in any way rude or offending. I can put a smiley
> behind it if it somehow helps, but we still need a way to raise this big
> red stop sign.
>

"further"? There was no discussion at all, you just started off like
that. If you think somebody misses that connection, you can point out
to documentation/videos whatever so the contributor can understand
what's wrong with an approach. You did that, so that's fine. It's just
starting off _any_ discussion with a "Well complete NAK" is terrible
style. I'd feel uncomfortable if that happened to me and I'm sure
there are enough people like that that we should be more reasonable
with our replies. Just.. don't.

We are all humans here and people react negatively to such things. And
if people do it on purpose it just makes it worse.

> >> This is clearly going against the idea of having jobs only depend on
> >> fences and nothing else which is mandatory for correct memory management.
> >>
> > I'm sure it's all documented and there is a design document on how
> > things have to look like you can point out? Might help to get a better
> > understanding on how things should be.
>
> Yeah, that's the problematic part. We have documented this very
> extensively:
> https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences
>
> And both Jason and Daniel gave talks about the underlying problem and

fyi:
s/Jason/Faith/g

> try to come up with patches to raise warnings when that happens, but
> people still keep coming up with the same idea over and over again.
>

Yes, and we'll have to tell them over and over again. Nothing wrong
with that. That's just part of maintaining such a big subsystem. And
that's definitely not a valid reason to phrase things like above.

> It's just that the technical relationship between preventing jobs from
> running and with that preventing dma_fences from signaling and the core
> memory management with page faults and shrinkers waiting for those
> fences is absolutely not obvious.
>
> We had at least 10 different teams from different companies falling into
> the same trap already and either the patches were rejected of hand or
> had to painfully reverted or mitigated later on.
>

Sure, but that's just part of the job. And pointing out fundamental
mistakes early on is important, but the situation won't get any better
by being like that. Yes, we'll have to repeat the same words over and
over again, and yes that might be annoying, but that's just how it is.

> Regards,
> Christian.
>
> >
> >> If the hw is busy with something you need to return the fence for this
> >> from the prepare_job callback so that the scheduler can be notified when
> >> the hw is available again.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> Signed-off-by: Asahi Lina <lina@asahilina.net>
> >>> ---
> >>>    drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
> >>>    include/drm/gpu_scheduler.h            |  8 ++++++++
> >>>    2 files changed, 18 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >>> index 4e6ad6e122bc..5c0add2c7546 100644
> >>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
> >>>                if (!entity)
> >>>                        continue;
> >>>
> >>> +             if (sched->ops->can_run_job) {
> >>> +                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> >>> +                     if (!sched_job) {
> >>> +                             complete_all(&entity->entity_idle);
> >>> +                             continue;
> >>> +                     }
> >>> +                     if (!sched->ops->can_run_job(sched_job))
> >>> +                             continue;
> >>> +             }
> >>> +
> >>>                sched_job = drm_sched_entity_pop_job(entity);
> >>>
> >>>                if (!sched_job) {
> >>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> >>> index 9db9e5e504ee..bd89ea9507b9 100644
> >>> --- a/include/drm/gpu_scheduler.h
> >>> +++ b/include/drm/gpu_scheduler.h
> >>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
> >>>        struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
> >>>                                         struct drm_sched_entity *s_entity);
> >>>
> >>> +     /**
> >>> +      * @can_run_job: Called before job execution to check whether the
> >>> +      * hardware is free enough to run the job.  This can be used to
> >>> +      * implement more complex hardware resource policies than the
> >>> +      * hw_submission limit.
> >>> +      */
> >>> +     bool (*can_run_job)(struct drm_sched_job *sched_job);
> >>> +
> >>>        /**
> >>>             * @run_job: Called to execute the job once all of the dependencies
> >>>             * have been resolved.  This may be called multiple times, if
> >>>
>
  
Asahi Lina March 8, 2023, 2:53 p.m. UTC | #7
On 08/03/2023 19.00, Christian König wrote:
> Am 08.03.23 um 10:41 schrieb Asahi Lina:
>> On 08/03/2023 17.46, Christian König wrote:
>>> Am 07.03.23 um 15:25 schrieb Asahi Lina:
>>>> Some hardware may require more complex resource utilization accounting
>>>> than the simple job count supported by drm_sched internally. Add a
>>>> can_run_job callback to allow drivers to implement more logic before
>>>> deciding whether to run a GPU job.
>>> Well complete NAK.
>>>
>>> This is clearly going against the idea of having jobs only depend on
>>> fences and nothing else which is mandatory for correct memory management.
>>>
>>> If the hw is busy with something you need to return the fence for this
>>> from the prepare_job callback so that the scheduler can be notified when
>>> the hw is available again.
>> I think you misunderstood the intent here... This isn't about job
>> dependencies, it's about in-flight resource limits.
>>
>> drm_sched already has a hw_submission_limit that specifies the number of
>> submissions that can be in flight, but that doesn't work for us because
>> each job from drm_sched's point of view consists of multiple commands
>> split among 3 firmware queues. The firmware can only support up to 128
>> work commands in flight per queue (barriers don't count), otherwise it
>> overflows a fixed-size buffer.
>>
>> So we need more complex accounting of how many underlying commands are
>> in flight per queue to determine whether it is safe to run a new job,
>> and that is what this callback accomplishes. This has to happen even
>> when individual jobs have no buffer/resource dependencies between them
>> (which is what the fences would express).
> 
> Yeah, I already assumed that you have something like this.
> 
> And to make it clear this is unfortunately a complete NAK to this 
> approach! You can't do this!

I think you still have some significant misconception about how this
driver works and uses drm_sched... I would appreciate it if you listen
and try to understand the design before giving hard NAKs... (this isn't
a Radeon)

> The background is that core memory management requires that signaling a 
> fence only depends on signaling other fences and hardware progress and 
> nothing else. Otherwise you immediately run into problems because of 
> circle dependencies or what we call infinite fences.

And hardware progress is exactly the only dependency here...

> Jason Ekstrand gave a create presentation on that problem a few years 
> ago on LPC. I strongly suggest you google that one up.

Faith Ekstrand (it looks like you mistyped that name...) is the person
who proposed that I should use drm_sched in this way (see below), we've
had a few private meetings about this design ^^

>> You can see the driver implementation of that callback in
>> drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then
>> calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that
>> does the actual available slot count checks.
>>
>> The can_run_job logic is written to mirror the hw_submission_limit logic
>> (just a bit later in the sched main loop since we need to actually pick
>> a job to do the check), and just like for that case, completion of any
>> job in the same scheduler will cause another run of the main loop and
>> another check (which is exactly what we want here).
> 
> Yeah and that hw_submission_limit is based on a fence signaling again.

I don't think so...? It's just an atomic that gets checked in
drm_sched_ready(). There are no extra fences involved (other than the
job completion fences that trigger another scheduler run). The idea is
that when the hardware queue makes forward progress you check against
the limit again and submit more jobs as needed. I'm doing the same exact
thing, I'm just using more complex logic for the notion of in-flight
queue limits!

> When you have some firmware limitation that a job needs resources which 
> are currently in use by other submissions then those other submissions 
> have fences as well and you can return those in the prepare_job callback.
> 
> If those other submissions don't have fences, then you have a major 
> design problem inside your driver and we need to get back to square one 
> and talk about that dependency handling.

I think we have a disconnect in our views of what is going on here...

This hardware has firmware-side scheduling with an arbitrary (as far as
I know) number of queues. There is one scheduler instance and one entity
per userspace queue (not global!). These queues process jobs in some
logical sequence, though at the firmware level they get split into up to
three queues each (and there is some parallelism allowed). The
limitation here is in the number of in-flight jobs per firmware queue,
not global.

There is no way for things to deadlock. If jobs have been submitted to
the firmware queue, that means their dependencies were signaled already.
Jobs have intra-job dependencies via driver barriers (which drm_sched
knows nothing about), but the submission code in the driver guarantees
that they are deadlock-free since you can only barrier on past commands,
which by definition submit first.

If a firmware queue is full, drm_sched blocks. Since it is full, that
means it will run those commands (since they have no outside
dependencies and they are already queued and ready to run by the
firmware), eventually space will be freed, and each time a job completes
drm_sched will do the can_run_job check again and decide whether to run
a new job.

Since the firmware queues contain commands which only have past-facing
barriers on other already submitted commands, by definition they will
become empty at some point as long as the firmware is making forward
progress. And therefore, by definition, can_run_job will eventually
return true at some point after a job completion fence is signaled (the
one for the last job submitted prior). There is a check in the driver to
ensure that we do not allow submissions which, by themselves, would
exceed the queued command limit (we actually just limit to 64 commands
overall right now, which is conservative but seems reasonable given the
128-per-firmware-queue limit).

I get the feeling that you are conflating pending jobs with submitted
jobs. This isn't about how many jobs you can have pending in drm_sched
before running them or anything like that. Of course, at that point,
arbitrary dependencies come into play and you can end up with deadlocks
on dependency fences. But that's not the case here. What can_run_job is
waiting on is guaranteed to make forward progress.

>> This case (potentially scheduling more than the FW job limit) is rare
>> but handling it is necessary, since otherwise the entire job
>> completion/tracking logic gets screwed up on the firmware end and queues
>> end up stuck (I've managed to trigger this before).
> 
> Actually that's a pretty normal use case. I've have rejected similar 
> requirements like this before as well.
> 
> For an example how this can work see amdgpu_job_prepare_job(): 
> https://elixir.bootlin.com/linux/v6.3-rc1/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L251
> 
> The gang submit gives and example of a global fence lock and the VMIDs 
> are an example of a global shared firmware resource.

But the resource can_run_job is checking on isn't globally shared! It's
specific to this scheduler instance, just like hw_submission_limit is,
so as long as the firmware behind the scheduler is making forward
progress, the resource will be guaranteed to be freed until another job
can run.

I actually know I have a different theoretical deadlock issue along
these lines in the driver because right now we grab actually global
resources (including a VMID) before job submission to drm_sched. This is
a known issue, and to fix it without reducing performance I need to
introduce some kind of "patching/fixup" system for firmware commands
(because we need to inject those identifiers in dozens of places, but we
don't want to construct those commands from scratch at job run time
because that introduces latency at the wrong time and makes error
handling/validation more complicated and error-prone), and that is
exactly what should happen in prepare_job, as you say. And yes, at that
point that should use fences to block when those resources are
exhausted. But that's a different discussion we should have when
reviewing the driver, it has nothing to do with the DRM abstractions nor
the can_run_job callback I'm adding here nor the firmware queue length
limit issue! (And also the global hardware devices are plentiful enough
that I would be very surprised if anyone ever deadlocks it in practice
even with the current code, so I honestly don't think that should be a
blocker for driver submission either, I can and will fix it later...)

~~ Lina
  
Christian König March 8, 2023, 3:02 p.m. UTC | #8
Am 08.03.23 um 15:43 schrieb Karol Herbst:
> [SNIP]
> "further"? There was no discussion at all,

Yeah, well that is exactly what I wanted to archive.

>   you just started off like
> that. If you think somebody misses that connection, you can point out
> to documentation/videos whatever so the contributor can understand
> what's wrong with an approach. You did that, so that's fine. It's just
> starting off _any_ discussion with a "Well complete NAK" is terrible
> style. I'd feel uncomfortable if that happened to me and I'm sure
> there are enough people like that that we should be more reasonable
> with our replies. Just.. don't.
>
> We are all humans here and people react negatively to such things. And
> if people do it on purpose it just makes it worse.

I completely see your point, I just don't know how to improve it.

I don't stop people like this because I want to make them uncomfortable 
but because I want to prevent further discussions on that topic.

In other words how can I make people notice that this is something 
fundamental while still being polite?

>>>> This is clearly going against the idea of having jobs only depend on
>>>> fences and nothing else which is mandatory for correct memory management.
>>>>
>>> I'm sure it's all documented and there is a design document on how
>>> things have to look like you can point out? Might help to get a better
>>> understanding on how things should be.
>> Yeah, that's the problematic part. We have documented this very
>> extensively:
>> https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences
>>
>> And both Jason and Daniel gave talks about the underlying problem and
> fyi:
> s/Jason/Faith/g

+1. I wasn't aware of that.

>> try to come up with patches to raise warnings when that happens, but
>> people still keep coming up with the same idea over and over again.
>>
> Yes, and we'll have to tell them over and over again. Nothing wrong
> with that. That's just part of maintaining such a big subsystem. And
> that's definitely not a valid reason to phrase things like above.
>
>> It's just that the technical relationship between preventing jobs from
>> running and with that preventing dma_fences from signaling and the core
>> memory management with page faults and shrinkers waiting for those
>> fences is absolutely not obvious.
>>
>> We had at least 10 different teams from different companies falling into
>> the same trap already and either the patches were rejected of hand or
>> had to painfully reverted or mitigated later on.
>>
> Sure, but that's just part of the job. And pointing out fundamental
> mistakes early on is important, but the situation won't get any better
> by being like that. Yes, we'll have to repeat the same words over and
> over again, and yes that might be annoying, but that's just how it is.

Well I have no problem explaining people why a solution doesn't work.

But what usually happens is that people don't realize that they need to 
back of from a design and completely start over.

Regards,
Christian.

>
>> Regards,
>> Christian.
>>
>>>> If the hw is busy with something you need to return the fence for this
>>>> from the prepare_job callback so that the scheduler can be notified when
>>>> the hw is available again.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
>>>>>     include/drm/gpu_scheduler.h            |  8 ++++++++
>>>>>     2 files changed, 18 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 4e6ad6e122bc..5c0add2c7546 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
>>>>>                 if (!entity)
>>>>>                         continue;
>>>>>
>>>>> +             if (sched->ops->can_run_job) {
>>>>> +                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>>>>> +                     if (!sched_job) {
>>>>> +                             complete_all(&entity->entity_idle);
>>>>> +                             continue;
>>>>> +                     }
>>>>> +                     if (!sched->ops->can_run_job(sched_job))
>>>>> +                             continue;
>>>>> +             }
>>>>> +
>>>>>                 sched_job = drm_sched_entity_pop_job(entity);
>>>>>
>>>>>                 if (!sched_job) {
>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>> index 9db9e5e504ee..bd89ea9507b9 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
>>>>>         struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
>>>>>                                          struct drm_sched_entity *s_entity);
>>>>>
>>>>> +     /**
>>>>> +      * @can_run_job: Called before job execution to check whether the
>>>>> +      * hardware is free enough to run the job.  This can be used to
>>>>> +      * implement more complex hardware resource policies than the
>>>>> +      * hw_submission limit.
>>>>> +      */
>>>>> +     bool (*can_run_job)(struct drm_sched_job *sched_job);
>>>>> +
>>>>>         /**
>>>>>              * @run_job: Called to execute the job once all of the dependencies
>>>>>              * have been resolved.  This may be called multiple times, if
>>>>>
  
Karol Herbst March 8, 2023, 3:19 p.m. UTC | #9
On Wed, Mar 8, 2023 at 4:09 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 08.03.23 um 15:43 schrieb Karol Herbst:
> > [SNIP]
> > "further"? There was no discussion at all,
>
> Yeah, well that is exactly what I wanted to archive.
>
> >   you just started off like
> > that. If you think somebody misses that connection, you can point out
> > to documentation/videos whatever so the contributor can understand
> > what's wrong with an approach. You did that, so that's fine. It's just
> > starting off _any_ discussion with a "Well complete NAK" is terrible
> > style. I'd feel uncomfortable if that happened to me and I'm sure
> > there are enough people like that that we should be more reasonable
> > with our replies. Just.. don't.
> >
> > We are all humans here and people react negatively to such things. And
> > if people do it on purpose it just makes it worse.
>
> I completely see your point, I just don't know how to improve it.
>
> I don't stop people like this because I want to make them uncomfortable
> but because I want to prevent further discussions on that topic.
>
> In other words how can I make people notice that this is something
> fundamental while still being polite?
>

I think a little improvement over this would be to at least wait a few
replies before resorting to those strong statements. Just before it
becomes a risk in just wasting time.

> >>>> This is clearly going against the idea of having jobs only depend on
> >>>> fences and nothing else which is mandatory for correct memory management.
> >>>>
> >>> I'm sure it's all documented and there is a design document on how
> >>> things have to look like you can point out? Might help to get a better
> >>> understanding on how things should be.
> >> Yeah, that's the problematic part. We have documented this very
> >> extensively:
> >> https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences
> >>
> >> And both Jason and Daniel gave talks about the underlying problem and
> > fyi:
> > s/Jason/Faith/g
>
> +1. I wasn't aware of that.
>
> >> try to come up with patches to raise warnings when that happens, but
> >> people still keep coming up with the same idea over and over again.
> >>
> > Yes, and we'll have to tell them over and over again. Nothing wrong
> > with that. That's just part of maintaining such a big subsystem. And
> > that's definitely not a valid reason to phrase things like above.
> >
> >> It's just that the technical relationship between preventing jobs from
> >> running and with that preventing dma_fences from signaling and the core
> >> memory management with page faults and shrinkers waiting for those
> >> fences is absolutely not obvious.
> >>
> >> We had at least 10 different teams from different companies falling into
> >> the same trap already and either the patches were rejected of hand or
> >> had to painfully reverted or mitigated later on.
> >>
> > Sure, but that's just part of the job. And pointing out fundamental
> > mistakes early on is important, but the situation won't get any better
> > by being like that. Yes, we'll have to repeat the same words over and
> > over again, and yes that might be annoying, but that's just how it is.
>
> Well I have no problem explaining people why a solution doesn't work.
>
> But what usually happens is that people don't realize that they need to
> back of from a design and completely start over.
>
> Regards,
> Christian.
>
> >
> >> Regards,
> >> Christian.
> >>
> >>>> If the hw is busy with something you need to return the fence for this
> >>>> from the prepare_job callback so that the scheduler can be notified when
> >>>> the hw is available again.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
> >>>>> ---
> >>>>>     drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
> >>>>>     include/drm/gpu_scheduler.h            |  8 ++++++++
> >>>>>     2 files changed, 18 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> index 4e6ad6e122bc..5c0add2c7546 100644
> >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
> >>>>>                 if (!entity)
> >>>>>                         continue;
> >>>>>
> >>>>> +             if (sched->ops->can_run_job) {
> >>>>> +                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> >>>>> +                     if (!sched_job) {
> >>>>> +                             complete_all(&entity->entity_idle);
> >>>>> +                             continue;
> >>>>> +                     }
> >>>>> +                     if (!sched->ops->can_run_job(sched_job))
> >>>>> +                             continue;
> >>>>> +             }
> >>>>> +
> >>>>>                 sched_job = drm_sched_entity_pop_job(entity);
> >>>>>
> >>>>>                 if (!sched_job) {
> >>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> >>>>> index 9db9e5e504ee..bd89ea9507b9 100644
> >>>>> --- a/include/drm/gpu_scheduler.h
> >>>>> +++ b/include/drm/gpu_scheduler.h
> >>>>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
> >>>>>         struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
> >>>>>                                          struct drm_sched_entity *s_entity);
> >>>>>
> >>>>> +     /**
> >>>>> +      * @can_run_job: Called before job execution to check whether the
> >>>>> +      * hardware is free enough to run the job.  This can be used to
> >>>>> +      * implement more complex hardware resource policies than the
> >>>>> +      * hw_submission limit.
> >>>>> +      */
> >>>>> +     bool (*can_run_job)(struct drm_sched_job *sched_job);
> >>>>> +
> >>>>>         /**
> >>>>>              * @run_job: Called to execute the job once all of the dependencies
> >>>>>              * have been resolved.  This may be called multiple times, if
> >>>>>
>
  
Christian König March 8, 2023, 3:30 p.m. UTC | #10
Am 08.03.23 um 15:53 schrieb Asahi Lina:
> [SNIP]
>> The background is that core memory management requires that signaling a
>> fence only depends on signaling other fences and hardware progress and
>> nothing else. Otherwise you immediately run into problems because of
>> circle dependencies or what we call infinite fences.
> And hardware progress is exactly the only dependency here...

Well then you should have a fence for that hardware progress.

>> Jason Ekstrand gave a create presentation on that problem a few years
>> ago on LPC. I strongly suggest you google that one up.
> Faith Ekstrand (it looks like you mistyped that name...)

My fault I was really just mistyping that :)

>   is the person
> who proposed that I should use drm_sched in this way (see below), we've
> had a few private meetings about this design ^^
>
>>> You can see the driver implementation of that callback in
>>> drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then
>>> calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that
>>> does the actual available slot count checks.
>>>
>>> The can_run_job logic is written to mirror the hw_submission_limit logic
>>> (just a bit later in the sched main loop since we need to actually pick
>>> a job to do the check), and just like for that case, completion of any
>>> job in the same scheduler will cause another run of the main loop and
>>> another check (which is exactly what we want here).
>> Yeah and that hw_submission_limit is based on a fence signaling again.
> I don't think so...? It's just an atomic that gets checked in
> drm_sched_ready(). There are no extra fences involved (other than the
> job completion fences that trigger another scheduler run). The idea is
> that when the hardware queue makes forward progress you check against
> the limit again and submit more jobs as needed. I'm doing the same exact
> thing, I'm just using more complex logic for the notion of in-flight
> queue limits!

Then why can't you express that logic in a dependency fence?

>> When you have some firmware limitation that a job needs resources which
>> are currently in use by other submissions then those other submissions
>> have fences as well and you can return those in the prepare_job callback.
>>
>> If those other submissions don't have fences, then you have a major
>> design problem inside your driver and we need to get back to square one
>> and talk about that dependency handling.
> I think we have a disconnect in our views of what is going on here...
>
> This hardware has firmware-side scheduling with an arbitrary (as far as
> I know) number of queues. There is one scheduler instance and one entity
> per userspace queue (not global!). These queues process jobs in some
> logical sequence, though at the firmware level they get split into up to
> three queues each (and there is some parallelism allowed). The
> limitation here is in the number of in-flight jobs per firmware queue,
> not global.

So far I'm familiar with that design.

> There is no way for things to deadlock. If jobs have been submitted to
> the firmware queue, that means their dependencies were signaled already.
> Jobs have intra-job dependencies via driver barriers (which drm_sched
> knows nothing about), but the submission code in the driver guarantees
> that they are deadlock-free since you can only barrier on past commands,
> which by definition submit first.
>
> If a firmware queue is full, drm_sched blocks. Since it is full, that
> means it will run those commands (since they have no outside
> dependencies and they are already queued and ready to run by the
> firmware), eventually space will be freed, and each time a job completes
> drm_sched will do the can_run_job check again and decide whether to run
> a new job.
>
> Since the firmware queues contain commands which only have past-facing
> barriers on other already submitted commands, by definition they will
> become empty at some point as long as the firmware is making forward
> progress. And therefore, by definition, can_run_job will eventually
> return true at some point after a job completion fence is signaled (the
> one for the last job submitted prior). There is a check in the driver to
> ensure that we do not allow submissions which, by themselves, would
> exceed the queued command limit (we actually just limit to 64 commands
> overall right now, which is conservative but seems reasonable given the
> 128-per-firmware-queue limit).

Well then again why don't you give that fence out as dependency? Is it 
because the scheduler tries to optimize those away?

> I get the feeling that you are conflating pending jobs with submitted
> jobs. This isn't about how many jobs you can have pending in drm_sched
> before running them or anything like that. Of course, at that point,
> arbitrary dependencies come into play and you can end up with deadlocks
> on dependency fences. But that's not the case here. What can_run_job is
> waiting on is guaranteed to make forward progress.

I see that we have a disconnection here. As far as I can see you can use 
the can_run callback in only three ways:

1. To check for some userspace dependency (We don't need to discuss 
that, it's evil and we both know it).

2. You check for some hw resource availability. Similar to VMID on 
amdgpu hw.

     This is what I think you do here (but I might be wrong). But this 
would be extremely problematic because you can then live lock.
     E.g. queue A keeps submitting jobs which take only a few resources 
and by doing so delays submitting jobs from queue B indefinitely.

3. You have an intra queue dependency. E.g. you have jobs which take X 
amount of resources, you can submit only to a specific limit.
     But in this case you should be able to return fences from the same 
queue as dependency and won't need that callback.

     We would just need to adjust drm_sched_entity_add_dependency_cb() a 
bit because dependencies from the same queue are currently filtered out 
because it assumes a pipeline nature of submission (e.g. previous 
submissions are finished before new submissions start).

>>> This case (potentially scheduling more than the FW job limit) is rare
>>> but handling it is necessary, since otherwise the entire job
>>> completion/tracking logic gets screwed up on the firmware end and queues
>>> end up stuck (I've managed to trigger this before).
>> Actually that's a pretty normal use case. I've have rejected similar
>> requirements like this before as well.
>>
>> For an example how this can work see amdgpu_job_prepare_job():
>> https://elixir.bootlin.com/linux/v6.3-rc1/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L251
>>
>> The gang submit gives and example of a global fence lock and the VMIDs
>> are an example of a global shared firmware resource.
> But the resource can_run_job is checking on isn't globally shared! It's
> specific to this scheduler instance, just like hw_submission_limit is,
> so as long as the firmware behind the scheduler is making forward
> progress, the resource will be guaranteed to be freed until another job
> can run.

Well either it should be globally shared because it is a shared resource 
(similar to our VMID or gangs) or it is an intra queue limitation in 
which case you could just use the fences previously submitted on the 
queue as dependency.

> I actually know I have a different theoretical deadlock issue along
> these lines in the driver because right now we grab actually global
> resources (including a VMID) before job submission to drm_sched. This is
> a known issue, and to fix it without reducing performance I need to
> introduce some kind of "patching/fixup" system for firmware commands
> (because we need to inject those identifiers in dozens of places, but we
> don't want to construct those commands from scratch at job run time
> because that introduces latency at the wrong time and makes error
> handling/validation more complicated and error-prone), and that is
> exactly what should happen in prepare_job, as you say. And yes, at that
> point that should use fences to block when those resources are
> exhausted. But that's a different discussion we should have when
> reviewing the driver, it has nothing to do with the DRM abstractions nor
> the can_run_job callback I'm adding here nor the firmware queue length
> limit issue! (And also the global hardware devices are plentiful enough
> that I would be very surprised if anyone ever deadlocks it in practice
> even with the current code, so I honestly don't think that should be a
> blocker for driver submission either, I can and will fix it later...)

Well this is what I thought about those problems in amdgpu as well and 
it totally shipwrecked.

We still have memory allocations in the VMID code path which I'm still 
not sure how to remove.

Regards,
Christian.

>
> ~~ Lina
  
Asahi Lina March 8, 2023, 4:44 p.m. UTC | #11
On 09/03/2023 00.30, Christian König wrote:
> Am 08.03.23 um 15:53 schrieb Asahi Lina:
>> [SNIP]
>>> The background is that core memory management requires that signaling a
>>> fence only depends on signaling other fences and hardware progress and
>>> nothing else. Otherwise you immediately run into problems because of
>>> circle dependencies or what we call infinite fences.
>> And hardware progress is exactly the only dependency here...
> 
> Well then you should have a fence for that hardware progress.

I do, it's the prior job hardware completion fences that drm_sched
already knows about!

Yes, I could return those in the prepare callback, it just means I need
to start stashing fence references in the underlying firmware job queue
command objects so I can find out what is the oldest pending fence is,
and return it when a queue is full. As long as drm_sched doesn't mind if
I keep giving it fences (since multiple commands can have to complete
before there is space) or the occasional already signaled fence (because
this process is inherently racy), it should work fine.

If you think this is the better way, I'll do it that way and drop this
patch. It just seemed simpler to do it with another callback, since
drm_sched is already tracking those fences and doing a hardware queue
limit check anyway, and that way I can avoid tracking the fences down
into the hardware queue code... *

(But I still maintain what I'm trying to do here is entirely correct and
deadlock-free! If you prefer I use prepare_job and return prior job
fences from that instead, that's very different from NAKing the patch
saying it's broken...)

* If you're wondering how the fences get signaled at all then: callback
closures that capture a reference to the fence when firmware commands
are constructed and submitted. I know, I know, fancy Rust stuff... ^^
If you'd rather have me use the fences for the blocking, I'll probably
just drop the signaling bit from the closures so we don't need to keep
two redundant fence references in different places per command. I still
need the closures for command completion processing though, since I use
them to process statistics too...

>>> Jason Ekstrand gave a create presentation on that problem a few years
>>> ago on LPC. I strongly suggest you google that one up.
>> Faith Ekstrand (it looks like you mistyped that name...)
> 
> My fault I was really just mistyping that :)

It's all good ^^

> 
> I see that we have a disconnection here. As far as I can see you can use 
> the can_run callback in only three ways:
> 
> 1. To check for some userspace dependency (We don't need to discuss 
> that, it's evil and we both know it).
> 
> 2. You check for some hw resource availability. Similar to VMID on 
> amdgpu hw.
> 
>      This is what I think you do here (but I might be wrong).

It isn't... I agree, it would be problematic. It doesn't make any sense
to check for global resources this way, not just because you might
deadlock but also because there might be nothing to signal to the
scheduler that a resource was freed at all once it is!

> But this 
> would be extremely problematic because you can then live lock.
>      E.g. queue A keeps submitting jobs which take only a few resources 
> and by doing so delays submitting jobs from queue B indefinitely.

This particular issue aside, fairness in global resource allocation is a
conversation I'd love to have! Right now the driver doesn't try to
ensure that, a queue can easily monopolize certain hardware resources
(though one queue can only monopolize one of each, so you'd need
something like 63 queues with 63 distinct VMs all submitting
free-running jobs back to back in order to starve other queues of
resources forever). For starters, one thing I'm thinking of doing is
reserving certain subsets of hardware resources for queues with a given
priority, so you can at least guarantee forward progress of
higher-priority queues when faced with misbehaving lower-priority
queues. But if we want to guarantee proper fairness, I think I'll have
to start doing things like switching to a CPU-roundtrip submission model
when resources become scarce (to guarantee that queues actually release
the resources once in a while) and then figure out how to add fairness
to the allocation code...

But let's have that conversation when we talk about the driver (or maybe
on IRC or something?), right now I'm more interested in getting the
abstractions reviewed ^^

> 3. You have an intra queue dependency. E.g. you have jobs which take X 
> amount of resources, you can submit only to a specific limit.
>      But in this case you should be able to return fences from the same 
> queue as dependency and won't need that callback.

Yes, I can do this. I can just do the same check can_run_job() does and
if it fails, pick the oldest job in the full firmware queue and return
its fence (it just means I need to keep track of those fences there, as
I said above).

>      We would just need to adjust drm_sched_entity_add_dependency_cb() a 
> bit because dependencies from the same queue are currently filtered out 
> because it assumes a pipeline nature of submission (e.g. previous 
> submissions are finished before new submissions start).

Actually that should be fine, because I'd be returning the underlying
hardware completion fences (what the run() callback returns) which the
driver owns, and wouldn't be recognized as belonging to the sched.

>> I actually know I have a different theoretical deadlock issue along
>> these lines in the driver because right now we grab actually global
>> resources (including a VMID) before job submission to drm_sched. This is
>> a known issue, and to fix it without reducing performance I need to
>> introduce some kind of "patching/fixup" system for firmware commands
>> (because we need to inject those identifiers in dozens of places, but we
>> don't want to construct those commands from scratch at job run time
>> because that introduces latency at the wrong time and makes error
>> handling/validation more complicated and error-prone), and that is
>> exactly what should happen in prepare_job, as you say. And yes, at that
>> point that should use fences to block when those resources are
>> exhausted. But that's a different discussion we should have when
>> reviewing the driver, it has nothing to do with the DRM abstractions nor
>> the can_run_job callback I'm adding here nor the firmware queue length
>> limit issue! (And also the global hardware devices are plentiful enough
>> that I would be very surprised if anyone ever deadlocks it in practice
>> even with the current code, so I honestly don't think that should be a
>> blocker for driver submission either, I can and will fix it later...)
> 
> Well this is what I thought about those problems in amdgpu as well and 
> it totally shipwrecked.
> 
> We still have memory allocations in the VMID code path which I'm still 
> not sure how to remove.

We don't even have a shrinker yet, and I'm sure that's going to be a lot
of fun when we add it too... but yes, if we can't do any memory
allocations in some of these callbacks (is this documented anywhere?),
that's going to be interesting...

It's not all bad news though! All memory allocations are fallible in
kernel Rust (and therefore explicit, and also failures have to be
explicitly handled or propagated), so it's pretty easy to point out
where they are, and there are already discussions of higher-level
tooling to enforce rules like that (and things like wait contexts).
Also, Rust makes it a lot easier to refactor code in general and not be
scared that you're going to regress everything, so I'm not really
worried if I need to turn a chunk of the driver on its head to solve
some of these problems in the future ^^ (I already did that when I
switched it from the "demo" synchronous submission model to the proper
explicit sync + fences one.)

~~ Lina
  
Christian König March 8, 2023, 5:57 p.m. UTC | #12
Am 08.03.23 um 17:44 schrieb Asahi Lina:
> On 09/03/2023 00.30, Christian König wrote:
>> Am 08.03.23 um 15:53 schrieb Asahi Lina:
>>> [SNIP]
>>>> The background is that core memory management requires that signaling a
>>>> fence only depends on signaling other fences and hardware progress and
>>>> nothing else. Otherwise you immediately run into problems because of
>>>> circle dependencies or what we call infinite fences.
>>> And hardware progress is exactly the only dependency here...
>> Well then you should have a fence for that hardware progress.
> I do, it's the prior job hardware completion fences that drm_sched
> already knows about!
>
> Yes, I could return those in the prepare callback, it just means I need
> to start stashing fence references in the underlying firmware job queue
> command objects so I can find out what is the oldest pending fence is,
> and return it when a queue is full. As long as drm_sched doesn't mind if
> I keep giving it fences (since multiple commands can have to complete
> before there is space) or the occasional already signaled fence (because
> this process is inherently racy), it should work fine.

Well this handling is intentional and necessary, but see below for a 
more in deep explanation.

> If you think this is the better way, I'll do it that way and drop this
> patch. It just seemed simpler to do it with another callback, since
> drm_sched is already tracking those fences and doing a hardware queue
> limit check anyway, and that way I can avoid tracking the fences down
> into the hardware queue code... *

Well it's not the better way, it's the only way that works.

I have to admit that my bet on your intentions was wrong, but even that 
use case doesn't work correctly.

See when your callback returns false it is perfectly possible that all 
hw fences are signaled between returning that information and processing it.

The result would be that the scheduler goes to sleep and never wakes up 
again.

That's why we have that rule that all dependencies need to be expressed 
by those dma_fence objects, cause those are design with such races in mind.

> (But I still maintain what I'm trying to do here is entirely correct and
> deadlock-free! If you prefer I use prepare_job and return prior job
> fences from that instead, that's very different from NAKing the patch
> saying it's broken...)

As I said we exercised those ideas before and yes this approach here 
came up before as well and no it doesn't work.

> * If you're wondering how the fences get signaled at all then: callback
> closures that capture a reference to the fence when firmware commands
> are constructed and submitted. I know, I know, fancy Rust stuff... ^^
> If you'd rather have me use the fences for the blocking, I'll probably
> just drop the signaling bit from the closures so we don't need to keep
> two redundant fence references in different places per command. I still
> need the closures for command completion processing though, since I use
> them to process statistics too...
>
>> I see that we have a disconnection here. As far as I can see you can use
>> the can_run callback in only three ways:
>>
>> 1. To check for some userspace dependency (We don't need to discuss
>> that, it's evil and we both know it).
>>
>> 2. You check for some hw resource availability. Similar to VMID on
>> amdgpu hw.
>>
>>       This is what I think you do here (but I might be wrong).
> It isn't... I agree, it would be problematic. It doesn't make any sense
> to check for global resources this way, not just because you might
> deadlock but also because there might be nothing to signal to the
> scheduler that a resource was freed at all once it is!
>
>> But this
>> would be extremely problematic because you can then live lock.
>>       E.g. queue A keeps submitting jobs which take only a few resources
>> and by doing so delays submitting jobs from queue B indefinitely.
> This particular issue aside, fairness in global resource allocation is a
> conversation I'd love to have! Right now the driver doesn't try to
> ensure that, a queue can easily monopolize certain hardware resources
> (though one queue can only monopolize one of each, so you'd need
> something like 63 queues with 63 distinct VMs all submitting
> free-running jobs back to back in order to starve other queues of
> resources forever). For starters, one thing I'm thinking of doing is
> reserving certain subsets of hardware resources for queues with a given
> priority, so you can at least guarantee forward progress of
> higher-priority queues when faced with misbehaving lower-priority
> queues. But if we want to guarantee proper fairness, I think I'll have
> to start doing things like switching to a CPU-roundtrip submission model
> when resources become scarce (to guarantee that queues actually release
> the resources once in a while) and then figure out how to add fairness
> to the allocation code...
>
> But let's have that conversation when we talk about the driver (or maybe
> on IRC or something?), right now I'm more interested in getting the
> abstractions reviewed ^^

Well that stuff is highly problematic as well. The fairness aside you 
risk starvation which in turn breaks the guarantee of forward progress.

In this particular case you can catch this with a timeout for the hw 
operation, but you should consider blocking that from the sw side as well.

>> 3. You have an intra queue dependency. E.g. you have jobs which take X
>> amount of resources, you can submit only to a specific limit.
>>       But in this case you should be able to return fences from the same
>> queue as dependency and won't need that callback.
> Yes, I can do this. I can just do the same check can_run_job() does and
> if it fails, pick the oldest job in the full firmware queue and return
> its fence (it just means I need to keep track of those fences there, as
> I said above).
>
>>       We would just need to adjust drm_sched_entity_add_dependency_cb() a
>> bit because dependencies from the same queue are currently filtered out
>> because it assumes a pipeline nature of submission (e.g. previous
>> submissions are finished before new submissions start).
> Actually that should be fine, because I'd be returning the underlying
> hardware completion fences (what the run() callback returns) which the
> driver owns, and wouldn't be recognized as belonging to the sched.
>
>>> I actually know I have a different theoretical deadlock issue along
>>> these lines in the driver because right now we grab actually global
>>> resources (including a VMID) before job submission to drm_sched. This is
>>> a known issue, and to fix it without reducing performance I need to
>>> introduce some kind of "patching/fixup" system for firmware commands
>>> (because we need to inject those identifiers in dozens of places, but we
>>> don't want to construct those commands from scratch at job run time
>>> because that introduces latency at the wrong time and makes error
>>> handling/validation more complicated and error-prone), and that is
>>> exactly what should happen in prepare_job, as you say. And yes, at that
>>> point that should use fences to block when those resources are
>>> exhausted. But that's a different discussion we should have when
>>> reviewing the driver, it has nothing to do with the DRM abstractions nor
>>> the can_run_job callback I'm adding here nor the firmware queue length
>>> limit issue! (And also the global hardware devices are plentiful enough
>>> that I would be very surprised if anyone ever deadlocks it in practice
>>> even with the current code, so I honestly don't think that should be a
>>> blocker for driver submission either, I can and will fix it later...)
>> Well this is what I thought about those problems in amdgpu as well and
>> it totally shipwrecked.
>>
>> We still have memory allocations in the VMID code path which I'm still
>> not sure how to remove.
> We don't even have a shrinker yet, and I'm sure that's going to be a lot
> of fun when we add it too... but yes, if we can't do any memory
> allocations in some of these callbacks (is this documented anywhere?),
> that's going to be interesting...

Yes, that is all part of the dma_fence documentation. It's just 
absolutely not obvious what all this means.

> It's not all bad news though! All memory allocations are fallible in
> kernel Rust (and therefore explicit, and also failures have to be
> explicitly handled or propagated), so it's pretty easy to point out
> where they are, and there are already discussions of higher-level
> tooling to enforce rules like that (and things like wait contexts).
> Also, Rust makes it a lot easier to refactor code in general and not be
> scared that you're going to regress everything, so I'm not really
> worried if I need to turn a chunk of the driver on its head to solve
> some of these problems in the future ^^ (I already did that when I
> switched it from the "demo" synchronous submission model to the proper
> explicit sync + fences one.)

Yeah, well the problem isn't that you run into memory allocation failure.

The problem is rather something like this:
1. You try to allocate memory to signal your fence.
2. This memory allocation can't be fulfilled and goes to sleep to wait 
for reclaim.
3. On another CPU reclaim is running and through the general purpose 
shrinker, page fault or MMU notifier ends up wait for your dma_fence.

You don't even need to implement the shrinker for this to go boom 
extremely easy.

So everything involved with signaling the fence can allocate memory only 
with GFP_ATOMIC and only if you absolutely have to.

Christian.


>
> ~~ Lina
  
Asahi Lina March 8, 2023, 7:05 p.m. UTC | #13
On 09/03/2023 02.57, Christian König wrote:
> Am 08.03.23 um 17:44 schrieb Asahi Lina:
>> On 09/03/2023 00.30, Christian König wrote:
>>> Am 08.03.23 um 15:53 schrieb Asahi Lina:
>>>> [SNIP]
>>>>> The background is that core memory management requires that signaling a
>>>>> fence only depends on signaling other fences and hardware progress and
>>>>> nothing else. Otherwise you immediately run into problems because of
>>>>> circle dependencies or what we call infinite fences.
>>>> And hardware progress is exactly the only dependency here...
>>> Well then you should have a fence for that hardware progress.
>> I do, it's the prior job hardware completion fences that drm_sched
>> already knows about!
>>
>> Yes, I could return those in the prepare callback, it just means I need
>> to start stashing fence references in the underlying firmware job queue
>> command objects so I can find out what is the oldest pending fence is,
>> and return it when a queue is full. As long as drm_sched doesn't mind if
>> I keep giving it fences (since multiple commands can have to complete
>> before there is space) or the occasional already signaled fence (because
>> this process is inherently racy), it should work fine.
> 
> Well this handling is intentional and necessary, but see below for a 
> more in deep explanation.
> 
>> If you think this is the better way, I'll do it that way and drop this
>> patch. It just seemed simpler to do it with another callback, since
>> drm_sched is already tracking those fences and doing a hardware queue
>> limit check anyway, and that way I can avoid tracking the fences down
>> into the hardware queue code... *
> 
> Well it's not the better way, it's the only way that works.
> 
> I have to admit that my bet on your intentions was wrong, but even that 
> use case doesn't work correctly.
> 
> See when your callback returns false it is perfectly possible that all 
> hw fences are signaled between returning that information and processing it.
> 
> The result would be that the scheduler goes to sleep and never wakes up 
> again.

That can't happen, because it will just go into another iteration of the
drm_sched main loop since there is an entity available still.

Rather there is probably the opposite bug in this patch: the can_run_job
logic should be moved into the wait_event_interruptible() condition
check, otherwise I think it can end up busy-looping since the condition
itself can be true even when the can_run_job check blocks it.

But there is no risk of it going to sleep and never waking up because
job completions will wake up the waitqueue by definition, and that
happens after the driver-side queues are popped. If this problem could
happen, then the existing hw_submission_limit logic would be broken in
the same way. It is logically equivalent in how it works.

Basically, if properly done in wait_event_interruptible, it is exactly
the logic of that macro that prevents this race condition and makes
everything work at all. Without it, drm_sched would be completely broken.

> As I said we exercised those ideas before and yes this approach here 
> came up before as well and no it doesn't work.

It can never deadlock with this patch as it stands (though it could busy
loop), and if properly moved into the wait_event_interruptible(), it
would also never busy loop and work entirely as intended. The actual API
change is sound.

I don't know why you're trying so hard to convince everyone that this
approach is fundamentally broken... It might be a bad idea for other
reasons, it might encourage incorrect usage, it might not be the best
option, there are plenty of arguments you can make... but you just keep
trying to make an argument that it just can't work at all for some
reason. Why? I already said I'm happy dropping it in favor of the fences...

It's intended to mirror the hw_submission_limit logic. If you think this
is broken, then that's broken too. They are equivalent mechanisms.

>> This particular issue aside, fairness in global resource allocation is a
>> conversation I'd love to have! Right now the driver doesn't try to
>> ensure that, a queue can easily monopolize certain hardware resources
>> (though one queue can only monopolize one of each, so you'd need
>> something like 63 queues with 63 distinct VMs all submitting
>> free-running jobs back to back in order to starve other queues of
>> resources forever). For starters, one thing I'm thinking of doing is
>> reserving certain subsets of hardware resources for queues with a given
>> priority, so you can at least guarantee forward progress of
>> higher-priority queues when faced with misbehaving lower-priority
>> queues. But if we want to guarantee proper fairness, I think I'll have
>> to start doing things like switching to a CPU-roundtrip submission model
>> when resources become scarce (to guarantee that queues actually release
>> the resources once in a while) and then figure out how to add fairness
>> to the allocation code...
>>
>> But let's have that conversation when we talk about the driver (or maybe
>> on IRC or something?), right now I'm more interested in getting the
>> abstractions reviewed ^^
> 
> Well that stuff is highly problematic as well. The fairness aside you 
> risk starvation which in turn breaks the guarantee of forward progress.
> 
> In this particular case you can catch this with a timeout for the hw 
> operation, but you should consider blocking that from the sw side as well.

In the current state I actually think it's not really that problematic,
because the resources are acquired directly in the ioctl path. So that
can block if starved, but if that can cause overall forward progress to
stop because some fence doesn't get signaled, then so can just not doing
the ioctl in the first place, so there's not much point (userspace can
always misbehave with its fence usage...). By the time anything gets
submitted to drm_sched, the resources are already guaranteed to be
acquired, we never block in the run callback.

It needs to be fixed of course, but if the threat model is a malicious
GPU process, well, there are many other ways to DoS your system... and I
don't think it's very likely that 63+ queues (which usually means 63+
processes with OpenGL) will end up accidentally starving the GPU in a
tight loop at the same time. I'd love to hear about real-world scenarios
where this kind of thing has been a real problem and not just a
theoretical one though... maybe I'm missing something?

Basically my priorities with the driver are:

1. Make sure it never crashes
2. Make sure it works well for real users
3. Make it work smoothly for real users under reasonable load
(priorities, CPU scheduler interactions, etc.)
4. Make it handle accidental problems more gracefully (OOMs etc, I need
to look into private GEM BO accounting to processes so the OOM killer
has better data to work with)
5. Make it more robust against deliberate abuse/starvation (this should
matter more once we have some kind of paravirtualization solution...)

And right now we're somewhere between 2 and 3. So if there are cases
where this resource acquisition stuff can cause a problem for real
users, I'll want to fix it earlier. But if this is more theoretical than
anything (with the resource limits of AGX GPUs), I'd rather focus on
things like memory accounting and shrinker support first.

>> We don't even have a shrinker yet, and I'm sure that's going to be a lot
>> of fun when we add it too... but yes, if we can't do any memory
>> allocations in some of these callbacks (is this documented anywhere?),
>> that's going to be interesting...
> 
> Yes, that is all part of the dma_fence documentation. It's just 
> absolutely not obvious what all this means.

I mean is there any documentation on how this interacts with drm_sched?
Like, am I not allowed to allocate memory in prepare()? What about
run()? What about GPU interrupt work? (not a raw IRQ handler context, I
mean the execution path from GPU IRQ to drm_sched run() fences getting
signaled)

>> It's not all bad news though! All memory allocations are fallible in
>> kernel Rust (and therefore explicit, and also failures have to be
>> explicitly handled or propagated), so it's pretty easy to point out
>> where they are, and there are already discussions of higher-level
>> tooling to enforce rules like that (and things like wait contexts).
>> Also, Rust makes it a lot easier to refactor code in general and not be
>> scared that you're going to regress everything, so I'm not really
>> worried if I need to turn a chunk of the driver on its head to solve
>> some of these problems in the future ^^ (I already did that when I
>> switched it from the "demo" synchronous submission model to the proper
>> explicit sync + fences one.)
> 
> Yeah, well the problem isn't that you run into memory allocation failure.

What I mean is that the mandatory failure handling means it's relatively
easy to audit where memory allocations can actually happen.

> The problem is rather something like this:
> 1. You try to allocate memory to signal your fence.
> 2. This memory allocation can't be fulfilled and goes to sleep to wait 
> for reclaim.
> 3. On another CPU reclaim is running and through the general purpose 
> shrinker, page fault or MMU notifier ends up wait for your dma_fence.
> 
> You don't even need to implement the shrinker for this to go boom 
> extremely easy.

Hmm, can you actually get something waiting on a dma_fence like that
today with this driver? We don't have a shrinker, we don't have
synchronous page faults or MMU notifications for the GPU, and this is
explicit sync so all in/out fences cross over into userspace so surely
they can't be trusted anyway?

I'm definitely not familiar with the intricacies of DMA fences and how
they interact with everything else yet, but it's starting to sound like
either this isn't quite broken for our simple driver yet, or it must be
broken pretty much everywhere in some way...

> So everything involved with signaling the fence can allocate memory only 
> with GFP_ATOMIC and only if you absolutely have to.

I don't think we even have a good story for passing around gfp_flags in
Rust code so that will be interesting... though I need to actually audit
the code paths and see how many allocations we really do. I know I alloc
some vectors for holding completed commands and stuff like that, but I'm
pretty sure I can fix that one with some reworking, and I'm not sure how
many other random things there really are...? Obviously most allocations
happen at command creation time, on completion you mostly get a lot of
freeing, so maybe I can just eliminate all allocs and not worry about
GFP_ATOMIC.

~~ Lina
  
Christian König March 8, 2023, 7:12 p.m. UTC | #14
Am 08.03.23 um 20:05 schrieb Asahi Lina:
> [SNIP]
>> Well it's not the better way, it's the only way that works.
>>
>> I have to admit that my bet on your intentions was wrong, but even that
>> use case doesn't work correctly.
>>
>> See when your callback returns false it is perfectly possible that all
>> hw fences are signaled between returning that information and processing it.
>>
>> The result would be that the scheduler goes to sleep and never wakes up
>> again.
> That can't happen, because it will just go into another iteration of the
> drm_sched main loop since there is an entity available still.
>
> Rather there is probably the opposite bug in this patch: the can_run_job
> logic should be moved into the wait_event_interruptible() condition
> check, otherwise I think it can end up busy-looping since the condition
> itself can be true even when the can_run_job check blocks it.
>
> But there is no risk of it going to sleep and never waking up because
> job completions will wake up the waitqueue by definition, and that
> happens after the driver-side queues are popped. If this problem could
> happen, then the existing hw_submission_limit logic would be broken in
> the same way. It is logically equivalent in how it works.
>
> Basically, if properly done in wait_event_interruptible, it is exactly
> the logic of that macro that prevents this race condition and makes
> everything work at all. Without it, drm_sched would be completely broken.
>
>> As I said we exercised those ideas before and yes this approach here
>> came up before as well and no it doesn't work.
> It can never deadlock with this patch as it stands (though it could busy
> loop), and if properly moved into the wait_event_interruptible(), it
> would also never busy loop and work entirely as intended. The actual API
> change is sound.
>
> I don't know why you're trying so hard to convince everyone that this
> approach is fundamentally broken... It might be a bad idea for other
> reasons, it might encourage incorrect usage, it might not be the best
> option, there are plenty of arguments you can make... but you just keep
> trying to make an argument that it just can't work at all for some
> reason. Why? I already said I'm happy dropping it in favor of the fences...

Well because it is broken.

When you move the check into the wait_event_interruptible condition then 
who is going to call wait_event_interruptible when the condition changes?

As I said this idea came up before and was rejected multiple times.

Regards,
Christian.

>
> It's intended to mirror the hw_submission_limit logic. If you think this
> is broken, then that's broken too. They are equivalent mechanisms.
>
>>> This particular issue aside, fairness in global resource allocation is a
>>> conversation I'd love to have! Right now the driver doesn't try to
>>> ensure that, a queue can easily monopolize certain hardware resources
>>> (though one queue can only monopolize one of each, so you'd need
>>> something like 63 queues with 63 distinct VMs all submitting
>>> free-running jobs back to back in order to starve other queues of
>>> resources forever). For starters, one thing I'm thinking of doing is
>>> reserving certain subsets of hardware resources for queues with a given
>>> priority, so you can at least guarantee forward progress of
>>> higher-priority queues when faced with misbehaving lower-priority
>>> queues. But if we want to guarantee proper fairness, I think I'll have
>>> to start doing things like switching to a CPU-roundtrip submission model
>>> when resources become scarce (to guarantee that queues actually release
>>> the resources once in a while) and then figure out how to add fairness
>>> to the allocation code...
>>>
>>> But let's have that conversation when we talk about the driver (or maybe
>>> on IRC or something?), right now I'm more interested in getting the
>>> abstractions reviewed ^^
>> Well that stuff is highly problematic as well. The fairness aside you
>> risk starvation which in turn breaks the guarantee of forward progress.
>>
>> In this particular case you can catch this with a timeout for the hw
>> operation, but you should consider blocking that from the sw side as well.
> In the current state I actually think it's not really that problematic,
> because the resources are acquired directly in the ioctl path. So that
> can block if starved, but if that can cause overall forward progress to
> stop because some fence doesn't get signaled, then so can just not doing
> the ioctl in the first place, so there's not much point (userspace can
> always misbehave with its fence usage...). By the time anything gets
> submitted to drm_sched, the resources are already guaranteed to be
> acquired, we never block in the run callback.
>
> It needs to be fixed of course, but if the threat model is a malicious
> GPU process, well, there are many other ways to DoS your system... and I
> don't think it's very likely that 63+ queues (which usually means 63+
> processes with OpenGL) will end up accidentally starving the GPU in a
> tight loop at the same time. I'd love to hear about real-world scenarios
> where this kind of thing has been a real problem and not just a
> theoretical one though... maybe I'm missing something?
>
> Basically my priorities with the driver are:
>
> 1. Make sure it never crashes
> 2. Make sure it works well for real users
> 3. Make it work smoothly for real users under reasonable load
> (priorities, CPU scheduler interactions, etc.)
> 4. Make it handle accidental problems more gracefully (OOMs etc, I need
> to look into private GEM BO accounting to processes so the OOM killer
> has better data to work with)
> 5. Make it more robust against deliberate abuse/starvation (this should
> matter more once we have some kind of paravirtualization solution...)
>
> And right now we're somewhere between 2 and 3. So if there are cases
> where this resource acquisition stuff can cause a problem for real
> users, I'll want to fix it earlier. But if this is more theoretical than
> anything (with the resource limits of AGX GPUs), I'd rather focus on
> things like memory accounting and shrinker support first.
>
>>> We don't even have a shrinker yet, and I'm sure that's going to be a lot
>>> of fun when we add it too... but yes, if we can't do any memory
>>> allocations in some of these callbacks (is this documented anywhere?),
>>> that's going to be interesting...
>> Yes, that is all part of the dma_fence documentation. It's just
>> absolutely not obvious what all this means.
> I mean is there any documentation on how this interacts with drm_sched?
> Like, am I not allowed to allocate memory in prepare()? What about
> run()? What about GPU interrupt work? (not a raw IRQ handler context, I
> mean the execution path from GPU IRQ to drm_sched run() fences getting
> signaled)
>
>>> It's not all bad news though! All memory allocations are fallible in
>>> kernel Rust (and therefore explicit, and also failures have to be
>>> explicitly handled or propagated), so it's pretty easy to point out
>>> where they are, and there are already discussions of higher-level
>>> tooling to enforce rules like that (and things like wait contexts).
>>> Also, Rust makes it a lot easier to refactor code in general and not be
>>> scared that you're going to regress everything, so I'm not really
>>> worried if I need to turn a chunk of the driver on its head to solve
>>> some of these problems in the future ^^ (I already did that when I
>>> switched it from the "demo" synchronous submission model to the proper
>>> explicit sync + fences one.)
>> Yeah, well the problem isn't that you run into memory allocation failure.
> What I mean is that the mandatory failure handling means it's relatively
> easy to audit where memory allocations can actually happen.
>
>> The problem is rather something like this:
>> 1. You try to allocate memory to signal your fence.
>> 2. This memory allocation can't be fulfilled and goes to sleep to wait
>> for reclaim.
>> 3. On another CPU reclaim is running and through the general purpose
>> shrinker, page fault or MMU notifier ends up wait for your dma_fence.
>>
>> You don't even need to implement the shrinker for this to go boom
>> extremely easy.
> Hmm, can you actually get something waiting on a dma_fence like that
> today with this driver? We don't have a shrinker, we don't have
> synchronous page faults or MMU notifications for the GPU, and this is
> explicit sync so all in/out fences cross over into userspace so surely
> they can't be trusted anyway?
>
> I'm definitely not familiar with the intricacies of DMA fences and how
> they interact with everything else yet, but it's starting to sound like
> either this isn't quite broken for our simple driver yet, or it must be
> broken pretty much everywhere in some way...
>
>> So everything involved with signaling the fence can allocate memory only
>> with GFP_ATOMIC and only if you absolutely have to.
> I don't think we even have a good story for passing around gfp_flags in
> Rust code so that will be interesting... though I need to actually audit
> the code paths and see how many allocations we really do. I know I alloc
> some vectors for holding completed commands and stuff like that, but I'm
> pretty sure I can fix that one with some reworking, and I'm not sure how
> many other random things there really are...? Obviously most allocations
> happen at command creation time, on completion you mostly get a lot of
> freeing, so maybe I can just eliminate all allocs and not worry about
> GFP_ATOMIC.
>
> ~~ Lina
  
Asahi Lina March 8, 2023, 7:45 p.m. UTC | #15
On 09/03/2023 04.12, Christian König wrote:
> Am 08.03.23 um 20:05 schrieb Asahi Lina:
>> [SNIP]
>>> Well it's not the better way, it's the only way that works.
>>>
>>> I have to admit that my bet on your intentions was wrong, but even that
>>> use case doesn't work correctly.
>>>
>>> See when your callback returns false it is perfectly possible that all
>>> hw fences are signaled between returning that information and processing it.
>>>
>>> The result would be that the scheduler goes to sleep and never wakes up
>>> again.
>> That can't happen, because it will just go into another iteration of the
>> drm_sched main loop since there is an entity available still.
>>
>> Rather there is probably the opposite bug in this patch: the can_run_job
>> logic should be moved into the wait_event_interruptible() condition
>> check, otherwise I think it can end up busy-looping since the condition
>> itself can be true even when the can_run_job check blocks it.
>>
>> But there is no risk of it going to sleep and never waking up because
>> job completions will wake up the waitqueue by definition, and that
>> happens after the driver-side queues are popped. If this problem could
>> happen, then the existing hw_submission_limit logic would be broken in
>> the same way. It is logically equivalent in how it works.
>>
>> Basically, if properly done in wait_event_interruptible, it is exactly
>> the logic of that macro that prevents this race condition and makes
>> everything work at all. Without it, drm_sched would be completely broken.
>>
>>> As I said we exercised those ideas before and yes this approach here
>>> came up before as well and no it doesn't work.
>> It can never deadlock with this patch as it stands (though it could busy
>> loop), and if properly moved into the wait_event_interruptible(), it
>> would also never busy loop and work entirely as intended. The actual API
>> change is sound.
>>
>> I don't know why you're trying so hard to convince everyone that this
>> approach is fundamentally broken... It might be a bad idea for other
>> reasons, it might encourage incorrect usage, it might not be the best
>> option, there are plenty of arguments you can make... but you just keep
>> trying to make an argument that it just can't work at all for some
>> reason. Why? I already said I'm happy dropping it in favor of the fences...
> 
> Well because it is broken.
> 
> When you move the check into the wait_event_interruptible condition then 
> who is going to call wait_event_interruptible when the condition changes?

I think you mean wake_up_interruptible(). That would be
drm_sched_job_done(), on the fence callback when a job completes, which
as I keep saying is the same logic used for
hw_rq_count/hw_submission_limit tracking.

Please think about it for a second, it's really not that complicated to
see why it works:

- Driver pops off completed commands <-- can_run_job condition satisfied
- Driver signals fence
 - drm_sched_job_done_cb()
  - drm_sched_job_done()
   - atomic_dec(&sched->hw_rq_count); <-- hw_submission_limit satisfied
   - ...
   - wake_up_interruptible(&sched->wake_up_worker);
      ^- happens after both conditions are potentially satisfied

It really is completely equivalent to just making the hw_rq_count logic
customizable by the driver. The actual flow is the same. As long as the
driver guarantees it satisfies the can_run_job() condition before
signaling the completion fence that triggered that change, it works fine.

> As I said this idea came up before and was rejected multiple times.

Maybe it was a different idea, or maybe it was rejected for other
reasons, or maybe it was wrongly rejected for being broken when it isn't ^^

~~ Lina
  
Christian König March 8, 2023, 8:14 p.m. UTC | #16
Am 08.03.23 um 20:45 schrieb Asahi Lina:
> On 09/03/2023 04.12, Christian König wrote:
>> Am 08.03.23 um 20:05 schrieb Asahi Lina:
>>> [SNIP]
>>>> Well it's not the better way, it's the only way that works.
>>>>
>>>> I have to admit that my bet on your intentions was wrong, but even that
>>>> use case doesn't work correctly.
>>>>
>>>> See when your callback returns false it is perfectly possible that all
>>>> hw fences are signaled between returning that information and processing it.
>>>>
>>>> The result would be that the scheduler goes to sleep and never wakes up
>>>> again.
>>> That can't happen, because it will just go into another iteration of the
>>> drm_sched main loop since there is an entity available still.
>>>
>>> Rather there is probably the opposite bug in this patch: the can_run_job
>>> logic should be moved into the wait_event_interruptible() condition
>>> check, otherwise I think it can end up busy-looping since the condition
>>> itself can be true even when the can_run_job check blocks it.
>>>
>>> But there is no risk of it going to sleep and never waking up because
>>> job completions will wake up the waitqueue by definition, and that
>>> happens after the driver-side queues are popped. If this problem could
>>> happen, then the existing hw_submission_limit logic would be broken in
>>> the same way. It is logically equivalent in how it works.
>>>
>>> Basically, if properly done in wait_event_interruptible, it is exactly
>>> the logic of that macro that prevents this race condition and makes
>>> everything work at all. Without it, drm_sched would be completely broken.
>>>
>>>> As I said we exercised those ideas before and yes this approach here
>>>> came up before as well and no it doesn't work.
>>> It can never deadlock with this patch as it stands (though it could busy
>>> loop), and if properly moved into the wait_event_interruptible(), it
>>> would also never busy loop and work entirely as intended. The actual API
>>> change is sound.
>>>
>>> I don't know why you're trying so hard to convince everyone that this
>>> approach is fundamentally broken... It might be a bad idea for other
>>> reasons, it might encourage incorrect usage, it might not be the best
>>> option, there are plenty of arguments you can make... but you just keep
>>> trying to make an argument that it just can't work at all for some
>>> reason. Why? I already said I'm happy dropping it in favor of the fences...
>> Well because it is broken.
>>
>> When you move the check into the wait_event_interruptible condition then
>> who is going to call wait_event_interruptible when the condition changes?
> I think you mean wake_up_interruptible(). That would be
> drm_sched_job_done(), on the fence callback when a job completes, which
> as I keep saying is the same logic used for
> hw_rq_count/hw_submission_limit tracking.

As the documentation to wait_event says:

  * wake_up() has to be called after changing any variable that could
  * change the result of the wait condition.

So what you essentially try to do here is to skip that and say 
drm_sched_job_done() would call that anyway, but when you read any 
variable to determine that state then as far as I can see nothing is 
guarantying that order.

The only other possibility how you could use the callback correctly 
would be to call drm_fence_is_signaled() to query the state of your hw 
submission from the same fence which is then signaled. But then the 
question is once more why you don't give that fence directly to the 
scheduler?

> Please think about it for a second,

Yeah, I'm trying to really follow your intentions here. But that doesn't 
really makes sense.

Either you are trying to do something invalid or you are trying to 
circumvent the object model somehow and add a shortcut for the signaling 
API. Both would be more than fishy.

Regards,
Christian.

>   it's really not that complicated to
> see why it works:
>
> - Driver pops off completed commands <-- can_run_job condition satisfied
> - Driver signals fence
>   - drm_sched_job_done_cb()
>    - drm_sched_job_done()
>     - atomic_dec(&sched->hw_rq_count); <-- hw_submission_limit satisfied
>     - ...
>     - wake_up_interruptible(&sched->wake_up_worker);
>        ^- happens after both conditions are potentially satisfied
>
> It really is completely equivalent to just making the hw_rq_count logic
> customizable by the driver. The actual flow is the same. As long as the
> driver guarantees it satisfies the can_run_job() condition before
> signaling the completion fence that triggered that change, it works fine.
>
>> As I said this idea came up before and was rejected multiple times.
> Maybe it was a different idea, or maybe it was rejected for other
> reasons, or maybe it was wrongly rejected for being broken when it isn't ^^
>
> ~~ Lina
  
Asahi Lina March 9, 2023, 6:30 a.m. UTC | #17
On 09/03/2023 05.14, Christian König wrote:
>> I think you mean wake_up_interruptible(). That would be
>> drm_sched_job_done(), on the fence callback when a job completes, which
>> as I keep saying is the same logic used for
>> hw_rq_count/hw_submission_limit tracking.
> 
> As the documentation to wait_event says:
> 
>   * wake_up() has to be called after changing any variable that could
>   * change the result of the wait condition.
> 
> So what you essentially try to do here is to skip that and say 
> drm_sched_job_done() would call that anyway, but when you read any 
> variable to determine that state then as far as I can see nothing is 
> guarantying that order.

The driver needs to guarantee that any changes to that state precede a
job completion fence signal of course, that's the entire idea of the
API. It's supposed to represent a check for per-scheduler (or more
specific, but not more global) resources that are released on job
completion. Of course if you misuse the API you could cause a problem,
but what I'm trying to say is that the API as designed and when used as
intended does work properly.

Put another way: job completions always need to cause the sched main
loop to run an iteration anyway (otherwise we wouldn't make forward
progress), and job completions are exactly the signal that the
can_run_job() condition may have changed.

> The only other possibility how you could use the callback correctly 
> would be to call drm_fence_is_signaled() to query the state of your hw 
> submission from the same fence which is then signaled. But then the 
> question is once more why you don't give that fence directly to the 
> scheduler?

But the driver is supposed to guarantee that the ordering is always 1.
resources freed, 2. fence signaled. So you don't need to check for the
fence, you can just check for the resource state. If the callback
returns false then by definition the fence wasn't yet signaled at some
point during its execution (because the resources weren't yet freed),
and since it would be in the wait_event_interruptible() check path, by
definition the fence signaling at any point during or after the check
would cause the thread to wake up again and re-check.

Thread 1                                          Thread 2
1. wait_event_interruptible() arms wq             1. Free resources
2. can_run_job() checks resources                 2. Signal fence
3. wait_event_interruptible() sleeps on wq        3. Fence wakes up wq
4. loop

There is no possible interleaving of those sequences that leads to a
lost event and the thread not waking up:
- If T2.3 happens before T1.1, that means T2.1 happened earlier and T1.2
must return true.
- If T2.3 happens after T1.1 but before T1.3, the wq code will ensure
the wq does not sleep (or immediately wakes up) at T1.3 since it was
signaled during the condition check, after the wq was armed. At the next
check loop, T1.2 will then return true, since T2.1 already happened
before T2.3.
- If T2.3 happens during T1.3, the wq wakes up normally and does another
check, and at that point T1.2 returns true.

QED.

~~ Lina
  
Christian König March 9, 2023, 8:05 a.m. UTC | #18
Am 09.03.23 um 07:30 schrieb Asahi Lina:
> On 09/03/2023 05.14, Christian König wrote:
>>> I think you mean wake_up_interruptible(). That would be
>>> drm_sched_job_done(), on the fence callback when a job completes, which
>>> as I keep saying is the same logic used for
>>> hw_rq_count/hw_submission_limit tracking.
>> As the documentation to wait_event says:
>>
>>    * wake_up() has to be called after changing any variable that could
>>    * change the result of the wait condition.
>>
>> So what you essentially try to do here is to skip that and say
>> drm_sched_job_done() would call that anyway, but when you read any
>> variable to determine that state then as far as I can see nothing is
>> guarantying that order.
> The driver needs to guarantee that any changes to that state precede a
> job completion fence signal of course, that's the entire idea of the
> API. It's supposed to represent a check for per-scheduler (or more
> specific, but not more global) resources that are released on job
> completion. Of course if you misuse the API you could cause a problem,
> but what I'm trying to say is that the API as designed and when used as
> intended does work properly.
>
> Put another way: job completions always need to cause the sched main
> loop to run an iteration anyway (otherwise we wouldn't make forward
> progress), and job completions are exactly the signal that the
> can_run_job() condition may have changed.
>
>> The only other possibility how you could use the callback correctly
>> would be to call drm_fence_is_signaled() to query the state of your hw
>> submission from the same fence which is then signaled. But then the
>> question is once more why you don't give that fence directly to the
>> scheduler?
> But the driver is supposed to guarantee that the ordering is always 1.
> resources freed, 2. fence signaled. So you don't need to check for the
> fence, you can just check for the resource state.

Yeah, but this is exactly what the dma_fence framework tried to prevent. 
We try very hard to avoid such side channel signaling :)

But putting that issue aside for a moment. What I don't get is when you 
have such intra queue dependencies, then why can't you check that at a 
much higher level?

In other words even userspace should be able to predict that for it's 
submissions X amount of resources are needed and when all of my 
submissions run in parallel that won't work.

Asking the firmware for a status is usually a magnitudes slower than 
just computing it before submission.

Regards,
Christian.


> If the callback
> returns false then by definition the fence wasn't yet signaled at some
> point during its execution (because the resources weren't yet freed),
> and since it would be in the wait_event_interruptible() check path, by
> definition the fence signaling at any point during or after the check
> would cause the thread to wake up again and re-check.
>
> Thread 1                                          Thread 2
> 1. wait_event_interruptible() arms wq             1. Free resources
> 2. can_run_job() checks resources                 2. Signal fence
> 3. wait_event_interruptible() sleeps on wq        3. Fence wakes up wq
> 4. loop
>
> There is no possible interleaving of those sequences that leads to a
> lost event and the thread not waking up:
> - If T2.3 happens before T1.1, that means T2.1 happened earlier and T1.2
> must return true.
> - If T2.3 happens after T1.1 but before T1.3, the wq code will ensure
> the wq does not sleep (or immediately wakes up) at T1.3 since it was
> signaled during the condition check, after the wq was armed. At the next
> check loop, T1.2 will then return true, since T2.1 already happened
> before T2.3.
> - If T2.3 happens during T1.3, the wq wakes up normally and does another
> check, and at that point T1.2 returns true.
>
> QED.
>
> ~~ Lina
  
Asahi Lina March 9, 2023, 9:14 a.m. UTC | #19
On 09/03/2023 17.05, Christian König wrote:
> Am 09.03.23 um 07:30 schrieb Asahi Lina:
>> On 09/03/2023 05.14, Christian König wrote:
>>>> I think you mean wake_up_interruptible(). That would be
>>>> drm_sched_job_done(), on the fence callback when a job completes, which
>>>> as I keep saying is the same logic used for
>>>> hw_rq_count/hw_submission_limit tracking.
>>> As the documentation to wait_event says:
>>>
>>>    * wake_up() has to be called after changing any variable that could
>>>    * change the result of the wait condition.
>>>
>>> So what you essentially try to do here is to skip that and say
>>> drm_sched_job_done() would call that anyway, but when you read any
>>> variable to determine that state then as far as I can see nothing is
>>> guarantying that order.
>> The driver needs to guarantee that any changes to that state precede a
>> job completion fence signal of course, that's the entire idea of the
>> API. It's supposed to represent a check for per-scheduler (or more
>> specific, but not more global) resources that are released on job
>> completion. Of course if you misuse the API you could cause a problem,
>> but what I'm trying to say is that the API as designed and when used as
>> intended does work properly.
>>
>> Put another way: job completions always need to cause the sched main
>> loop to run an iteration anyway (otherwise we wouldn't make forward
>> progress), and job completions are exactly the signal that the
>> can_run_job() condition may have changed.
>>
>>> The only other possibility how you could use the callback correctly
>>> would be to call drm_fence_is_signaled() to query the state of your hw
>>> submission from the same fence which is then signaled. But then the
>>> question is once more why you don't give that fence directly to the
>>> scheduler?
>> But the driver is supposed to guarantee that the ordering is always 1.
>> resources freed, 2. fence signaled. So you don't need to check for the
>> fence, you can just check for the resource state.
> 
> Yeah, but this is exactly what the dma_fence framework tried to prevent. 
> We try very hard to avoid such side channel signaling :)

Right, and it's fine, I can use the fences directly easily enough. I'm
just trying to explain why my original idea works too, even if it's not
the best solution for other reasons!

Of course I don't have the context of what other drivers are doing or
did historically and what the pitfalls are, so I can't know what the
"right" solution for any of this is in that context. I did my best to
understand the drm_sched code and come up with a solution that works
(which it does) without any more info. When I saw the hw submission
limit stuff, I thought "okay, I need the same thing but with slightly
more complex logic, so let's add a callback so the driver can customize
it and do its own inflight counting".

After this discussion, I can see that this is equivalent to doing the
same check in prepare_job() followed by returning the oldest running
job's fence (as long as there's no race there... it should be fine if
the fence reference is taken first, before the resource check, or if
everything is done within the same critical section taking the firmware
queue lock), so I'm happy to switch to that and drop this patch.

But keep in mind none of this is documented, and there's no way for us
driver authors to understand what we're supposed to do without
documentation. As I said I spent a long time trying to understand
drm_sched, and then my original attempt missed the drm_sched_fini()
issue with dangling jobs and Alyssa managed to hit an oops on the test
branch, I guessed what the problem was from her trace, figured out a way
to reproduce it (the kill-loop glmark2 thing), and fixed it in the next
patch in this series. So even trying my best to figure out how to do
this, reading the code and what scarce docs there are, I managed to miss
something that caused a potential oops on the first try. If I can't even
get the API usage right after spending hours on it trying really hard
not to (because it's not just about my driver, I need the Rust
abstraction to be safe for any driver), there's no way I'm going to
divine what approaches to resource/dependency signaling are
problematic/easy to abuse... the most I can hope for is "I got the
wrapper right and the API/driver interaction is correct and guarantees
forward progress if the driver follows the rules".

So when I submit something, and you reply with "Well complete NAK",
that's just not nice. Honestly, I was kind of upset when I got that
email. It sounded as if you were saying my solution was completely
broken and couldn't work, but no matter how I looked at it I couldn't
figure out how it's broken. And then it took several emails to even
understand what you were suggesting with the prepare_job callback (and
yes, that works too and is probably harder to abuse than a new
callback). I'm trying really hard to make this all work and be correct,
and of course I make mistakes too... but then I look at the code and no
matter what I can come up with it seems to work and be correct, what am
I supposed to do? I'm happy to learn and figure out better approaches
for everything that lead to better drivers, but I need an actual
explanation of the issues, not just a NAK...

I also would appreciate it if people give me the benefit of the doubt
and let me explain what I'm doing and how I'm doing it and how this
hardware works, because the whole thing is subtle to the core and very
different to other GPUs. Honestly, I don't think any reviewer that
hasn't spent hours poring over the driver/abstraction code could
confidently say that a certain subtle sync issue exists at a first pass
(other than for really obvious bad code sequences). I'm happy to look
into issues and I definitely want to know what cases to look at and what
to check for and fix anything we find... but isn't it better if we work
together instead of shouting "this is broken" at the first hint of
possible trouble?

> But putting that issue aside for a moment. What I don't get is when you 
> have such intra queue dependencies, then why can't you check that at a 
> much higher level?
> 
> In other words even userspace should be able to predict that for it's 
> submissions X amount of resources are needed and when all of my 
> submissions run in parallel that won't work.

Technically yes, but we can't trust userspace to honor this, since
overflowing the firmware queue breaks everything, so the kernel has to
do the check... plus we're trying to insulate userspace from the details
of how work is queued at the firmware. We need to support multiple
firmware versions including future ones we can't predict yet without
breaking UAPI, so the less the UAPI depends on firmware details, the
better. That's why at the UAPI level, this is boiled down to a simpler
"max commands per submission" limit that gets passed in the params
struct, which is conservative, and then the kernel can deal with the
actual in-flight count tracking and only submit things to the hardware
when they fit.

In the future we could even support job splitting on the kernel side and
remove the max commands per submission limit altogether (though it
probably still makes sense to have for other reasons, like bounding how
much kernel/firmware memory a single queue can consume, so I'm not sure
this is even worth doing at all).

> Asking the firmware for a status is usually a magnitudes slower than 
> just computing it before submission.

I'm not asking the firmware for status, I'm just asking my own firmware
queue code how many slots are currently free in each backing queue.
That's just based on internal driver state, there is no firmware round trip!

I could technically compute this before submission and figure out how
much work has been queued and pre-populate fences that ensure we never
exceed the max, but honestly that's a lot more code to track job sizes
and I don't think it makes sense when I can just ask "Do we have space?
No? Okay, return the oldest running job fence for now and try again when
it completes" in prepare_job(). Maybe it's faster in pathological cases
to do something fancier, but let's wait until Vulkan works and we can
run real AAA games and see where the bottlenecks are before going down
the optimization road ^^

~~ Lina
  
Faith Ekstrand March 9, 2023, 6:50 p.m. UTC | #20
Jumping in here quick... (Sorry, I was out yesterday and was ignoring
my e-mail on Tuesday so I could finally type some compiler code.)

On Thu, 2023-03-09 at 18:14 +0900, Asahi Lina wrote:
> On 09/03/2023 17.05, Christian König wrote:
> > Am 09.03.23 um 07:30 schrieb Asahi Lina:
> > > On 09/03/2023 05.14, Christian König wrote:
> > > > > I think you mean wake_up_interruptible(). That would be
> > > > > drm_sched_job_done(), on the fence callback when a job
> > > > > completes, which
> > > > > as I keep saying is the same logic used for
> > > > > hw_rq_count/hw_submission_limit tracking.
> > > > As the documentation to wait_event says:
> > > > 
> > > >    * wake_up() has to be called after changing any variable
> > > > that could
> > > >    * change the result of the wait condition.
> > > > 
> > > > So what you essentially try to do here is to skip that and say
> > > > drm_sched_job_done() would call that anyway, but when you read
> > > > any
> > > > variable to determine that state then as far as I can see
> > > > nothing is
> > > > guarantying that order.
> > > The driver needs to guarantee that any changes to that state
> > > precede a
> > > job completion fence signal of course, that's the entire idea of
> > > the
> > > API. It's supposed to represent a check for per-scheduler (or
> > > more
> > > specific, but not more global) resources that are released on job
> > > completion. Of course if you misuse the API you could cause a
> > > problem,
> > > but what I'm trying to say is that the API as designed and when
> > > used as
> > > intended does work properly.
> > > 
> > > Put another way: job completions always need to cause the sched
> > > main
> > > loop to run an iteration anyway (otherwise we wouldn't make
> > > forward
> > > progress), and job completions are exactly the signal that the
> > > can_run_job() condition may have changed.
> > > 
> > > > The only other possibility how you could use the callback
> > > > correctly
> > > > would be to call drm_fence_is_signaled() to query the state of
> > > > your hw
> > > > submission from the same fence which is then signaled. But then
> > > > the
> > > > question is once more why you don't give that fence directly to
> > > > the
> > > > scheduler?
> > > But the driver is supposed to guarantee that the ordering is
> > > always 1.
> > > resources freed, 2. fence signaled. So you don't need to check
> > > for the
> > > fence, you can just check for the resource state.
> > 
> > Yeah, but this is exactly what the dma_fence framework tried to
> > prevent. 
> > We try very hard to avoid such side channel signaling :)
> 
> Right, and it's fine, I can use the fences directly easily enough.
> I'm
> just trying to explain why my original idea works too, even if it's
> not
> the best solution for other reasons!
> 
> Of course I don't have the context of what other drivers are doing or
> did historically and what the pitfalls are, so I can't know what the
> "right" solution for any of this is in that context. I did my best to
> understand the drm_sched code and come up with a solution that works
> (which it does) without any more info. When I saw the hw submission
> limit stuff, I thought "okay, I need the same thing but with slightly
> more complex logic, so let's add a callback so the driver can
> customize
> it and do its own inflight counting".

So, I think there's a difference here between "impossible to implement
correctly", "likely to be implemented correctly", and "impossible to
implement incorrectly".  It's obviously possible to implement
correctly.  You can just always return true or do exactly the same
check or do some simple thing where you can guarantee that it will only
ever return false when there's a bunch of other stuff in the queue. 
That doesn't mean that it's likely to be implemented correctly by some
other driver.  Some idiot will come along and try to take advantage of
it and cause themselves horrible problems.

And, to be clear, for the purposes of this discussion, we're ALL
idiots, myself included.  If there's one thing the DRM community has
learned over the years, it's that drivers are so complex that we all
turn into idiots at some point, relative to the complexity of the code
and hardware behavior.  That's why things like dma_fence are written so
incredibly defensively and why we're so harsh about the rules.  It's
the rules and not our individual smarts that keep us from making
mistakes.  (Kinda like Rust, in a way.)  So while I appreciate the
frustration of "I'm just trying to do something that's clearly correct
here", that doesn't mean that then next person to come by and fix a bug
by tweaking that callback isn't going to screw it up irreparably.  That
person may even be you in 6 to 12 months after this e-mail thread is a
distant memory.

So, yes, does the implementation you have today work without deadlocks
or starvation?  Maybe it does.  I've not verified.  Is the suggested
callback a giant foot-gun in the already treacherous territory of
scheduling and fencing?  Yeah, it probably is and there's another way
to implement the same behavior which is likely safer in the long run.

> After this discussion, I can see that this is equivalent to doing the
> same check in prepare_job() followed by returning the oldest running
> job's fence (as long as there's no race there... it should be fine if
> the fence reference is taken first, before the resource check, or if
> everything is done within the same critical section taking the
> firmware
> queue lock), so I'm happy to switch to that and drop this patch.
> 
> But keep in mind none of this is documented, and there's no way for
> us
> driver authors to understand what we're supposed to do without
> documentation. As I said I spent a long time trying to understand
> drm_sched, and then my original attempt missed the drm_sched_fini()
> issue with dangling jobs and Alyssa managed to hit an oops on the
> test
> branch, I guessed what the problem was from her trace, figured out a
> way
> to reproduce it (the kill-loop glmark2 thing), and fixed it in the
> next
> patch in this series. So even trying my best to figure out how to do
> this, reading the code and what scarce docs there are, I managed to
> miss
> something that caused a potential oops on the first try. If I can't
> even
> get the API usage right after spending hours on it trying really hard
> not to (because it's not just about my driver, I need the Rust
> abstraction to be safe for any driver), there's no way I'm going to
> divine what approaches to resource/dependency signaling are
> problematic/easy to abuse... the most I can hope for is "I got the
> wrapper right and the API/driver interaction is correct and
> guarantees
> forward progress if the driver follows the rules".

Your frustration with the lack of good documentation in DRM is entirely
justified.  It's a mess and there's not a whole lot of people who
understand all these subtleties.  Connecting to the hive mind via e-
mail and asking questions is the best you can do a lot of the time, I'm
afraid.  I wish we had better documentation for a lot of things and I'd
be happy to see the situation improved added but we've got a lot of
debt there and not always a lot of time.  (Yeah, I know, that's every
senior engineer's excuse...)  We really are trying to be better about
it moving forward, though.  Daniel has been pushing people to document
things a lot more in recent years.  But, yeah, lots of debt...

Also, in a weird way, I think these conversations are sometimes better
than documentation.  It took a while to get around to it all but
there's a lot of context that was brought together in this e-mail
thread that wouldn't have been in the docs no matter how good they are.
A lot of it isn't an isolated thing that should clearly be explained in
the run_job docs.  It's subtle interactions which happen when all the
pieces come together.  I see this complaint a lot about Vulkan as well.
There are behaviors which only become evident when you find the right 5
pieces of the spec and put them all together and squint.  It'd be good
to call those out sometimes but there's no way we can document all of
them.

> So when I submit something, and you reply with "Well complete NAK",
> that's just not nice. Honestly, I was kind of upset when I got that
> email. It sounded as if you were saying my solution was completely
> broken and couldn't work, but no matter how I looked at it I couldn't
> figure out how it's broken. And then it took several emails to even
> understand what you were suggesting with the prepare_job callback
> (and
> yes, that works too and is probably harder to abuse than a new
> callback). I'm trying really hard to make this all work and be
> correct,
> and of course I make mistakes too... but then I look at the code and
> no
> matter what I can come up with it seems to work and be correct, what
> am
> I supposed to do? I'm happy to learn and figure out better approaches
> for everything that lead to better drivers, but I need an actual
> explanation of the issues, not just a NAK...
> 
> I also would appreciate it if people give me the benefit of the doubt
> and let me explain what I'm doing and how I'm doing it and how this
> hardware works, because the whole thing is subtle to the core and
> very
> different to other GPUs. Honestly, I don't think any reviewer that
> hasn't spent hours poring over the driver/abstraction code could
> confidently say that a certain subtle sync issue exists at a first
> pass
> (other than for really obvious bad code sequences). I'm happy to look
> into issues and I definitely want to know what cases to look at and
> what
> to check for and fix anything we find... but isn't it better if we
> work
> together instead of shouting "this is broken" at the first hint of
> possible trouble?

Debating if I want to wade in in this one because this thread is
already getting a bit warm and I don't want to make it worse.  But, I'm
an idiot, so...

Knowing what I do of both people in this thread, I think Christian is
giving you more benefit of the doubt than you realize.  Yes, his tone
may be a bit abrupt but he continued to spend his time responding in
detail to every question you raised.  That means he was taking you
seriously, even if he wasn't yielding ground.

Communication is hard, especially with all the different personalities,
languages, and cultures involved in an international community like
this.  Sometimes the clarity of saying "no, this isn't going to work"
up-front is necessary.  Sometimes the person on the other end of the e-
mail could benefit from a gentler response.  It's hard to know from
early interactions.  Enough people have been wrong about dma_fence over
the years (Hi! It's me!) that "no" is often the right starting
position. 😭️  It doesn't always feel great to be on the receiving end
of that but Christian is pretty much guarding a dragon cave, so...

To be clear, none of that is a defense of the toxicity for which the
Linux community has gotten a reputation.  A lot of subsystem
maintainers have been known to start of with "no" to any idea they
didn't already think of.  That's bad.  Generally, you shouldn't assume
everyone but you is an idiot.  When it comes to dma_fence, though, the
assumption is that we're ALL idiots and the "No, seriously, don't go
into the dragon cave.  You won't come out alive.  You're not that
special." signs are justified. 😓️

I hope the context I'm providing here is helpful.  If not, feel free to
ignore me.  It looks like you got the technical issues sorted.

~Faith

>
  
Asahi Lina March 10, 2023, 9:16 a.m. UTC | #21
On 10/03/2023 03.50, Faith Ekstrand wrote:
> Jumping in here quick... (Sorry, I was out yesterday and was ignoring
> my e-mail on Tuesday so I could finally type some compiler code.)
> 
> On Thu, 2023-03-09 at 18:14 +0900, Asahi Lina wrote:
>> On 09/03/2023 17.05, Christian König wrote:
>>> Am 09.03.23 um 07:30 schrieb Asahi Lina:
>>>> On 09/03/2023 05.14, Christian König wrote:
>>>>>> I think you mean wake_up_interruptible(). That would be
>>>>>> drm_sched_job_done(), on the fence callback when a job
>>>>>> completes, which
>>>>>> as I keep saying is the same logic used for
>>>>>> hw_rq_count/hw_submission_limit tracking.
>>>>> As the documentation to wait_event says:
>>>>>
>>>>>    * wake_up() has to be called after changing any variable
>>>>> that could
>>>>>    * change the result of the wait condition.
>>>>>
>>>>> So what you essentially try to do here is to skip that and say
>>>>> drm_sched_job_done() would call that anyway, but when you read
>>>>> any
>>>>> variable to determine that state then as far as I can see
>>>>> nothing is
>>>>> guarantying that order.
>>>> The driver needs to guarantee that any changes to that state
>>>> precede a
>>>> job completion fence signal of course, that's the entire idea of
>>>> the
>>>> API. It's supposed to represent a check for per-scheduler (or
>>>> more
>>>> specific, but not more global) resources that are released on job
>>>> completion. Of course if you misuse the API you could cause a
>>>> problem,
>>>> but what I'm trying to say is that the API as designed and when
>>>> used as
>>>> intended does work properly.
>>>>
>>>> Put another way: job completions always need to cause the sched
>>>> main
>>>> loop to run an iteration anyway (otherwise we wouldn't make
>>>> forward
>>>> progress), and job completions are exactly the signal that the
>>>> can_run_job() condition may have changed.
>>>>
>>>>> The only other possibility how you could use the callback
>>>>> correctly
>>>>> would be to call drm_fence_is_signaled() to query the state of
>>>>> your hw
>>>>> submission from the same fence which is then signaled. But then
>>>>> the
>>>>> question is once more why you don't give that fence directly to
>>>>> the
>>>>> scheduler?
>>>> But the driver is supposed to guarantee that the ordering is
>>>> always 1.
>>>> resources freed, 2. fence signaled. So you don't need to check
>>>> for the
>>>> fence, you can just check for the resource state.
>>>
>>> Yeah, but this is exactly what the dma_fence framework tried to
>>> prevent. 
>>> We try very hard to avoid such side channel signaling :)
>>
>> Right, and it's fine, I can use the fences directly easily enough.
>> I'm
>> just trying to explain why my original idea works too, even if it's
>> not
>> the best solution for other reasons!
>>
>> Of course I don't have the context of what other drivers are doing or
>> did historically and what the pitfalls are, so I can't know what the
>> "right" solution for any of this is in that context. I did my best to
>> understand the drm_sched code and come up with a solution that works
>> (which it does) without any more info. When I saw the hw submission
>> limit stuff, I thought "okay, I need the same thing but with slightly
>> more complex logic, so let's add a callback so the driver can
>> customize
>> it and do its own inflight counting".
> 
> So, I think there's a difference here between "impossible to implement
> correctly", "likely to be implemented correctly", and "impossible to
> implement incorrectly".  It's obviously possible to implement
> correctly.  You can just always return true or do exactly the same
> check or do some simple thing where you can guarantee that it will only
> ever return false when there's a bunch of other stuff in the queue. 
> That doesn't mean that it's likely to be implemented correctly by some
> other driver.  Some idiot will come along and try to take advantage of
> it and cause themselves horrible problems.
> 
> And, to be clear, for the purposes of this discussion, we're ALL
> idiots, myself included.  If there's one thing the DRM community has
> learned over the years, it's that drivers are so complex that we all
> turn into idiots at some point, relative to the complexity of the code
> and hardware behavior.  That's why things like dma_fence are written so
> incredibly defensively and why we're so harsh about the rules.  It's
> the rules and not our individual smarts that keep us from making
> mistakes.  (Kinda like Rust, in a way.)  So while I appreciate the
> frustration of "I'm just trying to do something that's clearly correct
> here", that doesn't mean that then next person to come by and fix a bug
> by tweaking that callback isn't going to screw it up irreparably.  That
> person may even be you in 6 to 12 months after this e-mail thread is a
> distant memory.
> 
> So, yes, does the implementation you have today work without deadlocks
> or starvation?  Maybe it does.  I've not verified.  Is the suggested
> callback a giant foot-gun in the already treacherous territory of
> scheduling and fencing?  Yeah, it probably is and there's another way
> to implement the same behavior which is likely safer in the long run.

I understand that... I just wish the response had been along the lines
of "this is a huge footgun for these reasons, and you don't need it
because you can do it this other way instead", not "the concept is
completely broken, NAK".

If the discussion were phrased around how the API can be used and
abused, then I can understand what the concern is. But it was somehow
always about me and what I'm doing...

> This is clearly going against the idea of having jobs only depend on 
> fences and nothing else which is mandatory for correct memory management.

That implies what I'm doing breaks memory management (and that it is
obvious).

> And to make it clear this is unfortunately a complete NAK to this 
> approach! You can't do this!

Again that I can't do it... and then we got an argument over whether the
code is actually broken or not. But that doesn't even matter, since the
issue is how easy the API is to use or misuse, not whether I actually
misuse it...

I'll switch to prepare_job() fences for the next version, so it's not an
issue. Using that didn't even cross my mind because, knowing nothing
about the intended usage here, the prepare_job() callback docs are quite
obtuse:

> Called when the scheduler is considering scheduling this job next> to get another struct dma_fence for this job to block on. Once i>
returns NULL, run_job() may be called.
> 
> Can be NULL if no additional preparation to the dependencies are necessary.> Skipped when jobs are killed instead of run.

What's a "dependency"? To me that sounded like execution dependencies,
and we clearly express those in the jobs themselves ahead of time. But
it turns out the purpose of this callback is to grab resources just in
time before execution or block on them becoming available through a
fence, and then it makes a lot more sense how to use it to do in-flight
command count limiting.

Aside: now that I understand this, I'm tempted to make the Rust
signature for this return a Result<(), Fence>. Returning a fence is
essentially the "error" case here, and that means in the implementation
you can just do:

if job.foo_res.is_none() {
    job.foo_res = Some(foo.get_resource()?);
}
if job.bar_res.is_none() {
    job.bar_res = Some(bar.get_resource()?);
}

As long as all the get_resource() calls return a Result<Resource, Fence>.

There's even more undocumented subtlety here though, since as far as I
can tell if all the resources aren't always grabbed in the same order,
or more than one of a single resource is grabbed separately you could
deadlock or even livelock?

This is theoretical since right now I don't handle this properly at all
other than the command count limit (I need the command struct fixup
system for this to be reasonably possible), but for example, I'll need
1-3 event IDs per job, and if I grab them one by one, you could end up
deadlocking with all event IDs used by jobs waiting for more. And if I
don't store them eagerly (so drop the IDs if you can't get all of them),
then you can end up with livelocks where every scheduler is grabbing an
ID, then dropping it when we can't get another one, which signals a
fence for another blocked scheduler to grab another ID, which then drops
it because it can't get more, etc. So I probably need to grab a number
of event IDs atomically.

> Also, in a weird way, I think these conversations are sometimes better
> than documentation.  It took a while to get around to it all but
> there's a lot of context that was brought together in this e-mail
> thread that wouldn't have been in the docs no matter how good they are.
> A lot of it isn't an isolated thing that should clearly be explained in
> the run_job docs.  It's subtle interactions which happen when all the
> pieces come together.  I see this complaint a lot about Vulkan as well.
> There are behaviors which only become evident when you find the right 5
> pieces of the spec and put them all together and squint.  It'd be good
> to call those out sometimes but there's no way we can document all of
> them.

That's true, but I think we could improve things a lot even with just
better docs and more hyperlinking between docs... For example, the GEM
and DMA fence docs do have quite a bit of prose that gets you some
context (even if it's a bit outdated and not complete). But drm_sched
just has one paragraph and a list giving a high-level design, and then
goes straight into function docs. It definitely takes putting together
the sched, fence, dma_resv, etc. docs together to get the big picture,
but if those docs all at least point at each other and are individually
reasonably complete, then we'd have a chance ^^

~~ Lina
  
Daniel Vetter March 16, 2023, 1:40 p.m. UTC | #22
On Wed, Mar 08, 2023 at 04:19:17PM +0100, Karol Herbst wrote:
> On Wed, Mar 8, 2023 at 4:09 PM Christian König <christian.koenig@amd.com> wrote:
> >
> > Am 08.03.23 um 15:43 schrieb Karol Herbst:
> > > [SNIP]
> > > "further"? There was no discussion at all,
> >
> > Yeah, well that is exactly what I wanted to archive.
> >
> > >   you just started off like
> > > that. If you think somebody misses that connection, you can point out
> > > to documentation/videos whatever so the contributor can understand
> > > what's wrong with an approach. You did that, so that's fine. It's just
> > > starting off _any_ discussion with a "Well complete NAK" is terrible
> > > style. I'd feel uncomfortable if that happened to me and I'm sure
> > > there are enough people like that that we should be more reasonable
> > > with our replies. Just.. don't.
> > >
> > > We are all humans here and people react negatively to such things. And
> > > if people do it on purpose it just makes it worse.
> >
> > I completely see your point, I just don't know how to improve it.
> >
> > I don't stop people like this because I want to make them uncomfortable
> > but because I want to prevent further discussions on that topic.
> >
> > In other words how can I make people notice that this is something
> > fundamental while still being polite?

Ask them to improve the docs. Gets them on board, and for bonus point you
- can check they actually get it when you review the doc patch
- get scheduler docs for free
- have an easily pasteable link for next time around instead of just an
  aggressive NAK that helps no one really (aside from getting people
  boiling).

It's not really about being polite but making sure that efficient
communiction happens and that you don't have to repeat yourself. In rare
cases you get to type the docs themself when people are too dense to learn
(like what I had to do with the various dma_fence docs).

> I think a little improvement over this would be to at least wait a few
> replies before resorting to those strong statements. Just before it
> becomes a risk in just wasting time.

See above what I'm trying to do. When the message doesn't sink in as
either a proper doc patch or when linking to the doc patch for next time
around (because let's face it, this entire concept of "dma_fence committed
for execution" is extremely trick, there will be repeations of this
question until we've sunset dma_fence, which is probably decades away).

If the learning does not happen, then it's the time to whack the big
hammer (and if people don't get it, you can escalate to Dave&me, we have
tools to make sure people get the message). But this really should be the
end, not the start of the escalation chain :-)

Cheers, Daniel

> 
> > >>>> This is clearly going against the idea of having jobs only depend on
> > >>>> fences and nothing else which is mandatory for correct memory management.
> > >>>>
> > >>> I'm sure it's all documented and there is a design document on how
> > >>> things have to look like you can point out? Might help to get a better
> > >>> understanding on how things should be.
> > >> Yeah, that's the problematic part. We have documented this very
> > >> extensively:
> > >> https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-fences
> > >>
> > >> And both Jason and Daniel gave talks about the underlying problem and
> > > fyi:
> > > s/Jason/Faith/g
> >
> > +1. I wasn't aware of that.
> >
> > >> try to come up with patches to raise warnings when that happens, but
> > >> people still keep coming up with the same idea over and over again.
> > >>
> > > Yes, and we'll have to tell them over and over again. Nothing wrong
> > > with that. That's just part of maintaining such a big subsystem. And
> > > that's definitely not a valid reason to phrase things like above.
> > >
> > >> It's just that the technical relationship between preventing jobs from
> > >> running and with that preventing dma_fences from signaling and the core
> > >> memory management with page faults and shrinkers waiting for those
> > >> fences is absolutely not obvious.
> > >>
> > >> We had at least 10 different teams from different companies falling into
> > >> the same trap already and either the patches were rejected of hand or
> > >> had to painfully reverted or mitigated later on.
> > >>
> > > Sure, but that's just part of the job. And pointing out fundamental
> > > mistakes early on is important, but the situation won't get any better
> > > by being like that. Yes, we'll have to repeat the same words over and
> > > over again, and yes that might be annoying, but that's just how it is.
> >
> > Well I have no problem explaining people why a solution doesn't work.
> >
> > But what usually happens is that people don't realize that they need to
> > back of from a design and completely start over.
> >
> > Regards,
> > Christian.
> >
> > >
> > >> Regards,
> > >> Christian.
> > >>
> > >>>> If the hw is busy with something you need to return the fence for this
> > >>>> from the prepare_job callback so that the scheduler can be notified when
> > >>>> the hw is available again.
> > >>>>
> > >>>> Regards,
> > >>>> Christian.
> > >>>>
> > >>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
> > >>>>> ---
> > >>>>>     drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
> > >>>>>     include/drm/gpu_scheduler.h            |  8 ++++++++
> > >>>>>     2 files changed, 18 insertions(+)
> > >>>>>
> > >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > >>>>> index 4e6ad6e122bc..5c0add2c7546 100644
> > >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> > >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > >>>>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
> > >>>>>                 if (!entity)
> > >>>>>                         continue;
> > >>>>>
> > >>>>> +             if (sched->ops->can_run_job) {
> > >>>>> +                     sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> > >>>>> +                     if (!sched_job) {
> > >>>>> +                             complete_all(&entity->entity_idle);
> > >>>>> +                             continue;
> > >>>>> +                     }
> > >>>>> +                     if (!sched->ops->can_run_job(sched_job))
> > >>>>> +                             continue;
> > >>>>> +             }
> > >>>>> +
> > >>>>>                 sched_job = drm_sched_entity_pop_job(entity);
> > >>>>>
> > >>>>>                 if (!sched_job) {
> > >>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > >>>>> index 9db9e5e504ee..bd89ea9507b9 100644
> > >>>>> --- a/include/drm/gpu_scheduler.h
> > >>>>> +++ b/include/drm/gpu_scheduler.h
> > >>>>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
> > >>>>>         struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
> > >>>>>                                          struct drm_sched_entity *s_entity);
> > >>>>>
> > >>>>> +     /**
> > >>>>> +      * @can_run_job: Called before job execution to check whether the
> > >>>>> +      * hardware is free enough to run the job.  This can be used to
> > >>>>> +      * implement more complex hardware resource policies than the
> > >>>>> +      * hw_submission limit.
> > >>>>> +      */
> > >>>>> +     bool (*can_run_job)(struct drm_sched_job *sched_job);
> > >>>>> +
> > >>>>>         /**
> > >>>>>              * @run_job: Called to execute the job once all of the dependencies
> > >>>>>              * have been resolved.  This may be called multiple times, if
> > >>>>>
> >
>
  
Daniel Vetter April 5, 2023, 1:40 p.m. UTC | #23
On Tue, Mar 07, 2023 at 11:25:35PM +0900, Asahi Lina wrote:
> Some hardware may require more complex resource utilization accounting
> than the simple job count supported by drm_sched internally. Add a
> can_run_job callback to allow drivers to implement more logic before
> deciding whether to run a GPU job.
> 
> Signed-off-by: Asahi Lina <lina@asahilina.net>

Ok scheduler rules, or trying to summarize the entire discussion:

dma_fence rules are very tricky. The two main chapters in the docs are

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#dma-fence-cross-driver-contract
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences

Unforutunately I don't think it's possible to check this at compile time,
thus far all we can do is validate at runtime. I've posted two patches for
this:

https://lore.kernel.org/dri-devel/20201023122216.2373294-17-daniel.vetter@ffwll.ch/
https://lore.kernel.org/dri-devel/20201023122216.2373294-20-daniel.vetter@ffwll.ch/

Unfortunately most drivers are buggy and get this completely wrong, so
realistically we'd need to make this a per-driver opt-out and annotate all
current drivers. Well except amdgpu is correct by now I think (they'd
still need to test that). And Rob Clark is working on patches to fix up
msm.

I think best here is if you work together with Rob to make sure these
annotations are mandatory for any rust drivers (I don't want new buggy
drivers at least). Would also be great to improve the kerneldoc for all
the driver hooks to explain these restrictions and link to the relevant
kerneldocs (there's also one for the dma_fence signalling annotations
which might be worth linking too).

I don't see any way to make this explicit in rust types, it's really only
something runtime tests (using lockdep) can catch. Somewhat disappointing.

For the other things discussed here:

- Option<Dma_Fence> as the return value for ->prepare_job makes sense to
  me.

- I don't see any way a driver can use ->can_run_job without breaking the
  above rules, that really doesn't sound like a good idea to me.

Cheers, Daniel

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
>  include/drm/gpu_scheduler.h            |  8 ++++++++
>  2 files changed, 18 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 4e6ad6e122bc..5c0add2c7546 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
>  		if (!entity)
>  			continue;
>  
> +		if (sched->ops->can_run_job) {
> +			sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> +			if (!sched_job) {
> +				complete_all(&entity->entity_idle);
> +				continue;
> +			}
> +			if (!sched->ops->can_run_job(sched_job))
> +				continue;
> +		}
> +
>  		sched_job = drm_sched_entity_pop_job(entity);
>  
>  		if (!sched_job) {
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 9db9e5e504ee..bd89ea9507b9 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
>  	struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
>  					 struct drm_sched_entity *s_entity);
>  
> +	/**
> +	 * @can_run_job: Called before job execution to check whether the
> +	 * hardware is free enough to run the job.  This can be used to
> +	 * implement more complex hardware resource policies than the
> +	 * hw_submission limit.
> +	 */
> +	bool (*can_run_job)(struct drm_sched_job *sched_job);
> +
>  	/**
>           * @run_job: Called to execute the job once all of the dependencies
>           * have been resolved.  This may be called multiple times, if
> 
> -- 
> 2.35.1
>
  
Christian König April 5, 2023, 2:14 p.m. UTC | #24
Am 05.04.23 um 15:40 schrieb Daniel Vetter:
> On Tue, Mar 07, 2023 at 11:25:35PM +0900, Asahi Lina wrote:
>> Some hardware may require more complex resource utilization accounting
>> than the simple job count supported by drm_sched internally. Add a
>> can_run_job callback to allow drivers to implement more logic before
>> deciding whether to run a GPU job.
>>
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
> Ok scheduler rules, or trying to summarize the entire discussion:
>
> dma_fence rules are very tricky. The two main chapters in the docs are
>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#dma-fence-cross-driver-contract
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
>
> Unforutunately I don't think it's possible to check this at compile time,
> thus far all we can do is validate at runtime. I've posted two patches for
> this:
>
> https://lore.kernel.org/dri-devel/20201023122216.2373294-17-daniel.vetter@ffwll.ch/
> https://lore.kernel.org/dri-devel/20201023122216.2373294-20-daniel.vetter@ffwll.ch/
>
> Unfortunately most drivers are buggy and get this completely wrong, so
> realistically we'd need to make this a per-driver opt-out and annotate all
> current drivers. Well except amdgpu is correct by now I think (they'd
> still need to test that).

There is still one potential memory allocation in the run_job callback 
in amdgpu which I wasn't able to fix yet.

But that one is purely academic and could potentially be trivially 
replaced with using GFP_ATOMIC if we ever have to.

Christian.

>   And Rob Clark is working on patches to fix up
> msm.
>
> I think best here is if you work together with Rob to make sure these
> annotations are mandatory for any rust drivers (I don't want new buggy
> drivers at least). Would also be great to improve the kerneldoc for all
> the driver hooks to explain these restrictions and link to the relevant
> kerneldocs (there's also one for the dma_fence signalling annotations
> which might be worth linking too).
>
> I don't see any way to make this explicit in rust types, it's really only
> something runtime tests (using lockdep) can catch. Somewhat disappointing.
>
> For the other things discussed here:
>
> - Option<Dma_Fence> as the return value for ->prepare_job makes sense to
>    me.
>
> - I don't see any way a driver can use ->can_run_job without breaking the
>    above rules, that really doesn't sound like a good idea to me.
>
> Cheers, Daniel
>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
>>   include/drm/gpu_scheduler.h            |  8 ++++++++
>>   2 files changed, 18 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 4e6ad6e122bc..5c0add2c7546 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
>>   		if (!entity)
>>   			continue;
>>   
>> +		if (sched->ops->can_run_job) {
>> +			sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>> +			if (!sched_job) {
>> +				complete_all(&entity->entity_idle);
>> +				continue;
>> +			}
>> +			if (!sched->ops->can_run_job(sched_job))
>> +				continue;
>> +		}
>> +
>>   		sched_job = drm_sched_entity_pop_job(entity);
>>   
>>   		if (!sched_job) {
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 9db9e5e504ee..bd89ea9507b9 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
>>   	struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
>>   					 struct drm_sched_entity *s_entity);
>>   
>> +	/**
>> +	 * @can_run_job: Called before job execution to check whether the
>> +	 * hardware is free enough to run the job.  This can be used to
>> +	 * implement more complex hardware resource policies than the
>> +	 * hw_submission limit.
>> +	 */
>> +	bool (*can_run_job)(struct drm_sched_job *sched_job);
>> +
>>   	/**
>>            * @run_job: Called to execute the job once all of the dependencies
>>            * have been resolved.  This may be called multiple times, if
>>
>> -- 
>> 2.35.1
>>
  
Daniel Vetter April 5, 2023, 2:21 p.m. UTC | #25
On Wed, Apr 05, 2023 at 04:14:11PM +0200, Christian König wrote:
> Am 05.04.23 um 15:40 schrieb Daniel Vetter:
> > On Tue, Mar 07, 2023 at 11:25:35PM +0900, Asahi Lina wrote:
> > > Some hardware may require more complex resource utilization accounting
> > > than the simple job count supported by drm_sched internally. Add a
> > > can_run_job callback to allow drivers to implement more logic before
> > > deciding whether to run a GPU job.
> > > 
> > > Signed-off-by: Asahi Lina <lina@asahilina.net>
> > Ok scheduler rules, or trying to summarize the entire discussion:
> > 
> > dma_fence rules are very tricky. The two main chapters in the docs are
> > 
> > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#dma-fence-cross-driver-contract
> > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_buf#indefinite-dma-fences
> > 
> > Unforutunately I don't think it's possible to check this at compile time,
> > thus far all we can do is validate at runtime. I've posted two patches for
> > this:
> > 
> > https://lore.kernel.org/dri-devel/20201023122216.2373294-17-daniel.vetter@ffwll.ch/
> > https://lore.kernel.org/dri-devel/20201023122216.2373294-20-daniel.vetter@ffwll.ch/
> > 
> > Unfortunately most drivers are buggy and get this completely wrong, so
> > realistically we'd need to make this a per-driver opt-out and annotate all
> > current drivers. Well except amdgpu is correct by now I think (they'd
> > still need to test that).
> 
> There is still one potential memory allocation in the run_job callback in
> amdgpu which I wasn't able to fix yet.
> 
> But that one is purely academic and could potentially be trivially replaced
> with using GFP_ATOMIC if we ever have to.

I think the modeset in the tdr code was more scary, and I'm not sure you
really managed to get rid of absolutely everything in there yet.
-Daniel

> 
> Christian.
> 
> >   And Rob Clark is working on patches to fix up
> > msm.
> > 
> > I think best here is if you work together with Rob to make sure these
> > annotations are mandatory for any rust drivers (I don't want new buggy
> > drivers at least). Would also be great to improve the kerneldoc for all
> > the driver hooks to explain these restrictions and link to the relevant
> > kerneldocs (there's also one for the dma_fence signalling annotations
> > which might be worth linking too).
> > 
> > I don't see any way to make this explicit in rust types, it's really only
> > something runtime tests (using lockdep) can catch. Somewhat disappointing.
> > 
> > For the other things discussed here:
> > 
> > - Option<Dma_Fence> as the return value for ->prepare_job makes sense to
> >    me.
> > 
> > - I don't see any way a driver can use ->can_run_job without breaking the
> >    above rules, that really doesn't sound like a good idea to me.
> > 
> > Cheers, Daniel
> > 
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++
> > >   include/drm/gpu_scheduler.h            |  8 ++++++++
> > >   2 files changed, 18 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 4e6ad6e122bc..5c0add2c7546 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param)
> > >   		if (!entity)
> > >   			continue;
> > > +		if (sched->ops->can_run_job) {
> > > +			sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> > > +			if (!sched_job) {
> > > +				complete_all(&entity->entity_idle);
> > > +				continue;
> > > +			}
> > > +			if (!sched->ops->can_run_job(sched_job))
> > > +				continue;
> > > +		}
> > > +
> > >   		sched_job = drm_sched_entity_pop_job(entity);
> > >   		if (!sched_job) {
> > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > index 9db9e5e504ee..bd89ea9507b9 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -396,6 +396,14 @@ struct drm_sched_backend_ops {
> > >   	struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
> > >   					 struct drm_sched_entity *s_entity);
> > > +	/**
> > > +	 * @can_run_job: Called before job execution to check whether the
> > > +	 * hardware is free enough to run the job.  This can be used to
> > > +	 * implement more complex hardware resource policies than the
> > > +	 * hw_submission limit.
> > > +	 */
> > > +	bool (*can_run_job)(struct drm_sched_job *sched_job);
> > > +
> > >   	/**
> > >            * @run_job: Called to execute the job once all of the dependencies
> > >            * have been resolved.  This may be called multiple times, if
> > > 
> > > -- 
> > > 2.35.1
> > > 
>
  

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 4e6ad6e122bc..5c0add2c7546 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1001,6 +1001,16 @@  static int drm_sched_main(void *param)
 		if (!entity)
 			continue;
 
+		if (sched->ops->can_run_job) {
+			sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
+			if (!sched_job) {
+				complete_all(&entity->entity_idle);
+				continue;
+			}
+			if (!sched->ops->can_run_job(sched_job))
+				continue;
+		}
+
 		sched_job = drm_sched_entity_pop_job(entity);
 
 		if (!sched_job) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 9db9e5e504ee..bd89ea9507b9 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -396,6 +396,14 @@  struct drm_sched_backend_ops {
 	struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job,
 					 struct drm_sched_entity *s_entity);
 
+	/**
+	 * @can_run_job: Called before job execution to check whether the
+	 * hardware is free enough to run the job.  This can be used to
+	 * implement more complex hardware resource policies than the
+	 * hw_submission limit.
+	 */
+	bool (*can_run_job)(struct drm_sched_job *sched_job);
+
 	/**
          * @run_job: Called to execute the job once all of the dependencies
          * have been resolved.  This may be called multiple times, if