[RFC,11/18] drm/scheduler: Clean up jobs when the scheduler is torn down

Message ID 20230307-rust-drm-v1-11-917ff5bc80a8@asahilina.net
State New
Headers
Series Rust DRM subsystem abstractions (& preview AGX driver) |

Commit Message

Asahi Lina March 7, 2023, 2:25 p.m. UTC
  drm_sched_fini() currently leaves any pending jobs dangling, which
causes segfaults and other badness when job completion fences are
signaled after the scheduler is torn down.

Explicitly detach all jobs from their completion callbacks and free
them. This makes it possible to write a sensible safe abstraction for
drm_sched, without having to externally duplicate the tracking of
in-flight jobs.

This shouldn't regress any existing drivers, since calling
drm_sched_fini() with any pending jobs is broken and this change should
be a no-op if there are no pending jobs.

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
 drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)
  

Comments

Maarten Lankhorst March 8, 2023, 9:57 a.m. UTC | #1
On 2023-03-07 15:25, Asahi Lina wrote:
> drm_sched_fini() currently leaves any pending jobs dangling, which
> causes segfaults and other badness when job completion fences are
> signaled after the scheduler is torn down.
>
> Explicitly detach all jobs from their completion callbacks and free
> them. This makes it possible to write a sensible safe abstraction for
> drm_sched, without having to externally duplicate the tracking of
> in-flight jobs.
>
> This shouldn't regress any existing drivers, since calling
> drm_sched_fini() with any pending jobs is broken and this change should
> be a no-op if there are no pending jobs.
>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++--
>   1 file changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 5c0add2c7546..0aab1e0aebdd 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init);
>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>   {
>   	struct drm_sched_entity *s_entity;
> +	struct drm_sched_job *s_job, *tmp;
>   	int i;
>   
> -	if (sched->thread)
> -		kthread_stop(sched->thread);
> +	if (!sched->thread)
> +		return;
> +
> +	/*
> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
> +	 * and cleaning up complete jobs.
> +	 */
> +	drm_sched_stop(sched, NULL);
> +
> +	/*
> +	 * Iterate through the pending job list and free all jobs.
> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
> +	 * otherwise it is responsible for keeping any necessary data structures for
> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
> +	 * putting them in its own queue or doing its own refcounting).
> +	 */
> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
> +		spin_lock(&sched->job_list_lock);
> +		list_del_init(&s_job->list);
> +		spin_unlock(&sched->job_list_lock);
> +		sched->ops->free_job(s_job);
> +	}

I would stop the kthread first, then delete all jobs without spinlock 
since nothing else can race against sched_fini?

If you do need the spinlock, It would need to guard list_for_each_entry too.

> +
> +	kthread_stop(sched->thread);
>   
>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>   		struct drm_sched_rq *rq = &sched->sched_rq[i];
>
  
Christian König March 8, 2023, 10:03 a.m. UTC | #2
Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
>
> On 2023-03-07 15:25, Asahi Lina wrote:
>> drm_sched_fini() currently leaves any pending jobs dangling, which
>> causes segfaults and other badness when job completion fences are
>> signaled after the scheduler is torn down.
>>
>> Explicitly detach all jobs from their completion callbacks and free
>> them. This makes it possible to write a sensible safe abstraction for
>> drm_sched, without having to externally duplicate the tracking of
>> in-flight jobs.
>>
>> This shouldn't regress any existing drivers, since calling
>> drm_sched_fini() with any pending jobs is broken and this change should
>> be a no-op if there are no pending jobs.
>>
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 27 
>> +++++++++++++++++++++++++--
>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 5c0add2c7546..0aab1e0aebdd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   {
>>       struct drm_sched_entity *s_entity;
>> +    struct drm_sched_job *s_job, *tmp;
>>       int i;
>>   -    if (sched->thread)
>> -        kthread_stop(sched->thread);
>> +    if (!sched->thread)
>> +        return;
>> +
>> +    /*
>> +     * Stop the scheduler, detaching all jobs from their hardware 
>> callbacks
>> +     * and cleaning up complete jobs.
>> +     */
>> +    drm_sched_stop(sched, NULL);
>> +
>> +    /*
>> +     * Iterate through the pending job list and free all jobs.
>> +     * This assumes the driver has either guaranteed jobs are 
>> already stopped, or that
>> +     * otherwise it is responsible for keeping any necessary data 
>> structures for
>> +     * in-progress jobs alive even when the free_job() callback is 
>> called early (e.g. by
>> +     * putting them in its own queue or doing its own refcounting).
>> +     */
>> +    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>> +        spin_lock(&sched->job_list_lock);
>> +        list_del_init(&s_job->list);
>> +        spin_unlock(&sched->job_list_lock);
>> +        sched->ops->free_job(s_job);
>> +    }
>
> I would stop the kthread first, then delete all jobs without spinlock 
> since nothing else can race against sched_fini?
>
> If you do need the spinlock, It would need to guard 
> list_for_each_entry too.

Well this case here actually should not happen in the first place.

Jobs depend on their device, so as long as there are jobs there should 
also be a reference to the scheduler.

What could be is that you have allocated a scheduler instance 
dynamically, but even then you should first tear down all entities and 
then the scheduler.

Regards,
Christian.

>
>> +
>> +    kthread_stop(sched->thread);
>>         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>> DRM_SCHED_PRIORITY_MIN; i--) {
>>           struct drm_sched_rq *rq = &sched->sched_rq[i];
>>
  
Asahi Lina March 8, 2023, 3:18 p.m. UTC | #3
On 08/03/2023 19.03, Christian König wrote:
> Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
>>
>> On 2023-03-07 15:25, Asahi Lina wrote:
>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>> causes segfaults and other badness when job completion fences are
>>> signaled after the scheduler is torn down.
>>>
>>> Explicitly detach all jobs from their completion callbacks and free
>>> them. This makes it possible to write a sensible safe abstraction for
>>> drm_sched, without having to externally duplicate the tracking of
>>> in-flight jobs.
>>>
>>> This shouldn't regress any existing drivers, since calling
>>> drm_sched_fini() with any pending jobs is broken and this change should
>>> be a no-op if there are no pending jobs.
>>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 27 
>>> +++++++++++++++++++++++++--
>>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 5c0add2c7546..0aab1e0aebdd 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>   {
>>>       struct drm_sched_entity *s_entity;
>>> +    struct drm_sched_job *s_job, *tmp;
>>>       int i;
>>>   -    if (sched->thread)
>>> -        kthread_stop(sched->thread);
>>> +    if (!sched->thread)
>>> +        return;
>>> +
>>> +    /*
>>> +     * Stop the scheduler, detaching all jobs from their hardware 
>>> callbacks
>>> +     * and cleaning up complete jobs.
>>> +     */
>>> +    drm_sched_stop(sched, NULL);
>>> +
>>> +    /*
>>> +     * Iterate through the pending job list and free all jobs.
>>> +     * This assumes the driver has either guaranteed jobs are 
>>> already stopped, or that
>>> +     * otherwise it is responsible for keeping any necessary data 
>>> structures for
>>> +     * in-progress jobs alive even when the free_job() callback is 
>>> called early (e.g. by
>>> +     * putting them in its own queue or doing its own refcounting).
>>> +     */
>>> +    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>> +        spin_lock(&sched->job_list_lock);
>>> +        list_del_init(&s_job->list);
>>> +        spin_unlock(&sched->job_list_lock);
>>> +        sched->ops->free_job(s_job);
>>> +    }
>>
>> I would stop the kthread first, then delete all jobs without spinlock 
>> since nothing else can race against sched_fini?
>>
>> If you do need the spinlock, It would need to guard 
>> list_for_each_entry too.
> 
> Well this case here actually should not happen in the first place.

"This should not happen in the first place" is how you end up with C
APIs that have corner cases that lead to kernel oopses...

The idea with Rust abstractions is that it needs to be actually
impossible to create memory safety problems for the user of the
abstraction, you can't impose arbitrary constraints like "you must wait
for all jobs to finish before destroying the scheduler"... it needs to
be intrinsically safe.

> Jobs depend on their device, so as long as there are jobs there should 
> also be a reference to the scheduler.

These schedulers are created dynamically per userspace queue. The memory
management and reference counting involved make it safe to destroy the
scheduler even when behind the scenes hardware jobs are still running,
as long as drm_sched itself doesn't crash on fences firing without a
scheduler (which is what this patch fixes).

This is the power of Rust: it forces you to architect your code in a way
that you don't have complex high-level dependencies that span the entire
driver and are difficult to prove hold. In my driver, you can kill a
process and that destroys the drm_sched, closes all GEM objects,
everything, even if the GPU is still running jobs from that process. The
worst that can happen is that the GPU faults as in-use userspace buffers
are unmapped out from under the running user job, but that's fine (GPU
faults are recoverable). The actual firmware resources, queues, etc. in
use are all kept alive until the commands finish executing (or fault,
which is just an abnormal completion), even if the userspace process
that owned them is long gone. I've tested this extensively by doing
things like large-resolution glmark runs in a loop that get `kill -9`'d
repeatedly, and it works very well! Tons of GPU faults but no firmware
crashes, no oopses, nothing. And the firmware *will* crash irrecoverably
if anything goes wrong with its shared memory structures, so that it
doesn't is pretty good evidence that all this works!

> What could be is that you have allocated a scheduler instance 
> dynamically, but even then you should first tear down all entities and 
> then the scheduler.

This is about creating a safe Rust abstraction, so we can't impose
requirements on users like that, the abstraction has to take care of it.
Unfortunately, the jobs cannot depend on the scheduler at the
abstraction level. I tried that (putting a reference counted reference
to the scheduler in the job abstraction), but it doesn't work because a
job completing can end up dropping the last reference to the scheduler,
and then you end up trying to stop and clean up the scheduler from a
callback called from the scheduler kthread itself, which deadlocks. We
could throw those cleanups into a workqueue or something, but that's
just adding bandages around the problem that the drm_sched interface
today is just not safe without this patch...

Right now, it is not possible to create a safe Rust abstraction for
drm_sched without doing something like duplicating all job tracking in
the abstraction, or the above backreference + deferred cleanup mess, or
something equally silly. So let's just fix the C side please ^^

So far, drm_sched is the only DRM API that has had such a fundamental
API safety issue that I had to make a change like this to the C to make
the Rust abstraction possible/reasonable... drm_sched has also been by
far the hardest DRM component API to understand from a safety point of
view, with the most inconsistent documentation about what the
ownership/freeing rules are, and what objects need to outlive what other
objects (I had to just read the code to figure most of this out). That's
also one nice outcome of writing Rust abstractions: it forces us to make
all these rules and invariants explicit, instead of leaving them as
unwritten assumptions (almost nobody consistently documents this in C
APIs...).

If I got it right, anyone using the Rust drm_sched abstraction doesn't
have to worry about this any more because if they do something that
would oops with it, their code won't compile. But I need this patch to
be able to make that guarantee...

~~ Lina
  
Christian König March 8, 2023, 3:42 p.m. UTC | #4
Am 08.03.23 um 16:18 schrieb Asahi Lina:
> On 08/03/2023 19.03, Christian König wrote:
>> Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
>>> On 2023-03-07 15:25, Asahi Lina wrote:
>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>> causes segfaults and other badness when job completion fences are
>>>> signaled after the scheduler is torn down.
>>>>
>>>> Explicitly detach all jobs from their completion callbacks and free
>>>> them. This makes it possible to write a sensible safe abstraction for
>>>> drm_sched, without having to externally duplicate the tracking of
>>>> in-flight jobs.
>>>>
>>>> This shouldn't regress any existing drivers, since calling
>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>> be a no-op if there are no pending jobs.
>>>>
>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_main.c | 27
>>>> +++++++++++++++++++++++++--
>>>>    1 file changed, 25 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 5c0add2c7546..0aab1e0aebdd 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>    void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>    {
>>>>        struct drm_sched_entity *s_entity;
>>>> +    struct drm_sched_job *s_job, *tmp;
>>>>        int i;
>>>>    -    if (sched->thread)
>>>> -        kthread_stop(sched->thread);
>>>> +    if (!sched->thread)
>>>> +        return;
>>>> +
>>>> +    /*
>>>> +     * Stop the scheduler, detaching all jobs from their hardware
>>>> callbacks
>>>> +     * and cleaning up complete jobs.
>>>> +     */
>>>> +    drm_sched_stop(sched, NULL);
>>>> +
>>>> +    /*
>>>> +     * Iterate through the pending job list and free all jobs.
>>>> +     * This assumes the driver has either guaranteed jobs are
>>>> already stopped, or that
>>>> +     * otherwise it is responsible for keeping any necessary data
>>>> structures for
>>>> +     * in-progress jobs alive even when the free_job() callback is
>>>> called early (e.g. by
>>>> +     * putting them in its own queue or doing its own refcounting).
>>>> +     */
>>>> +    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>> +        spin_lock(&sched->job_list_lock);
>>>> +        list_del_init(&s_job->list);
>>>> +        spin_unlock(&sched->job_list_lock);
>>>> +        sched->ops->free_job(s_job);
>>>> +    }
>>> I would stop the kthread first, then delete all jobs without spinlock
>>> since nothing else can race against sched_fini?
>>>
>>> If you do need the spinlock, It would need to guard
>>> list_for_each_entry too.
>> Well this case here actually should not happen in the first place.
> "This should not happen in the first place" is how you end up with C
> APIs that have corner cases that lead to kernel oopses...
>
> The idea with Rust abstractions is that it needs to be actually
> impossible to create memory safety problems for the user of the
> abstraction, you can't impose arbitrary constraints like "you must wait
> for all jobs to finish before destroying the scheduler"... it needs to
> be intrinsically safe.
>
>> Jobs depend on their device, so as long as there are jobs there should
>> also be a reference to the scheduler.
> These schedulers are created dynamically per userspace queue. The memory
> management and reference counting involved make it safe to destroy the
> scheduler even when behind the scenes hardware jobs are still running,
> as long as drm_sched itself doesn't crash on fences firing without a
> scheduler (which is what this patch fixes).

We have originally rejected that approach, but I still think it might 
work if done right.

> This is the power of Rust: it forces you to architect your code in a way
> that you don't have complex high-level dependencies that span the entire
> driver and are difficult to prove hold. In my driver, you can kill a
> process and that destroys the drm_sched, closes all GEM objects,
> everything, even if the GPU is still running jobs from that process. The
> worst that can happen is that the GPU faults as in-use userspace buffers
> are unmapped out from under the running user job, but that's fine (GPU
> faults are recoverable). The actual firmware resources, queues, etc. in
> use are all kept alive until the commands finish executing (or fault,
> which is just an abnormal completion), even if the userspace process
> that owned them is long gone. I've tested this extensively by doing
> things like large-resolution glmark runs in a loop that get `kill -9`'d
> repeatedly, and it works very well! Tons of GPU faults but no firmware
> crashes, no oopses, nothing. And the firmware *will* crash irrecoverably
> if anything goes wrong with its shared memory structures, so that it
> doesn't is pretty good evidence that all this works!

Well testing is no prove at all of a correct design.

>> What could be is that you have allocated a scheduler instance
>> dynamically, but even then you should first tear down all entities and
>> then the scheduler.
> This is about creating a safe Rust abstraction, so we can't impose
> requirements on users like that, the abstraction has to take care of it.
> Unfortunately, the jobs cannot depend on the scheduler at the
> abstraction level. I tried that (putting a reference counted reference
> to the scheduler in the job abstraction), but it doesn't work because a
> job completing can end up dropping the last reference to the scheduler,
> and then you end up trying to stop and clean up the scheduler from a
> callback called from the scheduler kthread itself, which deadlocks. We
> could throw those cleanups into a workqueue or something, but that's
> just adding bandages around the problem that the drm_sched interface
> today is just not safe without this patch...

Well that won't work like this. The scheduler has a pretty clear tear 
down procedure.

And that procedure implies that all entities which might provide jobs 
are destroyed before the scheduler is destroyed.

Destroying the entities in turn cleans up the pending jobs inside of 
them. We could add a warning when users of this API doesn't do this 
correctly, but cleaning up incorrect API use is clearly something we 
don't want here.

> Right now, it is not possible to create a safe Rust abstraction for
> drm_sched without doing something like duplicating all job tracking in
> the abstraction, or the above backreference + deferred cleanup mess, or
> something equally silly. So let's just fix the C side please ^^

Nope, as far as I can see this is just not correctly tearing down the 
objects in the right order.

So you are trying to do something which is not supposed to work in the 
first place.

Regards,
Christian.

>
> So far, drm_sched is the only DRM API that has had such a fundamental
> API safety issue that I had to make a change like this to the C to make
> the Rust abstraction possible/reasonable... drm_sched has also been by
> far the hardest DRM component API to understand from a safety point of
> view, with the most inconsistent documentation about what the
> ownership/freeing rules are, and what objects need to outlive what other
> objects (I had to just read the code to figure most of this out). That's
> also one nice outcome of writing Rust abstractions: it forces us to make
> all these rules and invariants explicit, instead of leaving them as
> unwritten assumptions (almost nobody consistently documents this in C
> APIs...).
>
> If I got it right, anyone using the Rust drm_sched abstraction doesn't
> have to worry about this any more because if they do something that
> would oops with it, their code won't compile. But I need this patch to
> be able to make that guarantee...
>
> ~~ Lina
  
Asahi Lina March 8, 2023, 5:32 p.m. UTC | #5
On 09/03/2023 00.42, Christian König wrote:
> Am 08.03.23 um 16:18 schrieb Asahi Lina:
>> On 08/03/2023 19.03, Christian König wrote:
>>> Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
>>>> On 2023-03-07 15:25, Asahi Lina wrote:
>>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>>> causes segfaults and other badness when job completion fences are
>>>>> signaled after the scheduler is torn down.
>>>>>
>>>>> Explicitly detach all jobs from their completion callbacks and free
>>>>> them. This makes it possible to write a sensible safe abstraction for
>>>>> drm_sched, without having to externally duplicate the tracking of
>>>>> in-flight jobs.
>>>>>
>>>>> This shouldn't regress any existing drivers, since calling
>>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>>> be a no-op if there are no pending jobs.
>>>>>
>>>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>>>> ---
>>>>>    drivers/gpu/drm/scheduler/sched_main.c | 27
>>>>> +++++++++++++++++++++++++--
>>>>>    1 file changed, 25 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 5c0add2c7546..0aab1e0aebdd 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>>>>    void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>>>    {
>>>>>        struct drm_sched_entity *s_entity;
>>>>> +    struct drm_sched_job *s_job, *tmp;
>>>>>        int i;
>>>>>    -    if (sched->thread)
>>>>> -        kthread_stop(sched->thread);
>>>>> +    if (!sched->thread)
>>>>> +        return;
>>>>> +
>>>>> +    /*
>>>>> +     * Stop the scheduler, detaching all jobs from their hardware
>>>>> callbacks
>>>>> +     * and cleaning up complete jobs.
>>>>> +     */
>>>>> +    drm_sched_stop(sched, NULL);
>>>>> +
>>>>> +    /*
>>>>> +     * Iterate through the pending job list and free all jobs.
>>>>> +     * This assumes the driver has either guaranteed jobs are
>>>>> already stopped, or that
>>>>> +     * otherwise it is responsible for keeping any necessary data
>>>>> structures for
>>>>> +     * in-progress jobs alive even when the free_job() callback is
>>>>> called early (e.g. by
>>>>> +     * putting them in its own queue or doing its own refcounting).
>>>>> +     */
>>>>> +    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>>>>> +        spin_lock(&sched->job_list_lock);
>>>>> +        list_del_init(&s_job->list);
>>>>> +        spin_unlock(&sched->job_list_lock);
>>>>> +        sched->ops->free_job(s_job);
>>>>> +    }
>>>> I would stop the kthread first, then delete all jobs without spinlock
>>>> since nothing else can race against sched_fini?
>>>>
>>>> If you do need the spinlock, It would need to guard
>>>> list_for_each_entry too.
>>> Well this case here actually should not happen in the first place.
>> "This should not happen in the first place" is how you end up with C
>> APIs that have corner cases that lead to kernel oopses...
>>
>> The idea with Rust abstractions is that it needs to be actually
>> impossible to create memory safety problems for the user of the
>> abstraction, you can't impose arbitrary constraints like "you must wait
>> for all jobs to finish before destroying the scheduler"... it needs to
>> be intrinsically safe.
>>
>>> Jobs depend on their device, so as long as there are jobs there should
>>> also be a reference to the scheduler.
>> These schedulers are created dynamically per userspace queue. The memory
>> management and reference counting involved make it safe to destroy the
>> scheduler even when behind the scenes hardware jobs are still running,
>> as long as drm_sched itself doesn't crash on fences firing without a
>> scheduler (which is what this patch fixes).
> 
> We have originally rejected that approach, but I still think it might 
> work if done right.
> 
>> This is the power of Rust: it forces you to architect your code in a way
>> that you don't have complex high-level dependencies that span the entire
>> driver and are difficult to prove hold. In my driver, you can kill a
>> process and that destroys the drm_sched, closes all GEM objects,
>> everything, even if the GPU is still running jobs from that process. The
>> worst that can happen is that the GPU faults as in-use userspace buffers
>> are unmapped out from under the running user job, but that's fine (GPU
>> faults are recoverable). The actual firmware resources, queues, etc. in
>> use are all kept alive until the commands finish executing (or fault,
>> which is just an abnormal completion), even if the userspace process
>> that owned them is long gone. I've tested this extensively by doing
>> things like large-resolution glmark runs in a loop that get `kill -9`'d
>> repeatedly, and it works very well! Tons of GPU faults but no firmware
>> crashes, no oopses, nothing. And the firmware *will* crash irrecoverably
>> if anything goes wrong with its shared memory structures, so that it
>> doesn't is pretty good evidence that all this works!
> 
> Well testing is no prove at all of a correct design.

Well, I'm guessing you don't have a formal correctness proof for amdgpu
either... ^^

There's actually no way to prove my design is correct, since this is a
reverse engineered driver that talks to proprietary firmware and I don't
have the benefit of both open and internal docs like you AMD people
have, never mind access to firmware source code... all I can do is try
to understand how it should work based on how macOS does things and
running tests, and then design something that should work with it. I
spent months writing a prototype Python driver before even starting on
the real DRM driver (long story...), and I keep going back to it to test
little details of the firmware interface. There's over 3300 lines of
just firmware structure definitions, it's kind of crazy...

But even with all that... this driver has no right to be as stable as it
is, considering I wrote it in just a few months. It hasn't even been a
year since I started working on AGX at all! As I mentioned in the cover
letter, we've gotten zero reports of oopses in production. I tried
fuzzing the UAPI and all I managed to do was crash the firmware after a
lot of GPU faults (that was a subtle firmware data cache coherency
issue, now fixed), the driver itself was fine. I didn't have to debug
the OOM error codepaths when we first started running Xonotic on 8GB RAM
machines with no texture compression support on high quality texture
settings (bad combination...), it all just worked even though all those
error/cleanup paths had never been tested before at all. The only memory
leaks I managed to cause were due to circular references between VMs and
GEM objects (tricky to avoid, I did manage to miss one special case
object in the first driver release...), everything else just cleans
itself up by design. And it's not because I'm a genius or anything like
that... it's because Rust just makes getting all this right *so* much
easier than C.

So I can at least say I'm quite confident that, as long as my
understanding of the firmware structure lifetimes is correct and I
encode it in the Rust object model the driver uses to represent them,
things will work without crashing without relying on high-level
invariants like "you must wait for all job completions before tearing
down the top-level scheduler for a user queue" ^^

>>> What could be is that you have allocated a scheduler instance
>>> dynamically, but even then you should first tear down all entities and
>>> then the scheduler.
>> This is about creating a safe Rust abstraction, so we can't impose
>> requirements on users like that, the abstraction has to take care of it.
>> Unfortunately, the jobs cannot depend on the scheduler at the
>> abstraction level. I tried that (putting a reference counted reference
>> to the scheduler in the job abstraction), but it doesn't work because a
>> job completing can end up dropping the last reference to the scheduler,
>> and then you end up trying to stop and clean up the scheduler from a
>> callback called from the scheduler kthread itself, which deadlocks. We
>> could throw those cleanups into a workqueue or something, but that's
>> just adding bandages around the problem that the drm_sched interface
>> today is just not safe without this patch...
> 
> Well that won't work like this. The scheduler has a pretty clear tear 
> down procedure.

Well... I wouldn't call it "clear". I had to reverse engineer this from
reading drm_sched source code, the docs don't tell you. The entire
documentation of "drm_sched_fini()" is as follows:

"Tears down and cleans up the scheduler."

That's it.

This is why I had so much trouble writing this abstraction, and I spent
hours reading the drm_sched code to understand how it worked in order to
use the API correctly... and yet...

> 
> And that procedure implies that all entities which might provide jobs 
> are destroyed before the scheduler is destroyed.

Yes, I do this: the entity abstraction holds a reference to the
scheduler for this reason, so the scheduler can only be destroyed once
all entities are destroyed. But...
> Destroying the entities in turn cleans up the pending jobs inside of 
> them.

Yes but... none of this cleans up jobs that are already submitted by the
scheduler and in its pending list, with registered completion callbacks,
which were already popped off of the entities.

*That* is the problem this patch fixes!

> We could add a warning when users of this API doesn't do this 
> correctly, but cleaning up incorrect API use is clearly something we 
> don't want here.

It is the job of the Rust abstractions to make incorrect API use that
leads to memory unsafety impossible. So even if you don't want that in
C, it's my job to do that for Rust... and right now, I just can't
because drm_sched doesn't provide an API that can be safely wrapped
without weird bits of babysitting functionality on top (like tracking
jobs outside or awkwardly making jobs hold a reference to the scheduler
and defer dropping it to another thread).

>> Right now, it is not possible to create a safe Rust abstraction for
>> drm_sched without doing something like duplicating all job tracking in
>> the abstraction, or the above backreference + deferred cleanup mess, or
>> something equally silly. So let's just fix the C side please ^^
> 
> Nope, as far as I can see this is just not correctly tearing down the 
> objects in the right order.

There's no API to clean up in-flight jobs in a drm_sched at all.
Destroying an entity won't do it. So there is no reasonable way to do
this at all...

> So you are trying to do something which is not supposed to work in the 
> first place.

I need to make things that aren't supposed to work impossible to do in
the first place, or at least fail gracefully instead of just oopsing
like drm_sched does today...

If you're convinced there's a way to do this, can you tell me exactly
what code sequence I need to run to safely shut down a scheduler
assuming all entities are already destroyed? You can't ask me for a list
of pending jobs (the scheduler knows this, it doesn't make any sense to
duplicate that outside), and you can't ask me to just not do this until
all jobs complete execution (because then we either end up with the
messy deadlock situation I described if I take a reference, or more
duplicative in-flight job count tracking and blocking in the free path
of the Rust abstraction, which doesn't make any sense either).

~~ Lina
  
Alyssa Rosenzweig March 8, 2023, 5:39 p.m. UTC | #6
> You can't ask me for a list
> of pending jobs (the scheduler knows this, it doesn't make any sense to
> duplicate that outside)

Silly question: could you add a new exported function to drm_sched to get the list of pending jobs, to be used by the Rust abstraction internally? IDK if that makes any sense.
  
Asahi Lina March 8, 2023, 5:44 p.m. UTC | #7
On 09/03/2023 02.39, alyssa@rosenzweig.io wrote:
>> You can't ask me for a list
>> of pending jobs (the scheduler knows this, it doesn't make any sense to
>> duplicate that outside)
> 
> Silly question: could you add a new exported function to drm_sched to get the list of pending jobs, to be used by the Rust abstraction internally? IDK if that makes any sense.

The drm_sched struct is public, we could just go in there and do it
anyway... but then I need to figure out how to do
`list_for_each_entry_safe` in Rust and this all makes very little sense
when it's clearly the scheduler's job to provide some form of cleanup
function users can use to do it...

I mean, I guess I can do that if Christian is adamantly against
providing a safe C API, but it's clearly not the right solution and I
hope this is not the approach maintainers take with Rust abstractions,
because that's going to make our lives a lot harder for no good reason,
and it also means C users don't get any of the benefits of Rust
abstraction work if the APIs can't be improved at all along with it.

~~ Lina
  
Christian König March 8, 2023, 6:12 p.m. UTC | #8
Am 08.03.23 um 18:32 schrieb Asahi Lina:
> [SNIP]
> Yes but... none of this cleans up jobs that are already submitted by the
> scheduler and in its pending list, with registered completion callbacks,
> which were already popped off of the entities.
>
> *That* is the problem this patch fixes!

Ah! Yes that makes more sense now.

>> We could add a warning when users of this API doesn't do this
>> correctly, but cleaning up incorrect API use is clearly something we
>> don't want here.
> It is the job of the Rust abstractions to make incorrect API use that
> leads to memory unsafety impossible. So even if you don't want that in
> C, it's my job to do that for Rust... and right now, I just can't
> because drm_sched doesn't provide an API that can be safely wrapped
> without weird bits of babysitting functionality on top (like tracking
> jobs outside or awkwardly making jobs hold a reference to the scheduler
> and defer dropping it to another thread).

Yeah, that was discussed before but rejected.

The argument was that upper layer needs to wait for the hw to become 
idle before the scheduler can be destroyed anyway.

>>> Right now, it is not possible to create a safe Rust abstraction for
>>> drm_sched without doing something like duplicating all job tracking in
>>> the abstraction, or the above backreference + deferred cleanup mess, or
>>> something equally silly. So let's just fix the C side please ^^
>> Nope, as far as I can see this is just not correctly tearing down the
>> objects in the right order.
> There's no API to clean up in-flight jobs in a drm_sched at all.
> Destroying an entity won't do it. So there is no reasonable way to do
> this at all...

Yes, this was removed.

>> So you are trying to do something which is not supposed to work in the
>> first place.
> I need to make things that aren't supposed to work impossible to do in
> the first place, or at least fail gracefully instead of just oopsing
> like drm_sched does today...
>
> If you're convinced there's a way to do this, can you tell me exactly
> what code sequence I need to run to safely shut down a scheduler
> assuming all entities are already destroyed? You can't ask me for a list
> of pending jobs (the scheduler knows this, it doesn't make any sense to
> duplicate that outside), and you can't ask me to just not do this until
> all jobs complete execution (because then we either end up with the
> messy deadlock situation I described if I take a reference, or more
> duplicative in-flight job count tracking and blocking in the free path
> of the Rust abstraction, which doesn't make any sense either).

Good question. We don't have anybody upstream which uses the scheduler 
lifetime like this.

Essentially the job list in the scheduler is something we wanted to 
remove because it causes tons of race conditions during hw recovery.

When you tear down the firmware queue how do you handle already 
submitted jobs there?

Regards,
Christian.

>
> ~~ Lina
  
Christian König March 8, 2023, 6:13 p.m. UTC | #9
Am 08.03.23 um 18:39 schrieb alyssa@rosenzweig.io:
>> You can't ask me for a list
>> of pending jobs (the scheduler knows this, it doesn't make any sense to
>> duplicate that outside)
> Silly question: could you add a new exported function to drm_sched to get the list of pending jobs, to be used by the Rust abstraction internally? IDK if that makes any sense.

I was thinking about something similar as well. The problem is that you 
could only use this function from the scheduler thread itself, e.g. from 
one of its callback functions.

Christian.
  
Asahi Lina March 8, 2023, 7:37 p.m. UTC | #10
On 09/03/2023 03.12, Christian König wrote:
> Am 08.03.23 um 18:32 schrieb Asahi Lina:
>> [SNIP]
>> Yes but... none of this cleans up jobs that are already submitted by the
>> scheduler and in its pending list, with registered completion callbacks,
>> which were already popped off of the entities.
>>
>> *That* is the problem this patch fixes!
> 
> Ah! Yes that makes more sense now.
> 
>>> We could add a warning when users of this API doesn't do this
>>> correctly, but cleaning up incorrect API use is clearly something we
>>> don't want here.
>> It is the job of the Rust abstractions to make incorrect API use that
>> leads to memory unsafety impossible. So even if you don't want that in
>> C, it's my job to do that for Rust... and right now, I just can't
>> because drm_sched doesn't provide an API that can be safely wrapped
>> without weird bits of babysitting functionality on top (like tracking
>> jobs outside or awkwardly making jobs hold a reference to the scheduler
>> and defer dropping it to another thread).
> 
> Yeah, that was discussed before but rejected.
> 
> The argument was that upper layer needs to wait for the hw to become 
> idle before the scheduler can be destroyed anyway.

Unfortunately, that's not a requirement you can encode in the Rust type
system easily as far as I know, and Rust safety rules mean we need to
make it safe even if the upper layer doesn't do this... (or else we have
to mark the entire drm_sched abstraction unsafe, but that would be a pity).

I know it's a different way of thinking, but it has pretty clear
benefits since with Rust you can actually guarantee that things are safe
overall by just auditing explicitly unsafe code. If we just mark all of
drm_sched unsafe, that means we now need to audit all details about how
the driver uses it for safety. It makes more sense to just make the
abstraction safe, which is much easier to audit.

>>>> Right now, it is not possible to create a safe Rust abstraction for
>>>> drm_sched without doing something like duplicating all job tracking in
>>>> the abstraction, or the above backreference + deferred cleanup mess, or
>>>> something equally silly. So let's just fix the C side please ^^
>>> Nope, as far as I can see this is just not correctly tearing down the
>>> objects in the right order.
>> There's no API to clean up in-flight jobs in a drm_sched at all.
>> Destroying an entity won't do it. So there is no reasonable way to do
>> this at all...
> 
> Yes, this was removed.
> 
>>> So you are trying to do something which is not supposed to work in the
>>> first place.
>> I need to make things that aren't supposed to work impossible to do in
>> the first place, or at least fail gracefully instead of just oopsing
>> like drm_sched does today...
>>
>> If you're convinced there's a way to do this, can you tell me exactly
>> what code sequence I need to run to safely shut down a scheduler
>> assuming all entities are already destroyed? You can't ask me for a list
>> of pending jobs (the scheduler knows this, it doesn't make any sense to
>> duplicate that outside), and you can't ask me to just not do this until
>> all jobs complete execution (because then we either end up with the
>> messy deadlock situation I described if I take a reference, or more
>> duplicative in-flight job count tracking and blocking in the free path
>> of the Rust abstraction, which doesn't make any sense either).
> 
> Good question. We don't have anybody upstream which uses the scheduler 
> lifetime like this.
> 
> Essentially the job list in the scheduler is something we wanted to 
> remove because it causes tons of race conditions during hw recovery.
> 
> When you tear down the firmware queue how do you handle already 
> submitted jobs there?

The firmware queue is itself reference counted and any firmware queue
that has acquired an event notification resource (that is, which is busy
with running or upcoming jobs) hands off a reference to itself into the
event subsystem, so it can get notified of job completions by the
firmware. Then once it becomes idle it unregisters itself, and at that
point if it has no owning userspace queue, that would be the last
reference and it gets dropped. So we don't tear down firmware queues
until they are idle.

(There is a subtle deadlock break in the event module to make this work
out, where we clone a reference to the queue and drop the event
subsystem lock before signaling it of completions, so it can call back
in and take the lock as it unregisters itself if needed. Then the actual
teardown happens when the signaling is complete and that reference clone
is the last one to get dropped.)

If a queue is idle at the firmware level but has upcoming jobs queued in
drm_sched, when those get deleted as part of an explicit drm_sched
teardown (free_job()) the queue notices it lost its upcoming jobs and
relinquishes the event resource if there are no running jobs. I'm not
even sure exactly what order this all happens in in practice (it depends
on structure field order in Rust!), but it doesn't really matter because
either way everything gets cleaned up one way or another.

I actually don't know of any way to actively abort jobs on the firmware,
so this is pretty much the only option I have. I've even seen
long-running compute jobs on macOS run to completion even if you kill
the submitting process, so there might be no way to do this at all.
Though in practice since we unmap everything from the VM anyway when the
userspace stuff gets torn down, almost any normal GPU work is going to
immediately fault at that point (macOS doesn't do this because macOS
effectively does implicit sync with BO tracking at the kernel level...).

By the way, I don't really use the hardware recovery stuff right now.
I'm not even sure if there is a sensible way I could use it, since as I
said we can't exactly abort jobs. I know there are ways to lock up the
firmware/GPU, but so far those have all been things the kernel driver
can prevent, and I'm not even sure if there is any way to recover from
that anyway. The firmware itself has its own timeouts and recovery for
"normal" problems. From the point of view of the driver and everything
above it, in-flight commands during a GPU fault or timeout are just
marked complete by the firmware, after a firmware recovery cycle where
the driver gets notified of the problem (that's when we mark the
commands failed so we can propagate the error). There is no
re-submission or anything, userspace just gets told of the problem but
the queue survives. In the future it might be possible to re-submit
innocent commands (it is possible for a GPU fault to break another
process running concurrently, and this is a problem macOS has too...),
which is still not perfect due to side effects but might work most of
the time, but that depends on the "command patching" stuff I mentioned,
and I'm still not even sure if it will be possible to do safely. There's
a lot of subtlety around what we can and can't do during a firmware
recovery cycle that I haven't even started to investigate yet (the
answer could be "nothing" even).

~~ Lina
  
Christian König March 9, 2023, 8:42 a.m. UTC | #11
Am 08.03.23 um 20:37 schrieb Asahi Lina:
> On 09/03/2023 03.12, Christian König wrote:
>> Am 08.03.23 um 18:32 schrieb Asahi Lina:
>>> [SNIP]
>>> Yes but... none of this cleans up jobs that are already submitted by the
>>> scheduler and in its pending list, with registered completion callbacks,
>>> which were already popped off of the entities.
>>>
>>> *That* is the problem this patch fixes!
>> Ah! Yes that makes more sense now.
>>
>>>> We could add a warning when users of this API doesn't do this
>>>> correctly, but cleaning up incorrect API use is clearly something we
>>>> don't want here.
>>> It is the job of the Rust abstractions to make incorrect API use that
>>> leads to memory unsafety impossible. So even if you don't want that in
>>> C, it's my job to do that for Rust... and right now, I just can't
>>> because drm_sched doesn't provide an API that can be safely wrapped
>>> without weird bits of babysitting functionality on top (like tracking
>>> jobs outside or awkwardly making jobs hold a reference to the scheduler
>>> and defer dropping it to another thread).
>> Yeah, that was discussed before but rejected.
>>
>> The argument was that upper layer needs to wait for the hw to become
>> idle before the scheduler can be destroyed anyway.
> Unfortunately, that's not a requirement you can encode in the Rust type
> system easily as far as I know, and Rust safety rules mean we need to
> make it safe even if the upper layer doesn't do this... (or else we have
> to mark the entire drm_sched abstraction unsafe, but that would be a pity).

Yeah, that should really not be something we should do.

But you could make the scheduler depend on your fw context object, don't 
you?

Detaching the scheduler from the underlying hw fences is certainly 
possible, but we removed that functionality because some people people 
tried to force push some Windows recovery module into Linux. We are in 
the process of reverting that and cleaning things up once more, but that 
will take a while.

Instead of detaching you could also block for the hw to become idle, but 
if you do that synchronous on process termination you run into trouble 
as well.

> I know it's a different way of thinking, but it has pretty clear
> benefits since with Rust you can actually guarantee that things are safe
> overall by just auditing explicitly unsafe code. If we just mark all of
> drm_sched unsafe, that means we now need to audit all details about how
> the driver uses it for safety. It makes more sense to just make the
> abstraction safe, which is much easier to audit.

I'm pretty familiar with that approach.

>
>>>>> Right now, it is not possible to create a safe Rust abstraction for
>>>>> drm_sched without doing something like duplicating all job tracking in
>>>>> the abstraction, or the above backreference + deferred cleanup mess, or
>>>>> something equally silly. So let's just fix the C side please ^^
>>>> Nope, as far as I can see this is just not correctly tearing down the
>>>> objects in the right order.
>>> There's no API to clean up in-flight jobs in a drm_sched at all.
>>> Destroying an entity won't do it. So there is no reasonable way to do
>>> this at all...
>> Yes, this was removed.
>>
>>>> So you are trying to do something which is not supposed to work in the
>>>> first place.
>>> I need to make things that aren't supposed to work impossible to do in
>>> the first place, or at least fail gracefully instead of just oopsing
>>> like drm_sched does today...
>>>
>>> If you're convinced there's a way to do this, can you tell me exactly
>>> what code sequence I need to run to safely shut down a scheduler
>>> assuming all entities are already destroyed? You can't ask me for a list
>>> of pending jobs (the scheduler knows this, it doesn't make any sense to
>>> duplicate that outside), and you can't ask me to just not do this until
>>> all jobs complete execution (because then we either end up with the
>>> messy deadlock situation I described if I take a reference, or more
>>> duplicative in-flight job count tracking and blocking in the free path
>>> of the Rust abstraction, which doesn't make any sense either).
>> Good question. We don't have anybody upstream which uses the scheduler
>> lifetime like this.
>>
>> Essentially the job list in the scheduler is something we wanted to
>> remove because it causes tons of race conditions during hw recovery.
>>
>> When you tear down the firmware queue how do you handle already
>> submitted jobs there?
> The firmware queue is itself reference counted and any firmware queue
> that has acquired an event notification resource (that is, which is busy
> with running or upcoming jobs) hands off a reference to itself into the
> event subsystem, so it can get notified of job completions by the
> firmware. Then once it becomes idle it unregisters itself, and at that
> point if it has no owning userspace queue, that would be the last
> reference and it gets dropped. So we don't tear down firmware queues
> until they are idle.

And could those fw queue not reference the scheduler?

>
> (There is a subtle deadlock break in the event module to make this work
> out, where we clone a reference to the queue and drop the event
> subsystem lock before signaling it of completions, so it can call back
> in and take the lock as it unregisters itself if needed. Then the actual
> teardown happens when the signaling is complete and that reference clone
> is the last one to get dropped.)
>
> If a queue is idle at the firmware level but has upcoming jobs queued in
> drm_sched, when those get deleted as part of an explicit drm_sched
> teardown (free_job()) the queue notices it lost its upcoming jobs and
> relinquishes the event resource if there are no running jobs. I'm not
> even sure exactly what order this all happens in in practice (it depends
> on structure field order in Rust!), but it doesn't really matter because
> either way everything gets cleaned up one way or another.
>
> I actually don't know of any way to actively abort jobs on the firmware,
> so this is pretty much the only option I have. I've even seen
> long-running compute jobs on macOS run to completion even if you kill
> the submitting process, so there might be no way to do this at all.
> Though in practice since we unmap everything from the VM anyway when the
> userspace stuff gets torn down, almost any normal GPU work is going to
> immediately fault at that point (macOS doesn't do this because macOS
> effectively does implicit sync with BO tracking at the kernel level...).

Oh, that is an interesting information. How does macOS do explicit sync 
then or isn't that supported at all?

> By the way, I don't really use the hardware recovery stuff right now.
> I'm not even sure if there is a sensible way I could use it, since as I
> said we can't exactly abort jobs. I know there are ways to lock up the
> firmware/GPU, but so far those have all been things the kernel driver
> can prevent, and I'm not even sure if there is any way to recover from
> that anyway. The firmware itself has its own timeouts and recovery for
> "normal" problems. From the point of view of the driver and everything
> above it, in-flight commands during a GPU fault or timeout are just
> marked complete by the firmware, after a firmware recovery cycle where
> the driver gets notified of the problem (that's when we mark the
> commands failed so we can propagate the error).

Yeah, that's exactly what we are telling our fw people for years that we 
need this as well.

> There is no re-submission or anything, userspace just gets told of the problem but
> the queue survives.

> In the future it might be possible to re-submit innocent commands

Long story short: Don't do this! This is what the Windows drivers have 
been doing and it creates tons of problems.

Just signal the problem back to userspace and let the user space driver 
decide what to do.

The background is that most graphics applications (games etc..) then 
rather start on the next frame instead of submitting the current one 
again while compute applications make sure that the abort and tell the 
user that the calculations might be corrupted and need to be redone.

Regards,
Christian.

>   (it is possible for a GPU fault to break another
> process running concurrently, and this is a problem macOS has too...),
> which is still not perfect due to side effects but might work most of
> the time, but that depends on the "command patching" stuff I mentioned,
> and I'm still not even sure if it will be possible to do safely. There's
> a lot of subtlety around what we can and can't do during a firmware
> recovery cycle that I haven't even started to investigate yet (the
> answer could be "nothing" even).
>
> ~~ Lina
  
Asahi Lina March 9, 2023, 9:43 a.m. UTC | #12
On 09/03/2023 17.42, Christian König wrote:
> Am 08.03.23 um 20:37 schrieb Asahi Lina:
>> On 09/03/2023 03.12, Christian König wrote:
>>> Am 08.03.23 um 18:32 schrieb Asahi Lina:
>>>> [SNIP]
>>>> Yes but... none of this cleans up jobs that are already submitted by the
>>>> scheduler and in its pending list, with registered completion callbacks,
>>>> which were already popped off of the entities.
>>>>
>>>> *That* is the problem this patch fixes!
>>> Ah! Yes that makes more sense now.
>>>
>>>>> We could add a warning when users of this API doesn't do this
>>>>> correctly, but cleaning up incorrect API use is clearly something we
>>>>> don't want here.
>>>> It is the job of the Rust abstractions to make incorrect API use that
>>>> leads to memory unsafety impossible. So even if you don't want that in
>>>> C, it's my job to do that for Rust... and right now, I just can't
>>>> because drm_sched doesn't provide an API that can be safely wrapped
>>>> without weird bits of babysitting functionality on top (like tracking
>>>> jobs outside or awkwardly making jobs hold a reference to the scheduler
>>>> and defer dropping it to another thread).
>>> Yeah, that was discussed before but rejected.
>>>
>>> The argument was that upper layer needs to wait for the hw to become
>>> idle before the scheduler can be destroyed anyway.
>> Unfortunately, that's not a requirement you can encode in the Rust type
>> system easily as far as I know, and Rust safety rules mean we need to
>> make it safe even if the upper layer doesn't do this... (or else we have
>> to mark the entire drm_sched abstraction unsafe, but that would be a pity).
> 
> Yeah, that should really not be something we should do.
> 
> But you could make the scheduler depend on your fw context object, don't 
> you?

Yes, and that would fix the problem for this driver, but it wouldn't
make the abstraction safe. The thing is we have to make it *impossible*
to misuse drm_sched in such a way that it crashes, at the Rust
abstraction level. If we start depending on the driver following rules
like that, that means the drm_sched abstraction has to be marked unsafe.

> Detaching the scheduler from the underlying hw fences is certainly 
> possible, but we removed that functionality because some people people 
> tried to force push some Windows recovery module into Linux. We are in 
> the process of reverting that and cleaning things up once more, but that 
> will take a while.

Okay, but I don't see why that should block the Rust abstractions... I
don't even need a new API to do that, all I need is to know that
drm_sched_fini() will do it so it won't crash when the hw fences
complete later, as this patch does.

> Instead of detaching you could also block for the hw to become idle, but 
> if you do that synchronous on process termination you run into trouble 
> as well.

Yes, but again this something that can only be done at the driver level
so it doesn't solve the safe abstraction problem...

>> The firmware queue is itself reference counted and any firmware queue
>> that has acquired an event notification resource (that is, which is busy
>> with running or upcoming jobs) hands off a reference to itself into the
>> event subsystem, so it can get notified of job completions by the
>> firmware. Then once it becomes idle it unregisters itself, and at that
>> point if it has no owning userspace queue, that would be the last
>> reference and it gets dropped. So we don't tear down firmware queues
>> until they are idle.
> 
> And could those fw queue not reference the scheduler?

Yes but again, that rule can't be encoded in the abstraction... so that
makes it unsafe. The goal is to have a safe abstraction, which means
that all the rules that you need to follow to avoid memory safety issues
are checked by the Rust compiler.

>> I actually don't know of any way to actively abort jobs on the firmware,
>> so this is pretty much the only option I have. I've even seen
>> long-running compute jobs on macOS run to completion even if you kill
>> the submitting process, so there might be no way to do this at all.
>> Though in practice since we unmap everything from the VM anyway when the
>> userspace stuff gets torn down, almost any normal GPU work is going to
>> immediately fault at that point (macOS doesn't do this because macOS
>> effectively does implicit sync with BO tracking at the kernel level...).
> 
> Oh, that is an interesting information. How does macOS do explicit sync 
> then or isn't that supported at all?

They have the equivalent of sync objects at the UAPI level, but they
also have the implicit stuff and their UAPI seems to always pass a BO
list to the kernel as far as we could tell, even though it still works
without it. I think it's a weird hybrid of explicit+implicit sync. From
the Metal docs:

> By default, Metal tracks the write hazards and synchronizes the resources
> (see Resource Fundamentals) you create from an MTLDevice and directly bind
> to a pipeline. However, Metal doesn’t, by default, track resources you
> allocate from an MTLHeap (see Memory Heaps).

So it's both, and you can override it...

At the firmware level, I've never seen Metal use queue barriers yet like
I do (other than the vertex->fragment ones), so either they always do
CPU round trips for cross-subqueue sync (render<->compute) or we just
haven't figured out the magic combination to get it to do that yet.
Honestly, I suspect they just always do it on the CPU. macOS is pretty
ugly behind the scenes and it's pretty obvious a lot of their own driver
was rushed (the firmware seems to support quite a few features the
driver doesn't... maybe it even has a job abort mechanism, we just
haven't found it yet).

Of course, our goal is to do things better than macOS (and we already do
some things better!) but getting confident enough about firmware/HW
details to diverge from what macOS does is tricky and a slow process...

>> By the way, I don't really use the hardware recovery stuff right now.
>> I'm not even sure if there is a sensible way I could use it, since as I
>> said we can't exactly abort jobs. I know there are ways to lock up the
>> firmware/GPU, but so far those have all been things the kernel driver
>> can prevent, and I'm not even sure if there is any way to recover from
>> that anyway. The firmware itself has its own timeouts and recovery for
>> "normal" problems. From the point of view of the driver and everything
>> above it, in-flight commands during a GPU fault or timeout are just
>> marked complete by the firmware, after a firmware recovery cycle where
>> the driver gets notified of the problem (that's when we mark the
>> commands failed so we can propagate the error).
> 
> Yeah, that's exactly what we are telling our fw people for years that we 
> need this as well.

Yeah, the ugly bit is that the firmware does a full GPU recovery even on
simple page faults (which could be handled more gracefully) so even
stuff like that can possibly break concurrent GPU work.

On the other hand, macOS configures things so page faults are ignored
and silently return all-00 on reads for shader accesses, which is how
they implement sparse buffers/textures... and we'll probably have to do
that to improve reliability against app faults if nothing else. But
right now the driver enables explicit page faults for everything so we
can debug Mesa (it's a kernel module param, GPU global and I haven't
found a way to change it after initial load unfortunately, but it might
be possible).

I think there's also a way to do actual page fault handling (like swap
in pages and resume the GPU), but that's one of those firmware features
Apple's driver just never uses as far as I can tell. There's so much
unexplored territory...

> 
>> There is no re-submission or anything, userspace just gets told of the problem but
>> the queue survives.
> 
>> In the future it might be possible to re-submit innocent commands
> 
> Long story short: Don't do this! This is what the Windows drivers have 
> been doing and it creates tons of problems.
> 
> Just signal the problem back to userspace and let the user space driver 
> decide what to do.
> 
> The background is that most graphics applications (games etc..) then 
> rather start on the next frame instead of submitting the current one 
> again while compute applications make sure that the abort and tell the 
> user that the calculations might be corrupted and need to be redone.

Then we're good with what we're currently doing, since we already notify
userspace like that!

Actually I wanted to ask about error notifications. Right now we have an
out-of-band mechanism to provide detailed fault info to userspace which
works fine, but in principle it's optional. However, I also mark the hw
 fences as errored when a fault happens (with an errno that describes
the overall situation), but that never makes it into the drm_sched job
complete fence. I looked at the drm_sched code and I didn't see any
error propagation. Is that supposed to work, or am I supposed to
directly mark the drm_sched side fence as complete, or did I
misunderstand all this? I get the feeling maybe existing drivers just
rely on the recovery/timeout/etc paths to mark jobs as errored (since
those do it explicitly) and never need error forwarding from the hw fence?

~~ Lina
  
Christian König March 9, 2023, 11:47 a.m. UTC | #13
Am 09.03.23 um 10:43 schrieb Asahi Lina:
> On 09/03/2023 17.42, Christian König wrote:
>> Am 08.03.23 um 20:37 schrieb Asahi Lina:
>>> On 09/03/2023 03.12, Christian König wrote:
>>>> Am 08.03.23 um 18:32 schrieb Asahi Lina:
>>>>> [SNIP]
>>>>> Yes but... none of this cleans up jobs that are already submitted by the
>>>>> scheduler and in its pending list, with registered completion callbacks,
>>>>> which were already popped off of the entities.
>>>>>
>>>>> *That* is the problem this patch fixes!
>>>> Ah! Yes that makes more sense now.
>>>>
>>>>>> We could add a warning when users of this API doesn't do this
>>>>>> correctly, but cleaning up incorrect API use is clearly something we
>>>>>> don't want here.
>>>>> It is the job of the Rust abstractions to make incorrect API use that
>>>>> leads to memory unsafety impossible. So even if you don't want that in
>>>>> C, it's my job to do that for Rust... and right now, I just can't
>>>>> because drm_sched doesn't provide an API that can be safely wrapped
>>>>> without weird bits of babysitting functionality on top (like tracking
>>>>> jobs outside or awkwardly making jobs hold a reference to the scheduler
>>>>> and defer dropping it to another thread).
>>>> Yeah, that was discussed before but rejected.
>>>>
>>>> The argument was that upper layer needs to wait for the hw to become
>>>> idle before the scheduler can be destroyed anyway.
>>> Unfortunately, that's not a requirement you can encode in the Rust type
>>> system easily as far as I know, and Rust safety rules mean we need to
>>> make it safe even if the upper layer doesn't do this... (or else we have
>>> to mark the entire drm_sched abstraction unsafe, but that would be a pity).
>> Yeah, that should really not be something we should do.
>>
>> But you could make the scheduler depend on your fw context object, don't
>> you?
> Yes, and that would fix the problem for this driver, but it wouldn't
> make the abstraction safe. The thing is we have to make it *impossible*
> to misuse drm_sched in such a way that it crashes, at the Rust
> abstraction level. If we start depending on the driver following rules
> like that, that means the drm_sched abstraction has to be marked unsafe.
>
>> Detaching the scheduler from the underlying hw fences is certainly
>> possible, but we removed that functionality because some people people
>> tried to force push some Windows recovery module into Linux. We are in
>> the process of reverting that and cleaning things up once more, but that
>> will take a while.
> Okay, but I don't see why that should block the Rust abstractions...

Because even with removing the fence callback this is inherently unsafe.

You not only need to remove the callback, but also make sure that no 
parallel timeout handling is running.

This might not matter for you driver at the moment, but it's certainly 
something you need to keep in mind when you really want save handling.

Apart from that I don't have much objections to this here as long as 
Maartens comments are addressed as well.

Regards,
Christian.

> I
> don't even need a new API to do that, all I need is to know that
> drm_sched_fini() will do it so it won't crash when the hw fences
> complete later, as this patch does.
>
>> Instead of detaching you could also block for the hw to become idle, but
>> if you do that synchronous on process termination you run into trouble
>> as well.
> Yes, but again this something that can only be done at the driver level
> so it doesn't solve the safe abstraction problem...
>
>>> The firmware queue is itself reference counted and any firmware queue
>>> that has acquired an event notification resource (that is, which is busy
>>> with running or upcoming jobs) hands off a reference to itself into the
>>> event subsystem, so it can get notified of job completions by the
>>> firmware. Then once it becomes idle it unregisters itself, and at that
>>> point if it has no owning userspace queue, that would be the last
>>> reference and it gets dropped. So we don't tear down firmware queues
>>> until they are idle.
>> And could those fw queue not reference the scheduler?
> Yes but again, that rule can't be encoded in the abstraction... so that
> makes it unsafe. The goal is to have a safe abstraction, which means
> that all the rules that you need to follow to avoid memory safety issues
> are checked by the Rust compiler.
>
>>> I actually don't know of any way to actively abort jobs on the firmware,
>>> so this is pretty much the only option I have. I've even seen
>>> long-running compute jobs on macOS run to completion even if you kill
>>> the submitting process, so there might be no way to do this at all.
>>> Though in practice since we unmap everything from the VM anyway when the
>>> userspace stuff gets torn down, almost any normal GPU work is going to
>>> immediately fault at that point (macOS doesn't do this because macOS
>>> effectively does implicit sync with BO tracking at the kernel level...).
>> Oh, that is an interesting information. How does macOS do explicit sync
>> then or isn't that supported at all?
> They have the equivalent of sync objects at the UAPI level, but they
> also have the implicit stuff and their UAPI seems to always pass a BO
> list to the kernel as far as we could tell, even though it still works
> without it. I think it's a weird hybrid of explicit+implicit sync. From
> the Metal docs:
>
>> By default, Metal tracks the write hazards and synchronizes the resources
>> (see Resource Fundamentals) you create from an MTLDevice and directly bind
>> to a pipeline. However, Metal doesn’t, by default, track resources you
>> allocate from an MTLHeap (see Memory Heaps).
> So it's both, and you can override it...
>
> At the firmware level, I've never seen Metal use queue barriers yet like
> I do (other than the vertex->fragment ones), so either they always do
> CPU round trips for cross-subqueue sync (render<->compute) or we just
> haven't figured out the magic combination to get it to do that yet.
> Honestly, I suspect they just always do it on the CPU. macOS is pretty
> ugly behind the scenes and it's pretty obvious a lot of their own driver
> was rushed (the firmware seems to support quite a few features the
> driver doesn't... maybe it even has a job abort mechanism, we just
> haven't found it yet).
>
> Of course, our goal is to do things better than macOS (and we already do
> some things better!) but getting confident enough about firmware/HW
> details to diverge from what macOS does is tricky and a slow process...
>
>>> By the way, I don't really use the hardware recovery stuff right now.
>>> I'm not even sure if there is a sensible way I could use it, since as I
>>> said we can't exactly abort jobs. I know there are ways to lock up the
>>> firmware/GPU, but so far those have all been things the kernel driver
>>> can prevent, and I'm not even sure if there is any way to recover from
>>> that anyway. The firmware itself has its own timeouts and recovery for
>>> "normal" problems. From the point of view of the driver and everything
>>> above it, in-flight commands during a GPU fault or timeout are just
>>> marked complete by the firmware, after a firmware recovery cycle where
>>> the driver gets notified of the problem (that's when we mark the
>>> commands failed so we can propagate the error).
>> Yeah, that's exactly what we are telling our fw people for years that we
>> need this as well.
> Yeah, the ugly bit is that the firmware does a full GPU recovery even on
> simple page faults (which could be handled more gracefully) so even
> stuff like that can possibly break concurrent GPU work.
>
> On the other hand, macOS configures things so page faults are ignored
> and silently return all-00 on reads for shader accesses, which is how
> they implement sparse buffers/textures... and we'll probably have to do
> that to improve reliability against app faults if nothing else. But
> right now the driver enables explicit page faults for everything so we
> can debug Mesa (it's a kernel module param, GPU global and I haven't
> found a way to change it after initial load unfortunately, but it might
> be possible).
>
> I think there's also a way to do actual page fault handling (like swap
> in pages and resume the GPU), but that's one of those firmware features
> Apple's driver just never uses as far as I can tell. There's so much
> unexplored territory...
>
>>> There is no re-submission or anything, userspace just gets told of the problem but
>>> the queue survives.
>>> In the future it might be possible to re-submit innocent commands
>> Long story short: Don't do this! This is what the Windows drivers have
>> been doing and it creates tons of problems.
>>
>> Just signal the problem back to userspace and let the user space driver
>> decide what to do.
>>
>> The background is that most graphics applications (games etc..) then
>> rather start on the next frame instead of submitting the current one
>> again while compute applications make sure that the abort and tell the
>> user that the calculations might be corrupted and need to be redone.
> Then we're good with what we're currently doing, since we already notify
> userspace like that!
>
> Actually I wanted to ask about error notifications. Right now we have an
> out-of-band mechanism to provide detailed fault info to userspace which
> works fine, but in principle it's optional. However, I also mark the hw
>   fences as errored when a fault happens (with an errno that describes
> the overall situation), but that never makes it into the drm_sched job
> complete fence. I looked at the drm_sched code and I didn't see any
> error propagation. Is that supposed to work, or am I supposed to
> directly mark the drm_sched side fence as complete, or did I
> misunderstand all this? I get the feeling maybe existing drivers just
> rely on the recovery/timeout/etc paths to mark jobs as errored (since
> those do it explicitly) and never need error forwarding from the hw fence?
>
> ~~ Lina
  
Asahi Lina March 9, 2023, 1:48 p.m. UTC | #14
On 09/03/2023 20.47, Christian König wrote:
> Am 09.03.23 um 10:43 schrieb Asahi Lina:
>> On 09/03/2023 17.42, Christian König wrote:
>>> Am 08.03.23 um 20:37 schrieb Asahi Lina:
>>>> On 09/03/2023 03.12, Christian König wrote:
>>>>> Am 08.03.23 um 18:32 schrieb Asahi Lina:
>>>>>> [SNIP]
>>>>>> Yes but... none of this cleans up jobs that are already submitted by the
>>>>>> scheduler and in its pending list, with registered completion callbacks,
>>>>>> which were already popped off of the entities.
>>>>>>
>>>>>> *That* is the problem this patch fixes!
>>>>> Ah! Yes that makes more sense now.
>>>>>
>>>>>>> We could add a warning when users of this API doesn't do this
>>>>>>> correctly, but cleaning up incorrect API use is clearly something we
>>>>>>> don't want here.
>>>>>> It is the job of the Rust abstractions to make incorrect API use that
>>>>>> leads to memory unsafety impossible. So even if you don't want that in
>>>>>> C, it's my job to do that for Rust... and right now, I just can't
>>>>>> because drm_sched doesn't provide an API that can be safely wrapped
>>>>>> without weird bits of babysitting functionality on top (like tracking
>>>>>> jobs outside or awkwardly making jobs hold a reference to the scheduler
>>>>>> and defer dropping it to another thread).
>>>>> Yeah, that was discussed before but rejected.
>>>>>
>>>>> The argument was that upper layer needs to wait for the hw to become
>>>>> idle before the scheduler can be destroyed anyway.
>>>> Unfortunately, that's not a requirement you can encode in the Rust type
>>>> system easily as far as I know, and Rust safety rules mean we need to
>>>> make it safe even if the upper layer doesn't do this... (or else we have
>>>> to mark the entire drm_sched abstraction unsafe, but that would be a pity).
>>> Yeah, that should really not be something we should do.
>>>
>>> But you could make the scheduler depend on your fw context object, don't
>>> you?
>> Yes, and that would fix the problem for this driver, but it wouldn't
>> make the abstraction safe. The thing is we have to make it *impossible*
>> to misuse drm_sched in such a way that it crashes, at the Rust
>> abstraction level. If we start depending on the driver following rules
>> like that, that means the drm_sched abstraction has to be marked unsafe.
>>
>>> Detaching the scheduler from the underlying hw fences is certainly
>>> possible, but we removed that functionality because some people people
>>> tried to force push some Windows recovery module into Linux. We are in
>>> the process of reverting that and cleaning things up once more, but that
>>> will take a while.
>> Okay, but I don't see why that should block the Rust abstractions...
> 
> Because even with removing the fence callback this is inherently unsafe.
> 
> You not only need to remove the callback, but also make sure that no 
> parallel timeout handling is running.

If by that you mean that the timeout handling functions aren't being
called by the driver, then that's implied. If the scheduler is being
dropped, by definition there are no references left to call into the
scheduler directly from the Rust side. So we only need to worry about
what drm_sched itself does.

Right now the cleanup function tears down the timeout work at the end,
but it probably makes sense to do it at the start? Then if we do that
and stop the kthread, we can be really sure nothing else is accessing
the scheduler and we can clean up without taking any locks:

Roughly:

void drm_sched_fini(struct drm_gpu_scheduler *sched)
{
    sched->ready = false; /* Should probably do this first? */
    kthread_stop(sched->thread);
    cancel_delayed_work_sync(&sched->work_tdr);

    /* Clean up the pending_list here */
}

I'm also not sure what the rest of the drm_sched_fini() function is
doing right now. It's going through all entities and removing them, and
then wakes up entities stuck in drm_sched_entity_flush()... but didn't
we just agree that the API requires users to tear down entities before
tearing down the scheduler anyway?

~~ Lina
  
Faith Ekstrand March 9, 2023, 7:59 p.m. UTC | #15
On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
> On 09/03/2023 17.42, Christian König wrote:
> > Am 08.03.23 um 20:37 schrieb Asahi Lina:
> > > On 09/03/2023 03.12, Christian König wrote:
> > > > Am 08.03.23 um 18:32 schrieb Asahi Lina:
> > > > > [SNIP]
> > > > > Yes but... none of this cleans up jobs that are already
> > > > > submitted by the
> > > > > scheduler and in its pending list, with registered completion
> > > > > callbacks,
> > > > > which were already popped off of the entities.
> > > > > 
> > > > > *That* is the problem this patch fixes!
> > > > Ah! Yes that makes more sense now.
> > > > 
> > > > > > We could add a warning when users of this API doesn't do
> > > > > > this
> > > > > > correctly, but cleaning up incorrect API use is clearly
> > > > > > something we
> > > > > > don't want here.
> > > > > It is the job of the Rust abstractions to make incorrect API
> > > > > use that
> > > > > leads to memory unsafety impossible. So even if you don't
> > > > > want that in
> > > > > C, it's my job to do that for Rust... and right now, I just
> > > > > can't
> > > > > because drm_sched doesn't provide an API that can be safely
> > > > > wrapped
> > > > > without weird bits of babysitting functionality on top (like
> > > > > tracking
> > > > > jobs outside or awkwardly making jobs hold a reference to the
> > > > > scheduler
> > > > > and defer dropping it to another thread).
> > > > Yeah, that was discussed before but rejected.
> > > > 
> > > > The argument was that upper layer needs to wait for the hw to
> > > > become
> > > > idle before the scheduler can be destroyed anyway.
> > > Unfortunately, that's not a requirement you can encode in the
> > > Rust type
> > > system easily as far as I know, and Rust safety rules mean we
> > > need to
> > > make it safe even if the upper layer doesn't do this... (or else
> > > we have
> > > to mark the entire drm_sched abstraction unsafe, but that would
> > > be a pity).
> > 
> > Yeah, that should really not be something we should do.
> > 
> > But you could make the scheduler depend on your fw context object,
> > don't 
> > you?
> 
> Yes, and that would fix the problem for this driver, but it wouldn't
> make the abstraction safe. The thing is we have to make it
> *impossible*
> to misuse drm_sched in such a way that it crashes, at the Rust
> abstraction level. If we start depending on the driver following
> rules
> like that, that means the drm_sched abstraction has to be marked
> unsafe.
> 
> > Detaching the scheduler from the underlying hw fences is certainly 
> > possible, but we removed that functionality because some people
> > people 
> > tried to force push some Windows recovery module into Linux. We are
> > in 
> > the process of reverting that and cleaning things up once more, but
> > that 
> > will take a while.
> 
> Okay, but I don't see why that should block the Rust abstractions...
> I
> don't even need a new API to do that, all I need is to know that
> drm_sched_fini() will do it so it won't crash when the hw fences
> complete later, as this patch does.
> 
> > Instead of detaching you could also block for the hw to become
> > idle, but 
> > if you do that synchronous on process termination you run into
> > trouble 
> > as well.
> 
> Yes, but again this something that can only be done at the driver
> level
> so it doesn't solve the safe abstraction problem...
> 
> > > The firmware queue is itself reference counted and any firmware
> > > queue
> > > that has acquired an event notification resource (that is, which
> > > is busy
> > > with running or upcoming jobs) hands off a reference to itself
> > > into the
> > > event subsystem, so it can get notified of job completions by the
> > > firmware. Then once it becomes idle it unregisters itself, and at
> > > that
> > > point if it has no owning userspace queue, that would be the last
> > > reference and it gets dropped. So we don't tear down firmware
> > > queues
> > > until they are idle.
> > 
> > And could those fw queue not reference the scheduler?
> 
> Yes but again, that rule can't be encoded in the abstraction... so
> that
> makes it unsafe. The goal is to have a safe abstraction, which means
> that all the rules that you need to follow to avoid memory safety
> issues
> are checked by the Rust compiler.
> 
> > > I actually don't know of any way to actively abort jobs on the
> > > firmware,
> > > so this is pretty much the only option I have. I've even seen
> > > long-running compute jobs on macOS run to completion even if you
> > > kill
> > > the submitting process, so there might be no way to do this at
> > > all.
> > > Though in practice since we unmap everything from the VM anyway
> > > when the
> > > userspace stuff gets torn down, almost any normal GPU work is
> > > going to
> > > immediately fault at that point (macOS doesn't do this because
> > > macOS
> > > effectively does implicit sync with BO tracking at the kernel
> > > level...).
> > 
> > Oh, that is an interesting information. How does macOS do explicit
> > sync 
> > then or isn't that supported at all?
> 
> They have the equivalent of sync objects at the UAPI level, but they
> also have the implicit stuff and their UAPI seems to always pass a BO
> list to the kernel as far as we could tell, even though it still
> works
> without it. I think it's a weird hybrid of explicit+implicit sync.
> From
> the Metal docs:
> 
> > By default, Metal tracks the write hazards and synchronizes the
> > resources
> > (see Resource Fundamentals) you create from an MTLDevice and
> > directly bind
> > to a pipeline. However, Metal doesn’t, by default, track resources
> > you
> > allocate from an MTLHeap (see Memory Heaps).
> 
> So it's both, and you can override it...
> 
> At the firmware level, I've never seen Metal use queue barriers yet
> like
> I do (other than the vertex->fragment ones), so either they always do
> CPU round trips for cross-subqueue sync (render<->compute) or we just
> haven't figured out the magic combination to get it to do that yet.
> Honestly, I suspect they just always do it on the CPU. macOS is
> pretty
> ugly behind the scenes and it's pretty obvious a lot of their own
> driver
> was rushed (the firmware seems to support quite a few features the
> driver doesn't... maybe it even has a job abort mechanism, we just
> haven't found it yet).
> 
> Of course, our goal is to do things better than macOS (and we already
> do
> some things better!) but getting confident enough about firmware/HW
> details to diverge from what macOS does is tricky and a slow
> process...
> 
> > > By the way, I don't really use the hardware recovery stuff right
> > > now.
> > > I'm not even sure if there is a sensible way I could use it,
> > > since as I
> > > said we can't exactly abort jobs. I know there are ways to lock
> > > up the
> > > firmware/GPU, but so far those have all been things the kernel
> > > driver
> > > can prevent, and I'm not even sure if there is any way to recover
> > > from
> > > that anyway. The firmware itself has its own timeouts and
> > > recovery for
> > > "normal" problems. From the point of view of the driver and
> > > everything
> > > above it, in-flight commands during a GPU fault or timeout are
> > > just
> > > marked complete by the firmware, after a firmware recovery cycle
> > > where
> > > the driver gets notified of the problem (that's when we mark the
> > > commands failed so we can propagate the error).
> > 
> > Yeah, that's exactly what we are telling our fw people for years
> > that we 
> > need this as well.
> 
> Yeah, the ugly bit is that the firmware does a full GPU recovery even
> on
> simple page faults (which could be handled more gracefully) so even
> stuff like that can possibly break concurrent GPU work.
> 
> On the other hand, macOS configures things so page faults are ignored
> and silently return all-00 on reads for shader accesses, which is how
> they implement sparse buffers/textures... and we'll probably have to
> do
> that to improve reliability against app faults if nothing else. But
> right now the driver enables explicit page faults for everything so
> we
> can debug Mesa (it's a kernel module param, GPU global and I haven't
> found a way to change it after initial load unfortunately, but it
> might
> be possible).
> 
> I think there's also a way to do actual page fault handling (like
> swap
> in pages and resume the GPU), but that's one of those firmware
> features
> Apple's driver just never uses as far as I can tell. There's so much
> unexplored territory...
> 
> > 
> > > There is no re-submission or anything, userspace just gets told
> > > of the problem but
> > > the queue survives.
> > 
> > > In the future it might be possible to re-submit innocent commands
> > 
> > Long story short: Don't do this! This is what the Windows drivers
> > have 
> > been doing and it creates tons of problems.

Yeah, we tried to do a bit of that in the GL days.  It was a bad idea.

> > Just signal the problem back to userspace and let the user space
> > driver 
> > decide what to do.
> > 
> > The background is that most graphics applications (games etc..)
> > then 
> > rather start on the next frame instead of submitting the current
> > one 
> > again while compute applications make sure that the abort and tell
> > the 
> > user that the calculations might be corrupted and need to be
> > redone.

The guarantee that Vulkan makes is that, if you idle the GPU and you
haven't gotten a DEVICE_LOST yet, your data is good.  If you get a
DEVICE_LOST, all bets are off.  The problem is that, no matter how fast
the error propagation may be in the kernel or userspace driver, errors
can still show up in strange ways.  An OOB buffer access could end up
modifying a shader binary which gets run 3 frames later and causes a
corruption.  Once you've faulted, you really have no idea how far back
is good or what memory is corrupted.  You have to assume that
everything mapped to the GPU VA space is potentially toast.

> Then we're good with what we're currently doing, since we already
> notify
> userspace like that!
> 
> Actually I wanted to ask about error notifications. Right now we have
> an
> out-of-band mechanism to provide detailed fault info to userspace
> which
> works fine, but in principle it's optional.

This is fine, in principal.  Because of the nature of errors, async is
fine as long as the error shows up eventually.  Faster is better, for
sure, but error latency doesn't really matter in practice.

> However, I also mark the hw
>  fences as errored when a fault happens (with an errno that describes
> the overall situation), but that never makes it into the drm_sched
> job
> complete fence. I looked at the drm_sched code and I didn't see any
> error propagation. Is that supposed to work, or am I supposed to
> directly mark the drm_sched side fence as complete, or did I
> misunderstand all this? I get the feeling maybe existing drivers just
> rely on the recovery/timeout/etc paths to mark jobs as errored (since
> those do it explicitly) and never need error forwarding from the hw
> fence?

The end behavior needs to be that all fences for all jobs submitted to
the queue get signaled.  That's needed to satisfy the finite time
guarantees of dma_fence.  Exactly how that happens (let the job run,
abort all the jobs, etc.) is an implementation detail for the driver to
decide.  If you want, you can also set a bit on the context (or queue)
to mark it as dead and start returning EIO or similar from any ioctls
trying to submit more work if you wanted.  Not required but you can.

~Faith
  
Asahi Lina March 10, 2023, 9:58 a.m. UTC | #16
On 10/03/2023 04.59, Faith Ekstrand wrote:
> On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
>> On 09/03/2023 17.42, Christian König wrote:
>>> Long story short: Don't do this! This is what the Windows drivers
>>> have 
>>> been doing and it creates tons of problems.
> 
> Yeah, we tried to do a bit of that in the GL days.  It was a bad idea.

I think I should clarify: I was proposing re-queueing innocent jobs from
innocent queues/VMs that were impacted by a fault. The reason is that we
may be able to tweak firmware state to force it to do that safely,
during the firmware recovery cycle, such that an aborted job restarts
and then subsequent jobs/commands continue as normal. We can't leave it
to userspace because if we do nothing, the affected job ends up
incomplete but then everything after it that is already queued still
runs, and that is definitely a recipe for a bigger mess if userspace
wants to seamlessly recover. The firmware recovery cycle is a
"stop-the-world" situation for the GPU (the firmware literally
busy-loops waiting for the driver to set a continue flag in memory...),
so that's the only real chance that the driver gets to make decisions
about what is going to happen next.

Of course, that only works if individual possibly concurrently running
commands are idempotent, but I think a lot of typical GPU work is? (E.g.
any render pass without side effects other than the render targets and
where the background shader does no loads, or even render passes that do
loads but where all draws are opaque, which are all things the current
Gallium driver is intimately familiar with since Crazy Tiler
Optimizations™ need that info to be provided anyway). So I was wondering
whether it'd make sense to have such an idempotency/restartable flag on
job submission, and then the driver would do its best to recover and
rerun it if it gets killed by an unrelated concurrent bad job.

Then again this all depends on an investigation into what we *can* do
during firmware recovery that hasn't happened at all yet. It might be
that it isn't safe to do anything really, or that doing things depends
on touching even deeper firmware state structs that we treat as opaque
right now and we really don't want to have to touch...

But maybe none of this is worth it in practice, it just sounded like it
could be useful maybe?

Now that I look at it, we have a lovely "what is this flag doing anyway"
bit already passed from Mesa through to the firmware we called
ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S which, now that I look at it, is
actually getting set when any attachment (any color, Z, S) is not being
cleared for that pass (so it's loaded). That could very well be an "is
not idempotent" flag... and maybe that means the firmware does this for
us already? Sounds like something to test... I might have some 16Kx16K
GLmark runs to do concurrent with an evil faulting job now ^^ (and then
that also means we need to set it when shaders have side effects and
stuff, which right now we don't).

>>> Just signal the problem back to userspace and let the user space
>>> driver 
>>> decide what to do.
>>>
>>> The background is that most graphics applications (games etc..)
>>> then 
>>> rather start on the next frame instead of submitting the current
>>> one 
>>> again while compute applications make sure that the abort and tell
>>> the 
>>> user that the calculations might be corrupted and need to be
>>> redone.
> 
> The guarantee that Vulkan makes is that, if you idle the GPU and you
> haven't gotten a DEVICE_LOST yet, your data is good.  If you get a
> DEVICE_LOST, all bets are off.  The problem is that, no matter how fast
> the error propagation may be in the kernel or userspace driver, errors
> can still show up in strange ways.  An OOB buffer access could end up
> modifying a shader binary which gets run 3 frames later and causes a
> corruption.  Once you've faulted, you really have no idea how far back
> is good or what memory is corrupted.  You have to assume that
> everything mapped to the GPU VA space is potentially toast.

Yes of course, for the actually faulting VM all bets are off after a
fault (though we can try a bit harder at least... I have a READ_ONLY BO
flag now, I should set it on the shader pools!).

>> Actually I wanted to ask about error notifications. Right now we have
>> an
>> out-of-band mechanism to provide detailed fault info to userspace
>> which
>> works fine, but in principle it's optional.
> 
> This is fine, in principal.  Because of the nature of errors, async is
> fine as long as the error shows up eventually.  Faster is better, for
> sure, but error latency doesn't really matter in practice.
> 
>> However, I also mark the hw
>>  fences as errored when a fault happens (with an errno that describes
>> the overall situation), but that never makes it into the drm_sched
>> job
>> complete fence. I looked at the drm_sched code and I didn't see any
>> error propagation. Is that supposed to work, or am I supposed to
>> directly mark the drm_sched side fence as complete, or did I
>> misunderstand all this? I get the feeling maybe existing drivers just
>> rely on the recovery/timeout/etc paths to mark jobs as errored (since
>> those do it explicitly) and never need error forwarding from the hw
>> fence?
> 
> The end behavior needs to be that all fences for all jobs submitted to
> the queue get signaled.  That's needed to satisfy the finite time
> guarantees of dma_fence.  Exactly how that happens (let the job run,
> abort all the jobs, etc.) is an implementation detail for the driver to
> decide.  If you want, you can also set a bit on the context (or queue)
> to mark it as dead and start returning EIO or similar from any ioctls
> trying to submit more work if you wanted.  Not required but you can.

Fences have an error flag though, does that get reported to userspace
somehow? I thought it did, but maybe not, or maybe only drm_sched not
propagating it is the issue?

In other words, absent my fancy stats reporting BO system, what is the
normal way that an explicit sync driver signals to userspace that the
job associated with a syncobj has failed?

(If there is no way, then I'll probably want to change the stats BO
system to be configurable, so if you ask for no stats/time info, you
only get overall job status and faults, which has less overhead.)

~~ Lina
  
Faith Ekstrand March 13, 2023, 8:11 p.m. UTC | #17
On Fri, 2023-03-10 at 18:58 +0900, Asahi Lina wrote:
> On 10/03/2023 04.59, Faith Ekstrand wrote:
> > On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
> > > On 09/03/2023 17.42, Christian König wrote:
> > > > Long story short: Don't do this! This is what the Windows
> > > > drivers
> > > > have 
> > > > been doing and it creates tons of problems.
> > 
> > Yeah, we tried to do a bit of that in the GL days.  It was a bad
> > idea.
> 
> I think I should clarify: I was proposing re-queueing innocent jobs
> from
> innocent queues/VMs that were impacted by a fault. The reason is that
> we
> may be able to tweak firmware state to force it to do that safely,
> during the firmware recovery cycle, such that an aborted job restarts
> and then subsequent jobs/commands continue as normal. We can't leave
> it
> to userspace because if we do nothing, the affected job ends up
> incomplete but then everything after it that is already queued still
> runs, and that is definitely a recipe for a bigger mess if userspace
> wants to seamlessly recover. The firmware recovery cycle is a
> "stop-the-world" situation for the GPU (the firmware literally
> busy-loops waiting for the driver to set a continue flag in
> memory...),
> so that's the only real chance that the driver gets to make decisions
> about what is going to happen next.

Ok, that makes sense.  Yes, if you have other jobs on other queues and
are able to recover everything that isn't in the faulting VM, that's a
good thing.  I wasn't sure how hang/fault recovery worked on AGX.  In
tat case, I don't think there's a dma_fence problem.  As long as you
keep recovering and killing off any faulting contexts, eventually the
good contexts should make progress and those fences should signal.

Of course, the firmware recovery cycle may be complex and need (or at
least appear to) memory allocation or similar and that's where
everything gets hairy.  Hopefully, though, if you've already got the
resources from the old context, you can re-use them after a bit of
clean-up work and still get deterministic and reliable recovery cycles.

> Of course, that only works if individual possibly concurrently
> running
> commands are idempotent, but I think a lot of typical GPU work is?

No, that's not a valid assumption.  For a single 3D render pass which
doesn't do any image or SSBO access, it may be possible to re-run it. 
However, that won't be true of compute work and isn't necessarily true
of back-to-back passes. Lots of modern apps do temporal stuff where one
frame depends on the previous and a re-run might screw that up. Also,
with Vulkan's memory aliasing, it's hard to tell just from which
resources are accessed whether or not a command buffer leaves its input
memory undamaged.

> (E.g.
> any render pass without side effects other than the render targets
> and
> where the background shader does no loads, or even render passes that
> do
> loads but where all draws are opaque, which are all things the
> current
> Gallium driver is intimately familiar with since Crazy Tiler
> Optimizations™ need that info to be provided anyway). So I was
> wondering
> whether it'd make sense to have such an idempotency/restartable flag
> on
> job submission, and then the driver would do its best to recover and
> rerun it if it gets killed by an unrelated concurrent bad job.
> 
> Then again this all depends on an investigation into what we *can* do
> during firmware recovery that hasn't happened at all yet. It might be
> that it isn't safe to do anything really, or that doing things
> depends
> on touching even deeper firmware state structs that we treat as
> opaque
> right now and we really don't want to have to touch...
> 
> But maybe none of this is worth it in practice, it just sounded like
> it
> could be useful maybe?

Maybe? It's not clear to me that such a flag would be useful or even
practical to provide from the Mesa side.  Ideally, you'd be able to
figure out when a fault happens, what VM it happened in and exactly
what work was in-flight when it happened and only kill the one guilty
VM.  However, it sounds like your understanding of the firmware is
currently rough enough that doing so may not be practical.  In that
case, the best thing to do is to kill any VMs which were on the GPU at
the time and hope the individual apps are able to recover.

> Now that I look at it, we have a lovely "what is this flag doing
> anyway"
> bit already passed from Mesa through to the firmware we called
> ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S which, now that I look at it,
> is
> actually getting set when any attachment (any color, Z, S) is not
> being
> cleared for that pass (so it's loaded). That could very well be an
> "is
> not idempotent" flag... and maybe that means the firmware does this
> for
> us already? Sounds like something to test... I might have some
> 16Kx16K
> GLmark runs to do concurrent with an evil faulting job now ^^ (and
> then
> that also means we need to set it when shaders have side effects and
> stuff, which right now we don't).
> 
> > > > Just signal the problem back to userspace and let the user
> > > > space
> > > > driver 
> > > > decide what to do.
> > > > 
> > > > The background is that most graphics applications (games etc..)
> > > > then 
> > > > rather start on the next frame instead of submitting the
> > > > current
> > > > one 
> > > > again while compute applications make sure that the abort and
> > > > tell
> > > > the 
> > > > user that the calculations might be corrupted and need to be
> > > > redone.
> > 
> > The guarantee that Vulkan makes is that, if you idle the GPU and
> > you
> > haven't gotten a DEVICE_LOST yet, your data is good.  If you get a
> > DEVICE_LOST, all bets are off.  The problem is that, no matter how
> > fast
> > the error propagation may be in the kernel or userspace driver,
> > errors
> > can still show up in strange ways.  An OOB buffer access could end
> > up
> > modifying a shader binary which gets run 3 frames later and causes
> > a
> > corruption.  Once you've faulted, you really have no idea how far
> > back
> > is good or what memory is corrupted.  You have to assume that
> > everything mapped to the GPU VA space is potentially toast.
> 
> Yes of course, for the actually faulting VM all bets are off after a
> fault (though we can try a bit harder at least... I have a READ_ONLY
> BO
> flag now, I should set it on the shader pools!).
> 
> > > Actually I wanted to ask about error notifications. Right now we
> > > have
> > > an
> > > out-of-band mechanism to provide detailed fault info to userspace
> > > which
> > > works fine, but in principle it's optional.
> > 
> > This is fine, in principal.  Because of the nature of errors, async
> > is
> > fine as long as the error shows up eventually.  Faster is better,
> > for
> > sure, but error latency doesn't really matter in practice.
> > 
> > > However, I also mark the hw
> > >  fences as errored when a fault happens (with an errno that
> > > describes
> > > the overall situation), but that never makes it into the
> > > drm_sched
> > > job
> > > complete fence. I looked at the drm_sched code and I didn't see
> > > any
> > > error propagation. Is that supposed to work, or am I supposed to
> > > directly mark the drm_sched side fence as complete, or did I
> > > misunderstand all this? I get the feeling maybe existing drivers
> > > just
> > > rely on the recovery/timeout/etc paths to mark jobs as errored
> > > (since
> > > those do it explicitly) and never need error forwarding from the
> > > hw
> > > fence?
> > 
> > The end behavior needs to be that all fences for all jobs submitted
> > to
> > the queue get signaled.  That's needed to satisfy the finite time
> > guarantees of dma_fence.  Exactly how that happens (let the job
> > run,
> > abort all the jobs, etc.) is an implementation detail for the
> > driver to
> > decide.  If you want, you can also set a bit on the context (or
> > queue)
> > to mark it as dead and start returning EIO or similar from any
> > ioctls
> > trying to submit more work if you wanted.  Not required but you
> > can.
> 
> Fences have an error flag though, does that get reported to userspace
> somehow? I thought it did, but maybe not, or maybe only drm_sched not
> propagating it is the issue?
> 
> In other words, absent my fancy stats reporting BO system, what is
> the
> normal way that an explicit sync driver signals to userspace that the
> job associated with a syncobj has failed?

One is via the return value from exec/submit.  Often there's also a
query mechanism for more detailed information.  It's not particularly
standard at the moment, I'm afraid.  I could point you at i915 but I
wouldn't call that uAPI something to be emulated, in general.

> (If there is no way, then I'll probably want to change the stats BO
> system to be configurable, so if you ask for no stats/time info, you
> only get overall job status and faults, which has less overhead.)

There is an error but it doesn't automatically get propagated to
userspace.  So, for instance, a SYNCOBJ_WAIT ioctl won't return an
error if it sees a fence error.  It needs to get caught by the driver
and returned through a driver ioctl somehow.

~Faith
  
Daniel Vetter April 5, 2023, 1:52 p.m. UTC | #18
On Tue, Mar 07, 2023 at 11:25:36PM +0900, Asahi Lina wrote:
> drm_sched_fini() currently leaves any pending jobs dangling, which
> causes segfaults and other badness when job completion fences are
> signaled after the scheduler is torn down.
> 
> Explicitly detach all jobs from their completion callbacks and free
> them. This makes it possible to write a sensible safe abstraction for
> drm_sched, without having to externally duplicate the tracking of
> in-flight jobs.
> 
> This shouldn't regress any existing drivers, since calling
> drm_sched_fini() with any pending jobs is broken and this change should
> be a no-op if there are no pending jobs.
> 
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 5c0add2c7546..0aab1e0aebdd 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init);
>  void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  {
>  	struct drm_sched_entity *s_entity;
> +	struct drm_sched_job *s_job, *tmp;
>  	int i;
>  
> -	if (sched->thread)
> -		kthread_stop(sched->thread);
> +	if (!sched->thread)
> +		return;
> +
> +	/*
> +	 * Stop the scheduler, detaching all jobs from their hardware callbacks
> +	 * and cleaning up complete jobs.
> +	 */
> +	drm_sched_stop(sched, NULL);
> +
> +	/*
> +	 * Iterate through the pending job list and free all jobs.
> +	 * This assumes the driver has either guaranteed jobs are already stopped, or that
> +	 * otherwise it is responsible for keeping any necessary data structures for
> +	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
> +	 * putting them in its own queue or doing its own refcounting).
> +	 */

This comment makes me wonder whether we shouldn't go one step further and
have a drm_sched_quiescent, which waits for any in-flight jobs to complete
and cancels everything else. Because even if rust guarantees that you
don't have any memory bugs, if you just leak things by sprinkling
reference-counted pointer wrappers everywhere you still have a semantic
bug.

Except now it's much harder to realize that because there's no Oops and
KASAN doesn't tell you about it either. I think it would be much better if
the scheduler code and rust abstraction provider drivers the correct
lifetimes and very strongly encourage them to only have borrowed
references and not additional refcounting of their own.

I think Christian mentioned that this would block in close() or context
destruction, which is no good at all. And with the 1:1
drm_scheduler:drm_sched_entity design for there's no other place. This is
way I've suggested in the Xe threads that we should make the current
drm_scheduler an implementation detail hidden from drivers, with a new
drm_scheduler which is always per-engine for all cases as the driver api
interface.  And the internal scheduler attached to either that (for
current drivers) or drm_sched_entity (for fw scheduling drivers) as
needed. With that
- the sched_entity cleanup could take care of this code here for the fw
  scheduler case
- the drm_sched_fini could take care of blocking appropriately before the
  driver is unloaded for any lagging in-flight jobs, without blocking
  userspace
- drivers should not end up with any need to reference-count either
  per-ctx/drm_sched_entity or per-drm_sched_job data, ever

Because any comment that's along the lines of "drivers need to refcount"
is bad business, because it either means leaks (rust) or crashes (C). I
much prefer when drivers have to put in extra effort to get things wrong
because by default the lifetimes are Just Right(tm).
-Daniel

> +	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
> +		spin_lock(&sched->job_list_lock);
> +		list_del_init(&s_job->list);
> +		spin_unlock(&sched->job_list_lock);
> +		sched->ops->free_job(s_job);
> +	}
> +
> +	kthread_stop(sched->thread);
>  
>  	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>  		struct drm_sched_rq *rq = &sched->sched_rq[i];
> 
> -- 
> 2.35.1
>
  

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 5c0add2c7546..0aab1e0aebdd 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1119,10 +1119,33 @@  EXPORT_SYMBOL(drm_sched_init);
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
 	struct drm_sched_entity *s_entity;
+	struct drm_sched_job *s_job, *tmp;
 	int i;
 
-	if (sched->thread)
-		kthread_stop(sched->thread);
+	if (!sched->thread)
+		return;
+
+	/*
+	 * Stop the scheduler, detaching all jobs from their hardware callbacks
+	 * and cleaning up complete jobs.
+	 */
+	drm_sched_stop(sched, NULL);
+
+	/*
+	 * Iterate through the pending job list and free all jobs.
+	 * This assumes the driver has either guaranteed jobs are already stopped, or that
+	 * otherwise it is responsible for keeping any necessary data structures for
+	 * in-progress jobs alive even when the free_job() callback is called early (e.g. by
+	 * putting them in its own queue or doing its own refcounting).
+	 */
+	list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
+		spin_lock(&sched->job_list_lock);
+		list_del_init(&s_job->list);
+		spin_unlock(&sched->job_list_lock);
+		sched->ops->free_job(s_job);
+	}
+
+	kthread_stop(sched->thread);
 
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		struct drm_sched_rq *rq = &sched->sched_rq[i];