[0/3] DRM scheduler documentation & bug fixes

Message ID 20230714-drm-sched-fixes-v1-0-c567249709f7@asahilina.net
Headers
Series DRM scheduler documentation & bug fixes |

Message

Asahi Lina July 14, 2023, 8:21 a.m. UTC
  Based on the previous discussion while I was writing the Rust
abstractions for the DRM scheduler, it looks like we're overdue for some
documentation.

This series first attempts to document what I've learned about the
scheduler and what I believe should be the *intended* lifetime
semantics, and then fixes a few bugs that result from that:

1. The DRM scheduler fences cannot be required to be outlived by the
   scheduler. This is non-negotiable. The whole point of these fences is
   to decouple the underlying hardware/driver from consumers, such as
   dma-bufs with an attached fence. If this requirement were not met,
   then we'd have to somehow keep the scheduler and all the driver
   components associated with it alive as long as a dma-buf with an
   attached drm_sched fence is alive, which could be indefinitely even
   after the hardware that produced that dma-buf is long gone. Consider,
   for example, using a hot-pluggable GPU to write to a dma-buf in main
   memory, which gets presented on an integrated display controller, and
   then the GPU is unplugged. That buffer could potentially live
   forever, we can't block GPU driver cleanup on that.

2. Make the DRM scheduler properly clean up jobs on shutdown, such that
   we can support the use case of tearing down the scheduler with
   in-flight jobs. This is important to cleanly support the firmware
   scheduling use case, where the DRM scheduler is attached to a file
   (which we want to be able to tear down quickly when userspace closes
   it) while firmware could continue to (attempt to) run in-flight jobs
   after that point. The major missing codepath to make this work is
   detaching jobs from their HW fences on scheduler shutdown, so
   implement that. This also makes writing a safe Rust abstraction
   plausible, since otherwise we'd have to add a huge pile of complexity
   to that side in order to enforce the invariant that the scheduler
   outlives its jobs (including setting up a workqueue to handle
   scheduler teardown and other craziness, which is an unacceptable
   level of complexity for what should be a lightweight abstraction).

I believe there *may* still be at least one UAF-type bug related to case
2 above, but it's very hard to trigger and I wasn't able to figure out
what causes it the one time I saw it recently. Other than that, things
look quite robust on the Asahi driver with these patches, even when
trying to break things by killing GPU consumers in a tight loop and
things like that. If we agree this is a good way forward, I think this
is a good start even if there's still a bug lurking somewhere.

Aside (but related to the previous discussion): the can_run_job thing
is gone, I'm using fences returned from prepare() now and that works
well (and actually fixes one corner case related to wait contexts I'd
missed), so hopefully that's OK with everyone ^^

Changes from the previous version of patch #2: explicitly signal
detached job fences with an error. I'd missed that and I think it's what
was causing us some rare lockups due to fences never getting signaled.

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
Asahi Lina (3):
      drm/scheduler: Add more documentation
      drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
      drm/scheduler: Clean up jobs when the scheduler is torn down.
 drivers/gpu/drm/scheduler/sched_entity.c |  7 ++-
 drivers/gpu/drm/scheduler/sched_fence.c  |  4 +-
 drivers/gpu/drm/scheduler/sched_main.c   | 90 ++++++++++++++++++++++++++++++--
 include/drm/gpu_scheduler.h              |  5 ++
 4 files changed, 99 insertions(+), 7 deletions(-)
---
base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
change-id: 20230714-drm-sched-fixes-94bea043bbe7

Thank you,
~~ Lina
  

Comments

Christian König July 14, 2023, 8:40 a.m. UTC | #1
Am 14.07.23 um 10:21 schrieb Asahi Lina:
> Document the implied lifetime rules of the scheduler (or at least the
> intended ones), as well as the expectations of how resource acquisition
> should be handled.
>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 58 ++++++++++++++++++++++++++++++++--
>   1 file changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 7b2bfc10c1a5..1f3bc3606239 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -43,9 +43,61 @@
>    *
>    * The jobs in a entity are always scheduled in the order that they were pushed.
>    *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * Lifetime rules
> + * --------------
> + *
> + * Getting object lifetimes right across the stack is critical to avoid UAF
> + * issues. The DRM scheduler has the following lifetime rules:
> + *
> + * - The scheduler must outlive all of its entities.
> + * - Jobs pushed to the scheduler are owned by it, and must only be freed
> + *   after the free_job() callback is called.
> + * - Scheduler fences are reference-counted and may outlive the scheduler.

> + * - The scheduler *may* be destroyed while jobs are still in flight.

That's not correct. The scheduler can only be destroyed after all the 
entities serving it have been destroyed as well as all the jobs already 
pushed to the hw finished.

What might be possible to add is that the hw is still working on the 
already pushed jobs, but so far that was rejected as undesirable.

> + * - There is no guarantee that all jobs have been freed when all entities
> + *   and the scheduled have been destroyed. Jobs may be freed asynchronously
> + *   after this point.
> + * - Once a job is taken from the entity's queue and pushed to the hardware,
> + *   i.e. the pending queue, the entity must not be referenced any more
> + *   through the job's entity pointer. In other words, entities are not
> + *   required to outlive job execution.
> + *
> + * If the scheduler is destroyed with jobs in flight, the following
> + * happens:
> + *
> + * - Jobs that were pushed but have not yet run will be destroyed as part
> + *   of the entity cleanup (which must happen before the scheduler itself
> + *   is destroyed, per the first rule above). This signals the job
> + *   finished fence with an error flag. This process runs asynchronously
> + *   after drm_sched_entity_destroy() returns.
> + * - Jobs that are in-flight on the hardware are "detached" from their
> + *   driver fence (the fence returned from the run_job() callback). In
> + *   this case, it is up to the driver to ensure that any bookkeeping or
> + *   internal data structures have separately managed lifetimes and that
> + *   the hardware either cancels the jobs or runs them to completion.
> + *   The DRM scheduler itself will immediately signal the job complete
> + *   fence (with an error flag) and then call free_job() as part of the
> + *   cleanup process.
> + *
> + * After the scheduler is destroyed, drivers *may* (but are not required to)
> + * skip signaling their remaining driver fences, as long as they have only ever
> + * been returned to the scheduler being destroyed as the return value from
> + * run_job() and not passed anywhere else.

This is an outright NAK to this. Fences must always be cleanly signaled.

IIRC Daniel documented this as mandatory on the dma_fence behavior.

Regards,
Christian.

>   If these fences are used in any other
> + * context, then the driver *must* signal them, per the usual fence signaling
> + * rules.
> + *
> + * Resource management
> + * -------------------
> + *
> + * Drivers may need to acquire certain hardware resources (e.g. VM IDs) in order
> + * to run a job. This process must happen during the job's prepare() callback,
> + * not in the run() callback. If any resource is unavailable at job prepare time,
> + * the driver must return a suitable fence that can be waited on to wait for the
> + * resource to (potentially) become available.
> + *
> + * In order to avoid deadlocks, drivers must always acquire resources in the
> + * same order, and release them in opposite order when a job completes or if
> + * resource acquisition fails.
>    */
>   
>   #include <linux/kthread.h>
>
  
Christian König July 14, 2023, 8:43 a.m. UTC | #2
Am 14.07.23 um 10:21 schrieb Asahi Lina:
> A signaled scheduler fence can outlive its scheduler, since fences are
> independencly reference counted. Therefore, we can't reference the
> scheduler in the get_timeline_name() implementation.
>
> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
> dma-bufs reference fences from GPU schedulers that no longer exist.
>
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>   drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>   include/drm/gpu_scheduler.h              | 5 +++++
>   3 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index b2bbc8a68b30..17f35b0b005a 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>   
>   		/*
>   		 * Fence is from the same scheduler, only need to wait for
> -		 * it to be scheduled
> +		 * it to be scheduled.
> +		 *
> +		 * Note: s_fence->sched could have been freed and reallocated
> +		 * as another scheduler. This false positive case is okay, as if
> +		 * the old scheduler was freed all of its jobs must have
> +		 * signaled their completion fences.

This is outright nonsense. As long as an entity for a scheduler exists 
it is not allowed to free up this scheduler.

So this function can't be called like this.

>   		 */
>   		fence = dma_fence_get(&s_fence->scheduled);
>   		dma_fence_put(entity->dependency);
> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
> index ef120475e7c6..06a0eebcca10 100644
> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> @@ -68,7 +68,7 @@ static const char *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>   static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f)
>   {
>   	struct drm_sched_fence *fence = to_drm_sched_fence(f);
> -	return (const char *)fence->sched->name;
> +	return (const char *)fence->sched_name;
>   }
>   
>   static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
>   	unsigned seq;
>   
>   	fence->sched = entity->rq->sched;
> +	strlcpy(fence->sched_name, entity->rq->sched->name,
> +		sizeof(fence->sched_name));
>   	seq = atomic_inc_return(&entity->fence_seq);
>   	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>   		       &fence->lock, entity->fence_context, seq);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index e95b4837e5a3..4fa9523bd47d 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>            * @lock: the lock used by the scheduled and the finished fences.
>            */
>   	spinlock_t			lock;
> +        /**
> +         * @sched_name: the name of the scheduler that owns this fence. We
> +	 * keep a copy here since fences can outlive their scheduler.
> +         */
> +	char sched_name[16];

This just mitigates the problem, but doesn't fix it.

The real issue is that the hw fence is kept around much longer than that.

Additional to that I'm not willing to increase the scheduler fence size 
like that just to decouple them from the scheduler.

Regards,
Christian.

>           /**
>            * @owner: job owner for debugging
>            */
>