Message ID | 20230406-scheduler-uaf-1-v1-1-8e5662269d25@asahilina.net |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp446964vqo; Wed, 5 Apr 2023 09:45:02 -0700 (PDT) X-Google-Smtp-Source: AKy350b5KxPcRIQY9Bu5BRL/SLgJUi9qiIdPKR+amtdGaluNcGxYAhJDK+VKS9/tty7uPgrZuBdl X-Received: by 2002:aa7:dbd8:0:b0:4fc:709f:7abd with SMTP id v24-20020aa7dbd8000000b004fc709f7abdmr2908032edt.2.1680713102649; Wed, 05 Apr 2023 09:45:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680713102; cv=none; d=google.com; s=arc-20160816; b=WH529yL5L2QUHDdKVuZaDk+Km5BIyeIMap9kIYacrXwdG1pvNr1iDh3ofM7PZ/sK3f wJhEYdG3DUch4Jl3x06/ihIiDhCEbxLbHnhLDns4/bxIJ4P3gYlaygAD/M/vKKlodEXA 7aHNwWXEqQ4Imk1g26MiCXCnQ4fXHUV8PxL/k9KYXZemkupvn1d9zd56sPhpU76mPvOO 2wnXitgsFzB5EMzCt/MyDrbSKxrL5JW5/RD32KEY/QRYOcWZK9MYDeZLIOILnC+AOaxV 9iV6gNc5t6ovVviz+/kxtp4iSn5Ul9l9R9oYPfeQhIDgiIPLTSnUDXYaV/Eq0iYhU3Tx /+5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:message-id:content-transfer-encoding :mime-version:subject:date:from:dkim-signature; bh=jlSUKQaZKGEMcb7C3gOOBnuvToluLgYk9CR78B1AEqo=; b=EQciP2b97zUAGrJ7h0Ql0QhU3Vkrjs7r2Bp2F2RcectoEJIA40HrRQYgT4RQI3HpR6 HQy3e0K9AkFojXuK1x71hbnOO7apoeInYLBd9oMFfBswSkOchqMTtvNg/9HlDwQplaHC miccG06is5HC66s+zIHvMOVio9EPu7Ti/sGcNupTOBYLOJFDewThEz+AdWzXECsLqq09 dz7TyeiixByaiEbNRiBvmqB14k5hKPgD7AmfD8tB5n325aVzeEqx2a9mqeUNesTLzWpa SjUFinSF5x27jkJ2+fDO95kCJQ6nKYFd2Ym9bfNRovuueTPRQ9e+nNli39dmSOYlbzoS R1NA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@asahilina.net header.s=default header.b=chhqqtkm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=asahilina.net Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m17-20020a17090607d100b00947a72931edsi12328970ejc.397.2023.04.05.09.44.38; Wed, 05 Apr 2023 09:45:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@asahilina.net header.s=default header.b=chhqqtkm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=asahilina.net Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231964AbjDEQez (ORCPT <rfc822;lkml4gm@gmail.com> + 99 others); Wed, 5 Apr 2023 12:34:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40894 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229590AbjDEQem (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 5 Apr 2023 12:34:42 -0400 Received: from mail.marcansoft.com (marcansoft.com [212.63.210.85]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5FB214C05; Wed, 5 Apr 2023 09:34:34 -0700 (PDT) Received: from [127.0.0.1] (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: linasend@asahilina.net) by mail.marcansoft.com (Postfix) with ESMTPSA id C249242118; Wed, 5 Apr 2023 16:34:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=asahilina.net; s=default; t=1680712471; bh=6hylb8a77OJ3/6TutdrbcroO5JBsT18Y4Xu7BzXejGw=; h=From:Date:Subject:To:Cc; b=chhqqtkmWEaznb7IhslQ4k6vCZ9b05y+1BohQOSZdF0KfLWPl9YEYaVZUkD/9RyIc i7wRMevcE7TkrcRJVxAGc94mZRg+Yn17e9A8kfsq9ip3U0iCZo9tsbAA2tLhPr+UvA aIyr/KgYGpy9ehdywhqG45ZVN5qPcAfH6BN4mqFuZFgPpk/dzozEbk09CfMT0tnQpv p3/HkoNnMULgvQC6U5IxN54kV9JqxToq4pJ3aQAAHwOuMkR5Rmbg7QrmoSdb6QyJHk dThstomiEaLe2w6c+9tIRPQrM+SuR0JwZDdON1C0+YRqSrhGGAolVxmebjOMMHEXIp Jq2vQqhXUhBWw== From: Asahi Lina <lina@asahilina.net> Date: Thu, 06 Apr 2023 01:34:24 +0900 Subject: [PATCH] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20230406-scheduler-uaf-1-v1-1-8e5662269d25@asahilina.net> X-B4-Tracking: v=1; b=H4sIAA+jLWQC/x2N0QrCMAxFf2Xk2Ui7lkH9FfEhyzJbkCoJG8LYv 9v6eM7lcA8w0SIGt+EAlb1YedcG/jIAZ6pPwbI0htGNwUU3oXGWZXuJ4kYrekwpCofIxClAq2Y ywVmpcu7dPl0DKnvsvu8flbV8/4/3x3n+ADlzkZqBAAAA To: Luben Tuikov <luben.tuikov@amd.com>, David Airlie <airlied@gmail.com>, Daniel Vetter <daniel@ffwll.ch>, Sumit Semwal <sumit.semwal@linaro.org>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com> Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, asahi@lists.linux.dev, Asahi Lina <lina@asahilina.net> X-Mailer: b4 0.12.0 X-Developer-Signature: v=1; a=ed25519-sha256; t=1680712468; l=3168; i=lina@asahilina.net; s=20230221; h=from:subject:message-id; bh=6hylb8a77OJ3/6TutdrbcroO5JBsT18Y4Xu7BzXejGw=; b=fzl6uj6vopdSACbk9q9OzJ8heI46PdmDbHAcqBaoh1l1UFL2j12a/YUF57upGiukq6/uONr+R hwqMsz+jDU+D5GSwQeO1fkTh3sdTwJhx8RE2wWtch/ifppZ6JFHBOhM X-Developer-Key: i=lina@asahilina.net; a=ed25519; pk=Qn8jZuOtR1m5GaiDfTrAoQ4NE1XoYVZ/wmt5YtXWFC4= X-Spam-Status: No, score=-0.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1762355422505926887?= X-GMAIL-MSGID: =?utf-8?q?1762355422505926887?= |
Series |
drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name
|
|
Commit Message
Asahi Lina
April 5, 2023, 4:34 p.m. UTC
A signaled scheduler fence can outlive its scheduler, since fences are
independently reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.
Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.
Signed-off-by: Asahi Lina <lina@asahilina.net>
---
drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
drivers/gpu/drm/scheduler/sched_fence.c | 4 +++-
include/drm/gpu_scheduler.h | 5 +++++
3 files changed, 14 insertions(+), 2 deletions(-)
---
base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6
change-id: 20230406-scheduler-uaf-1-994ec34cac93
Thank you,
~~ Lina
Comments
On 2023-04-05 12:34, Asahi Lina wrote: > A signaled scheduler fence can outlive its scheduler, since fences are > independently reference counted. Therefore, we can't reference the > scheduler in the get_timeline_name() implementation. > > Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared > dma-bufs reference fences from GPU schedulers that no longer exist. > > Signed-off-by: Asahi Lina <lina@asahilina.net> > --- Thanks for fixing this. If there's no objections, I'll add my tag and I'll push this tomorrow. I think this should go in through drm-misc-fixes and drm-misc-next. Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> Regards, Luben > drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++- > drivers/gpu/drm/scheduler/sched_fence.c | 4 +++- > include/drm/gpu_scheduler.h | 5 +++++ > 3 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > index 15d04a0ec623..8b3b949b2ce8 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -368,7 +368,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity) > > /* > * Fence is from the same scheduler, only need to wait for > - * it to be scheduled > + * it to be scheduled. > + * > + * Note: s_fence->sched could have been freed and reallocated > + * as another scheduler. This false positive case is okay, as if > + * the old scheduler was freed all of its jobs must have > + * signaled their completion fences. > */ > fence = dma_fence_get(&s_fence->scheduled); > dma_fence_put(entity->dependency); > diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c > index 7fd869520ef2..33b145dfa38c 100644 > --- a/drivers/gpu/drm/scheduler/sched_fence.c > +++ b/drivers/gpu/drm/scheduler/sched_fence.c > @@ -66,7 +66,7 @@ static const char *drm_sched_fence_get_driver_name(struct dma_fence *fence) > static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f) > { > struct drm_sched_fence *fence = to_drm_sched_fence(f); > - return (const char *)fence->sched->name; > + return (const char *)fence->sched_name; > } > > static void drm_sched_fence_free_rcu(struct rcu_head *rcu) > @@ -168,6 +168,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence, > unsigned seq; > > fence->sched = entity->rq->sched; > + strlcpy(fence->sched_name, entity->rq->sched->name, > + sizeof(fence->sched_name)); > seq = atomic_inc_return(&entity->fence_seq); > dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled, > &fence->lock, entity->fence_context, seq); > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > index 9db9e5e504ee..49f019731891 100644 > --- a/include/drm/gpu_scheduler.h > +++ b/include/drm/gpu_scheduler.h > @@ -295,6 +295,11 @@ struct drm_sched_fence { > * @lock: the lock used by the scheduled and the finished fences. > */ > spinlock_t lock; > + /** > + * @sched_name: the name of the scheduler that owns this fence. We > + * keep a copy here since fences can outlive their scheduler. > + */ > + char sched_name[16]; > /** > * @owner: job owner for debugging > */ > > --- > base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6 > change-id: 20230406-scheduler-uaf-1-994ec34cac93 > > Thank you, > ~~ Lina >
Am 05.04.23 um 18:34 schrieb Asahi Lina: > A signaled scheduler fence can outlive its scheduler, since fences are > independently reference counted. Well that is actually not correct. Schedulers are supposed to stay around until the hw they have been driving is no longer present. E.g. the reference was scheduler_fence->hw_fence->driver->scheduler. Your use case is now completely different to that and this won't work any more. This here might just be the first case where that breaks. Regards, Christian. > Therefore, we can't reference the > scheduler in the get_timeline_name() implementation. > > Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared > dma-bufs reference fences from GPU schedulers that no longer exist. > > Signed-off-by: Asahi Lina <lina@asahilina.net> > --- > drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++- > drivers/gpu/drm/scheduler/sched_fence.c | 4 +++- > include/drm/gpu_scheduler.h | 5 +++++ > 3 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > index 15d04a0ec623..8b3b949b2ce8 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -368,7 +368,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity) > > /* > * Fence is from the same scheduler, only need to wait for > - * it to be scheduled > + * it to be scheduled. > + * > + * Note: s_fence->sched could have been freed and reallocated > + * as another scheduler. This false positive case is okay, as if > + * the old scheduler was freed all of its jobs must have > + * signaled their completion fences. > */ > fence = dma_fence_get(&s_fence->scheduled); > dma_fence_put(entity->dependency); > diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c > index 7fd869520ef2..33b145dfa38c 100644 > --- a/drivers/gpu/drm/scheduler/sched_fence.c > +++ b/drivers/gpu/drm/scheduler/sched_fence.c > @@ -66,7 +66,7 @@ static const char *drm_sched_fence_get_driver_name(struct dma_fence *fence) > static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f) > { > struct drm_sched_fence *fence = to_drm_sched_fence(f); > - return (const char *)fence->sched->name; > + return (const char *)fence->sched_name; > } > > static void drm_sched_fence_free_rcu(struct rcu_head *rcu) > @@ -168,6 +168,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence, > unsigned seq; > > fence->sched = entity->rq->sched; > + strlcpy(fence->sched_name, entity->rq->sched->name, > + sizeof(fence->sched_name)); > seq = atomic_inc_return(&entity->fence_seq); > dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled, > &fence->lock, entity->fence_context, seq); > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > index 9db9e5e504ee..49f019731891 100644 > --- a/include/drm/gpu_scheduler.h > +++ b/include/drm/gpu_scheduler.h > @@ -295,6 +295,11 @@ struct drm_sched_fence { > * @lock: the lock used by the scheduled and the finished fences. > */ > spinlock_t lock; > + /** > + * @sched_name: the name of the scheduler that owns this fence. We > + * keep a copy here since fences can outlive their scheduler. > + */ > + char sched_name[16]; > /** > * @owner: job owner for debugging > */ > > --- > base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6 > change-id: 20230406-scheduler-uaf-1-994ec34cac93 > > Thank you, > ~~ Lina >
On Thu, Apr 06, 2023 at 10:29:57AM +0200, Christian König wrote: > Am 05.04.23 um 18:34 schrieb Asahi Lina: > > A signaled scheduler fence can outlive its scheduler, since fences are > > independently reference counted. > > Well that is actually not correct. Schedulers are supposed to stay around > until the hw they have been driving is no longer present. > > E.g. the reference was scheduler_fence->hw_fence->driver->scheduler. > > Your use case is now completely different to that and this won't work any > more. This is why I'm a bit a broken record suggesting that for the fw scheduler case, where we have drm_sched_entity:drm_scheduler 1:1 and created at runtime, we really should rework the interface exposed to drivers: - drm_scheduler stays the thing that's per-engine and stays around for as long as the driver - We split out a drm_sched_internal, which is either tied to drm_scheduler (ringbuffer scheduler mode) or drm_sched_entity (fw ctx scheduling mode). - drm/sched internals are updated to dtrt in all these cases. And there's a lot, stuff like drm_sched_job is quite tricky if each driver needs to protect against concurrent ctx/entity creation/destruction, and I really don't like the idea that drivers hand-roll this kind of tricky state transition code that's used in the exceptional tdr/gpu/fw-death situation all themselves. > This here might just be the first case where that breaks. Yeah I expect there's going to be a solid stream of these, and we're just going to random-walk in circles if this effort doesn't come with at least some amount of design. Thus far no one really comment on the above plan though, so I'm not sure what the consensu plan is among all the various fw-scheduling driver efforts ... -Daniel > > Regards, > Christian. > > > Therefore, we can't reference the > > scheduler in the get_timeline_name() implementation. > > > > Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared > > dma-bufs reference fences from GPU schedulers that no longer exist. > > > > Signed-off-by: Asahi Lina <lina@asahilina.net> > > --- > > drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++- > > drivers/gpu/drm/scheduler/sched_fence.c | 4 +++- > > include/drm/gpu_scheduler.h | 5 +++++ > > 3 files changed, 14 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > > index 15d04a0ec623..8b3b949b2ce8 100644 > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > @@ -368,7 +368,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity) > > /* > > * Fence is from the same scheduler, only need to wait for > > - * it to be scheduled > > + * it to be scheduled. > > + * > > + * Note: s_fence->sched could have been freed and reallocated > > + * as another scheduler. This false positive case is okay, as if > > + * the old scheduler was freed all of its jobs must have > > + * signaled their completion fences. > > */ > > fence = dma_fence_get(&s_fence->scheduled); > > dma_fence_put(entity->dependency); > > diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c > > index 7fd869520ef2..33b145dfa38c 100644 > > --- a/drivers/gpu/drm/scheduler/sched_fence.c > > +++ b/drivers/gpu/drm/scheduler/sched_fence.c > > @@ -66,7 +66,7 @@ static const char *drm_sched_fence_get_driver_name(struct dma_fence *fence) > > static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f) > > { > > struct drm_sched_fence *fence = to_drm_sched_fence(f); > > - return (const char *)fence->sched->name; > > + return (const char *)fence->sched_name; > > } > > static void drm_sched_fence_free_rcu(struct rcu_head *rcu) > > @@ -168,6 +168,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence, > > unsigned seq; > > fence->sched = entity->rq->sched; > > + strlcpy(fence->sched_name, entity->rq->sched->name, > > + sizeof(fence->sched_name)); > > seq = atomic_inc_return(&entity->fence_seq); > > dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled, > > &fence->lock, entity->fence_context, seq); > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > > index 9db9e5e504ee..49f019731891 100644 > > --- a/include/drm/gpu_scheduler.h > > +++ b/include/drm/gpu_scheduler.h > > @@ -295,6 +295,11 @@ struct drm_sched_fence { > > * @lock: the lock used by the scheduled and the finished fences. > > */ > > spinlock_t lock; > > + /** > > + * @sched_name: the name of the scheduler that owns this fence. We > > + * keep a copy here since fences can outlive their scheduler. > > + */ > > + char sched_name[16]; > > /** > > * @owner: job owner for debugging > > */ > > > > --- > > base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6 > > change-id: 20230406-scheduler-uaf-1-994ec34cac93 > > > > Thank you, > > ~~ Lina > > >
On 06/04/2023 17.29, Christian König wrote: > Am 05.04.23 um 18:34 schrieb Asahi Lina: >> A signaled scheduler fence can outlive its scheduler, since fences are >> independently reference counted. > > Well that is actually not correct. Schedulers are supposed to stay > around until the hw they have been driving is no longer present. But the fences can outlive that. You can GPU render into an imported buffer, which attaches a fence to it. Then the GPU goes away but the fence is still attached to the buffer. Then you oops when you cat that debugfs file... My use case does this way more often (since schedulers are tied to UAPI objects), which is how I found this, but as far as I can tell this is already broken for all drivers on unplug/unbind/anything else that would destroy the schedulers with fences potentially referenced on separate scanout devices or at any other DMA-BUF consumer. > E.g. the reference was scheduler_fence->hw_fence->driver->scheduler. It's up to drivers not to mess that up, since the HW fence has the same requirements that it can outlive other driver objects, just like any other fence. That's not something the scheduler has to be concerned with, it's a driver correctness issue. Of course, in C you have to get it right yourself, while with correct Rust abstractions will cause your code to fail to compile if you do it wrong ^^ In my particular case, the hw_fence is a very dumb object that has no references to anything, only an ID and a pending op count. Jobs hold references to it and decrement it until it signals, not the other way around. So that object can live forever regardless of whether the rest of the device is gone. > Your use case is now completely different to that and this won't work > any more. > > This here might just be the first case where that breaks. This bug already exists, it's just a lot rarer for existing use cases... but either way Xe is doing the same thing I am, so I'm not the only one here either. ~~ Lina
Am 06.04.23 um 10:49 schrieb Asahi Lina: > On 06/04/2023 17.29, Christian König wrote: >> Am 05.04.23 um 18:34 schrieb Asahi Lina: >>> A signaled scheduler fence can outlive its scheduler, since fences are >>> independently reference counted. >> >> Well that is actually not correct. Schedulers are supposed to stay >> around until the hw they have been driving is no longer present. > > But the fences can outlive that. You can GPU render into an imported > buffer, which attaches a fence to it. Then the GPU goes away but the > fence is still attached to the buffer. Then you oops when you cat that > debugfs file... No, exactly that's the point you wouldn't ops. > > My use case does this way more often (since schedulers are tied to > UAPI objects), which is how I found this, but as far as I can tell > this is already broken for all drivers on unplug/unbind/anything else > that would destroy the schedulers with fences potentially referenced > on separate scanout devices or at any other DMA-BUF consumer. Even if a GPU is hot plugged the data structures for it should only go away with the last reference, since the scheduler fence is referencing the hw fence and the hw fence in turn is referencing the driver this shouldn't happen. > >> E.g. the reference was scheduler_fence->hw_fence->driver->scheduler. > > It's up to drivers not to mess that up, since the HW fence has the > same requirements that it can outlive other driver objects, just like > any other fence. That's not something the scheduler has to be > concerned with, it's a driver correctness issue. > > Of course, in C you have to get it right yourself, while with correct > Rust abstractions will cause your code to fail to compile if you do it > wrong ^^ > > In my particular case, the hw_fence is a very dumb object that has no > references to anything, only an ID and a pending op count. Jobs hold > references to it and decrement it until it signals, not the other way > around. So that object can live forever regardless of whether the rest > of the device is gone. That is then certainly a bug. This won't work that way, and the timelime name is just the tip of the iceberg here. The fence reference count needs to keep both the scheduler and driver alive. Otherwise you could for example unload the module and immediately ops because your fence_ops go away. > >> Your use case is now completely different to that and this won't work >> any more. >> >> This here might just be the first case where that breaks. > > This bug already exists, it's just a lot rarer for existing use > cases... but either way Xe is doing the same thing I am, so I'm not > the only one here either. No it doesn't. You just have implemented the references differently than they are supposed to be. Fixing this one occasion here would mitigate that immediate ops, but doesn't fix the fundamental problem. Regards, Christian. > > ~~ Lina >
On 06/04/2023 18.06, Christian König wrote: > Am 06.04.23 um 10:49 schrieb Asahi Lina: >> On 06/04/2023 17.29, Christian König wrote: >>> Am 05.04.23 um 18:34 schrieb Asahi Lina: >>>> A signaled scheduler fence can outlive its scheduler, since fences are >>>> independently reference counted. >>> >>> Well that is actually not correct. Schedulers are supposed to stay >>> around until the hw they have been driving is no longer present. >> >> But the fences can outlive that. You can GPU render into an imported >> buffer, which attaches a fence to it. Then the GPU goes away but the >> fence is still attached to the buffer. Then you oops when you cat that >> debugfs file... > > No, exactly that's the point you wouldn't ops. > >> >> My use case does this way more often (since schedulers are tied to >> UAPI objects), which is how I found this, but as far as I can tell >> this is already broken for all drivers on unplug/unbind/anything else >> that would destroy the schedulers with fences potentially referenced >> on separate scanout devices or at any other DMA-BUF consumer. > > Even if a GPU is hot plugged the data structures for it should only go > away with the last reference, since the scheduler fence is referencing > the hw fence and the hw fence in turn is referencing the driver this > shouldn't happen. So those fences can potentially keep half the driver data structures alive just for existing in a signaled state? That doesn't seem sensible (and it definitely doesn't work for our use case where schedulers can be created and destroyed freely, it could lead to way too much junk sticking around in memory). >> >>> E.g. the reference was scheduler_fence->hw_fence->driver->scheduler. >> >> It's up to drivers not to mess that up, since the HW fence has the >> same requirements that it can outlive other driver objects, just like >> any other fence. That's not something the scheduler has to be >> concerned with, it's a driver correctness issue. >> >> Of course, in C you have to get it right yourself, while with correct >> Rust abstractions will cause your code to fail to compile if you do it >> wrong ^^ >> >> In my particular case, the hw_fence is a very dumb object that has no >> references to anything, only an ID and a pending op count. Jobs hold >> references to it and decrement it until it signals, not the other way >> around. So that object can live forever regardless of whether the rest >> of the device is gone. > > That is then certainly a bug. This won't work that way, and the timelime > name is just the tip of the iceberg here. > > The fence reference count needs to keep both the scheduler and driver > alive. Otherwise you could for example unload the module and immediately > ops because your fence_ops go away. Yes, I realized the module issue after writing the email. But that's the *only* thing it needs to hold a reference to! Which is much more sensible than keeping a huge chunk of UAPI-adjacent data structures alive for a signaled fence that for all we know might stick around forever attached to some buffer. >>> Your use case is now completely different to that and this won't work >>> any more. >>> >>> This here might just be the first case where that breaks. >> >> This bug already exists, it's just a lot rarer for existing use >> cases... but either way Xe is doing the same thing I am, so I'm not >> the only one here either. > > No it doesn't. You just have implemented the references differently than > they are supposed to be. > > Fixing this one occasion here would mitigate that immediate ops, but > doesn't fix the fundamental problem. Honestly, at this point I'm starting to doubt whether we want to use drm_scheduler at all, because it clearly wasn't designed for our use case and every time I try to fix something your answer has been "you're using it wrong", and at the same time the practically nonexistent documentation makes it impossible to know how it was actually designed to be used. ~~ Lina
On 06/04/2023 18.15, Asahi Lina wrote: > On 06/04/2023 18.06, Christian König wrote: >> Am 06.04.23 um 10:49 schrieb Asahi Lina: >>> On 06/04/2023 17.29, Christian König wrote: >>>> Am 05.04.23 um 18:34 schrieb Asahi Lina: >>>>> A signaled scheduler fence can outlive its scheduler, since fences are >>>>> independently reference counted. >>>> >>>> Well that is actually not correct. Schedulers are supposed to stay >>>> around until the hw they have been driving is no longer present. >>> >>> But the fences can outlive that. You can GPU render into an imported >>> buffer, which attaches a fence to it. Then the GPU goes away but the >>> fence is still attached to the buffer. Then you oops when you cat that >>> debugfs file... >> >> No, exactly that's the point you wouldn't ops. >> >>> >>> My use case does this way more often (since schedulers are tied to >>> UAPI objects), which is how I found this, but as far as I can tell >>> this is already broken for all drivers on unplug/unbind/anything else >>> that would destroy the schedulers with fences potentially referenced >>> on separate scanout devices or at any other DMA-BUF consumer. >> >> Even if a GPU is hot plugged the data structures for it should only go >> away with the last reference, since the scheduler fence is referencing >> the hw fence and the hw fence in turn is referencing the driver this >> shouldn't happen. > > So those fences can potentially keep half the driver data structures > alive just for existing in a signaled state? That doesn't seem sensible > (and it definitely doesn't work for our use case where schedulers can be > created and destroyed freely, it could lead to way too much junk > sticking around in memory). > >>> >>>> E.g. the reference was scheduler_fence->hw_fence->driver->scheduler. >>> >>> It's up to drivers not to mess that up, since the HW fence has the >>> same requirements that it can outlive other driver objects, just like >>> any other fence. That's not something the scheduler has to be >>> concerned with, it's a driver correctness issue. >>> >>> Of course, in C you have to get it right yourself, while with correct >>> Rust abstractions will cause your code to fail to compile if you do it >>> wrong ^^ >>> >>> In my particular case, the hw_fence is a very dumb object that has no >>> references to anything, only an ID and a pending op count. Jobs hold >>> references to it and decrement it until it signals, not the other way >>> around. So that object can live forever regardless of whether the rest >>> of the device is gone. >> >> That is then certainly a bug. This won't work that way, and the timelime >> name is just the tip of the iceberg here. >> >> The fence reference count needs to keep both the scheduler and driver >> alive. Otherwise you could for example unload the module and immediately >> ops because your fence_ops go away. > > Yes, I realized the module issue after writing the email. But that's the > *only* thing it needs to hold a reference to! Which is much more > sensible than keeping a huge chunk of UAPI-adjacent data structures > alive for a signaled fence that for all we know might stick around > forever attached to some buffer. > >>>> Your use case is now completely different to that and this won't work >>>> any more. >>>> >>>> This here might just be the first case where that breaks. >>> >>> This bug already exists, it's just a lot rarer for existing use >>> cases... but either way Xe is doing the same thing I am, so I'm not >>> the only one here either. >> >> No it doesn't. You just have implemented the references differently than >> they are supposed to be. >> >> Fixing this one occasion here would mitigate that immediate ops, but >> doesn't fix the fundamental problem. > > Honestly, at this point I'm starting to doubt whether we want to use > drm_scheduler at all, because it clearly wasn't designed for our use > case and every time I try to fix something your answer has been "you're > using it wrong", and at the same time the practically nonexistent > documentation makes it impossible to know how it was actually designed > to be used. Also, requiring that the hw_fence keep the scheduler alive (which is documented nowhere) is a completely ridiculous lifetime requirement and layering violation across multiple subsystems. I have no idea how I'd make Rust abstractions uphold that requirement safely without doing something silly like having abstraction-specific hw_fence wrappers, and then you run into deadlocks due to the scheduler potentially being dropped from the job_done callback when the fence reference is dropped and just... no, please. This is terrible. If you don't want me to fix it we'll have to find another way because I can't work with this. ~~ Lina
On Thu, Apr 06, 2023 at 06:27:27PM +0900, Asahi Lina wrote: > On 06/04/2023 18.15, Asahi Lina wrote: > > On 06/04/2023 18.06, Christian König wrote: > > > Am 06.04.23 um 10:49 schrieb Asahi Lina: > > > > On 06/04/2023 17.29, Christian König wrote: > > > > > Am 05.04.23 um 18:34 schrieb Asahi Lina: > > > > > > A signaled scheduler fence can outlive its scheduler, since fences are > > > > > > independently reference counted. > > > > > > > > > > Well that is actually not correct. Schedulers are supposed to stay > > > > > around until the hw they have been driving is no longer present. > > > > > > > > But the fences can outlive that. You can GPU render into an imported > > > > buffer, which attaches a fence to it. Then the GPU goes away but the > > > > fence is still attached to the buffer. Then you oops when you cat that > > > > debugfs file... > > > > > > No, exactly that's the point you wouldn't ops. > > > > > > > > > > > My use case does this way more often (since schedulers are tied to > > > > UAPI objects), which is how I found this, but as far as I can tell > > > > this is already broken for all drivers on unplug/unbind/anything else > > > > that would destroy the schedulers with fences potentially referenced > > > > on separate scanout devices or at any other DMA-BUF consumer. > > > > > > Even if a GPU is hot plugged the data structures for it should only go > > > away with the last reference, since the scheduler fence is referencing > > > the hw fence and the hw fence in turn is referencing the driver this > > > shouldn't happen. > > > > So those fences can potentially keep half the driver data structures > > alive just for existing in a signaled state? That doesn't seem sensible > > (and it definitely doesn't work for our use case where schedulers can be > > created and destroyed freely, it could lead to way too much junk > > sticking around in memory). This is why the split betwen the hw fence and the public sched fence. Because once the hw fence stuff is sorted it should all be able to go away, without the public fence keeping everything alive. > > > > > E.g. the reference was scheduler_fence->hw_fence->driver->scheduler. > > > > > > > > It's up to drivers not to mess that up, since the HW fence has the > > > > same requirements that it can outlive other driver objects, just like > > > > any other fence. That's not something the scheduler has to be > > > > concerned with, it's a driver correctness issue. > > > > > > > > Of course, in C you have to get it right yourself, while with correct > > > > Rust abstractions will cause your code to fail to compile if you do it > > > > wrong ^^ > > > > > > > > In my particular case, the hw_fence is a very dumb object that has no > > > > references to anything, only an ID and a pending op count. Jobs hold > > > > references to it and decrement it until it signals, not the other way > > > > around. So that object can live forever regardless of whether the rest > > > > of the device is gone. > > > > > > That is then certainly a bug. This won't work that way, and the timelime > > > name is just the tip of the iceberg here. > > > > > > The fence reference count needs to keep both the scheduler and driver > > > alive. Otherwise you could for example unload the module and immediately > > > ops because your fence_ops go away. > > > > Yes, I realized the module issue after writing the email. But that's the > > *only* thing it needs to hold a reference to! Which is much more > > sensible than keeping a huge chunk of UAPI-adjacent data structures > > alive for a signaled fence that for all we know might stick around > > forever attached to some buffer. > > > > > > > Your use case is now completely different to that and this won't work > > > > > any more. > > > > > > > > > > This here might just be the first case where that breaks. > > > > > > > > This bug already exists, it's just a lot rarer for existing use > > > > cases... but either way Xe is doing the same thing I am, so I'm not > > > > the only one here either. > > > > > > No it doesn't. You just have implemented the references differently than > > > they are supposed to be. > > > > > > Fixing this one occasion here would mitigate that immediate ops, but > > > doesn't fix the fundamental problem. > > > > Honestly, at this point I'm starting to doubt whether we want to use > > drm_scheduler at all, because it clearly wasn't designed for our use > > case and every time I try to fix something your answer has been "you're > > using it wrong", and at the same time the practically nonexistent > > documentation makes it impossible to know how it was actually designed > > to be used. > > Also, requiring that the hw_fence keep the scheduler alive (which is > documented nowhere) is a completely ridiculous lifetime requirement and > layering violation across multiple subsystems. I have no idea how I'd make > Rust abstractions uphold that requirement safely without doing something > silly like having abstraction-specific hw_fence wrappers, and then you run > into deadlocks due to the scheduler potentially being dropped from the > job_done callback when the fence reference is dropped and just... no, > please. This is terrible. If you don't want me to fix it we'll have to find > another way because I can't work with this. So generally the hw fence keeps the underlying gpu ctx alive, and the gpu context keeps the gpu vm alive. Pretty much has to, or your gpu starts executing stuff that's freed. Now for fw scheduler gpu ctx isn't just drm_sched_entity, but also drm_scheduler. Plus whatever driver stuff you have lying around for a ctx. Plus ofc a reference to the vm, which might in turn keep a ton of gem_bo alive. Still I'm not seeing where the fundamental issue is in this refcount scheme, or where it's silly? Mapping this all to Rust correctly is a challenge for sure, and also untangling the assumption that drm_scheduler is suddenly a lot more dynamic object (see my other reply), but fundamentally calling this all silly and terrible I don't understand ... -Daniel
On 06/04/2023 18.48, Daniel Vetter wrote: > On Thu, Apr 06, 2023 at 06:27:27PM +0900, Asahi Lina wrote: >> On 06/04/2023 18.15, Asahi Lina wrote: >>> On 06/04/2023 18.06, Christian König wrote: >>>> Am 06.04.23 um 10:49 schrieb Asahi Lina: >>>>> On 06/04/2023 17.29, Christian König wrote: >>>>>> Am 05.04.23 um 18:34 schrieb Asahi Lina: >>>>>>> A signaled scheduler fence can outlive its scheduler, since fences are >>>>>>> independently reference counted. >>>>>> >>>>>> Well that is actually not correct. Schedulers are supposed to stay >>>>>> around until the hw they have been driving is no longer present. >>>>> >>>>> But the fences can outlive that. You can GPU render into an imported >>>>> buffer, which attaches a fence to it. Then the GPU goes away but the >>>>> fence is still attached to the buffer. Then you oops when you cat that >>>>> debugfs file... >>>> >>>> No, exactly that's the point you wouldn't ops. >>>> >>>>> >>>>> My use case does this way more often (since schedulers are tied to >>>>> UAPI objects), which is how I found this, but as far as I can tell >>>>> this is already broken for all drivers on unplug/unbind/anything else >>>>> that would destroy the schedulers with fences potentially referenced >>>>> on separate scanout devices or at any other DMA-BUF consumer. >>>> >>>> Even if a GPU is hot plugged the data structures for it should only go >>>> away with the last reference, since the scheduler fence is referencing >>>> the hw fence and the hw fence in turn is referencing the driver this >>>> shouldn't happen. >>> >>> So those fences can potentially keep half the driver data structures >>> alive just for existing in a signaled state? That doesn't seem sensible >>> (and it definitely doesn't work for our use case where schedulers can be >>> created and destroyed freely, it could lead to way too much junk >>> sticking around in memory). > > This is why the split betwen the hw fence and the public sched fence. > Because once the hw fence stuff is sorted it should all be able to go > away, without the public fence keeping everything alive. That doesn't seem to be how it works though... as far as I can tell the public finished fence keeps the public scheduled fence alive, which keeps the hw fence alive. If that is how it is supposed to work, then we're back at this being a simple UAF which is fixed by this patch (and then the fence code also needs to be fixed to actually drop the parent fence reference when it signals). >> Also, requiring that the hw_fence keep the scheduler alive (which is >> documented nowhere) is a completely ridiculous lifetime requirement and >> layering violation across multiple subsystems. I have no idea how I'd make >> Rust abstractions uphold that requirement safely without doing something >> silly like having abstraction-specific hw_fence wrappers, and then you run >> into deadlocks due to the scheduler potentially being dropped from the >> job_done callback when the fence reference is dropped and just... no, >> please. This is terrible. If you don't want me to fix it we'll have to find >> another way because I can't work with this. > > So generally the hw fence keeps the underlying gpu ctx alive, and the gpu > context keeps the gpu vm alive. Pretty much has to, or your gpu starts > executing stuff that's freed. I do that using a different mechanism, since the way this GPU signals events doesn't map directly to fences, nor to the UAPI queue's (and therefore the DRM scheduler's hw fence's) idea of what job completion is. There's sort of a timeline mechanism instead, but the timelines are 24 bits and wrap and there's a global set of 128 of them and multiple can be required for a single drm_scheduler job. Keeping queues alive (which then keep active jobs alive which keep whatever resources they require alive) happens at the level of those 128 event slots, as I call them, which hold a reference to whatever queue is assigned to them. Once those signal the queues walk through pending jobs until the signaled value, and that's where the hw fence gets signaled. After that the jobs that were just completed are freed (This is a sticky point right now due to possible deadlocks if done on the event thread. What I'm doing right now is a bit of a hack that works well enough in practice, but I want to refactor it to be cleaner once we have more Rust abstractions to work with...) It might actually make sense to start moving some lifetimes into the hw_fence if we find we need to more tightly tie firmware lifetimes together (this is TBD, I'm still figuring out all the corner case rules), but that's only practical if the drm_sched fence doesn't hold up the entire hw_fence. For that to make sense the hw_fence needs to be able to go away soon after it signals. This whole thing is pretty messy and I'm open to refactoring ideas, I just don't want to have to keep drm_sched instances lying around just because someone has a buffer reference with one of its fences installed... those potentially long-lived fences shouldn't keep anything more than small bits of driver-global state alive (ideally just the module itself). > Now for fw scheduler gpu ctx isn't just drm_sched_entity, but also > drm_scheduler. Plus whatever driver stuff you have lying around for a ctx. > Plus ofc a reference to the vm, which might in turn keep a ton of gem_bo > alive. In this case I don't actually keep the user BOs alive (because that's a File level thing), and yes that does mean if you kill a process using the GPU chances are it's going to fault as all the userspace buffers disappear. But I do keep all the firmware and kernel-allocated objects and BOs around of course, those are the ones that have references from the pending job objects in one way or another (and which are critical to keep alive, as otherwise the firmware will crash). I don't keep the scheduler though, since that's just a higher-level thing to me, not something the GPU itself cares about. If userspace goes away and BOs are going away anyway, we might as well cancel all pending jobs and let any currently executing ones freewheel (that was one of my earlier patches) while allowing the scheduler to be torn down. If we need to though, I can figure out some way to keep the BO mapping/etc state alive when the File goes away until jobs complete. It's just sort of academic because userspace is supposed to clean up after itself and wait for all jobs, and there's no way for the kernel to guarantee the GPU won't fault anyway, so a malicious userspace could always do the wrong thing... > Still I'm not seeing where the fundamental issue is in this refcount > scheme, or where it's silly? Mapping this all to Rust correctly is a > challenge for sure, and also untangling the assumption that drm_scheduler > is suddenly a lot more dynamic object (see my other reply), but > fundamentally calling this all silly and terrible I don't understand ... Christian is saying that the hw fence needs to keep a transitive reference to the scheduler. How do we enforce that? Fences are one abstraction, and the scheduler is another. We can't encode that a given generic Fence implementation contains a reference to a given scheduler in the type system. Rust abstractions are easy to write for self-contained subsystems with clear lifetime rules for their components. They are hard to write for convoluted lifetime rules that span several modules (without built-in refcounting to deal with it), and it is exactly those kinds of designs that are error-prone to use and document, regardless of whether you're writing Rust or C. From what Christian says, drm_sched today has that kind of design and it is intentional. Since we can't express complex lifetime relationships in the type system in a usable way, we have to use refcounting. So first, that means the DRM scheduler needs to be refcounted (it actually already is in my abstraction, due to the Entity reference). Next, since there's a cross-subsystem lifetime requirement, that means we can no longer accept arbitrary generic user fences returned to the drm_sched as a hw fence. Instead we need some kind of wrapper fence similar to what drm_sched itself does, or a special subtype leaking into the dma_fence abstraction, or an open coded proxy into the underlying user fence that isn't actually its own fence, or something. Now your hw fence keeps a reference to the DRM scheduler, fine. Now what happens when a job completes? It signals the hw fence. The DRM scheduler thread/workqueue wakes up, signals the job complete fence, and frees the job including dropping its fence reference. What happens if nobody else is actually holding a reference to those fences? Freeing the job frees them too. Which frees the hw fence. What if that was the last reference to the DRM scheduler left? We free the DRM scheduler... from its own thread. Deadlock! This is the exact same situation that led to my previous patch to directly kill jobs when you free the scheduler, because the alternative to keep memory safety was to have jobs hold a reference to the scheduler, and then you run into the same deadlock (with fewer steps because fences aren't involved). I tried it, it doesn't work. Of course you could fix it with a sidecar "job freer" thread/workqueue or something on the abstraction side to get it out of that execution context... and now we're on the way to reimplementing half of the scheduler. In my experience, when you start having to add piles of functionality to ensure safety and provide a clean Rust API on top of the underlying C API, to the point where you're duplicating things the C implementation does in order to do them safely, that means the underlying C API is not good and you need to fix it instead... So far, drm_sched has been the only component where I've run into this... repeatedly... across several footguns in the API... ;; ~~ Lina
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 15d04a0ec623..8b3b949b2ce8 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -368,7 +368,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity) /* * Fence is from the same scheduler, only need to wait for - * it to be scheduled + * it to be scheduled. + * + * Note: s_fence->sched could have been freed and reallocated + * as another scheduler. This false positive case is okay, as if + * the old scheduler was freed all of its jobs must have + * signaled their completion fences. */ fence = dma_fence_get(&s_fence->scheduled); dma_fence_put(entity->dependency); diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c index 7fd869520ef2..33b145dfa38c 100644 --- a/drivers/gpu/drm/scheduler/sched_fence.c +++ b/drivers/gpu/drm/scheduler/sched_fence.c @@ -66,7 +66,7 @@ static const char *drm_sched_fence_get_driver_name(struct dma_fence *fence) static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f) { struct drm_sched_fence *fence = to_drm_sched_fence(f); - return (const char *)fence->sched->name; + return (const char *)fence->sched_name; } static void drm_sched_fence_free_rcu(struct rcu_head *rcu) @@ -168,6 +168,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence, unsigned seq; fence->sched = entity->rq->sched; + strlcpy(fence->sched_name, entity->rq->sched->name, + sizeof(fence->sched_name)); seq = atomic_inc_return(&entity->fence_seq); dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled, &fence->lock, entity->fence_context, seq); diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9db9e5e504ee..49f019731891 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -295,6 +295,11 @@ struct drm_sched_fence { * @lock: the lock used by the scheduled and the finished fences. */ spinlock_t lock; + /** + * @sched_name: the name of the scheduler that owns this fence. We + * keep a copy here since fences can outlive their scheduler. + */ + char sched_name[16]; /** * @owner: job owner for debugging */