Message ID | 20230418100453.4433-1-dakr@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2733060vqo; Tue, 18 Apr 2023 03:19:41 -0700 (PDT) X-Google-Smtp-Source: AKy350YSf1glzUcvOVcuDIaEFM9OGgDiev3qmNGnaozp7UK42Nl6IdKEx99AonSJfqHoQOMvNbrN X-Received: by 2002:a17:90a:9483:b0:246:696f:b1f1 with SMTP id s3-20020a17090a948300b00246696fb1f1mr1614296pjo.6.1681813181328; Tue, 18 Apr 2023 03:19:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681813181; cv=none; d=google.com; s=arc-20160816; b=wkl4tknW5OIfh2Ebf6pzCTIeEYD2V23QbCysWttecPRJ+NeuPq3BwxYnOgX6UCmR54 4zO9gfsiDVkIMVhpMD19DEqbybgS952+CQC/dMML8gNekCbhB0jG6+N0a3yKA2iJ7L9m UtdOMbn9mUQtclb18S3VOWMCM0xu14zagqGyUPmIOdH5NVQWBHCcb0wwhdI7yHPmVMhj ApHRtplkwcShmPzSgCsItTndqqcHK7tlLYPj5PVxEhet8niPubTwo/wW+Boc2EAlz4ws 8ebdZcl5QiNKhMYFdJxiFGVemMyOQsYS4Ee+/nttupgJaDCBm7DzfdwcKxvmWLt7guDK Cb6A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=kkPmtUJM/y/8ZMLgmPzrDpSz/aD5s4kTz4AWJ+kwyns=; b=yPJ3hG/PN4wfS6KhMrKnD8iyjY0PniN7D1cA8WG/egeUkUpjXK/WRwN3DtSUDO9AMM w0d/m4YwIcda5dqi1yaUnKrwLOuEj31FJti0d2K+YoqBSIJqaBJ8BYRz6mzpVq1KVsij nDanwbXf4sRZ+4CJjxX8HPitQB9cPjamm6aZwlG9AG401o324ofSmIWGv3mj7WaOYL7/ em1wJwMrcq6NILna5ZDYKS37fYIk+/RBXmsfab/TrTRyIXVzDlzinTUP1QA5fc+figc8 pubGDY9Q8cPqbsMq0i7O4RvnpfQ00lj0YKwyJVZNsCkhWYIpjhgNht1l8klXTiIB1PTQ k0hw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=FXy+hhuL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m8-20020a17090a71c800b00247af6f0497si2856021pjs.21.2023.04.18.03.19.28; Tue, 18 Apr 2023 03:19:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=FXy+hhuL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230026AbjDRKFr (ORCPT <rfc822;leviz.kernel.dev@gmail.com> + 99 others); Tue, 18 Apr 2023 06:05:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33532 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229689AbjDRKFq (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 18 Apr 2023 06:05:46 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 892A7448B for <linux-kernel@vger.kernel.org>; Tue, 18 Apr 2023 03:05:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1681812299; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=kkPmtUJM/y/8ZMLgmPzrDpSz/aD5s4kTz4AWJ+kwyns=; b=FXy+hhuLkrzXT8MprYeWrK1kuYnjpSdY5338FXwFYs/xooNOaGvQryLazxiwwUluyW/dAp y8oWFZVbX/ps3mstfeAkPqAT4qEHH8pIEsrKFAIwzvNjDkGW7VwVQy5pIdkHiORDxlmjtD nx/a5mTKsc9X847aq/MLvIR5W6okLVw= Received: from mail-ej1-f70.google.com (mail-ej1-f70.google.com [209.85.218.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-374-a2_MzuasOIy1lf15vIdRRQ-1; Tue, 18 Apr 2023 06:04:58 -0400 X-MC-Unique: a2_MzuasOIy1lf15vIdRRQ-1 Received: by mail-ej1-f70.google.com with SMTP id qk25-20020a170906d9d900b0094f0dd1166dso3026651ejb.9 for <linux-kernel@vger.kernel.org>; Tue, 18 Apr 2023 03:04:58 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681812297; x=1684404297; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=kkPmtUJM/y/8ZMLgmPzrDpSz/aD5s4kTz4AWJ+kwyns=; b=AqP7v95T9HGFMfcZeWJbBBy4HOYydO8hpSmrKmww/2h//en/OQ4DQloTbY6+wITNGD nd0EgPXBK3v2oC9BZUmQsmnBx7VpdJUZksftZryUP2KdwajIu/jHw8Vgd4db+Hp06o5h adBaEDVyjG9zcBdTb69/cygaJQAhREwKB5jihgsnZkaYecPJpssQQgVmPuiH0hUOY6yv iyH/Rqrmrdpv6bCcH9TysOOmacoX9g7JFcPjOWdwdUsLplv3VGlywLv2nImaU+jlQkLO m92xzlsuoNQFnPhJbijUaIcHLnQKdJJBFmlvBfDxbtXeruzs/se+6sClhdFLbGfGdCEm H4Vw== X-Gm-Message-State: AAQBX9eldRRKASOhe2tA72JHQNWZtjQ8CZX0bk2KhGVh05lzCVaPj7ZI fLmbg56TV4biPLnM29zh93g7zeZqlbQF+4v38+Rur/36Ls47PW4YOX9n34PmYhav60LNbYJt453 WC5jS1QWTe2HJWMpF/pruCgXY X-Received: by 2002:a17:906:f6cd:b0:94f:3bf7:dacf with SMTP id jo13-20020a170906f6cd00b0094f3bf7dacfmr7523192ejb.71.1681812297377; Tue, 18 Apr 2023 03:04:57 -0700 (PDT) X-Received: by 2002:a17:906:f6cd:b0:94f:3bf7:dacf with SMTP id jo13-20020a170906f6cd00b0094f3bf7dacfmr7523179ejb.71.1681812297083; Tue, 18 Apr 2023 03:04:57 -0700 (PDT) Received: from cassiopeiae.. ([2a02:810d:4b3f:de78:642:1aff:fe31:a19f]) by smtp.gmail.com with ESMTPSA id q18-20020a170906361200b0094f124a37c4sm5131590ejb.18.2023.04.18.03.04.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Apr 2023 03:04:56 -0700 (PDT) From: Danilo Krummrich <dakr@redhat.com> To: luben.tuikov@amd.com, airlied@gmail.com, daniel@ffwll.ch, l.stach@pengutronix.de, christian.koenig@amd.com Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, Danilo Krummrich <dakr@redhat.com> Subject: [PATCH v2] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job() Date: Tue, 18 Apr 2023 12:04:53 +0200 Message-Id: <20230418100453.4433-1-dakr@redhat.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763508938524120632?= X-GMAIL-MSGID: =?utf-8?q?1763508938524120632?= |
Series |
[v2] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
|
|
Commit Message
Danilo Krummrich
April 18, 2023, 10:04 a.m. UTC
It already happend a few times that patches slipped through which
implemented access to an entity through a job that was already removed
from the entities queue. Since jobs and entities might have different
lifecycles, this can potentially cause UAF bugs.
In order to make it obvious that a jobs entity pointer shouldn't be
accessed after drm_sched_entity_pop_job() was called successfully, set
the jobs entity pointer to NULL once the job is removed from the entity
queue.
Moreover, debugging a potential NULL pointer dereference is way easier
than potentially corrupted memory through a UAF.
Signed-off-by: Danilo Krummrich <dakr@redhat.com>
---
drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
drivers/gpu/drm/scheduler/sched_main.c | 4 ++++
2 files changed, 10 insertions(+)
Comments
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> and applied to drm-misc-next. Thanks! Regards, Luben On 2023-04-18 06:04, Danilo Krummrich wrote: > It already happend a few times that patches slipped through which > implemented access to an entity through a job that was already removed > from the entities queue. Since jobs and entities might have different > lifecycles, this can potentially cause UAF bugs. > > In order to make it obvious that a jobs entity pointer shouldn't be > accessed after drm_sched_entity_pop_job() was called successfully, set > the jobs entity pointer to NULL once the job is removed from the entity > queue. > > Moreover, debugging a potential NULL pointer dereference is way easier > than potentially corrupted memory through a UAF. > > Signed-off-by: Danilo Krummrich <dakr@redhat.com> > --- > drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++ > drivers/gpu/drm/scheduler/sched_main.c | 4 ++++ > 2 files changed, 10 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > index 15d04a0ec623..a9c6118e534b 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) > drm_sched_rq_update_fifo(entity, next->submit_ts); > } > > + /* Jobs and entities might have different lifecycles. Since we're > + * removing the job from the entities queue, set the jobs entity pointer > + * to NULL to prevent any future access of the entity through this job. > + */ > + sched_job->entity = NULL; > + > return sched_job; > } > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 9b16480686f6..e89a3e469cd5 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -42,6 +42,10 @@ > * the hardware. > * > * The jobs in a entity are always scheduled in the order that they were pushed. > + * > + * Note that once a job was taken from the entities queue and pushed to the > + * hardware, i.e. the pending queue, the entity must not be referenced anymore > + * through the jobs entity pointer. > */ > > #include <linux/kthread.h>
On 18/04/2023 11:04, Danilo Krummrich wrote: > It already happend a few times that patches slipped through which > implemented access to an entity through a job that was already removed > from the entities queue. Since jobs and entities might have different > lifecycles, this can potentially cause UAF bugs. > > In order to make it obvious that a jobs entity pointer shouldn't be > accessed after drm_sched_entity_pop_job() was called successfully, set > the jobs entity pointer to NULL once the job is removed from the entity > queue. > > Moreover, debugging a potential NULL pointer dereference is way easier > than potentially corrupted memory through a UAF. > > Signed-off-by: Danilo Krummrich <dakr@redhat.com> This triggers a splat for me (with Panfrost driver), the cause of which is the following code in drm_sched_get_cleanup_job(): if (job) { job->entity->elapsed_ns += ktime_to_ns( ktime_sub(job->s_fence->finished.timestamp, job->s_fence->scheduled.timestamp)); } which indeed is accessing entity after the job has been returned from drm_sched_entity_pop_job(). And obviously job->entity is a NULL pointer with this patch. I'm afraid I don't fully understand the lifecycle so I'm not sure if this is simply exposing an existing bug in drm_sched_get_cleanup_job() or if this commit needs to be reverted. Thanks, Steve > --- > drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++ > drivers/gpu/drm/scheduler/sched_main.c | 4 ++++ > 2 files changed, 10 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > index 15d04a0ec623..a9c6118e534b 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) > drm_sched_rq_update_fifo(entity, next->submit_ts); > } > > + /* Jobs and entities might have different lifecycles. Since we're > + * removing the job from the entities queue, set the jobs entity pointer > + * to NULL to prevent any future access of the entity through this job. > + */ > + sched_job->entity = NULL; > + > return sched_job; > } > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 9b16480686f6..e89a3e469cd5 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -42,6 +42,10 @@ > * the hardware. > * > * The jobs in a entity are always scheduled in the order that they were pushed. > + * > + * Note that once a job was taken from the entities queue and pushed to the > + * hardware, i.e. the pending queue, the entity must not be referenced anymore > + * through the jobs entity pointer. > */ > > #include <linux/kthread.h>
Hi Steven, Am Mittwoch, dem 19.04.2023 um 10:39 +0100 schrieb Steven Price: > On 18/04/2023 11:04, Danilo Krummrich wrote: > > It already happend a few times that patches slipped through which > > implemented access to an entity through a job that was already removed > > from the entities queue. Since jobs and entities might have different > > lifecycles, this can potentially cause UAF bugs. > > > > In order to make it obvious that a jobs entity pointer shouldn't be > > accessed after drm_sched_entity_pop_job() was called successfully, set > > the jobs entity pointer to NULL once the job is removed from the entity > > queue. > > > > Moreover, debugging a potential NULL pointer dereference is way easier > > than potentially corrupted memory through a UAF. > > > > Signed-off-by: Danilo Krummrich <dakr@redhat.com> > > This triggers a splat for me (with Panfrost driver), the cause of which > is the following code in drm_sched_get_cleanup_job(): > > if (job) { > job->entity->elapsed_ns += ktime_to_ns( > ktime_sub(job->s_fence->finished.timestamp, > job->s_fence->scheduled.timestamp)); > } > > which indeed is accessing entity after the job has been returned from > drm_sched_entity_pop_job(). And obviously job->entity is a NULL pointer > with this patch. > > I'm afraid I don't fully understand the lifecycle so I'm not sure if > this is simply exposing an existing bug in drm_sched_get_cleanup_job() > or if this commit needs to be reverted. > Not sure which tree you are on. The offending commit has been reverted in 6.3-rc5. Regards, Lucas > Thanks, > > Steve > > > --- > > drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++ > > drivers/gpu/drm/scheduler/sched_main.c | 4 ++++ > > 2 files changed, 10 insertions(+) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > > index 15d04a0ec623..a9c6118e534b 100644 > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) > > drm_sched_rq_update_fifo(entity, next->submit_ts); > > } > > > > + /* Jobs and entities might have different lifecycles. Since we're > > + * removing the job from the entities queue, set the jobs entity pointer > > + * to NULL to prevent any future access of the entity through this job. > > + */ > > + sched_job->entity = NULL; > > + > > return sched_job; > > } > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index 9b16480686f6..e89a3e469cd5 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -42,6 +42,10 @@ > > * the hardware. > > * > > * The jobs in a entity are always scheduled in the order that they were pushed. > > + * > > + * Note that once a job was taken from the entities queue and pushed to the > > + * hardware, i.e. the pending queue, the entity must not be referenced anymore > > + * through the jobs entity pointer. > > */ > > > > #include <linux/kthread.h> >
On 19/04/2023 10:44, Lucas Stach wrote: > Hi Steven, > > Am Mittwoch, dem 19.04.2023 um 10:39 +0100 schrieb Steven Price: >> On 18/04/2023 11:04, Danilo Krummrich wrote: >>> It already happend a few times that patches slipped through which >>> implemented access to an entity through a job that was already removed >>> from the entities queue. Since jobs and entities might have different >>> lifecycles, this can potentially cause UAF bugs. >>> >>> In order to make it obvious that a jobs entity pointer shouldn't be >>> accessed after drm_sched_entity_pop_job() was called successfully, set >>> the jobs entity pointer to NULL once the job is removed from the entity >>> queue. >>> >>> Moreover, debugging a potential NULL pointer dereference is way easier >>> than potentially corrupted memory through a UAF. >>> >>> Signed-off-by: Danilo Krummrich <dakr@redhat.com> >> >> This triggers a splat for me (with Panfrost driver), the cause of which >> is the following code in drm_sched_get_cleanup_job(): >> >> if (job) { >> job->entity->elapsed_ns += ktime_to_ns( >> ktime_sub(job->s_fence->finished.timestamp, >> job->s_fence->scheduled.timestamp)); >> } >> >> which indeed is accessing entity after the job has been returned from >> drm_sched_entity_pop_job(). And obviously job->entity is a NULL pointer >> with this patch. >> >> I'm afraid I don't fully understand the lifecycle so I'm not sure if >> this is simply exposing an existing bug in drm_sched_get_cleanup_job() >> or if this commit needs to be reverted. >> > Not sure which tree you are on. The offending commit has been reverted > in 6.3-rc5. This is in drm-misc-next - I'm not sure which "offending commit" you are referring to. I'm referring to: 96c7c2f4d5bd ("drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()") which was merged yesterday to drm-misc-next (and is currently the top commit). Is there another commit which has been reverted elsewhere which is conflicting? Steve > Regards, > Lucas > >> Thanks, >> >> Steve >> >>> --- >>> drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++ >>> drivers/gpu/drm/scheduler/sched_main.c | 4 ++++ >>> 2 files changed, 10 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c >>> index 15d04a0ec623..a9c6118e534b 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_entity.c >>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >>> @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) >>> drm_sched_rq_update_fifo(entity, next->submit_ts); >>> } >>> >>> + /* Jobs and entities might have different lifecycles. Since we're >>> + * removing the job from the entities queue, set the jobs entity pointer >>> + * to NULL to prevent any future access of the entity through this job. >>> + */ >>> + sched_job->entity = NULL; >>> + >>> return sched_job; >>> } >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>> index 9b16480686f6..e89a3e469cd5 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -42,6 +42,10 @@ >>> * the hardware. >>> * >>> * The jobs in a entity are always scheduled in the order that they were pushed. >>> + * >>> + * Note that once a job was taken from the entities queue and pushed to the >>> + * hardware, i.e. the pending queue, the entity must not be referenced anymore >>> + * through the jobs entity pointer. >>> */ >>> >>> #include <linux/kthread.h> >> >
On 19/04/2023 10:53, Steven Price wrote: > On 19/04/2023 10:44, Lucas Stach wrote: >> Hi Steven, >> >> Am Mittwoch, dem 19.04.2023 um 10:39 +0100 schrieb Steven Price: >>> On 18/04/2023 11:04, Danilo Krummrich wrote: >>>> It already happend a few times that patches slipped through which >>>> implemented access to an entity through a job that was already removed >>>> from the entities queue. Since jobs and entities might have different >>>> lifecycles, this can potentially cause UAF bugs. >>>> >>>> In order to make it obvious that a jobs entity pointer shouldn't be >>>> accessed after drm_sched_entity_pop_job() was called successfully, set >>>> the jobs entity pointer to NULL once the job is removed from the entity >>>> queue. >>>> >>>> Moreover, debugging a potential NULL pointer dereference is way easier >>>> than potentially corrupted memory through a UAF. >>>> >>>> Signed-off-by: Danilo Krummrich <dakr@redhat.com> >>> >>> This triggers a splat for me (with Panfrost driver), the cause of which >>> is the following code in drm_sched_get_cleanup_job(): >>> >>> if (job) { >>> job->entity->elapsed_ns += ktime_to_ns( >>> ktime_sub(job->s_fence->finished.timestamp, >>> job->s_fence->scheduled.timestamp)); >>> } >>> >>> which indeed is accessing entity after the job has been returned from >>> drm_sched_entity_pop_job(). And obviously job->entity is a NULL pointer >>> with this patch. >>> >>> I'm afraid I don't fully understand the lifecycle so I'm not sure if >>> this is simply exposing an existing bug in drm_sched_get_cleanup_job() >>> or if this commit needs to be reverted. >>> >> Not sure which tree you are on. The offending commit has been reverted >> in 6.3-rc5. > > This is in drm-misc-next - I'm not sure which "offending commit" you are > referring to. I'm referring to: > > 96c7c2f4d5bd ("drm/scheduler: set entity to NULL in > drm_sched_entity_pop_job()") > > which was merged yesterday to drm-misc-next (and is currently the top > commit). > > Is there another commit which has been reverted elsewhere which is > conflicting? Answering my own question, the conflicting commit is: baad10973fdb ("Revert "drm/scheduler: track GPU active time per entity"") But that commit isn't (yet) in drm-misc-next. Which unfortunately means drm-misc-next is broken until a back-merge happens. Steve
Am Mittwoch, dem 19.04.2023 um 10:53 +0100 schrieb Steven Price: > On 19/04/2023 10:44, Lucas Stach wrote: > > Hi Steven, > > > > Am Mittwoch, dem 19.04.2023 um 10:39 +0100 schrieb Steven Price: > > > On 18/04/2023 11:04, Danilo Krummrich wrote: > > > > It already happend a few times that patches slipped through which > > > > implemented access to an entity through a job that was already removed > > > > from the entities queue. Since jobs and entities might have different > > > > lifecycles, this can potentially cause UAF bugs. > > > > > > > > In order to make it obvious that a jobs entity pointer shouldn't be > > > > accessed after drm_sched_entity_pop_job() was called successfully, set > > > > the jobs entity pointer to NULL once the job is removed from the entity > > > > queue. > > > > > > > > Moreover, debugging a potential NULL pointer dereference is way easier > > > > than potentially corrupted memory through a UAF. > > > > > > > > Signed-off-by: Danilo Krummrich <dakr@redhat.com> > > > > > > This triggers a splat for me (with Panfrost driver), the cause of which > > > is the following code in drm_sched_get_cleanup_job(): > > > > > > if (job) { > > > job->entity->elapsed_ns += ktime_to_ns( > > > ktime_sub(job->s_fence->finished.timestamp, > > > job->s_fence->scheduled.timestamp)); > > > } > > > > > > which indeed is accessing entity after the job has been returned from > > > drm_sched_entity_pop_job(). And obviously job->entity is a NULL pointer > > > with this patch. > > > > > > I'm afraid I don't fully understand the lifecycle so I'm not sure if > > > this is simply exposing an existing bug in drm_sched_get_cleanup_job() > > > or if this commit needs to be reverted. > > > > > Not sure which tree you are on. The offending commit has been reverted > > in 6.3-rc5. > > This is in drm-misc-next - I'm not sure which "offending commit" you are > referring to. I'm referring to: > > 96c7c2f4d5bd ("drm/scheduler: set entity to NULL in > drm_sched_entity_pop_job()") > > which was merged yesterday to drm-misc-next (and is currently the top > commit). > > Is there another commit which has been reverted elsewhere which is > conflicting? > The commit which introduces the use-after-free, which is now upgraded to the above NULL ptr dereference is df622729ddbf ("drm/scheduler: track GPU active time per entity") and has been reverted in baad10973fdb ("Revert "drm/scheduler: track GPU active time per entity"). The revert is upstream in 6.3-rc5. Regards, Lucas > Steve > > > Regards, > > Lucas > > > > > Thanks, > > > > > > Steve > > > > > > > --- > > > > drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++ > > > > drivers/gpu/drm/scheduler/sched_main.c | 4 ++++ > > > > 2 files changed, 10 insertions(+) > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > > > > index 15d04a0ec623..a9c6118e534b 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > > > @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) > > > > drm_sched_rq_update_fifo(entity, next->submit_ts); > > > > } > > > > > > > > + /* Jobs and entities might have different lifecycles. Since we're > > > > + * removing the job from the entities queue, set the jobs entity pointer > > > > + * to NULL to prevent any future access of the entity through this job. > > > > + */ > > > > + sched_job->entity = NULL; > > > > + > > > > return sched_job; > > > > } > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > > > index 9b16480686f6..e89a3e469cd5 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > @@ -42,6 +42,10 @@ > > > > * the hardware. > > > > * > > > > * The jobs in a entity are always scheduled in the order that they were pushed. > > > > + * > > > > + * Note that once a job was taken from the entities queue and pushed to the > > > > + * hardware, i.e. the pending queue, the entity must not be referenced anymore > > > > + * through the jobs entity pointer. > > > > */ > > > > > > > > #include <linux/kthread.h> > > > > > >
On 2023-04-19 06:07, Lucas Stach wrote: > Am Mittwoch, dem 19.04.2023 um 10:53 +0100 schrieb Steven Price: >> On 19/04/2023 10:44, Lucas Stach wrote: >>> Hi Steven, >>> >>> Am Mittwoch, dem 19.04.2023 um 10:39 +0100 schrieb Steven Price: >>>> On 18/04/2023 11:04, Danilo Krummrich wrote: >>>>> It already happend a few times that patches slipped through which >>>>> implemented access to an entity through a job that was already removed >>>>> from the entities queue. Since jobs and entities might have different >>>>> lifecycles, this can potentially cause UAF bugs. >>>>> >>>>> In order to make it obvious that a jobs entity pointer shouldn't be >>>>> accessed after drm_sched_entity_pop_job() was called successfully, set >>>>> the jobs entity pointer to NULL once the job is removed from the entity >>>>> queue. >>>>> >>>>> Moreover, debugging a potential NULL pointer dereference is way easier >>>>> than potentially corrupted memory through a UAF. >>>>> >>>>> Signed-off-by: Danilo Krummrich <dakr@redhat.com> >>>> >>>> This triggers a splat for me (with Panfrost driver), the cause of which >>>> is the following code in drm_sched_get_cleanup_job(): >>>> >>>> if (job) { >>>> job->entity->elapsed_ns += ktime_to_ns( >>>> ktime_sub(job->s_fence->finished.timestamp, >>>> job->s_fence->scheduled.timestamp)); >>>> } >>>> >>>> which indeed is accessing entity after the job has been returned from >>>> drm_sched_entity_pop_job(). And obviously job->entity is a NULL pointer >>>> with this patch. >>>> >>>> I'm afraid I don't fully understand the lifecycle so I'm not sure if >>>> this is simply exposing an existing bug in drm_sched_get_cleanup_job() >>>> or if this commit needs to be reverted. >>>> >>> Not sure which tree you are on. The offending commit has been reverted >>> in 6.3-rc5. >> >> This is in drm-misc-next - I'm not sure which "offending commit" you are >> referring to. I'm referring to: >> >> 96c7c2f4d5bd ("drm/scheduler: set entity to NULL in >> drm_sched_entity_pop_job()") >> >> which was merged yesterday to drm-misc-next (and is currently the top >> commit). >> >> Is there another commit which has been reverted elsewhere which is >> conflicting? >> > The commit which introduces the use-after-free, which is now upgraded > to the above NULL ptr dereference is df622729ddbf ("drm/scheduler: > track GPU active time per entity") and has been reverted in > baad10973fdb ("Revert "drm/scheduler: track GPU active time per > entity"). The revert is upstream in 6.3-rc5. We may actually need to track the effective average time a job from a context has spent on the GPU, for scheduling purposes. Just like df622729ddbf ("drm/scheduler: track GPU active time per entity") does it, but also counting the number of samples taken, so that we can compare to other entities (contexts), and pick an appropriate context from which to take out jobs. When we used Round Robin scheduling, there was a complaint that some contexts were starved, namely ones with many number of jobs submitted early, to that of contexts which had just one or two jobs. This is described in 977d97f18b5b8e ("drm/scheduler: Set the FIFO scheduling policy as the default"). Now there's a complaint that because some contexts submit many long-lived jobs, possibly back-to-back, while other contexts submit few short-lived jobs, then because of the FIFO scheduling, i.e. "oldest ready job executes next", we get to starve contexts which submit a sparse load of short-lived jobs. There are a few scheduling options to make scheduling fairer, i.e. selecting the next entity from which to get a job to run on the GPU, and it would seem we might not be able to get away with "oldest job waiting" or "last time a job was scheduled from this entity", exactly because of the reasons described in the previous paragraph. Other scheduling algorithms are unavailable since we don't know a priori how long a job would take, and the next best thing is to approximate it from past jobs--for which we need a measure of the average job execution time per entity. In principle, decoupling at the entity <--> job level, might be too low level to do it at, and perhaps this decoupling should be done at a higher level, but we might get away with something as simple as if (job->entity) { track time... }
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 15d04a0ec623..a9c6118e534b 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) drm_sched_rq_update_fifo(entity, next->submit_ts); } + /* Jobs and entities might have different lifecycles. Since we're + * removing the job from the entities queue, set the jobs entity pointer + * to NULL to prevent any future access of the entity through this job. + */ + sched_job->entity = NULL; + return sched_job; } diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 9b16480686f6..e89a3e469cd5 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -42,6 +42,10 @@ * the hardware. * * The jobs in a entity are always scheduled in the order that they were pushed. + * + * Note that once a job was taken from the entities queue and pushed to the + * hardware, i.e. the pending queue, the entity must not be referenced anymore + * through the jobs entity pointer. */ #include <linux/kthread.h>