Message ID | 20230714-drm-sched-fixes-v1-3-c567249709f7@asahilina.net |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp2369275vqm; Fri, 14 Jul 2023 02:05:23 -0700 (PDT) X-Google-Smtp-Source: APBJJlG1NJEG0pPsN/RghU3bSecKP7La/hUmiEznuPEJScDtLkyWL/bMhtN6MEav/oj7vr3smMTs X-Received: by 2002:a05:6808:d4b:b0:3a2:43e0:6b10 with SMTP id w11-20020a0568080d4b00b003a243e06b10mr6102498oik.40.1689325522755; Fri, 14 Jul 2023 02:05:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689325522; cv=none; d=google.com; s=arc-20160816; b=p6b7bmEIkonrqUw1PMi9aC1/B2Ouh91VhV6haTvQO9xnXiJgchi7LKkLAAn6pbHPab YrR5WoWGcmJQdsiszhepoLLa1nwrgp+Ldv7yM4ynLMZ64ebAJ7M0BsSksI3lhW5yUr4t Byu/uIPixPgWp+fGmFCxefFuOIaMmHjno9HDzpWPTDEEh79NHAUjg9/opYuVQWoaCjOX X3PGcKWH0gVlWZ6g8awGqMPpVZKYK4C1/PeynBOpZcCDPuHupRVUEJ4Y46QptMWvZFUU s82p+kp1NxkcwxSarQr2GYU3pWWCYrTJEU2rM3cNF6xKZyZrQh33h40Xj/ZHl++v5G04 UABg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:in-reply-to:references:message-id :content-transfer-encoding:mime-version:subject:date:from :dkim-signature; bh=H29GFu8tnaK+1X7V+xRRTdI8b3U3IN+JaTjzFayWKyg=; fh=94kmplSEjCUqQrkUfPl3t6aErf9LU03vc5Esj90lSEQ=; b=0W9xh0Y55nd+FjBZJFyWswWGp7zhnzkJpxeOvbmUdtRMFcv9r7Kg0OYOmyMB2yEpl+ IKKEYPJ8D7qexdi+CwLrdskmVKkdU8fsZXUjEZys0osju/vy1+OhN7JAwn5YnhvpcGfp sHcxBBTcvWMrhGDumhpynoNhP+OT21TKs/0QNdw87jS0itzHFYn1FsgvasybruBM0eWq BmxqjUfk2pZ6oDvt4T9sITr6thAusga7hAP+eHVfwUOXPsKJduzrbZR4mzrXPK/Oin3c Vl97Mnwt89m/23Zx1jWc/I7q/DzYcnFqoBa2wMJ/UNsPah+9BF16vC0SFi2Uhp2FKQpO oeDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@asahilina.net header.s=default header.b=LY2skEBB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=asahilina.net Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m8-20020a633f08000000b0055c558ac4cfsi6593817pga.487.2023.07.14.02.05.09; Fri, 14 Jul 2023 02:05:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@asahilina.net header.s=default header.b=LY2skEBB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=asahilina.net Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235402AbjGNIbu (ORCPT <rfc822;hadasmailinglist@gmail.com> + 99 others); Fri, 14 Jul 2023 04:31:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44120 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235274AbjGNIbj (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 14 Jul 2023 04:31:39 -0400 Received: from mail.marcansoft.com (marcansoft.com [IPv6:2a01:298:fe:f::2]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1A992213B; Fri, 14 Jul 2023 01:31:36 -0700 (PDT) Received: from [127.0.0.1] (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: linasend@asahilina.net) by mail.marcansoft.com (Postfix) with ESMTPSA id 477955BC41; Fri, 14 Jul 2023 08:21:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=asahilina.net; s=default; t=1689322906; bh=DL0HcpNnaq62QnwTFHu2+eWSTA+us8B0tZYj+4sZi9A=; h=From:Date:Subject:References:In-Reply-To:To:Cc; b=LY2skEBBhivIyGJs6j2I6//vYIBT9A6FegazsY5W1icthWJA0ZH01P5biTfimyZSl s1bzUnQsLSVfvbGMAh310XYK1riGjOyRWIUCXkjw0PJK3rT/DPl1WUwXdnR3p+Qxe2 7DK7A3uHYdZ8tVDJ9eydZibGPE/rPQ12VLIYC7bgOXM4ehFdZw2h5u/N2tj14/Yedl l4FcyYZDnRJyWYtX3Pt5SfOI55elff60j+C5FIWdoONWPuxEZeWYLdD/DpkDQg7RPN ZjeHWL1n7a4srdMwNfNbbk1asXK2x/nIAEbtb7n76REtIF3ZBlTFfzHkbcn3me6QXF 2Aa5UIqsXHkNw== From: Asahi Lina <lina@asahilina.net> Date: Fri, 14 Jul 2023 17:21:31 +0900 Subject: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down. MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20230714-drm-sched-fixes-v1-3-c567249709f7@asahilina.net> References: <20230714-drm-sched-fixes-v1-0-c567249709f7@asahilina.net> In-Reply-To: <20230714-drm-sched-fixes-v1-0-c567249709f7@asahilina.net> To: Luben Tuikov <luben.tuikov@amd.com>, David Airlie <airlied@gmail.com>, Daniel Vetter <daniel@ffwll.ch>, Sumit Semwal <sumit.semwal@linaro.org>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com> Cc: Faith Ekstrand <faith.ekstrand@collabora.com>, Alyssa Rosenzweig <alyssa@rosenzweig.io>, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, asahi@lists.linux.dev, Asahi Lina <lina@asahilina.net> X-Mailer: b4 0.12.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1689322891; l=2399; i=lina@asahilina.net; s=20230221; h=from:subject:message-id; bh=DL0HcpNnaq62QnwTFHu2+eWSTA+us8B0tZYj+4sZi9A=; b=r4JAZ5SsLR1ABJAGgswG/w06/9Z8YAtBlcxFJE2s82va4E4Ux6B2DLkVnNr2uuWp7xekQ2Unx sUgYVv6hvKyAFu70SDykSmPZ2Z+Wh9mz8X9Qu6GizGTOT+ZR4K8t0tB X-Developer-Key: i=lina@asahilina.net; a=ed25519; pk=Qn8jZuOtR1m5GaiDfTrAoQ4NE1XoYVZ/wmt5YtXWFC4= X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771386199536021704 X-GMAIL-MSGID: 1771386199536021704 |
Series |
DRM scheduler documentation & bug fixes
|
|
Commit Message
Asahi Lina
July 14, 2023, 8:21 a.m. UTC
drm_sched_fini() currently leaves any pending jobs dangling, which
causes segfaults and other badness when job completion fences are
signaled after the scheduler is torn down.
Explicitly detach all jobs from their completion callbacks and free
them. This makes it possible to write a sensible safe abstraction for
drm_sched, without having to externally duplicate the tracking of
in-flight jobs.
This shouldn't regress any existing drivers, since calling
drm_sched_fini() with any pending jobs is broken and this change should
be a no-op if there are no pending jobs.
Signed-off-by: Asahi Lina <lina@asahilina.net>
---
drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
Comments
On 2023-07-14 04:21, Asahi Lina wrote: > drm_sched_fini() currently leaves any pending jobs dangling, which > causes segfaults and other badness when job completion fences are > signaled after the scheduler is torn down. If there are pending jobs, ideally we want to call into the driver, so that it can release resources it may be holding for those. The idea behind "pending" is that they are pending in the hardware and we don't know their state until signalled/the callback called. (Or unless the device is reset and we get a notification of that fact.) > Explicitly detach all jobs from their completion callbacks and free > them. This makes it possible to write a sensible safe abstraction for > drm_sched, without having to externally duplicate the tracking of > in-flight jobs. > > This shouldn't regress any existing drivers, since calling > drm_sched_fini() with any pending jobs is broken and this change should > be a no-op if there are no pending jobs. While this statement is true on its own, it kind of contradicts the premise of the first paragraph. > Signed-off-by: Asahi Lina <lina@asahilina.net> > --- > drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++-- > 1 file changed, 30 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 1f3bc3606239..a4da4aac0efd 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init); > void drm_sched_fini(struct drm_gpu_scheduler *sched) > { > struct drm_sched_entity *s_entity; > + struct drm_sched_job *s_job, *tmp; > int i; > > - if (sched->thread) > - kthread_stop(sched->thread); > + if (!sched->thread) > + return; > + > + /* > + * Stop the scheduler, detaching all jobs from their hardware callbacks > + * and cleaning up complete jobs. > + */ > + drm_sched_stop(sched, NULL); > + > + /* > + * Iterate through the pending job list and free all jobs. > + * This assumes the driver has either guaranteed jobs are already stopped, or that > + * otherwise it is responsible for keeping any necessary data structures for > + * in-progress jobs alive even when the free_job() callback is called early (e.g. by > + * putting them in its own queue or doing its own refcounting). > + */ > + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { > + spin_lock(&sched->job_list_lock); > + list_del_init(&s_job->list); > + spin_unlock(&sched->job_list_lock); > + > + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH); > + drm_sched_fence_finished(s_job->s_fence); I'd imagine it's better to rebase this on top of drm-misc-next where drm_sched_fence_finished() takes one more parameter--the error. > + > + WARN_ON(s_job->s_fence->parent); > + sched->ops->free_job(s_job); > + } > + > + kthread_stop(sched->thread); > > for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { > struct drm_sched_rq *rq = &sched->sched_rq[i]; > Conceptually I don't mind this patch--I see what it is trying to achieve, but technically, we want the driver to detect GPU removal and return shared resources back, such as "jobs", which DRM is also aware of. In the case where we're initiating the tear, we should notify the driver that we're about to forget jobs (resources), so that it knows to return them back or that it shouldn't notify us for them (since we've notified we're forgetting them.) (Note also that in this latter case, traditionally, the device would be reset, so that we can guarantee that it has forgotten all shared resources which we are to tear up. This is somewhat more complicated with GPUs, thus the method pointed out above.)
On 15/07/2023 16.14, Luben Tuikov wrote: > On 2023-07-14 04:21, Asahi Lina wrote: >> drm_sched_fini() currently leaves any pending jobs dangling, which >> causes segfaults and other badness when job completion fences are >> signaled after the scheduler is torn down. > > If there are pending jobs, ideally we want to call into the driver, > so that it can release resources it may be holding for those. > The idea behind "pending" is that they are pending in the hardware > and we don't know their state until signalled/the callback called. > (Or unless the device is reset and we get a notification of that fact.) That's what the job->free_job() callback does, then the driver is free to do whatever it wants with those jobs. A driver could opt to synchronously kill those jobs (if it can) or account for them separately/asynchronously. What this patch basically says is that if you destroy a scheduler with pending jobs, it immediately considers them terminated with an error, and returns ownership back to the driver for freeing. Then the driver can decide how to handle the rest and whatever the underlying hardware state is. >> Explicitly detach all jobs from their completion callbacks and free >> them. This makes it possible to write a sensible safe abstraction for >> drm_sched, without having to externally duplicate the tracking of >> in-flight jobs. >> >> This shouldn't regress any existing drivers, since calling >> drm_sched_fini() with any pending jobs is broken and this change should >> be a no-op if there are no pending jobs. > > While this statement is true on its own, it kind of contradicts > the premise of the first paragraph. I mean right *now* it's broken, before this patch. I'm trying to make it safe, but it shouldn't regress any exiting drivers since if they trigger this code path they are broken today. > >> Signed-off-by: Asahi Lina <lina@asahilina.net> >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++-- >> 1 file changed, 30 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> index 1f3bc3606239..a4da4aac0efd 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init); >> void drm_sched_fini(struct drm_gpu_scheduler *sched) >> { >> struct drm_sched_entity *s_entity; >> + struct drm_sched_job *s_job, *tmp; >> int i; >> >> - if (sched->thread) >> - kthread_stop(sched->thread); >> + if (!sched->thread) >> + return; >> + >> + /* >> + * Stop the scheduler, detaching all jobs from their hardware callbacks >> + * and cleaning up complete jobs. >> + */ >> + drm_sched_stop(sched, NULL); >> + >> + /* >> + * Iterate through the pending job list and free all jobs. >> + * This assumes the driver has either guaranteed jobs are already stopped, or that >> + * otherwise it is responsible for keeping any necessary data structures for >> + * in-progress jobs alive even when the free_job() callback is called early (e.g. by >> + * putting them in its own queue or doing its own refcounting). >> + */ >> + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { >> + spin_lock(&sched->job_list_lock); >> + list_del_init(&s_job->list); >> + spin_unlock(&sched->job_list_lock); >> + >> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH); >> + drm_sched_fence_finished(s_job->s_fence); > > I'd imagine it's better to rebase this on top of drm-misc-next where > drm_sched_fence_finished() takes one more parameter--the error. Ah, sure! I can do that. > >> + >> + WARN_ON(s_job->s_fence->parent); >> + sched->ops->free_job(s_job); >> + } >> + >> + kthread_stop(sched->thread); >> >> for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { >> struct drm_sched_rq *rq = &sched->sched_rq[i]; >> > > Conceptually I don't mind this patch--I see what it is trying to achieve, > but technically, we want the driver to detect GPU removal and return shared > resources back, such as "jobs", which DRM is also aware of. I think you missed the context of why I'm doing this, so in short: my use case (like Xe's) involves using a separate drm_sched instance *per file* since these queues are scheduled directly by the firmware. So this isn't about GPU removal, but rather about a GPU context going away while jobs are in flight (e.g. the process got killed). We want that to quickly kill the "DRM view" of the world, including signaling all the fences with an error and freeing resources like the scheduler itself. In the case of this particular GPU, there is no known way to actively and synchronously abort GPU jobs, so we need to let them run to completion (or failure), but we don't want that to block process cleanup and freeing a bunch of high-level resources. The driver is architected roughly along the lines of a firmware abstraction layer that maps to the firmware shared memory structures, and then a layer on top that implements the DRM view. When a process gets killed, the DRM side (which includes the scheduler, etc.) gets torn down immediately, and it makes sense to handle this cleanup inside drm_sched since it already has a view into what jobs are in flight. Otherwise, I would have to duplicate job tracking in the driver (actually worse: in the Rust abstraction for safety), which doesn't make much sense. But what I *do* have in the driver is tracking of the firmware structures. So when the drm_sched gets torn down and all the jobs killed, the underlying firmware jobs do run to completion, and the resources they use are all cleaned up after that (it's all reference counted). The primitive involved here is that in-flight firmware jobs are assigned an event completion slot, and that keeps a reference to them from a global array until the events fire and all the jobs are known to have completed. This keeps things memory-safe, since we absolutely cannot free/destroy firmware structures while they are in use (otherwise the firmware crashes, which is fatal on these GPUs - requires a full system reboot to recover). In practice, with the VM map model that we use, what ends up happening when a process gets killed is that all the user objects for in-flight jobs get unmapped, which usually causes the GPU hardware (not firmware) to fault. This then triggers early termination of jobs anyway via the firmware fault recovery flow. But even that takes some short amount of time, and by then all the drm_sched stuff is long gone and we're just dealing with the in-flight firmware stuff. > In the case where we're initiating the tear, we should notify the driver that > we're about to forget jobs (resources), so that it knows to return them back > or that it shouldn't notify us for them (since we've notified we're forgetting them.) That contradicts Christian's comment. I tried to document that (after this patch) the scheduler no longer cares about hw fences and whether they are signaled or not after it's destroyed, and I got a strongly worded NAK for it. Sooo... which is it? Is it okay for drivers not to signal the hw fence after a scheduler teardown, or not? But really, I don't see a use case for an explicit "about to forget job" callback. The job free callback already serves the purpose of telling the driver to clean up resources associated with a job. If it wants to synchronously abort things there, it could easily take over its own fence signaling and do something with the underlying stuff if the fence is not signaled yet. In my case, since the driver is written in Rust and free_job() just maps to the destructor (Drop impl), that just ends up freeing a bunch of memory and other objects, and I don't particularly care about the state of the firmware side any more after that. The flow is the same whether it was a successful job completion, a failure, or an early destruction due to the drm_sched getting torn down. > (Note also that in this latter case, traditionally, the device would be reset, > so that we can guarantee that it has forgotten all shared resources which > we are to tear up. This is somewhat more complicated with GPUs, thus the method > pointed out above.) Yeah, in the firmware scheduling case we can't do this at all unless the firmware has an explicit teardown/forget op (which I'm not aware of) and a full GPU reset isn't something we can do either. Hence we just let the underlying jobs complete. In practice they tend to die pretty quickly anyway once all the buffers are unmapped. ~~ Lina
On 2023-07-16 03:51, Asahi Lina wrote: > On 15/07/2023 16.14, Luben Tuikov wrote: >> On 2023-07-14 04:21, Asahi Lina wrote: >>> drm_sched_fini() currently leaves any pending jobs dangling, which >>> causes segfaults and other badness when job completion fences are >>> signaled after the scheduler is torn down. >> >> If there are pending jobs, ideally we want to call into the driver, >> so that it can release resources it may be holding for those. >> The idea behind "pending" is that they are pending in the hardware >> and we don't know their state until signalled/the callback called. >> (Or unless the device is reset and we get a notification of that fact.) > > That's what the job->free_job() callback does, then the driver is free > to do whatever it wants with those jobs. A driver could opt to > synchronously kill those jobs (if it can) or account for them > separately/asynchronously. > > What this patch basically says is that if you destroy a scheduler with > pending jobs, it immediately considers them terminated with an error, > and returns ownership back to the driver for freeing. Then the driver > can decide how to handle the rest and whatever the underlying hardware > state is. > >>> Explicitly detach all jobs from their completion callbacks and free >>> them. This makes it possible to write a sensible safe abstraction for >>> drm_sched, without having to externally duplicate the tracking of >>> in-flight jobs. >>> >>> This shouldn't regress any existing drivers, since calling >>> drm_sched_fini() with any pending jobs is broken and this change should >>> be a no-op if there are no pending jobs. >> >> While this statement is true on its own, it kind of contradicts >> the premise of the first paragraph. > > I mean right *now* it's broken, before this patch. I'm trying to make it > safe, but it shouldn't regress any exiting drivers since if they trigger > this code path they are broken today. Not sure about other drivers--they can speak for themselves and the CC list should include them--please use "dim add-missing-cc" and make sure that the Git commit description contains the Cc tags--then git send-email will populate the SMTP CC. Feel free to add more Cc tags on top of that. > >> >>> Signed-off-by: Asahi Lina <lina@asahilina.net> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++-- >>> 1 file changed, 30 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>> index 1f3bc3606239..a4da4aac0efd 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init); >>> void drm_sched_fini(struct drm_gpu_scheduler *sched) >>> { >>> struct drm_sched_entity *s_entity; >>> + struct drm_sched_job *s_job, *tmp; >>> int i; >>> >>> - if (sched->thread) >>> - kthread_stop(sched->thread); >>> + if (!sched->thread) >>> + return; >>> + >>> + /* >>> + * Stop the scheduler, detaching all jobs from their hardware callbacks >>> + * and cleaning up complete jobs. >>> + */ >>> + drm_sched_stop(sched, NULL); >>> + >>> + /* >>> + * Iterate through the pending job list and free all jobs. >>> + * This assumes the driver has either guaranteed jobs are already stopped, or that >>> + * otherwise it is responsible for keeping any necessary data structures for >>> + * in-progress jobs alive even when the free_job() callback is called early (e.g. by >>> + * putting them in its own queue or doing its own refcounting). >>> + */ >>> + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { >>> + spin_lock(&sched->job_list_lock); >>> + list_del_init(&s_job->list); >>> + spin_unlock(&sched->job_list_lock); >>> + >>> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH); >>> + drm_sched_fence_finished(s_job->s_fence); >> >> I'd imagine it's better to rebase this on top of drm-misc-next where >> drm_sched_fence_finished() takes one more parameter--the error. > > Ah, sure! I can do that. It's worth posting it as a stand-alone patch. Please make sure to add Cc tags into the commit description--use "dim add-missing-cc", perhaps also git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl" for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl for affected files.) Feel free to post it stand-alone and we'll let the natural review process take over. :-) > >> >>> + >>> + WARN_ON(s_job->s_fence->parent); >>> + sched->ops->free_job(s_job); >>> + } >>> + >>> + kthread_stop(sched->thread); >>> >>> for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { >>> struct drm_sched_rq *rq = &sched->sched_rq[i]; >>> >> >> Conceptually I don't mind this patch--I see what it is trying to achieve, >> but technically, we want the driver to detect GPU removal and return shared >> resources back, such as "jobs", which DRM is also aware of. > > I think you missed the context of why I'm doing this, so in short: my As a general rule of thumb, in my writing emails I try to avoid using "you" and "I" as much as possible--it sets this divisive stage, and it can get misrepresented, especially in email. As is the case in research literature, if I absolutely have to use a pronoun--which rarely happens, I always use "we", and this is the most number of "I"-s I've used in a long while. > use case (like Xe's) involves using a separate drm_sched instance *per > file* since these queues are scheduled directly by the firmware. So this > isn't about GPU removal, but rather about a GPU context going away while > jobs are in flight (e.g. the process got killed). We want that to > quickly kill the "DRM view" of the world, including signaling all the > fences with an error and freeing resources like the scheduler itself. > > In the case of this particular GPU, there is no known way to actively > and synchronously abort GPU jobs, so we need to let them run to > completion (or failure), but we don't want that to block process cleanup > and freeing a bunch of high-level resources. The driver is architected > roughly along the lines of a firmware abstraction layer that maps to the > firmware shared memory structures, and then a layer on top that > implements the DRM view. When a process gets killed, the DRM side (which > includes the scheduler, etc.) gets torn down immediately, and it makes > sense to handle this cleanup inside drm_sched since it already has a > view into what jobs are in flight. Otherwise, I would have to duplicate > job tracking in the driver (actually worse: in the Rust abstraction for > safety), which doesn't make much sense. > > But what I *do* have in the driver is tracking of the firmware > structures. So when the drm_sched gets torn down and all the jobs > killed, the underlying firmware jobs do run to completion, and the > resources they use are all cleaned up after that (it's all reference > counted). The ref-count definitely helps here. > The primitive involved here is that in-flight firmware jobs > are assigned an event completion slot, and that keeps a reference to > them from a global array until the events fire and all the jobs are > known to have completed. This keeps things memory-safe, since we > absolutely cannot free/destroy firmware structures while they are in use > (otherwise the firmware crashes, which is fatal on these GPUs - requires > a full system reboot to recover). > > In practice, with the VM map model that we use, what ends up happening > when a process gets killed is that all the user objects for in-flight > jobs get unmapped, which usually causes the GPU hardware (not firmware) > to fault. This then triggers early termination of jobs anyway via the > firmware fault recovery flow. But even that takes some short amount of > time, and by then all the drm_sched stuff is long gone and we're just > dealing with the in-flight firmware stuff. > >> In the case where we're initiating the tear, we should notify the driver that >> we're about to forget jobs (resources), so that it knows to return them back >> or that it shouldn't notify us for them (since we've notified we're forgetting them.) > > That contradicts Christian's comment. I tried to document that (after > this patch) the scheduler no longer cares about hw fences and whether > they are signaled or not after it's destroyed, and I got a strongly > worded NAK for it. Sooo... which is it? Is it okay for drivers not to > signal the hw fence after a scheduler teardown, or not? Christian is correct in that we don't want to hang upstream control to the whims of a low-level device driver. > But really, I don't see a use case for an explicit "about to forget job" > callback. The job free callback already serves the purpose of telling Long time ago, in a galaxy far far away, this was needed in order to prevent device write-DMA into non-existing (random) memory. As this is not the case anymore, go with Christian's comment. > the driver to clean up resources associated with a job. If it wants to > synchronously abort things there, it could easily take over its own > fence signaling and do something with the underlying stuff if the fence > is not signaled yet. > > In my case, since the driver is written in Rust and free_job() just maps > to the destructor (Drop impl), that just ends up freeing a bunch of > memory and other objects, and I don't particularly care about the state > of the firmware side any more after that. The flow is the same whether > it was a successful job completion, a failure, or an early destruction > due to the drm_sched getting torn down. > >> (Note also that in this latter case, traditionally, the device would be reset, >> so that we can guarantee that it has forgotten all shared resources which >> we are to tear up. This is somewhat more complicated with GPUs, thus the method >> pointed out above.) > > Yeah, in the firmware scheduling case we can't do this at all unless the > firmware has an explicit teardown/forget op (which I'm not aware of) and > a full GPU reset isn't something we can do either. Hence we just let the > underlying jobs complete. In practice they tend to die pretty quickly > anyway once all the buffers are unmapped. Perhaps in the future, as more complex workloads are deferred to this hardware and driver, a real-time requirement might be needed for this "tend to die pretty quickly", that that there's some guarantee of work resuming in some finite time.
On 18/07/2023 02.40, Luben Tuikov wrote: > On 2023-07-16 03:51, Asahi Lina wrote: >> On 15/07/2023 16.14, Luben Tuikov wrote: >>> On 2023-07-14 04:21, Asahi Lina wrote: >>>> drm_sched_fini() currently leaves any pending jobs dangling, which >>>> causes segfaults and other badness when job completion fences are >>>> signaled after the scheduler is torn down. >>> >>> If there are pending jobs, ideally we want to call into the driver, >>> so that it can release resources it may be holding for those. >>> The idea behind "pending" is that they are pending in the hardware >>> and we don't know their state until signalled/the callback called. >>> (Or unless the device is reset and we get a notification of that fact.) >> >> That's what the job->free_job() callback does, then the driver is free >> to do whatever it wants with those jobs. A driver could opt to >> synchronously kill those jobs (if it can) or account for them >> separately/asynchronously. >> >> What this patch basically says is that if you destroy a scheduler with >> pending jobs, it immediately considers them terminated with an error, >> and returns ownership back to the driver for freeing. Then the driver >> can decide how to handle the rest and whatever the underlying hardware >> state is. >> >>>> Explicitly detach all jobs from their completion callbacks and free >>>> them. This makes it possible to write a sensible safe abstraction for >>>> drm_sched, without having to externally duplicate the tracking of >>>> in-flight jobs. >>>> >>>> This shouldn't regress any existing drivers, since calling >>>> drm_sched_fini() with any pending jobs is broken and this change should >>>> be a no-op if there are no pending jobs. >>> >>> While this statement is true on its own, it kind of contradicts >>> the premise of the first paragraph. >> >> I mean right *now* it's broken, before this patch. I'm trying to make it >> safe, but it shouldn't regress any exiting drivers since if they trigger >> this code path they are broken today. > > Not sure about other drivers--they can speak for themselves and the CC list > should include them--please use "dim add-missing-cc" and make sure > that the Git commit description contains the Cc tags--then git send-email > will populate the SMTP CC. Feel free to add more Cc tags on top of that. I use `b4 prep -c` which I think does the same thing? I just ran it again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why that one wasn't there. Am I missing anything else? >> >>> >>>> Signed-off-by: Asahi Lina <lina@asahilina.net> >>>> --- >>>> drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++-- >>>> 1 file changed, 30 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>>> index 1f3bc3606239..a4da4aac0efd 100644 >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init); >>>> void drm_sched_fini(struct drm_gpu_scheduler *sched) >>>> { >>>> struct drm_sched_entity *s_entity; >>>> + struct drm_sched_job *s_job, *tmp; >>>> int i; >>>> >>>> - if (sched->thread) >>>> - kthread_stop(sched->thread); >>>> + if (!sched->thread) >>>> + return; >>>> + >>>> + /* >>>> + * Stop the scheduler, detaching all jobs from their hardware callbacks >>>> + * and cleaning up complete jobs. >>>> + */ >>>> + drm_sched_stop(sched, NULL); >>>> + >>>> + /* >>>> + * Iterate through the pending job list and free all jobs. >>>> + * This assumes the driver has either guaranteed jobs are already stopped, or that >>>> + * otherwise it is responsible for keeping any necessary data structures for >>>> + * in-progress jobs alive even when the free_job() callback is called early (e.g. by >>>> + * putting them in its own queue or doing its own refcounting). >>>> + */ >>>> + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { >>>> + spin_lock(&sched->job_list_lock); >>>> + list_del_init(&s_job->list); >>>> + spin_unlock(&sched->job_list_lock); >>>> + >>>> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH); >>>> + drm_sched_fence_finished(s_job->s_fence); >>> >>> I'd imagine it's better to rebase this on top of drm-misc-next where >>> drm_sched_fence_finished() takes one more parameter--the error. >> >> Ah, sure! I can do that. > > It's worth posting it as a stand-alone patch. Please make sure to add Cc tags > into the commit description--use "dim add-missing-cc", perhaps also > git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl" > for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl > for affected files.) > > Feel free to post it stand-alone and we'll let the natural review process take over. :-) I already posted this one as part of the bindings RFC and the other one stand-alone, and they got NAKed by Christian, that's why it's a specific series for sched now with the docs, per Daniel's suggestion... now you're saying I should post them stand-alone again... ? >> >>> >>>> + >>>> + WARN_ON(s_job->s_fence->parent); >>>> + sched->ops->free_job(s_job); >>>> + } >>>> + >>>> + kthread_stop(sched->thread); >>>> >>>> for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { >>>> struct drm_sched_rq *rq = &sched->sched_rq[i]; >>>> >>> >>> Conceptually I don't mind this patch--I see what it is trying to achieve, >>> but technically, we want the driver to detect GPU removal and return shared >>> resources back, such as "jobs", which DRM is also aware of. >> >> I think you missed the context of why I'm doing this, so in short: my > > As a general rule of thumb, in my writing emails I try to avoid using > "you" and "I" as much as possible--it sets this divisive stage, and it > can get misrepresented, especially in email. > > As is the case in research literature, if I absolutely have to use a pronoun--which > rarely happens, I always use "we", and this is the most number of "I"-s I've used > in a long while. > >> use case (like Xe's) involves using a separate drm_sched instance *per >> file* since these queues are scheduled directly by the firmware. So this >> isn't about GPU removal, but rather about a GPU context going away while >> jobs are in flight (e.g. the process got killed). We want that to >> quickly kill the "DRM view" of the world, including signaling all the >> fences with an error and freeing resources like the scheduler itself. >> >> In the case of this particular GPU, there is no known way to actively >> and synchronously abort GPU jobs, so we need to let them run to >> completion (or failure), but we don't want that to block process cleanup >> and freeing a bunch of high-level resources. The driver is architected >> roughly along the lines of a firmware abstraction layer that maps to the >> firmware shared memory structures, and then a layer on top that >> implements the DRM view. When a process gets killed, the DRM side (which >> includes the scheduler, etc.) gets torn down immediately, and it makes >> sense to handle this cleanup inside drm_sched since it already has a >> view into what jobs are in flight. Otherwise, I would have to duplicate >> job tracking in the driver (actually worse: in the Rust abstraction for >> safety), which doesn't make much sense. >> >> But what I *do* have in the driver is tracking of the firmware >> structures. So when the drm_sched gets torn down and all the jobs >> killed, the underlying firmware jobs do run to completion, and the >> resources they use are all cleaned up after that (it's all reference >> counted). > > The ref-count definitely helps here. > >> The primitive involved here is that in-flight firmware jobs >> are assigned an event completion slot, and that keeps a reference to >> them from a global array until the events fire and all the jobs are >> known to have completed. This keeps things memory-safe, since we >> absolutely cannot free/destroy firmware structures while they are in use >> (otherwise the firmware crashes, which is fatal on these GPUs - requires >> a full system reboot to recover). >> >> In practice, with the VM map model that we use, what ends up happening >> when a process gets killed is that all the user objects for in-flight >> jobs get unmapped, which usually causes the GPU hardware (not firmware) >> to fault. This then triggers early termination of jobs anyway via the >> firmware fault recovery flow. But even that takes some short amount of >> time, and by then all the drm_sched stuff is long gone and we're just >> dealing with the in-flight firmware stuff. >> >>> In the case where we're initiating the tear, we should notify the driver that >>> we're about to forget jobs (resources), so that it knows to return them back >>> or that it shouldn't notify us for them (since we've notified we're forgetting them.) >> >> That contradicts Christian's comment. I tried to document that (after >> this patch) the scheduler no longer cares about hw fences and whether >> they are signaled or not after it's destroyed, and I got a strongly >> worded NAK for it. Sooo... which is it? Is it okay for drivers not to >> signal the hw fence after a scheduler teardown, or not? > > Christian is correct in that we don't want to hang upstream control > to the whims of a low-level device driver. > >> But really, I don't see a use case for an explicit "about to forget job" >> callback. The job free callback already serves the purpose of telling > > Long time ago, in a galaxy far far away, this was needed in order > to prevent device write-DMA into non-existing (random) memory. As > this is not the case anymore, go with Christian's comment. > >> the driver to clean up resources associated with a job. If it wants to >> synchronously abort things there, it could easily take over its own >> fence signaling and do something with the underlying stuff if the fence >> is not signaled yet. >> >> In my case, since the driver is written in Rust and free_job() just maps >> to the destructor (Drop impl), that just ends up freeing a bunch of >> memory and other objects, and I don't particularly care about the state >> of the firmware side any more after that. The flow is the same whether >> it was a successful job completion, a failure, or an early destruction >> due to the drm_sched getting torn down. >> >>> (Note also that in this latter case, traditionally, the device would be reset, >>> so that we can guarantee that it has forgotten all shared resources which >>> we are to tear up. This is somewhat more complicated with GPUs, thus the method >>> pointed out above.) >> >> Yeah, in the firmware scheduling case we can't do this at all unless the >> firmware has an explicit teardown/forget op (which I'm not aware of) and >> a full GPU reset isn't something we can do either. Hence we just let the >> underlying jobs complete. In practice they tend to die pretty quickly >> anyway once all the buffers are unmapped. > > Perhaps in the future, as more complex workloads are deferred to this > hardware and driver, a real-time requirement might be needed for this > "tend to die pretty quickly", that that there's some guarantee of > work resuming in some finite time. That's not something we can control. This hardware is reverse-engineered and we don't get to write the firmware (it's signed). Maybe there is a job cancel op, and maybe we'll find it some day, or maybe not. I've certainly never seen macOS do anything like that, including in very blatant cases like a 30-second compute job. On macOS it kept running to completion even after I killed the process. We can't make the hardware/firmware do something it can't do. At least there's firmware preemption though, so a rogue long-running job shouldn't block everything else. ~~ Lina
On 2023-07-17 18:45, Asahi Lina wrote: > On 18/07/2023 02.40, Luben Tuikov wrote: >> On 2023-07-16 03:51, Asahi Lina wrote: >>> On 15/07/2023 16.14, Luben Tuikov wrote: >>>> On 2023-07-14 04:21, Asahi Lina wrote: >>>>> drm_sched_fini() currently leaves any pending jobs dangling, which >>>>> causes segfaults and other badness when job completion fences are >>>>> signaled after the scheduler is torn down. >>>> >>>> If there are pending jobs, ideally we want to call into the driver, >>>> so that it can release resources it may be holding for those. >>>> The idea behind "pending" is that they are pending in the hardware >>>> and we don't know their state until signalled/the callback called. >>>> (Or unless the device is reset and we get a notification of that fact.) >>> >>> That's what the job->free_job() callback does, then the driver is free >>> to do whatever it wants with those jobs. A driver could opt to >>> synchronously kill those jobs (if it can) or account for them >>> separately/asynchronously. >>> >>> What this patch basically says is that if you destroy a scheduler with >>> pending jobs, it immediately considers them terminated with an error, >>> and returns ownership back to the driver for freeing. Then the driver >>> can decide how to handle the rest and whatever the underlying hardware >>> state is. >>> >>>>> Explicitly detach all jobs from their completion callbacks and free >>>>> them. This makes it possible to write a sensible safe abstraction for >>>>> drm_sched, without having to externally duplicate the tracking of >>>>> in-flight jobs. >>>>> >>>>> This shouldn't regress any existing drivers, since calling >>>>> drm_sched_fini() with any pending jobs is broken and this change should >>>>> be a no-op if there are no pending jobs. >>>> >>>> While this statement is true on its own, it kind of contradicts >>>> the premise of the first paragraph. >>> >>> I mean right *now* it's broken, before this patch. I'm trying to make it >>> safe, but it shouldn't regress any exiting drivers since if they trigger >>> this code path they are broken today. >> >> Not sure about other drivers--they can speak for themselves and the CC list >> should include them--please use "dim add-missing-cc" and make sure >> that the Git commit description contains the Cc tags--then git send-email >> will populate the SMTP CC. Feel free to add more Cc tags on top of that. > > I use `b4 prep -c` which I think does the same thing? I just ran it > again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why > that one wasn't there. Am I missing anything else? Not sure about "b4 prep -c"--using "git send-email" instead, but what is important is to add the Cc: tags as part of the commit message. A "git log" of drm-misc-next shows the proper format. Then maintainers add Link: tag to the correct email thread, which is usually completely automated by "dim" or by "git am", or both. I never do any of this stuff manually and it's all done by tools like "dim", and the such. Sometimes I'd run "scripts/get_maintainer.pl" manually ("dim add-missing-cc" runs that script too), as well as "git blame" and "git log -- <file>" to see if I can add more Cc: tags to the commit message to keep people well informed. Then let "git send-email" add them to the SMTP CC, when the patch is actually emailed out. > >>> >>>> >>>>> Signed-off-by: Asahi Lina <lina@asahilina.net> >>>>> --- >>>>> drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++-- >>>>> 1 file changed, 30 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>>>> index 1f3bc3606239..a4da4aac0efd 100644 >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init); >>>>> void drm_sched_fini(struct drm_gpu_scheduler *sched) >>>>> { >>>>> struct drm_sched_entity *s_entity; >>>>> + struct drm_sched_job *s_job, *tmp; >>>>> int i; >>>>> >>>>> - if (sched->thread) >>>>> - kthread_stop(sched->thread); >>>>> + if (!sched->thread) >>>>> + return; >>>>> + >>>>> + /* >>>>> + * Stop the scheduler, detaching all jobs from their hardware callbacks >>>>> + * and cleaning up complete jobs. >>>>> + */ >>>>> + drm_sched_stop(sched, NULL); >>>>> + >>>>> + /* >>>>> + * Iterate through the pending job list and free all jobs. >>>>> + * This assumes the driver has either guaranteed jobs are already stopped, or that >>>>> + * otherwise it is responsible for keeping any necessary data structures for >>>>> + * in-progress jobs alive even when the free_job() callback is called early (e.g. by >>>>> + * putting them in its own queue or doing its own refcounting). >>>>> + */ >>>>> + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { >>>>> + spin_lock(&sched->job_list_lock); >>>>> + list_del_init(&s_job->list); >>>>> + spin_unlock(&sched->job_list_lock); >>>>> + >>>>> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH); >>>>> + drm_sched_fence_finished(s_job->s_fence); >>>> >>>> I'd imagine it's better to rebase this on top of drm-misc-next where >>>> drm_sched_fence_finished() takes one more parameter--the error. >>> >>> Ah, sure! I can do that. >> >> It's worth posting it as a stand-alone patch. Please make sure to add Cc tags >> into the commit description--use "dim add-missing-cc", perhaps also >> git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl" >> for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl >> for affected files.) >> >> Feel free to post it stand-alone and we'll let the natural review process take over. :-) > > I already posted this one as part of the bindings RFC and the other one > stand-alone, and they got NAKed by Christian, that's why it's a specific > series for sched now with the docs, per Daniel's suggestion... now > you're saying I should post them stand-alone again... ? Oh, I see. I don't remember why Christian NAK-ed it--do you have a link by any chance? As I said, conceptually I don't mind this patch as there is some merit to what it is trying to do, but this does beg the question why no drivers seem to have wanted it thus far? However, it is worth nothing that there is some logic to this patch, so I'd say, if driver writers agree with it (we do call their free_job() method after all--do we need to check that it is non-null?), we can at least try to cement whether this is something they think is good to have, or is redundant, or breaks some assumption, and so on.
Am 16.07.23 um 09:51 schrieb Asahi Lina: > On 15/07/2023 16.14, Luben Tuikov wrote: >> On 2023-07-14 04:21, Asahi Lina wrote: >>> drm_sched_fini() currently leaves any pending jobs dangling, which >>> causes segfaults and other badness when job completion fences are >>> signaled after the scheduler is torn down. >> >> If there are pending jobs, ideally we want to call into the driver, >> so that it can release resources it may be holding for those. >> The idea behind "pending" is that they are pending in the hardware >> and we don't know their state until signalled/the callback called. >> (Or unless the device is reset and we get a notification of that fact.) > > That's what the job->free_job() callback does, then the driver is free > to do whatever it wants with those jobs. A driver could opt to > synchronously kill those jobs (if it can) or account for them > separately/asynchronously. > > What this patch basically says is that if you destroy a scheduler with > pending jobs, it immediately considers them terminated with an error, > and returns ownership back to the driver for freeing. Then the driver > can decide how to handle the rest and whatever the underlying hardware > state is. Yeah, and exactly that is absolutely *not* a good idea. Keep in mind that memory management depends on all this stuff and signal a dma_fence always requires that the hw give a go for that. If you want to cleanup a scheduler with pending jobs what needs to happen instead is that the driver cancels the processing and signals the hw fence. > >>> Explicitly detach all jobs from their completion callbacks and free >>> them. This makes it possible to write a sensible safe abstraction for >>> drm_sched, without having to externally duplicate the tracking of >>> in-flight jobs. >>> >>> This shouldn't regress any existing drivers, since calling >>> drm_sched_fini() with any pending jobs is broken and this change should >>> be a no-op if there are no pending jobs. >> >> While this statement is true on its own, it kind of contradicts >> the premise of the first paragraph. > > I mean right *now* it's broken, before this patch. I'm trying to make > it safe, but it shouldn't regress any exiting drivers since if they > trigger this code path they are broken today. Yes and exactly that's intentional. What you can do is to issue a *big* warning here when there are still pending unsignaled hw fences when the driver calls drm_sched_fini(). But setting the scheduler fence to signaled without getting a signaled state from the hw fence is a complete NO-GO. Regards, Christian. > >> >>> Signed-off-by: Asahi Lina <lina@asahilina.net> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 32 >>> ++++++++++++++++++++++++++++++-- >>> 1 file changed, 30 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>> b/drivers/gpu/drm/scheduler/sched_main.c >>> index 1f3bc3606239..a4da4aac0efd 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init); >>> void drm_sched_fini(struct drm_gpu_scheduler *sched) >>> { >>> struct drm_sched_entity *s_entity; >>> + struct drm_sched_job *s_job, *tmp; >>> int i; >>> - if (sched->thread) >>> - kthread_stop(sched->thread); >>> + if (!sched->thread) >>> + return; >>> + >>> + /* >>> + * Stop the scheduler, detaching all jobs from their hardware >>> callbacks >>> + * and cleaning up complete jobs. >>> + */ >>> + drm_sched_stop(sched, NULL); >>> + >>> + /* >>> + * Iterate through the pending job list and free all jobs. >>> + * This assumes the driver has either guaranteed jobs are >>> already stopped, or that >>> + * otherwise it is responsible for keeping any necessary data >>> structures for >>> + * in-progress jobs alive even when the free_job() callback is >>> called early (e.g. by >>> + * putting them in its own queue or doing its own refcounting). >>> + */ >>> + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { >>> + spin_lock(&sched->job_list_lock); >>> + list_del_init(&s_job->list); >>> + spin_unlock(&sched->job_list_lock); >>> + >>> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH); >>> + drm_sched_fence_finished(s_job->s_fence); >> >> I'd imagine it's better to rebase this on top of drm-misc-next where >> drm_sched_fence_finished() takes one more parameter--the error. > > Ah, sure! I can do that. > >> >>> + >>> + WARN_ON(s_job->s_fence->parent); >>> + sched->ops->free_job(s_job); >>> + } >>> + >>> + kthread_stop(sched->thread); >>> for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= >>> DRM_SCHED_PRIORITY_MIN; i--) { >>> struct drm_sched_rq *rq = &sched->sched_rq[i]; >>> >> >> Conceptually I don't mind this patch--I see what it is trying to >> achieve, >> but technically, we want the driver to detect GPU removal and return >> shared >> resources back, such as "jobs", which DRM is also aware of. > > I think you missed the context of why I'm doing this, so in short: my > use case (like Xe's) involves using a separate drm_sched instance *per > file* since these queues are scheduled directly by the firmware. So > this isn't about GPU removal, but rather about a GPU context going > away while jobs are in flight (e.g. the process got killed). We want > that to quickly kill the "DRM view" of the world, including signaling > all the fences with an error and freeing resources like the scheduler > itself. > > In the case of this particular GPU, there is no known way to actively > and synchronously abort GPU jobs, so we need to let them run to > completion (or failure), but we don't want that to block process > cleanup and freeing a bunch of high-level resources. The driver is > architected roughly along the lines of a firmware abstraction layer > that maps to the firmware shared memory structures, and then a layer > on top that implements the DRM view. When a process gets killed, the > DRM side (which includes the scheduler, etc.) gets torn down > immediately, and it makes sense to handle this cleanup inside > drm_sched since it already has a view into what jobs are in flight. > Otherwise, I would have to duplicate job tracking in the driver > (actually worse: in the Rust abstraction for safety), which doesn't > make much sense. > > But what I *do* have in the driver is tracking of the firmware > structures. So when the drm_sched gets torn down and all the jobs > killed, the underlying firmware jobs do run to completion, and the > resources they use are all cleaned up after that (it's all reference > counted). The primitive involved here is that in-flight firmware jobs > are assigned an event completion slot, and that keeps a reference to > them from a global array until the events fire and all the jobs are > known to have completed. This keeps things memory-safe, since we > absolutely cannot free/destroy firmware structures while they are in > use (otherwise the firmware crashes, which is fatal on these GPUs - > requires a full system reboot to recover). > > In practice, with the VM map model that we use, what ends up happening > when a process gets killed is that all the user objects for in-flight > jobs get unmapped, which usually causes the GPU hardware (not > firmware) to fault. This then triggers early termination of jobs > anyway via the firmware fault recovery flow. But even that takes some > short amount of time, and by then all the drm_sched stuff is long gone > and we're just dealing with the in-flight firmware stuff. > >> In the case where we're initiating the tear, we should notify the >> driver that >> we're about to forget jobs (resources), so that it knows to return >> them back >> or that it shouldn't notify us for them (since we've notified we're >> forgetting them.) > > That contradicts Christian's comment. I tried to document that (after > this patch) the scheduler no longer cares about hw fences and whether > they are signaled or not after it's destroyed, and I got a strongly > worded NAK for it. Sooo... which is it? Is it okay for drivers not to > signal the hw fence after a scheduler teardown, or not? > > But really, I don't see a use case for an explicit "about to forget > job" callback. The job free callback already serves the purpose of > telling the driver to clean up resources associated with a job. If it > wants to synchronously abort things there, it could easily take over > its own fence signaling and do something with the underlying stuff if > the fence is not signaled yet. > > In my case, since the driver is written in Rust and free_job() just > maps to the destructor (Drop impl), that just ends up freeing a bunch > of memory and other objects, and I don't particularly care about the > state of the firmware side any more after that. The flow is the same > whether it was a successful job completion, a failure, or an early > destruction due to the drm_sched getting torn down. > >> (Note also that in this latter case, traditionally, the device would >> be reset, >> so that we can guarantee that it has forgotten all shared resources >> which >> we are to tear up. This is somewhat more complicated with GPUs, thus >> the method >> pointed out above.) > > Yeah, in the firmware scheduling case we can't do this at all unless > the firmware has an explicit teardown/forget op (which I'm not aware > of) and a full GPU reset isn't something we can do either. Hence we > just let the underlying jobs complete. In practice they tend to die > pretty quickly anyway once all the buffers are unmapped. > > ~~ Lina >
On 2023-07-19 04:45, Christian König wrote: > Am 16.07.23 um 09:51 schrieb Asahi Lina: >> On 15/07/2023 16.14, Luben Tuikov wrote: >>> On 2023-07-14 04:21, Asahi Lina wrote: >>>> drm_sched_fini() currently leaves any pending jobs dangling, which >>>> causes segfaults and other badness when job completion fences are >>>> signaled after the scheduler is torn down. >>> >>> If there are pending jobs, ideally we want to call into the driver, >>> so that it can release resources it may be holding for those. >>> The idea behind "pending" is that they are pending in the hardware >>> and we don't know their state until signalled/the callback called. >>> (Or unless the device is reset and we get a notification of that fact.) >> >> That's what the job->free_job() callback does, then the driver is free >> to do whatever it wants with those jobs. A driver could opt to >> synchronously kill those jobs (if it can) or account for them >> separately/asynchronously. >> >> What this patch basically says is that if you destroy a scheduler with >> pending jobs, it immediately considers them terminated with an error, >> and returns ownership back to the driver for freeing. Then the driver >> can decide how to handle the rest and whatever the underlying hardware >> state is. > > Yeah, and exactly that is absolutely *not* a good idea. Keep in mind > that memory management depends on all this stuff and signal a dma_fence > always requires that the hw give a go for that. > > If you want to cleanup a scheduler with pending jobs what needs to > happen instead is that the driver cancels the processing and signals the > hw fence. > >> >>>> Explicitly detach all jobs from their completion callbacks and free >>>> them. This makes it possible to write a sensible safe abstraction for >>>> drm_sched, without having to externally duplicate the tracking of >>>> in-flight jobs. >>>> >>>> This shouldn't regress any existing drivers, since calling >>>> drm_sched_fini() with any pending jobs is broken and this change should >>>> be a no-op if there are no pending jobs. >>> >>> While this statement is true on its own, it kind of contradicts >>> the premise of the first paragraph. >> >> I mean right *now* it's broken, before this patch. I'm trying to make >> it safe, but it shouldn't regress any exiting drivers since if they >> trigger this code path they are broken today. > > Yes and exactly that's intentional. > > What you can do is to issue a *big* warning here when there are still > pending unsignaled hw fences when the driver calls drm_sched_fini(). > > But setting the scheduler fence to signaled without getting a signaled > state from the hw fence is a complete NO-GO. Okay, so we have the requirement (how). If we can also get a reason behind it (why), perhaps we can add the requirement and the reason as a lucid comment to drm_sched_fini() to come with this patch when reworked, so that future drivers whether they be in Rust or C, can take note. Perhaps this will also help future development in DRM itself.
July 18, 2023 at 1:14 AM, "Luben Tuikov" <luben.tuikov@amd.com> wrote: > > > Not sure about other drivers--they can speak for themselves and the CC list > > > should include them--please use "dim add-missing-cc" and make sure > > > that the Git commit description contains the Cc tags--then git send-email > > > will populate the SMTP CC. Feel free to add more Cc tags on top of that. > > > > I use `b4 prep -c` which I think does the same thing? I just ran it > > again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why > > that one wasn't there. Am I missing anything else? > > Not sure about "b4 prep -c"--using "git send-email" instead, but what is > important is to add the Cc: tags as part of the commit message. A "git log" of > drm-misc-next shows the proper format. Then maintainers add Link: > tag to the correct email thread, which is usually completely automated > by "dim" or by "git am", or both. It's useful to note here that this is not standard practice across the entirety of the Linux tree. In general, Cc: trailers are added to individual commits when get_maintainer.pl wouldn't otherwise include someone in the recipient list. The "dim" tool mentioned here is specific to the DRM subsystem (the "d" stands for "DRM"). Since both tools work on git series, you can use it alongside b4. DRM folks, if get_maintainer.pl isn't finding someone who should be included on a series of patches, should the MAINTAINERS file be updated to make it easier to submit valid patches without needing to know of "dim"? Regards, -K
On 2023-07-19 14:16, Konstantin Ryabitsev wrote: > July 18, 2023 at 1:14 AM, "Luben Tuikov" <luben.tuikov@amd.com> wrote: >>>> Not sure about other drivers--they can speak for themselves and the CC list >>>> should include them--please use "dim add-missing-cc" and make sure >>>> that the Git commit description contains the Cc tags--then git send-email >>>> will populate the SMTP CC. Feel free to add more Cc tags on top of that. >>> >>> I use `b4 prep -c` which I think does the same thing? I just ran it >>> again and it only added 'linaro-mm-sig@lists.linaro.org', not sure why >>> that one wasn't there. Am I missing anything else? >> >> Not sure about "b4 prep -c"--using "git send-email" instead, but what is >> important is to add the Cc: tags as part of the commit message. A "git log" of >> drm-misc-next shows the proper format. Then maintainers add Link: >> tag to the correct email thread, which is usually completely automated >> by "dim" or by "git am", or both. > > It's useful to note here that this is not standard practice across the entirety of the Linux tree. In general, Cc: trailers are added to individual commits when get_maintainer.pl wouldn't otherwise include someone in the recipient list. The "dim" tool mentioned here is specific to the DRM subsystem (the "d" stands for "DRM"). Since both tools work on git series, you can use it alongside b4. > In DRM we use "dim"--it's just how we do things and everyone complies with this. "dim" also includes the Link: tag (which "git am" can also be made add), and this adds certain amount of accountability, which is a good thing. This is why I suggested that a subsequent version of these patches, include the Cc: tags, which would normally come from "dim add-missing-cc", which uses "scripts/get_maintainer.pl". DRM maintainers regularly use `git rebase --exec "dim add-missing-cc" ...'. > DRM folks, if get_maintainer.pl isn't finding someone who should be included on a series of patches, should the MAINTAINERS file be updated to make it easier to submit valid patches without needing to know of "dim"? "scripts/get_maintainer.pl" does consult the MAINTAINERS file. There's been no immediate need to update the MAINTAINERS file. Sometimes a single function or a single line in a function (as in some kind of complex calculation), might be coming from someone who doesn't normally commit to the subsystem. This is where "git blame" and "git log" are helpful to inspect and add a Cc: tag with that email to the commit message, and this of course depends on the nature of the incoming patch.
On Mon, Jul 17, 2023 at 01:40:38PM -0400, Luben Tuikov wrote: > On 2023-07-16 03:51, Asahi Lina wrote: > > On 15/07/2023 16.14, Luben Tuikov wrote: > >> On 2023-07-14 04:21, Asahi Lina wrote: > >>> drm_sched_fini() currently leaves any pending jobs dangling, which > >>> causes segfaults and other badness when job completion fences are > >>> signaled after the scheduler is torn down. > >> > >> If there are pending jobs, ideally we want to call into the driver, > >> so that it can release resources it may be holding for those. > >> The idea behind "pending" is that they are pending in the hardware > >> and we don't know their state until signalled/the callback called. > >> (Or unless the device is reset and we get a notification of that fact.) > > > > That's what the job->free_job() callback does, then the driver is free > > to do whatever it wants with those jobs. A driver could opt to > > synchronously kill those jobs (if it can) or account for them > > separately/asynchronously. > > > > What this patch basically says is that if you destroy a scheduler with > > pending jobs, it immediately considers them terminated with an error, > > and returns ownership back to the driver for freeing. Then the driver > > can decide how to handle the rest and whatever the underlying hardware > > state is. > > > >>> Explicitly detach all jobs from their completion callbacks and free > >>> them. This makes it possible to write a sensible safe abstraction for > >>> drm_sched, without having to externally duplicate the tracking of > >>> in-flight jobs. > >>> > >>> This shouldn't regress any existing drivers, since calling > >>> drm_sched_fini() with any pending jobs is broken and this change should > >>> be a no-op if there are no pending jobs. > >> > >> While this statement is true on its own, it kind of contradicts > >> the premise of the first paragraph. > > > > I mean right *now* it's broken, before this patch. I'm trying to make it > > safe, but it shouldn't regress any exiting drivers since if they trigger > > this code path they are broken today. > > Not sure about other drivers--they can speak for themselves and the CC list > should include them--please use "dim add-missing-cc" and make sure > that the Git commit description contains the Cc tags--then git send-email > will populate the SMTP CC. Feel free to add more Cc tags on top of that. > Xe doesn't need this as our reference counting scheme doesn't allow drm_sched_fini to be called when jobs are pending. If we want to teardown a drm_sched we set TDR timeout to zero and all pending jobs gets cleaned up that way, the ref of sched will go to zero, and drm_sched_fini is called. The caveat here being I think we need a worker to call drm_sched_fini as the last ref to scheduler might be dropped from within scheduler main thread. That being said, I doubt this patch breaks anything in Xe so do not a real strong opinion on this. Matt > > > >> > >>> Signed-off-by: Asahi Lina <lina@asahilina.net> > >>> --- > >>> drivers/gpu/drm/scheduler/sched_main.c | 32 ++++++++++++++++++++++++++++++-- > >>> 1 file changed, 30 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > >>> index 1f3bc3606239..a4da4aac0efd 100644 > >>> --- a/drivers/gpu/drm/scheduler/sched_main.c > >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c > >>> @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init); > >>> void drm_sched_fini(struct drm_gpu_scheduler *sched) > >>> { > >>> struct drm_sched_entity *s_entity; > >>> + struct drm_sched_job *s_job, *tmp; > >>> int i; > >>> > >>> - if (sched->thread) > >>> - kthread_stop(sched->thread); > >>> + if (!sched->thread) > >>> + return; > >>> + > >>> + /* > >>> + * Stop the scheduler, detaching all jobs from their hardware callbacks > >>> + * and cleaning up complete jobs. > >>> + */ > >>> + drm_sched_stop(sched, NULL); > >>> + > >>> + /* > >>> + * Iterate through the pending job list and free all jobs. > >>> + * This assumes the driver has either guaranteed jobs are already stopped, or that > >>> + * otherwise it is responsible for keeping any necessary data structures for > >>> + * in-progress jobs alive even when the free_job() callback is called early (e.g. by > >>> + * putting them in its own queue or doing its own refcounting). > >>> + */ > >>> + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { > >>> + spin_lock(&sched->job_list_lock); > >>> + list_del_init(&s_job->list); > >>> + spin_unlock(&sched->job_list_lock); > >>> + > >>> + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH); > >>> + drm_sched_fence_finished(s_job->s_fence); > >> > >> I'd imagine it's better to rebase this on top of drm-misc-next where > >> drm_sched_fence_finished() takes one more parameter--the error. > > > > Ah, sure! I can do that. > > It's worth posting it as a stand-alone patch. Please make sure to add Cc tags > into the commit description--use "dim add-missing-cc", perhaps also > git-blame and git-log might help with additional Cc. "scripts/get_maintainer.pl" > for files unaffected by this commit. (dim add-missing-cc uses get_maintainer.pl > for affected files.) > > Feel free to post it stand-alone and we'll let the natural review process take over. :-) > > > > >> > >>> + > >>> + WARN_ON(s_job->s_fence->parent); > >>> + sched->ops->free_job(s_job); > >>> + } > >>> + > >>> + kthread_stop(sched->thread); > >>> > >>> for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { > >>> struct drm_sched_rq *rq = &sched->sched_rq[i]; > >>> > >> > >> Conceptually I don't mind this patch--I see what it is trying to achieve, > >> but technically, we want the driver to detect GPU removal and return shared > >> resources back, such as "jobs", which DRM is also aware of. > > > > I think you missed the context of why I'm doing this, so in short: my > > As a general rule of thumb, in my writing emails I try to avoid using > "you" and "I" as much as possible--it sets this divisive stage, and it > can get misrepresented, especially in email. > > As is the case in research literature, if I absolutely have to use a pronoun--which > rarely happens, I always use "we", and this is the most number of "I"-s I've used > in a long while. > > > use case (like Xe's) involves using a separate drm_sched instance *per > > file* since these queues are scheduled directly by the firmware. So this > > isn't about GPU removal, but rather about a GPU context going away while > > jobs are in flight (e.g. the process got killed). We want that to > > quickly kill the "DRM view" of the world, including signaling all the > > fences with an error and freeing resources like the scheduler itself. > > > > In the case of this particular GPU, there is no known way to actively > > and synchronously abort GPU jobs, so we need to let them run to > > completion (or failure), but we don't want that to block process cleanup > > and freeing a bunch of high-level resources. The driver is architected > > roughly along the lines of a firmware abstraction layer that maps to the > > firmware shared memory structures, and then a layer on top that > > implements the DRM view. When a process gets killed, the DRM side (which > > includes the scheduler, etc.) gets torn down immediately, and it makes > > sense to handle this cleanup inside drm_sched since it already has a > > view into what jobs are in flight. Otherwise, I would have to duplicate > > job tracking in the driver (actually worse: in the Rust abstraction for > > safety), which doesn't make much sense. > > > > But what I *do* have in the driver is tracking of the firmware > > structures. So when the drm_sched gets torn down and all the jobs > > killed, the underlying firmware jobs do run to completion, and the > > resources they use are all cleaned up after that (it's all reference > > counted). > > The ref-count definitely helps here. > > > The primitive involved here is that in-flight firmware jobs > > are assigned an event completion slot, and that keeps a reference to > > them from a global array until the events fire and all the jobs are > > known to have completed. This keeps things memory-safe, since we > > absolutely cannot free/destroy firmware structures while they are in use > > (otherwise the firmware crashes, which is fatal on these GPUs - requires > > a full system reboot to recover). > > > > In practice, with the VM map model that we use, what ends up happening > > when a process gets killed is that all the user objects for in-flight > > jobs get unmapped, which usually causes the GPU hardware (not firmware) > > to fault. This then triggers early termination of jobs anyway via the > > firmware fault recovery flow. But even that takes some short amount of > > time, and by then all the drm_sched stuff is long gone and we're just > > dealing with the in-flight firmware stuff. > > > >> In the case where we're initiating the tear, we should notify the driver that > >> we're about to forget jobs (resources), so that it knows to return them back > >> or that it shouldn't notify us for them (since we've notified we're forgetting them.) > > > > That contradicts Christian's comment. I tried to document that (after > > this patch) the scheduler no longer cares about hw fences and whether > > they are signaled or not after it's destroyed, and I got a strongly > > worded NAK for it. Sooo... which is it? Is it okay for drivers not to > > signal the hw fence after a scheduler teardown, or not? > > Christian is correct in that we don't want to hang upstream control > to the whims of a low-level device driver. > > > But really, I don't see a use case for an explicit "about to forget job" > > callback. The job free callback already serves the purpose of telling > > Long time ago, in a galaxy far far away, this was needed in order > to prevent device write-DMA into non-existing (random) memory. As > this is not the case anymore, go with Christian's comment. > > > the driver to clean up resources associated with a job. If it wants to > > synchronously abort things there, it could easily take over its own > > fence signaling and do something with the underlying stuff if the fence > > is not signaled yet. > > > > In my case, since the driver is written in Rust and free_job() just maps > > to the destructor (Drop impl), that just ends up freeing a bunch of > > memory and other objects, and I don't particularly care about the state > > of the firmware side any more after that. The flow is the same whether > > it was a successful job completion, a failure, or an early destruction > > due to the drm_sched getting torn down. > > > >> (Note also that in this latter case, traditionally, the device would be reset, > >> so that we can guarantee that it has forgotten all shared resources which > >> we are to tear up. This is somewhat more complicated with GPUs, thus the method > >> pointed out above.) > > > > Yeah, in the firmware scheduling case we can't do this at all unless the > > firmware has an explicit teardown/forget op (which I'm not aware of) and > > a full GPU reset isn't something we can do either. Hence we just let the > > underlying jobs complete. In practice they tend to die pretty quickly > > anyway once all the buffers are unmapped. > > Perhaps in the future, as more complex workloads are deferred to this > hardware and driver, a real-time requirement might be needed for this > "tend to die pretty quickly", that that there's some guarantee of > work resuming in some finite time. > -- > Regards, > Luben > > > > > ~~ Lina > > >
On 2023-08-02 00:06, Matthew Brost wrote: > On Mon, Jul 17, 2023 at 01:40:38PM -0400, Luben Tuikov wrote: >> On 2023-07-16 03:51, Asahi Lina wrote: >>> On 15/07/2023 16.14, Luben Tuikov wrote: >>>> On 2023-07-14 04:21, Asahi Lina wrote: >>>>> drm_sched_fini() currently leaves any pending jobs dangling, which >>>>> causes segfaults and other badness when job completion fences are >>>>> signaled after the scheduler is torn down. >>>> >>>> If there are pending jobs, ideally we want to call into the driver, >>>> so that it can release resources it may be holding for those. >>>> The idea behind "pending" is that they are pending in the hardware >>>> and we don't know their state until signalled/the callback called. >>>> (Or unless the device is reset and we get a notification of that fact.) >>> >>> That's what the job->free_job() callback does, then the driver is free >>> to do whatever it wants with those jobs. A driver could opt to >>> synchronously kill those jobs (if it can) or account for them >>> separately/asynchronously. >>> >>> What this patch basically says is that if you destroy a scheduler with >>> pending jobs, it immediately considers them terminated with an error, >>> and returns ownership back to the driver for freeing. Then the driver >>> can decide how to handle the rest and whatever the underlying hardware >>> state is. >>> >>>>> Explicitly detach all jobs from their completion callbacks and free >>>>> them. This makes it possible to write a sensible safe abstraction for >>>>> drm_sched, without having to externally duplicate the tracking of >>>>> in-flight jobs. >>>>> >>>>> This shouldn't regress any existing drivers, since calling >>>>> drm_sched_fini() with any pending jobs is broken and this change should >>>>> be a no-op if there are no pending jobs. >>>> >>>> While this statement is true on its own, it kind of contradicts >>>> the premise of the first paragraph. >>> >>> I mean right *now* it's broken, before this patch. I'm trying to make it >>> safe, but it shouldn't regress any exiting drivers since if they trigger >>> this code path they are broken today. >> >> Not sure about other drivers--they can speak for themselves and the CC list >> should include them--please use "dim add-missing-cc" and make sure >> that the Git commit description contains the Cc tags--then git send-email >> will populate the SMTP CC. Feel free to add more Cc tags on top of that. >> > > Xe doesn't need this as our reference counting scheme doesn't allow > drm_sched_fini to be called when jobs are pending. If we want to > teardown a drm_sched we set TDR timeout to zero and all pending jobs > gets cleaned up that way, the ref of sched will go to zero, and > drm_sched_fini is called. The caveat here being I think we need a worker > to call drm_sched_fini as the last ref to scheduler might be dropped > from within scheduler main thread. > > That being said, I doubt this patch breaks anything in Xe so do not a > real strong opinion on this. Yes, that's my understanding as well. If the drivers call drm_sched_fini() then they are responsible for cleaning up _before_ calling drm_sched_fini(). The Xe driver seems to be doing the right thing. All in all, since drm_sched_fini() is called by the driver, the driver is supposed to have cleaned up before calling it, so I don't see much need for this patch after all.
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 1f3bc3606239..a4da4aac0efd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1186,10 +1186,38 @@ EXPORT_SYMBOL(drm_sched_init); void drm_sched_fini(struct drm_gpu_scheduler *sched) { struct drm_sched_entity *s_entity; + struct drm_sched_job *s_job, *tmp; int i; - if (sched->thread) - kthread_stop(sched->thread); + if (!sched->thread) + return; + + /* + * Stop the scheduler, detaching all jobs from their hardware callbacks + * and cleaning up complete jobs. + */ + drm_sched_stop(sched, NULL); + + /* + * Iterate through the pending job list and free all jobs. + * This assumes the driver has either guaranteed jobs are already stopped, or that + * otherwise it is responsible for keeping any necessary data structures for + * in-progress jobs alive even when the free_job() callback is called early (e.g. by + * putting them in its own queue or doing its own refcounting). + */ + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { + spin_lock(&sched->job_list_lock); + list_del_init(&s_job->list); + spin_unlock(&sched->job_list_lock); + + dma_fence_set_error(&s_job->s_fence->finished, -ESRCH); + drm_sched_fence_finished(s_job->s_fence); + + WARN_ON(s_job->s_fence->parent); + sched->ops->free_job(s_job); + } + + kthread_stop(sched->thread); for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { struct drm_sched_rq *rq = &sched->sched_rq[i];