Message ID | 20230621005719.836857-2-andrealmeid@igalia.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp4067781vqr; Tue, 20 Jun 2023 19:09:23 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ44vwhjgpzVHo+KBBicCkky9f//Rn2KX7c6Wd+dTBFp8xx8tzoSETZ8KdXN07roaxkSpARl X-Received: by 2002:a05:6a00:2d04:b0:65c:2ea:2c5e with SMTP id fa4-20020a056a002d0400b0065c02ea2c5emr14536692pfb.29.1687313362565; Tue, 20 Jun 2023 19:09:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687313362; cv=none; d=google.com; s=arc-20160816; b=HF2RgQz/pCjaD8aiZCzD1vZH54YPEi93zSwxIAQAEnqxj6jv+V4giaNOulDgfuhKdw SjPs93umQ+C1uD7+sCcBLuBiAyeidHfBoU+ZA9lQf1a3uDuQaoMl4OfX0TSHiVpAF1X4 wXal+hGlW5kbVUBTVlV2wOI9izuO7lwPa/AHjBGUMRraJ0hFi0Y9lcXHN9D6hRwcHGo8 WiHuuVQxZXJRMc+iu8Toe2PXS/VFh13u1GeHpnbBrdvr8J7UZA1YTJCQkulUk/Wm65S8 pvwZ6qkVYM0UOQjrmOV6uWRD02BlHgpjMysaj63xl+p0lL9oaEj7MSozup9lT2b4pM4I uUgQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=/U+iXS7cPu32eSBwTnxjvnLVI4LNUEgzC2fiTpf+640=; b=tY8SArujaMLRKpMIeOvPYebG02oaa4oxAiUmLiSJfY0prPn9eDHwyhe+FKBiiU+icg wACouDSXDE3idwC2wnZnWJgSW0m+OjWVn6CcBeAoQ48pX4S6hkYqxxfcMKf2oXlU5QQH tkwGMeVNXLVCICQMmsw+/ZzU4sAOTjkuLDbfqy8IoMsOTC6iSNrI18/S5J6AamGXselY Y5NgAhAnLDHqkvDJqEQTTIxQim+eE1faVd+n0HnaNopeZDENzN1luzE4JEBbXMn46taz 1PFz5naX3t/cdcqvpE9MBFhUs9sbs3FSespO6ucEM5Dr7ST7S/GMA3uyrhah7qFE9+ED 6jsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b=ViYBsyZd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z127-20020a633385000000b005533f397576si2972033pgz.46.2023.06.20.19.09.10; Tue, 20 Jun 2023 19:09:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b=ViYBsyZd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229930AbjFUA6g (ORCPT <rfc822;maxin.john@gmail.com> + 99 others); Tue, 20 Jun 2023 20:58:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229628AbjFUA6d (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 20 Jun 2023 20:58:33 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9FFFF10CE for <linux-kernel@vger.kernel.org>; Tue, 20 Jun 2023 17:58:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=/U+iXS7cPu32eSBwTnxjvnLVI4LNUEgzC2fiTpf+640=; b=ViYBsyZduSvJIePE3SHI9Xcseq ZUGzW4FIeLj6u40vb7Mut5GmR+n+ppjsfMBRmCJauD8lwHce6uWkVLA4p0l7j9xhwZT0ECXhrIWT7 UPL10h29Wx2SGRQe44eJHve6Y9Ho1ZOcuZZJnYXeqoPcPejsj46pXh7CcLPHEoOKfHKRqUei5OfTp heAeYym/HcDNnNGapEkznLmRZ1gA3kvbLGamL9Gyh9JcOUQJswEbq6KL9O8EHF50qBeb8hgNMazRh LoyI85oR2Mu0idXEWvjbf/BHvEsbLvzG3ixyv1NVSv4hzYhVGmg0cp1s1SvlgF4A2QA076u0ty0Ef tlxTUvyQ==; Received: from [179.113.218.86] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1qBmB4-0011pg-5q; Wed, 21 Jun 2023 02:58:30 +0200 From: =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com> To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, Simon Ser <contact@emersion.fr>, Rob Clark <robdclark@gmail.com>, Pekka Paalanen <ppaalanen@gmail.com>, Daniel Vetter <daniel@ffwll.ch>, Daniel Stone <daniel@fooishbar.org>, =?utf-8?b?J01hcmVrIE9sxaHDoWsn?= <maraeo@gmail.com>, Dave Airlie <airlied@gmail.com>, =?utf-8?q?Michel_D=C3=A4nzer?= <michel.daenzer@mailbox.org>, Samuel Pitoiset <samuel.pitoiset@gmail.com>, =?utf-8?q?Timur_Krist=C3=B3f?= <timur.kristof@gmail.com>, Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>, =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com> Subject: [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations Date: Tue, 20 Jun 2023 21:57:16 -0300 Message-ID: <20230621005719.836857-2-andrealmeid@igalia.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230621005719.836857-1-andrealmeid@igalia.com> References: <20230621005719.836857-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769276296853672588?= X-GMAIL-MSGID: =?utf-8?q?1769276296853672588?= |
Series |
drm: Standardize device reset notification
|
|
Commit Message
André Almeida
June 21, 2023, 12:57 a.m. UTC
Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
Documentation/gpu/drm-uapi.rst | 65 ++++++++++++++++++++++++++++++++++
1 file changed, 65 insertions(+)
Comments
On Tue, 20 Jun 2023 21:57:16 -0300 André Almeida <andrealmeid@igalia.com> wrote: > Create a section that specifies how to deal with DRM device resets for > kernel and userspace drivers. > > Signed-off-by: André Almeida <andrealmeid@igalia.com> Hi André, nice to see this! I ended up giving lots of grammar comments, but I'm not a native speaker. Generally it looks good to me. > --- > Documentation/gpu/drm-uapi.rst | 65 ++++++++++++++++++++++++++++++++++ > 1 file changed, 65 insertions(+) > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > index 65fb3036a580..da4f8a694d8d 100644 > --- a/Documentation/gpu/drm-uapi.rst > +++ b/Documentation/gpu/drm-uapi.rst > @@ -285,6 +285,71 @@ for GPU1 and GPU2 from different vendors, and a third handler for > mmapped regular files. Threads cause additional pain with signal > handling as well. > > +Device reset > +============ > + > +The GPU stack is really complex and is prone to errors, from hardware bugs, > +faulty applications and everything in between the many layers. To recover > +from this kind of state, sometimes is needed to reset the device. This section It seems unclear what "this kind of state" refers to, so maybe just write "errors"? Maybe: Some errors require resetting the device in order to make the device usable again. I presume that recovery does not mean that the failed job could recover. > +describes what's the expectations for DRM and usermode drivers when a device > +resets and how to propagate the reset status. > + > +Kernel Mode Driver > +------------------ > + > +The KMD is responsible for checking if the device needs a reset, and to perform > +it as needed. Usually a hung is detected when a job gets stuck executing. KMD s/hung/hang/ ? > +then update it's internal reset tracking to be ready when userspace asks the updates its "update reset tracking"... do you mean that KMD records information about the reset in case userspace asks for it later? > +kernel about reset information. Drivers should implement the DRM_IOCTL_GET_RESET > +for that. At this point, I'm not sure what "reset tracking" or "reset information" entails. Could something be said about those? > + > +User Mode Driver > +---------------- > + > +The UMD should check before submitting new commands to the KMD if the device has > +been reset, and this can be checked more often if it requires to. The > +DRM_IOCTL_GET_RESET is the default interface for those kind of checks. After > +detecting a reset, UMD will then proceed to report it to the application using > +the appropriated API error code, as explained in the bellow section about s/bellow/below/ > +robustness. > + > +Robustness > +---------- > + > +The only way to try to keep an application working after a reset is if it > +complies with the robustness aspects of the graphical API that is using. that it is using. > + > +Graphical APIs provide ways to application to deal with device resets. However, provide ways for applications to deal with > +there's no guarantee that the app will be correctly using such features, and UMD > +can implement policies to close the app if it's a repeating offender, likely in > +a broken loop. This is done to ensure that it doesn't keeps blocking the user does not keep I think contractions are usually avoided in documents, but I'm not bothering to flag them all. > +interface to be correctly displayed. interface from being correctly displayed. > + > +OpenGL > +~~~~~~ > + > +Apps using OpenGL can rely on ``GL_ARB_robustness`` to be robust. This extension > +tells if a reset has happened, and if so, all the context state is considered > +lost and the app proceeds by creating new ones. If robustness isn't in use, UMD > +will terminate the app when a reset is detected, giving that the contexts are > +lost and the app won't be able to figure this out and recreate the contexts. What about GL ES? Is GL_ARB_robustness implemented or even defined there? What about EGL returning errors like EGL_CONTEXT_LOST, would handling that not be enough from the app? The documented expectation is: "The application must destroy all contexts and reinitialise OpenGL ES state and objects to continue rendering." > + > +Vulkan > +~~~~~~ > + > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > +This error code means, among other things, that a device reset has happened and > +it needs to recreate the contexts to keep going. > + > +Reporting resets causes > +----------------------- > + > +Apart from propagating the reset through the stack so apps can recover, it's > +really useful for driver developers to learn more about what caused the reset in > +first place. DRM devices should make use of devcoredump to store relevant > +information about the reset, so this information can be added to user bug > +reports. > + > .. _drm_driver_ioctl: > > IOCTL Support on Device Nodes What about VRAM contents? If userspace holds a dmabuf handle, can a GPU reset wipe that buffer? How would that be communicated? The dmabuf may have originated in another process. Thanks, pq
Em 21/06/2023 04:58, Pekka Paalanen escreveu: > On Tue, 20 Jun 2023 21:57:16 -0300 > André Almeida <andrealmeid@igalia.com> wrote: > >> Create a section that specifies how to deal with DRM device resets for >> kernel and userspace drivers. >> >> Signed-off-by: André Almeida <andrealmeid@igalia.com> > > Hi André, > > nice to see this! I ended up giving lots of grammar comments, but I'm > not a native speaker. Generally it looks good to me. Thank you for your feedback :) > >> --- >> Documentation/gpu/drm-uapi.rst | 65 ++++++++++++++++++++++++++++++++++ >> 1 file changed, 65 insertions(+) >> >> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst >> index 65fb3036a580..da4f8a694d8d 100644 >> --- a/Documentation/gpu/drm-uapi.rst >> +++ b/Documentation/gpu/drm-uapi.rst >> @@ -285,6 +285,71 @@ for GPU1 and GPU2 from different vendors, and a third handler for >> mmapped regular files. Threads cause additional pain with signal >> handling as well. >> >> +Device reset >> +============ >> + >> +The GPU stack is really complex and is prone to errors, from hardware bugs, >> +faulty applications and everything in between the many layers. To recover >> +from this kind of state, sometimes is needed to reset the device. This section > > It seems unclear what "this kind of state" refers to, so maybe just write "errors"? > > Maybe: > > Some errors require resetting the device in order to make the > device usable again. > > I presume that recovery does not mean that the failed job could recover. > >> +describes what's the expectations for DRM and usermode drivers when a device >> +resets and how to propagate the reset status. >> + >> +Kernel Mode Driver >> +------------------ >> + >> +The KMD is responsible for checking if the device needs a reset, and to perform >> +it as needed. Usually a hung is detected when a job gets stuck executing. KMD > > s/hung/hang/ ? > >> +then update it's internal reset tracking to be ready when userspace asks the > > updates its > > "update reset tracking"... do you mean that KMD records information > about the reset in case userspace asks for it later? Yes, kernel drivers do annotate whenever a reset happens, so it can report to userspace when it asks about resets. For instance, this is the amdgpu implementation of AMDGPU_CTX_OP_QUERY_STATE2: https://elixir.bootlin.com/linux/v6.3.8/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c#L548 You can see there stored information about resets. > >> +kernel about reset information. Drivers should implement the DRM_IOCTL_GET_RESET >> +for that. > > At this point, I'm not sure what "reset tracking" or "reset > information" entails. Could something be said about those? > >> + >> +User Mode Driver >> +---------------- >> + >> +The UMD should check before submitting new commands to the KMD if the device has >> +been reset, and this can be checked more often if it requires to. The >> +DRM_IOCTL_GET_RESET is the default interface for those kind of checks. After >> +detecting a reset, UMD will then proceed to report it to the application using >> +the appropriated API error code, as explained in the bellow section about > > s/bellow/below/ > >> +robustness. >> + >> +Robustness >> +---------- >> + >> +The only way to try to keep an application working after a reset is if it >> +complies with the robustness aspects of the graphical API that is using. > > that it is using. > >> + >> +Graphical APIs provide ways to application to deal with device resets. However, > > provide ways for applications to deal with > >> +there's no guarantee that the app will be correctly using such features, and UMD >> +can implement policies to close the app if it's a repeating offender, likely in >> +a broken loop. This is done to ensure that it doesn't keeps blocking the user > > does not keep > > I think contractions are usually avoided in documents, but I'm not > bothering to flag them all. > >> +interface to be correctly displayed. > > interface from being correctly displayed. > >> + >> +OpenGL >> +~~~~~~ >> + >> +Apps using OpenGL can rely on ``GL_ARB_robustness`` to be robust. This extension >> +tells if a reset has happened, and if so, all the context state is considered >> +lost and the app proceeds by creating new ones. If robustness isn't in use, UMD >> +will terminate the app when a reset is detected, giving that the contexts are >> +lost and the app won't be able to figure this out and recreate the contexts. > > What about GL ES? Is GL_ARB_robustness implemented or even defined there? > I found this: https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt "Since this is intended to be a version of ARB_robustness for OpenGL ES, it should be named accordingly." I can add this to this paragraph. > What about EGL returning errors like EGL_CONTEXT_LOST, would handling that not > be enough from the app? The documented expectation is: "The application > must destroy all contexts and reinitialise OpenGL ES state and objects > to continue rendering." I couldn't find the spec for EGL_CONTEXT_LOST, but I found for GL_CONTEXT_LOST, which I assume is similar. GL_CONTEXT_LOST is only returned in some specific commands (that might cause a polling application to block indefinitely), so I don't think it's enough, given that the we can't guarantee that the application will call such commands after a reset, thus not being able to notice a reset. https://registry.khronos.org/OpenGL-Refpages/gl4/html/glGetGraphicsResetStatus.xhtml > >> + >> +Vulkan >> +~~~~~~ >> + >> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. >> +This error code means, among other things, that a device reset has happened and >> +it needs to recreate the contexts to keep going. >> + >> +Reporting resets causes >> +----------------------- >> + >> +Apart from propagating the reset through the stack so apps can recover, it's >> +really useful for driver developers to learn more about what caused the reset in >> +first place. DRM devices should make use of devcoredump to store relevant >> +information about the reset, so this information can be added to user bug >> +reports. >> + >> .. _drm_driver_ioctl: >> >> IOCTL Support on Device Nodes > > What about VRAM contents? If userspace holds a dmabuf handle, can a GPU > reset wipe that buffer? How would that be communicated? > Yes, it can. > The dmabuf may have originated in another process. > Indeed, I think we might need to add an error code for dmabuf calls so the buffer user knows that it's invalid now because a reset has happened in the other device. I will need to read more dmabuf code to make sure how this would be possible. > > Thanks, > pq
On Wed, 21 Jun 2023 13:28:34 -0300 André Almeida <andrealmeid@igalia.com> wrote: > Em 21/06/2023 04:58, Pekka Paalanen escreveu: > > On Tue, 20 Jun 2023 21:57:16 -0300 > > André Almeida <andrealmeid@igalia.com> wrote: > > > >> Create a section that specifies how to deal with DRM device resets for > >> kernel and userspace drivers. > >> > >> Signed-off-by: André Almeida <andrealmeid@igalia.com> > > > > Hi André, > > > > nice to see this! I ended up giving lots of grammar comments, but I'm > > not a native speaker. Generally it looks good to me. > > Thank you for your feedback :) > > > > >> --- > >> Documentation/gpu/drm-uapi.rst | 65 ++++++++++++++++++++++++++++++++++ > >> 1 file changed, 65 insertions(+) > >> > >> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > >> index 65fb3036a580..da4f8a694d8d 100644 > >> --- a/Documentation/gpu/drm-uapi.rst > >> +++ b/Documentation/gpu/drm-uapi.rst > >> @@ -285,6 +285,71 @@ for GPU1 and GPU2 from different vendors, and a third handler for > >> mmapped regular files. Threads cause additional pain with signal > >> handling as well. > >> > >> +Device reset > >> +============ > >> + > >> +The GPU stack is really complex and is prone to errors, from hardware bugs, > >> +faulty applications and everything in between the many layers. To recover > >> +from this kind of state, sometimes is needed to reset the device. This section > > > > It seems unclear what "this kind of state" refers to, so maybe just write "errors"? > > > > Maybe: > > > > Some errors require resetting the device in order to make the > > device usable again. > > > > I presume that recovery does not mean that the failed job could recover. > > > >> +describes what's the expectations for DRM and usermode drivers when a device > >> +resets and how to propagate the reset status. > >> + > >> +Kernel Mode Driver > >> +------------------ > >> + > >> +The KMD is responsible for checking if the device needs a reset, and to perform > >> +it as needed. Usually a hung is detected when a job gets stuck executing. KMD > > > > s/hung/hang/ ? > > > >> +then update it's internal reset tracking to be ready when userspace asks the > > > > updates its > > > > "update reset tracking"... do you mean that KMD records information > > about the reset in case userspace asks for it later? > > Yes, kernel drivers do annotate whenever a reset happens, so it can > report to userspace when it asks about resets. > > For instance, this is the amdgpu implementation of > AMDGPU_CTX_OP_QUERY_STATE2: > > https://elixir.bootlin.com/linux/v6.3.8/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c#L548 > > > You can see there stored information about resets. Hi André, right. What I mean is, if I have to ask this, then that implies that the wording could be more clear. I don't know if "reset tracking" is some sub-system that is turned on and off as needed or what updating it would mean. > > > >> +kernel about reset information. Drivers should implement the DRM_IOCTL_GET_RESET > >> +for that. > > > > At this point, I'm not sure what "reset tracking" or "reset > > information" entails. Could something be said about those? > > >> + > >> +User Mode Driver > >> +---------------- > >> + > >> +The UMD should check before submitting new commands to the KMD if the device has > >> +been reset, and this can be checked more often if it requires to. The > >> +DRM_IOCTL_GET_RESET is the default interface for those kind of checks. After > >> +detecting a reset, UMD will then proceed to report it to the application using > >> +the appropriated API error code, as explained in the bellow section about > > > > s/bellow/below/ > > > >> +robustness. > >> + > >> +Robustness > >> +---------- > >> + > >> +The only way to try to keep an application working after a reset is if it > >> +complies with the robustness aspects of the graphical API that is using. > > > > that it is using. > > > >> + > >> +Graphical APIs provide ways to application to deal with device resets. However, > > > > provide ways for applications to deal with > > > >> +there's no guarantee that the app will be correctly using such features, and UMD > >> +can implement policies to close the app if it's a repeating offender, likely in > >> +a broken loop. This is done to ensure that it doesn't keeps blocking the user > > > > does not keep > > > > I think contractions are usually avoided in documents, but I'm not > > bothering to flag them all. > > > >> +interface to be correctly displayed. > > > > interface from being correctly displayed. > > > >> + > >> +OpenGL > >> +~~~~~~ > >> + > >> +Apps using OpenGL can rely on ``GL_ARB_robustness`` to be robust. This extension > >> +tells if a reset has happened, and if so, all the context state is considered > >> +lost and the app proceeds by creating new ones. If robustness isn't in use, UMD > >> +will terminate the app when a reset is detected, giving that the contexts are > >> +lost and the app won't be able to figure this out and recreate the contexts. > > > > What about GL ES? Is GL_ARB_robustness implemented or even defined there? > > > > I found this: > https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt > > "Since this is intended to be a version of ARB_robustness for OpenGL ES, > it should be named accordingly." > > I can add this to this paragraph. Yes, please! I suppose there could be even more extensions with similar benefits, so maybe these extension should be mentioned as examples. Right now the wording sounds like these are the chosen extensions, and if you don't use one, the process will be terminated. > > > What about EGL returning errors like EGL_CONTEXT_LOST, would handling that not > > be enough from the app? The documented expectation is: "The application > > must destroy all contexts and reinitialise OpenGL ES state and objects > > to continue rendering." > > I couldn't find the spec for EGL_CONTEXT_LOST, but I found for > GL_CONTEXT_LOST, which I assume is similar. EGL Version 1.5 - August 27, 2014 Section 2.7 Power Management Following a power management event, calls to eglSwapBuffers, eglCopyBuffers, or eglMakeCurrent will indicate failure by returning EGL_FALSE. The error EGL_CONTEXT_LOST will be returned if a power management event has occurred. On detection of this error, the application must destroy all contexts (by calling eglDestroyContext for each context). To continue rendering the application must recreate any contexts it requires, and subsequently restore any client API state and objects it wishes to use. It is talking about power management which is not quite GPU reset, but I see so much similarity that I'd say it doesn't matter which one actually happened. The only difference is that power management events are not caused by application bugs, which means that the application will simply re-initialize and retry, which may result in a reset loop. You already wrote provision to handle reset loops, and I'm not sure applications handling EGL_CONTEXT_LOST would/could ever infer that they are the culprit without using robustness extensions. I can see how EGL_CONTEXT_LOST could be deemed unsuitable for resets, too. > > GL_CONTEXT_LOST is only returned in some specific commands (that might > cause a polling application to block indefinitely), so I don't think > it's enough, given that the we can't guarantee that the application will > call such commands after a reset, thus not being able to notice a reset. > > https://registry.khronos.org/OpenGL-Refpages/gl4/html/glGetGraphicsResetStatus.xhtml Ok, another API for a similar thing. So in that case, the app does not need to use a robustness extension if it uses OpenGL 4.5 and bothers to check. This makes the wording "If robustness is not in use" problematic, because it seems complicated to determine if robusteness is in use in any particular application. I suppose Mesa would track if the app ever called glGetGraphicsResetStatus() before drawing after reset? Thanks, pq > > > >> + > >> +Vulkan > >> +~~~~~~ > >> + > >> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > >> +This error code means, among other things, that a device reset has happened and > >> +it needs to recreate the contexts to keep going. > >> + > >> +Reporting resets causes > >> +----------------------- > >> + > >> +Apart from propagating the reset through the stack so apps can recover, it's > >> +really useful for driver developers to learn more about what caused the reset in > >> +first place. DRM devices should make use of devcoredump to store relevant > >> +information about the reset, so this information can be added to user bug > >> +reports. > >> + > >> .. _drm_driver_ioctl: > >> > >> IOCTL Support on Device Nodes > > > > What about VRAM contents? If userspace holds a dmabuf handle, can a GPU > > reset wipe that buffer? How would that be communicated? > > > > Yes, it can. > > > The dmabuf may have originated in another process. > > > > Indeed, I think we might need to add an error code for dmabuf calls so > the buffer user knows that it's invalid now because a reset has happened > in the other device. I will need to read more dmabuf code to make sure > how this would be possible. > > > > > Thanks, > > pq
Em 22/06/2023 05:12, Pekka Paalanen escreveu: > On Wed, 21 Jun 2023 13:28:34 -0300 > André Almeida <andrealmeid@igalia.com> wrote: > >> Em 21/06/2023 04:58, Pekka Paalanen escreveu: >>> On Tue, 20 Jun 2023 21:57:16 -0300 >>> André Almeida <andrealmeid@igalia.com> wrote: >>> >>>> Create a section that specifies how to deal with DRM device resets for >>>> kernel and userspace drivers. >>>> >>>> Signed-off-by: André Almeida <andrealmeid@igalia.com> >>> >>> Hi André, >>> >>> nice to see this! I ended up giving lots of grammar comments, but I'm >>> not a native speaker. Generally it looks good to me. >> >> Thank you for your feedback :) >> >>> >>>> --- >>>> Documentation/gpu/drm-uapi.rst | 65 ++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 65 insertions(+) >>>> >>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst >>>> index 65fb3036a580..da4f8a694d8d 100644 >>>> --- a/Documentation/gpu/drm-uapi.rst >>>> +++ b/Documentation/gpu/drm-uapi.rst >>>> @@ -285,6 +285,71 @@ for GPU1 and GPU2 from different vendors, and a third handler for >>>> mmapped regular files. Threads cause additional pain with signal >>>> handling as well. >>>> >>>> +Device reset >>>> +============ >>>> + >>>> +The GPU stack is really complex and is prone to errors, from hardware bugs, >>>> +faulty applications and everything in between the many layers. To recover >>>> +from this kind of state, sometimes is needed to reset the device. This section >>> >>> It seems unclear what "this kind of state" refers to, so maybe just write "errors"? >>> >>> Maybe: >>> >>> Some errors require resetting the device in order to make the >>> device usable again. >>> >>> I presume that recovery does not mean that the failed job could recover. >>> >>>> +describes what's the expectations for DRM and usermode drivers when a device >>>> +resets and how to propagate the reset status. >>>> + >>>> +Kernel Mode Driver >>>> +------------------ >>>> + >>>> +The KMD is responsible for checking if the device needs a reset, and to perform >>>> +it as needed. Usually a hung is detected when a job gets stuck executing. KMD >>> >>> s/hung/hang/ ? >>> >>>> +then update it's internal reset tracking to be ready when userspace asks the >>> >>> updates its >>> >>> "update reset tracking"... do you mean that KMD records information >>> about the reset in case userspace asks for it later? >> >> Yes, kernel drivers do annotate whenever a reset happens, so it can >> report to userspace when it asks about resets. >> >> For instance, this is the amdgpu implementation of >> AMDGPU_CTX_OP_QUERY_STATE2: >> >> https://elixir.bootlin.com/linux/v6.3.8/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c#L548 >> >> >> You can see there stored information about resets. > > Hi André, > > right. What I mean is, if I have to ask this, then that implies that > the wording could be more clear. > > I don't know if "reset tracking" is some sub-system that is turned on > and off as needed or what updating it would mean. > Understood, I'll rewrite it to be more clear. >>> >>>> +kernel about reset information. Drivers should implement the DRM_IOCTL_GET_RESET >>>> +for that. >>> >>> At this point, I'm not sure what "reset tracking" or "reset >>> information" entails. Could something be said about those? >>> >> + >>>> +User Mode Driver >>>> +---------------- >>>> + >>>> +The UMD should check before submitting new commands to the KMD if the device has >>>> +been reset, and this can be checked more often if it requires to. The >>>> +DRM_IOCTL_GET_RESET is the default interface for those kind of checks. After >>>> +detecting a reset, UMD will then proceed to report it to the application using >>>> +the appropriated API error code, as explained in the bellow section about >>> >>> s/bellow/below/ >>> >>>> +robustness. >>>> + >>>> +Robustness >>>> +---------- >>>> + >>>> +The only way to try to keep an application working after a reset is if it >>>> +complies with the robustness aspects of the graphical API that is using. >>> >>> that it is using. >>> >>>> + >>>> +Graphical APIs provide ways to application to deal with device resets. However, >>> >>> provide ways for applications to deal with >>> >>>> +there's no guarantee that the app will be correctly using such features, and UMD >>>> +can implement policies to close the app if it's a repeating offender, likely in >>>> +a broken loop. This is done to ensure that it doesn't keeps blocking the user >>> >>> does not keep >>> >>> I think contractions are usually avoided in documents, but I'm not >>> bothering to flag them all. >>> >>>> +interface to be correctly displayed. >>> >>> interface from being correctly displayed. >>> >>>> + >>>> +OpenGL >>>> +~~~~~~ >>>> + >>>> +Apps using OpenGL can rely on ``GL_ARB_robustness`` to be robust. This extension >>>> +tells if a reset has happened, and if so, all the context state is considered >>>> +lost and the app proceeds by creating new ones. If robustness isn't in use, UMD >>>> +will terminate the app when a reset is detected, giving that the contexts are >>>> +lost and the app won't be able to figure this out and recreate the contexts. >>> >>> What about GL ES? Is GL_ARB_robustness implemented or even defined there? >>> >> >> I found this: >> https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt >> >> "Since this is intended to be a version of ARB_robustness for OpenGL ES, >> it should be named accordingly." >> >> I can add this to this paragraph. > > Yes, please! > > I suppose there could be even more extensions with similar benefits, so > maybe these extension should be mentioned as examples. Right now the > wording sounds like these are the chosen extensions, and if you don't > use one, the process will be terminated. > >> >>> What about EGL returning errors like EGL_CONTEXT_LOST, would handling that not >>> be enough from the app? The documented expectation is: "The application >>> must destroy all contexts and reinitialise OpenGL ES state and objects >>> to continue rendering." >> >> I couldn't find the spec for EGL_CONTEXT_LOST, but I found for >> GL_CONTEXT_LOST, which I assume is similar. > > EGL Version 1.5 - August 27, 2014 > > Section 2.7 Power Management > > Following a power management event, calls to eglSwapBuffers, > eglCopyBuffers, or eglMakeCurrent will indicate failure by > returning EGL_FALSE. The error EGL_CONTEXT_LOST will be > returned if a power management event has occurred. > > On detection of this error, the application must destroy all > contexts (by calling eglDestroyContext for each context). To > continue rendering the application must recreate any contexts > it requires, and subsequently restore any client API state and > objects it wishes to use. > > It is talking about power management which is not quite GPU reset, but > I see so much similarity that I'd say it doesn't matter which one > actually happened. The only difference is that power management events > are not caused by application bugs, which means that the application > will simply re-initialize and retry, which may result in a reset loop. > > You already wrote provision to handle reset loops, and I'm not sure > applications handling EGL_CONTEXT_LOST would/could ever infer that they > are the culprit without using robustness extensions. > > I can see how EGL_CONTEXT_LOST could be deemed unsuitable for resets, > too. > Indeed, this is tricky. However, I believe that the complex nature of the stack can lead to situations where an app is causing the hardware to change it's power management settings. Even though the app isn't doing it on purpose, as stated in the introduction of this section, resets can be caused by hardware bugs, so unfortunately apps might not be able to run correctly and will get reset notifications, and even get terminate. >> >> GL_CONTEXT_LOST is only returned in some specific commands (that might >> cause a polling application to block indefinitely), so I don't think >> it's enough, given that the we can't guarantee that the application will >> call such commands after a reset, thus not being able to notice a reset. >> >> https://registry.khronos.org/OpenGL-Refpages/gl4/html/glGetGraphicsResetStatus.xhtml > > Ok, another API for a similar thing. > > So in that case, the app does not need to use a robustness extension if > it uses OpenGL 4.5 and bothers to check. > > This makes the wording "If robustness is not in use" problematic, > because it seems complicated to determine if robusteness is in use in > any particular application. I suppose Mesa would track if the app ever > called glGetGraphicsResetStatus() before drawing after reset? > > I agree, I think tracking glGetGraphicsResetStatus() should be enough, but I will let this part of the documentation as this for now: If it's _possible to guarantee_ that robustness is not in use [...] > Thanks, > pq > >>> >>>> + >>>> +Vulkan >>>> +~~~~~~ >>>> + >>>> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. >>>> +This error code means, among other things, that a device reset has happened and >>>> +it needs to recreate the contexts to keep going. >>>> + >>>> +Reporting resets causes >>>> +----------------------- >>>> + >>>> +Apart from propagating the reset through the stack so apps can recover, it's >>>> +really useful for driver developers to learn more about what caused the reset in >>>> +first place. DRM devices should make use of devcoredump to store relevant >>>> +information about the reset, so this information can be added to user bug >>>> +reports. >>>> + >>>> .. _drm_driver_ioctl: >>>> >>>> IOCTL Support on Device Nodes >>> >>> What about VRAM contents? If userspace holds a dmabuf handle, can a GPU >>> reset wipe that buffer? How would that be communicated? >>> >> >> Yes, it can. >> >>> The dmabuf may have originated in another process. >>> >> >> Indeed, I think we might need to add an error code for dmabuf calls so >> the buffer user knows that it's invalid now because a reset has happened >> in the other device. I will need to read more dmabuf code to make sure >> how this would be possible. >> >>> >>> Thanks, >>> pq >
diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst index 65fb3036a580..da4f8a694d8d 100644 --- a/Documentation/gpu/drm-uapi.rst +++ b/Documentation/gpu/drm-uapi.rst @@ -285,6 +285,71 @@ for GPU1 and GPU2 from different vendors, and a third handler for mmapped regular files. Threads cause additional pain with signal handling as well. +Device reset +============ + +The GPU stack is really complex and is prone to errors, from hardware bugs, +faulty applications and everything in between the many layers. To recover +from this kind of state, sometimes is needed to reset the device. This section +describes what's the expectations for DRM and usermode drivers when a device +resets and how to propagate the reset status. + +Kernel Mode Driver +------------------ + +The KMD is responsible for checking if the device needs a reset, and to perform +it as needed. Usually a hung is detected when a job gets stuck executing. KMD +then update it's internal reset tracking to be ready when userspace asks the +kernel about reset information. Drivers should implement the DRM_IOCTL_GET_RESET +for that. + +User Mode Driver +---------------- + +The UMD should check before submitting new commands to the KMD if the device has +been reset, and this can be checked more often if it requires to. The +DRM_IOCTL_GET_RESET is the default interface for those kind of checks. After +detecting a reset, UMD will then proceed to report it to the application using +the appropriated API error code, as explained in the bellow section about +robustness. + +Robustness +---------- + +The only way to try to keep an application working after a reset is if it +complies with the robustness aspects of the graphical API that is using. + +Graphical APIs provide ways to application to deal with device resets. However, +there's no guarantee that the app will be correctly using such features, and UMD +can implement policies to close the app if it's a repeating offender, likely in +a broken loop. This is done to ensure that it doesn't keeps blocking the user +interface to be correctly displayed. + +OpenGL +~~~~~~ + +Apps using OpenGL can rely on ``GL_ARB_robustness`` to be robust. This extension +tells if a reset has happened, and if so, all the context state is considered +lost and the app proceeds by creating new ones. If robustness isn't in use, UMD +will terminate the app when a reset is detected, giving that the contexts are +lost and the app won't be able to figure this out and recreate the contexts. + +Vulkan +~~~~~~ + +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. +This error code means, among other things, that a device reset has happened and +it needs to recreate the contexts to keep going. + +Reporting resets causes +----------------------- + +Apart from propagating the reset through the stack so apps can recover, it's +really useful for driver developers to learn more about what caused the reset in +first place. DRM devices should make use of devcoredump to store relevant +information about the reset, so this information can be added to user bug +reports. + .. _drm_driver_ioctl: IOCTL Support on Device Nodes