Message ID | 20230501185747.33519-1-andrealmeid@igalia.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp128678vqo; Mon, 1 May 2023 12:10:03 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6K/a9L73hsuDAyDAm2B6mIHE3KMQAKONliuwbE6A8FAkoxb2/0VXxM2YZOVrfrik8XPMZ/ X-Received: by 2002:a17:902:d484:b0:1a8:1e8c:95f5 with SMTP id c4-20020a170902d48400b001a81e8c95f5mr17684247plg.69.1682968203111; Mon, 01 May 2023 12:10:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682968203; cv=none; d=google.com; s=arc-20160816; b=Av1WMbbCj5Yrj5mvVzXvETEG6OpviIc8q8E0Zav7PyBHXheN6uPXMn8yIPR5r6xWDV IayOg+EJcC4C7REqmBgIAD9J7m/9+T25gMt04uDlYiG4wcWly2MX3fnI9NG8B5udXYDh 0ivbNXS4dMkldW2a5ISypdKHh/VAZJDtqkgZXj9D1iSpc+I0rxsEdVrACG0os8NvfnOP pl061J2Z2cKKv3tbaQzcle6+lbQC55MVoPFITPVq1Cc+fkOjRqHA2B9sgmyYiGK7LcAK NEg8sMn/G635Mw+bFJ1sc1QQwvGkGVp/M93RqQ9Y7Mq7roj/+1ZIikPD00+K3F7QNAy1 qTow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=hZemwKqI+XbJvZCmRaGMFa4GqKWtdAWoykUnIAnqKOI=; b=nj2EnqfXhDrEDaQRczUiTMfNbTTmhFxg1EQsuKcvUKH60hXHj0CnBdib5EsS7TYuQF qe7YEMqH3QAeXyZNZdNlQi1SOka8YP/BahQiXqE1m7ylMuvx37WAYY1DDRkD6FbSNqE+ bArHNlumg+nvuEBIQ9svjFtLoeqGO7xx6isUG9E10S7B8Ando7pqIfU0jm9IrdS6QHMn jGY1iyIsXZRNbVzEhzEvbyKOkDOrUbzckWX4Qdqxg2F6WIN+QNJTxw7OD38IcLkBBRVd IHtI97L1RKM8Qyt24cNA8hjCdpiXvof/7UvIghm3imEvfzNly/bsgPRCxXEeCcQBSiMX 9KAA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b="eOZ/Xgo+"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g11-20020a1709026b4b00b0019e95180a08si28564089plt.59.2023.05.01.12.09.47; Mon, 01 May 2023 12:10:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b="eOZ/Xgo+"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232719AbjEAS6r (ORCPT <rfc822;rbbytesnap@gmail.com> + 99 others); Mon, 1 May 2023 14:58:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41486 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229861AbjEAS6q (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 1 May 2023 14:58:46 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 110801708 for <linux-kernel@vger.kernel.org>; Mon, 1 May 2023 11:58:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-Id: Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: In-Reply-To:References:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=hZemwKqI+XbJvZCmRaGMFa4GqKWtdAWoykUnIAnqKOI=; b=eOZ/Xgo+w1avvUZ+ctjJLSHe4R 3PO7t7pJxOMDpP1SwWDO6KhBpE+vxfmaAhSWlRNyAuX5KyzvoUPDibdD1c525ZUoBDeOBOIGkkry2 pMBeT1wsv4GaR+MrbEZX945Y5c8d2bbWGzEoKZdR5Dh1TXA2IWFPDY5zV2fRHHIMIlpoUw2ynbljK N1gRQypiTIe1M/BZLxs+yQ3ZUtltDY4tbFNF2gjufV/DRkcV5wD8xMLk20NuoB9rj2akR9zi3pjUE muA/hWG58TT2ZlG4mnehsuEIWGGBAIbEAj4ZKGwUfZwaEq/qaEk0eBr3fIki9pJQ9/9YDH/qZzMwY 1AxjAfRg==; Received: from [179.113.250.147] (helo=steammachine.lan) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1ptYjO-00H3BT-Q4; Mon, 01 May 2023 20:58:39 +0200 From: =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com> To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?utf-8?b?J01hcmVrIE9sxaHDoWsn?= <maraeo@gmail.com>, Samuel Pitoiset <samuel.pitoiset@gmail.com>, Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>, =?utf-8?q?Timur_Krist=C3=B3f?= <timur.kristof@gmail.com>, michel.daenzer@mailbox.org, =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com> Subject: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl Date: Mon, 1 May 2023 15:57:46 -0300 Message-Id: <20230501185747.33519-1-andrealmeid@igalia.com> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764720066583283088?= X-GMAIL-MSGID: =?utf-8?q?1764720066583283088?= |
Series |
Add AMDGPU_INFO_GUILTY_APP ioctl
|
|
Message
André Almeida
May 1, 2023, 6:57 p.m. UTC
Currently UMD hasn't much information on what went wrong during a GPU reset. To help with that, this patch proposes a new IOCTL that can be used to query information about the resources that caused the hang. The goal of this RFC is to gather feedback about this interface. The mesa part can be found at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22785 The current implementation is racy, meaning that if two resets happens (even on different rings), the app will get the last reset information available, rather than the one that is looking for. Maybe this can be fixed with a ring_id parameter to query the information for a specific ring, but this also requires an interface to tell the UMD which ring caused it. I know that devcoredump is also used for this kind of information, but I believe that using an IOCTL is better for interfacing Mesa + Linux rather than parsing a file that its contents are subjected to be changed. André Almeida (1): drm/amdgpu: Add interface to dump guilty IB on GPU hang drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 ++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 29 ++++++++++++++++++++++++ include/uapi/drm/amdgpu_drm.h | 7 ++++++ 7 files changed, 52 insertions(+), 1 deletion(-)
Comments
On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@igalia.com> wrote: > > Currently UMD hasn't much information on what went wrong during a GPU reset. To > help with that, this patch proposes a new IOCTL that can be used to query > information about the resources that caused the hang. If we went with the IOCTL, we'd want to limit this to the guilty process. > > The goal of this RFC is to gather feedback about this interface. The mesa part > can be found at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22785 > > The current implementation is racy, meaning that if two resets happens (even on > different rings), the app will get the last reset information available, rather > than the one that is looking for. Maybe this can be fixed with a ring_id > parameter to query the information for a specific ring, but this also requires > an interface to tell the UMD which ring caused it. I think you'd want engine type or something like that so mesa knows how to interpret the IB info. You could store the most recent info in the fd priv for the guilty app. E.g., see what I did for tracking GPU page fault into: https://gitlab.freedesktop.org/agd5f/linux/-/commits/gpu_fault_info_ioctl > > I know that devcoredump is also used for this kind of information, but I believe > that using an IOCTL is better for interfacing Mesa + Linux rather than parsing > a file that its contents are subjected to be changed. Can you elaborate a bit on that? Isn't the whole point of devcoredump to store this sort of information? Alex > > André Almeida (1): > drm/amdgpu: Add interface to dump guilty IB on GPU hang > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +++ > drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 ++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 29 ++++++++++++++++++++++++ > include/uapi/drm/amdgpu_drm.h | 7 ++++++ > 7 files changed, 52 insertions(+), 1 deletion(-) > > -- > 2.40.1 >
Em 01/05/2023 16:24, Alex Deucher escreveu: > On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@igalia.com> wrote: >> >> I know that devcoredump is also used for this kind of information, but I believe >> that using an IOCTL is better for interfacing Mesa + Linux rather than parsing >> a file that its contents are subjected to be changed. > > Can you elaborate a bit on that? Isn't the whole point of devcoredump > to store this sort of information? > I think that devcoredump is something that you could use to submit to a bug report as it is, and then people can read/parse as they want, not as an interface to be read by Mesa... I'm not sure that it's something that I would call an API. But I might be wrong, if you know something that uses that as an API please share. Anyway, relying on that for Mesa would mean that we would need to ensure stability for the file content and format, making it less flexible to modify in the future and probe to bugs, while the IOCTL is well defined and extensible. Maybe the dump from Mesa + devcoredump could be complementary information to a bug report.
Well first of all don't expose the VMID to userspace. The UMD doesn't know (and shouldn't know) which VMID is used for a submission since this is dynamically assigned and can change at any time. For debugging there is an interface to use an reserved VMID for your debugged process which allows to associate logs, tracepoints and hw dumps with the stuff executed by this specific process. Then we already have a feedback mechanism in the form of the error number in the fence. What we still need is an IOCTL to query that. Regarding how far processing inside the IB was when the issue was detected, intermediate debug fences are much more reliable than asking the kernel for that. Regards, Christian. Am 01.05.23 um 20:57 schrieb André Almeida: > Currently UMD hasn't much information on what went wrong during a GPU reset. To > help with that, this patch proposes a new IOCTL that can be used to query > information about the resources that caused the hang. > > The goal of this RFC is to gather feedback about this interface. The mesa part > can be found at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22785 > > The current implementation is racy, meaning that if two resets happens (even on > different rings), the app will get the last reset information available, rather > than the one that is looking for. Maybe this can be fixed with a ring_id > parameter to query the information for a specific ring, but this also requires > an interface to tell the UMD which ring caused it. > > I know that devcoredump is also used for this kind of information, but I believe > that using an IOCTL is better for interfacing Mesa + Linux rather than parsing > a file that its contents are subjected to be changed. > > André Almeida (1): > drm/amdgpu: Add interface to dump guilty IB on GPU hang > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++ > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +++ > drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 ++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 29 ++++++++++++++++++++++++ > include/uapi/drm/amdgpu_drm.h | 7 ++++++ > 7 files changed, 52 insertions(+), 1 deletion(-) >
Am 02.05.23 um 03:26 schrieb André Almeida: > Em 01/05/2023 16:24, Alex Deucher escreveu: >> On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@igalia.com> >> wrote: >>> >>> I know that devcoredump is also used for this kind of information, >>> but I believe >>> that using an IOCTL is better for interfacing Mesa + Linux rather >>> than parsing >>> a file that its contents are subjected to be changed. >> >> Can you elaborate a bit on that? Isn't the whole point of devcoredump >> to store this sort of information? >> > > I think that devcoredump is something that you could use to submit to > a bug report as it is, and then people can read/parse as they want, > not as an interface to be read by Mesa... I'm not sure that it's > something that I would call an API. But I might be wrong, if you know > something that uses that as an API please share. > > Anyway, relying on that for Mesa would mean that we would need to > ensure stability for the file content and format, making it less > flexible to modify in the future and probe to bugs, while the IOCTL is > well defined and extensible. Maybe the dump from Mesa + devcoredump > could be complementary information to a bug report. Neither using an IOCTL nor devcoredump is a good approach for this since the values read from the hw register are completely unreliable. They could not be available because of GFXOFF or they could be overwritten or not even updated by the CP in the first place because of a hang etc.... If you want to track progress inside an IB what you do instead is to insert intermediate fence write commands into the IB. E.g. something like write value X to location Y when this executes. This way you can not only track how far the IB processed, but also in which stages of processing we where when the hang occurred. E.g. End of Pipe, End of Shaders, specific shader stages etc... Regards, Christian.
On Tue, May 2, 2023 at 11:12 AM Timur Kristóf <timur.kristof@gmail.com> wrote: > > Hi Christian, > > Christian König <christian.koenig@amd.com> ezt írta (időpont: 2023. máj. 2., Ke 9:59): >> >> Am 02.05.23 um 03:26 schrieb André Almeida: >> > Em 01/05/2023 16:24, Alex Deucher escreveu: >> >> On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@igalia.com> >> >> wrote: >> >>> >> >>> I know that devcoredump is also used for this kind of information, >> >>> but I believe >> >>> that using an IOCTL is better for interfacing Mesa + Linux rather >> >>> than parsing >> >>> a file that its contents are subjected to be changed. >> >> >> >> Can you elaborate a bit on that? Isn't the whole point of devcoredump >> >> to store this sort of information? >> >> >> > >> > I think that devcoredump is something that you could use to submit to >> > a bug report as it is, and then people can read/parse as they want, >> > not as an interface to be read by Mesa... I'm not sure that it's >> > something that I would call an API. But I might be wrong, if you know >> > something that uses that as an API please share. >> > >> > Anyway, relying on that for Mesa would mean that we would need to >> > ensure stability for the file content and format, making it less >> > flexible to modify in the future and probe to bugs, while the IOCTL is >> > well defined and extensible. Maybe the dump from Mesa + devcoredump >> > could be complementary information to a bug report. >> >> Neither using an IOCTL nor devcoredump is a good approach for this since >> the values read from the hw register are completely unreliable. They >> could not be available because of GFXOFF or they could be overwritten or >> not even updated by the CP in the first place because of a hang etc.... >> >> If you want to track progress inside an IB what you do instead is to >> insert intermediate fence write commands into the IB. E.g. something >> like write value X to location Y when this executes. >> >> This way you can not only track how far the IB processed, but also in >> which stages of processing we where when the hang occurred. E.g. End of >> Pipe, End of Shaders, specific shader stages etc... > > > Currently our biggest challenge in the userspace driver is debugging "random" GPU hangs. We have many dozens of bug reports from users which are like: "play the game for X hours and it will eventually hang the GPU". With the currently available tools, it is impossible for us to tackle these issues. André's proposal would be a step in improving this situation. > > We already do something like what you suggest, but there are multiple problems with that approach: > > 1. we can only submit 1 command buffer at a time because we won't know which IB hanged > 2. we can't use chaining because we don't know where in the IB it hanged > 3. it needs userspace to insert (a lot of) extra commands such as extra synchronization and memory writes > 4. It doesn't work when GPU recovery is enabled because the information is already gone when we detect the hang > > Consequences: > > A. It has a huge perf impact, so we can't enable it always > B. Thanks to the extra synchronization, some issues can't be reproduced when this kind of debugging is enabled > C. We have to ask users to disable GPU recovery to collect logs for us I think the problem is that the hang debugging in radv combines too many things. The information here can be gotten easily by adding a breadcrumb at the start of the cmdbuffer to store the IB address (or even just cmdbuffer CPU pointer) in the trace buffer. That should be approximately zero overhead and would give us the same info as this. I tried to remove (1/2) at some point because with a breadcrumb like the above I don't think it is necessary, but I think Samuel was against it at the time? As for all the other synchronization that is for figuring out which part of the IB hung (e.g. without barriers the IB processing might have moved past the hanging shader already), and I don't think this kernel mechanism changes that. So if we want to make this low overhead we can do this already without new kernel support, we just need to rework radv a bit. > > In my opinion, the correct solution to those problems would be if the kernel could give userspace the necessary information about a GPU hang before a GPU reset. To avoid the massive peformance cost, it would be best if we could know which IB hung and what were the commands being executed when it hung (perhaps pointers to the VA of the commands), along with which shaders were in flight (perhaps pointers to the VA of the shader binaries). > > If such an interface could be created, that would mean we could easily query this information and create useful logs of GPU hangs without much userspace overhead and without requiring the user to disable GPU resets etc. > > If it's not possible to do this, we'd appreciate some suggestions on how to properly solve this without the massive performance cost and without requiring the user to disable GPU recovery. > > Side note, it is also extremely difficult to even determine whether the problem is in userspace or the kernel. While kernel developers usually dismiss all GPU hangs as userspace problems, we've seen many issues where the problem was in the kernel (eg. bugs where wrong voltages were set, etc.) - any idea for tackling those kind of issues is also welcome. > > Thanks & best regards, > Timur
Hi, On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > Christian König <christian.koenig@amd.com> ezt írta (időpont: 2023. > > máj. 2., Ke 9:59): > > > > > Am 02.05.23 um 03:26 schrieb André Almeida: > > > > Em 01/05/2023 16:24, Alex Deucher escreveu: > > > >> On Mon, May 1, 2023 at 2:58 PM André Almeida > > > <andrealmeid@igalia.com> > > > >> wrote: > > > >>> > > > >>> I know that devcoredump is also used for this kind of > > > information, > > > >>> but I believe > > > >>> that using an IOCTL is better for interfacing Mesa + Linux > > > rather > > > >>> than parsing > > > >>> a file that its contents are subjected to be changed. > > > >> > > > >> Can you elaborate a bit on that? Isn't the whole point of > > > devcoredump > > > >> to store this sort of information? > > > >> > > > > > > > > I think that devcoredump is something that you could use to > > > submit to > > > > a bug report as it is, and then people can read/parse as they > > > want, > > > > not as an interface to be read by Mesa... I'm not sure that > > > it's > > > > something that I would call an API. But I might be wrong, if > > > you know > > > > something that uses that as an API please share. > > > > > > > > Anyway, relying on that for Mesa would mean that we would need > > > to > > > > ensure stability for the file content and format, making it > > > less > > > > flexible to modify in the future and probe to bugs, while the > > > IOCTL is > > > > well defined and extensible. Maybe the dump from Mesa + > > > devcoredump > > > > could be complementary information to a bug report. > > > > > > Neither using an IOCTL nor devcoredump is a good approach for > > > this since > > > the values read from the hw register are completely unreliable. > > > They > > > could not be available because of GFXOFF or they could be > > > overwritten or > > > not even updated by the CP in the first place because of a hang > > > etc.... > > > > > > If you want to track progress inside an IB what you do instead > > > is to > > > insert intermediate fence write commands into the IB. E.g. > > > something > > > like write value X to location Y when this executes. > > > > > > This way you can not only track how far the IB processed, but > > > also in > > > which stages of processing we where when the hang occurred. E.g. > > > End of > > > Pipe, End of Shaders, specific shader stages etc... > > > > > > > > > > Currently our biggest challenge in the userspace driver is > > debugging "random" GPU hangs. We have many dozens of bug reports > > from users which are like: "play the game for X hours and it will > > eventually hang the GPU". With the currently available tools, it is > > impossible for us to tackle these issues. André's proposal would be > > a step in improving this situation. > > > > We already do something like what you suggest, but there are > > multiple problems with that approach: > > > > 1. we can only submit 1 command buffer at a time because we won't > > know which IB hanged > > 2. we can't use chaining because we don't know where in the IB it > > hanged > > 3. it needs userspace to insert (a lot of) extra commands such as > > extra synchronization and memory writes > > 4. It doesn't work when GPU recovery is enabled because the > > information is already gone when we detect the hang > > > You can still submit multiple IBs and even chain them. All you need > to do is to insert into each IB commands which write to an extra > memory location with the IB executed and the position inside the IB. > > The write data command allows to write as many dw as you want (up to > multiple kb). The only potential problem is when you submit the same > IB multiple times. > > And yes that is of course quite some extra overhead, but I think > that should be manageable. Thanks, this sounds doable and would solve the limitation of how many IBs are submitted at a time. However it doesn't address the problem that enabling this sort of debugging will still have extra overhead. I don't mean the overhead from writing a couple of dwords for the trace, but rather, the overhead from needing to emit flushes or top of pipe events or whatever else we need so that we can tell which command hung the GPU. > > > In my opinion, the correct solution to those problems would be if > > the kernel could give userspace the necessary information about a > > GPU hang before a GPU reset. > > > The fundamental problem here is that the kernel doesn't have that > information either. We know which IB timed out and can potentially do > a devcoredump when that happens, but that's it. Is it really not possible to know such a fundamental thing as what the GPU was doing when it hung? How are we supposed to do any kind of debugging without knowing that? I wonder what AMD's Windows driver team is doing with this problem, surely they must have better tools to deal with GPU hangs? Best regards, Timur
On Tue, May 2, 2023 at 9:35 AM Timur Kristóf <timur.kristof@gmail.com> wrote: > > Hi, > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > > > Christian König <christian.koenig@amd.com> ezt írta (időpont: 2023. > > > máj. 2., Ke 9:59): > > > > > > > Am 02.05.23 um 03:26 schrieb André Almeida: > > > > > Em 01/05/2023 16:24, Alex Deucher escreveu: > > > > >> On Mon, May 1, 2023 at 2:58 PM André Almeida > > > > <andrealmeid@igalia.com> > > > > >> wrote: > > > > >>> > > > > >>> I know that devcoredump is also used for this kind of > > > > information, > > > > >>> but I believe > > > > >>> that using an IOCTL is better for interfacing Mesa + Linux > > > > rather > > > > >>> than parsing > > > > >>> a file that its contents are subjected to be changed. > > > > >> > > > > >> Can you elaborate a bit on that? Isn't the whole point of > > > > devcoredump > > > > >> to store this sort of information? > > > > >> > > > > > > > > > > I think that devcoredump is something that you could use to > > > > submit to > > > > > a bug report as it is, and then people can read/parse as they > > > > want, > > > > > not as an interface to be read by Mesa... I'm not sure that > > > > it's > > > > > something that I would call an API. But I might be wrong, if > > > > you know > > > > > something that uses that as an API please share. > > > > > > > > > > Anyway, relying on that for Mesa would mean that we would need > > > > to > > > > > ensure stability for the file content and format, making it > > > > less > > > > > flexible to modify in the future and probe to bugs, while the > > > > IOCTL is > > > > > well defined and extensible. Maybe the dump from Mesa + > > > > devcoredump > > > > > could be complementary information to a bug report. > > > > > > > > Neither using an IOCTL nor devcoredump is a good approach for > > > > this since > > > > the values read from the hw register are completely unreliable. > > > > They > > > > could not be available because of GFXOFF or they could be > > > > overwritten or > > > > not even updated by the CP in the first place because of a hang > > > > etc.... > > > > > > > > If you want to track progress inside an IB what you do instead > > > > is to > > > > insert intermediate fence write commands into the IB. E.g. > > > > something > > > > like write value X to location Y when this executes. > > > > > > > > This way you can not only track how far the IB processed, but > > > > also in > > > > which stages of processing we where when the hang occurred. E.g. > > > > End of > > > > Pipe, End of Shaders, specific shader stages etc... > > > > > > > > > > > > > > Currently our biggest challenge in the userspace driver is > > > debugging "random" GPU hangs. We have many dozens of bug reports > > > from users which are like: "play the game for X hours and it will > > > eventually hang the GPU". With the currently available tools, it is > > > impossible for us to tackle these issues. André's proposal would be > > > a step in improving this situation. > > > > > > We already do something like what you suggest, but there are > > > multiple problems with that approach: > > > > > > 1. we can only submit 1 command buffer at a time because we won't > > > know which IB hanged > > > 2. we can't use chaining because we don't know where in the IB it > > > hanged > > > 3. it needs userspace to insert (a lot of) extra commands such as > > > extra synchronization and memory writes > > > 4. It doesn't work when GPU recovery is enabled because the > > > information is already gone when we detect the hang > > > > > You can still submit multiple IBs and even chain them. All you need > > to do is to insert into each IB commands which write to an extra > > memory location with the IB executed and the position inside the IB. > > > > The write data command allows to write as many dw as you want (up to > > multiple kb). The only potential problem is when you submit the same > > IB multiple times. > > > > And yes that is of course quite some extra overhead, but I think > > that should be manageable. > > Thanks, this sounds doable and would solve the limitation of how many > IBs are submitted at a time. However it doesn't address the problem > that enabling this sort of debugging will still have extra overhead. > > I don't mean the overhead from writing a couple of dwords for the > trace, but rather, the overhead from needing to emit flushes or top of > pipe events or whatever else we need so that we can tell which command > hung the GPU. > > > > > > In my opinion, the correct solution to those problems would be if > > > the kernel could give userspace the necessary information about a > > > GPU hang before a GPU reset. > > > > > The fundamental problem here is that the kernel doesn't have that > > information either. We know which IB timed out and can potentially do > > a devcoredump when that happens, but that's it. > > > Is it really not possible to know such a fundamental thing as what the > GPU was doing when it hung? How are we supposed to do any kind of > debugging without knowing that? > > I wonder what AMD's Windows driver team is doing with this problem, > surely they must have better tools to deal with GPU hangs? For better or worse, most teams internally rely on scan dumps via JTAG which sort of limits the usefulness outside of AMD, but also gives you the exact state of the hardware when it's hung so the hardware teams prefer it. Alex
On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote: > On Tue, May 2, 2023 at 9:35 AM Timur Kristóf > <timur.kristof@gmail.com> wrote: > > > > Hi, > > > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > > > > > Christian König <christian.koenig@amd.com> ezt írta (időpont: > > > > 2023. > > > > máj. 2., Ke 9:59): > > > > > > > > > Am 02.05.23 um 03:26 schrieb André Almeida: > > > > > > Em 01/05/2023 16:24, Alex Deucher escreveu: > > > > > >> On Mon, May 1, 2023 at 2:58 PM André Almeida > > > > > <andrealmeid@igalia.com> > > > > > >> wrote: > > > > > >>> > > > > > >>> I know that devcoredump is also used for this kind of > > > > > information, > > > > > >>> but I believe > > > > > >>> that using an IOCTL is better for interfacing Mesa + > > > > > Linux > > > > > rather > > > > > >>> than parsing > > > > > >>> a file that its contents are subjected to be changed. > > > > > >> > > > > > >> Can you elaborate a bit on that? Isn't the whole point > > > > > of > > > > > devcoredump > > > > > >> to store this sort of information? > > > > > >> > > > > > > > > > > > > I think that devcoredump is something that you could use > > > > > to > > > > > submit to > > > > > > a bug report as it is, and then people can read/parse as > > > > > they > > > > > want, > > > > > > not as an interface to be read by Mesa... I'm not sure > > > > > that > > > > > it's > > > > > > something that I would call an API. But I might be wrong, > > > > > if > > > > > you know > > > > > > something that uses that as an API please share. > > > > > > > > > > > > Anyway, relying on that for Mesa would mean that we would > > > > > need > > > > > to > > > > > > ensure stability for the file content and format, making > > > > > it > > > > > less > > > > > > flexible to modify in the future and probe to bugs, while > > > > > the > > > > > IOCTL is > > > > > > well defined and extensible. Maybe the dump from Mesa + > > > > > devcoredump > > > > > > could be complementary information to a bug report. > > > > > > > > > > Neither using an IOCTL nor devcoredump is a good approach > > > > > for > > > > > this since > > > > > the values read from the hw register are completely > > > > > unreliable. > > > > > They > > > > > could not be available because of GFXOFF or they could be > > > > > overwritten or > > > > > not even updated by the CP in the first place because of a > > > > > hang > > > > > etc.... > > > > > > > > > > If you want to track progress inside an IB what you do > > > > > instead > > > > > is to > > > > > insert intermediate fence write commands into the IB. E.g. > > > > > something > > > > > like write value X to location Y when this executes. > > > > > > > > > > This way you can not only track how far the IB processed, > > > > > but > > > > > also in > > > > > which stages of processing we where when the hang occurred. > > > > > E.g. > > > > > End of > > > > > Pipe, End of Shaders, specific shader stages etc... > > > > > > > > > > > > > > > > > > Currently our biggest challenge in the userspace driver is > > > > debugging "random" GPU hangs. We have many dozens of bug > > > > reports > > > > from users which are like: "play the game for X hours and it > > > > will > > > > eventually hang the GPU". With the currently available tools, > > > > it is > > > > impossible for us to tackle these issues. André's proposal > > > > would be > > > > a step in improving this situation. > > > > > > > > We already do something like what you suggest, but there are > > > > multiple problems with that approach: > > > > > > > > 1. we can only submit 1 command buffer at a time because we > > > > won't > > > > know which IB hanged > > > > 2. we can't use chaining because we don't know where in the IB > > > > it > > > > hanged > > > > 3. it needs userspace to insert (a lot of) extra commands such > > > > as > > > > extra synchronization and memory writes > > > > 4. It doesn't work when GPU recovery is enabled because the > > > > information is already gone when we detect the hang > > > > > > > You can still submit multiple IBs and even chain them. All you > > > need > > > to do is to insert into each IB commands which write to an extra > > > memory location with the IB executed and the position inside the > > > IB. > > > > > > The write data command allows to write as many dw as you want > > > (up to > > > multiple kb). The only potential problem is when you submit the > > > same > > > IB multiple times. > > > > > > And yes that is of course quite some extra overhead, but I think > > > that should be manageable. > > > > Thanks, this sounds doable and would solve the limitation of how > > many > > IBs are submitted at a time. However it doesn't address the problem > > that enabling this sort of debugging will still have extra > > overhead. > > > > I don't mean the overhead from writing a couple of dwords for the > > trace, but rather, the overhead from needing to emit flushes or top > > of > > pipe events or whatever else we need so that we can tell which > > command > > hung the GPU. > > > > > > > > > In my opinion, the correct solution to those problems would be > > > > if > > > > the kernel could give userspace the necessary information about > > > > a > > > > GPU hang before a GPU reset. > > > > > > > The fundamental problem here is that the kernel doesn't have > > > that > > > information either. We know which IB timed out and can > > > potentially do > > > a devcoredump when that happens, but that's it. > > > > > > Is it really not possible to know such a fundamental thing as what > > the > > GPU was doing when it hung? How are we supposed to do any kind of > > debugging without knowing that? > > > > I wonder what AMD's Windows driver team is doing with this problem, > > surely they must have better tools to deal with GPU hangs? > > For better or worse, most teams internally rely on scan dumps via > JTAG > which sort of limits the usefulness outside of AMD, but also gives > you > the exact state of the hardware when it's hung so the hardware teams > prefer it. > How does this approach scale? It's not something we can ask users to do, and even if all of us in the radv team had a JTAG device, we wouldn't be able to play every game that users experience random hangs with.
On Tue, May 2, 2023 at 11:22 AM Timur Kristóf <timur.kristof@gmail.com> wrote: > > On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote: > > On Tue, May 2, 2023 at 9:35 AM Timur Kristóf > > <timur.kristof@gmail.com> wrote: > > > > > > Hi, > > > > > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote: > > > > > > > > > > Christian König <christian.koenig@amd.com> ezt írta (időpont: > > > > > 2023. > > > > > máj. 2., Ke 9:59): > > > > > > > > > > > Am 02.05.23 um 03:26 schrieb André Almeida: > > > > > > > Em 01/05/2023 16:24, Alex Deucher escreveu: > > > > > > >> On Mon, May 1, 2023 at 2:58 PM André Almeida > > > > > > <andrealmeid@igalia.com> > > > > > > >> wrote: > > > > > > >>> > > > > > > >>> I know that devcoredump is also used for this kind of > > > > > > information, > > > > > > >>> but I believe > > > > > > >>> that using an IOCTL is better for interfacing Mesa + > > > > > > Linux > > > > > > rather > > > > > > >>> than parsing > > > > > > >>> a file that its contents are subjected to be changed. > > > > > > >> > > > > > > >> Can you elaborate a bit on that? Isn't the whole point > > > > > > of > > > > > > devcoredump > > > > > > >> to store this sort of information? > > > > > > >> > > > > > > > > > > > > > > I think that devcoredump is something that you could use > > > > > > to > > > > > > submit to > > > > > > > a bug report as it is, and then people can read/parse as > > > > > > they > > > > > > want, > > > > > > > not as an interface to be read by Mesa... I'm not sure > > > > > > that > > > > > > it's > > > > > > > something that I would call an API. But I might be wrong, > > > > > > if > > > > > > you know > > > > > > > something that uses that as an API please share. > > > > > > > > > > > > > > Anyway, relying on that for Mesa would mean that we would > > > > > > need > > > > > > to > > > > > > > ensure stability for the file content and format, making > > > > > > it > > > > > > less > > > > > > > flexible to modify in the future and probe to bugs, while > > > > > > the > > > > > > IOCTL is > > > > > > > well defined and extensible. Maybe the dump from Mesa + > > > > > > devcoredump > > > > > > > could be complementary information to a bug report. > > > > > > > > > > > > Neither using an IOCTL nor devcoredump is a good approach > > > > > > for > > > > > > this since > > > > > > the values read from the hw register are completely > > > > > > unreliable. > > > > > > They > > > > > > could not be available because of GFXOFF or they could be > > > > > > overwritten or > > > > > > not even updated by the CP in the first place because of a > > > > > > hang > > > > > > etc.... > > > > > > > > > > > > If you want to track progress inside an IB what you do > > > > > > instead > > > > > > is to > > > > > > insert intermediate fence write commands into the IB. E.g. > > > > > > something > > > > > > like write value X to location Y when this executes. > > > > > > > > > > > > This way you can not only track how far the IB processed, > > > > > > but > > > > > > also in > > > > > > which stages of processing we where when the hang occurred. > > > > > > E.g. > > > > > > End of > > > > > > Pipe, End of Shaders, specific shader stages etc... > > > > > > > > > > > > > > > > > > > > > > Currently our biggest challenge in the userspace driver is > > > > > debugging "random" GPU hangs. We have many dozens of bug > > > > > reports > > > > > from users which are like: "play the game for X hours and it > > > > > will > > > > > eventually hang the GPU". With the currently available tools, > > > > > it is > > > > > impossible for us to tackle these issues. André's proposal > > > > > would be > > > > > a step in improving this situation. > > > > > > > > > > We already do something like what you suggest, but there are > > > > > multiple problems with that approach: > > > > > > > > > > 1. we can only submit 1 command buffer at a time because we > > > > > won't > > > > > know which IB hanged > > > > > 2. we can't use chaining because we don't know where in the IB > > > > > it > > > > > hanged > > > > > 3. it needs userspace to insert (a lot of) extra commands such > > > > > as > > > > > extra synchronization and memory writes > > > > > 4. It doesn't work when GPU recovery is enabled because the > > > > > information is already gone when we detect the hang > > > > > > > > > You can still submit multiple IBs and even chain them. All you > > > > need > > > > to do is to insert into each IB commands which write to an extra > > > > memory location with the IB executed and the position inside the > > > > IB. > > > > > > > > The write data command allows to write as many dw as you want > > > > (up to > > > > multiple kb). The only potential problem is when you submit the > > > > same > > > > IB multiple times. > > > > > > > > And yes that is of course quite some extra overhead, but I think > > > > that should be manageable. > > > > > > Thanks, this sounds doable and would solve the limitation of how > > > many > > > IBs are submitted at a time. However it doesn't address the problem > > > that enabling this sort of debugging will still have extra > > > overhead. > > > > > > I don't mean the overhead from writing a couple of dwords for the > > > trace, but rather, the overhead from needing to emit flushes or top > > > of > > > pipe events or whatever else we need so that we can tell which > > > command > > > hung the GPU. > > > > > > > > > > > > In my opinion, the correct solution to those problems would be > > > > > if > > > > > the kernel could give userspace the necessary information about > > > > > a > > > > > GPU hang before a GPU reset. > > > > > > > > > The fundamental problem here is that the kernel doesn't have > > > > that > > > > information either. We know which IB timed out and can > > > > potentially do > > > > a devcoredump when that happens, but that's it. > > > > > > > > > Is it really not possible to know such a fundamental thing as what > > > the > > > GPU was doing when it hung? How are we supposed to do any kind of > > > debugging without knowing that? > > > > > > I wonder what AMD's Windows driver team is doing with this problem, > > > surely they must have better tools to deal with GPU hangs? > > > > For better or worse, most teams internally rely on scan dumps via > > JTAG > > which sort of limits the usefulness outside of AMD, but also gives > > you > > the exact state of the hardware when it's hung so the hardware teams > > prefer it. > > > > How does this approach scale? It's not something we can ask users to > do, and even if all of us in the radv team had a JTAG device, we > wouldn't be able to play every game that users experience random hangs > with. It doesn't scale or lend itself particularly well to external development, but that's the current state of affairs. Alex
Am 02.05.23 um 20:41 schrieb Alex Deucher: > On Tue, May 2, 2023 at 11:22 AM Timur Kristóf <timur.kristof@gmail.com> wrote: >> [SNIP] >>>>>> In my opinion, the correct solution to those problems would be >>>>>> if >>>>>> the kernel could give userspace the necessary information about >>>>>> a >>>>>> GPU hang before a GPU reset. >>>>>> >>>>> The fundamental problem here is that the kernel doesn't have >>>>> that >>>>> information either. We know which IB timed out and can >>>>> potentially do >>>>> a devcoredump when that happens, but that's it. >>>> >>>> Is it really not possible to know such a fundamental thing as what >>>> the >>>> GPU was doing when it hung? How are we supposed to do any kind of >>>> debugging without knowing that? Yes, that's indeed something at least I try to figure out for years as well. Basically there are two major problems: 1. When the ASIC is hung you can't talk to the firmware engines any more and most state is not exposed directly, but just through some fw/hw interface. Just take a look at how umr reads the shader state from the SQ. When that block is hung you can't do that any more and basically have no chance at all to figure out why it's hung. Same for other engines, I remember once spending a week figuring out why the UVD block is hung during suspend. Turned out to be a debugging nightmare because any time you touch any register of that block the whole system would hang. 2. There are tons of things going on in a pipeline fashion or even completely in parallel. For example the CP is just the beginning of a rather long pipeline which at the end produces a bunch of pixels. In almost all cases I've seen you ran into a problem somewhere deep in the pipeline and only very rarely at the beginning. >>>> >>>> I wonder what AMD's Windows driver team is doing with this problem, >>>> surely they must have better tools to deal with GPU hangs? >>> For better or worse, most teams internally rely on scan dumps via >>> JTAG >>> which sort of limits the usefulness outside of AMD, but also gives >>> you >>> the exact state of the hardware when it's hung so the hardware teams >>> prefer it. >>> >> How does this approach scale? It's not something we can ask users to >> do, and even if all of us in the radv team had a JTAG device, we >> wouldn't be able to play every game that users experience random hangs >> with. > It doesn't scale or lend itself particularly well to external > development, but that's the current state of affairs. The usual approach seems to be to reproduce a problem in a lab and have a JTAG attached to give the hw guys a scan dump and they can then tell you why something didn't worked as expected. And yes that absolutely doesn't scale. Christian. > > Alex
Am 2023-05-03 um 03:59 schrieb Christian König: > Am 02.05.23 um 20:41 schrieb Alex Deucher: >> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf >> <timur.kristof@gmail.com> wrote: >>> [SNIP] >>>>>>> In my opinion, the correct solution to those problems would be >>>>>>> if >>>>>>> the kernel could give userspace the necessary information about >>>>>>> a >>>>>>> GPU hang before a GPU reset. >>>>>>> >>>>>> The fundamental problem here is that the kernel doesn't have >>>>>> that >>>>>> information either. We know which IB timed out and can >>>>>> potentially do >>>>>> a devcoredump when that happens, but that's it. >>>>> >>>>> Is it really not possible to know such a fundamental thing as what >>>>> the >>>>> GPU was doing when it hung? How are we supposed to do any kind of >>>>> debugging without knowing that? > > Yes, that's indeed something at least I try to figure out for years as > well. > > Basically there are two major problems: > 1. When the ASIC is hung you can't talk to the firmware engines any > more and most state is not exposed directly, but just through some > fw/hw interface. > Just take a look at how umr reads the shader state from the SQ. > When that block is hung you can't do that any more and basically have > no chance at all to figure out why it's hung. > > Same for other engines, I remember once spending a week figuring > out why the UVD block is hung during suspend. Turned out to be a > debugging nightmare because any time you touch any register of that > block the whole system would hang. > > 2. There are tons of things going on in a pipeline fashion or even > completely in parallel. For example the CP is just the beginning of a > rather long pipeline which at the end produces a bunch of pixels. > In almost all cases I've seen you ran into a problem somewhere > deep in the pipeline and only very rarely at the beginning. > >>>>> >>>>> I wonder what AMD's Windows driver team is doing with this problem, >>>>> surely they must have better tools to deal with GPU hangs? >>>> For better or worse, most teams internally rely on scan dumps via >>>> JTAG >>>> which sort of limits the usefulness outside of AMD, but also gives >>>> you >>>> the exact state of the hardware when it's hung so the hardware teams >>>> prefer it. >>>> >>> How does this approach scale? It's not something we can ask users to >>> do, and even if all of us in the radv team had a JTAG device, we >>> wouldn't be able to play every game that users experience random hangs >>> with. >> It doesn't scale or lend itself particularly well to external >> development, but that's the current state of affairs. > > The usual approach seems to be to reproduce a problem in a lab and > have a JTAG attached to give the hw guys a scan dump and they can then > tell you why something didn't worked as expected. That's the worst-case scenario where you're debugging HW or FW issues. Those should be pretty rare post-bringup. But are there hangs caused by user mode driver or application bugs that are easier to debug and probably don't even require a GPU reset? For example most VM faults can be handled without hanging the GPU. Similarly, a shader in an endless loop should not require a full GPU reset. In the KFD compute case, that's still preemptible and the offending process can be killed with Ctrl-C or debugged with rocm-gdb. It's more complicated for graphics because of the more complex pipeline and the lack of CWSR. But it should still be possible to do some debugging without JTAG if the problem is in SW and not HW or FW. It's probably worth improving that debugability without getting hung-up on the worst case. Maybe user mode graphics queues will offer a better way of recovering from these kinds of bugs, if the graphics pipeline can be unstuck without a GPU reset, just by killing the offending user mode queue. Regards, Felix > > And yes that absolutely doesn't scale. > > Christian. > >> >> Alex >
Am 03.05.23 um 17:08 schrieb Felix Kuehling: > Am 2023-05-03 um 03:59 schrieb Christian König: >> Am 02.05.23 um 20:41 schrieb Alex Deucher: >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf >>> <timur.kristof@gmail.com> wrote: >>>> [SNIP] >>>>>>>> In my opinion, the correct solution to those problems would be >>>>>>>> if >>>>>>>> the kernel could give userspace the necessary information about >>>>>>>> a >>>>>>>> GPU hang before a GPU reset. >>>>>>>> >>>>>>> The fundamental problem here is that the kernel doesn't have >>>>>>> that >>>>>>> information either. We know which IB timed out and can >>>>>>> potentially do >>>>>>> a devcoredump when that happens, but that's it. >>>>>> >>>>>> Is it really not possible to know such a fundamental thing as what >>>>>> the >>>>>> GPU was doing when it hung? How are we supposed to do any kind of >>>>>> debugging without knowing that? >> >> Yes, that's indeed something at least I try to figure out for years >> as well. >> >> Basically there are two major problems: >> 1. When the ASIC is hung you can't talk to the firmware engines any >> more and most state is not exposed directly, but just through some >> fw/hw interface. >> Just take a look at how umr reads the shader state from the SQ. >> When that block is hung you can't do that any more and basically have >> no chance at all to figure out why it's hung. >> >> Same for other engines, I remember once spending a week figuring >> out why the UVD block is hung during suspend. Turned out to be a >> debugging nightmare because any time you touch any register of that >> block the whole system would hang. >> >> 2. There are tons of things going on in a pipeline fashion or even >> completely in parallel. For example the CP is just the beginning of a >> rather long pipeline which at the end produces a bunch of pixels. >> In almost all cases I've seen you ran into a problem somewhere >> deep in the pipeline and only very rarely at the beginning. >> >>>>>> >>>>>> I wonder what AMD's Windows driver team is doing with this problem, >>>>>> surely they must have better tools to deal with GPU hangs? >>>>> For better or worse, most teams internally rely on scan dumps via >>>>> JTAG >>>>> which sort of limits the usefulness outside of AMD, but also gives >>>>> you >>>>> the exact state of the hardware when it's hung so the hardware teams >>>>> prefer it. >>>>> >>>> How does this approach scale? It's not something we can ask users to >>>> do, and even if all of us in the radv team had a JTAG device, we >>>> wouldn't be able to play every game that users experience random hangs >>>> with. >>> It doesn't scale or lend itself particularly well to external >>> development, but that's the current state of affairs. >> >> The usual approach seems to be to reproduce a problem in a lab and >> have a JTAG attached to give the hw guys a scan dump and they can >> then tell you why something didn't worked as expected. > > That's the worst-case scenario where you're debugging HW or FW issues. > Those should be pretty rare post-bringup. But are there hangs caused > by user mode driver or application bugs that are easier to debug and > probably don't even require a GPU reset? For example most VM faults > can be handled without hanging the GPU. Similarly, a shader in an > endless loop should not require a full GPU reset. In the KFD compute > case, that's still preemptible and the offending process can be killed > with Ctrl-C or debugged with rocm-gdb. We also have infinite loop in shader abort for gfx and page faults are pretty rare with OpenGL (a bit more often with Vulkan) and can be handled gracefully on modern hw (they just spam the logs). The majority of the problems is unfortunately that we really get hard hangs because of some hw issues. That can be caused by unlucky timing, power management or doing things in an order the hw doesn't expected. Regards, Christian. > > It's more complicated for graphics because of the more complex > pipeline and the lack of CWSR. But it should still be possible to do > some debugging without JTAG if the problem is in SW and not HW or FW. > It's probably worth improving that debugability without getting > hung-up on the worst case. > > Maybe user mode graphics queues will offer a better way of recovering > from these kinds of bugs, if the graphics pipeline can be unstuck > without a GPU reset, just by killing the offending user mode queue. > > Regards, > Felix > > >> >> And yes that absolutely doesn't scale. >> >> Christian. >> >>> >>> Alex >>
Hi Felix, On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: > That's the worst-case scenario where you're debugging HW or FW > issues. > Those should be pretty rare post-bringup. But are there hangs caused > by > user mode driver or application bugs that are easier to debug and > probably don't even require a GPU reset? There are many GPU hangs that gamers experience while playing. We have dozens of open bug reports against RADV about GPU hangs on various GPU generations. These usually fall into two categories: 1. When the hang always happens at the same point in a game. These are painful to debug but manageable. 2. "Random" hangs that happen to users over the course of playing a game for several hours. It is absolute hell to try to even reproduce let alone diagnose these issues, and this is what we would like to improve. For these hard-to-diagnose problems, it is already a challenge to determine whether the problem is the kernel (eg. setting wrong voltages / frequencies) or userspace (eg. missing some synchronization), can be even a game bug that we need to work around. > For example most VM faults can > be handled without hanging the GPU. Similarly, a shader in an endless > loop should not require a full GPU reset. This is actually not the case, AFAIK André's test case was an app that had an infinite loop in a shader. > > It's more complicated for graphics because of the more complex > pipeline > and the lack of CWSR. But it should still be possible to do some > debugging without JTAG if the problem is in SW and not HW or FW. It's > probably worth improving that debugability without getting hung-up on > the worst case. I agree, and we welcome any constructive suggestion to improve the situation. It seems like our idea doesn't work if the kernel can't give us the information we need. How do we move forward? Best regards, Timur
Em 03/05/2023 14:08, Marek Olšák escreveu: > GPU hangs are pretty common post-bringup. They are not common per user, > but if we gather all hangs from all users, we can have lots and lots of > them. > > GPU hangs are indeed not very debuggable. There are however some things > we can do: > - Identify the hanging IB by its VA (the kernel should know it) How can the kernel tell which VA range is being executed? I only found that information at mmCP_IB1_BASE_ regs, but as stated in this thread by Christian this is not reliable to be read. > - Read and parse the IB to detect memory corruption. > - Print active waves with shader disassembly if SQ isn't hung (often > it's not). > > Determining which packet the CP is stuck on is tricky. The CP has 2 > engines (one frontend and one backend) that work on the same command > buffer. The frontend engine runs ahead, executes some packets and > forwards others to the backend engine. Only the frontend engine has the > command buffer VA somewhere. The backend engine only receives packets > from the frontend engine via a FIFO, so it might not be possible to tell > where it's stuck if it's stuck. Do they run at the same asynchronously or does the front end waits the back end to execute? > > When the gfx pipeline hangs outside of shaders, making a scandump seems > to be the only way to have a chance at finding out what's going wrong, > and only AMD-internal versions of hw can be scanned. > > Marek > > On Wed, May 3, 2023 at 11:23 AM Christian König > <ckoenig.leichtzumerken@gmail.com > <mailto:ckoenig.leichtzumerken@gmail.com>> wrote: > > Am 03.05.23 um 17:08 schrieb Felix Kuehling: > > Am 2023-05-03 um 03:59 schrieb Christian König: > >> Am 02.05.23 um 20:41 schrieb Alex Deucher: > >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf > >>> <timur.kristof@gmail.com <mailto:timur.kristof@gmail.com>> wrote: > >>>> [SNIP] > >>>>>>>> In my opinion, the correct solution to those problems would be > >>>>>>>> if > >>>>>>>> the kernel could give userspace the necessary information > about > >>>>>>>> a > >>>>>>>> GPU hang before a GPU reset. > >>>>>>>> > >>>>>>> The fundamental problem here is that the kernel doesn't have > >>>>>>> that > >>>>>>> information either. We know which IB timed out and can > >>>>>>> potentially do > >>>>>>> a devcoredump when that happens, but that's it. > >>>>>> > >>>>>> Is it really not possible to know such a fundamental thing > as what > >>>>>> the > >>>>>> GPU was doing when it hung? How are we supposed to do any > kind of > >>>>>> debugging without knowing that? > >> > >> Yes, that's indeed something at least I try to figure out for years > >> as well. > >> > >> Basically there are two major problems: > >> 1. When the ASIC is hung you can't talk to the firmware engines any > >> more and most state is not exposed directly, but just through some > >> fw/hw interface. > >> Just take a look at how umr reads the shader state from the SQ. > >> When that block is hung you can't do that any more and basically > have > >> no chance at all to figure out why it's hung. > >> > >> Same for other engines, I remember once spending a week > figuring > >> out why the UVD block is hung during suspend. Turned out to be a > >> debugging nightmare because any time you touch any register of that > >> block the whole system would hang. > >> > >> 2. There are tons of things going on in a pipeline fashion or even > >> completely in parallel. For example the CP is just the beginning > of a > >> rather long pipeline which at the end produces a bunch of pixels. > >> In almost all cases I've seen you ran into a problem somewhere > >> deep in the pipeline and only very rarely at the beginning. > >> > >>>>>> > >>>>>> I wonder what AMD's Windows driver team is doing with this > problem, > >>>>>> surely they must have better tools to deal with GPU hangs? > >>>>> For better or worse, most teams internally rely on scan dumps via > >>>>> JTAG > >>>>> which sort of limits the usefulness outside of AMD, but also > gives > >>>>> you > >>>>> the exact state of the hardware when it's hung so the > hardware teams > >>>>> prefer it. > >>>>> > >>>> How does this approach scale? It's not something we can ask > users to > >>>> do, and even if all of us in the radv team had a JTAG device, we > >>>> wouldn't be able to play every game that users experience > random hangs > >>>> with. > >>> It doesn't scale or lend itself particularly well to external > >>> development, but that's the current state of affairs. > >> > >> The usual approach seems to be to reproduce a problem in a lab and > >> have a JTAG attached to give the hw guys a scan dump and they can > >> then tell you why something didn't worked as expected. > > > > That's the worst-case scenario where you're debugging HW or FW > issues. > > Those should be pretty rare post-bringup. But are there hangs caused > > by user mode driver or application bugs that are easier to debug and > > probably don't even require a GPU reset? For example most VM faults > > can be handled without hanging the GPU. Similarly, a shader in an > > endless loop should not require a full GPU reset. In the KFD compute > > case, that's still preemptible and the offending process can be > killed > > with Ctrl-C or debugged with rocm-gdb. > > We also have infinite loop in shader abort for gfx and page faults are > pretty rare with OpenGL (a bit more often with Vulkan) and can be > handled gracefully on modern hw (they just spam the logs). > > The majority of the problems is unfortunately that we really get hard > hangs because of some hw issues. That can be caused by unlucky timing, > power management or doing things in an order the hw doesn't expected. > > Regards, > Christian. > > > > > It's more complicated for graphics because of the more complex > > pipeline and the lack of CWSR. But it should still be possible to do > > some debugging without JTAG if the problem is in SW and not HW or > FW. > > It's probably worth improving that debugability without getting > > hung-up on the worst case. > > > > Maybe user mode graphics queues will offer a better way of > recovering > > from these kinds of bugs, if the graphics pipeline can be unstuck > > without a GPU reset, just by killing the offending user mode queue. > > > > Regards, > > Felix > > > > > >> > >> And yes that absolutely doesn't scale. > >> > >> Christian. > >> > >>> > >>> Alex > >> >
Em 03/05/2023 14:43, Timur Kristóf escreveu: > Hi Felix, > > On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: >> That's the worst-case scenario where you're debugging HW or FW >> issues. >> Those should be pretty rare post-bringup. But are there hangs caused >> by >> user mode driver or application bugs that are easier to debug and >> probably don't even require a GPU reset? > > There are many GPU hangs that gamers experience while playing. We have > dozens of open bug reports against RADV about GPU hangs on various GPU > generations. These usually fall into two categories: > > 1. When the hang always happens at the same point in a game. These are > painful to debug but manageable. > 2. "Random" hangs that happen to users over the course of playing a > game for several hours. It is absolute hell to try to even reproduce > let alone diagnose these issues, and this is what we would like to > improve. > > For these hard-to-diagnose problems, it is already a challenge to > determine whether the problem is the kernel (eg. setting wrong voltages > / frequencies) or userspace (eg. missing some synchronization), can be > even a game bug that we need to work around. > >> For example most VM faults can >> be handled without hanging the GPU. Similarly, a shader in an endless >> loop should not require a full GPU reset. > > This is actually not the case, AFAIK André's test case was an app that > had an infinite loop in a shader. > This is the test app if anyone want to try out: https://github.com/andrealmeid/vulkan-triangle-v1. Just compile and run. The kernel calls amdgpu_ring_soft_recovery() when I run my example, but I'm not sure what a soft recovery means here and if it's a full GPU reset or not. But if we can at least trust the CP registers to dump information for soft resets, it would be some improvement from the current state I think >> >> It's more complicated for graphics because of the more complex >> pipeline >> and the lack of CWSR. But it should still be possible to do some >> debugging without JTAG if the problem is in SW and not HW or FW. It's >> probably worth improving that debugability without getting hung-up on >> the worst case. > > I agree, and we welcome any constructive suggestion to improve the > situation. It seems like our idea doesn't work if the kernel can't give > us the information we need. > > How do we move forward? > > Best regards, > Timur >
Am 03.05.23 um 21:14 schrieb André Almeida: > Em 03/05/2023 14:43, Timur Kristóf escreveu: >> Hi Felix, >> >> On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: >>> That's the worst-case scenario where you're debugging HW or FW >>> issues. >>> Those should be pretty rare post-bringup. But are there hangs caused >>> by >>> user mode driver or application bugs that are easier to debug and >>> probably don't even require a GPU reset? >> >> There are many GPU hangs that gamers experience while playing. We have >> dozens of open bug reports against RADV about GPU hangs on various GPU >> generations. These usually fall into two categories: >> >> 1. When the hang always happens at the same point in a game. These are >> painful to debug but manageable. >> 2. "Random" hangs that happen to users over the course of playing a >> game for several hours. It is absolute hell to try to even reproduce >> let alone diagnose these issues, and this is what we would like to >> improve. >> >> For these hard-to-diagnose problems, it is already a challenge to >> determine whether the problem is the kernel (eg. setting wrong voltages >> / frequencies) or userspace (eg. missing some synchronization), can be >> even a game bug that we need to work around. >> >>> For example most VM faults can >>> be handled without hanging the GPU. Similarly, a shader in an endless >>> loop should not require a full GPU reset. >> >> This is actually not the case, AFAIK André's test case was an app that >> had an infinite loop in a shader. >> > > This is the test app if anyone want to try out: > https://github.com/andrealmeid/vulkan-triangle-v1. Just compile and run. > > The kernel calls amdgpu_ring_soft_recovery() when I run my example, > but I'm not sure what a soft recovery means here and if it's a full > GPU reset or not. That's just "soft" recovery. In other words we send the SQ a command to kill a shader. That usually works for shaders which contain an endless loop (which is the most common application bug), but unfortunately not for any other problem. > > But if we can at least trust the CP registers to dump information for > soft resets, it would be some improvement from the current state I think Especially for endless loops the CP registers are completely useless. The CP just prepares the draw commands and all the state which is then send to the SQ for execution. As Marek wrote we know which submission has timed out in the kernel, but we can't figure out where inside this submission we are. > >>> >>> It's more complicated for graphics because of the more complex >>> pipeline >>> and the lack of CWSR. But it should still be possible to do some >>> debugging without JTAG if the problem is in SW and not HW or FW. It's >>> probably worth improving that debugability without getting hung-up on >>> the worst case. >> >> I agree, and we welcome any constructive suggestion to improve the >> situation. It seems like our idea doesn't work if the kernel can't give >> us the information we need. >> >> How do we move forward? As I said the best approach to figure out which draw command hangs is to sprinkle WRITE_DATA commands into your command stream. That's not so much overhead and at least Bas things that this is doable in RADV with some changes. For the kernel we can certainly implement devcoredump and allow writing out register values and other state when a problem happens. Regards, Christian. >> >> Best regards, >> Timur >>