[RFC,0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Message ID	20230501185747.33519-1-andrealmeid@igalia.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com> To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, =?utf-8?b?J01hcmVrIE9sxaHDoWsn?= <maraeo@gmail.com>, Samuel Pitoiset <samuel.pitoiset@gmail.com>, Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>, =?utf-8?q?Timur_Krist=C3=B3f?= <timur.kristof@gmail.com>, michel.daenzer@mailbox.org, =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com> Subject: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl Date: Mon, 1 May 2023 15:57:46 -0300 Message-Id: <20230501185747.33519-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Add AMDGPU_INFO_GUILTY_APP ioctl \| [RFC,0/1] Add AMDGPU_INFO_GUILTY_APP ioctl [RFC,1/1] drm/amdgpu: Add interface to dump guilty IB on GPU hang

Message ID

20230501185747.33519-1-andrealmeid@igalia.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com>
To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
        linux-kernel@vger.kernel.org
Cc: kernel-dev@igalia.com, alexander.deucher@amd.com,
 christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com,
	=?utf-8?b?J01hcmVrIE9sxaHDoWsn?= <maraeo@gmail.com>,
 Samuel Pitoiset <samuel.pitoiset@gmail.com>,
 Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>,
 =?utf-8?q?Timur_Krist=C3=B3f?= <timur.kristof@gmail.com>,
 michel.daenzer@mailbox.org,
 =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com>
Subject: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl
Date: Mon,  1 May 2023 15:57:46 -0300
Message-Id: <20230501185747.33519-1-andrealmeid@igalia.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Add AMDGPU_INFO_GUILTY_APP ioctl |

Message

André Almeida May 1, 2023, 6:57 p.m. UTC

  Currently UMD hasn't much information on what went wrong during a GPU reset. To
help with that, this patch proposes a new IOCTL that can be used to query
information about the resources that caused the hang.

The goal of this RFC is to gather feedback about this interface. The mesa part
can be found at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22785

The current implementation is racy, meaning that if two resets happens (even on
different rings), the app will get the last reset information available, rather
than the one that is looking for. Maybe this can be fixed with a ring_id
parameter to query the information for a specific ring, but this also requires
an interface to tell the UMD which ring caused it.

I know that devcoredump is also used for this kind of information, but I believe
that using an IOCTL is better for interfacing Mesa + Linux rather than parsing
a file that its contents are subjected to be changed.

André Almeida (1):
  drm/amdgpu: Add interface to dump guilty IB on GPU hang

 drivers/gpu/drm/amd/amdgpu/amdgpu.h      |  3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  7 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c   | 29 ++++++++++++++++++++++++
 include/uapi/drm/amdgpu_drm.h            |  7 ++++++
 7 files changed, 52 insertions(+), 1 deletion(-)

Comments

Alex Deucher May 1, 2023, 7:24 p.m. UTC | #1

On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@igalia.com> wrote:
>
> Currently UMD hasn't much information on what went wrong during a GPU reset. To
> help with that, this patch proposes a new IOCTL that can be used to query
> information about the resources that caused the hang.

If we went with the IOCTL, we'd want to limit this to the guilty process.

>
> The goal of this RFC is to gather feedback about this interface. The mesa part
> can be found at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22785
>
> The current implementation is racy, meaning that if two resets happens (even on
> different rings), the app will get the last reset information available, rather
> than the one that is looking for. Maybe this can be fixed with a ring_id
> parameter to query the information for a specific ring, but this also requires
> an interface to tell the UMD which ring caused it.

I think you'd want engine type or something like that so mesa knows
how to interpret the IB info.  You could store the most recent info in
the fd priv for the guilty app.  E.g., see what I did for tracking GPU
page fault into:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/gpu_fault_info_ioctl

>
> I know that devcoredump is also used for this kind of information, but I believe
> that using an IOCTL is better for interfacing Mesa + Linux rather than parsing
> a file that its contents are subjected to be changed.

Can you elaborate a bit on that?  Isn't the whole point of devcoredump
to store this sort of information?

Alex


>
> André Almeida (1):
>   drm/amdgpu: Add interface to dump guilty IB on GPU hang
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h      |  3 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  3 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  3 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  7 ++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  1 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c   | 29 ++++++++++++++++++++++++
>  include/uapi/drm/amdgpu_drm.h            |  7 ++++++
>  7 files changed, 52 insertions(+), 1 deletion(-)
>
> --
> 2.40.1
>

André Almeida May 2, 2023, 1:26 a.m. UTC | #2

Em 01/05/2023 16:24, Alex Deucher escreveu:
> On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@igalia.com> wrote:
>>
>> I know that devcoredump is also used for this kind of information, but I believe
>> that using an IOCTL is better for interfacing Mesa + Linux rather than parsing
>> a file that its contents are subjected to be changed.
> 
> Can you elaborate a bit on that?  Isn't the whole point of devcoredump
> to store this sort of information?
> 

I think that devcoredump is something that you could use to submit to a 
bug report as it is, and then people can read/parse as they want, not as 
an interface to be read by Mesa... I'm not sure that it's something that 
I would call an API. But I might be wrong, if you know something that 
uses that as an API please share.

Anyway, relying on that for Mesa would mean that we would need to ensure 
stability for the file content and format, making it less flexible to 
modify in the future and probe to bugs, while the IOCTL is well defined 
and extensible. Maybe the dump from Mesa + devcoredump could be 
complementary information to a bug report.

Christian König May 2, 2023, 7:48 a.m. UTC | #3

Well first of all don't expose the VMID to userspace.

The UMD doesn't know (and shouldn't know) which VMID is used for a 
submission since this is dynamically assigned and can change at any time.

For debugging there is an interface to use an reserved VMID for your 
debugged process which allows to associate logs, tracepoints and hw 
dumps with the stuff executed by this specific process.

Then we already have a feedback mechanism in the form of the error 
number in the fence. What we still need is an IOCTL to query that.

Regarding how far processing inside the IB was when the issue was 
detected, intermediate debug fences are much more reliable than asking 
the kernel for that.

Regards,
Christian.

Am 01.05.23 um 20:57 schrieb André Almeida:
> Currently UMD hasn't much information on what went wrong during a GPU reset. To
> help with that, this patch proposes a new IOCTL that can be used to query
> information about the resources that caused the hang.
>
> The goal of this RFC is to gather feedback about this interface. The mesa part
> can be found at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22785
>
> The current implementation is racy, meaning that if two resets happens (even on
> different rings), the app will get the last reset information available, rather
> than the one that is looking for. Maybe this can be fixed with a ring_id
> parameter to query the information for a specific ring, but this also requires
> an interface to tell the UMD which ring caused it.
>
> I know that devcoredump is also used for this kind of information, but I believe
> that using an IOCTL is better for interfacing Mesa + Linux rather than parsing
> a file that its contents are subjected to be changed.
>
> André Almeida (1):
>    drm/amdgpu: Add interface to dump guilty IB on GPU hang
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h      |  3 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  3 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  3 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  7 ++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  1 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c   | 29 ++++++++++++++++++++++++
>   include/uapi/drm/amdgpu_drm.h            |  7 ++++++
>   7 files changed, 52 insertions(+), 1 deletion(-)
>

Christian König May 2, 2023, 7:59 a.m. UTC | #4

Am 02.05.23 um 03:26 schrieb André Almeida:
> Em 01/05/2023 16:24, Alex Deucher escreveu:
>> On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@igalia.com> 
>> wrote:
>>>
>>> I know that devcoredump is also used for this kind of information, 
>>> but I believe
>>> that using an IOCTL is better for interfacing Mesa + Linux rather 
>>> than parsing
>>> a file that its contents are subjected to be changed.
>>
>> Can you elaborate a bit on that?  Isn't the whole point of devcoredump
>> to store this sort of information?
>>
>
> I think that devcoredump is something that you could use to submit to 
> a bug report as it is, and then people can read/parse as they want, 
> not as an interface to be read by Mesa... I'm not sure that it's 
> something that I would call an API. But I might be wrong, if you know 
> something that uses that as an API please share.
>
> Anyway, relying on that for Mesa would mean that we would need to 
> ensure stability for the file content and format, making it less 
> flexible to modify in the future and probe to bugs, while the IOCTL is 
> well defined and extensible. Maybe the dump from Mesa + devcoredump 
> could be complementary information to a bug report.

Neither using an IOCTL nor devcoredump is a good approach for this since 
the values read from the hw register are completely unreliable. They 
could not be available because of GFXOFF or they could be overwritten or 
not even updated by the CP in the first place because of a hang etc....

If you want to track progress inside an IB what you do instead is to 
insert intermediate fence write commands into the IB. E.g. something 
like write value X to location Y when this executes.

This way you can not only track how far the IB processed, but also in 
which stages of processing we where when the hang occurred. E.g. End of 
Pipe, End of Shaders, specific shader stages etc...

Regards,
Christian.

Bas Nieuwenhuizen May 2, 2023, 9:30 a.m. UTC | #5

On Tue, May 2, 2023 at 11:12 AM Timur Kristóf <timur.kristof@gmail.com> wrote:
>
> Hi Christian,
>
> Christian König <christian.koenig@amd.com> ezt írta (időpont: 2023. máj. 2., Ke 9:59):
>>
>> Am 02.05.23 um 03:26 schrieb André Almeida:
>> > Em 01/05/2023 16:24, Alex Deucher escreveu:
>> >> On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@igalia.com>
>> >> wrote:
>> >>>
>> >>> I know that devcoredump is also used for this kind of information,
>> >>> but I believe
>> >>> that using an IOCTL is better for interfacing Mesa + Linux rather
>> >>> than parsing
>> >>> a file that its contents are subjected to be changed.
>> >>
>> >> Can you elaborate a bit on that?  Isn't the whole point of devcoredump
>> >> to store this sort of information?
>> >>
>> >
>> > I think that devcoredump is something that you could use to submit to
>> > a bug report as it is, and then people can read/parse as they want,
>> > not as an interface to be read by Mesa... I'm not sure that it's
>> > something that I would call an API. But I might be wrong, if you know
>> > something that uses that as an API please share.
>> >
>> > Anyway, relying on that for Mesa would mean that we would need to
>> > ensure stability for the file content and format, making it less
>> > flexible to modify in the future and probe to bugs, while the IOCTL is
>> > well defined and extensible. Maybe the dump from Mesa + devcoredump
>> > could be complementary information to a bug report.
>>
>> Neither using an IOCTL nor devcoredump is a good approach for this since
>> the values read from the hw register are completely unreliable. They
>> could not be available because of GFXOFF or they could be overwritten or
>> not even updated by the CP in the first place because of a hang etc....
>>
>> If you want to track progress inside an IB what you do instead is to
>> insert intermediate fence write commands into the IB. E.g. something
>> like write value X to location Y when this executes.
>>
>> This way you can not only track how far the IB processed, but also in
>> which stages of processing we where when the hang occurred. E.g. End of
>> Pipe, End of Shaders, specific shader stages etc...
>
>
> Currently our biggest challenge in the userspace driver is debugging "random" GPU hangs. We have many dozens of bug reports from users which are like: "play the game for X hours and it will eventually hang the GPU". With the currently available tools, it is impossible for us to tackle these issues. André's proposal would be a step in improving this situation.
>
> We already do something like what you suggest, but there are multiple problems with that approach:
>
> 1. we can only submit 1 command buffer at a time because we won't know which IB hanged
> 2. we can't use chaining because we don't know where in the IB it hanged
> 3. it needs userspace to insert (a lot of) extra commands such as extra synchronization and memory writes
> 4. It doesn't work when GPU recovery is enabled because the information is already gone when we detect the hang
>
> Consequences:
>
> A. It has a huge perf impact, so we can't enable it always
> B. Thanks to the extra synchronization, some issues can't be reproduced when this kind of debugging is enabled
> C. We have to ask users to disable GPU recovery to collect logs for us

I think the problem is that the hang debugging in radv combines too
many things. The information here can be gotten easily by adding a
breadcrumb at the start of the cmdbuffer to store the IB address (or
even just cmdbuffer CPU pointer) in the trace buffer. That should be
approximately zero overhead and would give us the same info as this.

I tried to remove (1/2) at some point because with a breadcrumb like
the above I don't think it is necessary, but I think Samuel was
against it at the time? As for all the other synchronization that is
for figuring out which part of the IB hung (e.g. without barriers the
IB processing might have moved past the hanging shader already), and I
don't think this kernel mechanism changes that.

So if we want to make this low overhead we can do this already without
new kernel support, we just need to rework radv a bit.

>
> In my opinion, the correct solution to those problems would be if the kernel could give userspace the necessary information about a GPU hang before a GPU reset. To avoid the massive peformance cost, it would be best if we could know which IB hung and what were the commands being executed when it hung (perhaps pointers to the VA of the commands), along with which shaders were in flight (perhaps pointers to the VA of the shader binaries).
>
> If such an interface could be created, that would mean we could easily query this information and create useful logs of GPU hangs without much userspace overhead and without requiring the user to disable GPU resets etc.
>
> If it's not possible to do this, we'd appreciate some suggestions on how to properly solve this without the massive performance cost and without requiring the user to disable GPU recovery.
>
> Side note, it is also extremely difficult to even determine whether the problem is in userspace or the kernel. While kernel developers usually dismiss all GPU hangs as userspace problems, we've seen many issues where the problem was in the kernel (eg. bugs where wrong voltages were set, etc.) - any idea for tackling those kind of issues is also welcome.
>
> Thanks & best regards,
> Timur

Timur Kristóf May 2, 2023, 1:34 p.m. UTC | #6

Hi,

On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote:
> > 
> > Christian König <christian.koenig@amd.com> ezt írta (időpont: 2023.
> > máj. 2., Ke 9:59):
> >  
> > > Am 02.05.23 um 03:26 schrieb André Almeida:
> > >  > Em 01/05/2023 16:24, Alex Deucher escreveu:
> > >  >> On Mon, May 1, 2023 at 2:58 PM André Almeida
> > > <andrealmeid@igalia.com> 
> > >  >> wrote:
> > >  >>>
> > >  >>> I know that devcoredump is also used for this kind of
> > > information, 
> > >  >>> but I believe
> > >  >>> that using an IOCTL is better for interfacing Mesa + Linux
> > > rather 
> > >  >>> than parsing
> > >  >>> a file that its contents are subjected to be changed.
> > >  >>
> > >  >> Can you elaborate a bit on that?  Isn't the whole point of
> > > devcoredump
> > >  >> to store this sort of information?
> > >  >>
> > >  >
> > >  > I think that devcoredump is something that you could use to
> > > submit to 
> > >  > a bug report as it is, and then people can read/parse as they
> > > want, 
> > >  > not as an interface to be read by Mesa... I'm not sure that
> > > it's 
> > >  > something that I would call an API. But I might be wrong, if
> > > you know 
> > >  > something that uses that as an API please share.
> > >  >
> > >  > Anyway, relying on that for Mesa would mean that we would need
> > > to 
> > >  > ensure stability for the file content and format, making it
> > > less 
> > >  > flexible to modify in the future and probe to bugs, while the
> > > IOCTL is 
> > >  > well defined and extensible. Maybe the dump from Mesa +
> > > devcoredump 
> > >  > could be complementary information to a bug report.
> > >  
> > >  Neither using an IOCTL nor devcoredump is a good approach for
> > > this since 
> > >  the values read from the hw register are completely unreliable.
> > > They 
> > >  could not be available because of GFXOFF or they could be
> > > overwritten or 
> > >  not even updated by the CP in the first place because of a hang
> > > etc....
> > >  
> > >  If you want to track progress inside an IB what you do instead
> > > is to 
> > >  insert intermediate fence write commands into the IB. E.g.
> > > something 
> > >  like write value X to location Y when this executes.
> > >  
> > >  This way you can not only track how far the IB processed, but
> > > also in 
> > >  which stages of processing we where when the hang occurred. E.g.
> > > End of 
> > >  Pipe, End of Shaders, specific shader stages etc...
> > >  
> > > 
> >  
> > Currently our biggest challenge in the userspace driver is
> > debugging "random" GPU hangs. We have many dozens of bug reports
> > from users which are like: "play the game for X hours and it will
> > eventually hang the GPU". With the currently available tools, it is
> > impossible for us to tackle these issues. André's proposal would be
> > a step in improving this situation.
> > 
> > We already do something like what you suggest, but there are
> > multiple problems with that approach:
> >  
> > 1. we can only submit 1 command buffer at a time because we won't
> > know which IB hanged
> > 2. we can't use chaining because we don't know where in the IB it
> > hanged
> > 3. it needs userspace to insert (a lot of) extra commands such as
> > extra synchronization and memory writes
> > 4. It doesn't work when GPU recovery is enabled because the
> > information is already gone when we detect the hang
> > 
>  You can still submit multiple IBs and even chain them. All you need
> to do is to insert into each IB commands which write to an extra
> memory location with the IB executed and the position inside the IB.
> 
>  The write data command allows to write as many dw as you want (up to
> multiple kb). The only potential problem is when you submit the same
> IB multiple times.
> 
>  And yes that is of course quite some extra overhead, but I think
> that should be manageable.

Thanks, this sounds doable and would solve the limitation of how many
IBs are submitted at a time. However it doesn't address the problem
that enabling this sort of debugging will still have extra overhead.

I don't mean the overhead from writing a couple of dwords for the
trace, but rather, the overhead from needing to emit flushes or top of
pipe events or whatever else we need so that we can tell which command
hung the GPU.

>  
> > In my opinion, the correct solution to those problems would be if
> > the kernel could give userspace the necessary information about a
> > GPU hang before a GPU reset.
> >   
>  The fundamental problem here is that the kernel doesn't have that
> information either. We know which IB timed out and can potentially do
> a devcoredump when that happens, but that's it.


Is it really not possible to know such a fundamental thing as what the
GPU was doing when it hung? How are we supposed to do any kind of
debugging without knowing that?

I wonder what AMD's Windows driver team is doing with this problem,
surely they must have better tools to deal with GPU hangs?

Best regards,
Timur

Alex Deucher May 2, 2023, 1:45 p.m. UTC | #7

On Tue, May 2, 2023 at 9:35 AM Timur Kristóf <timur.kristof@gmail.com> wrote:
>
> Hi,
>
> On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote:
> > >
> > > Christian König <christian.koenig@amd.com> ezt írta (időpont: 2023.
> > > máj. 2., Ke 9:59):
> > >
> > > > Am 02.05.23 um 03:26 schrieb André Almeida:
> > > >  > Em 01/05/2023 16:24, Alex Deucher escreveu:
> > > >  >> On Mon, May 1, 2023 at 2:58 PM André Almeida
> > > > <andrealmeid@igalia.com>
> > > >  >> wrote:
> > > >  >>>
> > > >  >>> I know that devcoredump is also used for this kind of
> > > > information,
> > > >  >>> but I believe
> > > >  >>> that using an IOCTL is better for interfacing Mesa + Linux
> > > > rather
> > > >  >>> than parsing
> > > >  >>> a file that its contents are subjected to be changed.
> > > >  >>
> > > >  >> Can you elaborate a bit on that?  Isn't the whole point of
> > > > devcoredump
> > > >  >> to store this sort of information?
> > > >  >>
> > > >  >
> > > >  > I think that devcoredump is something that you could use to
> > > > submit to
> > > >  > a bug report as it is, and then people can read/parse as they
> > > > want,
> > > >  > not as an interface to be read by Mesa... I'm not sure that
> > > > it's
> > > >  > something that I would call an API. But I might be wrong, if
> > > > you know
> > > >  > something that uses that as an API please share.
> > > >  >
> > > >  > Anyway, relying on that for Mesa would mean that we would need
> > > > to
> > > >  > ensure stability for the file content and format, making it
> > > > less
> > > >  > flexible to modify in the future and probe to bugs, while the
> > > > IOCTL is
> > > >  > well defined and extensible. Maybe the dump from Mesa +
> > > > devcoredump
> > > >  > could be complementary information to a bug report.
> > > >
> > > >  Neither using an IOCTL nor devcoredump is a good approach for
> > > > this since
> > > >  the values read from the hw register are completely unreliable.
> > > > They
> > > >  could not be available because of GFXOFF or they could be
> > > > overwritten or
> > > >  not even updated by the CP in the first place because of a hang
> > > > etc....
> > > >
> > > >  If you want to track progress inside an IB what you do instead
> > > > is to
> > > >  insert intermediate fence write commands into the IB. E.g.
> > > > something
> > > >  like write value X to location Y when this executes.
> > > >
> > > >  This way you can not only track how far the IB processed, but
> > > > also in
> > > >  which stages of processing we where when the hang occurred. E.g.
> > > > End of
> > > >  Pipe, End of Shaders, specific shader stages etc...
> > > >
> > > >
> > >
> > > Currently our biggest challenge in the userspace driver is
> > > debugging "random" GPU hangs. We have many dozens of bug reports
> > > from users which are like: "play the game for X hours and it will
> > > eventually hang the GPU". With the currently available tools, it is
> > > impossible for us to tackle these issues. André's proposal would be
> > > a step in improving this situation.
> > >
> > > We already do something like what you suggest, but there are
> > > multiple problems with that approach:
> > >
> > > 1. we can only submit 1 command buffer at a time because we won't
> > > know which IB hanged
> > > 2. we can't use chaining because we don't know where in the IB it
> > > hanged
> > > 3. it needs userspace to insert (a lot of) extra commands such as
> > > extra synchronization and memory writes
> > > 4. It doesn't work when GPU recovery is enabled because the
> > > information is already gone when we detect the hang
> > >
> >  You can still submit multiple IBs and even chain them. All you need
> > to do is to insert into each IB commands which write to an extra
> > memory location with the IB executed and the position inside the IB.
> >
> >  The write data command allows to write as many dw as you want (up to
> > multiple kb). The only potential problem is when you submit the same
> > IB multiple times.
> >
> >  And yes that is of course quite some extra overhead, but I think
> > that should be manageable.
>
> Thanks, this sounds doable and would solve the limitation of how many
> IBs are submitted at a time. However it doesn't address the problem
> that enabling this sort of debugging will still have extra overhead.
>
> I don't mean the overhead from writing a couple of dwords for the
> trace, but rather, the overhead from needing to emit flushes or top of
> pipe events or whatever else we need so that we can tell which command
> hung the GPU.
>
> >
> > > In my opinion, the correct solution to those problems would be if
> > > the kernel could give userspace the necessary information about a
> > > GPU hang before a GPU reset.
> > >
> >  The fundamental problem here is that the kernel doesn't have that
> > information either. We know which IB timed out and can potentially do
> > a devcoredump when that happens, but that's it.
>
>
> Is it really not possible to know such a fundamental thing as what the
> GPU was doing when it hung? How are we supposed to do any kind of
> debugging without knowing that?
>
> I wonder what AMD's Windows driver team is doing with this problem,
> surely they must have better tools to deal with GPU hangs?

For better or worse, most teams internally rely on scan dumps via JTAG
which sort of limits the usefulness outside of AMD, but also gives you
the exact state of the hardware when it's hung so the hardware teams
prefer it.

Alex

Timur Kristóf May 2, 2023, 3:22 p.m. UTC | #8

On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote:
> On Tue, May 2, 2023 at 9:35 AM Timur Kristóf
> <timur.kristof@gmail.com> wrote:
> > 
> > Hi,
> > 
> > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote:
> > > > 
> > > > Christian König <christian.koenig@amd.com> ezt írta (időpont:
> > > > 2023.
> > > > máj. 2., Ke 9:59):
> > > > 
> > > > > Am 02.05.23 um 03:26 schrieb André Almeida:
> > > > >  > Em 01/05/2023 16:24, Alex Deucher escreveu:
> > > > >  >> On Mon, May 1, 2023 at 2:58 PM André Almeida
> > > > > <andrealmeid@igalia.com>
> > > > >  >> wrote:
> > > > >  >>>
> > > > >  >>> I know that devcoredump is also used for this kind of
> > > > > information,
> > > > >  >>> but I believe
> > > > >  >>> that using an IOCTL is better for interfacing Mesa +
> > > > > Linux
> > > > > rather
> > > > >  >>> than parsing
> > > > >  >>> a file that its contents are subjected to be changed.
> > > > >  >>
> > > > >  >> Can you elaborate a bit on that?  Isn't the whole point
> > > > > of
> > > > > devcoredump
> > > > >  >> to store this sort of information?
> > > > >  >>
> > > > >  >
> > > > >  > I think that devcoredump is something that you could use
> > > > > to
> > > > > submit to
> > > > >  > a bug report as it is, and then people can read/parse as
> > > > > they
> > > > > want,
> > > > >  > not as an interface to be read by Mesa... I'm not sure
> > > > > that
> > > > > it's
> > > > >  > something that I would call an API. But I might be wrong,
> > > > > if
> > > > > you know
> > > > >  > something that uses that as an API please share.
> > > > >  >
> > > > >  > Anyway, relying on that for Mesa would mean that we would
> > > > > need
> > > > > to
> > > > >  > ensure stability for the file content and format, making
> > > > > it
> > > > > less
> > > > >  > flexible to modify in the future and probe to bugs, while
> > > > > the
> > > > > IOCTL is
> > > > >  > well defined and extensible. Maybe the dump from Mesa +
> > > > > devcoredump
> > > > >  > could be complementary information to a bug report.
> > > > > 
> > > > >  Neither using an IOCTL nor devcoredump is a good approach
> > > > > for
> > > > > this since
> > > > >  the values read from the hw register are completely
> > > > > unreliable.
> > > > > They
> > > > >  could not be available because of GFXOFF or they could be
> > > > > overwritten or
> > > > >  not even updated by the CP in the first place because of a
> > > > > hang
> > > > > etc....
> > > > > 
> > > > >  If you want to track progress inside an IB what you do
> > > > > instead
> > > > > is to
> > > > >  insert intermediate fence write commands into the IB. E.g.
> > > > > something
> > > > >  like write value X to location Y when this executes.
> > > > > 
> > > > >  This way you can not only track how far the IB processed,
> > > > > but
> > > > > also in
> > > > >  which stages of processing we where when the hang occurred.
> > > > > E.g.
> > > > > End of
> > > > >  Pipe, End of Shaders, specific shader stages etc...
> > > > > 
> > > > > 
> > > > 
> > > > Currently our biggest challenge in the userspace driver is
> > > > debugging "random" GPU hangs. We have many dozens of bug
> > > > reports
> > > > from users which are like: "play the game for X hours and it
> > > > will
> > > > eventually hang the GPU". With the currently available tools,
> > > > it is
> > > > impossible for us to tackle these issues. André's proposal
> > > > would be
> > > > a step in improving this situation.
> > > > 
> > > > We already do something like what you suggest, but there are
> > > > multiple problems with that approach:
> > > > 
> > > > 1. we can only submit 1 command buffer at a time because we
> > > > won't
> > > > know which IB hanged
> > > > 2. we can't use chaining because we don't know where in the IB
> > > > it
> > > > hanged
> > > > 3. it needs userspace to insert (a lot of) extra commands such
> > > > as
> > > > extra synchronization and memory writes
> > > > 4. It doesn't work when GPU recovery is enabled because the
> > > > information is already gone when we detect the hang
> > > > 
> > >  You can still submit multiple IBs and even chain them. All you
> > > need
> > > to do is to insert into each IB commands which write to an extra
> > > memory location with the IB executed and the position inside the
> > > IB.
> > > 
> > >  The write data command allows to write as many dw as you want
> > > (up to
> > > multiple kb). The only potential problem is when you submit the
> > > same
> > > IB multiple times.
> > > 
> > >  And yes that is of course quite some extra overhead, but I think
> > > that should be manageable.
> > 
> > Thanks, this sounds doable and would solve the limitation of how
> > many
> > IBs are submitted at a time. However it doesn't address the problem
> > that enabling this sort of debugging will still have extra
> > overhead.
> > 
> > I don't mean the overhead from writing a couple of dwords for the
> > trace, but rather, the overhead from needing to emit flushes or top
> > of
> > pipe events or whatever else we need so that we can tell which
> > command
> > hung the GPU.
> > 
> > > 
> > > > In my opinion, the correct solution to those problems would be
> > > > if
> > > > the kernel could give userspace the necessary information about
> > > > a
> > > > GPU hang before a GPU reset.
> > > > 
> > >  The fundamental problem here is that the kernel doesn't have
> > > that
> > > information either. We know which IB timed out and can
> > > potentially do
> > > a devcoredump when that happens, but that's it.
> > 
> > 
> > Is it really not possible to know such a fundamental thing as what
> > the
> > GPU was doing when it hung? How are we supposed to do any kind of
> > debugging without knowing that?
> > 
> > I wonder what AMD's Windows driver team is doing with this problem,
> > surely they must have better tools to deal with GPU hangs?
> 
> For better or worse, most teams internally rely on scan dumps via
> JTAG
> which sort of limits the usefulness outside of AMD, but also gives
> you
> the exact state of the hardware when it's hung so the hardware teams
> prefer it.
> 

How does this approach scale? It's not something we can ask users to
do, and even if all of us in the radv team had a JTAG device, we
wouldn't be able to play every game that users experience random hangs
with.

Alex Deucher May 2, 2023, 6:41 p.m. UTC | #9

On Tue, May 2, 2023 at 11:22 AM Timur Kristóf <timur.kristof@gmail.com> wrote:
>
> On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote:
> > On Tue, May 2, 2023 at 9:35 AM Timur Kristóf
> > <timur.kristof@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote:
> > > > >
> > > > > Christian König <christian.koenig@amd.com> ezt írta (időpont:
> > > > > 2023.
> > > > > máj. 2., Ke 9:59):
> > > > >
> > > > > > Am 02.05.23 um 03:26 schrieb André Almeida:
> > > > > >  > Em 01/05/2023 16:24, Alex Deucher escreveu:
> > > > > >  >> On Mon, May 1, 2023 at 2:58 PM André Almeida
> > > > > > <andrealmeid@igalia.com>
> > > > > >  >> wrote:
> > > > > >  >>>
> > > > > >  >>> I know that devcoredump is also used for this kind of
> > > > > > information,
> > > > > >  >>> but I believe
> > > > > >  >>> that using an IOCTL is better for interfacing Mesa +
> > > > > > Linux
> > > > > > rather
> > > > > >  >>> than parsing
> > > > > >  >>> a file that its contents are subjected to be changed.
> > > > > >  >>
> > > > > >  >> Can you elaborate a bit on that?  Isn't the whole point
> > > > > > of
> > > > > > devcoredump
> > > > > >  >> to store this sort of information?
> > > > > >  >>
> > > > > >  >
> > > > > >  > I think that devcoredump is something that you could use
> > > > > > to
> > > > > > submit to
> > > > > >  > a bug report as it is, and then people can read/parse as
> > > > > > they
> > > > > > want,
> > > > > >  > not as an interface to be read by Mesa... I'm not sure
> > > > > > that
> > > > > > it's
> > > > > >  > something that I would call an API. But I might be wrong,
> > > > > > if
> > > > > > you know
> > > > > >  > something that uses that as an API please share.
> > > > > >  >
> > > > > >  > Anyway, relying on that for Mesa would mean that we would
> > > > > > need
> > > > > > to
> > > > > >  > ensure stability for the file content and format, making
> > > > > > it
> > > > > > less
> > > > > >  > flexible to modify in the future and probe to bugs, while
> > > > > > the
> > > > > > IOCTL is
> > > > > >  > well defined and extensible. Maybe the dump from Mesa +
> > > > > > devcoredump
> > > > > >  > could be complementary information to a bug report.
> > > > > >
> > > > > >  Neither using an IOCTL nor devcoredump is a good approach
> > > > > > for
> > > > > > this since
> > > > > >  the values read from the hw register are completely
> > > > > > unreliable.
> > > > > > They
> > > > > >  could not be available because of GFXOFF or they could be
> > > > > > overwritten or
> > > > > >  not even updated by the CP in the first place because of a
> > > > > > hang
> > > > > > etc....
> > > > > >
> > > > > >  If you want to track progress inside an IB what you do
> > > > > > instead
> > > > > > is to
> > > > > >  insert intermediate fence write commands into the IB. E.g.
> > > > > > something
> > > > > >  like write value X to location Y when this executes.
> > > > > >
> > > > > >  This way you can not only track how far the IB processed,
> > > > > > but
> > > > > > also in
> > > > > >  which stages of processing we where when the hang occurred.
> > > > > > E.g.
> > > > > > End of
> > > > > >  Pipe, End of Shaders, specific shader stages etc...
> > > > > >
> > > > > >
> > > > >
> > > > > Currently our biggest challenge in the userspace driver is
> > > > > debugging "random" GPU hangs. We have many dozens of bug
> > > > > reports
> > > > > from users which are like: "play the game for X hours and it
> > > > > will
> > > > > eventually hang the GPU". With the currently available tools,
> > > > > it is
> > > > > impossible for us to tackle these issues. André's proposal
> > > > > would be
> > > > > a step in improving this situation.
> > > > >
> > > > > We already do something like what you suggest, but there are
> > > > > multiple problems with that approach:
> > > > >
> > > > > 1. we can only submit 1 command buffer at a time because we
> > > > > won't
> > > > > know which IB hanged
> > > > > 2. we can't use chaining because we don't know where in the IB
> > > > > it
> > > > > hanged
> > > > > 3. it needs userspace to insert (a lot of) extra commands such
> > > > > as
> > > > > extra synchronization and memory writes
> > > > > 4. It doesn't work when GPU recovery is enabled because the
> > > > > information is already gone when we detect the hang
> > > > >
> > > >  You can still submit multiple IBs and even chain them. All you
> > > > need
> > > > to do is to insert into each IB commands which write to an extra
> > > > memory location with the IB executed and the position inside the
> > > > IB.
> > > >
> > > >  The write data command allows to write as many dw as you want
> > > > (up to
> > > > multiple kb). The only potential problem is when you submit the
> > > > same
> > > > IB multiple times.
> > > >
> > > >  And yes that is of course quite some extra overhead, but I think
> > > > that should be manageable.
> > >
> > > Thanks, this sounds doable and would solve the limitation of how
> > > many
> > > IBs are submitted at a time. However it doesn't address the problem
> > > that enabling this sort of debugging will still have extra
> > > overhead.
> > >
> > > I don't mean the overhead from writing a couple of dwords for the
> > > trace, but rather, the overhead from needing to emit flushes or top
> > > of
> > > pipe events or whatever else we need so that we can tell which
> > > command
> > > hung the GPU.
> > >
> > > >
> > > > > In my opinion, the correct solution to those problems would be
> > > > > if
> > > > > the kernel could give userspace the necessary information about
> > > > > a
> > > > > GPU hang before a GPU reset.
> > > > >
> > > >  The fundamental problem here is that the kernel doesn't have
> > > > that
> > > > information either. We know which IB timed out and can
> > > > potentially do
> > > > a devcoredump when that happens, but that's it.
> > >
> > >
> > > Is it really not possible to know such a fundamental thing as what
> > > the
> > > GPU was doing when it hung? How are we supposed to do any kind of
> > > debugging without knowing that?
> > >
> > > I wonder what AMD's Windows driver team is doing with this problem,
> > > surely they must have better tools to deal with GPU hangs?
> >
> > For better or worse, most teams internally rely on scan dumps via
> > JTAG
> > which sort of limits the usefulness outside of AMD, but also gives
> > you
> > the exact state of the hardware when it's hung so the hardware teams
> > prefer it.
> >
>
> How does this approach scale? It's not something we can ask users to
> do, and even if all of us in the radv team had a JTAG device, we
> wouldn't be able to play every game that users experience random hangs
> with.

It doesn't scale or lend itself particularly well to external
development, but that's the current state of affairs.

Alex

Christian König May 3, 2023, 7:59 a.m. UTC | #10

Am 02.05.23 um 20:41 schrieb Alex Deucher:
> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf <timur.kristof@gmail.com> wrote:
>> [SNIP]
>>>>>> In my opinion, the correct solution to those problems would be
>>>>>> if
>>>>>> the kernel could give userspace the necessary information about
>>>>>> a
>>>>>> GPU hang before a GPU reset.
>>>>>>
>>>>>   The fundamental problem here is that the kernel doesn't have
>>>>> that
>>>>> information either. We know which IB timed out and can
>>>>> potentially do
>>>>> a devcoredump when that happens, but that's it.
>>>>
>>>> Is it really not possible to know such a fundamental thing as what
>>>> the
>>>> GPU was doing when it hung? How are we supposed to do any kind of
>>>> debugging without knowing that?

Yes, that's indeed something at least I try to figure out for years as well.

Basically there are two major problems:
1. When the ASIC is hung you can't talk to the firmware engines any more 
and most state is not exposed directly, but just through some fw/hw 
interface.
     Just take a look at how umr reads the shader state from the SQ. 
When that block is hung you can't do that any more and basically have no 
chance at all to figure out why it's hung.

     Same for other engines, I remember once spending a week figuring 
out why the UVD block is hung during suspend. Turned out to be a 
debugging nightmare because any time you touch any register of that 
block the whole system would hang.

2. There are tons of things going on in a pipeline fashion or even 
completely in parallel. For example the CP is just the beginning of a 
rather long pipeline which at the end produces a bunch of pixels.
     In almost all cases I've seen you ran into a problem somewhere deep 
in the pipeline and only very rarely at the beginning.

>>>>
>>>> I wonder what AMD's Windows driver team is doing with this problem,
>>>> surely they must have better tools to deal with GPU hangs?
>>> For better or worse, most teams internally rely on scan dumps via
>>> JTAG
>>> which sort of limits the usefulness outside of AMD, but also gives
>>> you
>>> the exact state of the hardware when it's hung so the hardware teams
>>> prefer it.
>>>
>> How does this approach scale? It's not something we can ask users to
>> do, and even if all of us in the radv team had a JTAG device, we
>> wouldn't be able to play every game that users experience random hangs
>> with.
> It doesn't scale or lend itself particularly well to external
> development, but that's the current state of affairs.

The usual approach seems to be to reproduce a problem in a lab and have 
a JTAG attached to give the hw guys a scan dump and they can then tell 
you why something didn't worked as expected.

And yes that absolutely doesn't scale.

Christian.

>
> Alex

Felix Kuehling May 3, 2023, 3:08 p.m. UTC | #11

Am 2023-05-03 um 03:59 schrieb Christian König:
> Am 02.05.23 um 20:41 schrieb Alex Deucher:
>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf 
>> <timur.kristof@gmail.com> wrote:
>>> [SNIP]
>>>>>>> In my opinion, the correct solution to those problems would be
>>>>>>> if
>>>>>>> the kernel could give userspace the necessary information about
>>>>>>> a
>>>>>>> GPU hang before a GPU reset.
>>>>>>>
>>>>>>   The fundamental problem here is that the kernel doesn't have
>>>>>> that
>>>>>> information either. We know which IB timed out and can
>>>>>> potentially do
>>>>>> a devcoredump when that happens, but that's it.
>>>>>
>>>>> Is it really not possible to know such a fundamental thing as what
>>>>> the
>>>>> GPU was doing when it hung? How are we supposed to do any kind of
>>>>> debugging without knowing that?
>
> Yes, that's indeed something at least I try to figure out for years as 
> well.
>
> Basically there are two major problems:
> 1. When the ASIC is hung you can't talk to the firmware engines any 
> more and most state is not exposed directly, but just through some 
> fw/hw interface.
>     Just take a look at how umr reads the shader state from the SQ. 
> When that block is hung you can't do that any more and basically have 
> no chance at all to figure out why it's hung.
>
>     Same for other engines, I remember once spending a week figuring 
> out why the UVD block is hung during suspend. Turned out to be a 
> debugging nightmare because any time you touch any register of that 
> block the whole system would hang.
>
> 2. There are tons of things going on in a pipeline fashion or even 
> completely in parallel. For example the CP is just the beginning of a 
> rather long pipeline which at the end produces a bunch of pixels.
>     In almost all cases I've seen you ran into a problem somewhere 
> deep in the pipeline and only very rarely at the beginning.
>
>>>>>
>>>>> I wonder what AMD's Windows driver team is doing with this problem,
>>>>> surely they must have better tools to deal with GPU hangs?
>>>> For better or worse, most teams internally rely on scan dumps via
>>>> JTAG
>>>> which sort of limits the usefulness outside of AMD, but also gives
>>>> you
>>>> the exact state of the hardware when it's hung so the hardware teams
>>>> prefer it.
>>>>
>>> How does this approach scale? It's not something we can ask users to
>>> do, and even if all of us in the radv team had a JTAG device, we
>>> wouldn't be able to play every game that users experience random hangs
>>> with.
>> It doesn't scale or lend itself particularly well to external
>> development, but that's the current state of affairs.
>
> The usual approach seems to be to reproduce a problem in a lab and 
> have a JTAG attached to give the hw guys a scan dump and they can then 
> tell you why something didn't worked as expected.

That's the worst-case scenario where you're debugging HW or FW issues. 
Those should be pretty rare post-bringup. But are there hangs caused by 
user mode driver or application bugs that are easier to debug and 
probably don't even require a GPU reset? For example most VM faults can 
be handled without hanging the GPU. Similarly, a shader in an endless 
loop should not require a full GPU reset. In the KFD compute case, 
that's still preemptible and the offending process can be killed with 
Ctrl-C or debugged with rocm-gdb.

It's more complicated for graphics because of the more complex pipeline 
and the lack of CWSR. But it should still be possible to do some 
debugging without JTAG if the problem is in SW and not HW or FW. It's 
probably worth improving that debugability without getting hung-up on 
the worst case.

Maybe user mode graphics queues will offer a better way of recovering 
from these kinds of bugs, if the graphics pipeline can be unstuck 
without a GPU reset, just by killing the offending user mode queue.

Regards,
   Felix


>
> And yes that absolutely doesn't scale.
>
> Christian.
>
>>
>> Alex
>

Christian König May 3, 2023, 3:23 p.m. UTC | #12

Am 03.05.23 um 17:08 schrieb Felix Kuehling:
> Am 2023-05-03 um 03:59 schrieb Christian König:
>> Am 02.05.23 um 20:41 schrieb Alex Deucher:
>>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf 
>>> <timur.kristof@gmail.com> wrote:
>>>> [SNIP]
>>>>>>>> In my opinion, the correct solution to those problems would be
>>>>>>>> if
>>>>>>>> the kernel could give userspace the necessary information about
>>>>>>>> a
>>>>>>>> GPU hang before a GPU reset.
>>>>>>>>
>>>>>>>   The fundamental problem here is that the kernel doesn't have
>>>>>>> that
>>>>>>> information either. We know which IB timed out and can
>>>>>>> potentially do
>>>>>>> a devcoredump when that happens, but that's it.
>>>>>>
>>>>>> Is it really not possible to know such a fundamental thing as what
>>>>>> the
>>>>>> GPU was doing when it hung? How are we supposed to do any kind of
>>>>>> debugging without knowing that?
>>
>> Yes, that's indeed something at least I try to figure out for years 
>> as well.
>>
>> Basically there are two major problems:
>> 1. When the ASIC is hung you can't talk to the firmware engines any 
>> more and most state is not exposed directly, but just through some 
>> fw/hw interface.
>>     Just take a look at how umr reads the shader state from the SQ. 
>> When that block is hung you can't do that any more and basically have 
>> no chance at all to figure out why it's hung.
>>
>>     Same for other engines, I remember once spending a week figuring 
>> out why the UVD block is hung during suspend. Turned out to be a 
>> debugging nightmare because any time you touch any register of that 
>> block the whole system would hang.
>>
>> 2. There are tons of things going on in a pipeline fashion or even 
>> completely in parallel. For example the CP is just the beginning of a 
>> rather long pipeline which at the end produces a bunch of pixels.
>>     In almost all cases I've seen you ran into a problem somewhere 
>> deep in the pipeline and only very rarely at the beginning.
>>
>>>>>>
>>>>>> I wonder what AMD's Windows driver team is doing with this problem,
>>>>>> surely they must have better tools to deal with GPU hangs?
>>>>> For better or worse, most teams internally rely on scan dumps via
>>>>> JTAG
>>>>> which sort of limits the usefulness outside of AMD, but also gives
>>>>> you
>>>>> the exact state of the hardware when it's hung so the hardware teams
>>>>> prefer it.
>>>>>
>>>> How does this approach scale? It's not something we can ask users to
>>>> do, and even if all of us in the radv team had a JTAG device, we
>>>> wouldn't be able to play every game that users experience random hangs
>>>> with.
>>> It doesn't scale or lend itself particularly well to external
>>> development, but that's the current state of affairs.
>>
>> The usual approach seems to be to reproduce a problem in a lab and 
>> have a JTAG attached to give the hw guys a scan dump and they can 
>> then tell you why something didn't worked as expected.
>
> That's the worst-case scenario where you're debugging HW or FW issues. 
> Those should be pretty rare post-bringup. But are there hangs caused 
> by user mode driver or application bugs that are easier to debug and 
> probably don't even require a GPU reset? For example most VM faults 
> can be handled without hanging the GPU. Similarly, a shader in an 
> endless loop should not require a full GPU reset. In the KFD compute 
> case, that's still preemptible and the offending process can be killed 
> with Ctrl-C or debugged with rocm-gdb.

We also have infinite loop in shader abort for gfx and page faults are 
pretty rare with OpenGL (a bit more often with Vulkan) and can be 
handled gracefully on modern hw (they just spam the logs).

The majority of the problems is unfortunately that we really get hard 
hangs because of some hw issues. That can be caused by unlucky timing, 
power management or doing things in an order the hw doesn't expected.

Regards,
Christian.

>
> It's more complicated for graphics because of the more complex 
> pipeline and the lack of CWSR. But it should still be possible to do 
> some debugging without JTAG if the problem is in SW and not HW or FW. 
> It's probably worth improving that debugability without getting 
> hung-up on the worst case.
>
> Maybe user mode graphics queues will offer a better way of recovering 
> from these kinds of bugs, if the graphics pipeline can be unstuck 
> without a GPU reset, just by killing the offending user mode queue.
>
> Regards,
>   Felix
>
>
>>
>> And yes that absolutely doesn't scale.
>>
>> Christian.
>>
>>>
>>> Alex
>>

Timur Kristóf May 3, 2023, 5:43 p.m. UTC | #13

Hi Felix,

On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote:
> That's the worst-case scenario where you're debugging HW or FW
> issues. 
> Those should be pretty rare post-bringup. But are there hangs caused
> by 
> user mode driver or application bugs that are easier to debug and 
> probably don't even require a GPU reset?

There are many GPU hangs that gamers experience while playing. We have
dozens of open bug reports against RADV about GPU hangs on various GPU
generations. These usually fall into two categories:

1. When the hang always happens at the same point in a game. These are
painful to debug but manageable.
2. "Random" hangs that happen to users over the course of playing a
game for several hours. It is absolute hell to try to even reproduce
let alone diagnose these issues, and this is what we would like to
improve.

For these hard-to-diagnose problems, it is already a challenge to
determine whether the problem is the kernel (eg. setting wrong voltages
/ frequencies) or userspace (eg. missing some synchronization), can be
even a game bug that we need to work around.

> For example most VM faults can 
> be handled without hanging the GPU. Similarly, a shader in an endless
> loop should not require a full GPU reset.

This is actually not the case, AFAIK André's test case was an app that
had an infinite loop in a shader.

> 
> It's more complicated for graphics because of the more complex
> pipeline 
> and the lack of CWSR. But it should still be possible to do some 
> debugging without JTAG if the problem is in SW and not HW or FW. It's
> probably worth improving that debugability without getting hung-up on
> the worst case.

I agree, and we welcome any constructive suggestion to improve the
situation. It seems like our idea doesn't work if the kernel can't give
us the information we need.

How do we move forward?

Best regards,
Timur

André Almeida May 3, 2023, 6:52 p.m. UTC | #14

Em 03/05/2023 14:08, Marek Olšák escreveu:
> GPU hangs are pretty common post-bringup. They are not common per user, 
> but if we gather all hangs from all users, we can have lots and lots of 
> them.
> 
> GPU hangs are indeed not very debuggable. There are however some things 
> we can do:
> - Identify the hanging IB by its VA (the kernel should know it)

How can the kernel tell which VA range is being executed? I only found 
that information at mmCP_IB1_BASE_ regs, but as stated in this thread by 
Christian this is not reliable to be read.

> - Read and parse the IB to detect memory corruption.
> - Print active waves with shader disassembly if SQ isn't hung (often 
> it's not).
> 
> Determining which packet the CP is stuck on is tricky. The CP has 2 
> engines (one frontend and one backend) that work on the same command 
> buffer. The frontend engine runs ahead, executes some packets and 
> forwards others to the backend engine. Only the frontend engine has the 
> command buffer VA somewhere. The backend engine only receives packets 
> from the frontend engine via a FIFO, so it might not be possible to tell 
> where it's stuck if it's stuck.

Do they run at the same asynchronously or does the front end waits the 
back end to execute?

> 
> When the gfx pipeline hangs outside of shaders, making a scandump seems 
> to be the only way to have a chance at finding out what's going wrong, 
> and only AMD-internal versions of hw can be scanned.
> 
> Marek
> 
> On Wed, May 3, 2023 at 11:23 AM Christian König 
> <ckoenig.leichtzumerken@gmail.com 
> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> 
>     Am 03.05.23 um 17:08 schrieb Felix Kuehling:
>      > Am 2023-05-03 um 03:59 schrieb Christian König:
>      >> Am 02.05.23 um 20:41 schrieb Alex Deucher:
>      >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf
>      >>> <timur.kristof@gmail.com <mailto:timur.kristof@gmail.com>> wrote:
>      >>>> [SNIP]
>      >>>>>>>> In my opinion, the correct solution to those problems would be
>      >>>>>>>> if
>      >>>>>>>> the kernel could give userspace the necessary information
>     about
>      >>>>>>>> a
>      >>>>>>>> GPU hang before a GPU reset.
>      >>>>>>>>
>      >>>>>>>   The fundamental problem here is that the kernel doesn't have
>      >>>>>>> that
>      >>>>>>> information either. We know which IB timed out and can
>      >>>>>>> potentially do
>      >>>>>>> a devcoredump when that happens, but that's it.
>      >>>>>>
>      >>>>>> Is it really not possible to know such a fundamental thing
>     as what
>      >>>>>> the
>      >>>>>> GPU was doing when it hung? How are we supposed to do any
>     kind of
>      >>>>>> debugging without knowing that?
>      >>
>      >> Yes, that's indeed something at least I try to figure out for years
>      >> as well.
>      >>
>      >> Basically there are two major problems:
>      >> 1. When the ASIC is hung you can't talk to the firmware engines any
>      >> more and most state is not exposed directly, but just through some
>      >> fw/hw interface.
>      >>     Just take a look at how umr reads the shader state from the SQ.
>      >> When that block is hung you can't do that any more and basically
>     have
>      >> no chance at all to figure out why it's hung.
>      >>
>      >>     Same for other engines, I remember once spending a week
>     figuring
>      >> out why the UVD block is hung during suspend. Turned out to be a
>      >> debugging nightmare because any time you touch any register of that
>      >> block the whole system would hang.
>      >>
>      >> 2. There are tons of things going on in a pipeline fashion or even
>      >> completely in parallel. For example the CP is just the beginning
>     of a
>      >> rather long pipeline which at the end produces a bunch of pixels.
>      >>     In almost all cases I've seen you ran into a problem somewhere
>      >> deep in the pipeline and only very rarely at the beginning.
>      >>
>      >>>>>>
>      >>>>>> I wonder what AMD's Windows driver team is doing with this
>     problem,
>      >>>>>> surely they must have better tools to deal with GPU hangs?
>      >>>>> For better or worse, most teams internally rely on scan dumps via
>      >>>>> JTAG
>      >>>>> which sort of limits the usefulness outside of AMD, but also
>     gives
>      >>>>> you
>      >>>>> the exact state of the hardware when it's hung so the
>     hardware teams
>      >>>>> prefer it.
>      >>>>>
>      >>>> How does this approach scale? It's not something we can ask
>     users to
>      >>>> do, and even if all of us in the radv team had a JTAG device, we
>      >>>> wouldn't be able to play every game that users experience
>     random hangs
>      >>>> with.
>      >>> It doesn't scale or lend itself particularly well to external
>      >>> development, but that's the current state of affairs.
>      >>
>      >> The usual approach seems to be to reproduce a problem in a lab and
>      >> have a JTAG attached to give the hw guys a scan dump and they can
>      >> then tell you why something didn't worked as expected.
>      >
>      > That's the worst-case scenario where you're debugging HW or FW
>     issues.
>      > Those should be pretty rare post-bringup. But are there hangs caused
>      > by user mode driver or application bugs that are easier to debug and
>      > probably don't even require a GPU reset? For example most VM faults
>      > can be handled without hanging the GPU. Similarly, a shader in an
>      > endless loop should not require a full GPU reset. In the KFD compute
>      > case, that's still preemptible and the offending process can be
>     killed
>      > with Ctrl-C or debugged with rocm-gdb.
> 
>     We also have infinite loop in shader abort for gfx and page faults are
>     pretty rare with OpenGL (a bit more often with Vulkan) and can be
>     handled gracefully on modern hw (they just spam the logs).
> 
>     The majority of the problems is unfortunately that we really get hard
>     hangs because of some hw issues. That can be caused by unlucky timing,
>     power management or doing things in an order the hw doesn't expected.
> 
>     Regards,
>     Christian.
> 
>      >
>      > It's more complicated for graphics because of the more complex
>      > pipeline and the lack of CWSR. But it should still be possible to do
>      > some debugging without JTAG if the problem is in SW and not HW or
>     FW.
>      > It's probably worth improving that debugability without getting
>      > hung-up on the worst case.
>      >
>      > Maybe user mode graphics queues will offer a better way of
>     recovering
>      > from these kinds of bugs, if the graphics pipeline can be unstuck
>      > without a GPU reset, just by killing the offending user mode queue.
>      >
>      > Regards,
>      >   Felix
>      >
>      >
>      >>
>      >> And yes that absolutely doesn't scale.
>      >>
>      >> Christian.
>      >>
>      >>>
>      >>> Alex
>      >>
>

André Almeida May 3, 2023, 7:14 p.m. UTC | #15

Em 03/05/2023 14:43, Timur Kristóf escreveu:
> Hi Felix,
> 
> On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote:
>> That's the worst-case scenario where you're debugging HW or FW
>> issues.
>> Those should be pretty rare post-bringup. But are there hangs caused
>> by
>> user mode driver or application bugs that are easier to debug and
>> probably don't even require a GPU reset?
> 
> There are many GPU hangs that gamers experience while playing. We have
> dozens of open bug reports against RADV about GPU hangs on various GPU
> generations. These usually fall into two categories:
> 
> 1. When the hang always happens at the same point in a game. These are
> painful to debug but manageable.
> 2. "Random" hangs that happen to users over the course of playing a
> game for several hours. It is absolute hell to try to even reproduce
> let alone diagnose these issues, and this is what we would like to
> improve.
> 
> For these hard-to-diagnose problems, it is already a challenge to
> determine whether the problem is the kernel (eg. setting wrong voltages
> / frequencies) or userspace (eg. missing some synchronization), can be
> even a game bug that we need to work around.
> 
>> For example most VM faults can
>> be handled without hanging the GPU. Similarly, a shader in an endless
>> loop should not require a full GPU reset.
> 
> This is actually not the case, AFAIK André's test case was an app that
> had an infinite loop in a shader.
> 

This is the test app if anyone want to try out: 
https://github.com/andrealmeid/vulkan-triangle-v1. Just compile and run.

The kernel calls amdgpu_ring_soft_recovery() when I run my example, but 
I'm not sure what a soft recovery means here and if it's a full GPU 
reset or not.

But if we can at least trust the CP registers to dump information for 
soft resets, it would be some improvement from the current state I think

>>
>> It's more complicated for graphics because of the more complex
>> pipeline
>> and the lack of CWSR. But it should still be possible to do some
>> debugging without JTAG if the problem is in SW and not HW or FW. It's
>> probably worth improving that debugability without getting hung-up on
>> the worst case.
> 
> I agree, and we welcome any constructive suggestion to improve the
> situation. It seems like our idea doesn't work if the kernel can't give
> us the information we need.
> 
> How do we move forward?
> 
> Best regards,
> Timur
>

Christian König May 4, 2023, 6:43 a.m. UTC | #16

Am 03.05.23 um 21:14 schrieb André Almeida:
> Em 03/05/2023 14:43, Timur Kristóf escreveu:
>> Hi Felix,
>>
>> On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote:
>>> That's the worst-case scenario where you're debugging HW or FW
>>> issues.
>>> Those should be pretty rare post-bringup. But are there hangs caused
>>> by
>>> user mode driver or application bugs that are easier to debug and
>>> probably don't even require a GPU reset?
>>
>> There are many GPU hangs that gamers experience while playing. We have
>> dozens of open bug reports against RADV about GPU hangs on various GPU
>> generations. These usually fall into two categories:
>>
>> 1. When the hang always happens at the same point in a game. These are
>> painful to debug but manageable.
>> 2. "Random" hangs that happen to users over the course of playing a
>> game for several hours. It is absolute hell to try to even reproduce
>> let alone diagnose these issues, and this is what we would like to
>> improve.
>>
>> For these hard-to-diagnose problems, it is already a challenge to
>> determine whether the problem is the kernel (eg. setting wrong voltages
>> / frequencies) or userspace (eg. missing some synchronization), can be
>> even a game bug that we need to work around.
>>
>>> For example most VM faults can
>>> be handled without hanging the GPU. Similarly, a shader in an endless
>>> loop should not require a full GPU reset.
>>
>> This is actually not the case, AFAIK André's test case was an app that
>> had an infinite loop in a shader.
>>
>
> This is the test app if anyone want to try out: 
> https://github.com/andrealmeid/vulkan-triangle-v1. Just compile and run.
>
> The kernel calls amdgpu_ring_soft_recovery() when I run my example, 
> but I'm not sure what a soft recovery means here and if it's a full 
> GPU reset or not.

That's just "soft" recovery. In other words we send the SQ a command to 
kill a shader.

That usually works for shaders which contain an endless loop (which is 
the most common application bug), but unfortunately not for any other 
problem.

>
> But if we can at least trust the CP registers to dump information for 
> soft resets, it would be some improvement from the current state I think

Especially for endless loops the CP registers are completely useless. 
The CP just prepares the draw commands and all the state which is then 
send to the SQ for execution.

As Marek wrote we know which submission has timed out in the kernel, but 
we can't figure out where inside this submission we are.

>
>>>
>>> It's more complicated for graphics because of the more complex
>>> pipeline
>>> and the lack of CWSR. But it should still be possible to do some
>>> debugging without JTAG if the problem is in SW and not HW or FW. It's
>>> probably worth improving that debugability without getting hung-up on
>>> the worst case.
>>
>> I agree, and we welcome any constructive suggestion to improve the
>> situation. It seems like our idea doesn't work if the kernel can't give
>> us the information we need.
>>
>> How do we move forward?

As I said the best approach to figure out which draw command hangs is to 
sprinkle WRITE_DATA commands into your command stream.

That's not so much overhead and at least Bas things that this is doable 
in RADV with some changes.

For the kernel we can certainly implement devcoredump and allow writing 
out register values and other state when a problem happens.

Regards,
Christian.

>>
>> Best regards,
>> Timur
>>