[RFC,v3,0/4] drm: Standardize device reset notification

Message ID 20230621005719.836857-1-andrealmeid@igalia.com
Headers
Series drm: Standardize device reset notification |

Message

André Almeida June 21, 2023, 12:57 a.m. UTC
  Hi,

This is a new version of the documentation for DRM device resets. As I dived
more in the subject, I started to believe that part of the problem was the lack
of a DRM API to get reset information from the driver. With an API, we can
better standardize reset queries, increase common code from both DRM and Mesa,
and make easier to write end-to-end tests.

So this patchset, along with the documentation, comes with a new IOCTL and two
implementations of it for amdgpu and i915 (although just the former was really
tested). This IOCTL uses the "context id" to query reset information, but this
might be not generic enough to be included in a DRM API.  At least for amdgpu,
this information is encapsulated by libdrm so one can't just call the ioctl
directly from the UMD as I was planning to, but a small refactor can be done to
expose the id. Anyway, I'm sharing it as it is to gather feedback if this seems
to work.

The amdgpu and i915 implementations are provided as a mean of testing and as
exemplification, and not as reference code yet, as the goal is more about the
interface itself then the driver parts.

For the documentation itself, after spending some time reading the reset path in
the kernel in Mesa, I decide to rewrite it to better reflect how it works, from
bottom to top.

You can check the userspace side of the IOCLT here:
 Mesa: https://gitlab.freedesktop.org/andrealmeid/mesa/-/commit/cd687b22fb32c21b23596c607003e2a495f465
 libdrm: https://gitlab.freedesktop.org/andrealmeid/libdrm/-/commit/b31e5404893ee9a85d1aa67e81c2f58c1dac3c46

For testing, I use this vulkan app that has an infinity loop in the shader:
https://github.com/andrealmeid/vulkan-triangle-v1

Feedbacks are welcomed!

Thanks,
		André

v2: https://lore.kernel.org/all/20230227204000.56787-1-andrealmeid@igalia.com/
v1: https://lore.kernel.org/all/20230123202646.356592-1-andrealmeid@igalia.com/

André Almeida (4):
  drm/doc: Document DRM device reset expectations
  drm: Create DRM_IOCTL_GET_RESET
  drm/amdgpu: Implement DRM_IOCTL_GET_RESET
  drm/i915: Implement DRM_IOCTL_GET_RESET

 Documentation/gpu/drm-uapi.rst                | 51 ++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 35 +++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h       |  5 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 12 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h       |  2 +
 drivers/gpu/drm/drm_debugfs.c                 |  2 +
 drivers/gpu/drm/drm_ioctl.c                   | 58 +++++++++++++++++++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 18 ++++++
 drivers/gpu/drm/i915/gem/i915_gem_context.h   |  2 +
 .../gpu/drm/i915/gem/i915_gem_context_types.h |  2 +
 drivers/gpu/drm/i915/i915_driver.c            |  2 +
 include/drm/drm_device.h                      |  3 +
 include/drm/drm_drv.h                         |  3 +
 include/uapi/drm/drm.h                        | 21 +++++++
 include/uapi/drm/drm_mode.h                   | 15 +++++
 17 files changed, 233 insertions(+), 3 deletions(-)
  

Comments

Christian König June 21, 2023, 7:42 a.m. UTC | #1
Am 21.06.23 um 02:57 schrieb André Almeida:
> Hi,
>
> This is a new version of the documentation for DRM device resets. As I dived
> more in the subject, I started to believe that part of the problem was the lack
> of a DRM API to get reset information from the driver. With an API, we can
> better standardize reset queries, increase common code from both DRM and Mesa,
> and make easier to write end-to-end tests.
>
> So this patchset, along with the documentation, comes with a new IOCTL and two
> implementations of it for amdgpu and i915 (although just the former was really
> tested). This IOCTL uses the "context id" to query reset information, but this
> might be not generic enough to be included in a DRM API.

Well the basic problem with that is that we don't have a standard DRM 
context defined.

If you want to do this you should probably start there first.

Apart from that this looks like a really really good idea to me, 
especially that we document the reset expectations.

Regards,
Christian.

>    At least for amdgpu,
> this information is encapsulated by libdrm so one can't just call the ioctl
> directly from the UMD as I was planning to, but a small refactor can be done to
> expose the id. Anyway, I'm sharing it as it is to gather feedback if this seems
> to work.
>
> The amdgpu and i915 implementations are provided as a mean of testing and as
> exemplification, and not as reference code yet, as the goal is more about the
> interface itself then the driver parts.
>
> For the documentation itself, after spending some time reading the reset path in
> the kernel in Mesa, I decide to rewrite it to better reflect how it works, from
> bottom to top.
>
> You can check the userspace side of the IOCLT here:
>   Mesa: https://gitlab.freedesktop.org/andrealmeid/mesa/-/commit/cd687b22fb32c21b23596c607003e2a495f465
>   libdrm: https://gitlab.freedesktop.org/andrealmeid/libdrm/-/commit/b31e5404893ee9a85d1aa67e81c2f58c1dac3c46
>
> For testing, I use this vulkan app that has an infinity loop in the shader:
> https://github.com/andrealmeid/vulkan-triangle-v1
>
> Feedbacks are welcomed!
>
> Thanks,
> 		André
>
> v2: https://lore.kernel.org/all/20230227204000.56787-1-andrealmeid@igalia.com/
> v1: https://lore.kernel.org/all/20230123202646.356592-1-andrealmeid@igalia.com/
>
> André Almeida (4):
>    drm/doc: Document DRM device reset expectations
>    drm: Create DRM_IOCTL_GET_RESET
>    drm/amdgpu: Implement DRM_IOCTL_GET_RESET
>    drm/i915: Implement DRM_IOCTL_GET_RESET
>
>   Documentation/gpu/drm-uapi.rst                | 51 ++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  4 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 35 +++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h       |  5 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 12 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h       |  2 +
>   drivers/gpu/drm/drm_debugfs.c                 |  2 +
>   drivers/gpu/drm/drm_ioctl.c                   | 58 +++++++++++++++++++
>   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 18 ++++++
>   drivers/gpu/drm/i915/gem/i915_gem_context.h   |  2 +
>   .../gpu/drm/i915/gem/i915_gem_context_types.h |  2 +
>   drivers/gpu/drm/i915/i915_driver.c            |  2 +
>   include/drm/drm_device.h                      |  3 +
>   include/drm/drm_drv.h                         |  3 +
>   include/uapi/drm/drm.h                        | 21 +++++++
>   include/uapi/drm/drm_mode.h                   | 15 +++++
>   17 files changed, 233 insertions(+), 3 deletions(-)
>
  
André Almeida June 21, 2023, 3:06 p.m. UTC | #2
Em 21/06/2023 04:42, Christian König escreveu:
> Am 21.06.23 um 02:57 schrieb André Almeida:
>> Hi,
>>
>> This is a new version of the documentation for DRM device resets. As I 
>> dived
>> more in the subject, I started to believe that part of the problem was 
>> the lack
>> of a DRM API to get reset information from the driver. With an API, we 
>> can
>> better standardize reset queries, increase common code from both DRM 
>> and Mesa,
>> and make easier to write end-to-end tests.
>>
>> So this patchset, along with the documentation, comes with a new IOCTL 
>> and two
>> implementations of it for amdgpu and i915 (although just the former 
>> was really
>> tested). This IOCTL uses the "context id" to query reset information, 
>> but this
>> might be not generic enough to be included in a DRM API.
> 
> Well the basic problem with that is that we don't have a standard DRM 
> context defined.
> 
> If you want to do this you should probably start there first.

Any idea on how to start this? I tried to find previous work about that, 
but I didn't find.

> 
> Apart from that this looks like a really really good idea to me, 
> especially that we document the reset expectations.

I think I'll submit just the doc for the next version then, given that 
the IOCTL will need a lot of rework.

> 
> Regards,
> Christian.
> 
>>    At least for amdgpu,
>> this information is encapsulated by libdrm so one can't just call the 
>> ioctl
>> directly from the UMD as I was planning to, but a small refactor can 
>> be done to
>> expose the id. Anyway, I'm sharing it as it is to gather feedback if 
>> this seems
>> to work.
>>
>> The amdgpu and i915 implementations are provided as a mean of testing 
>> and as
>> exemplification, and not as reference code yet, as the goal is more 
>> about the
>> interface itself then the driver parts.
>>
>> For the documentation itself, after spending some time reading the 
>> reset path in
>> the kernel in Mesa, I decide to rewrite it to better reflect how it 
>> works, from
>> bottom to top.
>>
>> You can check the userspace side of the IOCLT here:
>>   Mesa: 
>> https://gitlab.freedesktop.org/andrealmeid/mesa/-/commit/cd687b22fb32c21b23596c607003e2a495f465
>>   libdrm: 
>> https://gitlab.freedesktop.org/andrealmeid/libdrm/-/commit/b31e5404893ee9a85d1aa67e81c2f58c1dac3c46
>>
>> For testing, I use this vulkan app that has an infinity loop in the 
>> shader:
>> https://github.com/andrealmeid/vulkan-triangle-v1
>>
>> Feedbacks are welcomed!
>>
>> Thanks,
>>         André
>>
>> v2: 
>> https://lore.kernel.org/all/20230227204000.56787-1-andrealmeid@igalia.com/
>> v1: 
>> https://lore.kernel.org/all/20230123202646.356592-1-andrealmeid@igalia.com/
>>
>> André Almeida (4):
>>    drm/doc: Document DRM device reset expectations
>>    drm: Create DRM_IOCTL_GET_RESET
>>    drm/amdgpu: Implement DRM_IOCTL_GET_RESET
>>    drm/i915: Implement DRM_IOCTL_GET_RESET
>>
>>   Documentation/gpu/drm-uapi.rst                | 51 ++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  4 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 35 +++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h       |  5 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 12 +++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h       |  2 +
>>   drivers/gpu/drm/drm_debugfs.c                 |  2 +
>>   drivers/gpu/drm/drm_ioctl.c                   | 58 +++++++++++++++++++
>>   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 18 ++++++
>>   drivers/gpu/drm/i915/gem/i915_gem_context.h   |  2 +
>>   .../gpu/drm/i915/gem/i915_gem_context_types.h |  2 +
>>   drivers/gpu/drm/i915/i915_driver.c            |  2 +
>>   include/drm/drm_device.h                      |  3 +
>>   include/drm/drm_drv.h                         |  3 +
>>   include/uapi/drm/drm.h                        | 21 +++++++
>>   include/uapi/drm/drm_mode.h                   | 15 +++++
>>   17 files changed, 233 insertions(+), 3 deletions(-)
>>
>
  
Christian König June 21, 2023, 3:09 p.m. UTC | #3
Am 21.06.23 um 17:06 schrieb André Almeida:
> Em 21/06/2023 04:42, Christian König escreveu:
>> Am 21.06.23 um 02:57 schrieb André Almeida:
>>> Hi,
>>>
>>> This is a new version of the documentation for DRM device resets. As 
>>> I dived
>>> more in the subject, I started to believe that part of the problem 
>>> was the lack
>>> of a DRM API to get reset information from the driver. With an API, 
>>> we can
>>> better standardize reset queries, increase common code from both DRM 
>>> and Mesa,
>>> and make easier to write end-to-end tests.
>>>
>>> So this patchset, along with the documentation, comes with a new 
>>> IOCTL and two
>>> implementations of it for amdgpu and i915 (although just the former 
>>> was really
>>> tested). This IOCTL uses the "context id" to query reset 
>>> information, but this
>>> might be not generic enough to be included in a DRM API.
>>
>> Well the basic problem with that is that we don't have a standard DRM 
>> context defined.
>>
>> If you want to do this you should probably start there first.
>
> Any idea on how to start this? I tried to find previous work about 
> that, but I didn't find.

I'm not aware of any work in this area, maybe ping on the Mesa list as well.

Could be that someone looked into that but never send anything out.

>
>>
>> Apart from that this looks like a really really good idea to me, 
>> especially that we document the reset expectations.
>
> I think I'll submit just the doc for the next version then, given that 
> the IOCTL will need a lot of rework.

Yeah, agree completely.

Thanks,
Christian.

>
>>
>> Regards,
>> Christian.
>>
>>>    At least for amdgpu,
>>> this information is encapsulated by libdrm so one can't just call 
>>> the ioctl
>>> directly from the UMD as I was planning to, but a small refactor can 
>>> be done to
>>> expose the id. Anyway, I'm sharing it as it is to gather feedback if 
>>> this seems
>>> to work.
>>>
>>> The amdgpu and i915 implementations are provided as a mean of 
>>> testing and as
>>> exemplification, and not as reference code yet, as the goal is more 
>>> about the
>>> interface itself then the driver parts.
>>>
>>> For the documentation itself, after spending some time reading the 
>>> reset path in
>>> the kernel in Mesa, I decide to rewrite it to better reflect how it 
>>> works, from
>>> bottom to top.
>>>
>>> You can check the userspace side of the IOCLT here:
>>>   Mesa: 
>>> https://gitlab.freedesktop.org/andrealmeid/mesa/-/commit/cd687b22fb32c21b23596c607003e2a495f465
>>>   libdrm: 
>>> https://gitlab.freedesktop.org/andrealmeid/libdrm/-/commit/b31e5404893ee9a85d1aa67e81c2f58c1dac3c46
>>>
>>> For testing, I use this vulkan app that has an infinity loop in the 
>>> shader:
>>> https://github.com/andrealmeid/vulkan-triangle-v1
>>>
>>> Feedbacks are welcomed!
>>>
>>> Thanks,
>>>         André
>>>
>>> v2: 
>>> https://lore.kernel.org/all/20230227204000.56787-1-andrealmeid@igalia.com/
>>> v1: 
>>> https://lore.kernel.org/all/20230123202646.356592-1-andrealmeid@igalia.com/
>>>
>>> André Almeida (4):
>>>    drm/doc: Document DRM device reset expectations
>>>    drm: Create DRM_IOCTL_GET_RESET
>>>    drm/amdgpu: Implement DRM_IOCTL_GET_RESET
>>>    drm/i915: Implement DRM_IOCTL_GET_RESET
>>>
>>>   Documentation/gpu/drm-uapi.rst                | 51 ++++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  4 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 35 +++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h       |  5 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 12 +++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h       |  2 +
>>>   drivers/gpu/drm/drm_debugfs.c                 |  2 +
>>>   drivers/gpu/drm/drm_ioctl.c                   | 58 
>>> +++++++++++++++++++
>>>   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 18 ++++++
>>>   drivers/gpu/drm/i915/gem/i915_gem_context.h   |  2 +
>>>   .../gpu/drm/i915/gem/i915_gem_context_types.h |  2 +
>>>   drivers/gpu/drm/i915/i915_driver.c            |  2 +
>>>   include/drm/drm_device.h                      |  3 +
>>>   include/drm/drm_drv.h                         |  3 +
>>>   include/uapi/drm/drm.h                        | 21 +++++++
>>>   include/uapi/drm/drm_mode.h                   | 15 +++++
>>>   17 files changed, 233 insertions(+), 3 deletions(-)
>>>
>>