[drm-next,v4,00/14,RFC] DRM GPUVA Manager & Nouveau VM_BIND UAPI

Message ID 20230606223130.6132-1-dakr@redhat.com
Headers
Series DRM GPUVA Manager & Nouveau VM_BIND UAPI |

Message

Danilo Krummrich June 6, 2023, 10:31 p.m. UTC
  Furthermore, with the DRM GPUVA manager it provides a new DRM core feature to
keep track of GPU virtual address (VA) mappings in a more generic way.

The DRM GPUVA manager is indented to help drivers implement userspace-manageable
GPU VA spaces in reference to the Vulkan API. In order to achieve this goal it
serves the following purposes in this context.

    1) Provide infrastructure to track GPU VA allocations and mappings,
       making use of the maple_tree.

    2) Generically connect GPU VA mappings to their backing buffers, in
       particular DRM GEM objects.

    3) Provide a common implementation to perform more complex mapping
       operations on the GPU VA space. In particular splitting and merging
       of GPU VA mappings, e.g. for intersecting mapping requests or partial
       unmap requests.

The new VM_BIND Nouveau UAPI build on top of the DRM GPUVA manager, itself
providing the following new interfaces.

    1) Initialize a GPU VA space via the new DRM_IOCTL_NOUVEAU_VM_INIT ioctl
       for UMDs to specify the portion of VA space managed by the kernel and
       userspace, respectively.

    2) Allocate and free a VA space region as well as bind and unbind memory
       to the GPUs VA space via the new DRM_IOCTL_NOUVEAU_VM_BIND ioctl.

    3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl.

Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC, make use of the DRM
scheduler to queue jobs and support asynchronous processing with DRM syncobjs
as synchronization mechanism.

By default DRM_IOCTL_NOUVEAU_VM_BIND does synchronous processing,
DRM_IOCTL_NOUVEAU_EXEC supports asynchronous processing only.

The new VM_BIND UAPI for Nouveau makes also use of drm_exec (execution context
for GEM buffers) by Christian König. Since the patch implementing drm_exec was
not yet merged into drm-next it is part of this series, as well as a small fix
for this patch, which was found while testing this series.

This patch series is also available at [1].

There is a Mesa NVK merge request by Dave Airlie [2] implementing the
corresponding userspace parts for this series.

The Vulkan CTS test suite passes the sparse binding and sparse residency test
cases for the new UAPI together with Dave's Mesa work.

There are also some test cases in the igt-gpu-tools project [3] for the new UAPI
and hence the DRM GPU VA manager. However, most of them are testing the DRM GPU
VA manager's logic through Nouveau's new UAPI and should be considered just as
helper for implementation.

However, I absolutely intend to change those test cases to proper kunit test
cases for the DRM GPUVA manager, once and if we agree on it's usefulness and
design.

[1] https://gitlab.freedesktop.org/nouvelles/kernel/-/tree/new-uapi-drm-next /
    https://gitlab.freedesktop.org/nouvelles/kernel/-/merge_requests/1
[2] https://gitlab.freedesktop.org/nouveau/mesa/-/merge_requests/150/
[3] https://gitlab.freedesktop.org/dakr/igt-gpu-tools/-/tree/wip_nouveau_vm_bind

Changes in V2:
==============
  Nouveau:
    - Reworked the Nouveau VM_BIND UAPI to avoid memory allocations in fence
      signalling critical sections. Updates to the VA space are split up in three
      separate stages, where only the 2. stage executes in a fence signalling
      critical section:

        1. update the VA space, allocate new structures and page tables
        2. (un-)map the requested memory bindings
        3. free structures and page tables

    - Separated generic job scheduler code from specific job implementations.
    - Separated the EXEC and VM_BIND implementation of the UAPI.
    - Reworked the locking parts of the nvkm/vmm RAW interface, such that
      (un-)map operations can be executed in fence signalling critical sections.

  GPUVA Manager:
    - made drm_gpuva_regions optional for users of the GPUVA manager
    - allow NULL GEMs for drm_gpuva entries
    - swichted from drm_mm to maple_tree for track drm_gpuva / drm_gpuva_region
      entries
    - provide callbacks for users to allocate custom drm_gpuva_op structures to
      allow inheritance
    - added user bits to drm_gpuva_flags
    - added a prefetch operation type in order to support generating prefetch
      operations in the same way other operations generated
    - hand the responsibility for mutual exclusion for a GEM's
      drm_gpuva list to the user; simplified corresponding (un-)link functions

  Maple Tree:
    - I added two maple tree patches to the series, one to support custom tree
      walk macros and one to hand the locking responsibility to the user of the
      GPUVA manager without pre-defined lockdep checks.

Changes in V3:
==============
  Nouveau:
    - Reworked the Nouveau VM_BIND UAPI to do the job cleanup (including page
      table cleanup) within a workqueue rather than the job_free() callback of
      the scheduler itself. A job_free() callback can stall the execution (run()
      callback) of the next job in the queue. Since the page table cleanup
      requires to take the same locks as need to be taken for page table
      allocation, doing it directly in the job_free() callback would still
      violate the fence signalling critical path.
    - Separated Nouveau fence allocation and emit, such that we do not violate
      the fence signalling critical path in EXEC jobs.
    - Implement "regions" (for handling sparse mappings through PDEs and dual
      page tables) within Nouveau.
    - Drop the requirement for every mapping to be contained within a region.
    - Add necassary synchronization of VM_BIND job operation sequences in order
      to work around limitations in page table handling. This will be addressed
      in a future re-work of Nouveau's page table handling.
    - Fixed a couple of race conditions found through more testing. Thanks to
      Dave for consitently trying to break it. :-)

  GPUVA Manager:
    - Implement pre-allocation capabilities for tree modifications within fence
      signalling critical sections.
    - Implement accessors to to apply tree modification while walking the GPUVA
      tree in order to actually support processing of drm_gpuva_ops through
      callbacks in fence signalling critical sections rather than through
      pre-allocated operation lists.
    - Remove merging of GPUVAs; the kernel has limited to none knowlege about
      the semantics of mapping sequences. Hence, merging is purely speculative.
      It seems that gaining a significant (or at least a measurable) performance
      increase through merging is way more likely to happen when userspace is
      responsible for merging mappings up to the next larger page size if
      possible.
    - Since merging was removed, regions pretty much loose their right to exist.
      They might still be useful for handling dual page tables or similar
      mechanisms, but since Nouveau seems to be the only driver having a need
      for this for now, regions were removed from the GPUVA manager.
    - Fixed a couple of maple_tree related issues; thanks to Liam for helping me
      out.

Changes in V4:
==============
  Nouveau:
    - Refactored how specific VM_BIND and EXEC jobs are created and how their
      arguments are passed to the generic job implementation.
    - Fixed a UAF race condition where bind job ops could have been freed
      already while still waiting for a job cleanup to finish. This is due to
      in certain cases we need to wait for mappings actually being unmapped
      before creating sparse regions in the same area.
    - Re-based the code onto drm_exec v4 patch.

  GPUVA Manager:
    - Fixed a maple tree related bug when pre-allocating MA states.
      (Boris Brezillion)
    - Made struct drm_gpuva_fn_ops a const object in all occurrences.
      (Boris Brezillion)

TODO
====
  Maple Tree:
    - Maple tree uses the 'unsinged long' type for node entries. While this
      works for 64bit, it's incompatible with the DRM GPUVA Manager on 32bit,
      since the DRM GPUVA Manager uses the u64 type and so do drivers using it.
      While it's questionable whether a 32bit kernel and a > 32bit GPU address
      space make any sense, it creates tons of compiler warnings when compiling
      for 32bit. Maybe it makes sense to expand the maple tree API to let users
      decide which size to pick - other ideas / proposals are welcome.


Christian König (1):
  drm: execution context for GEM buffers v4

Danilo Krummrich (13):
  maple_tree: split up MA_STATE() macro
  drm: manager to keep track of GPUs VA mappings
  drm: debugfs: provide infrastructure to dump a DRM GPU VA space
  drm/nouveau: new VM_BIND uapi interfaces
  drm/nouveau: get vmm via nouveau_cli_vmm()
  drm/nouveau: bo: initialize GEM GPU VA interface
  drm/nouveau: move usercopy helpers to nouveau_drv.h
  drm/nouveau: fence: separate fence alloc and emit
  drm/nouveau: fence: fail to emit when fence context is killed
  drm/nouveau: chan: provide nouveau_channel_kill()
  drm/nouveau: nvkm/vmm: implement raw ops to manage uvmm
  drm/nouveau: implement new VM_BIND uAPI
  drm/nouveau: debugfs: implement DRM GPU VA debugfs

 Documentation/gpu/driver-uapi.rst             |   11 +
 Documentation/gpu/drm-mm.rst                  |   43 +
 drivers/gpu/drm/Kconfig                       |    6 +
 drivers/gpu/drm/Makefile                      |    3 +
 drivers/gpu/drm/drm_debugfs.c                 |   41 +
 drivers/gpu/drm/drm_exec.c                    |  278 +++
 drivers/gpu/drm/drm_gem.c                     |    3 +
 drivers/gpu/drm/drm_gpuva_mgr.c               | 1687 +++++++++++++++
 drivers/gpu/drm/nouveau/Kbuild                |    3 +
 drivers/gpu/drm/nouveau/Kconfig               |    2 +
 drivers/gpu/drm/nouveau/dispnv04/crtc.c       |    9 +-
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   26 +-
 drivers/gpu/drm/nouveau/include/nvif/vmm.h    |   19 +-
 .../gpu/drm/nouveau/include/nvkm/subdev/mmu.h |   20 +-
 drivers/gpu/drm/nouveau/nouveau_abi16.c       |   24 +
 drivers/gpu/drm/nouveau/nouveau_abi16.h       |    1 +
 drivers/gpu/drm/nouveau/nouveau_bo.c          |  204 +-
 drivers/gpu/drm/nouveau/nouveau_bo.h          |    2 +-
 drivers/gpu/drm/nouveau/nouveau_chan.c        |   22 +-
 drivers/gpu/drm/nouveau/nouveau_chan.h        |    1 +
 drivers/gpu/drm/nouveau/nouveau_debugfs.c     |   39 +
 drivers/gpu/drm/nouveau/nouveau_dmem.c        |    9 +-
 drivers/gpu/drm/nouveau/nouveau_drm.c         |   27 +-
 drivers/gpu/drm/nouveau/nouveau_drv.h         |   94 +-
 drivers/gpu/drm/nouveau/nouveau_exec.c        |  418 ++++
 drivers/gpu/drm/nouveau/nouveau_exec.h        |   54 +
 drivers/gpu/drm/nouveau/nouveau_fence.c       |   23 +-
 drivers/gpu/drm/nouveau/nouveau_fence.h       |    5 +-
 drivers/gpu/drm/nouveau/nouveau_gem.c         |   62 +-
 drivers/gpu/drm/nouveau/nouveau_mem.h         |    5 +
 drivers/gpu/drm/nouveau/nouveau_prime.c       |    2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c       |  461 ++++
 drivers/gpu/drm/nouveau/nouveau_sched.h       |  123 ++
 drivers/gpu/drm/nouveau/nouveau_svm.c         |    2 +-
 drivers/gpu/drm/nouveau/nouveau_uvmm.c        | 1898 +++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_uvmm.h        |  107 +
 drivers/gpu/drm/nouveau/nouveau_vmm.c         |    4 +-
 drivers/gpu/drm/nouveau/nvif/vmm.c            |  100 +-
 .../gpu/drm/nouveau/nvkm/subdev/mmu/uvmm.c    |  213 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c |  197 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   25 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgf100.c    |   16 +-
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c    |   16 +-
 .../gpu/drm/nouveau/nvkm/subdev/mmu/vmmnv50.c |   27 +-
 include/drm/drm_debugfs.h                     |   25 +
 include/drm/drm_drv.h                         |    6 +
 include/drm/drm_exec.h                        |  119 ++
 include/drm/drm_gem.h                         |   75 +
 include/drm/drm_gpuva_mgr.h                   |  681 ++++++
 include/linux/maple_tree.h                    |    7 +-
 include/uapi/drm/nouveau_drm.h                |  209 ++
 51 files changed, 7212 insertions(+), 242 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_exec.c
 create mode 100644 drivers/gpu/drm/drm_gpuva_mgr.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.h
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.h
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_uvmm.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_uvmm.h
 create mode 100644 include/drm/drm_exec.h
 create mode 100644 include/drm/drm_gpuva_mgr.h


base-commit: 33a86170888b7e4aa0cea94ebb9c67180139cea9
  

Comments

Donald Robson June 9, 2023, 11:56 a.m. UTC | #1
On Wed, 2023-06-07 at 00:31 +0200, Danilo Krummrich wrote:
> 
> Christian König (1):
>   drm: execution context for GEM buffers v4
> 
> Danilo Krummrich (13):
>   maple_tree: split up MA_STATE() macro
>   drm: manager to keep track of GPUs VA mappings

I have tested the drm GPUVA manager as part of using it with our new
driver.  The link below shows use of the drm_gpuva_sm_[un]map()
functions.  I think this is based on the v3 patches, but I have also
tried it locally using v4 patches.  We will be submitting this
driver for review soon.

https://gitlab.freedesktop.org/sarah-walker-imgtec/powervr/-/blob/dev/v3/drivers/gpu/drm/imagination/pvr_vm.c

In a previous incarnation, I used the drm_gpuva_insert() and
drm_gpuva_remove() functions directly.  In some now abandoned work I
used the drm_gpuva_sm_[un]map_ops_create() route.

The only problem I encountered along the way was the maple tree init
issue already reported by Boris and fixed in v4.  One caveat - as
our driver is a work in progress our testing is limited to certain
Sascha Willem tests.

I did find it quite difficult to get the prealloc route with
drm_gpuva_sm_[un]map() working.  I'm not sure to what degree this
reflects me being a novice on matters DRM, but I did find myself
wishing for more direction, even with Boris's help.

Tested-by: Donald Robson <donald.robson@imgtec.com>

>   drm: debugfs: provide infrastructure to dump a DRM GPU VA space
>   drm/nouveau: new VM_BIND uapi interfaces
>   drm/nouveau: get vmm via nouveau_cli_vmm()
>   drm/nouveau: bo: initialize GEM GPU VA interface
>   drm/nouveau: move usercopy helpers to nouveau_drv.h
>   drm/nouveau: fence: separate fence alloc and emit
>   drm/nouveau: fence: fail to emit when fence context is killed
>   drm/nouveau: chan: provide nouveau_channel_kill()
>   drm/nouveau: nvkm/vmm: implement raw ops to manage uvmm
>   drm/nouveau: implement new VM_BIND uAPI
>   drm/nouveau: debugfs: implement DRM GPU VA debugfs
>
  
Danilo Krummrich June 13, 2023, 2:20 p.m. UTC | #2
Hi Donald,

On 6/9/23 13:56, Donald Robson wrote:
> On Wed, 2023-06-07 at 00:31 +0200, Danilo Krummrich wrote:
>>
>> Christian König (1):
>>    drm: execution context for GEM buffers v4
>>
>> Danilo Krummrich (13):
>>    maple_tree: split up MA_STATE() macro
>>    drm: manager to keep track of GPUs VA mappings
> 
> I have tested the drm GPUVA manager as part of using it with our new
> driver.  The link below shows use of the drm_gpuva_sm_[un]map()
> functions.  I think this is based on the v3 patches, but I have also
> tried it locally using v4 patches.  We will be submitting this
> driver for review soon.

That's awesome - thank your for taking the effort!

> 
> https://gitlab.freedesktop.org/sarah-walker-imgtec/powervr/-/blob/dev/v3/drivers/gpu/drm/imagination/pvr_vm.c
> 
> In a previous incarnation, I used the drm_gpuva_insert() and
> drm_gpuva_remove() functions directly.  In some now abandoned work I
> used the drm_gpuva_sm_[un]map_ops_create() route.
> 
> The only problem I encountered along the way was the maple tree init
> issue already reported by Boris and fixed in v4.  One caveat - as
> our driver is a work in progress our testing is limited to certain
> Sascha Willem tests.
> 
> I did find it quite difficult to get the prealloc route with
> drm_gpuva_sm_[un]map() working.  I'm not sure to what degree this
> reflects me being a novice on matters DRM, but I did find myself
> wishing for more direction, even with Boris's help.

I'm definitely up improving the existing documentation. Anything in 
particular you think should be described in more detail?

- Danilo

> 
> Tested-by: Donald Robson <donald.robson@imgtec.com>
> 
>>    drm: debugfs: provide infrastructure to dump a DRM GPU VA space
>>    drm/nouveau: new VM_BIND uapi interfaces
>>    drm/nouveau: get vmm via nouveau_cli_vmm()
>>    drm/nouveau: bo: initialize GEM GPU VA interface
>>    drm/nouveau: move usercopy helpers to nouveau_drv.h
>>    drm/nouveau: fence: separate fence alloc and emit
>>    drm/nouveau: fence: fail to emit when fence context is killed
>>    drm/nouveau: chan: provide nouveau_channel_kill()
>>    drm/nouveau: nvkm/vmm: implement raw ops to manage uvmm
>>    drm/nouveau: implement new VM_BIND uAPI
>>    drm/nouveau: debugfs: implement DRM GPU VA debugfs
>>
  
Donald Robson June 14, 2023, 7:58 a.m. UTC | #3
On Tue, 2023-06-13 at 16:20 +0200, Danilo Krummrich wrote:

> I'm definitely up improving the existing documentation. Anything in 
> particular you think should be described in more detail?
> 
> - Danilo

Hi Danilo,

As I said, with inexperience it's possible I missed what I was
looking for in the existing documentation, which is highly detailed
in regard to how it deals with operations, but usage was where I fell
down.

If I understand there are three ways to use this, which are:
1) Using drm_gpuva_insert() and drm_gpuva_remove() directly using
   stack va objects.
2) Using drm_gpuva_insert() and drm_gpuva_remove() in a callback
   context, after having created ops lists using
   drm_gpuva_sm_[un]map_ops_create().
3) Using drm_gpuva_[un]map() in callback context after having
   prealloced a node and va objects for map/remap function use,
   which must be forwarded in as the 'priv' argument to
   drm_gpuva_sm_[un]map().

The first of these is pretty self-explanatory.  The second was also
fairly easy to understand, it has an example in your own driver, and
since it takes care of allocs in drm_gpuva_sm_map_ops_create() it
leads to pretty clean code too.

The third case, which I am using in the new PowerVR driver did not
have an example of usage and the approach is quite different to 2)
in that you have to prealloc everything explicitly.  I didn't realise
this, so it led to a fair amount of frustration.

I think if you're willing, it would help inexperienced implementers a
lot if there were some brief 'how to' snippets for each of the three
use cases.

Thanks,
Donald
  
Danilo Krummrich June 15, 2023, 4:31 p.m. UTC | #4
On 6/14/23 09:58, Donald Robson wrote:
> On Tue, 2023-06-13 at 16:20 +0200, Danilo Krummrich wrote:
> 
>> I'm definitely up improving the existing documentation. Anything in
>> particular you think should be described in more detail?
>>
>> - Danilo
> 
> Hi Danilo,
> 
> As I said, with inexperience it's possible I missed what I was
> looking for in the existing documentation, which is highly detailed
> in regard to how it deals with operations, but usage was where I fell
> down.
> 
> If I understand there are three ways to use this, which are:
> 1) Using drm_gpuva_insert() and drm_gpuva_remove() directly using
>     stack va objects.

What do you mean with stack va objects?

> 2) Using drm_gpuva_insert() and drm_gpuva_remove() in a callback
>     context, after having created ops lists using
>     drm_gpuva_sm_[un]map_ops_create().
> 3) Using drm_gpuva_[un]map() in callback context after having
>     prealloced a node and va objects for map/remap function use,
>     which must be forwarded in as the 'priv' argument to
>     drm_gpuva_sm_[un]map().

Right, and I think it might be worth concretely mentioning this in the 
documentation.

> 
> The first of these is pretty self-explanatory.  The second was also
> fairly easy to understand, it has an example in your own driver, and
> since it takes care of allocs in drm_gpuva_sm_map_ops_create() it
> leads to pretty clean code too.
> 
> The third case, which I am using in the new PowerVR driver did not
> have an example of usage and the approach is quite different to 2)
> in that you have to prealloc everything explicitly.  I didn't realise
> this, so it led to a fair amount of frustration.

Yeah, I think this is not entirely obvious why this is the case. I 
should maybe add a comment on how the callback way of using this 
interface is motivated.

The requirement of pre-allocation arises out of two circumstances.
First, having a single callback for every drm_gpuva_op on the GPUVA 
space implies that we're not allowed to fail the operation, because 
processing the drm_gpuva_ops directly implies that we can't unwind them 
on failure.

I know that the API functions the documentation guides you to use in 
this case actually can return error codes, but those are just range 
checks. If they fail, it's clearly a bug. However, I did not use WARN() 
for those cases, since the driver could still decide to use the 
callbacks to keep track of the operations in a driver specific way, 
although I would not recommend doing this and rather like to try to 
cover the drivers use case within the regular way of creating a list of 
operations.

Second, most (other) drivers when using the callback way of this 
interface would need to execute the GPUVA space updates asynchronously 
in a dma_fence signalling critical path, where no memory allocations are 
permitted.

> 
> I think if you're willing, it would help inexperienced implementers a
> lot if there were some brief 'how to' snippets for each of the three
> use cases.

Yes, I can definitely add some.

> 
> Thanks,
> Donald
  
Danilo Krummrich June 15, 2023, 4:39 p.m. UTC | #5
On 6/7/23 00:31, Danilo Krummrich wrote:

>    Maple Tree:
>      - Maple tree uses the 'unsinged long' type for node entries. While this
>        works for 64bit, it's incompatible with the DRM GPUVA Manager on 32bit,
>        since the DRM GPUVA Manager uses the u64 type and so do drivers using it.
>        While it's questionable whether a 32bit kernel and a > 32bit GPU address
>        space make any sense, it creates tons of compiler warnings when compiling
>        for 32bit. Maybe it makes sense to expand the maple tree API to let users
>        decide which size to pick - other ideas / proposals are welcome.

I remember you told me that the filesystem folks had some interest in a 
64-bit maple tree for a 32-bit kernel as well. Are there any news or 
plans for such a feature?

For the short term I'd probably add a feature flag to the GPUVA manager, 
where drivers explicitly need to promise not to pass in addresses 
exceeding 32-bit on a 32-bit kernel, and if they don't refuse to 
initialize the GPUVA manager on 32-bit kernels - or something similar...