[drm-next,v2,00/16,RFC] DRM GPUVA Manager & Nouveau VM_BIND UAPI

Message ID	20230217134422.14116-1-dakr@redhat.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Danilo Krummrich <dakr@redhat.com> To: airlied@gmail.com, daniel@ffwll.ch, tzimmermann@suse.de, mripard@kernel.org, corbet@lwn.net, christian.koenig@amd.com, bskeggs@redhat.com, Liam.Howlett@oracle.com, matthew.brost@intel.com, boris.brezillon@collabora.com, alexdeucher@gmail.com, ogabbay@kernel.org, bagasdotme@gmail.com, willy@infradead.org, jason@jlekstrand.net Cc: dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Danilo Krummrich <dakr@redhat.com> Subject: [PATCH drm-next v2 00/16] [RFC] DRM GPUVA Manager & Nouveau VM_BIND UAPI Date: Fri, 17 Feb 2023 14:44:06 +0100 Message-Id: <20230217134422.14116-1-dakr@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	DRM GPUVA Manager & Nouveau VM_BIND UAPI \| [drm-next,v2,00/16,RFC] DRM GPUVA Manager & Nouveau VM_BIND UAPI [drm-next,v2,01/16] drm: execution context for GEM buffers [drm-next,v2,02/16] drm/exec: fix memory leak in drm_exec_prepare_obj() [drm-next,v2,03/16] maple_tree: split up MA_STATE() macro [drm-next,v2,04/16] maple_tree: add flag MT_FLAGS_LOCK_NONE [drm-next,v2,05/16] drm: manager to keep track of GPUs VA mappings [drm-next,v2,06/16] drm: debugfs: provide infrastructure to dump a DRM GPU VA space [drm-next,v2,07/16] drm/nouveau: new VM_BIND uapi interfaces [drm-next,v2,08/16] drm/nouveau: get vmm via nouveau_cli_vmm() [drm-next,v2,09/16] drm/nouveau: bo: initialize GEM GPU VA interface [drm-next,v2,10/16] drm/nouveau: move usercopy helpers to nouveau_drv.h [drm-next,v2,11/16] drm/nouveau: fence: fail to emit when fence context is killed [drm-next,v2,12/16] drm/nouveau: chan: provide nouveau_channel_kill() [drm-next,v2,13/16] drm/nouveau: nvkm/vmm: implement raw ops to manage uvmm [drm-next,v2,14/16] drm/nouveau: implement uvmm for user mode bindings [drm-next,v2,15/16] drm/nouveau: implement new VM_BIND UAPI [drm-next,v2,16/16] drm/nouveau: debugfs: implement DRM GPU VA debugfs

Message ID

20230217134422.14116-1-dakr@redhat.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Danilo Krummrich <dakr@redhat.com>
To: airlied@gmail.com, daniel@ffwll.ch, tzimmermann@suse.de,
        mripard@kernel.org, corbet@lwn.net, christian.koenig@amd.com,
        bskeggs@redhat.com, Liam.Howlett@oracle.com,
        matthew.brost@intel.com, boris.brezillon@collabora.com,
        alexdeucher@gmail.com, ogabbay@kernel.org, bagasdotme@gmail.com,
        willy@infradead.org, jason@jlekstrand.net
Cc: dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org,
        linux-doc@vger.kernel.org, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, Danilo Krummrich <dakr@redhat.com>
Subject: [PATCH drm-next v2 00/16] [RFC] DRM GPUVA Manager & Nouveau VM_BIND
 UAPI
Date: Fri, 17 Feb 2023 14:44:06 +0100
Message-Id: <20230217134422.14116-1-dakr@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

DRM GPUVA Manager & Nouveau VM_BIND UAPI |

Message

Danilo Krummrich Feb. 17, 2023, 1:44 p.m. UTC

  This patch series provides a new UAPI for the Nouveau driver in order to
support Vulkan features, such as sparse bindings and sparse residency.

Furthermore, with the DRM GPUVA manager it provides a new DRM core feature to
keep track of GPU virtual address (VA) mappings in a more generic way.

The DRM GPUVA manager is indented to help drivers implement userspace-manageable
GPU VA spaces in reference to the Vulkan API. In order to achieve this goal it
serves the following purposes in this context.

    1) Provide infrastructure to track GPU VA allocations and mappings,
       making use of the maple_tree.

    2) Generically connect GPU VA mappings to their backing buffers, in
       particular DRM GEM objects.

    3) Provide a common implementation to perform more complex mapping
       operations on the GPU VA space. In particular splitting and merging
       of GPU VA mappings, e.g. for intersecting mapping requests or partial
       unmap requests.

The new VM_BIND Nouveau UAPI build on top of the DRM GPUVA manager, itself
providing the following new interfaces.

    1) Initialize a GPU VA space via the new DRM_IOCTL_NOUVEAU_VM_INIT ioctl
       for UMDs to specify the portion of VA space managed by the kernel and
       userspace, respectively.

    2) Allocate and free a VA space region as well as bind and unbind memory
       to the GPUs VA space via the new DRM_IOCTL_NOUVEAU_VM_BIND ioctl.

    3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl.

Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC, make use of the DRM
scheduler to queue jobs and support asynchronous processing with DRM syncobjs
as synchronization mechanism.

By default DRM_IOCTL_NOUVEAU_VM_BIND does synchronous processing,
DRM_IOCTL_NOUVEAU_EXEC supports asynchronous processing only.

The new VM_BIND UAPI for Nouveau makes also use of drm_exec (execution context
for GEM buffers) by Christian König. Since the patch implementing drm_exec was
not yet merged into drm-next it is part of this series, as well as a small fix
for this patch, which was found while testing this series.

This patch series is also available at [1].

There is a Mesa NVK merge request by Dave Airlie [2] implementing the
corresponding userspace parts for this series.

The Vulkan CTS test suite passes the sparse binding and sparse residency test
cases for the new UAPI together with Dave's Mesa work.

There are also some test cases in the igt-gpu-tools project [3] for the new UAPI
and hence the DRM GPU VA manager. However, most of them are testing the DRM GPU
VA manager's logic through Nouveau's new UAPI and should be considered just as
helper for implementation.

However, I absolutely intend to change those test cases to proper kunit test
cases for the DRM GPUVA manager, once and if we agree on it's usefulness and
design.

[1] https://gitlab.freedesktop.org/nouvelles/kernel/-/tree/new-uapi-drm-next /
    https://gitlab.freedesktop.org/nouvelles/kernel/-/merge_requests/1
[2] https://gitlab.freedesktop.org/nouveau/mesa/-/merge_requests/150/
[3] https://gitlab.freedesktop.org/dakr/igt-gpu-tools/-/tree/wip_nouveau_vm_bind

Changes in V2:
==============
  Nouveau:
    - Reworked the Nouveau VM_BIND UAPI to avoid memory allocations in fence
      signalling critical sections. Updates to the VA space are split up in three
      separate stages, where only the 2. stage executes in a fence signalling
      critical section:

        1. update the VA space, allocate new structures and page tables
        2. (un-)map the requested memory bindings
        3. free structures and page tables

    - Separated generic job scheduler code from specific job implementations.
    - Separated the EXEC and VM_BIND implementation of the UAPI.
    - Reworked the locking parts of the nvkm/vmm RAW interface, such that
      (un-)map operations can be executed in fence signalling critical sections.

  GPUVA Manager:
    - made drm_gpuva_regions optional for users of the GPUVA manager
    - allow NULL GEMs for drm_gpuva entries
    - swichted from drm_mm to maple_tree for track drm_gpuva / drm_gpuva_region
      entries
    - provide callbacks for users to allocate custom drm_gpuva_op structures to
      allow inheritance
    - added user bits to drm_gpuva_flags
    - added a prefetch operation type in order to support generating prefetch
      operations in the same way other operations generated
    - hand the responsibility for mutual exclusion for a GEM's
      drm_gpuva list to the user; simplified corresponding (un-)link functions

  Maple Tree:
    - I added two maple tree patches to the series, one to support custom tree
      walk macros and one to hand the locking responsibility to the user of the
      GPUVA manager without pre-defined lockdep checks.

TODO
====
  Maple Tree:
    - Maple tree uses the 'unsinged long' type for node entries. While this
      works for 64bit, it's incompatible with the DRM GPUVA Manager on 32bit,
      since the DRM GPUVA Manager uses the u64 type and so do drivers using it.
      While it's questionable whether a 32bit kernel and a > 32bit GPU address
      space make any sense, it creates tons of compiler warnings when compiling
      for 32bit. Maybe it makes sense to expand the maple tree API to let users
      decide which size to pick - other ideas / proposals are welcome.

Christian König (1):
  drm: execution context for GEM buffers

Danilo Krummrich (15):
  drm/exec: fix memory leak in drm_exec_prepare_obj()
  maple_tree: split up MA_STATE() macro
  maple_tree: add flag MT_FLAGS_LOCK_NONE
  drm: manager to keep track of GPUs VA mappings
  drm: debugfs: provide infrastructure to dump a DRM GPU VA space
  drm/nouveau: new VM_BIND uapi interfaces
  drm/nouveau: get vmm via nouveau_cli_vmm()
  drm/nouveau: bo: initialize GEM GPU VA interface
  drm/nouveau: move usercopy helpers to nouveau_drv.h
  drm/nouveau: fence: fail to emit when fence context is killed
  drm/nouveau: chan: provide nouveau_channel_kill()
  drm/nouveau: nvkm/vmm: implement raw ops to manage uvmm
  drm/nouveau: implement uvmm for user mode bindings
  drm/nouveau: implement new VM_BIND UAPI
  drm/nouveau: debugfs: implement DRM GPU VA debugfs

 Documentation/gpu/driver-uapi.rst             |   11 +
 Documentation/gpu/drm-mm.rst                  |   43 +
 drivers/gpu/drm/Kconfig                       |    6 +
 drivers/gpu/drm/Makefile                      |    3 +
 drivers/gpu/drm/amd/amdgpu/Kconfig            |    1 +
 drivers/gpu/drm/drm_debugfs.c                 |   56 +
 drivers/gpu/drm/drm_exec.c                    |  294 +++
 drivers/gpu/drm/drm_gem.c                     |    3 +
 drivers/gpu/drm/drm_gpuva_mgr.c               | 1704 +++++++++++++++++
 drivers/gpu/drm/nouveau/Kbuild                |    3 +
 drivers/gpu/drm/nouveau/Kconfig               |    2 +
 drivers/gpu/drm/nouveau/include/nvif/if000c.h |   26 +-
 drivers/gpu/drm/nouveau/include/nvif/vmm.h    |   19 +-
 .../gpu/drm/nouveau/include/nvkm/subdev/mmu.h |   20 +-
 drivers/gpu/drm/nouveau/nouveau_abi16.c       |   23 +
 drivers/gpu/drm/nouveau/nouveau_abi16.h       |    1 +
 drivers/gpu/drm/nouveau/nouveau_bo.c          |  152 +-
 drivers/gpu/drm/nouveau/nouveau_bo.h          |    2 +-
 drivers/gpu/drm/nouveau/nouveau_chan.c        |   16 +-
 drivers/gpu/drm/nouveau/nouveau_chan.h        |    1 +
 drivers/gpu/drm/nouveau/nouveau_debugfs.c     |   24 +
 drivers/gpu/drm/nouveau/nouveau_drm.c         |   26 +-
 drivers/gpu/drm/nouveau/nouveau_drv.h         |   92 +-
 drivers/gpu/drm/nouveau/nouveau_exec.c        |  322 ++++
 drivers/gpu/drm/nouveau/nouveau_exec.h        |   39 +
 drivers/gpu/drm/nouveau/nouveau_fence.c       |    7 +
 drivers/gpu/drm/nouveau/nouveau_fence.h       |    2 +-
 drivers/gpu/drm/nouveau/nouveau_gem.c         |   57 +-
 drivers/gpu/drm/nouveau/nouveau_mem.h         |    5 +
 drivers/gpu/drm/nouveau/nouveau_prime.c       |    2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c       |  467 +++++
 drivers/gpu/drm/nouveau/nouveau_sched.h       |   96 +
 drivers/gpu/drm/nouveau/nouveau_svm.c         |    2 +-
 drivers/gpu/drm/nouveau/nouveau_uvmm.c        | 1536 +++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_uvmm.h        |  138 ++
 drivers/gpu/drm/nouveau/nouveau_vmm.c         |    4 +-
 drivers/gpu/drm/nouveau/nvif/vmm.c            |  100 +-
 .../gpu/drm/nouveau/nvkm/subdev/mmu/uvmm.c    |  213 ++-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c |  197 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h |   25 +
 .../drm/nouveau/nvkm/subdev/mmu/vmmgf100.c    |   16 +-
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c    |   16 +-
 .../gpu/drm/nouveau/nvkm/subdev/mmu/vmmnv50.c |   27 +-
 include/drm/drm_debugfs.h                     |   25 +
 include/drm/drm_drv.h                         |    6 +
 include/drm/drm_exec.h                        |  144 ++
 include/drm/drm_gem.h                         |   75 +
 include/drm/drm_gpuva_mgr.h                   |  714 +++++++
 include/linux/maple_tree.h                    |   27 +-
 include/uapi/drm/nouveau_drm.h                |  220 +++
 lib/maple_tree.c                              |    7 +-
 51 files changed, 6808 insertions(+), 209 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_exec.c
 create mode 100644 drivers/gpu/drm/drm_gpuva_mgr.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.h
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.h
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_uvmm.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_uvmm.h
 create mode 100644 include/drm/drm_exec.h
 create mode 100644 include/drm/drm_gpuva_mgr.h


base-commit: 48075a66fca613477ac1969b576a93ef5db0164f

Comments

Boris Brezillon March 9, 2023, 9:12 a.m. UTC | #1

Hi Danilo,

On Fri, 17 Feb 2023 14:44:06 +0100
Danilo Krummrich <dakr@redhat.com> wrote:

> Changes in V2:
> ==============
>   Nouveau:
>     - Reworked the Nouveau VM_BIND UAPI to avoid memory allocations in fence
>       signalling critical sections. Updates to the VA space are split up in three
>       separate stages, where only the 2. stage executes in a fence signalling
>       critical section:
> 
>         1. update the VA space, allocate new structures and page tables

Sorry for the silly question, but I didn't find where the page tables
pre-allocation happens. Mind pointing it to me? It's also unclear when
this step happens. Is this at bind-job submission time, when the job is
not necessarily ready to run, potentially waiting for other deps to be
signaled. Or is it done when all deps are met, as an extra step before
jumping to step 2. If that's the former, then I don't see how the VA
space update can happen, since the bind-job might depend on other
bind-jobs modifying the same portion of the VA space (unbind ops might
lead to intermediate page table levels disappearing while we were
waiting for deps). If it's the latter, I wonder why this is not
considered as an allocation in the fence signaling path (for the
bind-job out-fence to be signaled, you need these allocations to
succeed, unless failing to allocate page-tables is considered like a HW
misbehavior and the fence is signaled with an error in that case).

Note that I'm not familiar at all with Nouveau or TTM, and it might
be something that's solved by another component, or I'm just
misunderstanding how the whole thing is supposed to work. This being
said, I'd really like to implement a VM_BIND-like uAPI in pancsf using
the gpuva_manager infra you're proposing here, so please bare with me
:-).

>         2. (un-)map the requested memory bindings
>         3. free structures and page tables
> 
>     - Separated generic job scheduler code from specific job implementations.
>     - Separated the EXEC and VM_BIND implementation of the UAPI.
>     - Reworked the locking parts of the nvkm/vmm RAW interface, such that
>       (un-)map operations can be executed in fence signalling critical sections.
> 

Regards,

Boris

Boris Brezillon March 9, 2023, 9:48 a.m. UTC | #2

On Thu, 9 Mar 2023 10:12:43 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> Hi Danilo,
> 
> On Fri, 17 Feb 2023 14:44:06 +0100
> Danilo Krummrich <dakr@redhat.com> wrote:
> 
> > Changes in V2:
> > ==============
> >   Nouveau:
> >     - Reworked the Nouveau VM_BIND UAPI to avoid memory allocations in fence
> >       signalling critical sections. Updates to the VA space are split up in three
> >       separate stages, where only the 2. stage executes in a fence signalling
> >       critical section:
> > 
> >         1. update the VA space, allocate new structures and page tables  
> 
> Sorry for the silly question, but I didn't find where the page tables
> pre-allocation happens. Mind pointing it to me? It's also unclear when
> this step happens. Is this at bind-job submission time, when the job is
> not necessarily ready to run, potentially waiting for other deps to be
> signaled. Or is it done when all deps are met, as an extra step before
> jumping to step 2. If that's the former, then I don't see how the VA
> space update can happen, since the bind-job might depend on other
> bind-jobs modifying the same portion of the VA space (unbind ops might
> lead to intermediate page table levels disappearing while we were
> waiting for deps). If it's the latter, I wonder why this is not
> considered as an allocation in the fence signaling path (for the
> bind-job out-fence to be signaled, you need these allocations to
> succeed, unless failing to allocate page-tables is considered like a HW
> misbehavior and the fence is signaled with an error in that case).

Ok, so I just noticed you only have one bind queue per drm_file
(cli->sched_entity), and jobs are executed in-order on a given queue,
so I guess that allows you to modify the VA space at submit time
without risking any modifications to the VA space coming from other
bind-queues targeting the same VM. And, if I'm correct, synchronous
bind/unbind ops take the same path, so no risk for those to modify the
VA space either (just wonder if it's a good thing to have to sync
bind/unbind operations waiting on async ones, but that's a different
topic).

> 
> Note that I'm not familiar at all with Nouveau or TTM, and it might
> be something that's solved by another component, or I'm just
> misunderstanding how the whole thing is supposed to work. This being
> said, I'd really like to implement a VM_BIND-like uAPI in pancsf using
> the gpuva_manager infra you're proposing here, so please bare with me
> :-).
> 
> >         2. (un-)map the requested memory bindings
> >         3. free structures and page tables
> > 
> >     - Separated generic job scheduler code from specific job implementations.
> >     - Separated the EXEC and VM_BIND implementation of the UAPI.
> >     - Reworked the locking parts of the nvkm/vmm RAW interface, such that
> >       (un-)map operations can be executed in fence signalling critical sections.
> >   
> 
> Regards,
> 
> Boris
>

Danilo Krummrich March 10, 2023, 4:45 p.m. UTC | #3

Hi Boris,

On 3/9/23 10:48, Boris Brezillon wrote:
> On Thu, 9 Mar 2023 10:12:43 +0100
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
>> Hi Danilo,
>>
>> On Fri, 17 Feb 2023 14:44:06 +0100
>> Danilo Krummrich <dakr@redhat.com> wrote:
>>
>>> Changes in V2:
>>> ==============
>>>    Nouveau:
>>>      - Reworked the Nouveau VM_BIND UAPI to avoid memory allocations in fence
>>>        signalling critical sections. Updates to the VA space are split up in three
>>>        separate stages, where only the 2. stage executes in a fence signalling
>>>        critical section:
>>>
>>>          1. update the VA space, allocate new structures and page tables
>>
>> Sorry for the silly question, but I didn't find where the page tables
>> pre-allocation happens. Mind pointing it to me? It's also unclear when
>> this step happens. Is this at bind-job submission time, when the job is
>> not necessarily ready to run, potentially waiting for other deps to be
>> signaled. Or is it done when all deps are met, as an extra step before
>> jumping to step 2. If that's the former, then I don't see how the VA
>> space update can happen, since the bind-job might depend on other
>> bind-jobs modifying the same portion of the VA space (unbind ops might
>> lead to intermediate page table levels disappearing while we were
>> waiting for deps). If it's the latter, I wonder why this is not
>> considered as an allocation in the fence signaling path (for the
>> bind-job out-fence to be signaled, you need these allocations to
>> succeed, unless failing to allocate page-tables is considered like a HW
>> misbehavior and the fence is signaled with an error in that case).
> 
> Ok, so I just noticed you only have one bind queue per drm_file
> (cli->sched_entity), and jobs are executed in-order on a given queue,
> so I guess that allows you to modify the VA space at submit time
> without risking any modifications to the VA space coming from other
> bind-queues targeting the same VM. And, if I'm correct, synchronous
> bind/unbind ops take the same path, so no risk for those to modify the
> VA space either (just wonder if it's a good thing to have to sync
> bind/unbind operations waiting on async ones, but that's a different
> topic).

Yes, that's all correct.

The page table allocation happens through nouveau_uvmm_vmm_get() which 
either allocates the corresponding page tables or increases the 
reference count, in case they already exist, accordingly.
The call goes all the way through nvif into the nvkm layer (not the 
easiest to follow the call chain) and ends up in nvkm_vmm_ptes_get().

There are multiple reasons for updating the VA space at submit time in 
Nouveau.

1) Subsequent EXEC ioctl() calls would need to wait for the bind jobs 
they depend on within the ioctl() rather than in the scheduler queue, 
because at the point of time where the ioctl() happens the VA space 
wouldn't be up-to-date.

2) Let's assume a new mapping is requested and within it's range other 
mappings already exist. Let's also assume that those existing mappings 
aren't contiguous, such that there are gaps between them. In such a case 
I need to allocate page tables only for the gaps between the existing 
mappings, or alternatively, allocate them for the whole range of the new 
mapping, but free / decrease the reference count of the page tables for 
the ranges of the previously existing mappings afterwards.
In the first case I need to know the gaps to allocate page tables for 
when submitting the job, which means the VA space must be up-to-date. In 
the latter one I must save the ranges of the previously existing 
mappings somewhere in order to clean them up, hence I need to allocate 
memory to store this information. Since I can't allocate this memory in 
the jobs run() callback (fence signalling critical section) I need to do 
it when submitting the job already and hence the VA space must be 
up-to-date again.
However, this is due to how page table management currently works in 
Nouveau and we might change that in the future.

Synchronous binds/unbinds taking the same path through the scheduler is 
a downside of this approach.

- Danilo

> 
>>
>> Note that I'm not familiar at all with Nouveau or TTM, and it might
>> be something that's solved by another component, or I'm just
>> misunderstanding how the whole thing is supposed to work. This being
>> said, I'd really like to implement a VM_BIND-like uAPI in pancsf using
>> the gpuva_manager infra you're proposing here, so please bare with me
>> :-).
>>
>>>          2. (un-)map the requested memory bindings
>>>          3. free structures and page tables
>>>
>>>      - Separated generic job scheduler code from specific job implementations.
>>>      - Separated the EXEC and VM_BIND implementation of the UAPI.
>>>      - Reworked the locking parts of the nvkm/vmm RAW interface, such that
>>>        (un-)map operations can be executed in fence signalling critical sections.
>>>    
>>
>> Regards,
>>
>> Boris
>>
>

Boris Brezillon March 10, 2023, 5:25 p.m. UTC | #4

Hi Danilo,

On Fri, 10 Mar 2023 17:45:58 +0100
Danilo Krummrich <dakr@redhat.com> wrote:

> Hi Boris,
> 
> On 3/9/23 10:48, Boris Brezillon wrote:
> > On Thu, 9 Mar 2023 10:12:43 +0100
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >   
> >> Hi Danilo,
> >>
> >> On Fri, 17 Feb 2023 14:44:06 +0100
> >> Danilo Krummrich <dakr@redhat.com> wrote:
> >>  
> >>> Changes in V2:
> >>> ==============
> >>>    Nouveau:
> >>>      - Reworked the Nouveau VM_BIND UAPI to avoid memory allocations in fence
> >>>        signalling critical sections. Updates to the VA space are split up in three
> >>>        separate stages, where only the 2. stage executes in a fence signalling
> >>>        critical section:
> >>>
> >>>          1. update the VA space, allocate new structures and page tables  
> >>
> >> Sorry for the silly question, but I didn't find where the page tables
> >> pre-allocation happens. Mind pointing it to me? It's also unclear when
> >> this step happens. Is this at bind-job submission time, when the job is
> >> not necessarily ready to run, potentially waiting for other deps to be
> >> signaled. Or is it done when all deps are met, as an extra step before
> >> jumping to step 2. If that's the former, then I don't see how the VA
> >> space update can happen, since the bind-job might depend on other
> >> bind-jobs modifying the same portion of the VA space (unbind ops might
> >> lead to intermediate page table levels disappearing while we were
> >> waiting for deps). If it's the latter, I wonder why this is not
> >> considered as an allocation in the fence signaling path (for the
> >> bind-job out-fence to be signaled, you need these allocations to
> >> succeed, unless failing to allocate page-tables is considered like a HW
> >> misbehavior and the fence is signaled with an error in that case).  
> > 
> > Ok, so I just noticed you only have one bind queue per drm_file
> > (cli->sched_entity), and jobs are executed in-order on a given queue,
> > so I guess that allows you to modify the VA space at submit time
> > without risking any modifications to the VA space coming from other
> > bind-queues targeting the same VM. And, if I'm correct, synchronous
> > bind/unbind ops take the same path, so no risk for those to modify the
> > VA space either (just wonder if it's a good thing to have to sync
> > bind/unbind operations waiting on async ones, but that's a different
> > topic).  
> 
> Yes, that's all correct.
> 
> The page table allocation happens through nouveau_uvmm_vmm_get() which 
> either allocates the corresponding page tables or increases the 
> reference count, in case they already exist, accordingly.
> The call goes all the way through nvif into the nvkm layer (not the 
> easiest to follow the call chain) and ends up in nvkm_vmm_ptes_get().
> 
> There are multiple reasons for updating the VA space at submit time in 
> Nouveau.
> 
> 1) Subsequent EXEC ioctl() calls would need to wait for the bind jobs 
> they depend on within the ioctl() rather than in the scheduler queue, 
> because at the point of time where the ioctl() happens the VA space 
> wouldn't be up-to-date.

Hm, actually that's what explicit sync is all about, isn't it? If you
have async binding ops, you should retrieve the bind-op out-fences and
pass them back as in-fences to the EXEC call, so you're sure all the
memory mappings you depend on are active when you execute those GPU
jobs. And if you're using sync binds, the changes are guaranteed to be
applied before the ioctl() returns. Am I missing something?

> 
> 2) Let's assume a new mapping is requested and within it's range other 
> mappings already exist. Let's also assume that those existing mappings 
> aren't contiguous, such that there are gaps between them. In such a case 
> I need to allocate page tables only for the gaps between the existing 
> mappings, or alternatively, allocate them for the whole range of the new 
> mapping, but free / decrease the reference count of the page tables for 
> the ranges of the previously existing mappings afterwards.
> In the first case I need to know the gaps to allocate page tables for 
> when submitting the job, which means the VA space must be up-to-date. In 
> the latter one I must save the ranges of the previously existing 
> mappings somewhere in order to clean them up, hence I need to allocate 
> memory to store this information. Since I can't allocate this memory in 
> the jobs run() callback (fence signalling critical section) I need to do 
> it when submitting the job already and hence the VA space must be 
> up-to-date again.

Yep that makes perfect sense, and that explains how the whole thing can
work. When I initially read the patch series, I had more complex use
cases in mind, with multiple bind queues targeting the same VM, and
synchronous bind taking a fast path (so they don't have to wait on
async binds which can in turn wait on external deps). This model makes
it hard to predict what the VA space will look like when an async bind
operation gets to be executed, thus making page table allocation more
complex, or forcing us to over-estimate the amount of pages we need for
this update (basically one page per MMU level, except maybe the top
level, plus the number of pages you'll always need for the bind
operation itself).

> However, this is due to how page table management currently works in 
> Nouveau and we might change that in the future.

I'm curious to hear about that if you have a bit of time. I'm starting
from scratch with pancsf, and I might consider going for something
similar to what you plan to do next.

> 
> Synchronous binds/unbinds taking the same path through the scheduler is 
> a downside of this approach.

Indeed. I mean, I can probably live with this limitation, but I'm
curious to know if the pg table management changes you're considering
for the future would solve that problem.

Anyway, thanks for taking the time to answer my question, things are
much clearer now.

Boris

Danilo Krummrich March 10, 2023, 8:06 p.m. UTC | #5

On 3/10/23 18:25, Boris Brezillon wrote:
> Hi Danilo,
> 
> On Fri, 10 Mar 2023 17:45:58 +0100
> Danilo Krummrich <dakr@redhat.com> wrote:
>> Hi Boris,
>>
>>> On Thu, 9 Mar 2023 10:12:43 +0100
>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>>
>>> Ok, so I just noticed you only have one bind queue per drm_file
>>> (cli->sched_entity), and jobs are executed in-order on a given queue,
>>> so I guess that allows you to modify the VA space at submit time
>>> without risking any modifications to the VA space coming from other
>>> bind-queues targeting the same VM. And, if I'm correct, synchronous
>>> bind/unbind ops take the same path, so no risk for those to modify the
>>> VA space either (just wonder if it's a good thing to have to sync
>>> bind/unbind operations waiting on async ones, but that's a different
>>> topic).
>>
>> Yes, that's all correct.
>>
>> The page table allocation happens through nouveau_uvmm_vmm_get() which
>> either allocates the corresponding page tables or increases the
>> reference count, in case they already exist, accordingly.
>> The call goes all the way through nvif into the nvkm layer (not the
>> easiest to follow the call chain) and ends up in nvkm_vmm_ptes_get().
>>
>> There are multiple reasons for updating the VA space at submit time in
>> Nouveau.
>>
>> 1) Subsequent EXEC ioctl() calls would need to wait for the bind jobs
>> they depend on within the ioctl() rather than in the scheduler queue,
>> because at the point of time where the ioctl() happens the VA space
>> wouldn't be up-to-date.
> 
> Hm, actually that's what explicit sync is all about, isn't it? If you
> have async binding ops, you should retrieve the bind-op out-fences and
> pass them back as in-fences to the EXEC call, so you're sure all the
> memory mappings you depend on are active when you execute those GPU
> jobs. And if you're using sync binds, the changes are guaranteed to be
> applied before the ioctl() returns. Am I missing something?
> 

No, you're right and this is exactly how I implemented it. The 
difference is where to wait for the bind jobs out-fences.

In the EXEC ioctl() we need to validate the GEM objects backing the 
dependent mappings and add the jobs fence to the GEMs DMA reservation. 
If the VA space isn't up-to-date we might not be able to look up the 
relevant GEMs and miss them.

If the VA space change happens in the bind jobs submit path (ioctl()), 
it is guaranteed that the view of the VA space is up-to-date (actually 
it might even be ahead of the actual current state) when the EXEC 
ioctl() is called. Hence, I can just pass the out-fences of the binds 
jobs the EXEC depends on to the job scheduler and return from the 
ioctl(). The job scheduler will then wait for the actual mappings being 
populated before executing the EXEC job.

If the VA space change is done when the bind job executes on the 
scheduler we would need to wait for the bind jobs out-fences in the EXEC 
ioctl() itself.

>>
>> 2) Let's assume a new mapping is requested and within it's range other
>> mappings already exist. Let's also assume that those existing mappings
>> aren't contiguous, such that there are gaps between them. In such a case
>> I need to allocate page tables only for the gaps between the existing
>> mappings, or alternatively, allocate them for the whole range of the new
>> mapping, but free / decrease the reference count of the page tables for
>> the ranges of the previously existing mappings afterwards.
>> In the first case I need to know the gaps to allocate page tables for
>> when submitting the job, which means the VA space must be up-to-date. In
>> the latter one I must save the ranges of the previously existing
>> mappings somewhere in order to clean them up, hence I need to allocate
>> memory to store this information. Since I can't allocate this memory in
>> the jobs run() callback (fence signalling critical section) I need to do
>> it when submitting the job already and hence the VA space must be
>> up-to-date again.
> 
> Yep that makes perfect sense, and that explains how the whole thing can
> work. When I initially read the patch series, I had more complex use
> cases in mind, with multiple bind queues targeting the same VM, and
> synchronous bind taking a fast path (so they don't have to wait on
> async binds which can in turn wait on external deps). This model makes
> it hard to predict what the VA space will look like when an async bind
> operation gets to be executed, thus making page table allocation more
> complex, or forcing us to over-estimate the amount of pages we need for
> this update (basically one page per MMU level, except maybe the top
> level, plus the number of pages you'll always need for the bind
> operation itself).
> 
>> However, this is due to how page table management currently works in
>> Nouveau and we might change that in the future.
> 
> I'm curious to hear about that if you have a bit of time. I'm starting
> from scratch with pancsf, and I might consider going for something
> similar to what you plan to do next.

There is no concrete plan yet. However, with the current implementation 
there are a few shortcomings (also in handling sparse ranges) that I'd 
like to address in the future.

> 
>>
>> Synchronous binds/unbinds taking the same path through the scheduler is
>> a downside of this approach.
> 
> Indeed. I mean, I can probably live with this limitation, but I'm
> curious to know if the pg table management changes you're considering
> for the future would solve that problem.

As mentioned above, I have a few ideas, but I did not think through them 
entirely yet.

A few thoughts though: If running synchronous binds/unbinds through the 
job scheduler is a concern I think it could be beneficial to 
(pre-)allocate page tables for newly requested mappings without the need 
to know whether there are existing mappings within this range already 
(ideally without tracking page table allocations separate from GPUVAs), 
such that we can update the VA space at job execution time. Same thing 
for freeing page tables for a range that only partially contains 
mappings at all. For that, reference counting page tables per mapping 
wouldn't really work.

On the other hand we need to consider that freeing page tables for a 
given range and allocating new page tables for the same or an 
overlapping range would need to be ordered in order to avoid races.

> 
> Anyway, thanks for taking the time to answer my question, things are
> much clearer now.

I'm happy to discuss this. Feel free to also reach out in IRC, my nick 
is 'dakr'.

> 
> Boris
>