vfio/pci: take mmap write lock for io_remap_pfn_range

Message ID 20230508125842.28193-1-yan.y.zhao@intel.com
State New
Headers
Series vfio/pci: take mmap write lock for io_remap_pfn_range |

Commit Message

Yan Zhao May 8, 2023, 12:58 p.m. UTC
  In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
pin_user_pages_remote() returns -EFAULT.

follow_fault_pfn
 fixup_user_fault
  handle_mm_fault
   handle_mm_fault
    do_fault
     do_shared_fault
      do_fault
       __do_fault
        vfio_pci_mmap_fault
         io_remap_pfn_range
          remap_pfn_range
           track_pfn_remap
            vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
           remap_pfn_range_notrack
            vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)

As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
holding of mmap write lock is required.
So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
write lock.

[1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com
commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
commit 1c71222e5f23
("mm: replace vma->vm_flags direct modifications with modifier calls")

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)


base-commit: 705b004ee377b789e39ae237519bab714297ac83
  

Comments

Jason Gunthorpe May 8, 2023, 4:48 p.m. UTC | #1
On Mon, May 08, 2023 at 08:58:42PM +0800, Yan Zhao wrote:
> In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
> pin_user_pages_remote() returns -EFAULT.
> 
> follow_fault_pfn
>  fixup_user_fault
>   handle_mm_fault
>    handle_mm_fault
>     do_fault
>      do_shared_fault
>       do_fault
>        __do_fault
>         vfio_pci_mmap_fault
>          io_remap_pfn_range
>           remap_pfn_range
>            track_pfn_remap
>             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
>            remap_pfn_range_notrack
>             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
> 
> As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
> holding of mmap write lock is required.
> So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
> write lock.
> 
> [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com
> commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
> commit 1c71222e5f23
> ("mm: replace vma->vm_flags direct modifications with modifier calls")
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index a5ab416cf476..5082f89152b3 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  	struct vfio_pci_mmap_vma *mmap_vma;
>  	vm_fault_t ret = VM_FAULT_NOPAGE;
>  
> +	mmap_assert_locked(vma->vm_mm);
> +	mmap_read_unlock(vma->vm_mm);
> +
> +	if (mmap_write_lock_killable(vma->vm_mm))
> +		return VM_FAULT_RETRY;

Certainly not..

I'm not sure how to resolve this properly, set the flags in advance?

The address space conversion?

Jason
  
Alex Williamson May 8, 2023, 8:57 p.m. UTC | #2
On Mon, 8 May 2023 13:48:30 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, May 08, 2023 at 08:58:42PM +0800, Yan Zhao wrote:
> > In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
> > pin_user_pages_remote() returns -EFAULT.
> > 
> > follow_fault_pfn
> >  fixup_user_fault
> >   handle_mm_fault
> >    handle_mm_fault
> >     do_fault
> >      do_shared_fault
> >       do_fault
> >        __do_fault
> >         vfio_pci_mmap_fault
> >          io_remap_pfn_range
> >           remap_pfn_range
> >            track_pfn_remap
> >             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
> >            remap_pfn_range_notrack
> >             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
> > 
> > As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
> > holding of mmap write lock is required.
> > So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
> > write lock.
> > 
> > [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com
> > commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
> > commit 1c71222e5f23
> > ("mm: replace vma->vm_flags direct modifications with modifier calls")
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index a5ab416cf476..5082f89152b3 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  	struct vfio_pci_mmap_vma *mmap_vma;
> >  	vm_fault_t ret = VM_FAULT_NOPAGE;
> >  
> > +	mmap_assert_locked(vma->vm_mm);
> > +	mmap_read_unlock(vma->vm_mm);
> > +
> > +	if (mmap_write_lock_killable(vma->vm_mm))
> > +		return VM_FAULT_RETRY;  
> 
> Certainly not..
> 
> I'm not sure how to resolve this properly, set the flags in advance?
> 
> The address space conversion?

We already try to set the flags in advance, but there are some
architectural flags like VM_PAT that make that tricky.  Cedric has been
looking at inserting individual pages with vmf_insert_pfn(), but that
incurs a lot more faults and therefore latency vs remapping the entire
vma on fault.  I'm not convinced that we shouldn't just attempt to
remove the fault handler entirely, but I haven't tried it yet to know
what gotchas are down that path.  Thanks,

Alex
  
Jason Gunthorpe May 10, 2023, 8:41 p.m. UTC | #3
On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:

> We already try to set the flags in advance, but there are some
> architectural flags like VM_PAT that make that tricky.  Cedric has been
> looking at inserting individual pages with vmf_insert_pfn(), but that
> incurs a lot more faults and therefore latency vs remapping the entire
> vma on fault.  I'm not convinced that we shouldn't just attempt to
> remove the fault handler entirely, but I haven't tried it yet to know
> what gotchas are down that path.  Thanks,

I thought we did it like this because there were races otherwise with
PTE insertion and zapping? I don't remember well anymore.

I vaugely remember the address_space conversion might help remove the
fault handler?

Jason
  
Yan Zhao May 11, 2023, 6:56 a.m. UTC | #4
On Wed, May 10, 2023 at 05:41:06PM -0300, Jason Gunthorpe wrote:
> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
> 
> > We already try to set the flags in advance, but there are some
> > architectural flags like VM_PAT that make that tricky.  Cedric has been
> > looking at inserting individual pages with vmf_insert_pfn(), but that
> > incurs a lot more faults and therefore latency vs remapping the entire
> > vma on fault.  I'm not convinced that we shouldn't just attempt to
> > remove the fault handler entirely, but I haven't tried it yet to know
> > what gotchas are down that path.  Thanks,
> 
> I thought we did it like this because there were races otherwise with
> PTE insertion and zapping? I don't remember well anymore.
> 
> I vaugely remember the address_space conversion might help remove the
> fault handler?
>
What about calling vmf_insert_pfn() in bulk as below?
And what is address_space conversion?


diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a5ab416cf476..1476e537f593 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1686,6 +1686,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
        struct vfio_pci_core_device *vdev = vma->vm_private_data;
        struct vfio_pci_mmap_vma *mmap_vma;
        vm_fault_t ret = VM_FAULT_NOPAGE;
+       unsigned long base_pfn, offset, i;

        mutex_lock(&vdev->vma_lock);
        down_read(&vdev->memory_lock);
@@ -1710,12 +1711,15 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
                        goto up_out;
        }

-       if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-                              vma->vm_end - vma->vm_start,
-                              vma->vm_page_prot)) {
-               ret = VM_FAULT_SIGBUS;
-               zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
-               goto up_out;
+       base_pfn = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
+       base_pfn += vma->vm_pgoff;
+       for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
+               offset = (i - vma->vm_start) >> PAGE_SHIFT;
+               ret = vmf_insert_pfn(vma, i, base_pfn + offset);
+               if (ret != VM_FAULT_NOPAGE) {
+                       zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+                       goto up_out;
+               }
        }

        if (__vfio_pci_add_vma(vdev, vma)) {
  
Cédric Le Goater May 11, 2023, 7:32 a.m. UTC | #5
On 5/10/23 22:41, Jason Gunthorpe wrote:
> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
> 
>> We already try to set the flags in advance, but there are some
>> architectural flags like VM_PAT that make that tricky.  Cedric has been
>> looking at inserting individual pages with vmf_insert_pfn(), but that
>> incurs a lot more faults and therefore latency vs remapping the entire
>> vma on fault.  I'm not convinced that we shouldn't just attempt to
>> remove the fault handler entirely, but I haven't tried it yet to know
>> what gotchas are down that path.  Thanks,

OTOH I didn't see any noticeable slowdowns in the tests I did with
NICs, IGD and NVIDIA GPU pass-through. I lack devices with large
BARs though. If anyone has some time for it, here are commits :

   https://github.com/legoater/linux/commits/vfio

First does a one page insert and fixes the lockdep issues we are
seeing with the recent series:

   https://lore.kernel.org/all/20230126193752.297968-1-surenb@google.com/

Second adds some statistics. For the NVIDIA GPU, faults reach 800k.

Thanks,

C.

> I thought we did it like this because there were races otherwise with
> PTE insertion and zapping? I don't remember well anymore.
> 
> I vaugely remember the address_space conversion might help remove the
> fault handler?
> 
> Jason
>
  
Cédric Le Goater May 11, 2023, 7:38 a.m. UTC | #6
On 5/11/23 08:56, Yan Zhao wrote:
> On Wed, May 10, 2023 at 05:41:06PM -0300, Jason Gunthorpe wrote:
>> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
>>
>>> We already try to set the flags in advance, but there are some
>>> architectural flags like VM_PAT that make that tricky.  Cedric has been
>>> looking at inserting individual pages with vmf_insert_pfn(), but that
>>> incurs a lot more faults and therefore latency vs remapping the entire
>>> vma on fault.  I'm not convinced that we shouldn't just attempt to
>>> remove the fault handler entirely, but I haven't tried it yet to know
>>> what gotchas are down that path.  Thanks,
>>
>> I thought we did it like this because there were races otherwise with
>> PTE insertion and zapping? I don't remember well anymore.
>>
>> I vaugely remember the address_space conversion might help remove the
>> fault handler?
>>
> What about calling vmf_insert_pfn() in bulk as below?

This works too, it is slightly slower than the io_remap_pfn_range() call
but doesn't have the lockdep issues.

Thanks,

C.

> And what is address_space conversion?
> 
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index a5ab416cf476..1476e537f593 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1686,6 +1686,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>          struct vfio_pci_core_device *vdev = vma->vm_private_data;
>          struct vfio_pci_mmap_vma *mmap_vma;
>          vm_fault_t ret = VM_FAULT_NOPAGE;
> +       unsigned long base_pfn, offset, i;
> 
>          mutex_lock(&vdev->vma_lock);
>          down_read(&vdev->memory_lock);
> @@ -1710,12 +1711,15 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>                          goto up_out;
>          }
> 
> -       if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> -                              vma->vm_end - vma->vm_start,
> -                              vma->vm_page_prot)) {
> -               ret = VM_FAULT_SIGBUS;
> -               zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> -               goto up_out;
> +       base_pfn = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +       base_pfn += vma->vm_pgoff;
> +       for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
> +               offset = (i - vma->vm_start) >> PAGE_SHIFT;
> +               ret = vmf_insert_pfn(vma, i, base_pfn + offset);
> +               if (ret != VM_FAULT_NOPAGE) {
> +                       zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> +                       goto up_out;
> +               }
>          }
> 
>          if (__vfio_pci_add_vma(vdev, vma)) {
>
  
Alex Williamson May 11, 2023, 4:07 p.m. UTC | #7
On Wed, 10 May 2023 17:41:06 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
> 
> > We already try to set the flags in advance, but there are some
> > architectural flags like VM_PAT that make that tricky.  Cedric has been
> > looking at inserting individual pages with vmf_insert_pfn(), but that
> > incurs a lot more faults and therefore latency vs remapping the entire
> > vma on fault.  I'm not convinced that we shouldn't just attempt to
> > remove the fault handler entirely, but I haven't tried it yet to know
> > what gotchas are down that path.  Thanks,  
> 
> I thought we did it like this because there were races otherwise with
> PTE insertion and zapping? I don't remember well anymore.

TBH, I don't recall if we tried a synchronous approach previously.  The
benefit of the faulting approach was that we could track the minimum
set of vmas which are actually making use of the mapping and throw that
tracking list away when zapping.  Without that, we need to add vmas
both on mmap and in vm_ops.open, removing only in vm_ops.close, and
acquire all the proper mm locking for each vma to re-insert the
mappings.

> I vaugely remember the address_space conversion might help remove the
> fault handler?

Yes, this did remove the fault handler entirely, it's (obviously)
dropped off my radar, but perhaps in the interim we could switch to
vmf_insert_pfn() and revive the address space series to eventually
remove the fault handling and vma list altogether.

For reference, I think this was the last posting of the address space
series:

https://lore.kernel.org/all/162818167535.1511194.6614962507750594786.stgit@omen/

Thanks,
Alex
  
Jason Gunthorpe May 11, 2023, 5:47 p.m. UTC | #8
On Thu, May 11, 2023 at 10:07:06AM -0600, Alex Williamson wrote:

> > I vaugely remember the address_space conversion might help remove the
> > fault handler?
> 
> Yes, this did remove the fault handler entirely, it's (obviously)
> dropped off my radar, but perhaps in the interim we could switch to
> vmf_insert_pfn() and revive the address space series to eventually
> remove the fault handling and vma list altogether.

vmf_insert_pfn() technically isn't supposed to be used for MMIO..

Eg it doesn't do the PAT stuff on x86 that is causing this problem in
the first place.

So doing the address space removing series seems like the best fix. It
has been mislocked for a long time, I suspect there isn't a real
urgent problem beyond we actually have lockdep annoations to catch the
mislocking now.

Jason
  
Yan Zhao May 12, 2023, 8:02 a.m. UTC | #9
On Thu, May 11, 2023 at 10:07:06AM -0600, Alex Williamson wrote:
> On Wed, 10 May 2023 17:41:06 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
> > 
> > > We already try to set the flags in advance, but there are some
> > > architectural flags like VM_PAT that make that tricky.  Cedric has been
> > > looking at inserting individual pages with vmf_insert_pfn(), but that
> > > incurs a lot more faults and therefore latency vs remapping the entire
> > > vma on fault.  I'm not convinced that we shouldn't just attempt to
> > > remove the fault handler entirely, but I haven't tried it yet to know
> > > what gotchas are down that path.  Thanks,  
> > 
> > I thought we did it like this because there were races otherwise with
> > PTE insertion and zapping? I don't remember well anymore.
> 
> TBH, I don't recall if we tried a synchronous approach previously.  The
> benefit of the faulting approach was that we could track the minimum
> set of vmas which are actually making use of the mapping and throw that
> tracking list away when zapping.  Without that, we need to add vmas
> both on mmap and in vm_ops.open, removing only in vm_ops.close, and
> acquire all the proper mm locking for each vma to re-insert the
> mappings.
> 
> > I vaugely remember the address_space conversion might help remove the
> > fault handler?
> 
> Yes, this did remove the fault handler entirely, it's (obviously)
> dropped off my radar, but perhaps in the interim we could switch to
> vmf_insert_pfn() and revive the address space series to eventually
> remove the fault handling and vma list altogether.
> 
> For reference, I think this was the last posting of the address space
> series:
> 
> https://lore.kernel.org/all/162818167535.1511194.6614962507750594786.stgit@omen/

Just took a quick look at this series.
A question is that looks it still needs to call io_remap_pfn_range() in
places like vfio_basic_config_write() for PCI_COMMAND, and device reset,
so mmap write lock is still required around vdev->memory_lock.
  

Patch

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a5ab416cf476..5082f89152b3 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1687,6 +1687,12 @@  static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 	struct vfio_pci_mmap_vma *mmap_vma;
 	vm_fault_t ret = VM_FAULT_NOPAGE;
 
+	mmap_assert_locked(vma->vm_mm);
+	mmap_read_unlock(vma->vm_mm);
+
+	if (mmap_write_lock_killable(vma->vm_mm))
+		return VM_FAULT_RETRY;
+
 	mutex_lock(&vdev->vma_lock);
 	down_read(&vdev->memory_lock);
 
@@ -1726,6 +1732,17 @@  static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 up_out:
 	up_read(&vdev->memory_lock);
 	mutex_unlock(&vdev->vma_lock);
+	mmap_write_unlock(vma->vm_mm);
+
+	/* If PTE is installed successfully, add the completed flag to
+	 * indicate mmap lock is released,
+	 * otherwise retake the read lock
+	 */
+	if (ret == VM_FAULT_NOPAGE)
+		ret |= VM_FAULT_COMPLETED;
+	else
+		mmap_read_lock(vma->vm_mm);
+
 	return ret;
 }