Message ID | 20230508125842.28193-1-yan.y.zhao@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2156669vqo; Mon, 8 May 2023 06:28:40 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ75rBBEgsZu/LEUJ6GfKrA5MLT/f5upmHcqTbS4cIJK3N4QjPlE2+MWlgS2IUE3bASQ8rqk X-Received: by 2002:a05:6a20:a111:b0:ff:d437:1403 with SMTP id q17-20020a056a20a11100b000ffd4371403mr8524523pzk.6.1683552519861; Mon, 08 May 2023 06:28:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683552519; cv=none; d=google.com; s=arc-20160816; b=sXp36v4wS7hdN3lZTvGcxw1gXS292Uir/b0ubWU+2m8qjfl87bypDy5fg7B4FYamBo 0pOujkL2qZ8lh8e1IRoX1ssUxsGLk23XIRDEFKHEycADvawgBXwPjerPz32pKnFczOLK xYyil/105gqRgYFcoyR7R+7tVjpmZHHzeh8pDv10hqo+GzkNoCQGBwxYnERjDV0zPu6Y z5wszBy5D9q7SjziIw+eYI2c9MAUCLtDrejJFCmHDR3WJID21b8uspXPvJnaC1+OwZfc Wuc27SaUBAEmT/e3JjA96lqGUdrkTIkE9xp5zFcTMCKeOoli/mEG3aXRmoxbIbcWvYYt Mg5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from :dkim-signature; bh=1NMW1h5w5F1q5ifdq3nhrd0mrQIyxC03HXsoHk8OF3M=; b=L8p+waa5DH4Us4PWKkviAZqcXHgtE45X+QJ4p03ukk6FlmiOSotFtPaFcQqJsBMfUU AeZAkiSx2hJvVULv+QLF2W21uexnznQICFzSACWvp9RDclcVoi8NCpVjxRlfHUsL1+pD SVK9Yyg5J0BACuf9P7ArZcpMK+kuViAUMXDYW5v0akjc75IBfr0YyKby4dOjeQCIHp4+ ao0qfnEYSL7C5qbgGumrG+PbzzfMBuKjOTYkVQWGooZJGMEMrCO9tCFLxDBUN4rjPxH1 8vHqAYJhQUEim2oGIBHlbhKOH0il7odu7ahFIM3U8B6a0IyKHd08CIZfPVtLU0TcvHjS XFYA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=NxiT+kvP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ie21-20020a17090b401500b00247a0928f5dsi12298650pjb.184.2023.05.08.06.28.24; Mon, 08 May 2023 06:28:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=NxiT+kvP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233471AbjEHNXw (ORCPT <rfc822;baris.duru.linux@gmail.com> + 99 others); Mon, 8 May 2023 09:23:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42886 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229479AbjEHNXu (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 8 May 2023 09:23:50 -0400 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 642E383; Mon, 8 May 2023 06:23:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1683552229; x=1715088229; h=from:to:cc:subject:date:message-id; bh=8SCm4wFEaCHIjaaMs+tOcclnzwJ0N0aJVRlTnjeWArY=; b=NxiT+kvP3EHjRZBkF5hpSVX3nOEH2ECltC9x5L/ltmwX9zMnx3uEXJsE cXp9fJvTCLCPm0QSJlcj1VkAyCv6iM4jV6cwmcWDB8120J/uC1Im+E73q hoWwtzAaw+PQxIi7b/es5uG6mOVLh7w1TPe8PPyM3m9lG9agMxa4giuny Y8JIz3unikABksYQ80hBiDwC/gC+wPh7K/WWpXaQFe1H/J/QIzW/ac+jq MpAqKvOAKzzq+XiKN8tMSoiFBqMbO3b6sBUGRsgMrI3XDhlqJ207M3dDf 8FZpn+bWwnMbH956OOTctospTcuCmtkKGp7skcsYXWXXRksChEUxDPipf Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10703"; a="349674058" X-IronPort-AV: E=Sophos;i="5.99,259,1677571200"; d="scan'208";a="349674058" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 May 2023 06:23:49 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10703"; a="648873705" X-IronPort-AV: E=Sophos;i="5.99,259,1677571200"; d="scan'208";a="648873705" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by orsmga003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 May 2023 06:23:46 -0700 From: Yan Zhao <yan.y.zhao@intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: alex.williamson@redhat.com, kevin.tian@intel.com, jgg@nvidia.com, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range Date: Mon, 8 May 2023 20:58:42 +0800 Message-Id: <20230508125842.28193-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1765332767034335650?= X-GMAIL-MSGID: =?utf-8?q?1765332767034335650?= |
Series |
vfio/pci: take mmap write lock for io_remap_pfn_range
|
|
Commit Message
Yan Zhao
May 8, 2023, 12:58 p.m. UTC
In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
pin_user_pages_remote() returns -EFAULT.
follow_fault_pfn
fixup_user_fault
handle_mm_fault
handle_mm_fault
do_fault
do_shared_fault
do_fault
__do_fault
vfio_pci_mmap_fault
io_remap_pfn_range
remap_pfn_range
track_pfn_remap
vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm)
remap_pfn_range_notrack
vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm)
As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
holding of mmap write lock is required.
So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
write lock.
[1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com
commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
commit 1c71222e5f23
("mm: replace vma->vm_flags direct modifications with modifier calls")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
base-commit: 705b004ee377b789e39ae237519bab714297ac83
Comments
On Mon, May 08, 2023 at 08:58:42PM +0800, Yan Zhao wrote: > In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after > pin_user_pages_remote() returns -EFAULT. > > follow_fault_pfn > fixup_user_fault > handle_mm_fault > handle_mm_fault > do_fault > do_shared_fault > do_fault > __do_fault > vfio_pci_mmap_fault > io_remap_pfn_range > remap_pfn_range > track_pfn_remap > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > remap_pfn_range_notrack > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > > As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1], > holding of mmap write lock is required. > So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap > write lock. > > [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com > commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions") > commit 1c71222e5f23 > ("mm: replace vma->vm_flags direct modifications with modifier calls") > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index a5ab416cf476..5082f89152b3 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > struct vfio_pci_mmap_vma *mmap_vma; > vm_fault_t ret = VM_FAULT_NOPAGE; > > + mmap_assert_locked(vma->vm_mm); > + mmap_read_unlock(vma->vm_mm); > + > + if (mmap_write_lock_killable(vma->vm_mm)) > + return VM_FAULT_RETRY; Certainly not.. I'm not sure how to resolve this properly, set the flags in advance? The address space conversion? Jason
On Mon, 8 May 2023 13:48:30 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, May 08, 2023 at 08:58:42PM +0800, Yan Zhao wrote: > > In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after > > pin_user_pages_remote() returns -EFAULT. > > > > follow_fault_pfn > > fixup_user_fault > > handle_mm_fault > > handle_mm_fault > > do_fault > > do_shared_fault > > do_fault > > __do_fault > > vfio_pci_mmap_fault > > io_remap_pfn_range > > remap_pfn_range > > track_pfn_remap > > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > > remap_pfn_range_notrack > > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > > > > As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1], > > holding of mmap write lock is required. > > So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap > > write lock. > > > > [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com > > commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions") > > commit 1c71222e5f23 > > ("mm: replace vma->vm_flags direct modifications with modifier calls") > > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > > --- > > drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++ > > 1 file changed, 17 insertions(+) > > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > > index a5ab416cf476..5082f89152b3 100644 > > --- a/drivers/vfio/pci/vfio_pci_core.c > > +++ b/drivers/vfio/pci/vfio_pci_core.c > > @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > > struct vfio_pci_mmap_vma *mmap_vma; > > vm_fault_t ret = VM_FAULT_NOPAGE; > > > > + mmap_assert_locked(vma->vm_mm); > > + mmap_read_unlock(vma->vm_mm); > > + > > + if (mmap_write_lock_killable(vma->vm_mm)) > > + return VM_FAULT_RETRY; > > Certainly not.. > > I'm not sure how to resolve this properly, set the flags in advance? > > The address space conversion? We already try to set the flags in advance, but there are some architectural flags like VM_PAT that make that tricky. Cedric has been looking at inserting individual pages with vmf_insert_pfn(), but that incurs a lot more faults and therefore latency vs remapping the entire vma on fault. I'm not convinced that we shouldn't just attempt to remove the fault handler entirely, but I haven't tried it yet to know what gotchas are down that path. Thanks, Alex
On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: > We already try to set the flags in advance, but there are some > architectural flags like VM_PAT that make that tricky. Cedric has been > looking at inserting individual pages with vmf_insert_pfn(), but that > incurs a lot more faults and therefore latency vs remapping the entire > vma on fault. I'm not convinced that we shouldn't just attempt to > remove the fault handler entirely, but I haven't tried it yet to know > what gotchas are down that path. Thanks, I thought we did it like this because there were races otherwise with PTE insertion and zapping? I don't remember well anymore. I vaugely remember the address_space conversion might help remove the fault handler? Jason
On Wed, May 10, 2023 at 05:41:06PM -0300, Jason Gunthorpe wrote: > On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: > > > We already try to set the flags in advance, but there are some > > architectural flags like VM_PAT that make that tricky. Cedric has been > > looking at inserting individual pages with vmf_insert_pfn(), but that > > incurs a lot more faults and therefore latency vs remapping the entire > > vma on fault. I'm not convinced that we shouldn't just attempt to > > remove the fault handler entirely, but I haven't tried it yet to know > > what gotchas are down that path. Thanks, > > I thought we did it like this because there were races otherwise with > PTE insertion and zapping? I don't remember well anymore. > > I vaugely remember the address_space conversion might help remove the > fault handler? > What about calling vmf_insert_pfn() in bulk as below? And what is address_space conversion? diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index a5ab416cf476..1476e537f593 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1686,6 +1686,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) struct vfio_pci_core_device *vdev = vma->vm_private_data; struct vfio_pci_mmap_vma *mmap_vma; vm_fault_t ret = VM_FAULT_NOPAGE; + unsigned long base_pfn, offset, i; mutex_lock(&vdev->vma_lock); down_read(&vdev->memory_lock); @@ -1710,12 +1711,15 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) goto up_out; } - if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - vma->vm_end - vma->vm_start, - vma->vm_page_prot)) { - ret = VM_FAULT_SIGBUS; - zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); - goto up_out; + base_pfn = (vmf->address - vma->vm_start) >> PAGE_SHIFT; + base_pfn += vma->vm_pgoff; + for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) { + offset = (i - vma->vm_start) >> PAGE_SHIFT; + ret = vmf_insert_pfn(vma, i, base_pfn + offset); + if (ret != VM_FAULT_NOPAGE) { + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); + goto up_out; + } } if (__vfio_pci_add_vma(vdev, vma)) {
On 5/10/23 22:41, Jason Gunthorpe wrote: > On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: > >> We already try to set the flags in advance, but there are some >> architectural flags like VM_PAT that make that tricky. Cedric has been >> looking at inserting individual pages with vmf_insert_pfn(), but that >> incurs a lot more faults and therefore latency vs remapping the entire >> vma on fault. I'm not convinced that we shouldn't just attempt to >> remove the fault handler entirely, but I haven't tried it yet to know >> what gotchas are down that path. Thanks, OTOH I didn't see any noticeable slowdowns in the tests I did with NICs, IGD and NVIDIA GPU pass-through. I lack devices with large BARs though. If anyone has some time for it, here are commits : https://github.com/legoater/linux/commits/vfio First does a one page insert and fixes the lockdep issues we are seeing with the recent series: https://lore.kernel.org/all/20230126193752.297968-1-surenb@google.com/ Second adds some statistics. For the NVIDIA GPU, faults reach 800k. Thanks, C. > I thought we did it like this because there were races otherwise with > PTE insertion and zapping? I don't remember well anymore. > > I vaugely remember the address_space conversion might help remove the > fault handler? > > Jason >
On 5/11/23 08:56, Yan Zhao wrote: > On Wed, May 10, 2023 at 05:41:06PM -0300, Jason Gunthorpe wrote: >> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: >> >>> We already try to set the flags in advance, but there are some >>> architectural flags like VM_PAT that make that tricky. Cedric has been >>> looking at inserting individual pages with vmf_insert_pfn(), but that >>> incurs a lot more faults and therefore latency vs remapping the entire >>> vma on fault. I'm not convinced that we shouldn't just attempt to >>> remove the fault handler entirely, but I haven't tried it yet to know >>> what gotchas are down that path. Thanks, >> >> I thought we did it like this because there were races otherwise with >> PTE insertion and zapping? I don't remember well anymore. >> >> I vaugely remember the address_space conversion might help remove the >> fault handler? >> > What about calling vmf_insert_pfn() in bulk as below? This works too, it is slightly slower than the io_remap_pfn_range() call but doesn't have the lockdep issues. Thanks, C. > And what is address_space conversion? > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index a5ab416cf476..1476e537f593 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -1686,6 +1686,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > struct vfio_pci_core_device *vdev = vma->vm_private_data; > struct vfio_pci_mmap_vma *mmap_vma; > vm_fault_t ret = VM_FAULT_NOPAGE; > + unsigned long base_pfn, offset, i; > > mutex_lock(&vdev->vma_lock); > down_read(&vdev->memory_lock); > @@ -1710,12 +1711,15 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > goto up_out; > } > > - if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, > - vma->vm_end - vma->vm_start, > - vma->vm_page_prot)) { > - ret = VM_FAULT_SIGBUS; > - zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); > - goto up_out; > + base_pfn = (vmf->address - vma->vm_start) >> PAGE_SHIFT; > + base_pfn += vma->vm_pgoff; > + for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) { > + offset = (i - vma->vm_start) >> PAGE_SHIFT; > + ret = vmf_insert_pfn(vma, i, base_pfn + offset); > + if (ret != VM_FAULT_NOPAGE) { > + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); > + goto up_out; > + } > } > > if (__vfio_pci_add_vma(vdev, vma)) { >
On Wed, 10 May 2023 17:41:06 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: > > > We already try to set the flags in advance, but there are some > > architectural flags like VM_PAT that make that tricky. Cedric has been > > looking at inserting individual pages with vmf_insert_pfn(), but that > > incurs a lot more faults and therefore latency vs remapping the entire > > vma on fault. I'm not convinced that we shouldn't just attempt to > > remove the fault handler entirely, but I haven't tried it yet to know > > what gotchas are down that path. Thanks, > > I thought we did it like this because there were races otherwise with > PTE insertion and zapping? I don't remember well anymore. TBH, I don't recall if we tried a synchronous approach previously. The benefit of the faulting approach was that we could track the minimum set of vmas which are actually making use of the mapping and throw that tracking list away when zapping. Without that, we need to add vmas both on mmap and in vm_ops.open, removing only in vm_ops.close, and acquire all the proper mm locking for each vma to re-insert the mappings. > I vaugely remember the address_space conversion might help remove the > fault handler? Yes, this did remove the fault handler entirely, it's (obviously) dropped off my radar, but perhaps in the interim we could switch to vmf_insert_pfn() and revive the address space series to eventually remove the fault handling and vma list altogether. For reference, I think this was the last posting of the address space series: https://lore.kernel.org/all/162818167535.1511194.6614962507750594786.stgit@omen/ Thanks, Alex
On Thu, May 11, 2023 at 10:07:06AM -0600, Alex Williamson wrote: > > I vaugely remember the address_space conversion might help remove the > > fault handler? > > Yes, this did remove the fault handler entirely, it's (obviously) > dropped off my radar, but perhaps in the interim we could switch to > vmf_insert_pfn() and revive the address space series to eventually > remove the fault handling and vma list altogether. vmf_insert_pfn() technically isn't supposed to be used for MMIO.. Eg it doesn't do the PAT stuff on x86 that is causing this problem in the first place. So doing the address space removing series seems like the best fix. It has been mislocked for a long time, I suspect there isn't a real urgent problem beyond we actually have lockdep annoations to catch the mislocking now. Jason
On Thu, May 11, 2023 at 10:07:06AM -0600, Alex Williamson wrote: > On Wed, 10 May 2023 17:41:06 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: > > > > > We already try to set the flags in advance, but there are some > > > architectural flags like VM_PAT that make that tricky. Cedric has been > > > looking at inserting individual pages with vmf_insert_pfn(), but that > > > incurs a lot more faults and therefore latency vs remapping the entire > > > vma on fault. I'm not convinced that we shouldn't just attempt to > > > remove the fault handler entirely, but I haven't tried it yet to know > > > what gotchas are down that path. Thanks, > > > > I thought we did it like this because there were races otherwise with > > PTE insertion and zapping? I don't remember well anymore. > > TBH, I don't recall if we tried a synchronous approach previously. The > benefit of the faulting approach was that we could track the minimum > set of vmas which are actually making use of the mapping and throw that > tracking list away when zapping. Without that, we need to add vmas > both on mmap and in vm_ops.open, removing only in vm_ops.close, and > acquire all the proper mm locking for each vma to re-insert the > mappings. > > > I vaugely remember the address_space conversion might help remove the > > fault handler? > > Yes, this did remove the fault handler entirely, it's (obviously) > dropped off my radar, but perhaps in the interim we could switch to > vmf_insert_pfn() and revive the address space series to eventually > remove the fault handling and vma list altogether. > > For reference, I think this was the last posting of the address space > series: > > https://lore.kernel.org/all/162818167535.1511194.6614962507750594786.stgit@omen/ Just took a quick look at this series. A question is that looks it still needs to call io_remap_pfn_range() in places like vfio_basic_config_write() for PCI_COMMAND, and device reset, so mmap write lock is still required around vdev->memory_lock.
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index a5ab416cf476..5082f89152b3 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) struct vfio_pci_mmap_vma *mmap_vma; vm_fault_t ret = VM_FAULT_NOPAGE; + mmap_assert_locked(vma->vm_mm); + mmap_read_unlock(vma->vm_mm); + + if (mmap_write_lock_killable(vma->vm_mm)) + return VM_FAULT_RETRY; + mutex_lock(&vdev->vma_lock); down_read(&vdev->memory_lock); @@ -1726,6 +1732,17 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) up_out: up_read(&vdev->memory_lock); mutex_unlock(&vdev->vma_lock); + mmap_write_unlock(vma->vm_mm); + + /* If PTE is installed successfully, add the completed flag to + * indicate mmap lock is released, + * otherwise retake the read lock + */ + if (ret == VM_FAULT_NOPAGE) + ret |= VM_FAULT_COMPLETED; + else + mmap_read_lock(vma->vm_mm); + return ret; }