[RFC,3/3] KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration

Message ID 20230808071702.20269-1-yan.y.zhao@intel.com
State New
Headers
Series Reduce NUMA balance caused TLB-shootdowns in a VM |

Commit Message

Yan Zhao Aug. 8, 2023, 7:17 a.m. UTC
  Skip zapping pages that're exclusive anonymas and maybe-dma-pinned in TDP
MMU if it's for NUMA migration purpose to save unnecessary zaps and TLB
shootdowns.

For NUMA balancing, change_pmd_range() will send .invalidate_range_start()
and .invalidate_range_end() pair unconditionally before setting a huge PMD
or PTE to be PROT_NONE.

No matter whether PROT_NONE is set under change_pmd_range(), NUMA migration
will eventually reject migrating of exclusive anonymas and maybe_dma_pinned
pages in later try_to_migrate_one() phase and restoring the affected huge
PMD or PTE.

Therefore, if KVM can detect those kind of pages in the zap phase, zap and
TLB shootdowns caused by this kind of protection can be avoided.

Corner cases like below are still fine.
1. Auto NUMA balancing selects a PMD range to set PROT_NONE in
   change_pmd_range().
2. A page is maybe-dma-pinned at the time of sending
   .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA.
    ==> so it's not zapped in KVM's secondary MMU.
3. The page is unpinned after sending .invalidate_range_start(), therefore
   is not maybe-dma-pinned and set to PROT_NONE in primary MMU.
4. For some reason, page fault is triggered in primary MMU and the page
   will be found to be suitable for NUMA migration.
5. try_to_migrate_one() will send .invalidate_range_start() notification
   with event type MMU_NOTIFY_CLEAR to KVM, and ===>
   KVM will zap the pages in secondary MMU.
6. The old page will be replaced by a new page in primary MMU.

If step 4 does not happen, though KVM will keep accessing a page that
might not be on the best NUMA node, it can be fixed by a next round of
step 1 in Auto NUMA balancing as change_pmd_range() will send mmu
notification without checking PROT_NONE is set or not.

Currently in this patch, for NUMA migration protection purpose, only
exclusive anonymous maybe-dma-pinned pages are skipped.
Can later include other type of pages, e.g., is_zone_device_page() or
PageKsm() if necessary.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c     |  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c | 26 ++++++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h |  4 ++--
 include/linux/kvm_host.h   |  1 +
 virt/kvm/kvm_main.c        |  5 +++++
 5 files changed, 32 insertions(+), 8 deletions(-)
  

Comments

Sean Christopherson Aug. 8, 2023, 11:56 p.m. UTC | #1
On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> > On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > > >  		    !is_last_spte(iter.old_spte, iter.level))
> > > >  			continue;
> > > >  
> > > > +		if (skip_pinned) {
> > > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > > +			struct folio *folio;
> > > > +
> > > > +			if (!page)
> > > > +				continue;
> > > > +
> > > > +			folio = page_folio(page);
> > > > +
> > > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > > +			    folio_maybe_dma_pinned(folio))
> > > > +				continue;
> > > > +		}
> > > > +
> > > 
> > > I don't get it..
> > > 
> > > The last patch made it so that the NUMA balancing code doesn't change
> > > page_maybe_dma_pinned() pages to PROT_NONE
> > > 
> > > So why doesn't KVM just check if the current and new SPTE are the same
> > > and refrain from invalidating if nothing changed?
> > 
> > Because KVM doesn't have visibility into the current and new PTEs when the zapping
> > occurs.  The contract for invalidate_range_start() requires that KVM drop all
> > references before returning, and so the zapping occurs before change_pte_range()
> > or change_huge_pmd() have done antyhing.
> > 
> > > Duplicating the checks here seems very frail to me.
> > 
> > Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> > can yield different results purely based on refcounts, i.e. KVM could skip pages
> > that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> > in general, I am steadfastedly against adding any kind of heuristic to KVM's
> > zapping logic.
> > 
> > This really needs to be fixed in the primary MMU and not require any direct
> > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> > to be skipped.
> 
> This likely has the same issue you just described, we don't know if it
> can be skipped until we iterate over the PTEs and by then it is too
> late to invoke the notifier. Maybe some kind of abort and restart
> scheme could work?

Or maybe treat this as a userspace config problem?  Pinning DMA pages in a VM,
having a fair amount of remote memory, *and* expecting NUMA balancing to do anything
useful for that VM seems like a userspace problem.

Actually, does NUMA balancing even support this particular scenario?  I see this
in do_numa_page()

	/* TODO: handle PTE-mapped THP */
	if (PageCompound(page))
		goto out_map;

and then for PG_anon_exclusive

	 * ... For now, we only expect it to be
	 * set on tail pages for PTE-mapped THP.
	 */
	PG_anon_exclusive = PG_mappedtodisk,

which IIUC means zapping these pages to do migrate_on-fault will never succeed.

Can we just tell userspace to mbind() the pinned region to explicitly exclude the
VMA(s) from NUMA balancing?
  
Yan Zhao Aug. 9, 2023, 12:11 a.m. UTC | #2
On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> > > On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > > > >  		    !is_last_spte(iter.old_spte, iter.level))
> > > > >  			continue;
> > > > >  
> > > > > +		if (skip_pinned) {
> > > > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > > > +			struct folio *folio;
> > > > > +
> > > > > +			if (!page)
> > > > > +				continue;
> > > > > +
> > > > > +			folio = page_folio(page);
> > > > > +
> > > > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > > > +			    folio_maybe_dma_pinned(folio))
> > > > > +				continue;
> > > > > +		}
> > > > > +
> > > > 
> > > > I don't get it..
> > > > 
> > > > The last patch made it so that the NUMA balancing code doesn't change
> > > > page_maybe_dma_pinned() pages to PROT_NONE
> > > > 
> > > > So why doesn't KVM just check if the current and new SPTE are the same
> > > > and refrain from invalidating if nothing changed?
> > > 
> > > Because KVM doesn't have visibility into the current and new PTEs when the zapping
> > > occurs.  The contract for invalidate_range_start() requires that KVM drop all
> > > references before returning, and so the zapping occurs before change_pte_range()
> > > or change_huge_pmd() have done antyhing.
> > > 
> > > > Duplicating the checks here seems very frail to me.
> > > 
> > > Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> > > can yield different results purely based on refcounts, i.e. KVM could skip pages
> > > that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> > > in general, I am steadfastedly against adding any kind of heuristic to KVM's
> > > zapping logic.
> > > 
> > > This really needs to be fixed in the primary MMU and not require any direct
> > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> > > to be skipped.
> > 
> > This likely has the same issue you just described, we don't know if it
> > can be skipped until we iterate over the PTEs and by then it is too
> > late to invoke the notifier. Maybe some kind of abort and restart
> > scheme could work?
> 
> Or maybe treat this as a userspace config problem?  Pinning DMA pages in a VM,
> having a fair amount of remote memory, *and* expecting NUMA balancing to do anything
> useful for that VM seems like a userspace problem.
> 
> Actually, does NUMA balancing even support this particular scenario?  I see this
> in do_numa_page()
> 
> 	/* TODO: handle PTE-mapped THP */
> 	if (PageCompound(page))
> 		goto out_map;
hi Sean,
I think compound page is handled in do_huge_pmd_numa_page(), and I do
observed numa migration of those kind of pages.


> and then for PG_anon_exclusive
> 
> 	 * ... For now, we only expect it to be
> 	 * set on tail pages for PTE-mapped THP.
> 	 */
> 	PG_anon_exclusive = PG_mappedtodisk,
> 
> which IIUC means zapping these pages to do migrate_on-fault will never succeed.
> 
> Can we just tell userspace to mbind() the pinned region to explicitly exclude the
> VMA(s) from NUMA balancing?
For VMs with VFIO mdev mediated devices, the VMAs to be pinned are
dynamic, I think it's hard to mbind() in advance.

Thanks
Yan
  
Yan Zhao Aug. 9, 2023, 12:29 a.m. UTC | #3
On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > >  		    !is_last_spte(iter.old_spte, iter.level))
> > >  			continue;
> > >  
> > > +		if (skip_pinned) {
> > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > +			struct folio *folio;
> > > +
> > > +			if (!page)
> > > +				continue;
> > > +
> > > +			folio = page_folio(page);
> > > +
> > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > +			    folio_maybe_dma_pinned(folio))
> > > +				continue;
> > > +		}
> > > +
> > 
> > I don't get it..
> > 
> > The last patch made it so that the NUMA balancing code doesn't change
> > page_maybe_dma_pinned() pages to PROT_NONE
> > 
> > So why doesn't KVM just check if the current and new SPTE are the same
> > and refrain from invalidating if nothing changed?
> 
> Because KVM doesn't have visibility into the current and new PTEs when the zapping
> occurs.  The contract for invalidate_range_start() requires that KVM drop all
> references before returning, and so the zapping occurs before change_pte_range()
> or change_huge_pmd() have done antyhing.
> 
> > Duplicating the checks here seems very frail to me.
> 
> Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> can yield different results purely based on refcounts, i.e. KVM could skip pages
Do you mean the different results of folio_maybe_dma_pinned() and
page_maybe_dma_pinned()?

I choose to use folio_maybe_dma_pinned() in KVM on purpose because in
this .invalidate_range_start() handler in KVM, we may get tail pages of
a folio, so it's better to call this folio's version of folio_maybe_dma_pinned().

However, in mm core, i.e. in change_huge_pmd() and change_pte_range(),
the "page" it gets is always head page of a folio, so though
page_maybe_dma_pinned() is called in it, it actually equals to
folio_maybe_dma_pinned(page_folio(page)).

So, I think the two sides should yield equal results.

On this other hand, if you are concerning about the ref count of page is
dynamic, and because KVM and mm core do not check ref count of a page
atomically, I think it's still fine.
Because, the notification of .invalidate_range_start() with event type
MMU_NOTIFY_PROTECTION_VMA only means the corresponding PTE is protected
in the primary MMU, it does not mean the page is UNMAPed.

In series [1], we can even see that for processes other than KVM, the
PROT_NONE in primary MMU for NUMA migration purpose is actually ignored
and the underlying PFNs are still accessed.

So, could KVM open a door for maybe-dma-pinned pages, and keeps mapping
those pages until
(1) a invalidate notification other than MMU_NOTIFY_PROTECTION_VMA comes or
(2) a invalidate notification with MMU_NOTIFY_PROTECTION_VMA comes again with
reduced page ref count?

[1]: https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/

Thanks
Yan

> that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> in general, I am steadfastedly against adding any kind of heuristic to KVM's
> zapping logic.
> 
> This really needs to be fixed in the primary MMU and not require any direct
> involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> to be skipped.
>
  
Yan Zhao Aug. 9, 2023, 2:58 a.m. UTC | #4
On Tue, Aug 08, 2023 at 11:32:37AM -0300, Jason Gunthorpe wrote:
.... 
> > This really needs to be fixed in the primary MMU and not require any direct
> > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> > to be skipped.
> 
> This likely has the same issue you just described, we don't know if it
> can be skipped until we iterate over the PTEs and by then it is too
> late to invoke the notifier. Maybe some kind of abort and restart
The problem is that KVM currently performs the zap in handler of .invalidate_range_start(),
so before abort in mm, KVM has done the zap in secondary MMU.

Or, could we move the zap in KVM side to handler of .invalidate_range_end() only for
MMU_NOTIFY_PROTECTION_VMA and MMU_NOTIFIER_RANGE_NUMA?

Then, in mm side, we could do the abort and update the range to contain only successful
subrange .invalidate_range_end().

Is that acceptable?

> scheme could work?
>
  
Yan Zhao Aug. 9, 2023, 5:06 a.m. UTC | #5
On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote:
 
> and then for PG_anon_exclusive
> 
> 	 * ... For now, we only expect it to be
> 	 * set on tail pages for PTE-mapped THP.
> 	 */
> 	PG_anon_exclusive = PG_mappedtodisk,
> 

	/*
         * Depending on the way an anonymous folio can be mapped into a page
         * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
         * THP), PG_anon_exclusive may be set only for the head page or for
         * tail pages of an anonymous folio. For now, we only expect it to be
         * set on tail pages for PTE-mapped THP.
         */
        PG_anon_exclusive = PG_mappedtodisk,

Now sure why the comment says PG_anon_exclusive is set only on tail
pages for PTE-mapped THP,

what I observed is that only head page of a compound page is set to
anon_exclusive.

And the code path is here:
__handle_mm_fault
  |->create_huge_pmd
     |->do_huge_pmd_anonymous_page //if (vma_is_anonymous(vmf->vma)
     	|->folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true);
        |->__do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
           |->folio_add_new_anon_rmap
              |->__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
	         |->SetPageAnonExclusive(page)

And this code path has been present since
6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
  
Jason Gunthorpe Aug. 9, 2023, 11:59 a.m. UTC | #6
On Wed, Aug 09, 2023 at 08:11:17AM +0800, Yan Zhao wrote:

> > Can we just tell userspace to mbind() the pinned region to explicitly exclude the
> > VMA(s) from NUMA balancing?

> For VMs with VFIO mdev mediated devices, the VMAs to be pinned are
> dynamic, I think it's hard to mbind() in advance.

It is hard to view the mediated devices path as a performance path
that deserves this kind of intervention :\

Jason
  
Yan Zhao Aug. 10, 2023, 9:08 a.m. UTC | #7
On Wed, Aug 09, 2023 at 08:59:16AM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 09, 2023 at 08:11:17AM +0800, Yan Zhao wrote:
> 
> > > Can we just tell userspace to mbind() the pinned region to explicitly exclude the
> > > VMA(s) from NUMA balancing?
> 
> > For VMs with VFIO mdev mediated devices, the VMAs to be pinned are
> > dynamic, I think it's hard to mbind() in advance.
> 
> It is hard to view the mediated devices path as a performance path
> that deserves this kind of intervention :\

Though you are right, maybe we can still make it better?

What about introducing a new callback which will be called when a page
is ensured to be PROT_NONE protected for NUMA balancing?

Then, rather than duplicate mm logic in KVM, KVM can depend on this callback
and do the page unmap in secondary MMU only for pages that are indeed
PROT_NONE protected for NUMA balancing, excluding pages that are obviously
non-NUMA-migratable.

I sent a RFC v2 (commit messages and comments are not well polished) to
show this idea,
https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/ 

Do you think we can continue the work?

Thanks a lot for your review!
  

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d72f2b20f430..9dccc25b1389 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6307,8 +6307,8 @@  void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	if (tdp_mmu_enabled) {
 		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
-						      gfn_end, true, flush);
+			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, gfn_end,
+						      true, flush, false);
 	}
 
 	if (flush)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6250bd3d20c1..17762b5a2b98 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -838,7 +838,8 @@  bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
  * operation can cause a soft lockup.
  */
 static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
-			      gfn_t start, gfn_t end, bool can_yield, bool flush)
+			      gfn_t start, gfn_t end, bool can_yield, bool flush,
+			      bool skip_pinned)
 {
 	struct tdp_iter iter;
 
@@ -859,6 +860,21 @@  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
+		if (skip_pinned) {
+			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
+			struct page *page = kvm_pfn_to_refcounted_page(pfn);
+			struct folio *folio;
+
+			if (!page)
+				continue;
+
+			folio = page_folio(page);
+
+			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
+			    folio_maybe_dma_pinned(folio))
+				continue;
+		}
+
 		tdp_mmu_iter_set_spte(kvm, &iter, 0);
 		flush = true;
 	}
@@ -878,12 +894,13 @@  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
  * more SPTEs were zapped since the MMU lock was last acquired.
  */
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
-			   bool can_yield, bool flush)
+			   bool can_yield, bool flush, bool skip_pinned)
 {
 	struct kvm_mmu_page *root;
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
-		flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
+		flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush,
+					  skip_pinned);
 
 	return flush;
 }
@@ -1147,7 +1164,8 @@  bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush)
 {
 	return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
-				     range->end, range->may_block, flush);
+				     range->end, range->may_block, flush,
+				     range->skip_pinned);
 }
 
 typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 0a63b1afabd3..2a9de44bc5c3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -20,8 +20,8 @@  __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared);
 
-bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
-				 gfn_t end, bool can_yield, bool flush);
+bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
+			   bool can_yield, bool flush, bool skip_pinned);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9125d0ab642d..f883d6b59545 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -266,6 +266,7 @@  struct kvm_gfn_range {
 	gfn_t end;
 	union kvm_mmu_notifier_arg arg;
 	bool may_block;
+	bool skip_pinned;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f84ef9399aee..1202c1daa568 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -532,6 +532,7 @@  struct kvm_hva_range {
 	on_unlock_fn_t on_unlock;
 	bool flush_on_ret;
 	bool may_block;
+	bool skip_pinned;
 };
 
 /*
@@ -595,6 +596,7 @@  static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			 */
 			gfn_range.arg = range->arg;
 			gfn_range.may_block = range->may_block;
+			gfn_range.skip_pinned = range->skip_pinned;
 
 			/*
 			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -754,6 +756,9 @@  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.on_unlock	= kvm_arch_guest_memory_reclaimed,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
+		.skip_pinned	= test_bit(MMF_HAS_PINNED, &range->mm->flags) &&
+				  (range->event == MMU_NOTIFY_PROTECTION_VMA) &&
+				  (range->flags & MMU_NOTIFIER_RANGE_NUMA),
 	};
 
 	trace_kvm_unmap_hva_range(range->start, range->end);