[v2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing

Message ID 20221023025047.470646-1-mike.kravetz@oracle.com
State New
Headers
Series [v2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing |

Commit Message

Mike Kravetz Oct. 23, 2022, 2:50 a.m. UTC
  madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the
page tables associated with the address range.  For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final.  However,
__unmap_hugepage_range_final assumes the passed vma is about to be
removed and deletes the vma_lock to prevent pmd sharing as the vma is
on the way out.  In the case of madvise(MADV_DONTNEED) the vma remains,
but the missing vma_lock prevents pmd sharing and could potentially
lead to issues with truncation/fault races.

This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing.  Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping
was not set in new pages added to the page table.  This resulted in
pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.

Create a new routine clear_hugetlb_page_range() that can be called from
madvise(MADV_DONTNEED) for hugetlb vmas.  It has the same setup as
zap_page_range, but does not delete the vma_lock.

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/

Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Reported-by: Wei Chen <harperchen1110@gmail.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
---
v2 - Fix build issues associated with !CONFIG_ADVISE_SYSCALLS

 include/linux/hugetlb.h |  7 +++++
 mm/hugetlb.c            | 62 +++++++++++++++++++++++++++++++++--------
 mm/madvise.c            |  5 +++-
 3 files changed, 61 insertions(+), 13 deletions(-)
  

Comments

Mike Kravetz Oct. 24, 2022, 9:55 p.m. UTC | #1
On 10/22/22 19:50, Mike Kravetz wrote:
> madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the
> page tables associated with the address range.  For hugetlb vmas,
> zap_page_range will call __unmap_hugepage_range_final.  However,
> __unmap_hugepage_range_final assumes the passed vma is about to be
> removed and deletes the vma_lock to prevent pmd sharing as the vma is
> on the way out.  In the case of madvise(MADV_DONTNEED) the vma remains,
> but the missing vma_lock prevents pmd sharing and could potentially
> lead to issues with truncation/fault races.
> 
> This issue was originally reported here [1] as a BUG triggered in
> page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
> vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
> prevent pmd sharing.  Subsequent faults on this vma were confused as
> VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping
> was not set in new pages added to the page table.  This resulted in
> pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.
> 
> Create a new routine clear_hugetlb_page_range() that can be called from
> madvise(MADV_DONTNEED) for hugetlb vmas.  It has the same setup as
> zap_page_range, but does not delete the vma_lock.

After seeing a syzbot use after free report [2] that is also addressed by
this patch, I started thinking ... 

When __unmap_hugepage_range_final was created, the only time unmap_single_vma
was called for hugetlb vmas was during process exit time via exit_mmap.  I got
in trouble when I added a call via madvise(MADV_DONTNEED) which calls
zap_page_range.  This patch takes care of that calling path by having
madvise(MADV_DONTNEED) call a new routine clear_hugetlb_page_range instead of
zap_page_range for hugetlb vmas.  The use after free bug had me auditing code
paths to make sure __unmap_hugepage_range_final was REALLY only called at
process exit time.  If not, and we could fault on a vma after calling
__unmap_hugepage_range_final we would be in trouble.

My thought was, what if we had __unmap_hugepage_range_final check mm->mm_users
to determine if it was being called in the process exit path?  If !mm_users,
then we can delete the vma_lock to prevent pmd sharing as we know the process
is exiting.  If not, we do not delete the lock.  That seems to be more robust
and would prevent issues if someone accidentally introduces a new code path
where __unmap_hugepage_range_final (unmap_single_vma for a hugetlb vma)
could be called outside process exit context.

Thoughts?

[2] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
  
Mike Kravetz Oct. 24, 2022, 11:14 p.m. UTC | #2
On 10/24/22 14:55, Mike Kravetz wrote:
> On 10/22/22 19:50, Mike Kravetz wrote:
> > madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the
> > page tables associated with the address range.  For hugetlb vmas,
> > zap_page_range will call __unmap_hugepage_range_final.  However,
> > __unmap_hugepage_range_final assumes the passed vma is about to be
> > removed and deletes the vma_lock to prevent pmd sharing as the vma is
> > on the way out.  In the case of madvise(MADV_DONTNEED) the vma remains,
> > but the missing vma_lock prevents pmd sharing and could potentially
> > lead to issues with truncation/fault races.
> > 
> > This issue was originally reported here [1] as a BUG triggered in
> > page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
> > vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
> > prevent pmd sharing.  Subsequent faults on this vma were confused as
> > VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping
> > was not set in new pages added to the page table.  This resulted in
> > pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.
> > 
> > Create a new routine clear_hugetlb_page_range() that can be called from
> > madvise(MADV_DONTNEED) for hugetlb vmas.  It has the same setup as
> > zap_page_range, but does not delete the vma_lock.
> 
> After seeing a syzbot use after free report [2] that is also addressed by
> this patch, I started thinking ... 
> 
> When __unmap_hugepage_range_final was created, the only time unmap_single_vma
> was called for hugetlb vmas was during process exit time via exit_mmap.  I got
> in trouble when I added a call via madvise(MADV_DONTNEED) which calls
> zap_page_range.  This patch takes care of that calling path by having
> madvise(MADV_DONTNEED) call a new routine clear_hugetlb_page_range instead of
> zap_page_range for hugetlb vmas.  The use after free bug had me auditing code
> paths to make sure __unmap_hugepage_range_final was REALLY only called at
> process exit time.  If not, and we could fault on a vma after calling
> __unmap_hugepage_range_final we would be in trouble.
> 
> My thought was, what if we had __unmap_hugepage_range_final check mm->mm_users
> to determine if it was being called in the process exit path?  If !mm_users,
> then we can delete the vma_lock to prevent pmd sharing as we know the process
> is exiting.  If not, we do not delete the lock.  That seems to be more robust
> and would prevent issues if someone accidentally introduces a new code path
> where __unmap_hugepage_range_final (unmap_single_vma for a hugetlb vma)
> could be called outside process exit context.
> 
> Thoughts?
> 
> [2] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/

Sorry if this seems like I am talking to myself.  Here is a proposed v3 as
described above.

From 1466fd43e180ede3f6479d1dca4e7f350f86f80b Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Mon, 24 Oct 2022 15:40:05 -0700
Subject: [PATCH v3] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED
 processing

When hugetlb madvise(MADV_DONTNEED) support was added, the existing
code to call zap_page_range() to clear the page tables associated
with the address range was not modified.  However, for hugetlb vmas
zap_page_range will call __unmap_hugepage_range_final. This routine
assumes the passed hugetlb vma is about to be removed and deletes
the vma_lock to prevent pmd sharing as the vma is on the way out.  In
the case of madvise(MADV_DONTNEED) the vma remains, but the missing
vma_lock prevents pmd sharing and could potentially lead to issues
with truncation/fault races.

This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag
to prevent pmd sharing.  Subsequent faults on this vma were confused
as VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping
was not set in new pages added to the page table.  This resulted in
pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.

__unmap_hugepage_range_final was originally designed only to be called
in the context of process exit (exit_mmap).  It is now called in the
context of madvise(MADV_DONTNEED).  Restructure the routine and check
for !mm_users which indicates it is being called in the context of
process exit.  If being called in process exit context, delete the
vma_lock.  Otherwise, just unmap and leave the lock.  Since the routine
is called in more than just process exit context, rename to eliminate
'final' as __unmap_hugetlb_page_range.

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/

Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Reported-by: Wei Chen <harperchen1110@gmail.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
---
v3 - Check for !mm_users in __unmap_hugepage_range_final instead of
     creating a separate function.

 include/linux/hugetlb.h |  4 ++--
 mm/hugetlb.c            | 30 ++++++++++++++++++++----------
 mm/memory.c             |  2 +-
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a899bc76d677..bc19a1f6ca10 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -158,7 +158,7 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 void unmap_hugepage_range(struct vm_area_struct *,
 			  unsigned long, unsigned long, struct page *,
 			  zap_flags_t);
-void __unmap_hugepage_range_final(struct mmu_gather *tlb,
+void __unmap_hugetlb_page_range(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma,
 			  unsigned long start, unsigned long end,
 			  struct page *ref_page, zap_flags_t zap_flags);
@@ -418,7 +418,7 @@ static inline unsigned long hugetlb_change_protection(
 	return 0;
 }
 
-static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
+static inline void __unmap_hugetlb_page_range(struct mmu_gather *tlb,
 			struct vm_area_struct *vma, unsigned long start,
 			unsigned long end, struct page *ref_page,
 			zap_flags_t zap_flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 931789a8f734..3fe1152c3c20 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5202,27 +5202,37 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		tlb_flush_mmu_tlbonly(tlb);
 }
 
-void __unmap_hugepage_range_final(struct mmu_gather *tlb,
+void __unmap_hugetlb_page_range(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page,
 			  zap_flags_t zap_flags)
 {
+	struct mm_struct *mm = vma->vm_mm;
+
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 
 	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
 
 	/*
-	 * Unlock and free the vma lock before releasing i_mmap_rwsem.  When
-	 * the vma_lock is freed, this makes the vma ineligible for pmd
-	 * sharing.  And, i_mmap_rwsem is required to set up pmd sharing.
-	 * This is important as page tables for this unmapped range will
-	 * be asynchrously deleted.  If the page tables are shared, there
-	 * will be issues when accessed by someone else.
+	 * Free the vma_lock here if process exiting
 	 */
-	__hugetlb_vma_unlock_write_free(vma);
-
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	if (!atomic_read(&mm->mm_users)) {
+		/*
+		 * Unlock and free the vma lock before releasing i_mmap_rwsem.
+		 * When the vma_lock is freed, this makes the vma ineligible
+		 * for pmd sharing.  And, i_mmap_rwsem is required to set up
+		 * pmd sharing.  This is important as page tables for this
+		 * unmapped range will be asynchrously deleted.  If the page
+		 * tables are shared, there will be issues when accessed by
+		 * someone else.
+		 */
+		__hugetlb_vma_unlock_write_free(vma);
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+	} else {
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+		hugetlb_vma_unlock_write(vma);
+	}
 }
 
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
diff --git a/mm/memory.c b/mm/memory.c
index 8e72f703ed99..1de8ea504047 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1687,7 +1687,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 			if (vma->vm_file) {
 				zap_flags_t zap_flags = details ?
 				    details->zap_flags : 0;
-				__unmap_hugepage_range_final(tlb, vma, start, end,
+				__unmap_hugetlb_page_range(tlb, vma, start, end,
 							     NULL, zap_flags);
 			}
 		} else
  
Peter Xu Oct. 26, 2022, 9:42 p.m. UTC | #3
Hi, Mike,

On Sat, Oct 22, 2022 at 07:50:47PM -0700, Mike Kravetz wrote:

[...]

> -void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> +static void __unmap_hugepage_range_locking(struct mmu_gather *tlb,
>  			  struct vm_area_struct *vma, unsigned long start,
>  			  unsigned long end, struct page *ref_page,
> -			  zap_flags_t zap_flags)
> +			  zap_flags_t zap_flags, bool final)
>  {
>  	hugetlb_vma_lock_write(vma);
>  	i_mmap_lock_write(vma->vm_file->f_mapping);
>  
>  	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
>  
> -	/*
> -	 * Unlock and free the vma lock before releasing i_mmap_rwsem.  When
> -	 * the vma_lock is freed, this makes the vma ineligible for pmd
> -	 * sharing.  And, i_mmap_rwsem is required to set up pmd sharing.
> -	 * This is important as page tables for this unmapped range will
> -	 * be asynchrously deleted.  If the page tables are shared, there
> -	 * will be issues when accessed by someone else.
> -	 */
> -	__hugetlb_vma_unlock_write_free(vma);
> +	if (final) {
> +		/*
> +		 * Unlock and free the vma lock before releasing i_mmap_rwsem.
> +		 * When the vma_lock is freed, this makes the vma ineligible
> +		 * for pmd sharing.  And, i_mmap_rwsem is required to set up
> +		 * pmd sharing.  This is important as page tables for this
> +		 * unmapped range will be asynchrously deleted.  If the page
> +		 * tables are shared, there will be issues when accessed by
> +		 * someone else.
> +		 */
> +		__hugetlb_vma_unlock_write_free(vma);
> +		i_mmap_unlock_write(vma->vm_file->f_mapping);

Pure question: can we rely on hugetlb_vm_op_close() to destroy the hugetlb
vma lock?

I read the comment above, it seems we are trying to avoid racing with pmd
sharing, but I don't see how that could ever happen, since iiuc there
should only be two places that unmaps the vma (final==true):

  (1) munmap: we're holding write lock, so no page fault possible
  (2) exit_mmap: we've already reset current->mm so no page fault possible

> +	} else {
> +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> +		hugetlb_vma_unlock_write(vma);
> +	}
> +}
>  
> -	i_mmap_unlock_write(vma->vm_file->f_mapping);
> +void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> +			  struct vm_area_struct *vma, unsigned long start,
> +			  unsigned long end, struct page *ref_page,
> +			  zap_flags_t zap_flags)
> +{
> +	__unmap_hugepage_range_locking(tlb, vma, start, end, ref_page,
> +					zap_flags, true);
>  }
>  
> +#ifdef CONFIG_ADVISE_SYSCALLS
> +/*
> + * Similar setup as in zap_page_range().  madvise(MADV_DONTNEED) can not call
> + * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete
> + * the associated vma_lock.
> + */
> +void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
> +				unsigned long end)
> +{
> +	struct mmu_notifier_range range;
> +	struct mmu_gather tlb;
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> +				start, end);

Is mmu_notifier_invalidate_range_start() missing here?

> +	tlb_gather_mmu(&tlb, vma->vm_mm);
> +	update_hiwater_rss(vma->vm_mm);
> +
> +	__unmap_hugepage_range_locking(&tlb, vma, start, end, NULL, 0, false);
> +
> +	mmu_notifier_invalidate_range_end(&range);
> +	tlb_finish_mmu(&tlb);
> +}
> +#endif
> +
>  void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  			  unsigned long end, struct page *ref_page,
>  			  zap_flags_t zap_flags)
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 2baa93ca2310..90577a669635 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>  					unsigned long start, unsigned long end)
>  {
> -	zap_page_range(vma, start, end - start);
> +	if (!is_vm_hugetlb_page(vma))
> +		zap_page_range(vma, start, end - start);
> +	else
> +		clear_hugetlb_page_range(vma, start, end);
>  	return 0;
>  }

This does look a bit unfortunate - zap_page_range() contains yet another
is_vm_hugetlb_page() check (further down in unmap_single_vma), it can be
very confusing on which code path is really handling hugetlb.

The other mm_users check in v3 doesn't need this change, but was a bit
hackish to me, because IIUC we're clear on the call paths to trigger this
(unmap_vmas), so it seems clean to me to pass that info from the upper
stack.

Maybe we can have a new zap_flags passed into unmap_single_vma() showing
that it's destroying the vma?

Thanks,
  
Mike Kravetz Oct. 26, 2022, 11:54 p.m. UTC | #4
On 10/26/22 17:42, Peter Xu wrote:
> Hi, Mike,
> 
> On Sat, Oct 22, 2022 at 07:50:47PM -0700, Mike Kravetz wrote:
> 
> [...]
> 
> > -void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> > +static void __unmap_hugepage_range_locking(struct mmu_gather *tlb,
> >  			  struct vm_area_struct *vma, unsigned long start,
> >  			  unsigned long end, struct page *ref_page,
> > -			  zap_flags_t zap_flags)
> > +			  zap_flags_t zap_flags, bool final)
> >  {
> >  	hugetlb_vma_lock_write(vma);
> >  	i_mmap_lock_write(vma->vm_file->f_mapping);
> >  
> >  	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
> >  
> > -	/*
> > -	 * Unlock and free the vma lock before releasing i_mmap_rwsem.  When
> > -	 * the vma_lock is freed, this makes the vma ineligible for pmd
> > -	 * sharing.  And, i_mmap_rwsem is required to set up pmd sharing.
> > -	 * This is important as page tables for this unmapped range will
> > -	 * be asynchrously deleted.  If the page tables are shared, there
> > -	 * will be issues when accessed by someone else.
> > -	 */
> > -	__hugetlb_vma_unlock_write_free(vma);
> > +	if (final) {
> > +		/*
> > +		 * Unlock and free the vma lock before releasing i_mmap_rwsem.
> > +		 * When the vma_lock is freed, this makes the vma ineligible
> > +		 * for pmd sharing.  And, i_mmap_rwsem is required to set up
> > +		 * pmd sharing.  This is important as page tables for this
> > +		 * unmapped range will be asynchrously deleted.  If the page
> > +		 * tables are shared, there will be issues when accessed by
> > +		 * someone else.
> > +		 */
> > +		__hugetlb_vma_unlock_write_free(vma);
> > +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> 
> Pure question: can we rely on hugetlb_vm_op_close() to destroy the hugetlb
> vma lock?
> 
> I read the comment above, it seems we are trying to avoid racing with pmd
> sharing, but I don't see how that could ever happen, since iiuc there
> should only be two places that unmaps the vma (final==true):
> 
>   (1) munmap: we're holding write lock, so no page fault possible
>   (2) exit_mmap: we've already reset current->mm so no page fault possible
> 

Thanks for taking a look Peter!

The possible sharing we are trying to stop would be initiated by a fault
in a different process on the same underlying mapping object (inode).  The
specific vma in exit processing is still linked into the mapping interval
tree.  So, even though we call huge_pmd_unshare in the unmap processing (in
__unmap_hugepage_range) the sharing could later be initiated by another
process.

Hope that makes sense.  That is also the reason the routine
page_table_shareable contains this check:

	/*
	 * match the virtual addresses, permission and the alignment of the
	 * page table page.
	 *
	 * Also, vma_lock (vm_private_data) is required for sharing.
	 */
	if (pmd_index(addr) != pmd_index(saddr) ||
	    vm_flags != svm_flags ||
	    !range_in_vma(svma, sbase, s_end) ||
	    !svma->vm_private_data)
		return 0;

FYI - The 'flags' check also prevents a non-uffd mapping from initiating
sharing with a uffd mapping.

> > +	} else {
> > +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> > +		hugetlb_vma_unlock_write(vma);
> > +	}
> > +}
> >  
> > -	i_mmap_unlock_write(vma->vm_file->f_mapping);
> > +void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> > +			  struct vm_area_struct *vma, unsigned long start,
> > +			  unsigned long end, struct page *ref_page,
> > +			  zap_flags_t zap_flags)
> > +{
> > +	__unmap_hugepage_range_locking(tlb, vma, start, end, ref_page,
> > +					zap_flags, true);
> >  }
> >  
> > +#ifdef CONFIG_ADVISE_SYSCALLS
> > +/*
> > + * Similar setup as in zap_page_range().  madvise(MADV_DONTNEED) can not call
> > + * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete
> > + * the associated vma_lock.
> > + */
> > +void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
> > +				unsigned long end)
> > +{
> > +	struct mmu_notifier_range range;
> > +	struct mmu_gather tlb;
> > +
> > +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > +				start, end);
> 
> Is mmu_notifier_invalidate_range_start() missing here?
> 

It certainly does look like it.  When I created this routine, I was trying to
mimic what was done in the current calling path zap_page_range to
__unmap_hugepage_range_final.  Now when I look at that, I am not seeing
a mmu_notifier_invalidate_range_start/end.  Am I missing something, or
are these missing today?  Do note that we do MMU_NOTIFY_UNMAP in
__unmap_hugepage_range.

> > +	tlb_gather_mmu(&tlb, vma->vm_mm);
> > +	update_hiwater_rss(vma->vm_mm);
> > +
> > +	__unmap_hugepage_range_locking(&tlb, vma, start, end, NULL, 0, false);
> > +
> > +	mmu_notifier_invalidate_range_end(&range);
> > +	tlb_finish_mmu(&tlb);
> > +}
> > +#endif
> > +
> >  void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> >  			  unsigned long end, struct page *ref_page,
> >  			  zap_flags_t zap_flags)
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 2baa93ca2310..90577a669635 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> >  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> >  					unsigned long start, unsigned long end)
> >  {
> > -	zap_page_range(vma, start, end - start);
> > +	if (!is_vm_hugetlb_page(vma))
> > +		zap_page_range(vma, start, end - start);
> > +	else
> > +		clear_hugetlb_page_range(vma, start, end);
> >  	return 0;
> >  }
> 
> This does look a bit unfortunate - zap_page_range() contains yet another
> is_vm_hugetlb_page() check (further down in unmap_single_vma), it can be
> very confusing on which code path is really handling hugetlb.
> 
> The other mm_users check in v3 doesn't need this change, but was a bit
> hackish to me, because IIUC we're clear on the call paths to trigger this
> (unmap_vmas), so it seems clean to me to pass that info from the upper
> stack.
> 
> Maybe we can have a new zap_flags passed into unmap_single_vma() showing
> that it's destroying the vma?

I thought about that.  However, we would need to start passing the flag
here into zap_page_range as this is the beginning of that call down into
the hugetlb code where we do not want to remove zap_page_rangethe
vma_lock.
  
Peter Xu Oct. 27, 2022, 1:12 a.m. UTC | #5
On Wed, Oct 26, 2022 at 04:54:01PM -0700, Mike Kravetz wrote:
> On 10/26/22 17:42, Peter Xu wrote:
> > Hi, Mike,
> > 
> > On Sat, Oct 22, 2022 at 07:50:47PM -0700, Mike Kravetz wrote:
> > 
> > [...]
> > 
> > > -void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> > > +static void __unmap_hugepage_range_locking(struct mmu_gather *tlb,
> > >  			  struct vm_area_struct *vma, unsigned long start,
> > >  			  unsigned long end, struct page *ref_page,
> > > -			  zap_flags_t zap_flags)
> > > +			  zap_flags_t zap_flags, bool final)
> > >  {
> > >  	hugetlb_vma_lock_write(vma);
> > >  	i_mmap_lock_write(vma->vm_file->f_mapping);
> > >  
> > >  	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
> > >  
> > > -	/*
> > > -	 * Unlock and free the vma lock before releasing i_mmap_rwsem.  When
> > > -	 * the vma_lock is freed, this makes the vma ineligible for pmd
> > > -	 * sharing.  And, i_mmap_rwsem is required to set up pmd sharing.
> > > -	 * This is important as page tables for this unmapped range will
> > > -	 * be asynchrously deleted.  If the page tables are shared, there
> > > -	 * will be issues when accessed by someone else.
> > > -	 */
> > > -	__hugetlb_vma_unlock_write_free(vma);
> > > +	if (final) {
> > > +		/*
> > > +		 * Unlock and free the vma lock before releasing i_mmap_rwsem.
> > > +		 * When the vma_lock is freed, this makes the vma ineligible
> > > +		 * for pmd sharing.  And, i_mmap_rwsem is required to set up
> > > +		 * pmd sharing.  This is important as page tables for this
> > > +		 * unmapped range will be asynchrously deleted.  If the page
> > > +		 * tables are shared, there will be issues when accessed by
> > > +		 * someone else.
> > > +		 */
> > > +		__hugetlb_vma_unlock_write_free(vma);
> > > +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> > 
> > Pure question: can we rely on hugetlb_vm_op_close() to destroy the hugetlb
> > vma lock?
> > 
> > I read the comment above, it seems we are trying to avoid racing with pmd
> > sharing, but I don't see how that could ever happen, since iiuc there
> > should only be two places that unmaps the vma (final==true):
> > 
> >   (1) munmap: we're holding write lock, so no page fault possible
> >   (2) exit_mmap: we've already reset current->mm so no page fault possible
> > 
> 
> Thanks for taking a look Peter!
> 
> The possible sharing we are trying to stop would be initiated by a fault
> in a different process on the same underlying mapping object (inode).  The
> specific vma in exit processing is still linked into the mapping interval
> tree.  So, even though we call huge_pmd_unshare in the unmap processing (in
> __unmap_hugepage_range) the sharing could later be initiated by another
> process.
> 
> Hope that makes sense.  That is also the reason the routine
> page_table_shareable contains this check:
> 
> 	/*
> 	 * match the virtual addresses, permission and the alignment of the
> 	 * page table page.
> 	 *
> 	 * Also, vma_lock (vm_private_data) is required for sharing.
> 	 */
> 	if (pmd_index(addr) != pmd_index(saddr) ||
> 	    vm_flags != svm_flags ||
> 	    !range_in_vma(svma, sbase, s_end) ||
> 	    !svma->vm_private_data)
> 		return 0;

Ah, makes sense.  Hmm, then I'm wondering whether hugetlb_vma_lock_free()
would ever be useful at all?  Because remove_vma() (or say, the close()
hook) seems to always be called after an precedent unmap_vmas().

> 
> FYI - The 'flags' check also prevents a non-uffd mapping from initiating
> sharing with a uffd mapping.
> 
> > > +	} else {
> > > +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> > > +		hugetlb_vma_unlock_write(vma);
> > > +	}
> > > +}
> > >  
> > > -	i_mmap_unlock_write(vma->vm_file->f_mapping);
> > > +void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> > > +			  struct vm_area_struct *vma, unsigned long start,
> > > +			  unsigned long end, struct page *ref_page,
> > > +			  zap_flags_t zap_flags)
> > > +{
> > > +	__unmap_hugepage_range_locking(tlb, vma, start, end, ref_page,
> > > +					zap_flags, true);
> > >  }
> > >  
> > > +#ifdef CONFIG_ADVISE_SYSCALLS
> > > +/*
> > > + * Similar setup as in zap_page_range().  madvise(MADV_DONTNEED) can not call
> > > + * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete
> > > + * the associated vma_lock.
> > > + */
> > > +void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
> > > +				unsigned long end)
> > > +{
> > > +	struct mmu_notifier_range range;
> > > +	struct mmu_gather tlb;
> > > +
> > > +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > > +				start, end);
> > 
> > Is mmu_notifier_invalidate_range_start() missing here?
> > 
> 
> It certainly does look like it.  When I created this routine, I was trying to
> mimic what was done in the current calling path zap_page_range to
> __unmap_hugepage_range_final.  Now when I look at that, I am not seeing
> a mmu_notifier_invalidate_range_start/end.  Am I missing something, or
> are these missing today?

I'm not sure whether we're looking at the same code base; here it's in
zap_page_range() itself.

	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
				start, start + size);
	tlb_gather_mmu(&tlb, vma->vm_mm);
	update_hiwater_rss(vma->vm_mm);
	mmu_notifier_invalidate_range_start(&range);
	do {
		unmap_single_vma(&tlb, vma, start, range.end, NULL);
	} while ((vma = mas_find(&mas, end - 1)) != NULL);
	mmu_notifier_invalidate_range_end(&range);

> Do note that we do MMU_NOTIFY_UNMAP in __unmap_hugepage_range.

Hmm, I think we may want CLEAR for zap-only and UNMAP only for unmap.

 * @MMU_NOTIFY_UNMAP: either munmap() that unmap the range or a mremap() that
 * move the range
 * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
 * madvise() or replacing a page by another one, ...).

The other thing is that unmap_vmas() also notifies (same to
zap_page_range), it looks a duplicated notification if any of them calls
__unmap_hugepage_range() at last.

> 
> > > +	tlb_gather_mmu(&tlb, vma->vm_mm);
> > > +	update_hiwater_rss(vma->vm_mm);
> > > +
> > > +	__unmap_hugepage_range_locking(&tlb, vma, start, end, NULL, 0, false);
> > > +
> > > +	mmu_notifier_invalidate_range_end(&range);
> > > +	tlb_finish_mmu(&tlb);
> > > +}
> > > +#endif
> > > +
> > >  void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> > >  			  unsigned long end, struct page *ref_page,
> > >  			  zap_flags_t zap_flags)
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 2baa93ca2310..90577a669635 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > >  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > >  					unsigned long start, unsigned long end)
> > >  {
> > > -	zap_page_range(vma, start, end - start);
> > > +	if (!is_vm_hugetlb_page(vma))
> > > +		zap_page_range(vma, start, end - start);
> > > +	else
> > > +		clear_hugetlb_page_range(vma, start, end);
> > >  	return 0;
> > >  }
> > 
> > This does look a bit unfortunate - zap_page_range() contains yet another
> > is_vm_hugetlb_page() check (further down in unmap_single_vma), it can be
> > very confusing on which code path is really handling hugetlb.
> > 
> > The other mm_users check in v3 doesn't need this change, but was a bit
> > hackish to me, because IIUC we're clear on the call paths to trigger this
> > (unmap_vmas), so it seems clean to me to pass that info from the upper
> > stack.
> > 
> > Maybe we can have a new zap_flags passed into unmap_single_vma() showing
> > that it's destroying the vma?
> 
> I thought about that.  However, we would need to start passing the flag
> here into zap_page_range as this is the beginning of that call down into
> the hugetlb code where we do not want to remove zap_page_rangethe
> vma_lock.

Right.  I was thinking just attach the new flag in unmap_vmas().  A pesudo
(not compiled) code attached.

Thanks,
  
Mike Kravetz Oct. 28, 2022, 3:23 p.m. UTC | #6
On 10/26/22 21:12, Peter Xu wrote:
> On Wed, Oct 26, 2022 at 04:54:01PM -0700, Mike Kravetz wrote:
> > On 10/26/22 17:42, Peter Xu wrote:
> > > 
> > > Pure question: can we rely on hugetlb_vm_op_close() to destroy the hugetlb
> > > vma lock?
> > > 
> > > I read the comment above, it seems we are trying to avoid racing with pmd
> > > sharing, but I don't see how that could ever happen, since iiuc there
> > > should only be two places that unmaps the vma (final==true):
> > > 
> > >   (1) munmap: we're holding write lock, so no page fault possible
> > >   (2) exit_mmap: we've already reset current->mm so no page fault possible
> > > 
> > 
> > Thanks for taking a look Peter!
> > 
> > The possible sharing we are trying to stop would be initiated by a fault
> > in a different process on the same underlying mapping object (inode).  The
> > specific vma in exit processing is still linked into the mapping interval
> > tree.  So, even though we call huge_pmd_unshare in the unmap processing (in
> > __unmap_hugepage_range) the sharing could later be initiated by another
> > process.
> > 
> > Hope that makes sense.  That is also the reason the routine
> > page_table_shareable contains this check:
> > 
> > 	/*
> > 	 * match the virtual addresses, permission and the alignment of the
> > 	 * page table page.
> > 	 *
> > 	 * Also, vma_lock (vm_private_data) is required for sharing.
> > 	 */
> > 	if (pmd_index(addr) != pmd_index(saddr) ||
> > 	    vm_flags != svm_flags ||
> > 	    !range_in_vma(svma, sbase, s_end) ||
> > 	    !svma->vm_private_data)
> > 		return 0;
> 
> Ah, makes sense.  Hmm, then I'm wondering whether hugetlb_vma_lock_free()
> would ever be useful at all?  Because remove_vma() (or say, the close()
> hook) seems to always be called after an precedent unmap_vmas().

You are right.  hugetlb_vma_lock_free will almost always be a noop when
called from the close hook.  It is still 'needed' for vms setup error
pathss.

> > > > +void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
> > > > +				unsigned long end)
> > > > +{
> > > > +	struct mmu_notifier_range range;
> > > > +	struct mmu_gather tlb;
> > > > +
> > > > +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > > > +				start, end);
> > > 
> > > Is mmu_notifier_invalidate_range_start() missing here?
> > > 
> > 
> > It certainly does look like it.  When I created this routine, I was trying to
> > mimic what was done in the current calling path zap_page_range to
> > __unmap_hugepage_range_final.  Now when I look at that, I am not seeing
> > a mmu_notifier_invalidate_range_start/end.  Am I missing something, or
> > are these missing today?
> 
> I'm not sure whether we're looking at the same code base; here it's in
> zap_page_range() itself.
> 
> 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> 				start, start + size);
> 	tlb_gather_mmu(&tlb, vma->vm_mm);
> 	update_hiwater_rss(vma->vm_mm);
> 	mmu_notifier_invalidate_range_start(&range);
> 	do {
> 		unmap_single_vma(&tlb, vma, start, range.end, NULL);
> 	} while ((vma = mas_find(&mas, end - 1)) != NULL);
> 	mmu_notifier_invalidate_range_end(&range);

Yes, I missed that.  Thanks!

> 
> > Do note that we do MMU_NOTIFY_UNMAP in __unmap_hugepage_range.
> 
> Hmm, I think we may want CLEAR for zap-only and UNMAP only for unmap.
> 
>  * @MMU_NOTIFY_UNMAP: either munmap() that unmap the range or a mremap() that
>  * move the range
>  * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
>  * madvise() or replacing a page by another one, ...).
> 
> The other thing is that unmap_vmas() also notifies (same to
> zap_page_range), it looks a duplicated notification if any of them calls
> __unmap_hugepage_range() at last.

The only call into __unmap_hugepage_range() from generic zap/unmap calls
is via __unmap_hugepage_range_final.  Other call paths are entirely
within hugetlb code.

> > > > +	tlb_gather_mmu(&tlb, vma->vm_mm);
> > > > +	update_hiwater_rss(vma->vm_mm);
> > > > +
> > > > +	__unmap_hugepage_range_locking(&tlb, vma, start, end, NULL, 0, false);
> > > > +
> > > > +	mmu_notifier_invalidate_range_end(&range);
> > > > +	tlb_finish_mmu(&tlb);
> > > > +}
> > > > +#endif
> > > > +
> > > >  void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> > > >  			  unsigned long end, struct page *ref_page,
> > > >  			  zap_flags_t zap_flags)
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 2baa93ca2310..90577a669635 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > > >  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > > >  					unsigned long start, unsigned long end)
> > > >  {
> > > > -	zap_page_range(vma, start, end - start);
> > > > +	if (!is_vm_hugetlb_page(vma))
> > > > +		zap_page_range(vma, start, end - start);
> > > > +	else
> > > > +		clear_hugetlb_page_range(vma, start, end);
> > > >  	return 0;
> > > >  }
> > > 
> > > This does look a bit unfortunate - zap_page_range() contains yet another
> > > is_vm_hugetlb_page() check (further down in unmap_single_vma), it can be
> > > very confusing on which code path is really handling hugetlb.
> > > 
> > > The other mm_users check in v3 doesn't need this change, but was a bit
> > > hackish to me, because IIUC we're clear on the call paths to trigger this
> > > (unmap_vmas), so it seems clean to me to pass that info from the upper
> > > stack.
> > > 
> > > Maybe we can have a new zap_flags passed into unmap_single_vma() showing
> > > that it's destroying the vma?
> > 
> > I thought about that.  However, we would need to start passing the flag
> > here into zap_page_range as this is the beginning of that call down into
> > the hugetlb code where we do not want to remove zap_page_rangethe
> > vma_lock.
> 
> Right.  I was thinking just attach the new flag in unmap_vmas().  A pesudo
> (not compiled) code attached.

I took your suggestions and came up with a new version of this patch.  Not
sure if I love the new zap flag, as it is only used by hugetlb code.  I also
added a bool to __unmap_hugepage_range to eliminate the duplicate notification
calls.

From 15ffe922b60af9f4c19927d5d5aaca75840d0f6c Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Fri, 28 Oct 2022 07:46:50 -0700
Subject: [PATCH v5] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED
 processing

madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the page
tables associated with the address range.  For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final.  However,
__unmap_hugepage_range_final assumes the passed vma is about to be removed
and deletes the vma_lock to prevent pmd sharing as the vma is on the way
out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
missing vma_lock prevents pmd sharing and could potentially lead to issues
with truncation/fault races.

This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing.  Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
not set in new pages added to the page table.  This resulted in pages that
appeared anonymous in a VM_SHARED vma and triggered the BUG.

Create a new routine clear_hugetlb_page_range() that can be called from
madvise(MADV_DONTNEED) for hugetlb vmas.  It has the same setup as
zap_page_range, but does not delete the vma_lock.  Also, add a new zap
flag ZAP_FLAG_UNMAP to indicate an unmap call from unmap_vmas().  This
is used to indicate the 'final' unmapping of a vma.  The routine
__unmap_hugepage_range to take a notification_needed argument.  This is
used to prevent duplicate notifications.

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Wei Chen <harperchen1110@gmail.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/hugetlb.h |  7 ++++
 include/linux/mm.h      |  3 ++
 mm/hugetlb.c            | 93 +++++++++++++++++++++++++++++++----------
 mm/madvise.c            |  5 ++-
 mm/memory.c             |  2 +-
 5 files changed, 86 insertions(+), 24 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3568b90b397d..badcb277603d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -158,6 +158,8 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 void unmap_hugepage_range(struct vm_area_struct *,
 			  unsigned long, unsigned long, struct page *,
 			  zap_flags_t);
+void clear_hugetlb_page_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
 void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma,
 			  unsigned long start, unsigned long end,
@@ -428,6 +430,11 @@ static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 	BUG();
 }
 
+static void __maybe_unused clear_hugetlb_page_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+}
+
 static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
 			unsigned int flags)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 978c17df053e..517c8cc8ccb9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3464,4 +3464,7 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  */
 #define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
 
+/* Set in unmap_vmas() to indicate an unmap call.  Only used by hugetlb */
+#define  ZAP_FLAG_UNMAP              ((__force zap_flags_t) BIT(1))
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4a0289ef09fa..0309a7c0f3bc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5062,7 +5062,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 
 static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end,
-				   struct page *ref_page, zap_flags_t zap_flags)
+				   struct page *ref_page, zap_flags_t zap_flags,
+				   bool notification_needed)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -5087,13 +5088,16 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	tlb_change_page_size(tlb, sz);
 	tlb_start_vma(tlb, vma);
 
-	/*
-	 * If sharing possible, alert mmu notifiers of worst case.
-	 */
-	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, mm, start,
-				end);
-	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
-	mmu_notifier_invalidate_range_start(&range);
+	if (notification_needed) {
+		/*
+		 * If sharing possible, alert mmu notifiers of worst case.
+		 */
+		mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, mm,
+					start, end);
+		adjust_range_if_pmd_sharing_possible(vma, &range.start,
+					&range.end);
+		mmu_notifier_invalidate_range_start(&range);
+	}
 	last_addr_mask = hugetlb_mask_last_page(h);
 	address = start;
 	for (; address < end; address += sz) {
@@ -5178,7 +5182,8 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		if (ref_page)
 			break;
 	}
-	mmu_notifier_invalidate_range_end(&range);
+	if (notification_needed)
+		mmu_notifier_invalidate_range_end(&range);
 	tlb_end_vma(tlb, vma);
 
 	/*
@@ -5198,29 +5203,72 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		tlb_flush_mmu_tlbonly(tlb);
 }
 
-void __unmap_hugepage_range_final(struct mmu_gather *tlb,
+static void __unmap_hugepage_range_locking(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page,
 			  zap_flags_t zap_flags)
 {
+	bool final = zap_flags & ZAP_FLAG_UNMAP;
+
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 
-	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
+	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags,
+				false);
 
-	/*
-	 * Unlock and free the vma lock before releasing i_mmap_rwsem.  When
-	 * the vma_lock is freed, this makes the vma ineligible for pmd
-	 * sharing.  And, i_mmap_rwsem is required to set up pmd sharing.
-	 * This is important as page tables for this unmapped range will
-	 * be asynchrously deleted.  If the page tables are shared, there
-	 * will be issues when accessed by someone else.
-	 */
-	__hugetlb_vma_unlock_write_free(vma);
+	if (final) {
+		/*
+		 * Unlock and free the vma lock before releasing i_mmap_rwsem.
+		 * When the vma_lock is freed, this makes the vma ineligible
+		 * for pmd sharing.  And, i_mmap_rwsem is required to set up
+		 * pmd sharing.  This is important as page tables for this
+		 * unmapped range will be asynchrously deleted.  If the page
+		 * tables are shared, there will be issues when accessed by
+		 * someone else.
+		 */
+		__hugetlb_vma_unlock_write_free(vma);
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+	} else {
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+		hugetlb_vma_unlock_write(vma);
+	}
+}
 
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+void __unmap_hugepage_range_final(struct mmu_gather *tlb,
+			  struct vm_area_struct *vma, unsigned long start,
+			  unsigned long end, struct page *ref_page,
+			  zap_flags_t zap_flags)
+{
+	__unmap_hugepage_range_locking(tlb, vma, start, end, ref_page,
+					zap_flags);
 }
 
+#ifdef CONFIG_ADVISE_SYSCALLS
+/*
+ * Similar setup as in zap_page_range().  madvise(MADV_DONTNEED) can not call
+ * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete
+ * the associated vma_lock.
+ */
+void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
+				unsigned long end)
+{
+	struct mmu_notifier_range range;
+	struct mmu_gather tlb;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+				start, end);
+	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
+	tlb_gather_mmu(&tlb, vma->vm_mm);
+	update_hiwater_rss(vma->vm_mm);
+	mmu_notifier_invalidate_range_start(&range);
+
+	__unmap_hugepage_range_locking(&tlb, vma, start, end, NULL, 0);
+
+	mmu_notifier_invalidate_range_end(&range);
+	tlb_finish_mmu(&tlb);
+}
+#endif
+
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page,
 			  zap_flags_t zap_flags)
@@ -5228,7 +5276,8 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 	struct mmu_gather tlb;
 
 	tlb_gather_mmu(&tlb, vma->vm_mm);
-	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
+	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags,
+				true);
 	tlb_finish_mmu(&tlb);
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index c7105ec6d08c..d8b4d7e56939 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range(vma, start, end - start);
+	if (!is_vm_hugetlb_page(vma))
+		zap_page_range(vma, start, end - start);
+	else
+		clear_hugetlb_page_range(vma, start, end);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index c5599a9279b1..679b702af4ce 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1671,7 +1671,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 {
 	struct mmu_notifier_range range;
 	struct zap_details details = {
-		.zap_flags = ZAP_FLAG_DROP_MARKER,
+		.zap_flags = ZAP_FLAG_DROP_MARKER | ZAP_FLAG_UNMAP,
 		/* Careful - we need to zap private pages too! */
 		.even_cows = true,
 	};
  
Peter Xu Oct. 28, 2022, 4:13 p.m. UTC | #7
On Fri, Oct 28, 2022 at 08:23:25AM -0700, Mike Kravetz wrote:
> On 10/26/22 21:12, Peter Xu wrote:
> > On Wed, Oct 26, 2022 at 04:54:01PM -0700, Mike Kravetz wrote:
> > > On 10/26/22 17:42, Peter Xu wrote:
> > > > 
> > > > Pure question: can we rely on hugetlb_vm_op_close() to destroy the hugetlb
> > > > vma lock?
> > > > 
> > > > I read the comment above, it seems we are trying to avoid racing with pmd
> > > > sharing, but I don't see how that could ever happen, since iiuc there
> > > > should only be two places that unmaps the vma (final==true):
> > > > 
> > > >   (1) munmap: we're holding write lock, so no page fault possible
> > > >   (2) exit_mmap: we've already reset current->mm so no page fault possible
> > > > 
> > > 
> > > Thanks for taking a look Peter!
> > > 
> > > The possible sharing we are trying to stop would be initiated by a fault
> > > in a different process on the same underlying mapping object (inode).  The
> > > specific vma in exit processing is still linked into the mapping interval
> > > tree.  So, even though we call huge_pmd_unshare in the unmap processing (in
> > > __unmap_hugepage_range) the sharing could later be initiated by another
> > > process.
> > > 
> > > Hope that makes sense.  That is also the reason the routine
> > > page_table_shareable contains this check:
> > > 
> > > 	/*
> > > 	 * match the virtual addresses, permission and the alignment of the
> > > 	 * page table page.
> > > 	 *
> > > 	 * Also, vma_lock (vm_private_data) is required for sharing.
> > > 	 */
> > > 	if (pmd_index(addr) != pmd_index(saddr) ||
> > > 	    vm_flags != svm_flags ||
> > > 	    !range_in_vma(svma, sbase, s_end) ||
> > > 	    !svma->vm_private_data)
> > > 		return 0;
> > 
> > Ah, makes sense.  Hmm, then I'm wondering whether hugetlb_vma_lock_free()
> > would ever be useful at all?  Because remove_vma() (or say, the close()
> > hook) seems to always be called after an precedent unmap_vmas().
> 
> You are right.  hugetlb_vma_lock_free will almost always be a noop when
> called from the close hook.  It is still 'needed' for vms setup error
> pathss.

Ah, yes.

Not sure whether it would be worthwhile to have a comment for that in the
close() hook, because it's rare that the vma lock is released (and need to
be released) before the vma destroy hook function.  The pmd unsharing
definitely complicates things.  In all cases, definitely worth a repost for
this, only to raise this point up.

> 
> > > > > +void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
> > > > > +				unsigned long end)
> > > > > +{
> > > > > +	struct mmu_notifier_range range;
> > > > > +	struct mmu_gather tlb;
> > > > > +
> > > > > +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > > > > +				start, end);
> > > > 
> > > > Is mmu_notifier_invalidate_range_start() missing here?
> > > > 
> > > 
> > > It certainly does look like it.  When I created this routine, I was trying to
> > > mimic what was done in the current calling path zap_page_range to
> > > __unmap_hugepage_range_final.  Now when I look at that, I am not seeing
> > > a mmu_notifier_invalidate_range_start/end.  Am I missing something, or
> > > are these missing today?
> > 
> > I'm not sure whether we're looking at the same code base; here it's in
> > zap_page_range() itself.
> > 
> > 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > 				start, start + size);
> > 	tlb_gather_mmu(&tlb, vma->vm_mm);
> > 	update_hiwater_rss(vma->vm_mm);
> > 	mmu_notifier_invalidate_range_start(&range);
> > 	do {
> > 		unmap_single_vma(&tlb, vma, start, range.end, NULL);
> > 	} while ((vma = mas_find(&mas, end - 1)) != NULL);
> > 	mmu_notifier_invalidate_range_end(&range);
> 
> Yes, I missed that.  Thanks!
> 
> > 
> > > Do note that we do MMU_NOTIFY_UNMAP in __unmap_hugepage_range.
> > 
> > Hmm, I think we may want CLEAR for zap-only and UNMAP only for unmap.
> > 
> >  * @MMU_NOTIFY_UNMAP: either munmap() that unmap the range or a mremap() that
> >  * move the range
> >  * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
> >  * madvise() or replacing a page by another one, ...).
> > 
> > The other thing is that unmap_vmas() also notifies (same to
> > zap_page_range), it looks a duplicated notification if any of them calls
> > __unmap_hugepage_range() at last.
> 
> The only call into __unmap_hugepage_range() from generic zap/unmap calls
> is via __unmap_hugepage_range_final.  Other call paths are entirely
> within hugetlb code.

Right, the duplication only happens on the outside-hugetlb (aka generic mm)
calls.  I saw that below it's being considered, thanks.  Though I had a
(maybe...) better thought, more below.

> 
> > > > > +	tlb_gather_mmu(&tlb, vma->vm_mm);
> > > > > +	update_hiwater_rss(vma->vm_mm);
> > > > > +
> > > > > +	__unmap_hugepage_range_locking(&tlb, vma, start, end, NULL, 0, false);
> > > > > +
> > > > > +	mmu_notifier_invalidate_range_end(&range);
> > > > > +	tlb_finish_mmu(&tlb);
> > > > > +}
> > > > > +#endif
> > > > > +
> > > > >  void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> > > > >  			  unsigned long end, struct page *ref_page,
> > > > >  			  zap_flags_t zap_flags)
> > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > index 2baa93ca2310..90577a669635 100644
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > > > >  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > > > >  					unsigned long start, unsigned long end)
> > > > >  {
> > > > > -	zap_page_range(vma, start, end - start);
> > > > > +	if (!is_vm_hugetlb_page(vma))
> > > > > +		zap_page_range(vma, start, end - start);
> > > > > +	else
> > > > > +		clear_hugetlb_page_range(vma, start, end);
> > > > >  	return 0;
> > > > >  }
> > > > 
> > > > This does look a bit unfortunate - zap_page_range() contains yet another
> > > > is_vm_hugetlb_page() check (further down in unmap_single_vma), it can be
> > > > very confusing on which code path is really handling hugetlb.
> > > > 
> > > > The other mm_users check in v3 doesn't need this change, but was a bit
> > > > hackish to me, because IIUC we're clear on the call paths to trigger this
> > > > (unmap_vmas), so it seems clean to me to pass that info from the upper
> > > > stack.
> > > > 
> > > > Maybe we can have a new zap_flags passed into unmap_single_vma() showing
> > > > that it's destroying the vma?
> > > 
> > > I thought about that.  However, we would need to start passing the flag
> > > here into zap_page_range as this is the beginning of that call down into
> > > the hugetlb code where we do not want to remove zap_page_rangethe
> > > vma_lock.
> > 
> > Right.  I was thinking just attach the new flag in unmap_vmas().  A pesudo
> > (not compiled) code attached.
> 
> I took your suggestions and came up with a new version of this patch.  Not
> sure if I love the new zap flag, as it is only used by hugetlb code.  I also
> added a bool to __unmap_hugepage_range to eliminate the duplicate notification
> calls.
> 
> From 15ffe922b60af9f4c19927d5d5aaca75840d0f6c Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Fri, 28 Oct 2022 07:46:50 -0700
> Subject: [PATCH v5] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED
>  processing
> 
> madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the page
> tables associated with the address range.  For hugetlb vmas,
> zap_page_range will call __unmap_hugepage_range_final.  However,
> __unmap_hugepage_range_final assumes the passed vma is about to be removed
> and deletes the vma_lock to prevent pmd sharing as the vma is on the way
> out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
> missing vma_lock prevents pmd sharing and could potentially lead to issues
> with truncation/fault races.
> 
> This issue was originally reported here [1] as a BUG triggered in
> page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
> vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
> prevent pmd sharing.  Subsequent faults on this vma were confused as
> VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
> not set in new pages added to the page table.  This resulted in pages that
> appeared anonymous in a VM_SHARED vma and triggered the BUG.
> 
> Create a new routine clear_hugetlb_page_range() that can be called from
> madvise(MADV_DONTNEED) for hugetlb vmas.  It has the same setup as
> zap_page_range, but does not delete the vma_lock.  Also, add a new zap
> flag ZAP_FLAG_UNMAP to indicate an unmap call from unmap_vmas().  This
> is used to indicate the 'final' unmapping of a vma.  The routine
> __unmap_hugepage_range to take a notification_needed argument.  This is
> used to prevent duplicate notifications.
> 
> [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
> Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Reported-by: Wei Chen <harperchen1110@gmail.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/hugetlb.h |  7 ++++
>  include/linux/mm.h      |  3 ++
>  mm/hugetlb.c            | 93 +++++++++++++++++++++++++++++++----------
>  mm/madvise.c            |  5 ++-
>  mm/memory.c             |  2 +-
>  5 files changed, 86 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 3568b90b397d..badcb277603d 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -158,6 +158,8 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>  void unmap_hugepage_range(struct vm_area_struct *,
>  			  unsigned long, unsigned long, struct page *,
>  			  zap_flags_t);
> +void clear_hugetlb_page_range(struct vm_area_struct *vma,
> +			unsigned long start, unsigned long end);
>  void __unmap_hugepage_range_final(struct mmu_gather *tlb,
>  			  struct vm_area_struct *vma,
>  			  unsigned long start, unsigned long end,
> @@ -428,6 +430,11 @@ static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
>  	BUG();
>  }
>  
> +static void __maybe_unused clear_hugetlb_page_range(struct vm_area_struct *vma,
> +			unsigned long start, unsigned long end)
> +{
> +}
> +
>  static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
>  			struct vm_area_struct *vma, unsigned long address,
>  			unsigned int flags)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 978c17df053e..517c8cc8ccb9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3464,4 +3464,7 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   */
>  #define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
>  
> +/* Set in unmap_vmas() to indicate an unmap call.  Only used by hugetlb */
> +#define  ZAP_FLAG_UNMAP              ((__force zap_flags_t) BIT(1))
> +
>  #endif /* _LINUX_MM_H */
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4a0289ef09fa..0309a7c0f3bc 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5062,7 +5062,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  
>  static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  				   unsigned long start, unsigned long end,
> -				   struct page *ref_page, zap_flags_t zap_flags)
> +				   struct page *ref_page, zap_flags_t zap_flags,
> +				   bool notification_needed)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long address;
> @@ -5087,13 +5088,16 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>  	tlb_change_page_size(tlb, sz);
>  	tlb_start_vma(tlb, vma);
>  
> -	/*
> -	 * If sharing possible, alert mmu notifiers of worst case.
> -	 */
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, mm, start,
> -				end);
> -	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> -	mmu_notifier_invalidate_range_start(&range);
> +	if (notification_needed) {

I'm not 100% sure whether this is needed.  Can we move the notification
just outside of this function, to where it's needed?  Based on the latest
mm-unstable c59145c0aa2c, what I read is that it's only needed for
unmap_hugepage_range() not __unmap_hugepage_range_locking() (these are the
only two callers of __unmap_hugepage_range).  Then maybe we can move these
notifications into unmap_hugepage_range().

Also note that I _think_ when moving we should change UNMAP to CLEAR
notifies too, but worth double check.

> +		/*
> +		 * If sharing possible, alert mmu notifiers of worst case.
> +		 */
> +		mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, mm,
> +					start, end);
> +		adjust_range_if_pmd_sharing_possible(vma, &range.start,
> +					&range.end);
> +		mmu_notifier_invalidate_range_start(&range);
> +	}
>  	last_addr_mask = hugetlb_mask_last_page(h);
>  	address = start;
>  	for (; address < end; address += sz) {
> @@ -5178,7 +5182,8 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>  		if (ref_page)
>  			break;
>  	}
> -	mmu_notifier_invalidate_range_end(&range);
> +	if (notification_needed)
> +		mmu_notifier_invalidate_range_end(&range);
>  	tlb_end_vma(tlb, vma);
>  
>  	/*
> @@ -5198,29 +5203,72 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>  		tlb_flush_mmu_tlbonly(tlb);
>  }
>  
> -void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> +static void __unmap_hugepage_range_locking(struct mmu_gather *tlb,
>  			  struct vm_area_struct *vma, unsigned long start,
>  			  unsigned long end, struct page *ref_page,
>  			  zap_flags_t zap_flags)
>  {
> +	bool final = zap_flags & ZAP_FLAG_UNMAP;
> +
>  	hugetlb_vma_lock_write(vma);
>  	i_mmap_lock_write(vma->vm_file->f_mapping);
>  
> -	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
> +	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags,
> +				false);
>  
> -	/*
> -	 * Unlock and free the vma lock before releasing i_mmap_rwsem.  When
> -	 * the vma_lock is freed, this makes the vma ineligible for pmd
> -	 * sharing.  And, i_mmap_rwsem is required to set up pmd sharing.
> -	 * This is important as page tables for this unmapped range will
> -	 * be asynchrously deleted.  If the page tables are shared, there
> -	 * will be issues when accessed by someone else.
> -	 */
> -	__hugetlb_vma_unlock_write_free(vma);
> +	if (final) {
> +		/*
> +		 * Unlock and free the vma lock before releasing i_mmap_rwsem.
> +		 * When the vma_lock is freed, this makes the vma ineligible
> +		 * for pmd sharing.  And, i_mmap_rwsem is required to set up
> +		 * pmd sharing.  This is important as page tables for this
> +		 * unmapped range will be asynchrously deleted.  If the page
> +		 * tables are shared, there will be issues when accessed by
> +		 * someone else.
> +		 */
> +		__hugetlb_vma_unlock_write_free(vma);
> +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> +	} else {
> +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> +		hugetlb_vma_unlock_write(vma);
> +	}
> +}
>  
> -	i_mmap_unlock_write(vma->vm_file->f_mapping);
> +void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> +			  struct vm_area_struct *vma, unsigned long start,
> +			  unsigned long end, struct page *ref_page,
> +			  zap_flags_t zap_flags)
> +{
> +	__unmap_hugepage_range_locking(tlb, vma, start, end, ref_page,
> +					zap_flags);
>  }
>  
> +#ifdef CONFIG_ADVISE_SYSCALLS
> +/*
> + * Similar setup as in zap_page_range().  madvise(MADV_DONTNEED) can not call
> + * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete
> + * the associated vma_lock.
> + */
> +void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
> +				unsigned long end)
> +{
> +	struct mmu_notifier_range range;
> +	struct mmu_gather tlb;
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> +				start, end);
> +	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> +	tlb_gather_mmu(&tlb, vma->vm_mm);
> +	update_hiwater_rss(vma->vm_mm);
> +	mmu_notifier_invalidate_range_start(&range);
> +
> +	__unmap_hugepage_range_locking(&tlb, vma, start, end, NULL, 0);
> +
> +	mmu_notifier_invalidate_range_end(&range);
> +	tlb_finish_mmu(&tlb);
> +}
> +#endif
> +
>  void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  			  unsigned long end, struct page *ref_page,
>  			  zap_flags_t zap_flags)
> @@ -5228,7 +5276,8 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  	struct mmu_gather tlb;
>  
>  	tlb_gather_mmu(&tlb, vma->vm_mm);
> -	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
> +	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags,
> +				true);
>  	tlb_finish_mmu(&tlb);
>  }
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c7105ec6d08c..d8b4d7e56939 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>  					unsigned long start, unsigned long end)
>  {
> -	zap_page_range(vma, start, end - start);
> +	if (!is_vm_hugetlb_page(vma))
> +		zap_page_range(vma, start, end - start);
> +	else
> +		clear_hugetlb_page_range(vma, start, end);

With the new ZAP_FLAG_UNMAP flag, clear_hugetlb_page_range() can be dropped
completely?  As zap_page_range() won't be with ZAP_FLAG_UNMAP so we can
identify things?

IIUC that's the major reason why I thought the zap flag could be helpful..

Thanks!

>  	return 0;
>  }
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index c5599a9279b1..679b702af4ce 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1671,7 +1671,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
>  {
>  	struct mmu_notifier_range range;
>  	struct zap_details details = {
> -		.zap_flags = ZAP_FLAG_DROP_MARKER,
> +		.zap_flags = ZAP_FLAG_DROP_MARKER | ZAP_FLAG_UNMAP,
>  		/* Careful - we need to zap private pages too! */
>  		.even_cows = true,
>  	};
> -- 
> 2.37.3
>
  
Mike Kravetz Oct. 28, 2022, 9:17 p.m. UTC | #8
On 10/28/22 12:13, Peter Xu wrote:
> On Fri, Oct 28, 2022 at 08:23:25AM -0700, Mike Kravetz wrote:
> > On 10/26/22 21:12, Peter Xu wrote:
> > > On Wed, Oct 26, 2022 at 04:54:01PM -0700, Mike Kravetz wrote:
> > > > On 10/26/22 17:42, Peter Xu wrote:
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index c7105ec6d08c..d8b4d7e56939 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> >  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> >  					unsigned long start, unsigned long end)
> >  {
> > -	zap_page_range(vma, start, end - start);
> > +	if (!is_vm_hugetlb_page(vma))
> > +		zap_page_range(vma, start, end - start);
> > +	else
> > +		clear_hugetlb_page_range(vma, start, end);
> 
> With the new ZAP_FLAG_UNMAP flag, clear_hugetlb_page_range() can be dropped
> completely?  As zap_page_range() won't be with ZAP_FLAG_UNMAP so we can
> identify things?
> 
> IIUC that's the major reason why I thought the zap flag could be helpful..

Argh.  I went to drop clear_hugetlb_page_range() but there is one issue.
In zap_page_range() the MMU_NOTIFY_CLEAR notifier is certainly called.
However, we really need to have a 'adjust_range_if_pmd_sharing_possible'
call in there because the 'range' may be part of a shared pmd.  :(

I think we need to either have a separate routine like clear_hugetlb_page_range
that sets up the appropriate range, or special case hugetlb in zap_page_range.
What do you think?
I think clear_hugetlb_page_range is the least bad of the two options.
  
Peter Xu Oct. 28, 2022, 11:20 p.m. UTC | #9
On Fri, Oct 28, 2022 at 02:17:01PM -0700, Mike Kravetz wrote:
> On 10/28/22 12:13, Peter Xu wrote:
> > On Fri, Oct 28, 2022 at 08:23:25AM -0700, Mike Kravetz wrote:
> > > On 10/26/22 21:12, Peter Xu wrote:
> > > > On Wed, Oct 26, 2022 at 04:54:01PM -0700, Mike Kravetz wrote:
> > > > > On 10/26/22 17:42, Peter Xu wrote:
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index c7105ec6d08c..d8b4d7e56939 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > >  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > >  					unsigned long start, unsigned long end)
> > >  {
> > > -	zap_page_range(vma, start, end - start);
> > > +	if (!is_vm_hugetlb_page(vma))
> > > +		zap_page_range(vma, start, end - start);
> > > +	else
> > > +		clear_hugetlb_page_range(vma, start, end);
> > 
> > With the new ZAP_FLAG_UNMAP flag, clear_hugetlb_page_range() can be dropped
> > completely?  As zap_page_range() won't be with ZAP_FLAG_UNMAP so we can
> > identify things?
> > 
> > IIUC that's the major reason why I thought the zap flag could be helpful..
> 
> Argh.  I went to drop clear_hugetlb_page_range() but there is one issue.
> In zap_page_range() the MMU_NOTIFY_CLEAR notifier is certainly called.
> However, we really need to have a 'adjust_range_if_pmd_sharing_possible'
> call in there because the 'range' may be part of a shared pmd.  :(
> 
> I think we need to either have a separate routine like clear_hugetlb_page_range
> that sets up the appropriate range, or special case hugetlb in zap_page_range.
> What do you think?
> I think clear_hugetlb_page_range is the least bad of the two options.

How about special case hugetlb as you mentioned?  If I'm not wrong, it
should be 3 lines change:

---8<---
diff --git a/mm/memory.c b/mm/memory.c
index c5599a9279b1..0a1632e44571 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1706,11 +1706,13 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
        lru_add_drain();
        mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
                                start, start + size);
+       if (is_vm_hugetlb_page(vma))
+               adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
        tlb_gather_mmu(&tlb, vma->vm_mm);
        update_hiwater_rss(vma->vm_mm);
        mmu_notifier_invalidate_range_start(&range);
        do {
-               unmap_single_vma(&tlb, vma, start, range.end, NULL);
+               unmap_single_vma(&tlb, vma, start, start + size, NULL);
        } while ((vma = mas_find(&mas, end - 1)) != NULL);
        mmu_notifier_invalidate_range_end(&range);
        tlb_finish_mmu(&tlb);
---8<---

As zap_page_range() is already vma-oriented anyway.  But maybe I missed
something important?
  
Mike Kravetz Oct. 30, 2022, 12:15 a.m. UTC | #10
On 10/28/22 19:20, Peter Xu wrote:
> On Fri, Oct 28, 2022 at 02:17:01PM -0700, Mike Kravetz wrote:
> > On 10/28/22 12:13, Peter Xu wrote:
> > > On Fri, Oct 28, 2022 at 08:23:25AM -0700, Mike Kravetz wrote:
> > > > On 10/26/22 21:12, Peter Xu wrote:
> > > > > On Wed, Oct 26, 2022 at 04:54:01PM -0700, Mike Kravetz wrote:
> > > > > > On 10/26/22 17:42, Peter Xu wrote:
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index c7105ec6d08c..d8b4d7e56939 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > > >  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > > >  					unsigned long start, unsigned long end)
> > > >  {
> > > > -	zap_page_range(vma, start, end - start);
> > > > +	if (!is_vm_hugetlb_page(vma))
> > > > +		zap_page_range(vma, start, end - start);
> > > > +	else
> > > > +		clear_hugetlb_page_range(vma, start, end);
> > > 
> > > With the new ZAP_FLAG_UNMAP flag, clear_hugetlb_page_range() can be dropped
> > > completely?  As zap_page_range() won't be with ZAP_FLAG_UNMAP so we can
> > > identify things?
> > > 
> > > IIUC that's the major reason why I thought the zap flag could be helpful..
> > 
> > Argh.  I went to drop clear_hugetlb_page_range() but there is one issue.
> > In zap_page_range() the MMU_NOTIFY_CLEAR notifier is certainly called.
> > However, we really need to have a 'adjust_range_if_pmd_sharing_possible'
> > call in there because the 'range' may be part of a shared pmd.  :(
> > 
> > I think we need to either have a separate routine like clear_hugetlb_page_range
> > that sets up the appropriate range, or special case hugetlb in zap_page_range.
> > What do you think?
> > I think clear_hugetlb_page_range is the least bad of the two options.
> 
> How about special case hugetlb as you mentioned?  If I'm not wrong, it
> should be 3 lines change:
> 
> ---8<---
> diff --git a/mm/memory.c b/mm/memory.c
> index c5599a9279b1..0a1632e44571 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1706,11 +1706,13 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>         lru_add_drain();
>         mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
>                                 start, start + size);
> +       if (is_vm_hugetlb_page(vma))
> +               adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
>         tlb_gather_mmu(&tlb, vma->vm_mm);
>         update_hiwater_rss(vma->vm_mm);
>         mmu_notifier_invalidate_range_start(&range);
>         do {
> -               unmap_single_vma(&tlb, vma, start, range.end, NULL);
> +               unmap_single_vma(&tlb, vma, start, start + size, NULL);
>         } while ((vma = mas_find(&mas, end - 1)) != NULL);
>         mmu_notifier_invalidate_range_end(&range);
>         tlb_finish_mmu(&tlb);
> ---8<---
> 
> As zap_page_range() is already vma-oriented anyway.  But maybe I missed
> something important?

zap_page_range is a bit confusing.  It appears that the passed range can
span multiple vmas.  Otherwise, there would be no do while loop.  Yet, there
is only one mmu_notifier_range_init call specifying the passed vma.

It appears all callers pass a range entirely within a single vma.

The modifications above would work for a range within a single vma.  However,
things would be more complicated if the range can indeed span multiple vmas.
For multiple vmas, we would need to check the first and last vmas for
pmd sharing.

Anyone know more about this seeming confusing behavior?  Perhaps, range
spanning multiple vmas was left over earlier code?
  
Nadav Amit Oct. 30, 2022, 12:54 a.m. UTC | #11
On Oct 29, 2022, at 5:15 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:

> zap_page_range is a bit confusing.  It appears that the passed range can
> span multiple vmas.  Otherwise, there would be no do while loop.  Yet, there
> is only one mmu_notifier_range_init call specifying the passed vma.
> 
> It appears all callers pass a range entirely within a single vma.
> 
> The modifications above would work for a range within a single vma.  However,
> things would be more complicated if the range can indeed span multiple vmas.
> For multiple vmas, we would need to check the first and last vmas for
> pmd sharing.
> 
> Anyone know more about this seeming confusing behavior?  Perhaps, range
> spanning multiple vmas was left over earlier code?

I don’t have personal knowledge, but I noticed that it does not make much
sense, at least for MADV_DONTNEED. I tried to batch the TLB flushes across
VMAs for madvise’s. [1]

Need to get to it sometime.

[1] https://lore.kernel.org/lkml/20210926161259.238054-7-namit@vmware.com/
  
Peter Xu Oct. 30, 2022, 6:43 p.m. UTC | #12
On Sat, Oct 29, 2022 at 05:54:44PM -0700, Nadav Amit wrote:
> On Oct 29, 2022, at 5:15 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> > zap_page_range is a bit confusing.  It appears that the passed range can
> > span multiple vmas.  Otherwise, there would be no do while loop.  Yet, there
> > is only one mmu_notifier_range_init call specifying the passed vma.
> > 
> > It appears all callers pass a range entirely within a single vma.
> > 
> > The modifications above would work for a range within a single vma.  However,
> > things would be more complicated if the range can indeed span multiple vmas.
> > For multiple vmas, we would need to check the first and last vmas for
> > pmd sharing.
> > 
> > Anyone know more about this seeming confusing behavior?  Perhaps, range
> > spanning multiple vmas was left over earlier code?
> 
> I don’t have personal knowledge, but I noticed that it does not make much
> sense, at least for MADV_DONTNEED. I tried to batch the TLB flushes across
> VMAs for madvise’s. [1]

The loop comes from 7e027b14d53e ("vm: simplify unmap_vmas() calling
convention", 2012-05-06), where zap_page_range() was used to replace a call
to unmap_vmas() because the patch wanted to eliminate the zap details
pointer for unmap_vmas(), which makes sense.

I didn't check the old code, but from what I can tell (and also as Mike
pointed out) I don't think zap_page_range() in the lastest code base is
ever used on multi-vma at all.  Otherwise the mmu notifier is already
broken - see mmu_notifier_range_init() where the vma pointer is also part
of the notification.

Perhaps we should just remove the loop?

> 
> Need to get to it sometime.
> 
> [1] https://lore.kernel.org/lkml/20210926161259.238054-7-namit@vmware.com/
>
  
Nadav Amit Oct. 30, 2022, 6:52 p.m. UTC | #13
On Oct 30, 2022, at 11:43 AM, Peter Xu <peterx@redhat.com> wrote:

> The loop comes from 7e027b14d53e ("vm: simplify unmap_vmas() calling
> convention", 2012-05-06), where zap_page_range() was used to replace a call
> to unmap_vmas() because the patch wanted to eliminate the zap details
> pointer for unmap_vmas(), which makes sense.
> 
> I didn't check the old code, but from what I can tell (and also as Mike
> pointed out) I don't think zap_page_range() in the lastest code base is
> ever used on multi-vma at all.  Otherwise the mmu notifier is already
> broken - see mmu_notifier_range_init() where the vma pointer is also part
> of the notification.
> 
> Perhaps we should just remove the loop?

There is already zap_page_range_single() that does exactly that. Just need
to export it.
  
Mike Kravetz Oct. 31, 2022, 1:44 a.m. UTC | #14
On 10/30/22 11:52, Nadav Amit wrote:
> On Oct 30, 2022, at 11:43 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> > The loop comes from 7e027b14d53e ("vm: simplify unmap_vmas() calling
> > convention", 2012-05-06), where zap_page_range() was used to replace a call
> > to unmap_vmas() because the patch wanted to eliminate the zap details
> > pointer for unmap_vmas(), which makes sense.
> > 
> > I didn't check the old code, but from what I can tell (and also as Mike
> > pointed out) I don't think zap_page_range() in the lastest code base is
> > ever used on multi-vma at all.  Otherwise the mmu notifier is already
> > broken - see mmu_notifier_range_init() where the vma pointer is also part
> > of the notification.
> > 
> > Perhaps we should just remove the loop?
> 
> There is already zap_page_range_single() that does exactly that. Just need
> to export it.

I was thinking that zap_page_range() should perform a notification call for
each vma within the loop.  Something like this?

@@ -1704,15 +1704,21 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
 
 	lru_add_drain();
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
-				start, start + size);
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 	update_hiwater_rss(vma->vm_mm);
-	mmu_notifier_invalidate_range_start(&range);
 	do {
-		unmap_single_vma(&tlb, vma, start, range.end, NULL);
+		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma,
+				vma->vm_mm,
+				max(start, vma->vm_start),
+				min(start + size, vma->vm_end));
+		if (is_vm_hugetlb_page(vma))
+			adjust_range_if_pmd_sharing_possible(vma,
+				&range.start,
+				&range.end);
+		mmu_notifier_invalidate_range_start(&range);
+		unmap_single_vma(&tlb, vma, start, start + size, NULL);
+		mmu_notifier_invalidate_range_end(&range);
 	} while ((vma = mas_find(&mas, end - 1)) != NULL);
-	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
 }
 

One thing to keep in mind is that this patch is a fix that must be
backported to stable.  Therefore, I do not think we want to add too
many changes out of the direct scope of the fix.

We can always change things like this in follow up patches.
  
Peter Xu Nov. 2, 2022, 7:24 p.m. UTC | #15
On Sun, Oct 30, 2022 at 06:44:10PM -0700, Mike Kravetz wrote:
> On 10/30/22 11:52, Nadav Amit wrote:
> > On Oct 30, 2022, at 11:43 AM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > > The loop comes from 7e027b14d53e ("vm: simplify unmap_vmas() calling
> > > convention", 2012-05-06), where zap_page_range() was used to replace a call
> > > to unmap_vmas() because the patch wanted to eliminate the zap details
> > > pointer for unmap_vmas(), which makes sense.
> > > 
> > > I didn't check the old code, but from what I can tell (and also as Mike
> > > pointed out) I don't think zap_page_range() in the lastest code base is
> > > ever used on multi-vma at all.  Otherwise the mmu notifier is already
> > > broken - see mmu_notifier_range_init() where the vma pointer is also part
> > > of the notification.
> > > 
> > > Perhaps we should just remove the loop?
> > 
> > There is already zap_page_range_single() that does exactly that. Just need
> > to export it.
> 
> I was thinking that zap_page_range() should perform a notification call for
> each vma within the loop.  Something like this?

I'm boldly guessing what Nadav suggested was using zap_page_range_single()
and export it for MADV_DONTNEED.  Hopefully that's also the easiest for
stable?

For the long term, I really think we should just get rid of the loop..

> 
> @@ -1704,15 +1704,21 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>  	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
>  
>  	lru_add_drain();
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> -				start, start + size);
>  	tlb_gather_mmu(&tlb, vma->vm_mm);
>  	update_hiwater_rss(vma->vm_mm);
> -	mmu_notifier_invalidate_range_start(&range);
>  	do {
> -		unmap_single_vma(&tlb, vma, start, range.end, NULL);
> +		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma,
> +				vma->vm_mm,
> +				max(start, vma->vm_start),
> +				min(start + size, vma->vm_end));
> +		if (is_vm_hugetlb_page(vma))
> +			adjust_range_if_pmd_sharing_possible(vma,
> +				&range.start,
> +				&range.end);
> +		mmu_notifier_invalidate_range_start(&range);
> +		unmap_single_vma(&tlb, vma, start, start + size, NULL);
> +		mmu_notifier_invalidate_range_end(&range);
>  	} while ((vma = mas_find(&mas, end - 1)) != NULL);
> -	mmu_notifier_invalidate_range_end(&range);
>  	tlb_finish_mmu(&tlb);
>  }
>  
> 
> One thing to keep in mind is that this patch is a fix that must be
> backported to stable.  Therefore, I do not think we want to add too
> many changes out of the direct scope of the fix.
> 
> We can always change things like this in follow up patches.
> -- 
> Mike Kravetz
>
  
Mike Kravetz Nov. 7, 2022, 8:01 p.m. UTC | #16
On 11/02/22 15:24, Peter Xu wrote:
> On Sun, Oct 30, 2022 at 06:44:10PM -0700, Mike Kravetz wrote:
> > On 10/30/22 11:52, Nadav Amit wrote:
> > > On Oct 30, 2022, at 11:43 AM, Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > > The loop comes from 7e027b14d53e ("vm: simplify unmap_vmas() calling
> > > > convention", 2012-05-06), where zap_page_range() was used to replace a call
> > > > to unmap_vmas() because the patch wanted to eliminate the zap details
> > > > pointer for unmap_vmas(), which makes sense.
> > > > 
> > > > I didn't check the old code, but from what I can tell (and also as Mike
> > > > pointed out) I don't think zap_page_range() in the lastest code base is
> > > > ever used on multi-vma at all.  Otherwise the mmu notifier is already
> > > > broken - see mmu_notifier_range_init() where the vma pointer is also part
> > > > of the notification.
> > > > 
> > > > Perhaps we should just remove the loop?
> > > 
> > > There is already zap_page_range_single() that does exactly that. Just need
> > > to export it.
> > 
> > I was thinking that zap_page_range() should perform a notification call for
> > each vma within the loop.  Something like this?
> 
> I'm boldly guessing what Nadav suggested was using zap_page_range_single()
> and export it for MADV_DONTNEED.  Hopefully that's also the easiest for
> stable?

I started making this change, then noticed that zap_vma_ptes() just calls
zap_page_range_single().  And, it is already exported.  That may be a
better fit since exporting zap_page_range_single would require a wrapper
as I do not think we want to export struct zap_details as well.

In any case, we still need to add the adjust_range_if_pmd_sharing_possible()
call to zap_page_range_single.

> 
> For the long term, I really think we should just get rid of the loop..
> 

Yes.  It will look a little strange if adjust_range_if_pmd_sharing_possible is
added to zap_page_range_single but not zap_page_range.  And, to properly add
it to zap_page_range means rewriting the routine as I did here:
https://lore.kernel.org/linux-mm/20221102013100.455139-1-mike.kravetz@oracle.com/
  

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a899bc76d677..0246e77be3a3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -158,6 +158,8 @@  long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 void unmap_hugepage_range(struct vm_area_struct *,
 			  unsigned long, unsigned long, struct page *,
 			  zap_flags_t);
+void clear_hugetlb_page_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
 void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma,
 			  unsigned long start, unsigned long end,
@@ -426,6 +428,11 @@  static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 	BUG();
 }
 
+static void __maybe_unused clear_hugetlb_page_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+}
+
 static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
 			unsigned int flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 931789a8f734..807cfd2884fa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5202,29 +5202,67 @@  static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		tlb_flush_mmu_tlbonly(tlb);
 }
 
-void __unmap_hugepage_range_final(struct mmu_gather *tlb,
+static void __unmap_hugepage_range_locking(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page,
-			  zap_flags_t zap_flags)
+			  zap_flags_t zap_flags, bool final)
 {
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 
 	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
 
-	/*
-	 * Unlock and free the vma lock before releasing i_mmap_rwsem.  When
-	 * the vma_lock is freed, this makes the vma ineligible for pmd
-	 * sharing.  And, i_mmap_rwsem is required to set up pmd sharing.
-	 * This is important as page tables for this unmapped range will
-	 * be asynchrously deleted.  If the page tables are shared, there
-	 * will be issues when accessed by someone else.
-	 */
-	__hugetlb_vma_unlock_write_free(vma);
+	if (final) {
+		/*
+		 * Unlock and free the vma lock before releasing i_mmap_rwsem.
+		 * When the vma_lock is freed, this makes the vma ineligible
+		 * for pmd sharing.  And, i_mmap_rwsem is required to set up
+		 * pmd sharing.  This is important as page tables for this
+		 * unmapped range will be asynchrously deleted.  If the page
+		 * tables are shared, there will be issues when accessed by
+		 * someone else.
+		 */
+		__hugetlb_vma_unlock_write_free(vma);
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+	} else {
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+		hugetlb_vma_unlock_write(vma);
+	}
+}
 
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+void __unmap_hugepage_range_final(struct mmu_gather *tlb,
+			  struct vm_area_struct *vma, unsigned long start,
+			  unsigned long end, struct page *ref_page,
+			  zap_flags_t zap_flags)
+{
+	__unmap_hugepage_range_locking(tlb, vma, start, end, ref_page,
+					zap_flags, true);
 }
 
+#ifdef CONFIG_ADVISE_SYSCALLS
+/*
+ * Similar setup as in zap_page_range().  madvise(MADV_DONTNEED) can not call
+ * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete
+ * the associated vma_lock.
+ */
+void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
+				unsigned long end)
+{
+	struct mmu_notifier_range range;
+	struct mmu_gather tlb;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+				start, end);
+	tlb_gather_mmu(&tlb, vma->vm_mm);
+	update_hiwater_rss(vma->vm_mm);
+
+	__unmap_hugepage_range_locking(&tlb, vma, start, end, NULL, 0, false);
+
+	mmu_notifier_invalidate_range_end(&range);
+	tlb_finish_mmu(&tlb);
+}
+#endif
+
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page,
 			  zap_flags_t zap_flags)
diff --git a/mm/madvise.c b/mm/madvise.c
index 2baa93ca2310..90577a669635 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -790,7 +790,10 @@  static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range(vma, start, end - start);
+	if (!is_vm_hugetlb_page(vma))
+		zap_page_range(vma, start, end - start);
+	else
+		clear_hugetlb_page_range(vma, start, end);
 	return 0;
 }