[v8,2/2] mm: remove zap_page_range and change callers to use zap_vma_range

Message ID 20221108011910.350887-3-mike.kravetz@oracle.com
State New
Headers
Series hugetlb MADV_DONTNEED fix and zap_page_range cleanup |

Commit Message

Mike Kravetz Nov. 8, 2022, 1:19 a.m. UTC
  zap_page_range was originally designed to unmap pages within an address
range that could span multiple vmas.  However, today all callers of
zap_page_range pass a range entirely within a single vma.  In addition,
the mmu notification call within zap_page range is not correct as it
should be vma specific.

Instead of fixing zap_page_range, change all callers to use zap_vma_range
as it is designed for ranges within a single vma.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 arch/arm64/kernel/vdso.c                |  4 ++--
 arch/powerpc/kernel/vdso.c              |  2 +-
 arch/powerpc/platforms/book3s/vas-api.c |  2 +-
 arch/powerpc/platforms/pseries/vas.c    |  2 +-
 arch/riscv/kernel/vdso.c                |  4 ++--
 arch/s390/kernel/vdso.c                 |  2 +-
 arch/s390/mm/gmap.c                     |  2 +-
 arch/x86/entry/vdso/vma.c               |  2 +-
 drivers/android/binder_alloc.c          |  2 +-
 include/linux/mm.h                      |  2 --
 mm/memory.c                             | 30 -------------------------
 mm/page-writeback.c                     |  2 +-
 net/ipv4/tcp.c                          |  6 ++---
 13 files changed, 15 insertions(+), 47 deletions(-)
  

Comments

Nadav Amit Nov. 10, 2022, 9:09 p.m. UTC | #1
On Nov 7, 2022, at 5:19 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:

> zap_page_range was originally designed to unmap pages within an address
> range that could span multiple vmas.  However, today all callers of
> zap_page_range pass a range entirely within a single vma.  In addition,
> the mmu notification call within zap_page range is not correct as it
> should be vma specific.
> 
> Instead of fixing zap_page_range, change all callers to use zap_vma_range
> as it is designed for ranges within a single vma.

I understand the argument about mmu notifiers being broken (which is of
course fixable).

But, are the callers really able to guarantee that the ranges are all in a
single VMA? I am not familiar with the users, but how for instance
tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some
sorts that caused the original VMA to be split?
  
Peter Xu Nov. 10, 2022, 9:27 p.m. UTC | #2
Hi, Nadav,

On Thu, Nov 10, 2022 at 01:09:43PM -0800, Nadav Amit wrote:
> But, are the callers really able to guarantee that the ranges are all in a
> single VMA? I am not familiar with the users, but how for instance
> tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some
> sorts that caused the original VMA to be split?

Let me try to answer this one for Mike..  We have two callers in tcp
zerocopy code for this function:

tcp_zerocopy_vm_insert_batch_error[2095] zap_page_range(vma, *address, maybe_zap_len);
tcp_zerocopy_receive[2237]     zap_page_range(vma, address, total_bytes_to_map);

Both of them take the mmap lock for read, so firstly mprotect is not
possible.

The 1st call has:

	mmap_read_lock(current->mm);

	vma = vma_lookup(current->mm, address);
	if (!vma || vma->vm_ops != &tcp_vm_ops) {
		mmap_read_unlock(current->mm);
		return -EINVAL;
	}
	vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
	avail_len = min_t(u32, vma_len, inq);
	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
	if (total_bytes_to_map) {
		if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT))
			zap_page_range(vma, address, total_bytes_to_map);

Here total_bytes_to_map comes from avail_len <--- vma_len, which is a min()
of the rest vma range.  So total_bytes_to_map will never go beyond the vma.

The 2nd call uses maybe_zap_len as len, we need to look two layers of the
callers, but ultimately it's something smaller than total_bytes_to_map we
discussed.  Hopefully it proves 100% safety on tcp zerocopy.
  
Nadav Amit Nov. 10, 2022, 10:02 p.m. UTC | #3
On Nov 10, 2022, at 1:27 PM, Peter Xu <peterx@redhat.com> wrote:

> Hi, Nadav,
> 
> On Thu, Nov 10, 2022 at 01:09:43PM -0800, Nadav Amit wrote:
>> But, are the callers really able to guarantee that the ranges are all in a
>> single VMA? I am not familiar with the users, but how for instance
>> tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some
>> sorts that caused the original VMA to be split?
> 
> Let me try to answer this one for Mike..  We have two callers in tcp
> zerocopy code for this function:
> 
> tcp_zerocopy_vm_insert_batch_error[2095] zap_page_range(vma, *address, maybe_zap_len);
> tcp_zerocopy_receive[2237]     zap_page_range(vma, address, total_bytes_to_map);
> 
> Both of them take the mmap lock for read, so firstly mprotect is not
> possible.
> 
> The 1st call has:
> 
> 	mmap_read_lock(current->mm);
> 
> 	vma = vma_lookup(current->mm, address);
> 	if (!vma || vma->vm_ops != &tcp_vm_ops) {
> 		mmap_read_unlock(current->mm);
> 		return -EINVAL;
> 	}
> 	vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
> 	avail_len = min_t(u32, vma_len, inq);
> 	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
> 	if (total_bytes_to_map) {
> 		if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT))
> 			zap_page_range(vma, address, total_bytes_to_map);
> 
> Here total_bytes_to_map comes from avail_len <--- vma_len, which is a min()
> of the rest vma range.  So total_bytes_to_map will never go beyond the vma.
> 
> The 2nd call uses maybe_zap_len as len, we need to look two layers of the
> callers, but ultimately it's something smaller than total_bytes_to_map we
> discussed.  Hopefully it proves 100% safety on tcp zerocopy.

Thanks Peter for the detailed explanation.

I had another look at the code and indeed it should not break. I am not sure
whether users who zero-copy receive and mprotect() part of the memory would
not be surprised, but I guess that’s a different story, which I should
further study at some point.
  
Mike Kravetz Nov. 10, 2022, 10:17 p.m. UTC | #4
On 11/10/22 14:02, Nadav Amit wrote:
> On Nov 10, 2022, at 1:27 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > Hi, Nadav,
> > 
> > On Thu, Nov 10, 2022 at 01:09:43PM -0800, Nadav Amit wrote:
> >> But, are the callers really able to guarantee that the ranges are all in a
> >> single VMA? I am not familiar with the users, but how for instance
> >> tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some
> >> sorts that caused the original VMA to be split?
> > 
> > Let me try to answer this one for Mike..  We have two callers in tcp
> > zerocopy code for this function:
> > 
> > tcp_zerocopy_vm_insert_batch_error[2095] zap_page_range(vma, *address, maybe_zap_len);
> > tcp_zerocopy_receive[2237]     zap_page_range(vma, address, total_bytes_to_map);
> > 
> > Both of them take the mmap lock for read, so firstly mprotect is not
> > possible.
> > 
> > The 1st call has:
> > 
> > 	mmap_read_lock(current->mm);
> > 
> > 	vma = vma_lookup(current->mm, address);
> > 	if (!vma || vma->vm_ops != &tcp_vm_ops) {
> > 		mmap_read_unlock(current->mm);
> > 		return -EINVAL;
> > 	}
> > 	vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
> > 	avail_len = min_t(u32, vma_len, inq);
> > 	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
> > 	if (total_bytes_to_map) {
> > 		if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT))
> > 			zap_page_range(vma, address, total_bytes_to_map);
> > 
> > Here total_bytes_to_map comes from avail_len <--- vma_len, which is a min()
> > of the rest vma range.  So total_bytes_to_map will never go beyond the vma.
> > 
> > The 2nd call uses maybe_zap_len as len, we need to look two layers of the
> > callers, but ultimately it's something smaller than total_bytes_to_map we
> > discussed.  Hopefully it proves 100% safety on tcp zerocopy.
> 
> Thanks Peter for the detailed explanation.
> 
> I had another look at the code and indeed it should not break. I am not sure
> whether users who zero-copy receive and mprotect() part of the memory would
> not be surprised, but I guess that’s a different story, which I should
> further study at some point.

I did audit all calling sites and am fairly certain passed ranges are within
a single vma.  Because of this, Peter suggested removing zap_page_range.  If
there is concern, we can just fix up the mmu notifiers in zap_page_range and
leave it.  This is what is done in the patch which is currently in
mm-hotfixes-unstable.
  

Patch

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 99ae81ab91a7..05aa0c68b609 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -141,10 +141,10 @@  int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 #ifdef CONFIG_COMPAT_VDSO
 		if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 #endif
 	}
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 4abc01949702..69210ca35dc8 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -123,7 +123,7 @@  int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, &vvar_spec))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 	}
 	mmap_read_unlock(mm);
 
diff --git a/arch/powerpc/platforms/book3s/vas-api.c b/arch/powerpc/platforms/book3s/vas-api.c
index 40f5ae5e1238..475925723981 100644
--- a/arch/powerpc/platforms/book3s/vas-api.c
+++ b/arch/powerpc/platforms/book3s/vas-api.c
@@ -414,7 +414,7 @@  static vm_fault_t vas_mmap_fault(struct vm_fault *vmf)
 	/*
 	 * When the LPAR lost credits due to core removal or during
 	 * migration, invalidate the existing mapping for the current
-	 * paste addresses and set windows in-active (zap_page_range in
+	 * paste addresses and set windows in-active (zap_vma_range in
 	 * reconfig_close_windows()).
 	 * New mapping will be done later after migration or new credits
 	 * available. So continue to receive faults if the user space
diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c
index 4ad6e510d405..b70afaa5e399 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -760,7 +760,7 @@  static int reconfig_close_windows(struct vas_caps *vcap, int excess_creds,
 		 * is done before the original mmap() and after the ioctl.
 		 */
 		if (vma)
-			zap_page_range(vma, vma->vm_start,
+			zap_vma_range(vma, vma->vm_start,
 					vma->vm_end - vma->vm_start);
 
 		mmap_write_unlock(task_ref->mm);
diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c
index 123d05255fcf..47b767215d15 100644
--- a/arch/riscv/kernel/vdso.c
+++ b/arch/riscv/kernel/vdso.c
@@ -127,10 +127,10 @@  int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, vdso_info.dm))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 #ifdef CONFIG_COMPAT
 		if (vma_is_special_mapping(vma, compat_vdso_info.dm))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 #endif
 	}
 
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 119328e1e2b3..af50c3cefe45 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -78,7 +78,7 @@  int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 
 		if (!vma_is_special_mapping(vma, &vvar_mapping))
 			continue;
-		zap_page_range(vma, vma->vm_start, size);
+		zap_vma_range(vma, vma->vm_start, size);
 		break;
 	}
 	mmap_read_unlock(mm);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 02d15c8dc92e..32f1d4a3d241 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -723,7 +723,7 @@  void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
 		if (is_vm_hugetlb_page(vma))
 			continue;
 		size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
-		zap_page_range(vma, vmaddr, size);
+		zap_vma_range(vma, vmaddr, size);
 	}
 	mmap_read_unlock(gmap->mm);
 }
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index d45c5fcfeac2..b3c269cf28d0 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -134,7 +134,7 @@  int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, &vvar_mapping))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 	}
 	mmap_read_unlock(mm);
 
diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index 1c39cfce32fa..063a9b4a6c02 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -1012,7 +1012,7 @@  enum lru_status binder_alloc_free_page(struct list_head *item,
 	if (vma) {
 		trace_binder_unmap_user_start(alloc, index);
 
-		zap_page_range(vma, page_addr, PAGE_SIZE);
+		zap_vma_range(vma, page_addr, PAGE_SIZE);
 
 		trace_binder_unmap_user_end(alloc, index);
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d205bcd9cd2e..16052a628ab2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1838,8 +1838,6 @@  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		  unsigned long size);
-void zap_page_range(struct vm_area_struct *vma, unsigned long address,
-		    unsigned long size);
 void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
 		    unsigned long size);
 void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
diff --git a/mm/memory.c b/mm/memory.c
index af3a4724b464..a9b2aa1149b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1686,36 +1686,6 @@  void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 	mmu_notifier_invalidate_range_end(&range);
 }
 
-/**
- * zap_page_range - remove user pages in a given range
- * @vma: vm_area_struct holding the applicable pages
- * @start: starting address of pages to zap
- * @size: number of bytes to zap
- *
- * Caller must protect the VMA list
- */
-void zap_page_range(struct vm_area_struct *vma, unsigned long start,
-		unsigned long size)
-{
-	struct maple_tree *mt = &vma->vm_mm->mm_mt;
-	unsigned long end = start + size;
-	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
-	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
-
-	lru_add_drain();
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
-				start, start + size);
-	tlb_gather_mmu(&tlb, vma->vm_mm);
-	update_hiwater_rss(vma->vm_mm);
-	mmu_notifier_invalidate_range_start(&range);
-	do {
-		unmap_single_vma(&tlb, vma, start, range.end, NULL);
-	} while ((vma = mas_find(&mas, end - 1)) != NULL);
-	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb);
-}
-
 /**
  * __zap_page_range_single - remove user pages in a given range
  * @vma: vm_area_struct holding the applicable pages
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7e9d8d857ecc..dbfa8b2062fc 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2601,7 +2601,7 @@  void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
  *
  * The caller must hold lock_page_memcg().  Most callers have the folio
  * locked.  A few have the folio blocked from truncation through other
- * means (eg zap_page_range() has it mapped and is holding the page table
+ * means (eg zap_vma_range() has it mapped and is holding the page table
  * lock).  This can also be called from mark_buffer_dirty(), which I
  * cannot prove is always protected against truncate.
  */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index de8f0cd7cb32..dea1d72ae4e2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2092,7 +2092,7 @@  static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma,
 		maybe_zap_len = total_bytes_to_map -  /* All bytes to map */
 				*length + /* Mapped or pending */
 				(pages_remaining * PAGE_SIZE); /* Failed map. */
-		zap_page_range(vma, *address, maybe_zap_len);
+		zap_vma_range(vma, *address, maybe_zap_len);
 		err = 0;
 	}
 
@@ -2100,7 +2100,7 @@  static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma,
 		unsigned long leftover_pages = pages_remaining;
 		int bytes_mapped;
 
-		/* We called zap_page_range, try to reinsert. */
+		/* We called zap_vma_range, try to reinsert. */
 		err = vm_insert_pages(vma, *address,
 				      pending_pages,
 				      &pages_remaining);
@@ -2234,7 +2234,7 @@  static int tcp_zerocopy_receive(struct sock *sk,
 	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
 	if (total_bytes_to_map) {
 		if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT))
-			zap_page_range(vma, address, total_bytes_to_map);
+			zap_vma_range(vma, address, total_bytes_to_map);
 		zc->length = total_bytes_to_map;
 		zc->recv_skip_hint = 0;
 	} else {