[v3,15/15] mm/mmap: Change vma iteration order in do_vmi_align_munmap()

Message ID 20230724183157.3939892-16-Liam.Howlett@oracle.com
State New
Headers
Series Reduce preallocations for maple tree |

Commit Message

Liam R. Howlett July 24, 2023, 6:31 p.m. UTC
  By delaying the setting of prev/next VMA until after the write of NULL,
the probability of the prev/next VMA already being in the CPU cache is
significantly increased, especially for larger munmap operations.  It
also means that prev/next will be loaded closer to when they are used.

This requires changing the loop type when gathering the VMAs that will
be freed.

Since prev will be set later in the function, it is better to reverse
the splitting direction of the start VMA (modify the new_below argument
to __split_vma).

Using the vma_iter_prev_range() to walk back to the correct location in
the tree will, on the most part, mean walking within the CPU cache.
Usually, this is two steps vs a node reset and a tree re-walk.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)
  

Comments

Jann Horn Aug. 14, 2023, 3:43 p.m. UTC | #1
@akpm

On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> Since prev will be set later in the function, it is better to reverse
> the splitting direction of the start VMA (modify the new_below argument
> to __split_vma).

It might be a good idea to reorder "mm: always lock new vma before
inserting into vma tree" before this patch.

If you apply this patch without "mm: always lock new vma before
inserting into vma tree", I think move_vma(), when called with a start
address in the middle of a VMA, will behave like this:

 - vma_start_write() [lock the VMA to be moved]
 - move_page_tables() [moves page table entries]
 - do_vmi_munmap()
   - do_vmi_align_munmap()
     - __split_vma()
       - creates a new VMA **covering the moved range** that is **not locked**
       - stores the new VMA in the VMA tree **without locking it** [1]
     - new VMA is locked and removed again [2]
[...]

So after the page tables in the region have already been moved, I
believe there will be a brief window (between [1] and [2]) where page
faults in the region can happen again, which could probably cause new
page tables and PTEs to be created in the region again in that window.
(This can't happen in Linus' current tree because the new VMA created
by __split_vma() only covers the range that is not being moved.)

Though I guess that's not going to lead to anything bad, since
do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
So maybe it's not that important.
  

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index 58f7b7038e4c..f11c0d663deb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2451,20 +2451,17 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
 			goto map_count_exceeded;
 
-		error = __split_vma(vmi, vma, start, 0);
+		error = __split_vma(vmi, vma, start, 1);
 		if (error)
 			goto start_split_failed;
-
-		vma = vma_iter_load(vmi);
 	}
 
-	prev = vma_prev(vmi);
-
 	/*
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
 	 * it is always overwritten.
 	 */
-	for_each_vma_range(*vmi, next, end) {
+	next = vma;
+	do {
 		/* Does it split the end? */
 		if (next->vm_end > end) {
 			error = __split_vma(vmi, next, end, 0);
@@ -2500,13 +2497,7 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		BUG_ON(next->vm_start < start);
 		BUG_ON(next->vm_start > end);
 #endif
-	}
-
-	if (vma_iter_end(vmi) > end)
-		next = vma_iter_load(vmi);
-
-	if (!next)
-		next = vma_next(vmi);
+	} for_each_vma_range(*vmi, next, end);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
@@ -2527,7 +2518,10 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		BUG_ON(count != test_count);
 	}
 #endif
-	vma_iter_set(vmi, start);
+
+	while (vma_iter_addr(vmi) > start)
+		vma_iter_prev_range(vmi);
+
 	error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
 	if (error)
 		goto clear_tree_failed;
@@ -2538,6 +2532,11 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (unlock)
 		mmap_write_downgrade(mm);
 
+	prev = vma_iter_prev_range(vmi);
+	next = vma_next(vmi);
+	if (next)
+		vma_iter_prev_range(vmi);
+
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.