[v2,04/16] mm: Change do_vmi_align_munmap() side tree index

Message ID 20230612203953.2093911-5-Liam.Howlett@oracle.com
State New
Headers
Series Reduce preallocations for maple tree |

Commit Message

Liam R. Howlett June 12, 2023, 8:39 p.m. UTC
  The majority of the calls to munmap a VMA is for a single vma.  The
maple tree is able to store a single entry at 0, with a size of 1 as a
pointer and avoid any allocations.  Change do_vmi_align_munmap() to
store the VMAs being munmap()'ed into a tree indexed by the count.  This
will leverage the ability to store the first entry without a node
allocation.

Storing the entries into a tree by the count and not the vma start and
end means changing the functions which iterate over the entries.  Update
unmap_vmas() and free_pgtables() to take a maple state and a tree end
address to support this functionality.

Passing through the same maple state to unmap_vmas() and free_pgtables()
means the state needs to be reset between calls.  This happens in the
static unmap_region() and exit_mmap().

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/internal.h |  8 ++++----
 mm/memory.c   | 19 +++++++++----------
 mm/mmap.c     | 41 ++++++++++++++++++++++++-----------------
 3 files changed, 37 insertions(+), 31 deletions(-)
  

Comments

Sergey Senozhatsky June 20, 2023, 12:26 p.m. UTC | #1
Hello Liam,

On (23/06/12 16:39), Liam R. Howlett wrote:
[..]
> @@ -2450,17 +2452,17 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
>  	/* Make sure no VMAs are about to be lost. */
>  	{
> -		MA_STATE(test, &mt_detach, start, end - 1);
> +		MA_STATE(test, &mt_detach, 0, 0);
>  		struct vm_area_struct *vma_mas, *vma_test;
>  		int test_count = 0;
>  
>  		vma_iter_set(vmi, start);
>  		rcu_read_lock();
> -		vma_test = mas_find(&test, end - 1);
> +		vma_test = mas_find(&test, count - 1);
>  		for_each_vma_range(*vmi, vma_mas, end) {
>  			BUG_ON(vma_mas != vma_test);
>  			test_count++;
> -			vma_test = mas_next(&test, end - 1);
> +			vma_test = mas_next(&test, count - 1);
>  		}
>  		rcu_read_unlock();
>  		BUG_ON(count != test_count);

Something isn't quite working, I'm hitting BUG_ON(vma_mas != vma_test)

[    8.156437] ------------[ cut here ]------------
[    8.160473] kernel BUG at mm/mmap.c:2439!
[    8.163894] invalid opcode: 0000 [#1] PREEMPT SMP PTI

RIP: 0010:do_vmi_align_munmap+0x489/0x8a0

[    8.207867] Call Trace:
[    8.208463]  <TASK>
[    8.209018]  ? die+0x32/0x80
[    8.209709]  ? do_trap+0xd2/0x100
[    8.210520]  ? do_vmi_align_munmap+0x489/0x8a0
[    8.211576]  ? do_vmi_align_munmap+0x489/0x8a0
[    8.212639]  ? do_error_trap+0x94/0x110
[    8.213549]  ? do_vmi_align_munmap+0x489/0x8a0
[    8.214581]  ? exc_invalid_op+0x49/0x60
[    8.215494]  ? do_vmi_align_munmap+0x489/0x8a0
[    8.216576]  ? asm_exc_invalid_op+0x16/0x20
[    8.217562]  ? do_vmi_align_munmap+0x489/0x8a0
[    8.218626]  do_vmi_munmap+0xc7/0x120
[    8.219494]  __vm_munmap+0xaa/0x1c0
[    8.220370]  __x64_sys_munmap+0x17/0x20
[    8.221275]  do_syscall_64+0x34/0x80
[    8.222165]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[    8.223359] RIP: 0033:0x7fdb0e2fca97
[    8.224224] Code: ff ff ff ff c3 66 0f 1f 44 00 00 f7 d8 89 05 20 28 01 00 48 c7 c0 ff ff ff ff c3 0f 1f 84 00 00 00 00 00 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8d 0d f9 27 01 00 f7 d8 89 01 48 83
[    8.228432] RSP: 002b:00007ffd15458348 EFLAGS: 00000202 ORIG_RAX: 000000000000000b
[    8.230175] RAX: ffffffffffffffda RBX: 0000562081a347b0 RCX: 00007fdb0e2fca97
[    8.231833] RDX: 0000000000000002 RSI: 0000000000004010 RDI: 00007fdb0e2d5000
[    8.233513] RBP: 00007ffd154584d0 R08: 0000000000000000 R09: 0000562081a3fd30
[    8.235178] R10: 0000562081a3fd18 R11: 0000000000000202 R12: 00007ffd15458388
[    8.236861] R13: 00007ffd15458438 R14: 00007ffd15458370 R15: 0000562081a347b0
  
Liam R. Howlett June 20, 2023, 1:04 p.m. UTC | #2
* Sergey Senozhatsky <senozhatsky@chromium.org> [230620 08:26]:
> Hello Liam,
> 
> On (23/06/12 16:39), Liam R. Howlett wrote:
> [..]
> > @@ -2450,17 +2452,17 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> >  	/* Make sure no VMAs are about to be lost. */
> >  	{
> > -		MA_STATE(test, &mt_detach, start, end - 1);
> > +		MA_STATE(test, &mt_detach, 0, 0);
> >  		struct vm_area_struct *vma_mas, *vma_test;
> >  		int test_count = 0;
> >  
> >  		vma_iter_set(vmi, start);
> >  		rcu_read_lock();
> > -		vma_test = mas_find(&test, end - 1);
> > +		vma_test = mas_find(&test, count - 1);
> >  		for_each_vma_range(*vmi, vma_mas, end) {
> >  			BUG_ON(vma_mas != vma_test);
> >  			test_count++;
> > -			vma_test = mas_next(&test, end - 1);
> > +			vma_test = mas_next(&test, count - 1);
> >  		}
> >  		rcu_read_unlock();
> >  		BUG_ON(count != test_count);
> 
> Something isn't quite working, I'm hitting BUG_ON(vma_mas != vma_test)

Is this with next by any chance?  There's a merge conflict which I'll
have to fix, but I won't be getting to it in time so the patches will
not make this merge window.

> 
> [    8.156437] ------------[ cut here ]------------
> [    8.160473] kernel BUG at mm/mmap.c:2439!
> [    8.163894] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> 
> RIP: 0010:do_vmi_align_munmap+0x489/0x8a0
> 
> [    8.207867] Call Trace:
> [    8.208463]  <TASK>
> [    8.209018]  ? die+0x32/0x80
> [    8.209709]  ? do_trap+0xd2/0x100
> [    8.210520]  ? do_vmi_align_munmap+0x489/0x8a0
> [    8.211576]  ? do_vmi_align_munmap+0x489/0x8a0
> [    8.212639]  ? do_error_trap+0x94/0x110
> [    8.213549]  ? do_vmi_align_munmap+0x489/0x8a0
> [    8.214581]  ? exc_invalid_op+0x49/0x60
> [    8.215494]  ? do_vmi_align_munmap+0x489/0x8a0
> [    8.216576]  ? asm_exc_invalid_op+0x16/0x20
> [    8.217562]  ? do_vmi_align_munmap+0x489/0x8a0
> [    8.218626]  do_vmi_munmap+0xc7/0x120
> [    8.219494]  __vm_munmap+0xaa/0x1c0
> [    8.220370]  __x64_sys_munmap+0x17/0x20
> [    8.221275]  do_syscall_64+0x34/0x80
> [    8.222165]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [    8.223359] RIP: 0033:0x7fdb0e2fca97
> [    8.224224] Code: ff ff ff ff c3 66 0f 1f 44 00 00 f7 d8 89 05 20 28 01 00 48 c7 c0 ff ff ff ff c3 0f 1f 84 00 00 00 00 00 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8d 0d f9 27 01 00 f7 d8 89 01 48 83
> [    8.228432] RSP: 002b:00007ffd15458348 EFLAGS: 00000202 ORIG_RAX: 000000000000000b
> [    8.230175] RAX: ffffffffffffffda RBX: 0000562081a347b0 RCX: 00007fdb0e2fca97
> [    8.231833] RDX: 0000000000000002 RSI: 0000000000004010 RDI: 00007fdb0e2d5000
> [    8.233513] RBP: 00007ffd154584d0 R08: 0000000000000000 R09: 0000562081a3fd30
> [    8.235178] R10: 0000562081a3fd18 R11: 0000000000000202 R12: 00007ffd15458388
> [    8.236861] R13: 00007ffd15458438 R14: 00007ffd15458370 R15: 0000562081a347b0
  
Sergey Senozhatsky June 21, 2023, 12:10 a.m. UTC | #3
On (23/06/20 09:04), Liam R. Howlett wrote:
> > On (23/06/12 16:39), Liam R. Howlett wrote:
> > [..]
> > > @@ -2450,17 +2452,17 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > >  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> > >  	/* Make sure no VMAs are about to be lost. */
> > >  	{
> > > -		MA_STATE(test, &mt_detach, start, end - 1);
> > > +		MA_STATE(test, &mt_detach, 0, 0);
> > >  		struct vm_area_struct *vma_mas, *vma_test;
> > >  		int test_count = 0;
> > >  
> > >  		vma_iter_set(vmi, start);
> > >  		rcu_read_lock();
> > > -		vma_test = mas_find(&test, end - 1);
> > > +		vma_test = mas_find(&test, count - 1);
> > >  		for_each_vma_range(*vmi, vma_mas, end) {
> > >  			BUG_ON(vma_mas != vma_test);
> > >  			test_count++;
> > > -			vma_test = mas_next(&test, end - 1);
> > > +			vma_test = mas_next(&test, count - 1);
> > >  		}
> > >  		rcu_read_unlock();
> > >  		BUG_ON(count != test_count);
> > 
> > Something isn't quite working, I'm hitting BUG_ON(vma_mas != vma_test)
> 
> Is this with next by any chance?

Oh yes, linux-next
  

Patch

diff --git a/mm/internal.h b/mm/internal.h
index 9b665c4e5fc0..24437f11d3c2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -103,7 +103,7 @@  bool __folio_end_writeback(struct folio *folio);
 void deactivate_file_folio(struct folio *folio);
 void folio_activate(struct folio *folio);
 
-void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
+void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		   struct vm_area_struct *start_vma, unsigned long floor,
 		   unsigned long ceiling, bool mm_wr_locked);
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
@@ -1099,9 +1099,9 @@  static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
 	return 0;
 }
 
-void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
-		struct vm_area_struct *vma, unsigned long start_addr,
-		unsigned long end_addr, bool mm_wr_locked);
+void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
+		struct vm_area_struct *start_vma, unsigned long start,
+		unsigned long end, unsigned long tree_end, bool mm_wr_locked);
 
 /*
  * VMA lock generalization
diff --git a/mm/memory.c b/mm/memory.c
index 8358f3b853f2..fa343b8dad4b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -360,12 +360,10 @@  void free_pgd_range(struct mmu_gather *tlb,
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
+void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		   struct vm_area_struct *vma, unsigned long floor,
 		   unsigned long ceiling, bool mm_wr_locked)
 {
-	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
-
 	do {
 		unsigned long addr = vma->vm_start;
 		struct vm_area_struct *next;
@@ -374,7 +372,7 @@  void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
 		 * Note: USER_PGTABLES_CEILING may be passed as ceiling and may
 		 * be 0.  This will underflow and is okay.
 		 */
-		next = mas_find(&mas, ceiling - 1);
+		next = mas_find(mas, ceiling - 1);
 
 		/*
 		 * Hide vma from rmap and truncate_pagecache before freeing
@@ -395,7 +393,7 @@  void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
 			       && !is_vm_hugetlb_page(next)) {
 				vma = next;
-				next = mas_find(&mas, ceiling - 1);
+				next = mas_find(mas, ceiling - 1);
 				if (mm_wr_locked)
 					vma_start_write(vma);
 				unlink_anon_vmas(vma);
@@ -1684,10 +1682,11 @@  static void unmap_single_vma(struct mmu_gather *tlb,
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
  * @tlb: address of the caller's struct mmu_gather
- * @mt: the maple tree
+ * @mas: The maple state
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
+ * @tree_end: The end address to search in the maple tree
  *
  * Unmap all pages in the vma list.
  *
@@ -1700,9 +1699,10 @@  static void unmap_single_vma(struct mmu_gather *tlb,
  * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
  * drops the lock and schedules.
  */
-void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
+void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 		struct vm_area_struct *vma, unsigned long start_addr,
-		unsigned long end_addr, bool mm_wr_locked)
+		unsigned long end_addr, unsigned long tree_end,
+		bool mm_wr_locked)
 {
 	struct mmu_notifier_range range;
 	struct zap_details details = {
@@ -1710,7 +1710,6 @@  void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 		/* Careful - we need to zap private pages too! */
 		.even_cows = true,
 	};
-	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm,
 				start_addr, end_addr);
@@ -1718,7 +1717,7 @@  void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 	do {
 		unmap_single_vma(tlb, vma, start_addr, end_addr, &details,
 				 mm_wr_locked);
-	} while ((vma = mas_find(&mas, end_addr - 1)) != NULL);
+	} while ((vma = mas_find(mas, tree_end - 1)) != NULL);
 	mmu_notifier_invalidate_range_end(&range);
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 1503a7bdb192..8e5563668b18 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -76,10 +76,10 @@  int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
 static bool ignore_rlimit_data;
 core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
 
-static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
+static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
 		struct vm_area_struct *next, unsigned long start,
-		unsigned long end, bool mm_wr_locked);
+		unsigned long end, unsigned long tree_end, bool mm_wr_locked);
 
 static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
 {
@@ -2221,18 +2221,20 @@  static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
  *
  * Called with the mm semaphore held.
  */
-static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
+static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		struct vm_area_struct *next,
-		unsigned long start, unsigned long end, bool mm_wr_locked)
+		struct vm_area_struct *next, unsigned long start,
+		unsigned long end, unsigned long tree_end, bool mm_wr_locked)
 {
 	struct mmu_gather tlb;
+	unsigned long mt_start = mas->index;
 
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm);
 	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, mt, vma, start, end, mm_wr_locked);
-	free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
+	unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked);
+	mas_set(mas, mt_start);
+	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
 				 next ? next->vm_start : USER_PGTABLES_CEILING,
 				 mm_wr_locked);
 	tlb_finish_mmu(&tlb);
@@ -2338,7 +2340,6 @@  static inline int munmap_sidetree(struct vm_area_struct *vma,
 				   struct ma_state *mas_detach)
 {
 	vma_start_write(vma);
-	mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
 	if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
 		return -ENOMEM;
 
@@ -2415,6 +2416,7 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 			if (error)
 				goto end_split_failed;
 		}
+		mas_set(&mas_detach, count);
 		error = munmap_sidetree(next, &mas_detach);
 		if (error)
 			goto munmap_sidetree_failed;
@@ -2450,17 +2452,17 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
 	{
-		MA_STATE(test, &mt_detach, start, end - 1);
+		MA_STATE(test, &mt_detach, 0, 0);
 		struct vm_area_struct *vma_mas, *vma_test;
 		int test_count = 0;
 
 		vma_iter_set(vmi, start);
 		rcu_read_lock();
-		vma_test = mas_find(&test, end - 1);
+		vma_test = mas_find(&test, count - 1);
 		for_each_vma_range(*vmi, vma_mas, end) {
 			BUG_ON(vma_mas != vma_test);
 			test_count++;
-			vma_test = mas_next(&test, end - 1);
+			vma_test = mas_next(&test, count - 1);
 		}
 		rcu_read_unlock();
 		BUG_ON(count != test_count);
@@ -2490,9 +2492,11 @@  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
 	 */
-	unmap_region(mm, &mt_detach, vma, prev, next, start, end, !downgrade);
+	mas_set(&mas_detach, 1);
+	unmap_region(mm, &mas_detach, vma, prev, next, start, end, count,
+		     !downgrade);
 	/* Statistics and freeing VMAs */
-	mas_set(&mas_detach, start);
+	mas_set(&mas_detach, 0);
 	remove_mt(mm, &mas_detach);
 	__mt_destroy(&mt_detach);
 
@@ -2800,9 +2804,10 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 		fput(vma->vm_file);
 		vma->vm_file = NULL;
 
+		vma_iter_set(&vmi, vma->vm_end);
 		/* Undo any partial mapping done by a device driver. */
-		unmap_region(mm, &mm->mm_mt, vma, prev, next, vma->vm_start,
-			     vma->vm_end, true);
+		unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
+			     vma->vm_end, vma->vm_end, true);
 	}
 	if (file && (vm_flags & VM_SHARED))
 		mapping_unmap_writable(file->f_mapping);
@@ -3131,7 +3136,7 @@  void exit_mmap(struct mm_struct *mm)
 	tlb_gather_mmu_fullmm(&tlb, mm);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
-	unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX, false);
+	unmap_vmas(&tlb, &mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
 	mmap_read_unlock(mm);
 
 	/*
@@ -3141,7 +3146,8 @@  void exit_mmap(struct mm_struct *mm)
 	set_bit(MMF_OOM_SKIP, &mm->flags);
 	mmap_write_lock(mm);
 	mt_clear_in_rcu(&mm->mm_mt);
-	free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
+	mas_set(&mas, vma->vm_end);
+	free_pgtables(&tlb, &mas, vma, FIRST_USER_ADDRESS,
 		      USER_PGTABLES_CEILING, true);
 	tlb_finish_mmu(&tlb);
 
@@ -3150,6 +3156,7 @@  void exit_mmap(struct mm_struct *mm)
 	 * enabled, without holding any MM locks besides the unreachable
 	 * mmap_write_lock.
 	 */
+	mas_set(&mas, vma->vm_end);
 	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);