[v3,04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

Message ID 20240102184633.748113-5-urezki@gmail.com
State New
Headers
Series Mitigate a vmap lock contention v3 |

Commit Message

Uladzislau Rezki Jan. 2, 2024, 6:46 p.m. UTC
  Store allocated objects in a separate nodes. A va->va_start
address is converted into a correct node where it should
be placed and resided. An addr_to_node() function is used
to do a proper address conversion to determine a node that
contains a VA.

Such approach balances VAs across nodes as a result an access
becomes scalable. Number of nodes in a system depends on number
of CPUs.

Please note:

1. As of now allocated VAs are bound to a node-0. It means the
   patch does not give any difference comparing with a current
   behavior;

2. The global vmap_area_lock, vmap_area_root are removed as there
   is no need in it anymore. The vmap_area_list is still kept and
   is _empty_. It is exported for a kexec only;

3. The vmallocinfo and vread() have to be reworked to be able to
   handle multiple nodes.

Reviewed-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 240 +++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 173 insertions(+), 67 deletions(-)
  

Comments

Wen Gu Jan. 5, 2024, 8:10 a.m. UTC | #1
On 2024/01/03 02:46, Uladzislau Rezki wrote:

> Store allocated objects in a separate nodes. A va->va_start
> address is converted into a correct node where it should
> be placed and resided. An addr_to_node() function is used
> to do a proper address conversion to determine a node that
> contains a VA.
> 
> Such approach balances VAs across nodes as a result an access
> becomes scalable. Number of nodes in a system depends on number
> of CPUs.
> 
> Please note:
> 
> 1. As of now allocated VAs are bound to a node-0. It means the
>    patch does not give any difference comparing with a current
>    behavior;
> 
> 2. The global vmap_area_lock, vmap_area_root are removed as there
>    is no need in it anymore. The vmap_area_list is still kept and
>    is _empty_. It is exported for a kexec only;
> 
> 3. The vmallocinfo and vread() have to be reworked to be able to
>    handle multiple nodes.
> 
> Reviewed-by: Baoquan He <bhe@redhat.com>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---

<...>

>  struct vmap_area *find_vmap_area(unsigned long addr)
>  {
> +	struct vmap_node *vn;
>  	struct vmap_area *va;
> +	int i, j;
>  
> -	spin_lock(&vmap_area_lock);
> -	va = __find_vmap_area(addr, &vmap_area_root);
> -	spin_unlock(&vmap_area_lock);
> +	/*
> +	 * An addr_to_node_id(addr) converts an address to a node index
> +	 * where a VA is located. If VA spans several zones and passed
> +	 * addr is not the same as va->va_start, what is not common, we
> +	 * may need to scan an extra nodes. See an example:
> +	 *
> +	 *      <--va-->
> +	 * -|-----|-----|-----|-----|-
> +	 *     1     2     0     1
> +	 *
> +	 * VA resides in node 1 whereas it spans 1 and 2. If passed
> +	 * addr is within a second node we should do extra work. We
> +	 * should mention that it is rare and is a corner case from
> +	 * the other hand it has to be covered.
> +	 */
> +	i = j = addr_to_node_id(addr);
> +	do {
> +		vn = &vmap_nodes[i];
>  
> -	return va;
> +		spin_lock(&vn->busy.lock);
> +		va = __find_vmap_area(addr, &vn->busy.root);
> +		spin_unlock(&vn->busy.lock);
> +
> +		if (va)
> +			return va;
> +	} while ((i = (i + 1) % nr_vmap_nodes) != j);
> +
> +	return NULL;
>  }
>  

Hi Uladzislau Rezki,

I really like your work, it is great and helpful!

Currently, I am working on using shared memory communication (SMC [1])
to transparently accelerate TCP communication between two peers within
the same OS instance[2].

In this scenario, a vzalloced kernel buffer acts as a shared memory and
will be simultaneous read or written by two SMC sockets, thus forming an
SMC connection.


           socket1                    socket2
              |                          ^
              |                          |    userspace
       ---- write -------------------- read ------
              |    +-----------------+   |    kernel
              +--->|  shared memory  |---+
                   | (vzalloced now) |
                   +-----------------+

Then I encountered the performance regression caused by lock contention
in find_vmap_area() when multiple threads transfer data through multiple
SMC connections on machines with many CPUs[3].

According to perf, the performance bottleneck is caused by the global
vmap_area_lock contention[4]:

- writer:

   smc_tx_sendmsg
   -> memcpy_from_msg
      -> copy_from_iter
         -> check_copy_size
            -> check_object_size
               -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
                  -> if(vm) find_vmap_area
                     -> try to hold vmap_area_lock lock
- reader:

   smc_rx_recvmsg
    -> memcpy_to_msg
       -> copy_to_iter
          -> check_copy_size
             -> check_object_size
                -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
                   -> if(vm) find_vmap_area
                      -> try to hold vmap_area_lock lock

Fortunately, thank you for this patch set, the global vmap_area_lock was
removed and per node lock vn->busy.lock is introduced. it is really helpful:

In 48 CPUs qemu environment, the Requests/s increased by 5 times:
- nginx
- wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80

                 vzalloced shmem      vzalloced shmem(with this patch set)
Requests/sec          113536.56            583729.93


But it also has some overhead, compared to using kzalloced shared memory
or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:

                 kzalloced shmem      vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
Requests/sec          831950.39            805164.78


So, as a newbie in Linux-mm, I would like to ask for some suggestions:

Is it possible to further eliminate the overhead caused by lock contention
in find_vmap_area() in this scenario (maybe this is asking too much), or the
only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced
buffer in the situation where cocurrent kernel-userspace-copy happens?

Any feedback will be appreciated. Thanks again for your time.


[1] Shared Memory Communications (SMC) enables two SMC capable peers to
     communicate by using memory buffers that each peer allocates for the
     partner's use. It improves throughput, lowers latency and cost, and
     maintains existing functions. See details in https://www.ibm.com/support/pages/node/7009315

[2] https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/

[3] issues: https://lore.kernel.org/all/1fbd6b74-1080-923a-01c1-689c3d65f880@huawei.com/
     analysis: https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/

[4] Some flamegraphs are attached,
     - SMC using vzalloced buffer: vzalloc_t96.svg
     - SMC using vzalloced buffer and with this patchset: vzalloc_t96_improve.svg
     - SMC using vzalloced buffer and unset CONFIG_HARDENED_USERCOPY: vzalloc_t96_nocheck.svg
     - SMC using kzalloced buffer: kzalloc_t96.svg


Best regards,
Wen Gu
  
Uladzislau Rezki Jan. 5, 2024, 10:50 a.m. UTC | #2
Hello, Wen Gu.

> 
> Hi Uladzislau Rezki,
> 
> I really like your work, it is great and helpful!
> 
> Currently, I am working on using shared memory communication (SMC [1])
> to transparently accelerate TCP communication between two peers within
> the same OS instance[2].
> 
> In this scenario, a vzalloced kernel buffer acts as a shared memory and
> will be simultaneous read or written by two SMC sockets, thus forming an
> SMC connection.
> 
> 
>           socket1                    socket2
>              |                          ^
>              |                          |    userspace
>       ---- write -------------------- read ------
>              |    +-----------------+   |    kernel
>              +--->|  shared memory  |---+
>                   | (vzalloced now) |
>                   +-----------------+
> 
> Then I encountered the performance regression caused by lock contention
> in find_vmap_area() when multiple threads transfer data through multiple
> SMC connections on machines with many CPUs[3].
> 
> According to perf, the performance bottleneck is caused by the global
> vmap_area_lock contention[4]:
> 
> - writer:
> 
>   smc_tx_sendmsg
>   -> memcpy_from_msg
>      -> copy_from_iter
>         -> check_copy_size
>            -> check_object_size
>               -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
>                  -> if(vm) find_vmap_area
>                     -> try to hold vmap_area_lock lock
> - reader:
> 
>   smc_rx_recvmsg
>    -> memcpy_to_msg
>       -> copy_to_iter
>          -> check_copy_size
>             -> check_object_size
>                -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
>                   -> if(vm) find_vmap_area
>                      -> try to hold vmap_area_lock lock
> 
> Fortunately, thank you for this patch set, the global vmap_area_lock was
> removed and per node lock vn->busy.lock is introduced. it is really helpful:
> 
> In 48 CPUs qemu environment, the Requests/s increased by 5 times:
> - nginx
> - wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80
> 
>                 vzalloced shmem      vzalloced shmem(with this patch set)
> Requests/sec          113536.56            583729.93
> 
> 
Thank you for the confirmation that your workload is improved. The "nginx"
is 5 times better!

> But it also has some overhead, compared to using kzalloced shared memory
> or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:
> 
>                 kzalloced shmem      vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
> Requests/sec          831950.39            805164.78
> 
> 
The CONFIG_HARDENED_USERCOPY prevents coping "wrong" memory regions. That is 
why if it is a vmalloced memory it wants to make sure it is really true,
if not user-copy is aborted.

So there is an extra work that involves finding a VA associated with an address.

> So, as a newbie in Linux-mm, I would like to ask for some suggestions:
> 
> Is it possible to further eliminate the overhead caused by lock contention
> in find_vmap_area() in this scenario (maybe this is asking too much), or the
> only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced
> buffer in the situation where cocurrent kernel-userspace-copy happens?
> 
Could you please try below patch, if it improves this series further?
Just in case:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e30dabf68263..40acf53cadfb 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -772,7 +772,7 @@ static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
 struct rb_list {
 	struct rb_root root;
 	struct list_head head;
-	spinlock_t lock;
+	rwlock_t lock;
 };
 
 struct vmap_pool {
@@ -947,19 +947,19 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
 	for (i = 0; i < nr_vmap_nodes; i++) {
 		vn = &vmap_nodes[i];
 
-		spin_lock(&vn->busy.lock);
+		read_lock(&vn->busy.lock);
 		va_lowest = __find_vmap_area_exceed_addr(addr, &vn->busy.root);
 		if (va_lowest) {
 			if (!va_node || va_lowest->va_start < (*va)->va_start) {
 				if (va_node)
-					spin_unlock(&va_node->busy.lock);
+					read_unlock(&va_node->busy.lock);
 
 				*va = va_lowest;
 				va_node = vn;
 				continue;
 			}
 		}
-		spin_unlock(&vn->busy.lock);
+		read_unlock(&vn->busy.lock);
 	}
 
 	return va_node;
@@ -1695,9 +1695,9 @@ static void free_vmap_area(struct vmap_area *va)
 	/*
 	 * Remove from the busy tree/list.
 	 */
-	spin_lock(&vn->busy.lock);
+	write_lock(&vn->busy.lock);
 	unlink_va(va, &vn->busy.root);
-	spin_unlock(&vn->busy.lock);
+	write_unlock(&vn->busy.lock);
 
 	/*
 	 * Insert/Merge it back to the free tree/list.
@@ -1901,9 +1901,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 
 	vn = addr_to_node(va->va_start);
 
-	spin_lock(&vn->busy.lock);
+	write_lock(&vn->busy.lock);
 	insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
-	spin_unlock(&vn->busy.lock);
+	write_unlock(&vn->busy.lock);
 
 	BUG_ON(!IS_ALIGNED(va->va_start, align));
 	BUG_ON(va->va_start < vstart);
@@ -2123,10 +2123,10 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
 		if (RB_EMPTY_ROOT(&vn->lazy.root))
 			continue;
 
-		spin_lock(&vn->lazy.lock);
+		write_lock(&vn->lazy.lock);
 		WRITE_ONCE(vn->lazy.root.rb_node, NULL);
 		list_replace_init(&vn->lazy.head, &vn->purge_list);
-		spin_unlock(&vn->lazy.lock);
+		write_unlock(&vn->lazy.lock);
 
 		start = min(start, list_first_entry(&vn->purge_list,
 			struct vmap_area, list)->va_start);
@@ -2223,9 +2223,9 @@ static void free_vmap_area_noflush(struct vmap_area *va)
 	vn = is_vn_id_valid(vn_id) ?
 		id_to_node(vn_id):addr_to_node(va->va_start);
 
-	spin_lock(&vn->lazy.lock);
+	write_lock(&vn->lazy.lock);
 	insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
-	spin_unlock(&vn->lazy.lock);
+	write_unlock(&vn->lazy.lock);
 
 	trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);
 
@@ -2272,9 +2272,9 @@ struct vmap_area *find_vmap_area(unsigned long addr)
 	do {
 		vn = &vmap_nodes[i];
 
-		spin_lock(&vn->busy.lock);
+		read_lock(&vn->busy.lock);
 		va = __find_vmap_area(addr, &vn->busy.root);
-		spin_unlock(&vn->busy.lock);
+		read_unlock(&vn->busy.lock);
 
 		if (va)
 			return va;
@@ -2293,11 +2293,11 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 	do {
 		vn = &vmap_nodes[i];
 
-		spin_lock(&vn->busy.lock);
+		write_lock(&vn->busy.lock);
 		va = __find_vmap_area(addr, &vn->busy.root);
 		if (va)
 			unlink_va(va, &vn->busy.root);
-		spin_unlock(&vn->busy.lock);
+		write_unlock(&vn->busy.lock);
 
 		if (va)
 			return va;
@@ -2514,9 +2514,9 @@ static void free_vmap_block(struct vmap_block *vb)
 	BUG_ON(tmp != vb);
 
 	vn = addr_to_node(vb->va->va_start);
-	spin_lock(&vn->busy.lock);
+	write_lock(&vn->busy.lock);
 	unlink_va(vb->va, &vn->busy.root);
-	spin_unlock(&vn->busy.lock);
+	write_unlock(&vn->busy.lock);
 
 	free_vmap_area_noflush(vb->va);
 	kfree_rcu(vb, rcu_head);
@@ -2942,9 +2942,9 @@ static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
 {
 	struct vmap_node *vn = addr_to_node(va->va_start);
 
-	spin_lock(&vn->busy.lock);
+	read_lock(&vn->busy.lock);
 	setup_vmalloc_vm_locked(vm, va, flags, caller);
-	spin_unlock(&vn->busy.lock);
+	read_unlock(&vn->busy.lock);
 }
 
 static void clear_vm_uninitialized_flag(struct vm_struct *vm)
@@ -4214,19 +4214,19 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 
 	next_va:
 		next = va->va_end;
-		spin_unlock(&vn->busy.lock);
+		read_unlock(&vn->busy.lock);
 	} while ((vn = find_vmap_area_exceed_addr_lock(next, &va)));
 
 finished_zero:
 	if (vn)
-		spin_unlock(&vn->busy.lock);
+		read_unlock(&vn->busy.lock);
 
 	/* zero-fill memory holes */
 	return count - remains + zero_iter(iter, remains);
 finished:
 	/* Nothing remains, or We couldn't copy/zero everything. */
 	if (vn)
-		spin_unlock(&vn->busy.lock);
+		read_unlock(&vn->busy.lock);
 
 	return count - remains;
 }
@@ -4563,11 +4563,11 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	for (area = 0; area < nr_vms; area++) {
 		struct vmap_node *vn = addr_to_node(vas[area]->va_start);
 
-		spin_lock(&vn->busy.lock);
+		write_lock(&vn->busy.lock);
 		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
 		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
 				 pcpu_get_vm_areas);
-		spin_unlock(&vn->busy.lock);
+		write_unlock(&vn->busy.lock);
 	}
 
 	/*
@@ -4687,7 +4687,7 @@ bool vmalloc_dump_obj(void *object)
 
 	vn = addr_to_node((unsigned long)objp);
 
-	if (spin_trylock(&vn->busy.lock)) {
+	if (read_trylock(&vn->busy.lock)) {
 		va = __find_vmap_area(addr, &vn->busy.root);
 
 		if (va && va->vm) {
@@ -4697,7 +4697,7 @@ bool vmalloc_dump_obj(void *object)
 			success = true;
 		}
 
-		spin_unlock(&vn->busy.lock);
+		read_unlock(&vn->busy.lock);
 	}
 
 	if (success)
@@ -4742,13 +4742,13 @@ static void show_purge_info(struct seq_file *m)
 	for (i = 0; i < nr_vmap_nodes; i++) {
 		vn = &vmap_nodes[i];
 
-		spin_lock(&vn->lazy.lock);
+		read_lock(&vn->lazy.lock);
 		list_for_each_entry(va, &vn->lazy.head, list) {
 			seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
 				(void *)va->va_start, (void *)va->va_end,
 				va->va_end - va->va_start);
 		}
-		spin_unlock(&vn->lazy.lock);
+		read_unlock(&vn->lazy.lock);
 	}
 }
 
@@ -4762,7 +4762,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
 	for (i = 0; i < nr_vmap_nodes; i++) {
 		vn = &vmap_nodes[i];
 
-		spin_lock(&vn->busy.lock);
+		read_lock(&vn->busy.lock);
 		list_for_each_entry(va, &vn->busy.head, list) {
 			if (!va->vm) {
 				if (va->flags & VMAP_RAM)
@@ -4808,7 +4808,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
 			show_numa_info(m, v);
 			seq_putc(m, '\n');
 		}
-		spin_unlock(&vn->busy.lock);
+		read_unlock(&vn->busy.lock);
 	}
 
 	/*
@@ -4902,11 +4902,11 @@ static void vmap_init_nodes(void)
 		vn = &vmap_nodes[n];
 		vn->busy.root = RB_ROOT;
 		INIT_LIST_HEAD(&vn->busy.head);
-		spin_lock_init(&vn->busy.lock);
+		rwlock_init(&vn->busy.lock);
 
 		vn->lazy.root = RB_ROOT;
 		INIT_LIST_HEAD(&vn->lazy.head);
-		spin_lock_init(&vn->lazy.lock);
+		rwlock_init(&vn->lazy.lock);
 
 		for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
 			INIT_LIST_HEAD(&vn->pool[i].head);
<snip>

Thank you!

--
Uladzislau Rezki
  
Wen Gu Jan. 6, 2024, 9:17 a.m. UTC | #3
On 2024/1/5 18:50, Uladzislau Rezki wrote:

> Hello, Wen Gu.
> 
>>
>> Hi Uladzislau Rezki,
>>

<...>

>> Fortunately, thank you for this patch set, the global vmap_area_lock was
>> removed and per node lock vn->busy.lock is introduced. it is really helpful:
>>
>> In 48 CPUs qemu environment, the Requests/s increased by 5 times:
>> - nginx
>> - wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80
>>
>>                  vzalloced shmem      vzalloced shmem(with this patch set)
>> Requests/sec          113536.56            583729.93
>>
>>
> Thank you for the confirmation that your workload is improved. The "nginx"
> is 5 times better!
> 

Yes, thank you very much for the improvement!

>> But it also has some overhead, compared to using kzalloced shared memory
>> or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:
>>
>>                  kzalloced shmem      vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
>> Requests/sec          831950.39            805164.78
>>
>>
> The CONFIG_HARDENED_USERCOPY prevents coping "wrong" memory regions. That is
> why if it is a vmalloced memory it wants to make sure it is really true,
> if not user-copy is aborted.
> 
> So there is an extra work that involves finding a VA associated with an address.
> 

Yes, and lock contention in finding VA is likely to be a performance bottleneck,
which is mitigated a lot by your work.

>> So, as a newbie in Linux-mm, I would like to ask for some suggestions:
>>
>> Is it possible to further eliminate the overhead caused by lock contention
>> in find_vmap_area() in this scenario (maybe this is asking too much), or the
>> only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced
>> buffer in the situation where cocurrent kernel-userspace-copy happens?
>>
> Could you please try below patch, if it improves this series further?
> Just in case:
> 

Thank you! I tried the patch, and it seems that the wait for rwlock_t
also exists, as much as using spinlock_t. (The flamegraph is attached.
Not sure why the read_lock waits so long, given that there is no frequent
write_lock competition)

                vzalloced shmem(spinlock_t)   vzalloced shmem(rwlock_t)
Requests/sec         583729.93                     460007.44

So I guess the overhead in finding vmap area is inevitable here and the
original spin_lock is fine in this series.

Thanks again for your help!

Best regards,
Wen Gu

> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e30dabf68263..40acf53cadfb 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -772,7 +772,7 @@ static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
>   struct rb_list {
>   	struct rb_root root;
>   	struct list_head head;
> -	spinlock_t lock;
> +	rwlock_t lock;
>   };
>   
>   struct vmap_pool {
> @@ -947,19 +947,19 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
>   	for (i = 0; i < nr_vmap_nodes; i++) {
>   		vn = &vmap_nodes[i];
>   
> -		spin_lock(&vn->busy.lock);
> +		read_lock(&vn->busy.lock);
>   		va_lowest = __find_vmap_area_exceed_addr(addr, &vn->busy.root);
>   		if (va_lowest) {
>   			if (!va_node || va_lowest->va_start < (*va)->va_start) {
>   				if (va_node)
> -					spin_unlock(&va_node->busy.lock);
> +					read_unlock(&va_node->busy.lock);
>   
>   				*va = va_lowest;
>   				va_node = vn;
>   				continue;
>   			}
>   		}
> -		spin_unlock(&vn->busy.lock);
> +		read_unlock(&vn->busy.lock);
>   	}
>   
>   	return va_node;
> @@ -1695,9 +1695,9 @@ static void free_vmap_area(struct vmap_area *va)
>   	/*
>   	 * Remove from the busy tree/list.
>   	 */
> -	spin_lock(&vn->busy.lock);
> +	write_lock(&vn->busy.lock);
>   	unlink_va(va, &vn->busy.root);
> -	spin_unlock(&vn->busy.lock);
> +	write_unlock(&vn->busy.lock);
>   
>   	/*
>   	 * Insert/Merge it back to the free tree/list.
> @@ -1901,9 +1901,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>   
>   	vn = addr_to_node(va->va_start);
>   
> -	spin_lock(&vn->busy.lock);
> +	write_lock(&vn->busy.lock);
>   	insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
> -	spin_unlock(&vn->busy.lock);
> +	write_unlock(&vn->busy.lock);
>   
>   	BUG_ON(!IS_ALIGNED(va->va_start, align));
>   	BUG_ON(va->va_start < vstart);
> @@ -2123,10 +2123,10 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
>   		if (RB_EMPTY_ROOT(&vn->lazy.root))
>   			continue;
>   
> -		spin_lock(&vn->lazy.lock);
> +		write_lock(&vn->lazy.lock);
>   		WRITE_ONCE(vn->lazy.root.rb_node, NULL);
>   		list_replace_init(&vn->lazy.head, &vn->purge_list);
> -		spin_unlock(&vn->lazy.lock);
> +		write_unlock(&vn->lazy.lock);
>   
>   		start = min(start, list_first_entry(&vn->purge_list,
>   			struct vmap_area, list)->va_start);
> @@ -2223,9 +2223,9 @@ static void free_vmap_area_noflush(struct vmap_area *va)
>   	vn = is_vn_id_valid(vn_id) ?
>   		id_to_node(vn_id):addr_to_node(va->va_start);
>   
> -	spin_lock(&vn->lazy.lock);
> +	write_lock(&vn->lazy.lock);
>   	insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
> -	spin_unlock(&vn->lazy.lock);
> +	write_unlock(&vn->lazy.lock);
>   
>   	trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);
>   
> @@ -2272,9 +2272,9 @@ struct vmap_area *find_vmap_area(unsigned long addr)
>   	do {
>   		vn = &vmap_nodes[i];
>   
> -		spin_lock(&vn->busy.lock);
> +		read_lock(&vn->busy.lock);
>   		va = __find_vmap_area(addr, &vn->busy.root);
> -		spin_unlock(&vn->busy.lock);
> +		read_unlock(&vn->busy.lock);
>   
>   		if (va)
>   			return va;
> @@ -2293,11 +2293,11 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
>   	do {
>   		vn = &vmap_nodes[i];
>   
> -		spin_lock(&vn->busy.lock);
> +		write_lock(&vn->busy.lock);
>   		va = __find_vmap_area(addr, &vn->busy.root);
>   		if (va)
>   			unlink_va(va, &vn->busy.root);
> -		spin_unlock(&vn->busy.lock);
> +		write_unlock(&vn->busy.lock);
>   
>   		if (va)
>   			return va;
> @@ -2514,9 +2514,9 @@ static void free_vmap_block(struct vmap_block *vb)
>   	BUG_ON(tmp != vb);
>   
>   	vn = addr_to_node(vb->va->va_start);
> -	spin_lock(&vn->busy.lock);
> +	write_lock(&vn->busy.lock);
>   	unlink_va(vb->va, &vn->busy.root);
> -	spin_unlock(&vn->busy.lock);
> +	write_unlock(&vn->busy.lock);
>   
>   	free_vmap_area_noflush(vb->va);
>   	kfree_rcu(vb, rcu_head);
> @@ -2942,9 +2942,9 @@ static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
>   {
>   	struct vmap_node *vn = addr_to_node(va->va_start);
>   
> -	spin_lock(&vn->busy.lock);
> +	read_lock(&vn->busy.lock);
>   	setup_vmalloc_vm_locked(vm, va, flags, caller);
> -	spin_unlock(&vn->busy.lock);
> +	read_unlock(&vn->busy.lock);
>   }
>   
>   static void clear_vm_uninitialized_flag(struct vm_struct *vm)
> @@ -4214,19 +4214,19 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
>   
>   	next_va:
>   		next = va->va_end;
> -		spin_unlock(&vn->busy.lock);
> +		read_unlock(&vn->busy.lock);
>   	} while ((vn = find_vmap_area_exceed_addr_lock(next, &va)));
>   
>   finished_zero:
>   	if (vn)
> -		spin_unlock(&vn->busy.lock);
> +		read_unlock(&vn->busy.lock);
>   
>   	/* zero-fill memory holes */
>   	return count - remains + zero_iter(iter, remains);
>   finished:
>   	/* Nothing remains, or We couldn't copy/zero everything. */
>   	if (vn)
> -		spin_unlock(&vn->busy.lock);
> +		read_unlock(&vn->busy.lock);
>   
>   	return count - remains;
>   }
> @@ -4563,11 +4563,11 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>   	for (area = 0; area < nr_vms; area++) {
>   		struct vmap_node *vn = addr_to_node(vas[area]->va_start);
>   
> -		spin_lock(&vn->busy.lock);
> +		write_lock(&vn->busy.lock);
>   		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
>   		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
>   				 pcpu_get_vm_areas);
> -		spin_unlock(&vn->busy.lock);
> +		write_unlock(&vn->busy.lock);
>   	}
>   
>   	/*
> @@ -4687,7 +4687,7 @@ bool vmalloc_dump_obj(void *object)
>   
>   	vn = addr_to_node((unsigned long)objp);
>   
> -	if (spin_trylock(&vn->busy.lock)) {
> +	if (read_trylock(&vn->busy.lock)) {
>   		va = __find_vmap_area(addr, &vn->busy.root);
>   
>   		if (va && va->vm) {
> @@ -4697,7 +4697,7 @@ bool vmalloc_dump_obj(void *object)
>   			success = true;
>   		}
>   
> -		spin_unlock(&vn->busy.lock);
> +		read_unlock(&vn->busy.lock);
>   	}
>   
>   	if (success)
> @@ -4742,13 +4742,13 @@ static void show_purge_info(struct seq_file *m)
>   	for (i = 0; i < nr_vmap_nodes; i++) {
>   		vn = &vmap_nodes[i];
>   
> -		spin_lock(&vn->lazy.lock);
> +		read_lock(&vn->lazy.lock);
>   		list_for_each_entry(va, &vn->lazy.head, list) {
>   			seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
>   				(void *)va->va_start, (void *)va->va_end,
>   				va->va_end - va->va_start);
>   		}
> -		spin_unlock(&vn->lazy.lock);
> +		read_unlock(&vn->lazy.lock);
>   	}
>   }
>   
> @@ -4762,7 +4762,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
>   	for (i = 0; i < nr_vmap_nodes; i++) {
>   		vn = &vmap_nodes[i];
>   
> -		spin_lock(&vn->busy.lock);
> +		read_lock(&vn->busy.lock);
>   		list_for_each_entry(va, &vn->busy.head, list) {
>   			if (!va->vm) {
>   				if (va->flags & VMAP_RAM)
> @@ -4808,7 +4808,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
>   			show_numa_info(m, v);
>   			seq_putc(m, '\n');
>   		}
> -		spin_unlock(&vn->busy.lock);
> +		read_unlock(&vn->busy.lock);
>   	}
>   
>   	/*
> @@ -4902,11 +4902,11 @@ static void vmap_init_nodes(void)
>   		vn = &vmap_nodes[n];
>   		vn->busy.root = RB_ROOT;
>   		INIT_LIST_HEAD(&vn->busy.head);
> -		spin_lock_init(&vn->busy.lock);
> +		rwlock_init(&vn->busy.lock);
>   
>   		vn->lazy.root = RB_ROOT;
>   		INIT_LIST_HEAD(&vn->lazy.head);
> -		spin_lock_init(&vn->lazy.lock);
> +		rwlock_init(&vn->lazy.lock);
>   
>   		for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
>   			INIT_LIST_HEAD(&vn->pool[i].head);
> <snip>
> 
> Thank you!
> 
> --
> Uladzislau Rezki
  
Uladzislau Rezki Jan. 6, 2024, 4:36 p.m. UTC | #4
> 
> On 2024/1/5 18:50, Uladzislau Rezki wrote:
> 
> > Hello, Wen Gu.
> > 
> > > 
> > > Hi Uladzislau Rezki,
> > > 
> 
> <...>
> 
> > > Fortunately, thank you for this patch set, the global vmap_area_lock was
> > > removed and per node lock vn->busy.lock is introduced. it is really helpful:
> > > 
> > > In 48 CPUs qemu environment, the Requests/s increased by 5 times:
> > > - nginx
> > > - wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80
> > > 
> > >                  vzalloced shmem      vzalloced shmem(with this patch set)
> > > Requests/sec          113536.56            583729.93
> > > 
> > > 
> > Thank you for the confirmation that your workload is improved. The "nginx"
> > is 5 times better!
> > 
> 
> Yes, thank you very much for the improvement!
> 
> > > But it also has some overhead, compared to using kzalloced shared memory
> > > or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:
> > > 
> > >                  kzalloced shmem      vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
> > > Requests/sec          831950.39            805164.78
> > > 
> > > 
> > The CONFIG_HARDENED_USERCOPY prevents coping "wrong" memory regions. That is
> > why if it is a vmalloced memory it wants to make sure it is really true,
> > if not user-copy is aborted.
> > 
> > So there is an extra work that involves finding a VA associated with an address.
> > 
> 
> Yes, and lock contention in finding VA is likely to be a performance bottleneck,
> which is mitigated a lot by your work.
> 
> > > So, as a newbie in Linux-mm, I would like to ask for some suggestions:
> > > 
> > > Is it possible to further eliminate the overhead caused by lock contention
> > > in find_vmap_area() in this scenario (maybe this is asking too much), or the
> > > only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced
> > > buffer in the situation where cocurrent kernel-userspace-copy happens?
> > > 
> > Could you please try below patch, if it improves this series further?
> > Just in case:
> > 
> 
> Thank you! I tried the patch, and it seems that the wait for rwlock_t
> also exists, as much as using spinlock_t. (The flamegraph is attached.
> Not sure why the read_lock waits so long, given that there is no frequent
> write_lock competition)
> 
>                vzalloced shmem(spinlock_t)   vzalloced shmem(rwlock_t)
> Requests/sec         583729.93                     460007.44
> 
> So I guess the overhead in finding vmap area is inevitable here and the
> original spin_lock is fine in this series.
> 
I have also noticed a erformance difference between rwlock and spinlock. 
So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is
set, i.e. find a VA.

--
Uladzislau Rezki
  
Hillf Danton Jan. 7, 2024, 6:59 a.m. UTC | #5
On Sat, 6 Jan 2024 17:36:23 +0100 Uladzislau Rezki <urezki@gmail.com>
> > 
> > Thank you! I tried the patch, and it seems that the wait for rwlock_t
> > also exists, as much as using spinlock_t. (The flamegraph is attached.
> > Not sure why the read_lock waits so long, given that there is no frequent
> > write_lock competition)
> > 
> >                vzalloced shmem(spinlock_t)   vzalloced shmem(rwlock_t)
> > Requests/sec         583729.93                     460007.44
> > 
> > So I guess the overhead in finding vmap area is inevitable here and the
> > original spin_lock is fine in this series.
> > 
> I have also noticed a erformance difference between rwlock and spinlock. 
> So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is
> set, i.e. find a VA.

See if read bias helps to understand the gap between spinlock and rwlock.

--- x/kernel/locking/qrwlock.c
+++ y/kernel/locking/qrwlock.c
@@ -23,7 +23,7 @@ void __lockfunc queued_read_lock_slowpat
 	/*
 	 * Readers come here when they cannot get the lock without waiting
 	 */
-	if (unlikely(in_interrupt())) {
+	if (1) {
 		/*
 		 * Readers in interrupt context will get the lock immediately
 		 * if the writer is just waiting (not holding the lock yet),
  
Wen Gu Jan. 8, 2024, 7:45 a.m. UTC | #6
On 2024/1/7 14:59, Hillf Danton wrote:
> On Sat, 6 Jan 2024 17:36:23 +0100 Uladzislau Rezki <urezki@gmail.com>
>>>
>>> Thank you! I tried the patch, and it seems that the wait for rwlock_t
>>> also exists, as much as using spinlock_t. (The flamegraph is attached.
>>> Not sure why the read_lock waits so long, given that there is no frequent
>>> write_lock competition)
>>>
>>>                 vzalloced shmem(spinlock_t)   vzalloced shmem(rwlock_t)
>>> Requests/sec         583729.93                     460007.44
>>>
>>> So I guess the overhead in finding vmap area is inevitable here and the
>>> original spin_lock is fine in this series.
>>>
>> I have also noticed a erformance difference between rwlock and spinlock.
>> So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is
>> set, i.e. find a VA.
> 
> See if read bias helps to understand the gap between spinlock and rwlock.
> 
> --- x/kernel/locking/qrwlock.c
> +++ y/kernel/locking/qrwlock.c
> @@ -23,7 +23,7 @@ void __lockfunc queued_read_lock_slowpat
>   	/*
>   	 * Readers come here when they cannot get the lock without waiting
>   	 */
> -	if (unlikely(in_interrupt())) {
> +	if (1) {
>   		/*
>   		 * Readers in interrupt context will get the lock immediately
>   		 * if the writer is just waiting (not holding the lock yet),

Thank you for the idea! Hillf.

IIUC, the change makes read operations more likely to acquire lock and
modified fairness to favor reading.

The test in my scenario shows:

vzalloced shmem with     spinlock_t      rwlock_t      rwlock_t(with above change)
Requests/sec              564961.29     442534.33     439733.31

In addition to read bias, there seems to be other factors that cause the
gap, but I haven't figured it out yet..

Thanks,
Wen Gu
  
Uladzislau Rezki Jan. 8, 2024, 6:37 p.m. UTC | #7
On Mon, Jan 08, 2024 at 03:45:18PM +0800, Wen Gu wrote:
> 
> 
> On 2024/1/7 14:59, Hillf Danton wrote:
> > On Sat, 6 Jan 2024 17:36:23 +0100 Uladzislau Rezki <urezki@gmail.com>
> > > > 
> > > > Thank you! I tried the patch, and it seems that the wait for rwlock_t
> > > > also exists, as much as using spinlock_t. (The flamegraph is attached.
> > > > Not sure why the read_lock waits so long, given that there is no frequent
> > > > write_lock competition)
> > > > 
> > > >                 vzalloced shmem(spinlock_t)   vzalloced shmem(rwlock_t)
> > > > Requests/sec         583729.93                     460007.44
> > > > 
> > > > So I guess the overhead in finding vmap area is inevitable here and the
> > > > original spin_lock is fine in this series.
> > > > 
> > > I have also noticed a erformance difference between rwlock and spinlock.
> > > So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is
> > > set, i.e. find a VA.
> > 
> > See if read bias helps to understand the gap between spinlock and rwlock.
> > 
> > --- x/kernel/locking/qrwlock.c
> > +++ y/kernel/locking/qrwlock.c
> > @@ -23,7 +23,7 @@ void __lockfunc queued_read_lock_slowpat
> >   	/*
> >   	 * Readers come here when they cannot get the lock without waiting
> >   	 */
> > -	if (unlikely(in_interrupt())) {
> > +	if (1) {
> >   		/*
> >   		 * Readers in interrupt context will get the lock immediately
> >   		 * if the writer is just waiting (not holding the lock yet),
> 
> Thank you for the idea! Hillf.
> 
> IIUC, the change makes read operations more likely to acquire lock and
> modified fairness to favor reading.
> 
> The test in my scenario shows:
> 
> vzalloced shmem with     spinlock_t      rwlock_t      rwlock_t(with above change)
> Requests/sec              564961.29     442534.33     439733.31
> 
> In addition to read bias, there seems to be other factors that cause the
> gap, but I haven't figured it out yet..
> 
<snip>
urezki@pc638:~$ cat locktorture.sh 
#!/bin/sh

# available lock types: spin_lock, rw_lock
torture_type=$1
test_time=$2

echo "Start..."

modprobe locktorture $torture_type nreaders_stress=0 > /dev/null 2>&1
sleep $test_time
rmmod locktorture > /dev/null 2>&1

echo "Done."
urezki@pc638:~$
<snip>

Out:

# sudo ./locktorture.sh rw_lock 30
[12107.327566] Writes:  Total: 53304415  Max/Min: 1620715/3225 ???  Fail: 0 
[12107.327898] spin_lock-torture: lock_torture_stats is stopping
[12107.328192] Writes:  Total: 53304415  Max/Min: 1620715/3225 ???  Fail: 0 
[12107.328368] spin_lock-torture:--- End of test: SUCCESS: acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0 long_hold=100 nested_locks=0 nreaders_stress=0 nwriters_stress=128 onoff_holdoff=0 onoff_interval=0 rt_boost=2 rt_boost_factor=50 shuffle_interval=3 shutdown_secs=0 stat_interval=60 stutter=5 verbose=1 writer_fifo=0

# sudo ./locktorture.sh spin_lock 30
[12051.968992] Writes:  Total: 47843400  Max/Min: 1335320/5942 ???  Fail: 0 
[12051.969236] spin_lock-torture: lock_torture_stats is stopping
[12051.969507] Writes:  Total: 47843400  Max/Min: 1335320/5942 ???  Fail: 0 
[12051.969813] spin_lock-torture:--- End of test: SUCCESS: acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0 long_hold=100 nested_locks=0 nreaders_stress=0 nwriters_stress=128 onoff_holdoff=0 onoff_interval=0 rt_boost=2 rt_boost_factor=50 shuffle_interval=3 shutdown_secs=0 stat_interval=60 stutter=5 verbose=1 writer_fifo=0

I do not see a big difference between spin_lock and rw_lock. In fact
the locktorture.ko test shows that a spin_lock is slightly worse but
it might be something that i could miss or is not accurate enough in
this test.

When it comes to vmap test-suite and the difference between rw_lock
and spin_lock. I have not spend much time figuring out the difference.
From the first glance it can be that a cache-miss rate is higher when
switch to rw_lock or rw-lock requires more atomics.

--
Uladzislau Rezki
  
Lorenzo Stoakes Jan. 16, 2024, 11:25 p.m. UTC | #8
On Tue, Jan 02, 2024 at 07:46:26PM +0100, Uladzislau Rezki (Sony) wrote:
> Store allocated objects in a separate nodes. A va->va_start
> address is converted into a correct node where it should
> be placed and resided. An addr_to_node() function is used
> to do a proper address conversion to determine a node that
> contains a VA.
>
> Such approach balances VAs across nodes as a result an access
> becomes scalable. Number of nodes in a system depends on number
> of CPUs.
>
> Please note:
>
> 1. As of now allocated VAs are bound to a node-0. It means the
>    patch does not give any difference comparing with a current
>    behavior;
>
> 2. The global vmap_area_lock, vmap_area_root are removed as there
>    is no need in it anymore. The vmap_area_list is still kept and
>    is _empty_. It is exported for a kexec only;
>
> 3. The vmallocinfo and vread() have to be reworked to be able to
>    handle multiple nodes.
>
> Reviewed-by: Baoquan He <bhe@redhat.com>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 240 +++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 173 insertions(+), 67 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 06bd843d18ae..786ecb18ae22 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
>  #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
>
>
> -static DEFINE_SPINLOCK(vmap_area_lock);
>  static DEFINE_SPINLOCK(free_vmap_area_lock);
>  /* Export for kexec only */
>  LIST_HEAD(vmap_area_list);
> -static struct rb_root vmap_area_root = RB_ROOT;
>  static bool vmap_initialized __read_mostly;
>
>  static struct rb_root purge_vmap_area_root = RB_ROOT;
> @@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT;
>   */
>  static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
>
> +/*
> + * An effective vmap-node logic. Users make use of nodes instead
> + * of a global heap. It allows to balance an access and mitigate
> + * contention.
> + */
> +struct rb_list {

I'm not sure this name is very instructive - this contains a red/black tree
root node, a list head and a lock, but the meaning of it is that it stores
a red/black tree and a list of vmap_area objects and has a lock to protect
access.

A raw 'list_head' without a comment is always hard to parse, maybe some
comments/embed in vmap_node?

At the very least if you wanted to keep the name generic it should be
something like ordered_rb_tree or something like this.

> +	struct rb_root root;
> +	struct list_head head;
> +	spinlock_t lock;
> +};
> +
> +static struct vmap_node {
> +	/* Bookkeeping data of this node. */
> +	struct rb_list busy;
> +} single;

This may be a thing about encapsulation/naming or similar, but I'm a little
confused as to why the rb_list type is maintained as a field rather than
its fields embedded?

> +
> +static struct vmap_node *vmap_nodes = &single;
> +static __read_mostly unsigned int nr_vmap_nodes = 1;
> +static __read_mostly unsigned int vmap_zone_size = 1;

It might be worth adding a comment here explaining that we're binding to a
single node for now to maintain existing behaviour (and a brief description
of what these values mean - for instance what unit vmap_zone_size is
expressed in?)

> +
> +static inline unsigned int
> +addr_to_node_id(unsigned long addr)
> +{
> +	return (addr / vmap_zone_size) % nr_vmap_nodes;
> +}
> +
> +static inline struct vmap_node *
> +addr_to_node(unsigned long addr)
> +{
> +	return &vmap_nodes[addr_to_node_id(addr)];
> +}
> +
>  static __always_inline unsigned long
>  va_size(struct vmap_area *va)
>  {
> @@ -803,10 +833,11 @@ unsigned long vmalloc_nr_pages(void)
>  }
>
>  /* Look up the first VA which satisfies addr < va_end, NULL if none. */
> -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
> +static struct vmap_area *
> +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
>  {
>  	struct vmap_area *va = NULL;
> -	struct rb_node *n = vmap_area_root.rb_node;
> +	struct rb_node *n = root->rb_node;
>
>  	addr = (unsigned long)kasan_reset_tag((void *)addr);
>
> @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
>   */
>  static void free_vmap_area(struct vmap_area *va)
>  {
> +	struct vmap_node *vn = addr_to_node(va->va_start);
> +

I'm being nitty here, and while I know it's a vmalloc convention to use
'va' and 'vm', perhaps we can break away from the super short variable name
convention and use 'vnode' or something for these values?

I feel people might get confused between 'vm' and 'vn' for instance.

>  	/*
>  	 * Remove from the busy tree/list.
>  	 */
> -	spin_lock(&vmap_area_lock);
> -	unlink_va(va, &vmap_area_root);
> -	spin_unlock(&vmap_area_lock);
> +	spin_lock(&vn->busy.lock);
> +	unlink_va(va, &vn->busy.root);
> +	spin_unlock(&vn->busy.lock);
>
>  	/*
>  	 * Insert/Merge it back to the free tree/list.
> @@ -1600,6 +1633,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  				int node, gfp_t gfp_mask,
>  				unsigned long va_flags)
>  {
> +	struct vmap_node *vn;
>  	struct vmap_area *va;
>  	unsigned long freed;
>  	unsigned long addr;
> @@ -1645,9 +1679,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	va->vm = NULL;
>  	va->flags = va_flags;
>
> -	spin_lock(&vmap_area_lock);
> -	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> -	spin_unlock(&vmap_area_lock);
> +	vn = addr_to_node(va->va_start);
> +
> +	spin_lock(&vn->busy.lock);
> +	insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
> +	spin_unlock(&vn->busy.lock);
>
>  	BUG_ON(!IS_ALIGNED(va->va_start, align));
>  	BUG_ON(va->va_start < vstart);
> @@ -1871,26 +1907,61 @@ static void free_unmap_vmap_area(struct vmap_area *va)
>
>  struct vmap_area *find_vmap_area(unsigned long addr)
>  {
> +	struct vmap_node *vn;
>  	struct vmap_area *va;
> +	int i, j;
>
> -	spin_lock(&vmap_area_lock);
> -	va = __find_vmap_area(addr, &vmap_area_root);
> -	spin_unlock(&vmap_area_lock);
> +	/*
> +	 * An addr_to_node_id(addr) converts an address to a node index
> +	 * where a VA is located. If VA spans several zones and passed
> +	 * addr is not the same as va->va_start, what is not common, we
> +	 * may need to scan an extra nodes. See an example:

For my understading when you say 'scan an extra nodes' do you mean scan
just 1 extra node, or multiple? If the former I'd replace this with 'may
need to scan an extra node' if the latter then 'may ened to scan extra
nodes'.

It's a nitty language thing, but also potentially changes the meaning of
this!

> +	 *
> +	 *      <--va-->
> +	 * -|-----|-----|-----|-----|-
> +	 *     1     2     0     1
> +	 *
> +	 * VA resides in node 1 whereas it spans 1 and 2. If passed
> +	 * addr is within a second node we should do extra work. We
> +	 * should mention that it is rare and is a corner case from
> +	 * the other hand it has to be covered.

A very minor language style nit, but you've already said this is not
common, I don't think you need this 'We should mention...' bit. It's not a
big deal however!

> +	 */
> +	i = j = addr_to_node_id(addr);
> +	do {
> +		vn = &vmap_nodes[i];
>
> -	return va;
> +		spin_lock(&vn->busy.lock);
> +		va = __find_vmap_area(addr, &vn->busy.root);
> +		spin_unlock(&vn->busy.lock);
> +
> +		if (va)
> +			return va;
> +	} while ((i = (i + 1) % nr_vmap_nodes) != j);

If you comment above suggests that only 1 extra node might need to be
scanned, should we stop after one iteration?

> +
> +	return NULL;
>  }
>
>  static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
>  {
> +	struct vmap_node *vn;
>  	struct vmap_area *va;
> +	int i, j;
>
> -	spin_lock(&vmap_area_lock);
> -	va = __find_vmap_area(addr, &vmap_area_root);
> -	if (va)
> -		unlink_va(va, &vmap_area_root);
> -	spin_unlock(&vmap_area_lock);
> +	i = j = addr_to_node_id(addr);
> +	do {
> +		vn = &vmap_nodes[i];
>
> -	return va;
> +		spin_lock(&vn->busy.lock);
> +		va = __find_vmap_area(addr, &vn->busy.root);
> +		if (va)
> +			unlink_va(va, &vn->busy.root);
> +		spin_unlock(&vn->busy.lock);
> +
> +		if (va)
> +			return va;
> +	} while ((i = (i + 1) % nr_vmap_nodes) != j);

Maybe worth adding a comment saying to refer to the comment in
find_vmap_area() to see why this loop is necessary.

> +
> +	return NULL;
>  }
>
>  /*** Per cpu kva allocator ***/
> @@ -2092,6 +2163,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
>
>  static void free_vmap_block(struct vmap_block *vb)
>  {
> +	struct vmap_node *vn;
>  	struct vmap_block *tmp;
>  	struct xarray *xa;
>
> @@ -2099,9 +2171,10 @@ static void free_vmap_block(struct vmap_block *vb)
>  	tmp = xa_erase(xa, addr_to_vb_idx(vb->va->va_start));
>  	BUG_ON(tmp != vb);
>
> -	spin_lock(&vmap_area_lock);
> -	unlink_va(vb->va, &vmap_area_root);
> -	spin_unlock(&vmap_area_lock);
> +	vn = addr_to_node(vb->va->va_start);
> +	spin_lock(&vn->busy.lock);
> +	unlink_va(vb->va, &vn->busy.root);
> +	spin_unlock(&vn->busy.lock);
>
>  	free_vmap_area_noflush(vb->va);
>  	kfree_rcu(vb, rcu_head);
> @@ -2525,9 +2598,11 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
>  static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
>  			      unsigned long flags, const void *caller)
>  {
> -	spin_lock(&vmap_area_lock);
> +	struct vmap_node *vn = addr_to_node(va->va_start);
> +
> +	spin_lock(&vn->busy.lock);
>  	setup_vmalloc_vm_locked(vm, va, flags, caller);
> -	spin_unlock(&vmap_area_lock);
> +	spin_unlock(&vn->busy.lock);
>  }
>
>  static void clear_vm_uninitialized_flag(struct vm_struct *vm)
> @@ -3715,6 +3790,7 @@ static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr,
>   */
>  long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
>  {
> +	struct vmap_node *vn;
>  	struct vmap_area *va;
>  	struct vm_struct *vm;
>  	char *vaddr;
> @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)

Unrelated to your change but makes me feel a little unwell to see 'const
char *addr'! Can we change this at some point? Or maybe I can :)

>
>  	remains = count;
>
> -	spin_lock(&vmap_area_lock);
> -	va = find_vmap_area_exceed_addr((unsigned long)addr);
> +	/* Hooked to node_0 so far. */
> +	vn = addr_to_node(0);

Why can't we use addr for this call? We already enforce the node-0 only
thing by setting nr_vmap_nodes to 1 right? And won't this be potentially
subtly wrong when we later increase this?

> +	spin_lock(&vn->busy.lock);
> +
> +	va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root);
>  	if (!va)
>  		goto finished_zero;
>
> @@ -3737,7 +3816,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
>  	if ((unsigned long)addr + remains <= va->va_start)
>  		goto finished_zero;
>
> -	list_for_each_entry_from(va, &vmap_area_list, list) {
> +	list_for_each_entry_from(va, &vn->busy.head, list) {
>  		size_t copied;
>
>  		if (remains == 0)
> @@ -3796,12 +3875,12 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
>  	}
>
>  finished_zero:
> -	spin_unlock(&vmap_area_lock);
> +	spin_unlock(&vn->busy.lock);
>  	/* zero-fill memory holes */
>  	return count - remains + zero_iter(iter, remains);
>  finished:
>  	/* Nothing remains, or We couldn't copy/zero everything. */
> -	spin_unlock(&vmap_area_lock);
> +	spin_unlock(&vn->busy.lock);
>
>  	return count - remains;
>  }
> @@ -4135,14 +4214,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>  	}
>
>  	/* insert all vm's */
> -	spin_lock(&vmap_area_lock);
>  	for (area = 0; area < nr_vms; area++) {
> -		insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);
> +		struct vmap_node *vn = addr_to_node(vas[area]->va_start);
>
> +		spin_lock(&vn->busy.lock);
> +		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
>  		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
>  				 pcpu_get_vm_areas);
> +		spin_unlock(&vn->busy.lock);

Hmm, before we were locking/unlocking once before the loop, now we're
locking on each iteration, this seems inefficient.

Seems like we need logic like:

/* ... something to check nr_vms > 0 ... */
struct vmap_node *last_node = NULL;

for (...) {
	struct vmap_node *vnode = addr_to_node(vas[area]->va_start);

	if (vnode != last_node) {
		spin_unlock(last_node->busy.lock);
		spin_lock(vnode->busy.lock);
		last_node = vnode;
	}

	...
}

if (last_node)
	spin_unlock(last_node->busy.lock);

To minimise the lock twiddling. What do you think?

>  	}
> -	spin_unlock(&vmap_area_lock);
>
>  	/*
>  	 * Mark allocated areas as accessible. Do it now as a best-effort
> @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object)
>  {
>  	void *objp = (void *)PAGE_ALIGN((unsigned long)object);
>  	const void *caller;
> -	struct vm_struct *vm;
>  	struct vmap_area *va;
> +	struct vmap_node *vn;
>  	unsigned long addr;
>  	unsigned int nr_pages;
> +	bool success = false;
>
> -	if (!spin_trylock(&vmap_area_lock))
> -		return false;

Nitpick on style for this, I really don't know why you are removing this
early exit? It's far neater to have a guard clause than to nest a whole
bunch of code below.

> -	va = __find_vmap_area((unsigned long)objp, &vmap_area_root);
> -	if (!va) {
> -		spin_unlock(&vmap_area_lock);
> -		return false;
> -	}
> +	vn = addr_to_node((unsigned long)objp);

Later in the patch where you fix build bot issues with the below
__find_vmap_area() invocation, you move from addr to (unsigned long)objp.

However since you're already referring to that here, why not change what
addr refers to and use that in both instances, e.g.

unsigned long addr = (unsigned long)objp;

Then update things that refer to the objp value as necessary.

>
> -	vm = va->vm;
> -	if (!vm) {
> -		spin_unlock(&vmap_area_lock);
> -		return false;
> +	if (spin_trylock(&vn->busy.lock)) {
> +		va = __find_vmap_area(addr, &vn->busy.root);
> +
> +		if (va && va->vm) {
> +			addr = (unsigned long)va->vm->addr;
> +			caller = va->vm->caller;
> +			nr_pages = va->vm->nr_pages;

Again it feels like you're undoing some good here, now you're referencing
va->vm over and over when you could simply assign vm = va->vm as the
original code did.

Also again it'd be nicer to use an early exit/guard clause approach.

> +			success = true;
> +		}
> +
> +		spin_unlock(&vn->busy.lock);
>  	}
> -	addr = (unsigned long)vm->addr;
> -	caller = vm->caller;
> -	nr_pages = vm->nr_pages;
> -	spin_unlock(&vmap_area_lock);
> -	pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
> -		nr_pages, addr, caller);
> -	return true;
> +
> +	if (success)
> +		pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
> +			nr_pages, addr, caller);

With the redefinition of addr, you could then simply put va->vm->addr here.

> +
> +	return success;
>  }
>  #endif

These are all essentially style nits (the actual bug was fixed by your
follow up patch for the build bots) :)

>
>  #ifdef CONFIG_PROC_FS
>  static void *s_start(struct seq_file *m, loff_t *pos)
> -	__acquires(&vmap_purge_lock)
> -	__acquires(&vmap_area_lock)

Do we want to replace these __acquires() directives? I suppose we simply
cannot now we need to look up the node.

>  {
> +	struct vmap_node *vn = addr_to_node(0);

Hm does the procfs operation span only one node? I guess we can start from
the initial node for an iteration, but I wonder if '&vmap_nodes[0]' here is
a more 'honest' thing to do rather than to assume that address 0 gets
translated to node zero here?

I think a comment like:

/* We start from node 0 */

Would be useful here at any rate.

> +
>  	mutex_lock(&vmap_purge_lock);
> -	spin_lock(&vmap_area_lock);
> +	spin_lock(&vn->busy.lock);
>
> -	return seq_list_start(&vmap_area_list, *pos);
> +	return seq_list_start(&vn->busy.head, *pos);
>  }
>
>  static void *s_next(struct seq_file *m, void *p, loff_t *pos)
>  {
> -	return seq_list_next(p, &vmap_area_list, pos);
> +	struct vmap_node *vn = addr_to_node(0);

This one I'm a little more uncertain of, obviously comments above still
apply but actually shouldn't we add a check to see if we're at the end of
the list and should look at the next node?

Even if we only have one for now, I don't like the idea of leaving in
hardcoded things that might get missed when we move to nr_vmap_nodes > 1.

For instance right now if you increased this above 1 it'd break things
right?

I'd say better to write logic assuming nr_vmap_nodes _could_ be > 1 even
if, to start, it won't be.

> +	return seq_list_next(p, &vn->busy.head, pos);
>  }
>
>  static void s_stop(struct seq_file *m, void *p)
> -	__releases(&vmap_area_lock)
> -	__releases(&vmap_purge_lock)
>  {
> -	spin_unlock(&vmap_area_lock);
> +	struct vmap_node *vn = addr_to_node(0);

See comments above about use of addr_to_node(0).

> +
> +	spin_unlock(&vn->busy.lock);
>  	mutex_unlock(&vmap_purge_lock);
>  }
>
> @@ -4344,9 +4426,11 @@ static void show_purge_info(struct seq_file *m)
>
>  static int s_show(struct seq_file *m, void *p)
>  {
> +	struct vmap_node *vn;
>  	struct vmap_area *va;
>  	struct vm_struct *v;
>
> +	vn = addr_to_node(0);

This one is really quite icky, should we make it easy for a vmap_area to
know its vmap_node? How is this going to work once nr_vmap_nodes > 1?

>  	va = list_entry(p, struct vmap_area, list);
>
>  	if (!va->vm) {
> @@ -4397,7 +4481,7 @@ static int s_show(struct seq_file *m, void *p)
>  	 * As a final step, dump "unpurged" areas.
>  	 */
>  final:
> -	if (list_is_last(&va->list, &vmap_area_list))
> +	if (list_is_last(&va->list, &vn->busy.head))
>  		show_purge_info(m);
>
>  	return 0;
> @@ -4428,7 +4512,8 @@ static void vmap_init_free_space(void)
>  {
>  	unsigned long vmap_start = 1;
>  	const unsigned long vmap_end = ULONG_MAX;
> -	struct vmap_area *busy, *free;
> +	struct vmap_area *free;
> +	struct vm_struct *busy;
>
>  	/*
>  	 *     B     F     B     B     B     F
> @@ -4436,12 +4521,12 @@ static void vmap_init_free_space(void)
>  	 *  |           The KVA space           |
>  	 *  |<--------------------------------->|
>  	 */
> -	list_for_each_entry(busy, &vmap_area_list, list) {
> -		if (busy->va_start - vmap_start > 0) {
> +	for (busy = vmlist; busy; busy = busy->next) {
> +		if ((unsigned long) busy->addr - vmap_start > 0) {
>  			free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
>  			if (!WARN_ON_ONCE(!free)) {
>  				free->va_start = vmap_start;
> -				free->va_end = busy->va_start;
> +				free->va_end = (unsigned long) busy->addr;
>
>  				insert_vmap_area_augment(free, NULL,
>  					&free_vmap_area_root,
> @@ -4449,7 +4534,7 @@ static void vmap_init_free_space(void)
>  			}
>  		}
>
> -		vmap_start = busy->va_end;
> +		vmap_start = (unsigned long) busy->addr + busy->size;
>  	}
>
>  	if (vmap_end - vmap_start > 0) {
> @@ -4465,9 +4550,23 @@ static void vmap_init_free_space(void)
>  	}
>  }
>
> +static void vmap_init_nodes(void)
> +{
> +	struct vmap_node *vn;
> +	int i;
> +
> +	for (i = 0; i < nr_vmap_nodes; i++) {
> +		vn = &vmap_nodes[i];
> +		vn->busy.root = RB_ROOT;
> +		INIT_LIST_HEAD(&vn->busy.head);
> +		spin_lock_init(&vn->busy.lock);
> +	}
> +}
> +
>  void __init vmalloc_init(void)
>  {
>  	struct vmap_area *va;
> +	struct vmap_node *vn;
>  	struct vm_struct *tmp;
>  	int i;
>
> @@ -4489,6 +4588,11 @@ void __init vmalloc_init(void)
>  		xa_init(&vbq->vmap_blocks);
>  	}
>
> +	/*
> +	 * Setup nodes before importing vmlist.
> +	 */
> +	vmap_init_nodes();
> +
>  	/* Import existing vmlist entries. */
>  	for (tmp = vmlist; tmp; tmp = tmp->next) {
>  		va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> @@ -4498,7 +4602,9 @@ void __init vmalloc_init(void)
>  		va->va_start = (unsigned long)tmp->addr;
>  		va->va_end = va->va_start + tmp->size;
>  		va->vm = tmp;
> -		insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> +
> +		vn = addr_to_node(va->va_start);
> +		insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
>  	}
>
>  	/*
> --
> 2.39.2
>
  
Uladzislau Rezki Jan. 18, 2024, 1:15 p.m. UTC | #9
On Tue, Jan 16, 2024 at 11:25:43PM +0000, Lorenzo Stoakes wrote:
> On Tue, Jan 02, 2024 at 07:46:26PM +0100, Uladzislau Rezki (Sony) wrote:
> > Store allocated objects in a separate nodes. A va->va_start
> > address is converted into a correct node where it should
> > be placed and resided. An addr_to_node() function is used
> > to do a proper address conversion to determine a node that
> > contains a VA.
> >
> > Such approach balances VAs across nodes as a result an access
> > becomes scalable. Number of nodes in a system depends on number
> > of CPUs.
> >
> > Please note:
> >
> > 1. As of now allocated VAs are bound to a node-0. It means the
> >    patch does not give any difference comparing with a current
> >    behavior;
> >
> > 2. The global vmap_area_lock, vmap_area_root are removed as there
> >    is no need in it anymore. The vmap_area_list is still kept and
> >    is _empty_. It is exported for a kexec only;
> >
> > 3. The vmallocinfo and vread() have to be reworked to be able to
> >    handle multiple nodes.
> >
> > Reviewed-by: Baoquan He <bhe@redhat.com>
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 240 +++++++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 173 insertions(+), 67 deletions(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 06bd843d18ae..786ecb18ae22 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
> >  #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
> >
> >
> > -static DEFINE_SPINLOCK(vmap_area_lock);
> >  static DEFINE_SPINLOCK(free_vmap_area_lock);
> >  /* Export for kexec only */
> >  LIST_HEAD(vmap_area_list);
> > -static struct rb_root vmap_area_root = RB_ROOT;
> >  static bool vmap_initialized __read_mostly;
> >
> >  static struct rb_root purge_vmap_area_root = RB_ROOT;
> > @@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT;
> >   */
> >  static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
> >
> > +/*
> > + * An effective vmap-node logic. Users make use of nodes instead
> > + * of a global heap. It allows to balance an access and mitigate
> > + * contention.
> > + */
> > +struct rb_list {
> 
> I'm not sure this name is very instructive - this contains a red/black tree
> root node, a list head and a lock, but the meaning of it is that it stores
> a red/black tree and a list of vmap_area objects and has a lock to protect
> access.
> 
> A raw 'list_head' without a comment is always hard to parse, maybe some
> comments/embed in vmap_node?
> 
> At the very least if you wanted to keep the name generic it should be
> something like ordered_rb_tree or something like this.
> 
I can add a comment in the vmap_node. rb_list describes a single, solid
data structure where a list and rb-tree are part of one entity protected
by lock. Similar to the B+tree where you have leaf nodes linked between
each other in order to do a fast sequential traversal.

> > +	struct rb_root root;
> > +	struct list_head head;
> > +	spinlock_t lock;
> > +};
> > +
> > +static struct vmap_node {
> > +	/* Bookkeeping data of this node. */
> > +	struct rb_list busy;
> > +} single;
> 
> This may be a thing about encapsulation/naming or similar, but I'm a little
> confused as to why the rb_list type is maintained as a field rather than
> its fields embedded?
> 
The "struct vmap_node" will be extended by the following patches in the
series.

> > +
> > +static struct vmap_node *vmap_nodes = &single;
> > +static __read_mostly unsigned int nr_vmap_nodes = 1;
> > +static __read_mostly unsigned int vmap_zone_size = 1;
> 
> It might be worth adding a comment here explaining that we're binding to a
> single node for now to maintain existing behaviour (and a brief description
> of what these values mean - for instance what unit vmap_zone_size is
> expressed in?)
> 
Right. Agree on it :)

> > +
> > +static inline unsigned int
> > +addr_to_node_id(unsigned long addr)
> > +{
> > +	return (addr / vmap_zone_size) % nr_vmap_nodes;
> > +}
> > +
> > +static inline struct vmap_node *
> > +addr_to_node(unsigned long addr)
> > +{
> > +	return &vmap_nodes[addr_to_node_id(addr)];
> > +}
> > +
> >  static __always_inline unsigned long
> >  va_size(struct vmap_area *va)
> >  {
> > @@ -803,10 +833,11 @@ unsigned long vmalloc_nr_pages(void)
> >  }
> >
> >  /* Look up the first VA which satisfies addr < va_end, NULL if none. */
> > -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
> > +static struct vmap_area *
> > +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
> >  {
> >  	struct vmap_area *va = NULL;
> > -	struct rb_node *n = vmap_area_root.rb_node;
> > +	struct rb_node *n = root->rb_node;
> >
> >  	addr = (unsigned long)kasan_reset_tag((void *)addr);
> >
> > @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
> >   */
> >  static void free_vmap_area(struct vmap_area *va)
> >  {
> > +	struct vmap_node *vn = addr_to_node(va->va_start);
> > +
> 
> I'm being nitty here, and while I know it's a vmalloc convention to use
> 'va' and 'vm', perhaps we can break away from the super short variable name
> convention and use 'vnode' or something for these values?
> 
> I feel people might get confused between 'vm' and 'vn' for instance.
> 
vnode, varea?

> >  	/*
> >  	 * Remove from the busy tree/list.
> >  	 */
> > -	spin_lock(&vmap_area_lock);
> > -	unlink_va(va, &vmap_area_root);
> > -	spin_unlock(&vmap_area_lock);
> > +	spin_lock(&vn->busy.lock);
> > +	unlink_va(va, &vn->busy.root);
> > +	spin_unlock(&vn->busy.lock);
> >
> >  	/*
> >  	 * Insert/Merge it back to the free tree/list.
> > @@ -1600,6 +1633,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> >  				int node, gfp_t gfp_mask,
> >  				unsigned long va_flags)
> >  {
> > +	struct vmap_node *vn;
> >  	struct vmap_area *va;
> >  	unsigned long freed;
> >  	unsigned long addr;
> > @@ -1645,9 +1679,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> >  	va->vm = NULL;
> >  	va->flags = va_flags;
> >
> > -	spin_lock(&vmap_area_lock);
> > -	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
> > -	spin_unlock(&vmap_area_lock);
> > +	vn = addr_to_node(va->va_start);
> > +
> > +	spin_lock(&vn->busy.lock);
> > +	insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
> > +	spin_unlock(&vn->busy.lock);
> >
> >  	BUG_ON(!IS_ALIGNED(va->va_start, align));
> >  	BUG_ON(va->va_start < vstart);
> > @@ -1871,26 +1907,61 @@ static void free_unmap_vmap_area(struct vmap_area *va)
> >
> >  struct vmap_area *find_vmap_area(unsigned long addr)
> >  {
> > +	struct vmap_node *vn;
> >  	struct vmap_area *va;
> > +	int i, j;
> >
> > -	spin_lock(&vmap_area_lock);
> > -	va = __find_vmap_area(addr, &vmap_area_root);
> > -	spin_unlock(&vmap_area_lock);
> > +	/*
> > +	 * An addr_to_node_id(addr) converts an address to a node index
> > +	 * where a VA is located. If VA spans several zones and passed
> > +	 * addr is not the same as va->va_start, what is not common, we
> > +	 * may need to scan an extra nodes. See an example:
> 
> For my understading when you say 'scan an extra nodes' do you mean scan
> just 1 extra node, or multiple? If the former I'd replace this with 'may
> need to scan an extra node' if the latter then 'may ened to scan extra
> nodes'.
> 
> It's a nitty language thing, but also potentially changes the meaning of
> this!
> 
Typo, i should replace it to: scan extra nodes.

> > +	 *
> > +	 *      <--va-->
> > +	 * -|-----|-----|-----|-----|-
> > +	 *     1     2     0     1
> > +	 *
> > +	 * VA resides in node 1 whereas it spans 1 and 2. If passed
> > +	 * addr is within a second node we should do extra work. We
> > +	 * should mention that it is rare and is a corner case from
> > +	 * the other hand it has to be covered.
> 
> A very minor language style nit, but you've already said this is not
> common, I don't think you need this 'We should mention...' bit. It's not a
> big deal however!
> 
No problem. We can remove it!

> > +	 */
> > +	i = j = addr_to_node_id(addr);
> > +	do {
> > +		vn = &vmap_nodes[i];
> >
> > -	return va;
> > +		spin_lock(&vn->busy.lock);
> > +		va = __find_vmap_area(addr, &vn->busy.root);
> > +		spin_unlock(&vn->busy.lock);
> > +
> > +		if (va)
> > +			return va;
> > +	} while ((i = (i + 1) % nr_vmap_nodes) != j);
> 
> If you comment above suggests that only 1 extra node might need to be
> scanned, should we stop after one iteration?
> 
Not really. Though we can improve it further to scan backward.

> > +
> > +	return NULL;
> >  }
> >
> >  static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
> >  {
> > +	struct vmap_node *vn;
> >  	struct vmap_area *va;
> > +	int i, j;
> >
> > -	spin_lock(&vmap_area_lock);
> > -	va = __find_vmap_area(addr, &vmap_area_root);
> > -	if (va)
> > -		unlink_va(va, &vmap_area_root);
> > -	spin_unlock(&vmap_area_lock);
> > +	i = j = addr_to_node_id(addr);
> > +	do {
> > +		vn = &vmap_nodes[i];
> >
> > -	return va;
> > +		spin_lock(&vn->busy.lock);
> > +		va = __find_vmap_area(addr, &vn->busy.root);
> > +		if (va)
> > +			unlink_va(va, &vn->busy.root);
> > +		spin_unlock(&vn->busy.lock);
> > +
> > +		if (va)
> > +			return va;
> > +	} while ((i = (i + 1) % nr_vmap_nodes) != j);
> 
> Maybe worth adding a comment saying to refer to the comment in
> find_vmap_area() to see why this loop is necessary.
> 
OK. We can do it to make it better for reading.

> > +
> > +	return NULL;
> >  }
> >
> >  /*** Per cpu kva allocator ***/
> > @@ -2092,6 +2163,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
> >
> >  static void free_vmap_block(struct vmap_block *vb)
> >  {
> > +	struct vmap_node *vn;
> >  	struct vmap_block *tmp;
> >  	struct xarray *xa;
> >
> > @@ -2099,9 +2171,10 @@ static void free_vmap_block(struct vmap_block *vb)
> >  	tmp = xa_erase(xa, addr_to_vb_idx(vb->va->va_start));
> >  	BUG_ON(tmp != vb);
> >
> > -	spin_lock(&vmap_area_lock);
> > -	unlink_va(vb->va, &vmap_area_root);
> > -	spin_unlock(&vmap_area_lock);
> > +	vn = addr_to_node(vb->va->va_start);
> > +	spin_lock(&vn->busy.lock);
> > +	unlink_va(vb->va, &vn->busy.root);
> > +	spin_unlock(&vn->busy.lock);
> >
> >  	free_vmap_area_noflush(vb->va);
> >  	kfree_rcu(vb, rcu_head);
> > @@ -2525,9 +2598,11 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
> >  static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
> >  			      unsigned long flags, const void *caller)
> >  {
> > -	spin_lock(&vmap_area_lock);
> > +	struct vmap_node *vn = addr_to_node(va->va_start);
> > +
> > +	spin_lock(&vn->busy.lock);
> >  	setup_vmalloc_vm_locked(vm, va, flags, caller);
> > -	spin_unlock(&vmap_area_lock);
> > +	spin_unlock(&vn->busy.lock);
> >  }
> >
> >  static void clear_vm_uninitialized_flag(struct vm_struct *vm)
> > @@ -3715,6 +3790,7 @@ static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr,
> >   */
> >  long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> >  {
> > +	struct vmap_node *vn;
> >  	struct vmap_area *va;
> >  	struct vm_struct *vm;
> >  	char *vaddr;
> > @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> 
> Unrelated to your change but makes me feel a little unwell to see 'const
> char *addr'! Can we change this at some point? Or maybe I can :)
> 
You are welcome :)

> >
> >  	remains = count;
> >
> > -	spin_lock(&vmap_area_lock);
> > -	va = find_vmap_area_exceed_addr((unsigned long)addr);
> > +	/* Hooked to node_0 so far. */
> > +	vn = addr_to_node(0);
> 
> Why can't we use addr for this call? We already enforce the node-0 only
> thing by setting nr_vmap_nodes to 1 right? And won't this be potentially
> subtly wrong when we later increase this?
> 
I used to have 0 here. But please note, it is changed by the next patch in
this series.

> > +	spin_lock(&vn->busy.lock);
> > +
> > +	va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root);
> >  	if (!va)
> >  		goto finished_zero;
> >
> > @@ -3737,7 +3816,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> >  	if ((unsigned long)addr + remains <= va->va_start)
> >  		goto finished_zero;
> >
> > -	list_for_each_entry_from(va, &vmap_area_list, list) {
> > +	list_for_each_entry_from(va, &vn->busy.head, list) {
> >  		size_t copied;
> >
> >  		if (remains == 0)
> > @@ -3796,12 +3875,12 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> >  	}
> >
> >  finished_zero:
> > -	spin_unlock(&vmap_area_lock);
> > +	spin_unlock(&vn->busy.lock);
> >  	/* zero-fill memory holes */
> >  	return count - remains + zero_iter(iter, remains);
> >  finished:
> >  	/* Nothing remains, or We couldn't copy/zero everything. */
> > -	spin_unlock(&vmap_area_lock);
> > +	spin_unlock(&vn->busy.lock);
> >
> >  	return count - remains;
> >  }
> > @@ -4135,14 +4214,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
> >  	}
> >
> >  	/* insert all vm's */
> > -	spin_lock(&vmap_area_lock);
> >  	for (area = 0; area < nr_vms; area++) {
> > -		insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);
> > +		struct vmap_node *vn = addr_to_node(vas[area]->va_start);
> >
> > +		spin_lock(&vn->busy.lock);
> > +		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
> >  		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
> >  				 pcpu_get_vm_areas);
> > +		spin_unlock(&vn->busy.lock);
> 
> Hmm, before we were locking/unlocking once before the loop, now we're
> locking on each iteration, this seems inefficient.
> 
> Seems like we need logic like:
> 
> /* ... something to check nr_vms > 0 ... */
> struct vmap_node *last_node = NULL;
> 
> for (...) {
> 	struct vmap_node *vnode = addr_to_node(vas[area]->va_start);
> 
> 	if (vnode != last_node) {
> 		spin_unlock(last_node->busy.lock);
> 		spin_lock(vnode->busy.lock);
> 		last_node = vnode;
> 	}
> 
> 	...
> }
> 
> if (last_node)
> 	spin_unlock(last_node->busy.lock);
> 
> To minimise the lock twiddling. What do you think?
>
This per-cpu-allocator prefetches several VA units per-cpu. I do not
find it as critical because it is not a hot path for the per-cpu allocator. 
When its buffers are exhausted it does an extra prefetch. So it is not
frequent.

> 
> >  	}
> > -	spin_unlock(&vmap_area_lock);
> >
> >  	/*
> >  	 * Mark allocated areas as accessible. Do it now as a best-effort
> > @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object)
> >  {
> >  	void *objp = (void *)PAGE_ALIGN((unsigned long)object);
> >  	const void *caller;
> > -	struct vm_struct *vm;
> >  	struct vmap_area *va;
> > +	struct vmap_node *vn;
> >  	unsigned long addr;
> >  	unsigned int nr_pages;
> > +	bool success = false;
> >
> > -	if (!spin_trylock(&vmap_area_lock))
> > -		return false;
> 
> Nitpick on style for this, I really don't know why you are removing this
> early exit? It's far neater to have a guard clause than to nest a whole
> bunch of code below.
> 
Hm... I can return back as it used to be. I do not have a strong opinion here.

> > -	va = __find_vmap_area((unsigned long)objp, &vmap_area_root);
> > -	if (!va) {
> > -		spin_unlock(&vmap_area_lock);
> > -		return false;
> > -	}
> > +	vn = addr_to_node((unsigned long)objp);
> 
> Later in the patch where you fix build bot issues with the below
> __find_vmap_area() invocation, you move from addr to (unsigned long)objp.
> 
> However since you're already referring to that here, why not change what
> addr refers to and use that in both instances, e.g.
> 
> unsigned long addr = (unsigned long)objp;
> 
> Then update things that refer to the objp value as necessary.
> 
That is what i was thinking of. We can send a separate patch for this.

> >
> > -	vm = va->vm;
> > -	if (!vm) {
> > -		spin_unlock(&vmap_area_lock);
> > -		return false;
> > +	if (spin_trylock(&vn->busy.lock)) {
> > +		va = __find_vmap_area(addr, &vn->busy.root);
> > +
> > +		if (va && va->vm) {
> > +			addr = (unsigned long)va->vm->addr;
> > +			caller = va->vm->caller;
> > +			nr_pages = va->vm->nr_pages;
> 
> Again it feels like you're undoing some good here, now you're referencing
> va->vm over and over when you could simply assign vm = va->vm as the
> original code did.
> 
> Also again it'd be nicer to use an early exit/guard clause approach.
> 
We can change it in separate patch.

> > +			success = true;
> > +		}
> > +
> > +		spin_unlock(&vn->busy.lock);
> >  	}
> > -	addr = (unsigned long)vm->addr;
> > -	caller = vm->caller;
> > -	nr_pages = vm->nr_pages;
> > -	spin_unlock(&vmap_area_lock);
> > -	pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
> > -		nr_pages, addr, caller);
> > -	return true;
> > +
> > +	if (success)
> > +		pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
> > +			nr_pages, addr, caller);
> 
> With the redefinition of addr, you could then simply put va->vm->addr here.
> 
> > +
> > +	return success;
> >  }
> >  #endif
> 
> These are all essentially style nits (the actual bug was fixed by your
> follow up patch for the build bots) :)
> 
Right :)

> >
> >  #ifdef CONFIG_PROC_FS
> >  static void *s_start(struct seq_file *m, loff_t *pos)
> > -	__acquires(&vmap_purge_lock)
> > -	__acquires(&vmap_area_lock)
> 
> Do we want to replace these __acquires() directives? I suppose we simply
> cannot now we need to look up the node.
> 
Yep. It is reworked anyway in another patch.

> >  {
> > +	struct vmap_node *vn = addr_to_node(0);
> 
> Hm does the procfs operation span only one node? I guess we can start from
> the initial node for an iteration, but I wonder if '&vmap_nodes[0]' here is
> a more 'honest' thing to do rather than to assume that address 0 gets
> translated to node zero here?
> 
> I think a comment like:
> 
> /* We start from node 0 */
> 
> Would be useful here at any rate.
> 
It works since nr_nodes is set to 1. By later patches in this series
it is fulfilled and completed.

> > +
> >  	mutex_lock(&vmap_purge_lock);
> > -	spin_lock(&vmap_area_lock);
> > +	spin_lock(&vn->busy.lock);
> >
> > -	return seq_list_start(&vmap_area_list, *pos);
> > +	return seq_list_start(&vn->busy.head, *pos);
> >  }
> >
> >  static void *s_next(struct seq_file *m, void *p, loff_t *pos)
> >  {
> > -	return seq_list_next(p, &vmap_area_list, pos);
> > +	struct vmap_node *vn = addr_to_node(0);
> 
> This one I'm a little more uncertain of, obviously comments above still
> apply but actually shouldn't we add a check to see if we're at the end of
> the list and should look at the next node?
> 
> Even if we only have one for now, I don't like the idea of leaving in
> hardcoded things that might get missed when we move to nr_vmap_nodes > 1.
> 
> For instance right now if you increased this above 1 it'd break things
> right?
> 
> I'd say better to write logic assuming nr_vmap_nodes _could_ be > 1 even
> if, to start, it won't be.
> 
Same as above. It is incomplete and stick to a single node. Further
patches solve this fully.

> > +	return seq_list_next(p, &vn->busy.head, pos);
> >  }
> >
> >  static void s_stop(struct seq_file *m, void *p)
> > -	__releases(&vmap_area_lock)
> > -	__releases(&vmap_purge_lock)
> >  {
> > -	spin_unlock(&vmap_area_lock);
> > +	struct vmap_node *vn = addr_to_node(0);
> 
> See comments above about use of addr_to_node(0).
> 
All this s_start/s_next/etc are removed and reworked by following
patches.

> > +
> > +	spin_unlock(&vn->busy.lock);
> >  	mutex_unlock(&vmap_purge_lock);
> >  }
> >
> > @@ -4344,9 +4426,11 @@ static void show_purge_info(struct seq_file *m)
> >
> >  static int s_show(struct seq_file *m, void *p)
> >  {
> > +	struct vmap_node *vn;
> >  	struct vmap_area *va;
> >  	struct vm_struct *v;
> >
> > +	vn = addr_to_node(0);
> 
> This one is really quite icky, should we make it easy for a vmap_area to
> know its vmap_node? How is this going to work once nr_vmap_nodes > 1?
> 
Same as above.

Thank you for the review! I can fix the comments as separate patches if
no objections.

--
Uladzislau Rezki
  
Lorenzo Stoakes Jan. 20, 2024, 12:55 p.m. UTC | #10
On Thu, Jan 18, 2024 at 02:15:31PM +0100, Uladzislau Rezki wrote:

[snip]

>
> > > +	struct rb_root root;
> > > +	struct list_head head;
> > > +	spinlock_t lock;
> > > +};
> > > +
> > > +static struct vmap_node {
> > > +	/* Bookkeeping data of this node. */
> > > +	struct rb_list busy;
> > > +} single;
> >
> > This may be a thing about encapsulation/naming or similar, but I'm a little
> > confused as to why the rb_list type is maintained as a field rather than
> > its fields embedded?
> >
> The "struct vmap_node" will be extended by the following patches in the
> series.
>

Yeah sorry I missed this, only realising after I sent...!

> > > +
> > > +static struct vmap_node *vmap_nodes = &single;
> > > +static __read_mostly unsigned int nr_vmap_nodes = 1;
> > > +static __read_mostly unsigned int vmap_zone_size = 1;
> >
> > It might be worth adding a comment here explaining that we're binding to a
> > single node for now to maintain existing behaviour (and a brief description
> > of what these values mean - for instance what unit vmap_zone_size is
> > expressed in?)
> >
> Right. Agree on it :)
>

Indeed :)

[snip]

> > >  /* Look up the first VA which satisfies addr < va_end, NULL if none. */
> > > -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
> > > +static struct vmap_area *
> > > +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
> > >  {
> > >  	struct vmap_area *va = NULL;
> > > -	struct rb_node *n = vmap_area_root.rb_node;
> > > +	struct rb_node *n = root->rb_node;
> > >
> > >  	addr = (unsigned long)kasan_reset_tag((void *)addr);
> > >
> > > @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
> > >   */
> > >  static void free_vmap_area(struct vmap_area *va)
> > >  {
> > > +	struct vmap_node *vn = addr_to_node(va->va_start);
> > > +
> >
> > I'm being nitty here, and while I know it's a vmalloc convention to use
> > 'va' and 'vm', perhaps we can break away from the super short variable name
> > convention and use 'vnode' or something for these values?
> >
> > I feel people might get confused between 'vm' and 'vn' for instance.
> >
> vnode, varea?

I think 'vm' and 'va' are fine, just scanning through easy to mistake 'vn'
and 'vm'. Obviously a litle nitpicky! You could replace all but a bit
churny, so I think vn -> vnode works best imo.

[snip]

> > >  struct vmap_area *find_vmap_area(unsigned long addr)
> > >  {
> > > +	struct vmap_node *vn;
> > >  	struct vmap_area *va;
> > > +	int i, j;
> > >
> > > -	spin_lock(&vmap_area_lock);
> > > -	va = __find_vmap_area(addr, &vmap_area_root);
> > > -	spin_unlock(&vmap_area_lock);
> > > +	/*
> > > +	 * An addr_to_node_id(addr) converts an address to a node index
> > > +	 * where a VA is located. If VA spans several zones and passed
> > > +	 * addr is not the same as va->va_start, what is not common, we
> > > +	 * may need to scan an extra nodes. See an example:
> >
> > For my understading when you say 'scan an extra nodes' do you mean scan
> > just 1 extra node, or multiple? If the former I'd replace this with 'may
> > need to scan an extra node' if the latter then 'may ened to scan extra
> > nodes'.
> >
> > It's a nitty language thing, but also potentially changes the meaning of
> > this!
> >
> Typo, i should replace it to: scan extra nodes.

Thanks.

>
> > > +	 *
> > > +	 *      <--va-->
> > > +	 * -|-----|-----|-----|-----|-
> > > +	 *     1     2     0     1
> > > +	 *
> > > +	 * VA resides in node 1 whereas it spans 1 and 2. If passed
> > > +	 * addr is within a second node we should do extra work. We
> > > +	 * should mention that it is rare and is a corner case from
> > > +	 * the other hand it has to be covered.
> >
> > A very minor language style nit, but you've already said this is not
> > common, I don't think you need this 'We should mention...' bit. It's not a
> > big deal however!
> >
> No problem. We can remove it!

Thanks.

>
> > > +	 */
> > > +	i = j = addr_to_node_id(addr);
> > > +	do {
> > > +		vn = &vmap_nodes[i];
> > >
> > > -	return va;
> > > +		spin_lock(&vn->busy.lock);
> > > +		va = __find_vmap_area(addr, &vn->busy.root);
> > > +		spin_unlock(&vn->busy.lock);
> > > +
> > > +		if (va)
> > > +			return va;
> > > +	} while ((i = (i + 1) % nr_vmap_nodes) != j);
> >
> > If you comment above suggests that only 1 extra node might need to be
> > scanned, should we stop after one iteration?
> >
> Not really. Though we can improve it further to scan backward.

I think it'd be good to clarify in the comment above that the VA could span
more than 1 node then, as the diagram seems to imply only 1 (I think just
simply because of the example you were showing).

[snip]

> > >  static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
> > >  {
> > > +	struct vmap_node *vn;
> > >  	struct vmap_area *va;
> > > +	int i, j;
> > >
> > > -	spin_lock(&vmap_area_lock);
> > > -	va = __find_vmap_area(addr, &vmap_area_root);
> > > -	if (va)
> > > -		unlink_va(va, &vmap_area_root);
> > > -	spin_unlock(&vmap_area_lock);
> > > +	i = j = addr_to_node_id(addr);
> > > +	do {
> > > +		vn = &vmap_nodes[i];
> > >
> > > -	return va;
> > > +		spin_lock(&vn->busy.lock);
> > > +		va = __find_vmap_area(addr, &vn->busy.root);
> > > +		if (va)
> > > +			unlink_va(va, &vn->busy.root);
> > > +		spin_unlock(&vn->busy.lock);
> > > +
> > > +		if (va)
> > > +			return va;
> > > +	} while ((i = (i + 1) % nr_vmap_nodes) != j);
> >
> > Maybe worth adding a comment saying to refer to the comment in
> > find_vmap_area() to see why this loop is necessary.
> >
> OK. We can do it to make it better for reading.

Thanks!

[snip]

> > > @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> >
> > Unrelated to your change but makes me feel a little unwell to see 'const
> > char *addr'! Can we change this at some point? Or maybe I can :)
> >
> You are welcome :)

Haha ;) yes I think I might tbh, I have noted it down.

>
> > >
> > >  	remains = count;
> > >
> > > -	spin_lock(&vmap_area_lock);
> > > -	va = find_vmap_area_exceed_addr((unsigned long)addr);
> > > +	/* Hooked to node_0 so far. */
> > > +	vn = addr_to_node(0);
> >
> > Why can't we use addr for this call? We already enforce the node-0 only
> > thing by setting nr_vmap_nodes to 1 right? And won't this be potentially
> > subtly wrong when we later increase this?
> >
> I used to have 0 here. But please note, it is changed by the next patch in
> this series.

Yeah sorry, again hadn't noticed this.

[snip]

> > > +		spin_lock(&vn->busy.lock);
> > > +		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
> > >  		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
> > >  				 pcpu_get_vm_areas);
> > > +		spin_unlock(&vn->busy.lock);
> >
> > Hmm, before we were locking/unlocking once before the loop, now we're
> > locking on each iteration, this seems inefficient.
> >
> > Seems like we need logic like:
> >
> > /* ... something to check nr_vms > 0 ... */
> > struct vmap_node *last_node = NULL;
> >
> > for (...) {
> > 	struct vmap_node *vnode = addr_to_node(vas[area]->va_start);
> >
> > 	if (vnode != last_node) {
> > 		spin_unlock(last_node->busy.lock);
> > 		spin_lock(vnode->busy.lock);
> > 		last_node = vnode;
> > 	}
> >
> > 	...
> > }
> >
> > if (last_node)
> > 	spin_unlock(last_node->busy.lock);
> >
> > To minimise the lock twiddling. What do you think?
> >
> This per-cpu-allocator prefetches several VA units per-cpu. I do not
> find it as critical because it is not a hot path for the per-cpu allocator.
> When its buffers are exhausted it does an extra prefetch. So it is not
> frequent.

OK, sure I mean this is simpler and more readable so if not a huge perf
concern then not a big deal.

>
> >
> > >  	}
> > > -	spin_unlock(&vmap_area_lock);
> > >
> > >  	/*
> > >  	 * Mark allocated areas as accessible. Do it now as a best-effort
> > > @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object)
> > >  {
> > >  	void *objp = (void *)PAGE_ALIGN((unsigned long)object);
> > >  	const void *caller;
> > > -	struct vm_struct *vm;
> > >  	struct vmap_area *va;
> > > +	struct vmap_node *vn;
> > >  	unsigned long addr;
> > >  	unsigned int nr_pages;
> > > +	bool success = false;
> > >
> > > -	if (!spin_trylock(&vmap_area_lock))
> > > -		return false;
> >
> > Nitpick on style for this, I really don't know why you are removing this
> > early exit? It's far neater to have a guard clause than to nest a whole
> > bunch of code below.
> >
> Hm... I can return back as it used to be. I do not have a strong opinion here.

Yeah that'd be ideal just for readability.

[snip the rest as broadly fairly trivial comment stuff on which we agree]

>
> Thank you for the review! I can fix the comments as separate patches if
> no objections.

Yes, overall it's style/comment improvement stuff nothing major, feel free
to send as follow-up patches.

I don't want to hold anything up here so for the rest, feel free to add:

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

>
> --
> Uladzislau Rezki
  
Uladzislau Rezki Jan. 22, 2024, 5:44 p.m. UTC | #11
On Sat, Jan 20, 2024 at 12:55:10PM +0000, Lorenzo Stoakes wrote:
> On Thu, Jan 18, 2024 at 02:15:31PM +0100, Uladzislau Rezki wrote:
> 
> [snip]
> 
> >
> > > > +	struct rb_root root;
> > > > +	struct list_head head;
> > > > +	spinlock_t lock;
> > > > +};
> > > > +
> > > > +static struct vmap_node {
> > > > +	/* Bookkeeping data of this node. */
> > > > +	struct rb_list busy;
> > > > +} single;
> > >
> > > This may be a thing about encapsulation/naming or similar, but I'm a little
> > > confused as to why the rb_list type is maintained as a field rather than
> > > its fields embedded?
> > >
> > The "struct vmap_node" will be extended by the following patches in the
> > series.
> >
> 
> Yeah sorry I missed this, only realising after I sent...!
> 
> > > > +
> > > > +static struct vmap_node *vmap_nodes = &single;
> > > > +static __read_mostly unsigned int nr_vmap_nodes = 1;
> > > > +static __read_mostly unsigned int vmap_zone_size = 1;
> > >
> > > It might be worth adding a comment here explaining that we're binding to a
> > > single node for now to maintain existing behaviour (and a brief description
> > > of what these values mean - for instance what unit vmap_zone_size is
> > > expressed in?)
> > >
> > Right. Agree on it :)
> >
> 
> Indeed :)
> 
> [snip]
> 
> > > >  /* Look up the first VA which satisfies addr < va_end, NULL if none. */
> > > > -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
> > > > +static struct vmap_area *
> > > > +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
> > > >  {
> > > >  	struct vmap_area *va = NULL;
> > > > -	struct rb_node *n = vmap_area_root.rb_node;
> > > > +	struct rb_node *n = root->rb_node;
> > > >
> > > >  	addr = (unsigned long)kasan_reset_tag((void *)addr);
> > > >
> > > > @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
> > > >   */
> > > >  static void free_vmap_area(struct vmap_area *va)
> > > >  {
> > > > +	struct vmap_node *vn = addr_to_node(va->va_start);
> > > > +
> > >
> > > I'm being nitty here, and while I know it's a vmalloc convention to use
> > > 'va' and 'vm', perhaps we can break away from the super short variable name
> > > convention and use 'vnode' or something for these values?
> > >
> > > I feel people might get confused between 'vm' and 'vn' for instance.
> > >
> > vnode, varea?
> 
> I think 'vm' and 'va' are fine, just scanning through easy to mistake 'vn'
> and 'vm'. Obviously a litle nitpicky! You could replace all but a bit
> churny, so I think vn -> vnode works best imo.
> 
> [snip]
> 
> > > >  struct vmap_area *find_vmap_area(unsigned long addr)
> > > >  {
> > > > +	struct vmap_node *vn;
> > > >  	struct vmap_area *va;
> > > > +	int i, j;
> > > >
> > > > -	spin_lock(&vmap_area_lock);
> > > > -	va = __find_vmap_area(addr, &vmap_area_root);
> > > > -	spin_unlock(&vmap_area_lock);
> > > > +	/*
> > > > +	 * An addr_to_node_id(addr) converts an address to a node index
> > > > +	 * where a VA is located. If VA spans several zones and passed
> > > > +	 * addr is not the same as va->va_start, what is not common, we
> > > > +	 * may need to scan an extra nodes. See an example:
> > >
> > > For my understading when you say 'scan an extra nodes' do you mean scan
> > > just 1 extra node, or multiple? If the former I'd replace this with 'may
> > > need to scan an extra node' if the latter then 'may ened to scan extra
> > > nodes'.
> > >
> > > It's a nitty language thing, but also potentially changes the meaning of
> > > this!
> > >
> > Typo, i should replace it to: scan extra nodes.
> 
> Thanks.
> 
> >
> > > > +	 *
> > > > +	 *      <--va-->
> > > > +	 * -|-----|-----|-----|-----|-
> > > > +	 *     1     2     0     1
> > > > +	 *
> > > > +	 * VA resides in node 1 whereas it spans 1 and 2. If passed
> > > > +	 * addr is within a second node we should do extra work. We
> > > > +	 * should mention that it is rare and is a corner case from
> > > > +	 * the other hand it has to be covered.
> > >
> > > A very minor language style nit, but you've already said this is not
> > > common, I don't think you need this 'We should mention...' bit. It's not a
> > > big deal however!
> > >
> > No problem. We can remove it!
> 
> Thanks.
> 
> >
> > > > +	 */
> > > > +	i = j = addr_to_node_id(addr);
> > > > +	do {
> > > > +		vn = &vmap_nodes[i];
> > > >
> > > > -	return va;
> > > > +		spin_lock(&vn->busy.lock);
> > > > +		va = __find_vmap_area(addr, &vn->busy.root);
> > > > +		spin_unlock(&vn->busy.lock);
> > > > +
> > > > +		if (va)
> > > > +			return va;
> > > > +	} while ((i = (i + 1) % nr_vmap_nodes) != j);
> > >
> > > If you comment above suggests that only 1 extra node might need to be
> > > scanned, should we stop after one iteration?
> > >
> > Not really. Though we can improve it further to scan backward.
> 
> I think it'd be good to clarify in the comment above that the VA could span
> more than 1 node then, as the diagram seems to imply only 1 (I think just
> simply because of the example you were showing).
> 
> [snip]
> 
> > > >  static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
> > > >  {
> > > > +	struct vmap_node *vn;
> > > >  	struct vmap_area *va;
> > > > +	int i, j;
> > > >
> > > > -	spin_lock(&vmap_area_lock);
> > > > -	va = __find_vmap_area(addr, &vmap_area_root);
> > > > -	if (va)
> > > > -		unlink_va(va, &vmap_area_root);
> > > > -	spin_unlock(&vmap_area_lock);
> > > > +	i = j = addr_to_node_id(addr);
> > > > +	do {
> > > > +		vn = &vmap_nodes[i];
> > > >
> > > > -	return va;
> > > > +		spin_lock(&vn->busy.lock);
> > > > +		va = __find_vmap_area(addr, &vn->busy.root);
> > > > +		if (va)
> > > > +			unlink_va(va, &vn->busy.root);
> > > > +		spin_unlock(&vn->busy.lock);
> > > > +
> > > > +		if (va)
> > > > +			return va;
> > > > +	} while ((i = (i + 1) % nr_vmap_nodes) != j);
> > >
> > > Maybe worth adding a comment saying to refer to the comment in
> > > find_vmap_area() to see why this loop is necessary.
> > >
> > OK. We can do it to make it better for reading.
> 
> Thanks!
> 
> [snip]
> 
> > > > @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > >
> > > Unrelated to your change but makes me feel a little unwell to see 'const
> > > char *addr'! Can we change this at some point? Or maybe I can :)
> > >
> > You are welcome :)
> 
> Haha ;) yes I think I might tbh, I have noted it down.
> 
> >
> > > >
> > > >  	remains = count;
> > > >
> > > > -	spin_lock(&vmap_area_lock);
> > > > -	va = find_vmap_area_exceed_addr((unsigned long)addr);
> > > > +	/* Hooked to node_0 so far. */
> > > > +	vn = addr_to_node(0);
> > >
> > > Why can't we use addr for this call? We already enforce the node-0 only
> > > thing by setting nr_vmap_nodes to 1 right? And won't this be potentially
> > > subtly wrong when we later increase this?
> > >
> > I used to have 0 here. But please note, it is changed by the next patch in
> > this series.
> 
> Yeah sorry, again hadn't noticed this.
> 
> [snip]
> 
> > > > +		spin_lock(&vn->busy.lock);
> > > > +		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
> > > >  		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
> > > >  				 pcpu_get_vm_areas);
> > > > +		spin_unlock(&vn->busy.lock);
> > >
> > > Hmm, before we were locking/unlocking once before the loop, now we're
> > > locking on each iteration, this seems inefficient.
> > >
> > > Seems like we need logic like:
> > >
> > > /* ... something to check nr_vms > 0 ... */
> > > struct vmap_node *last_node = NULL;
> > >
> > > for (...) {
> > > 	struct vmap_node *vnode = addr_to_node(vas[area]->va_start);
> > >
> > > 	if (vnode != last_node) {
> > > 		spin_unlock(last_node->busy.lock);
> > > 		spin_lock(vnode->busy.lock);
> > > 		last_node = vnode;
> > > 	}
> > >
> > > 	...
> > > }
> > >
> > > if (last_node)
> > > 	spin_unlock(last_node->busy.lock);
> > >
> > > To minimise the lock twiddling. What do you think?
> > >
> > This per-cpu-allocator prefetches several VA units per-cpu. I do not
> > find it as critical because it is not a hot path for the per-cpu allocator.
> > When its buffers are exhausted it does an extra prefetch. So it is not
> > frequent.
> 
> OK, sure I mean this is simpler and more readable so if not a huge perf
> concern then not a big deal.
> 
> >
> > >
> > > >  	}
> > > > -	spin_unlock(&vmap_area_lock);
> > > >
> > > >  	/*
> > > >  	 * Mark allocated areas as accessible. Do it now as a best-effort
> > > > @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object)
> > > >  {
> > > >  	void *objp = (void *)PAGE_ALIGN((unsigned long)object);
> > > >  	const void *caller;
> > > > -	struct vm_struct *vm;
> > > >  	struct vmap_area *va;
> > > > +	struct vmap_node *vn;
> > > >  	unsigned long addr;
> > > >  	unsigned int nr_pages;
> > > > +	bool success = false;
> > > >
> > > > -	if (!spin_trylock(&vmap_area_lock))
> > > > -		return false;
> > >
> > > Nitpick on style for this, I really don't know why you are removing this
> > > early exit? It's far neater to have a guard clause than to nest a whole
> > > bunch of code below.
> > >
> > Hm... I can return back as it used to be. I do not have a strong opinion here.
> 
> Yeah that'd be ideal just for readability.
> 
> [snip the rest as broadly fairly trivial comment stuff on which we agree]
> 
> >
> > Thank you for the review! I can fix the comments as separate patches if
> > no objections.
> 
> Yes, overall it's style/comment improvement stuff nothing major, feel free
> to send as follow-up patches.
> 
> I don't want to hold anything up here so for the rest, feel free to add:
> 
> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
> 
Appreciate! I will go through again and send out the patch that adds
more detailed explanation as requested in this review.

Again, thank you!

--
Uladzislau Rezki
  

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 06bd843d18ae..786ecb18ae22 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -728,11 +728,9 @@  EXPORT_SYMBOL(vmalloc_to_pfn);
 #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
 
 
-static DEFINE_SPINLOCK(vmap_area_lock);
 static DEFINE_SPINLOCK(free_vmap_area_lock);
 /* Export for kexec only */
 LIST_HEAD(vmap_area_list);
-static struct rb_root vmap_area_root = RB_ROOT;
 static bool vmap_initialized __read_mostly;
 
 static struct rb_root purge_vmap_area_root = RB_ROOT;
@@ -772,6 +770,38 @@  static struct rb_root free_vmap_area_root = RB_ROOT;
  */
 static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
 
+/*
+ * An effective vmap-node logic. Users make use of nodes instead
+ * of a global heap. It allows to balance an access and mitigate
+ * contention.
+ */
+struct rb_list {
+	struct rb_root root;
+	struct list_head head;
+	spinlock_t lock;
+};
+
+static struct vmap_node {
+	/* Bookkeeping data of this node. */
+	struct rb_list busy;
+} single;
+
+static struct vmap_node *vmap_nodes = &single;
+static __read_mostly unsigned int nr_vmap_nodes = 1;
+static __read_mostly unsigned int vmap_zone_size = 1;
+
+static inline unsigned int
+addr_to_node_id(unsigned long addr)
+{
+	return (addr / vmap_zone_size) % nr_vmap_nodes;
+}
+
+static inline struct vmap_node *
+addr_to_node(unsigned long addr)
+{
+	return &vmap_nodes[addr_to_node_id(addr)];
+}
+
 static __always_inline unsigned long
 va_size(struct vmap_area *va)
 {
@@ -803,10 +833,11 @@  unsigned long vmalloc_nr_pages(void)
 }
 
 /* Look up the first VA which satisfies addr < va_end, NULL if none. */
-static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
+static struct vmap_area *
+find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
 {
 	struct vmap_area *va = NULL;
-	struct rb_node *n = vmap_area_root.rb_node;
+	struct rb_node *n = root->rb_node;
 
 	addr = (unsigned long)kasan_reset_tag((void *)addr);
 
@@ -1552,12 +1583,14 @@  __alloc_vmap_area(struct rb_root *root, struct list_head *head,
  */
 static void free_vmap_area(struct vmap_area *va)
 {
+	struct vmap_node *vn = addr_to_node(va->va_start);
+
 	/*
 	 * Remove from the busy tree/list.
 	 */
-	spin_lock(&vmap_area_lock);
-	unlink_va(va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	spin_lock(&vn->busy.lock);
+	unlink_va(va, &vn->busy.root);
+	spin_unlock(&vn->busy.lock);
 
 	/*
 	 * Insert/Merge it back to the free tree/list.
@@ -1600,6 +1633,7 @@  static struct vmap_area *alloc_vmap_area(unsigned long size,
 				int node, gfp_t gfp_mask,
 				unsigned long va_flags)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
 	unsigned long freed;
 	unsigned long addr;
@@ -1645,9 +1679,11 @@  static struct vmap_area *alloc_vmap_area(unsigned long size,
 	va->vm = NULL;
 	va->flags = va_flags;
 
-	spin_lock(&vmap_area_lock);
-	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
-	spin_unlock(&vmap_area_lock);
+	vn = addr_to_node(va->va_start);
+
+	spin_lock(&vn->busy.lock);
+	insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
+	spin_unlock(&vn->busy.lock);
 
 	BUG_ON(!IS_ALIGNED(va->va_start, align));
 	BUG_ON(va->va_start < vstart);
@@ -1871,26 +1907,61 @@  static void free_unmap_vmap_area(struct vmap_area *va)
 
 struct vmap_area *find_vmap_area(unsigned long addr)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
+	int i, j;
 
-	spin_lock(&vmap_area_lock);
-	va = __find_vmap_area(addr, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	/*
+	 * An addr_to_node_id(addr) converts an address to a node index
+	 * where a VA is located. If VA spans several zones and passed
+	 * addr is not the same as va->va_start, what is not common, we
+	 * may need to scan an extra nodes. See an example:
+	 *
+	 *      <--va-->
+	 * -|-----|-----|-----|-----|-
+	 *     1     2     0     1
+	 *
+	 * VA resides in node 1 whereas it spans 1 and 2. If passed
+	 * addr is within a second node we should do extra work. We
+	 * should mention that it is rare and is a corner case from
+	 * the other hand it has to be covered.
+	 */
+	i = j = addr_to_node_id(addr);
+	do {
+		vn = &vmap_nodes[i];
 
-	return va;
+		spin_lock(&vn->busy.lock);
+		va = __find_vmap_area(addr, &vn->busy.root);
+		spin_unlock(&vn->busy.lock);
+
+		if (va)
+			return va;
+	} while ((i = (i + 1) % nr_vmap_nodes) != j);
+
+	return NULL;
 }
 
 static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
+	int i, j;
 
-	spin_lock(&vmap_area_lock);
-	va = __find_vmap_area(addr, &vmap_area_root);
-	if (va)
-		unlink_va(va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	i = j = addr_to_node_id(addr);
+	do {
+		vn = &vmap_nodes[i];
 
-	return va;
+		spin_lock(&vn->busy.lock);
+		va = __find_vmap_area(addr, &vn->busy.root);
+		if (va)
+			unlink_va(va, &vn->busy.root);
+		spin_unlock(&vn->busy.lock);
+
+		if (va)
+			return va;
+	} while ((i = (i + 1) % nr_vmap_nodes) != j);
+
+	return NULL;
 }
 
 /*** Per cpu kva allocator ***/
@@ -2092,6 +2163,7 @@  static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 
 static void free_vmap_block(struct vmap_block *vb)
 {
+	struct vmap_node *vn;
 	struct vmap_block *tmp;
 	struct xarray *xa;
 
@@ -2099,9 +2171,10 @@  static void free_vmap_block(struct vmap_block *vb)
 	tmp = xa_erase(xa, addr_to_vb_idx(vb->va->va_start));
 	BUG_ON(tmp != vb);
 
-	spin_lock(&vmap_area_lock);
-	unlink_va(vb->va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	vn = addr_to_node(vb->va->va_start);
+	spin_lock(&vn->busy.lock);
+	unlink_va(vb->va, &vn->busy.root);
+	spin_unlock(&vn->busy.lock);
 
 	free_vmap_area_noflush(vb->va);
 	kfree_rcu(vb, rcu_head);
@@ -2525,9 +2598,11 @@  static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
 static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
 			      unsigned long flags, const void *caller)
 {
-	spin_lock(&vmap_area_lock);
+	struct vmap_node *vn = addr_to_node(va->va_start);
+
+	spin_lock(&vn->busy.lock);
 	setup_vmalloc_vm_locked(vm, va, flags, caller);
-	spin_unlock(&vmap_area_lock);
+	spin_unlock(&vn->busy.lock);
 }
 
 static void clear_vm_uninitialized_flag(struct vm_struct *vm)
@@ -3715,6 +3790,7 @@  static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr,
  */
 long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
 	struct vm_struct *vm;
 	char *vaddr;
@@ -3728,8 +3804,11 @@  long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 
 	remains = count;
 
-	spin_lock(&vmap_area_lock);
-	va = find_vmap_area_exceed_addr((unsigned long)addr);
+	/* Hooked to node_0 so far. */
+	vn = addr_to_node(0);
+	spin_lock(&vn->busy.lock);
+
+	va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root);
 	if (!va)
 		goto finished_zero;
 
@@ -3737,7 +3816,7 @@  long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 	if ((unsigned long)addr + remains <= va->va_start)
 		goto finished_zero;
 
-	list_for_each_entry_from(va, &vmap_area_list, list) {
+	list_for_each_entry_from(va, &vn->busy.head, list) {
 		size_t copied;
 
 		if (remains == 0)
@@ -3796,12 +3875,12 @@  long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 	}
 
 finished_zero:
-	spin_unlock(&vmap_area_lock);
+	spin_unlock(&vn->busy.lock);
 	/* zero-fill memory holes */
 	return count - remains + zero_iter(iter, remains);
 finished:
 	/* Nothing remains, or We couldn't copy/zero everything. */
-	spin_unlock(&vmap_area_lock);
+	spin_unlock(&vn->busy.lock);
 
 	return count - remains;
 }
@@ -4135,14 +4214,15 @@  struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	}
 
 	/* insert all vm's */
-	spin_lock(&vmap_area_lock);
 	for (area = 0; area < nr_vms; area++) {
-		insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);
+		struct vmap_node *vn = addr_to_node(vas[area]->va_start);
 
+		spin_lock(&vn->busy.lock);
+		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
 		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
 				 pcpu_get_vm_areas);
+		spin_unlock(&vn->busy.lock);
 	}
-	spin_unlock(&vmap_area_lock);
 
 	/*
 	 * Mark allocated areas as accessible. Do it now as a best-effort
@@ -4253,55 +4333,57 @@  bool vmalloc_dump_obj(void *object)
 {
 	void *objp = (void *)PAGE_ALIGN((unsigned long)object);
 	const void *caller;
-	struct vm_struct *vm;
 	struct vmap_area *va;
+	struct vmap_node *vn;
 	unsigned long addr;
 	unsigned int nr_pages;
+	bool success = false;
 
-	if (!spin_trylock(&vmap_area_lock))
-		return false;
-	va = __find_vmap_area((unsigned long)objp, &vmap_area_root);
-	if (!va) {
-		spin_unlock(&vmap_area_lock);
-		return false;
-	}
+	vn = addr_to_node((unsigned long)objp);
 
-	vm = va->vm;
-	if (!vm) {
-		spin_unlock(&vmap_area_lock);
-		return false;
+	if (spin_trylock(&vn->busy.lock)) {
+		va = __find_vmap_area(addr, &vn->busy.root);
+
+		if (va && va->vm) {
+			addr = (unsigned long)va->vm->addr;
+			caller = va->vm->caller;
+			nr_pages = va->vm->nr_pages;
+			success = true;
+		}
+
+		spin_unlock(&vn->busy.lock);
 	}
-	addr = (unsigned long)vm->addr;
-	caller = vm->caller;
-	nr_pages = vm->nr_pages;
-	spin_unlock(&vmap_area_lock);
-	pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
-		nr_pages, addr, caller);
-	return true;
+
+	if (success)
+		pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
+			nr_pages, addr, caller);
+
+	return success;
 }
 #endif
 
 #ifdef CONFIG_PROC_FS
 static void *s_start(struct seq_file *m, loff_t *pos)
-	__acquires(&vmap_purge_lock)
-	__acquires(&vmap_area_lock)
 {
+	struct vmap_node *vn = addr_to_node(0);
+
 	mutex_lock(&vmap_purge_lock);
-	spin_lock(&vmap_area_lock);
+	spin_lock(&vn->busy.lock);
 
-	return seq_list_start(&vmap_area_list, *pos);
+	return seq_list_start(&vn->busy.head, *pos);
 }
 
 static void *s_next(struct seq_file *m, void *p, loff_t *pos)
 {
-	return seq_list_next(p, &vmap_area_list, pos);
+	struct vmap_node *vn = addr_to_node(0);
+	return seq_list_next(p, &vn->busy.head, pos);
 }
 
 static void s_stop(struct seq_file *m, void *p)
-	__releases(&vmap_area_lock)
-	__releases(&vmap_purge_lock)
 {
-	spin_unlock(&vmap_area_lock);
+	struct vmap_node *vn = addr_to_node(0);
+
+	spin_unlock(&vn->busy.lock);
 	mutex_unlock(&vmap_purge_lock);
 }
 
@@ -4344,9 +4426,11 @@  static void show_purge_info(struct seq_file *m)
 
 static int s_show(struct seq_file *m, void *p)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
 	struct vm_struct *v;
 
+	vn = addr_to_node(0);
 	va = list_entry(p, struct vmap_area, list);
 
 	if (!va->vm) {
@@ -4397,7 +4481,7 @@  static int s_show(struct seq_file *m, void *p)
 	 * As a final step, dump "unpurged" areas.
 	 */
 final:
-	if (list_is_last(&va->list, &vmap_area_list))
+	if (list_is_last(&va->list, &vn->busy.head))
 		show_purge_info(m);
 
 	return 0;
@@ -4428,7 +4512,8 @@  static void vmap_init_free_space(void)
 {
 	unsigned long vmap_start = 1;
 	const unsigned long vmap_end = ULONG_MAX;
-	struct vmap_area *busy, *free;
+	struct vmap_area *free;
+	struct vm_struct *busy;
 
 	/*
 	 *     B     F     B     B     B     F
@@ -4436,12 +4521,12 @@  static void vmap_init_free_space(void)
 	 *  |           The KVA space           |
 	 *  |<--------------------------------->|
 	 */
-	list_for_each_entry(busy, &vmap_area_list, list) {
-		if (busy->va_start - vmap_start > 0) {
+	for (busy = vmlist; busy; busy = busy->next) {
+		if ((unsigned long) busy->addr - vmap_start > 0) {
 			free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
 			if (!WARN_ON_ONCE(!free)) {
 				free->va_start = vmap_start;
-				free->va_end = busy->va_start;
+				free->va_end = (unsigned long) busy->addr;
 
 				insert_vmap_area_augment(free, NULL,
 					&free_vmap_area_root,
@@ -4449,7 +4534,7 @@  static void vmap_init_free_space(void)
 			}
 		}
 
-		vmap_start = busy->va_end;
+		vmap_start = (unsigned long) busy->addr + busy->size;
 	}
 
 	if (vmap_end - vmap_start > 0) {
@@ -4465,9 +4550,23 @@  static void vmap_init_free_space(void)
 	}
 }
 
+static void vmap_init_nodes(void)
+{
+	struct vmap_node *vn;
+	int i;
+
+	for (i = 0; i < nr_vmap_nodes; i++) {
+		vn = &vmap_nodes[i];
+		vn->busy.root = RB_ROOT;
+		INIT_LIST_HEAD(&vn->busy.head);
+		spin_lock_init(&vn->busy.lock);
+	}
+}
+
 void __init vmalloc_init(void)
 {
 	struct vmap_area *va;
+	struct vmap_node *vn;
 	struct vm_struct *tmp;
 	int i;
 
@@ -4489,6 +4588,11 @@  void __init vmalloc_init(void)
 		xa_init(&vbq->vmap_blocks);
 	}
 
+	/*
+	 * Setup nodes before importing vmlist.
+	 */
+	vmap_init_nodes();
+
 	/* Import existing vmlist entries. */
 	for (tmp = vmlist; tmp; tmp = tmp->next) {
 		va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
@@ -4498,7 +4602,9 @@  void __init vmalloc_init(void)
 		va->va_start = (unsigned long)tmp->addr;
 		va->va_end = va->va_start + tmp->size;
 		va->vm = tmp;
-		insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+
+		vn = addr_to_node(va->va_start);
+		insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
 	}
 
 	/*