[v3,00/11] Mitigate a vmap lock contention v3

Message ID 20240102184633.748113-1-urezki@gmail.com
Headers
Series Mitigate a vmap lock contention v3 |

Message

Uladzislau Rezki Jan. 2, 2024, 6:46 p.m. UTC
  This is v3. It is based on the 6.7.0-rc8.

1. Motivation

- Offload global vmap locks making it scaled to number of CPUS;
- If possible and there is an agreement, we can remove the "Per cpu kva allocator"
  to make the vmap code to be more simple;
- There were complains from XFS folk that a vmalloc might be contented
  on the their workloads.

2. Design(high level overview)

We introduce an effective vmap node logic. A node behaves as independent
entity to serve an allocation request directly(if possible) from its pool.
That way it bypasses a global vmap space that is protected by its own lock.

An access to pools are serialized by CPUs. Number of nodes are equal to
number of CPUs in a system. Please note the high threshold is bound to
128 nodes.

Pools are size segregated and populated based on system demand. The maximum
alloc request that can be stored into a segregated storage is 256 pages. The
lazily drain path decays a pool by 25% as a first step and as second populates
it by fresh freed VAs for reuse instead of returning them into a global space.

When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start
address is converted into a correct node where it should be placed and resided.
Doing so we balance VAs across the nodes as a result an access becomes scalable.
The addr_to_node() function does a proper address conversion to a correct node.

A vmap space is divided on segments with fixed size, it is 16 pages. That way
any address can be associated with a segment number. Number of segments are
equal to num_possible_cpus() but not grater then 128. The numeration starts
from 0. See below how it is converted:

static inline unsigned int
addr_to_node_id(unsigned long addr)
{
	return (addr / zone_size) % nr_nodes;
}

On a free path, a VA can be easily found by converting its "va_start" address
to a certain node it resides. It is moved from "busy" data to "lazy" data structure.
Later on, as noted earlier, the lazy kworker decays each node pool and populates it
by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc
request.

3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor

sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64

<default perf>
 94.41%     0.89%  [kernel]        [k] _raw_spin_lock
 93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
 76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
 72.96%     0.81%  [kernel]        [k] alloc_vmap_area
 56.94%     0.00%  [kernel]        [k] __get_vm_area_node
 41.95%     0.00%  [kernel]        [k] vmalloc
 37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
 35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
 35.17%     0.00%  [kernel]        [k] ret_from_fork
 35.17%     0.00%  [kernel]        [k] kthread
 35.08%     0.00%  [test_vmalloc]  [k] test_func
 34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
 28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
 23.53%     0.25%  [kernel]        [k] vfree.part.0
 21.72%     0.00%  [kernel]        [k] remove_vm_area
 20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
  2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
<default perf>
   vs
<patch-series perf>
 82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
 63.36%     0.02%  [kernel]        [k] vmalloc
 63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
 30.42%     4.46%  [kernel]        [k] vfree.part.0
 28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
 27.28%     0.19%  [kernel]        [k] __get_vm_area_node
 26.13%     1.50%  [kernel]        [k] alloc_vmap_area
 21.72%    21.67%  [kernel]        [k] clear_page_rep
 19.51%     2.43%  [kernel]        [k] _raw_spin_lock
 16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
 13.40%     2.07%  [kernel]        [k] free_unref_page
 10.62%     0.01%  [kernel]        [k] remove_vm_area
  9.02%     8.73%  [kernel]        [k] insert_vmap_area
  8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
  8.94%     0.00%  [kernel]        [k] ret_from_fork
  8.94%     0.00%  [kernel]        [k] kthread
  8.29%     0.00%  [test_vmalloc]  [k] test_func
  7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
  5.30%     4.73%  [kernel]        [k] purge_vmap_node
  4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
<patch-series perf>

confirms that a native_queued_spin_lock_slowpath goes down to
16.51% percent from 93.07%.

The throughput is ~12x higher:

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    10m51.271s
user    0m0.013s
sys     0m0.187s
urezki@pc638:~$

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    0m51.301s
user    0m0.015s
sys     0m0.040s
urezki@pc638:~$

4. Changelog

v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/
v2: https://lore.kernel.org/lkml/20230829081142.3619-1-urezki@gmail.com/

Delta v2 -> v3:
  - fix comments from v2 feedback;
  - switch from pre-fetch chunk logic to a less complex size based pools.

Baoquan He (1):
  mm/vmalloc: remove vmap_area_list

Uladzislau Rezki (Sony) (10):
  mm: vmalloc: Add va_alloc() helper
  mm: vmalloc: Rename adjust_va_to_fit_type() function
  mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
  mm: vmalloc: Remove global vmap_area_root rb-tree
  mm: vmalloc: Remove global purge_vmap_area_root rb-tree
  mm: vmalloc: Offload free_vmap_area_lock lock
  mm: vmalloc: Support multiple nodes in vread_iter
  mm: vmalloc: Support multiple nodes in vmallocinfo
  mm: vmalloc: Set nr_nodes based on CPUs in a system
  mm: vmalloc: Add a shrinker to drain vmap pools

 .../admin-guide/kdump/vmcoreinfo.rst          |    8 +-
 arch/arm64/kernel/crash_core.c                |    1 -
 arch/riscv/kernel/crash_core.c                |    1 -
 include/linux/vmalloc.h                       |    1 -
 kernel/crash_core.c                           |    4 +-
 kernel/kallsyms_selftest.c                    |    1 -
 mm/nommu.c                                    |    2 -
 mm/vmalloc.c                                  | 1049 ++++++++++++-----
 8 files changed, 786 insertions(+), 281 deletions(-)
  

Comments

Uladzislau Rezki Feb. 22, 2024, 8:35 a.m. UTC | #1
Hello, Folk!

> This is v3. It is based on the 6.7.0-rc8.
> 
> 1. Motivation
> 
> - Offload global vmap locks making it scaled to number of CPUS;
> - If possible and there is an agreement, we can remove the "Per cpu kva allocator"
>   to make the vmap code to be more simple;
> - There were complains from XFS folk that a vmalloc might be contented
>   on the their workloads.
> 
> 2. Design(high level overview)
> 
> We introduce an effective vmap node logic. A node behaves as independent
> entity to serve an allocation request directly(if possible) from its pool.
> That way it bypasses a global vmap space that is protected by its own lock.
> 
> An access to pools are serialized by CPUs. Number of nodes are equal to
> number of CPUs in a system. Please note the high threshold is bound to
> 128 nodes.
> 
> Pools are size segregated and populated based on system demand. The maximum
> alloc request that can be stored into a segregated storage is 256 pages. The
> lazily drain path decays a pool by 25% as a first step and as second populates
> it by fresh freed VAs for reuse instead of returning them into a global space.
> 
> When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start
> address is converted into a correct node where it should be placed and resided.
> Doing so we balance VAs across the nodes as a result an access becomes scalable.
> The addr_to_node() function does a proper address conversion to a correct node.
> 
> A vmap space is divided on segments with fixed size, it is 16 pages. That way
> any address can be associated with a segment number. Number of segments are
> equal to num_possible_cpus() but not grater then 128. The numeration starts
> from 0. See below how it is converted:
> 
> static inline unsigned int
> addr_to_node_id(unsigned long addr)
> {
> 	return (addr / zone_size) % nr_nodes;
> }
> 
> On a free path, a VA can be easily found by converting its "va_start" address
> to a certain node it resides. It is moved from "busy" data to "lazy" data structure.
> Later on, as noted earlier, the lazy kworker decays each node pool and populates it
> by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc
> request.
> 
> 3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor
> 
> sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
> 
> <default perf>
>  94.41%     0.89%  [kernel]        [k] _raw_spin_lock
>  93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
>  76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
>  72.96%     0.81%  [kernel]        [k] alloc_vmap_area
>  56.94%     0.00%  [kernel]        [k] __get_vm_area_node
>  41.95%     0.00%  [kernel]        [k] vmalloc
>  37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
>  35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
>  35.17%     0.00%  [kernel]        [k] ret_from_fork
>  35.17%     0.00%  [kernel]        [k] kthread
>  35.08%     0.00%  [test_vmalloc]  [k] test_func
>  34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
>  28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
>  23.53%     0.25%  [kernel]        [k] vfree.part.0
>  21.72%     0.00%  [kernel]        [k] remove_vm_area
>  20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
>   2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
> <default perf>
>    vs
> <patch-series perf>
>  82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
>  63.36%     0.02%  [kernel]        [k] vmalloc
>  63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
>  30.42%     4.46%  [kernel]        [k] vfree.part.0
>  28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
>  27.28%     0.19%  [kernel]        [k] __get_vm_area_node
>  26.13%     1.50%  [kernel]        [k] alloc_vmap_area
>  21.72%    21.67%  [kernel]        [k] clear_page_rep
>  19.51%     2.43%  [kernel]        [k] _raw_spin_lock
>  16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
>  13.40%     2.07%  [kernel]        [k] free_unref_page
>  10.62%     0.01%  [kernel]        [k] remove_vm_area
>   9.02%     8.73%  [kernel]        [k] insert_vmap_area
>   8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
>   8.94%     0.00%  [kernel]        [k] ret_from_fork
>   8.94%     0.00%  [kernel]        [k] kthread
>   8.29%     0.00%  [test_vmalloc]  [k] test_func
>   7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
>   5.30%     4.73%  [kernel]        [k] purge_vmap_node
>   4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
> <patch-series perf>
> 
> confirms that a native_queued_spin_lock_slowpath goes down to
> 16.51% percent from 93.07%.
> 
> The throughput is ~12x higher:
> 
> urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
> Run the test with following parameters: run_test_mask=7 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
> 
> real    10m51.271s
> user    0m0.013s
> sys     0m0.187s
> urezki@pc638:~$
> 
> urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
> Run the test with following parameters: run_test_mask=7 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
> 
> real    0m51.301s
> user    0m0.015s
> sys     0m0.040s
> urezki@pc638:~$
> 
> 4. Changelog
> 
> v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/
> v2: https://lore.kernel.org/lkml/20230829081142.3619-1-urezki@gmail.com/
> 
> Delta v2 -> v3:
>   - fix comments from v2 feedback;
>   - switch from pre-fetch chunk logic to a less complex size based pools.
> 
> Baoquan He (1):
>   mm/vmalloc: remove vmap_area_list
> 
> Uladzislau Rezki (Sony) (10):
>   mm: vmalloc: Add va_alloc() helper
>   mm: vmalloc: Rename adjust_va_to_fit_type() function
>   mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
>   mm: vmalloc: Remove global vmap_area_root rb-tree
>   mm: vmalloc: Remove global purge_vmap_area_root rb-tree
>   mm: vmalloc: Offload free_vmap_area_lock lock
>   mm: vmalloc: Support multiple nodes in vread_iter
>   mm: vmalloc: Support multiple nodes in vmallocinfo
>   mm: vmalloc: Set nr_nodes based on CPUs in a system
>   mm: vmalloc: Add a shrinker to drain vmap pools
> 
>  .../admin-guide/kdump/vmcoreinfo.rst          |    8 +-
>  arch/arm64/kernel/crash_core.c                |    1 -
>  arch/riscv/kernel/crash_core.c                |    1 -
>  include/linux/vmalloc.h                       |    1 -
>  kernel/crash_core.c                           |    4 +-
>  kernel/kallsyms_selftest.c                    |    1 -
>  mm/nommu.c                                    |    2 -
>  mm/vmalloc.c                                  | 1049 ++++++++++++-----
>  8 files changed, 786 insertions(+), 281 deletions(-)
> 
> -- 
> 2.39.2
> 
There is one thing that i have to clarify and which is open for me yet.

Test machine:
  quemu x86_64 system
  64 CPUs
  64G of memory

test suite:
  test_vmalloc.sh

environment:
  mm-unstable, branch: next-20240220 where this series
  is located. On top of it i added locally Suren's Baghdasaryan
  Memory allocation profiling v3 for better understanding of memory
  usage.

Before running test, the condition is as below:

urezki@pc638:~$ sort -h /proc/allocinfo
 27.2MiB     6970 mm/memory.c:1122 module:memory func:folio_prealloc
 79.1MiB    20245 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
  112MiB     8689 mm/slub.c:2202 module:slub func:alloc_slab_page
  122MiB    31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
urezki@pc638:~$ free -m
               total        used        free      shared  buff/cache   available
Mem:           64172         936       63618           0         134       63236
Swap:              0           0           0
urezki@pc638:~$

The test-suite stresses vmap/vmalloc layer by creating workers which in
a tight loop do alloc/free, i.e. it is considered as extreme. Below three
identical tests were done with only one difference, which is 64, 128 and 256 kworkers:

1) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64

urezki@pc638:~$ sort -h /proc/allocinfo
 80.1MiB    20518 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
  122MiB    31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
  153MiB    39048 mm/filemap.c:1919 module:filemap func:__filemap_get_folio
  178MiB    13259 mm/slub.c:2202 module:slub func:alloc_slab_page
  350MiB    89656 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@pc638:~$ free -m
               total        used        free      shared  buff/cache   available
Mem:           64172        1417       63054           0         298       62755
Swap:              0           0           0
urezki@pc638:~$

2) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128

urezki@pc638:~$ sort -h /proc/allocinfo
  122MiB    31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext 
  154MiB    39440 mm/filemap.c:1919 module:filemap func:__filemap_get_folio 
  196MiB    14038 mm/slub.c:2202 module:slub func:alloc_slab_page 
 1.20GiB   315655 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@pc638:~$ free -m
               total        used        free      shared  buff/cache   available
Mem:           64172        2556       61914           0         302       61616
Swap:              0           0           0
urezki@pc638:~$

3) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256

urezki@pc638:~$ sort -h /proc/allocinfo
  127MiB    32565 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
  197MiB    50506 mm/filemap.c:1919 module:filemap func:__filemap_get_folio
  278MiB    18519 mm/slub.c:2202 module:slub func:alloc_slab_page
 5.36GiB  1405072 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@pc638:~$ free -m
               total        used        free      shared  buff/cache   available
Mem:           64172        6741       57652           0         394       57431
Swap:              0           0           0
urezki@pc638:~$

pagetable_alloc - gets increased as soon as a higher pressure is applied by
increasing number of workers. Running same number of jobs on a next run
does not increase it and stays on same level as on previous.

/**
 * pagetable_alloc - Allocate pagetables
 * @gfp:    GFP flags
 * @order:  desired pagetable order
 *
 * pagetable_alloc allocates memory for page tables as well as a page table
 * descriptor to describe that memory.
 *
 * Return: The ptdesc describing the allocated page tables.
 */
static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
{
	struct page *page = alloc_pages(gfp | __GFP_COMP, order);

	return page_ptdesc(page);
}

Could you please comment on it? Or do you have any thought? Is it expected?
Is a page-table ever shrink?

/proc/slabinfo does not show any high "active" or "number" of objects to
be used by any cache.

/proc/meminfo - "VmallocUsed" stays low after those 3 tests.

I have checked it with KASAN, KMEMLEAK and i do not see any issues.

Thank you for the help!

--
Uladzislau Rezki
  
Pedro Falcato Feb. 22, 2024, 11:15 p.m. UTC | #2
Hi,

On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> Hello, Folk!
>
>[...]
> pagetable_alloc - gets increased as soon as a higher pressure is applied by
> increasing number of workers. Running same number of jobs on a next run
> does not increase it and stays on same level as on previous.
>
> /**
>  * pagetable_alloc - Allocate pagetables
>  * @gfp:    GFP flags
>  * @order:  desired pagetable order
>  *
>  * pagetable_alloc allocates memory for page tables as well as a page table
>  * descriptor to describe that memory.
>  *
>  * Return: The ptdesc describing the allocated page tables.
>  */
> static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> {
>         struct page *page = alloc_pages(gfp | __GFP_COMP, order);
>
>         return page_ptdesc(page);
> }
>
> Could you please comment on it? Or do you have any thought? Is it expected?
> Is a page-table ever shrink?

It's my understanding that the vunmap_range helpers don't actively
free page tables, they just clear PTEs. munmap does free them in
mmap.c:free_pgtables, maybe something could be worked up for vmalloc
too.
I would not be surprised if the memory increase you're seeing is more
or less correlated to the maximum vmalloc footprint throughout the
whole test.
  
Uladzislau Rezki Feb. 23, 2024, 9:34 a.m. UTC | #3
On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> Hi,
> 
> On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > Hello, Folk!
> >
> >[...]
> > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > increasing number of workers. Running same number of jobs on a next run
> > does not increase it and stays on same level as on previous.
> >
> > /**
> >  * pagetable_alloc - Allocate pagetables
> >  * @gfp:    GFP flags
> >  * @order:  desired pagetable order
> >  *
> >  * pagetable_alloc allocates memory for page tables as well as a page table
> >  * descriptor to describe that memory.
> >  *
> >  * Return: The ptdesc describing the allocated page tables.
> >  */
> > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > {
> >         struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> >
> >         return page_ptdesc(page);
> > }
> >
> > Could you please comment on it? Or do you have any thought? Is it expected?
> > Is a page-table ever shrink?
> 
> It's my understanding that the vunmap_range helpers don't actively
> free page tables, they just clear PTEs. munmap does free them in
> mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> too.
>
Right. I see that for a user space, pgtables are removed. There was a
work on it.

>
> I would not be surprised if the memory increase you're seeing is more
> or less correlated to the maximum vmalloc footprint throughout the
> whole test.
> 
Yes, the vmalloc footprint follows the memory usage. Some uses cases
map lot of memory.

Thanks for the input!

--
Uladzislau Rezki
  
Baoquan He Feb. 23, 2024, 10:26 a.m. UTC | #4
On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > Hi,
> > 
> > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > Hello, Folk!
> > >
> > >[...]
> > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > increasing number of workers. Running same number of jobs on a next run
> > > does not increase it and stays on same level as on previous.
> > >
> > > /**
> > >  * pagetable_alloc - Allocate pagetables
> > >  * @gfp:    GFP flags
> > >  * @order:  desired pagetable order
> > >  *
> > >  * pagetable_alloc allocates memory for page tables as well as a page table
> > >  * descriptor to describe that memory.
> > >  *
> > >  * Return: The ptdesc describing the allocated page tables.
> > >  */
> > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > {
> > >         struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > >
> > >         return page_ptdesc(page);
> > > }
> > >
> > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > Is a page-table ever shrink?
> > 
> > It's my understanding that the vunmap_range helpers don't actively
> > free page tables, they just clear PTEs. munmap does free them in
> > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > too.
> >
> Right. I see that for a user space, pgtables are removed. There was a
> work on it.
> 
> >
> > I would not be surprised if the memory increase you're seeing is more
> > or less correlated to the maximum vmalloc footprint throughout the
> > whole test.
> > 
> Yes, the vmalloc footprint follows the memory usage. Some uses cases
> map lot of memory.

The 'nr_threads=256' testing may be too radical. I took the test on
a bare metal machine as below, it's still running and hang there after
30 minutes. I did this after system boot. I am looking for other
machines with more processors.

[root@dell-r640-068 ~]# nproc 
64
[root@dell-r640-068 ~]# free -h
               total        used        free      shared  buff/cache   available
Mem:           187Gi        18Gi       169Gi        12Mi       262Mi       168Gi
Swap:          4.0Gi          0B       4.0Gi
[root@dell-r640-068 ~]# 

[root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
Run the test with following parameters: run_test_mask=127 nr_threads=256
  
Uladzislau Rezki Feb. 23, 2024, 11:06 a.m. UTC | #5
> On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > > Hi,
> > > 
> > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > Hello, Folk!
> > > >
> > > >[...]
> > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > > increasing number of workers. Running same number of jobs on a next run
> > > > does not increase it and stays on same level as on previous.
> > > >
> > > > /**
> > > >  * pagetable_alloc - Allocate pagetables
> > > >  * @gfp:    GFP flags
> > > >  * @order:  desired pagetable order
> > > >  *
> > > >  * pagetable_alloc allocates memory for page tables as well as a page table
> > > >  * descriptor to describe that memory.
> > > >  *
> > > >  * Return: The ptdesc describing the allocated page tables.
> > > >  */
> > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > > {
> > > >         struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > >
> > > >         return page_ptdesc(page);
> > > > }
> > > >
> > > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > > Is a page-table ever shrink?
> > > 
> > > It's my understanding that the vunmap_range helpers don't actively
> > > free page tables, they just clear PTEs. munmap does free them in
> > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > > too.
> > >
> > Right. I see that for a user space, pgtables are removed. There was a
> > work on it.
> > 
> > >
> > > I would not be surprised if the memory increase you're seeing is more
> > > or less correlated to the maximum vmalloc footprint throughout the
> > > whole test.
> > > 
> > Yes, the vmalloc footprint follows the memory usage. Some uses cases
> > map lot of memory.
> 
> The 'nr_threads=256' testing may be too radical. I took the test on
> a bare metal machine as below, it's still running and hang there after
> 30 minutes. I did this after system boot. I am looking for other
> machines with more processors.
> 
> [root@dell-r640-068 ~]# nproc 
> 64
> [root@dell-r640-068 ~]# free -h
>                total        used        free      shared  buff/cache   available
> Mem:           187Gi        18Gi       169Gi        12Mi       262Mi       168Gi
> Swap:          4.0Gi          0B       4.0Gi
> [root@dell-r640-068 ~]# 
> 
> [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> Run the test with following parameters: run_test_mask=127 nr_threads=256
> 
Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to
complete. So wait more :)


--
Uladzislau Rezki
  
Baoquan He Feb. 23, 2024, 3:57 p.m. UTC | #6
On 02/23/24 at 12:06pm, Uladzislau Rezki wrote:
> > On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > > > Hi,
> > > > 
> > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > >
> > > > > Hello, Folk!
> > > > >
> > > > >[...]
> > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > > > increasing number of workers. Running same number of jobs on a next run
> > > > > does not increase it and stays on same level as on previous.
> > > > >
> > > > > /**
> > > > >  * pagetable_alloc - Allocate pagetables
> > > > >  * @gfp:    GFP flags
> > > > >  * @order:  desired pagetable order
> > > > >  *
> > > > >  * pagetable_alloc allocates memory for page tables as well as a page table
> > > > >  * descriptor to describe that memory.
> > > > >  *
> > > > >  * Return: The ptdesc describing the allocated page tables.
> > > > >  */
> > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > > > {
> > > > >         struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > > >
> > > > >         return page_ptdesc(page);
> > > > > }
> > > > >
> > > > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > > > Is a page-table ever shrink?
> > > > 
> > > > It's my understanding that the vunmap_range helpers don't actively
> > > > free page tables, they just clear PTEs. munmap does free them in
> > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > > > too.
> > > >
> > > Right. I see that for a user space, pgtables are removed. There was a
> > > work on it.
> > > 
> > > >
> > > > I would not be surprised if the memory increase you're seeing is more
> > > > or less correlated to the maximum vmalloc footprint throughout the
> > > > whole test.
> > > > 
> > > Yes, the vmalloc footprint follows the memory usage. Some uses cases
> > > map lot of memory.
> > 
> > The 'nr_threads=256' testing may be too radical. I took the test on
> > a bare metal machine as below, it's still running and hang there after
> > 30 minutes. I did this after system boot. I am looking for other
> > machines with more processors.
> > 
> > [root@dell-r640-068 ~]# nproc 
> > 64
> > [root@dell-r640-068 ~]# free -h
> >                total        used        free      shared  buff/cache   available
> > Mem:           187Gi        18Gi       169Gi        12Mi       262Mi       168Gi
> > Swap:          4.0Gi          0B       4.0Gi
> > [root@dell-r640-068 ~]# 
> > 
> > [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> > Run the test with following parameters: run_test_mask=127 nr_threads=256
> > 
> Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to
> complete. So wait more :)

Right, mine could take the similar time to finish that. I got a machine
with 288 cpus, see if I can get some clues. When I go through the code
flow, suddenly realized it could be drain_vmap_area_work which is the 
bottle neck and cause the tremendous page table pages costing.

On your system, there's 64 cpus. then 

nr_lazy_max = lazy_max_pages() = 7*32M = 224M;

So with nr_threads=128 or 256, it's so easily getting to the nr_lazy_max
and triggering drain_vmap_work(). When cpu resouce is very limited, the
lazy vmap purging will be very slow. While the alloc/free in lib/tet_vmalloc.c 
are going far faster and more easily then vmap reclaiming. If old va is not
reused, new va is allocated and keep extending, the new page table surely
need be created to cover them.

I will take testing on the system with 288 cpus, will update if testing
is done.
  
Uladzislau Rezki Feb. 23, 2024, 6:55 p.m. UTC | #7
On Fri, Feb 23, 2024 at 11:57:25PM +0800, Baoquan He wrote:
> On 02/23/24 at 12:06pm, Uladzislau Rezki wrote:
> > > On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> > > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > > > > Hi,
> > > > > 
> > > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > >
> > > > > > Hello, Folk!
> > > > > >
> > > > > >[...]
> > > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > > > > increasing number of workers. Running same number of jobs on a next run
> > > > > > does not increase it and stays on same level as on previous.
> > > > > >
> > > > > > /**
> > > > > >  * pagetable_alloc - Allocate pagetables
> > > > > >  * @gfp:    GFP flags
> > > > > >  * @order:  desired pagetable order
> > > > > >  *
> > > > > >  * pagetable_alloc allocates memory for page tables as well as a page table
> > > > > >  * descriptor to describe that memory.
> > > > > >  *
> > > > > >  * Return: The ptdesc describing the allocated page tables.
> > > > > >  */
> > > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > > > > {
> > > > > >         struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > > > >
> > > > > >         return page_ptdesc(page);
> > > > > > }
> > > > > >
> > > > > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > > > > Is a page-table ever shrink?
> > > > > 
> > > > > It's my understanding that the vunmap_range helpers don't actively
> > > > > free page tables, they just clear PTEs. munmap does free them in
> > > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > > > > too.
> > > > >
> > > > Right. I see that for a user space, pgtables are removed. There was a
> > > > work on it.
> > > > 
> > > > >
> > > > > I would not be surprised if the memory increase you're seeing is more
> > > > > or less correlated to the maximum vmalloc footprint throughout the
> > > > > whole test.
> > > > > 
> > > > Yes, the vmalloc footprint follows the memory usage. Some uses cases
> > > > map lot of memory.
> > > 
> > > The 'nr_threads=256' testing may be too radical. I took the test on
> > > a bare metal machine as below, it's still running and hang there after
> > > 30 minutes. I did this after system boot. I am looking for other
> > > machines with more processors.
> > > 
> > > [root@dell-r640-068 ~]# nproc 
> > > 64
> > > [root@dell-r640-068 ~]# free -h
> > >                total        used        free      shared  buff/cache   available
> > > Mem:           187Gi        18Gi       169Gi        12Mi       262Mi       168Gi
> > > Swap:          4.0Gi          0B       4.0Gi
> > > [root@dell-r640-068 ~]# 
> > > 
> > > [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> > > Run the test with following parameters: run_test_mask=127 nr_threads=256
> > > 
> > Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to
> > complete. So wait more :)
> 
> Right, mine could take the similar time to finish that. I got a machine
> with 288 cpus, see if I can get some clues. When I go through the code
> flow, suddenly realized it could be drain_vmap_area_work which is the 
> bottle neck and cause the tremendous page table pages costing.
> 
> On your system, there's 64 cpus. then 
> 
> nr_lazy_max = lazy_max_pages() = 7*32M = 224M;
> 
> So with nr_threads=128 or 256, it's so easily getting to the nr_lazy_max
> and triggering drain_vmap_work(). When cpu resouce is very limited, the
> lazy vmap purging will be very slow. While the alloc/free in lib/tet_vmalloc.c 
> are going far faster and more easily then vmap reclaiming. If old va is not
> reused, new va is allocated and keep extending, the new page table surely
> need be created to cover them.
> 
> I will take testing on the system with 288 cpus, will update if testing
> is done.
> 
<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 12caa794abd4..a90c5393d85f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1754,6 +1754,8 @@ size_to_va_pool(struct vmap_node *vn, unsigned long size)
 	return NULL;
 }
 
+static unsigned long lazy_max_pages(void);
+
 static bool
 node_pool_add_va(struct vmap_node *n, struct vmap_area *va)
 {
@@ -1763,6 +1765,9 @@ node_pool_add_va(struct vmap_node *n, struct vmap_area *va)
 	if (!vp)
 		return false;
 
+	if (READ_ONCE(vp->len) > lazy_max_pages())
+		return false;
+
 	spin_lock(&n->pool_lock);
 	list_add(&va->list, &vp->head);
 	WRITE_ONCE(vp->len, vp->len + 1);
@@ -2170,9 +2175,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
 				INIT_WORK(&vn->purge_work, purge_vmap_node);
 
 				if (cpumask_test_cpu(i, cpu_online_mask))
-					schedule_work_on(i, &vn->purge_work);
+					queue_work_on(i, system_highpri_wq, &vn->purge_work);
 				else
-					schedule_work(&vn->purge_work);
+					queue_work(system_highpri_wq, &vn->purge_work);
 
 				nr_purge_helpers--;
 			} else {
<snip>

We need this. This settles it back to a normal PTE-usage. Tomorrow i
will check if cache-len should be limited. I tested on my 64 CPUs
system with radical 256 kworkers. It looks good.

--
Uladzislau Rezki
  
Baoquan He Feb. 28, 2024, 9:27 a.m. UTC | #8
On 02/23/24 at 07:55pm, Uladzislau Rezki wrote:
> On Fri, Feb 23, 2024 at 11:57:25PM +0800, Baoquan He wrote:
> > On 02/23/24 at 12:06pm, Uladzislau Rezki wrote:
> > > > On 02/23/24 at 10:34am, Uladzislau Rezki wrote:
> > > > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > > >
> > > > > > > Hello, Folk!
> > > > > > >
> > > > > > >[...]
> > > > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by
> > > > > > > increasing number of workers. Running same number of jobs on a next run
> > > > > > > does not increase it and stays on same level as on previous.
> > > > > > >
> > > > > > > /**
> > > > > > >  * pagetable_alloc - Allocate pagetables
> > > > > > >  * @gfp:    GFP flags
> > > > > > >  * @order:  desired pagetable order
> > > > > > >  *
> > > > > > >  * pagetable_alloc allocates memory for page tables as well as a page table
> > > > > > >  * descriptor to describe that memory.
> > > > > > >  *
> > > > > > >  * Return: The ptdesc describing the allocated page tables.
> > > > > > >  */
> > > > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
> > > > > > > {
> > > > > > >         struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > > > > >
> > > > > > >         return page_ptdesc(page);
> > > > > > > }
> > > > > > >
> > > > > > > Could you please comment on it? Or do you have any thought? Is it expected?
> > > > > > > Is a page-table ever shrink?
> > > > > > 
> > > > > > It's my understanding that the vunmap_range helpers don't actively
> > > > > > free page tables, they just clear PTEs. munmap does free them in
> > > > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc
> > > > > > too.
> > > > > >
> > > > > Right. I see that for a user space, pgtables are removed. There was a
> > > > > work on it.
> > > > > 
> > > > > >
> > > > > > I would not be surprised if the memory increase you're seeing is more
> > > > > > or less correlated to the maximum vmalloc footprint throughout the
> > > > > > whole test.
> > > > > > 
> > > > > Yes, the vmalloc footprint follows the memory usage. Some uses cases
> > > > > map lot of memory.
> > > > 
> > > > The 'nr_threads=256' testing may be too radical. I took the test on
> > > > a bare metal machine as below, it's still running and hang there after
> > > > 30 minutes. I did this after system boot. I am looking for other
> > > > machines with more processors.
> > > > 
> > > > [root@dell-r640-068 ~]# nproc 
> > > > 64
> > > > [root@dell-r640-068 ~]# free -h
> > > >                total        used        free      shared  buff/cache   available
> > > > Mem:           187Gi        18Gi       169Gi        12Mi       262Mi       168Gi
> > > > Swap:          4.0Gi          0B       4.0Gi
> > > > [root@dell-r640-068 ~]# 
> > > > 
> > > > [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> > > > Run the test with following parameters: run_test_mask=127 nr_threads=256
> > > > 
> > > Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to
> > > complete. So wait more :)
> > 
> > Right, mine could take the similar time to finish that. I got a machine
> > with 288 cpus, see if I can get some clues. When I go through the code
> > flow, suddenly realized it could be drain_vmap_area_work which is the 
> > bottle neck and cause the tremendous page table pages costing.
> > 
> > On your system, there's 64 cpus. then 
> > 
> > nr_lazy_max = lazy_max_pages() = 7*32M = 224M;
> > 
> > So with nr_threads=128 or 256, it's so easily getting to the nr_lazy_max
> > and triggering drain_vmap_work(). When cpu resouce is very limited, the
> > lazy vmap purging will be very slow. While the alloc/free in lib/tet_vmalloc.c 
> > are going far faster and more easily then vmap reclaiming. If old va is not
> > reused, new va is allocated and keep extending, the new page table surely
> > need be created to cover them.
> > 
> > I will take testing on the system with 288 cpus, will update if testing
> > is done.
> > 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 12caa794abd4..a90c5393d85f 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1754,6 +1754,8 @@ size_to_va_pool(struct vmap_node *vn, unsigned long size)
>  	return NULL;
>  }
>  
> +static unsigned long lazy_max_pages(void);
> +
>  static bool
>  node_pool_add_va(struct vmap_node *n, struct vmap_area *va)
>  {
> @@ -1763,6 +1765,9 @@ node_pool_add_va(struct vmap_node *n, struct vmap_area *va)
>  	if (!vp)
>  		return false;
>  
> +	if (READ_ONCE(vp->len) > lazy_max_pages())
> +		return false;
> +
>  	spin_lock(&n->pool_lock);
>  	list_add(&va->list, &vp->head);
>  	WRITE_ONCE(vp->len, vp->len + 1);
> @@ -2170,9 +2175,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
>  				INIT_WORK(&vn->purge_work, purge_vmap_node);
>  
>  				if (cpumask_test_cpu(i, cpu_online_mask))
> -					schedule_work_on(i, &vn->purge_work);
> +					queue_work_on(i, system_highpri_wq, &vn->purge_work);
>  				else
> -					schedule_work(&vn->purge_work);
> +					queue_work(system_highpri_wq, &vn->purge_work);
>  
>  				nr_purge_helpers--;
>  			} else {
> <snip>
> 
> We need this. This settles it back to a normal PTE-usage. Tomorrow i
> will check if cache-len should be limited. I tested on my 64 CPUs
> system with radical 256 kworkers. It looks good.

I finally finished the testing w/o and with your above improvement
patch. Testing is done on a system with 128 cpus. The system with 288
cpus is not available because of some console connection. Attach the log
here. In some testing after rebooting, I found it could take more than 30
minutes, I am not sure if it's caused by my messy code change. I finally
cleaned up all of them and take a clean linux-next to test, then apply
your above draft code.
[root@dell-per6515-03 linux]# nproc 
128
[root@dell-per6515-03 linux]# free -h
               total        used        free      shared  buff/cache   available
Mem:           124Gi       2.6Gi       122Gi        21Mi       402Mi       122Gi
Swap:          4.0Gi          0B       4.0Gi

1)linux-next kernel w/o improving code from Uladzislau
-------------------------------------------------------
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64
Run the test with following parameters: run_test_mask=127 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real	4m28.018s
user	0m0.015s
sys	0m4.712s
[root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
    21405696     5226 mm/memory.c:1122 func:folio_prealloc 
    26199936     7980 kernel/fork.c:309 func:alloc_thread_stack_node 
    29822976     7281 mm/readahead.c:247 func:page_cache_ra_unbounded 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
   107638784     6320 mm/readahead.c:468 func:ra_alloc_folio 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   134742016    32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   266797056    65136 include/linux/mm.h:2848 func:pagetable_alloc 
   507617280    32796 mm/slub.c:2305 func:alloc_slab_page 
[root@dell-per6515-03 ~]# 
[root@dell-per6515-03 ~]# 
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128
Run the test with following parameters: run_test_mask=127 nr_threads=128
Done.
Check the kernel ring buffer to see the summary.

real	6m19.328s
user	0m0.005s
sys	0m9.476s
[root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
    21405696     5226 mm/memory.c:1122 func:folio_prealloc 
    26889408     8190 kernel/fork.c:309 func:alloc_thread_stack_node 
    29822976     7281 mm/readahead.c:247 func:page_cache_ra_unbounded 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
   107638784     6320 mm/readahead.c:468 func:ra_alloc_folio 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   134742016    32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   550068224    34086 mm/slub.c:2305 func:alloc_slab_page 
   664535040   162240 include/linux/mm.h:2848 func:pagetable_alloc 
[root@dell-per6515-03 ~]# 
[root@dell-per6515-03 ~]# 
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
Run the test with following parameters: run_test_mask=127 nr_threads=256
Done.
Check the kernel ring buffer to see the summary.

real	19m10.657s
user	0m0.015s
sys	0m20.959s
[root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
    22441984     5479 mm/shmem.c:1634 func:shmem_alloc_folio 
    26758080     8150 kernel/fork.c:309 func:alloc_thread_stack_node 
    35880960     8760 mm/readahead.c:247 func:page_cache_ra_unbounded 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   122355712     7852 mm/readahead.c:468 func:ra_alloc_folio 
   134742016    32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   708231168    50309 mm/slub.c:2305 func:alloc_slab_page 
  1107296256   270336 include/linux/mm.h:2848 func:pagetable_alloc 
[root@dell-per6515-03 ~]# 

2)linux-next kernel with improving code from Uladzislau
-----------------------------------------------------
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64
Run the test with following parameters: run_test_mask=127 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real	4m27.226s
user	0m0.006s
sys	0m4.709s
[root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
    38023168     9283 mm/readahead.c:247 func:page_cache_ra_unbounded 
    72228864    17634 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
    99863552    97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   136314880    33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   184176640    10684 mm/readahead.c:468 func:ra_alloc_folio 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   284700672    69507 include/linux/mm.h:2848 func:pagetable_alloc 
   601427968    36377 mm/slub.c:2305 func:alloc_slab_page 
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128
Run the test with following parameters: run_test_mask=127 nr_threads=128
Done.
Check the kernel ring buffer to see the summary.

real	6m16.960s
user	0m0.007s
sys	0m9.465s
[root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
    38158336     9316 mm/readahead.c:247 func:page_cache_ra_unbounded 
    72220672    17632 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
    99863552    97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   136314880    33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   184504320    10710 mm/readahead.c:468 func:ra_alloc_folio 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   427884544   104464 include/linux/mm.h:2848 func:pagetable_alloc 
   697311232    45159 mm/slub.c:2305 func:alloc_slab_page
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
Run the test with following parameters: run_test_mask=127 nr_threads=256
Done.
Check the kernel ring buffer to see the summary.

real	21m15.673s
user	0m0.008s
sys	0m20.259s
[root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
    38158336     9316 mm/readahead.c:247 func:page_cache_ra_unbounded 
    72224768    17633 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
    99863552    97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   136314880    33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   184504320    10710 mm/readahead.c:468 func:ra_alloc_folio 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   506974208   123773 include/linux/mm.h:2848 func:pagetable_alloc 
   809504768    53621 mm/slub.c:2305 func:alloc_slab_page
[root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
Run the test with following parameters: run_test_mask=127 nr_threads=256
Done.
Check the kernel ring buffer to see the summary.

real	21m36.580s
user	0m0.012s
sys	0m19.912s
[root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
    38977536     9516 mm/readahead.c:247 func:page_cache_ra_unbounded 
    72273920    17645 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
    99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
    99895296    97554 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
   120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
   141033472    34432 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
   186064896    10841 mm/readahead.c:468 func:ra_alloc_folio 
   263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
   541237248   132138 include/linux/mm.h:2848 func:pagetable_alloc 
   694718464    41216 mm/slub.c:2305 func:alloc_slab_page
  
Uladzislau Rezki Feb. 29, 2024, 10:38 a.m. UTC | #9
> 
> I finally finished the testing w/o and with your above improvement
> patch. Testing is done on a system with 128 cpus. The system with 288
> cpus is not available because of some console connection. Attach the log
> here. In some testing after rebooting, I found it could take more than 30
> minutes, I am not sure if it's caused by my messy code change. I finally
> cleaned up all of them and take a clean linux-next to test, then apply
> your above draft code.

> [root@dell-per6515-03 linux]# nproc 
> 128
> [root@dell-per6515-03 linux]# free -h
>                total        used        free      shared  buff/cache   available
> Mem:           124Gi       2.6Gi       122Gi        21Mi       402Mi       122Gi
> Swap:          4.0Gi          0B       4.0Gi
> 
> 1)linux-next kernel w/o improving code from Uladzislau
> -------------------------------------------------------
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64
> Run the test with following parameters: run_test_mask=127 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
> 
> real	4m28.018s
> user	0m0.015s
> sys	0m4.712s
> [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
>     21405696     5226 mm/memory.c:1122 func:folio_prealloc 
>     26199936     7980 kernel/fork.c:309 func:alloc_thread_stack_node 
>     29822976     7281 mm/readahead.c:247 func:page_cache_ra_unbounded 
>     99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
>    107638784     6320 mm/readahead.c:468 func:ra_alloc_folio 
>    120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
>    134742016    32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
>    263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
>    266797056    65136 include/linux/mm.h:2848 func:pagetable_alloc 
>    507617280    32796 mm/slub.c:2305 func:alloc_slab_page 
> [root@dell-per6515-03 ~]# 
> [root@dell-per6515-03 ~]# 
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128
> Run the test with following parameters: run_test_mask=127 nr_threads=128
> Done.
> Check the kernel ring buffer to see the summary.
> 
> real	6m19.328s
> user	0m0.005s
> sys	0m9.476s
> [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
>     21405696     5226 mm/memory.c:1122 func:folio_prealloc 
>     26889408     8190 kernel/fork.c:309 func:alloc_thread_stack_node 
>     29822976     7281 mm/readahead.c:247 func:page_cache_ra_unbounded 
>     99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
>    107638784     6320 mm/readahead.c:468 func:ra_alloc_folio 
>    120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
>    134742016    32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
>    263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
>    550068224    34086 mm/slub.c:2305 func:alloc_slab_page 
>    664535040   162240 include/linux/mm.h:2848 func:pagetable_alloc 
> [root@dell-per6515-03 ~]# 
> [root@dell-per6515-03 ~]# 
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> Run the test with following parameters: run_test_mask=127 nr_threads=256
> Done.
> Check the kernel ring buffer to see the summary.
> 
> real	19m10.657s
> user	0m0.015s
> sys	0m20.959s
> [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10
>     22441984     5479 mm/shmem.c:1634 func:shmem_alloc_folio 
>     26758080     8150 kernel/fork.c:309 func:alloc_thread_stack_node 
>     35880960     8760 mm/readahead.c:247 func:page_cache_ra_unbounded 
>     99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
>    120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
>    122355712     7852 mm/readahead.c:468 func:ra_alloc_folio 
>    134742016    32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
>    263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
>    708231168    50309 mm/slub.c:2305 func:alloc_slab_page 
>   1107296256   270336 include/linux/mm.h:2848 func:pagetable_alloc 
> [root@dell-per6515-03 ~]# 
> 
> 2)linux-next kernel with improving code from Uladzislau
> -----------------------------------------------------
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64
> Run the test with following parameters: run_test_mask=127 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
> 
> real	4m27.226s
> user	0m0.006s
> sys	0m4.709s
> [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
>     38023168     9283 mm/readahead.c:247 func:page_cache_ra_unbounded 
>     72228864    17634 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
>     99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
>     99863552    97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
>    120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
>    136314880    33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
>    184176640    10684 mm/readahead.c:468 func:ra_alloc_folio 
>    263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
>    284700672    69507 include/linux/mm.h:2848 func:pagetable_alloc 
>    601427968    36377 mm/slub.c:2305 func:alloc_slab_page 
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128
> Run the test with following parameters: run_test_mask=127 nr_threads=128
> Done.
> Check the kernel ring buffer to see the summary.
> 
> real	6m16.960s
> user	0m0.007s
> sys	0m9.465s
> [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
>     38158336     9316 mm/readahead.c:247 func:page_cache_ra_unbounded 
>     72220672    17632 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
>     99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
>     99863552    97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
>    120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
>    136314880    33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
>    184504320    10710 mm/readahead.c:468 func:ra_alloc_folio 
>    263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
>    427884544   104464 include/linux/mm.h:2848 func:pagetable_alloc 
>    697311232    45159 mm/slub.c:2305 func:alloc_slab_page
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> Run the test with following parameters: run_test_mask=127 nr_threads=256
> Done.
> Check the kernel ring buffer to see the summary.
> 
> real	21m15.673s
> user	0m0.008s
> sys	0m20.259s
> [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
>     38158336     9316 mm/readahead.c:247 func:page_cache_ra_unbounded 
>     72224768    17633 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
>     99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
>     99863552    97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
>    120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
>    136314880    33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
>    184504320    10710 mm/readahead.c:468 func:ra_alloc_folio 
>    263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
>    506974208   123773 include/linux/mm.h:2848 func:pagetable_alloc 
>    809504768    53621 mm/slub.c:2305 func:alloc_slab_page
> [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
> Run the test with following parameters: run_test_mask=127 nr_threads=256
> Done.
> Check the kernel ring buffer to see the summary.
> 
> real	21m36.580s
> user	0m0.012s
> sys	0m19.912s
> [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10
>     38977536     9516 mm/readahead.c:247 func:page_cache_ra_unbounded 
>     72273920    17645 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 
>     99090432    96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 
>     99895296    97554 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 
>    120560528    29439 mm/mm_init.c:2521 func:alloc_large_system_hash 
>    141033472    34432 mm/percpu-vm.c:95 func:pcpu_alloc_pages 
>    186064896    10841 mm/readahead.c:468 func:ra_alloc_folio 
>    263192576    64256 mm/page_ext.c:270 func:alloc_page_ext 
>    541237248   132138 include/linux/mm.h:2848 func:pagetable_alloc 
>    694718464    41216 mm/slub.c:2305 func:alloc_slab_page
> 
> 
Thank you for testing this. So ~132mb with a patch. I think it looks
good but i might change the draft version and send out a new version.

Thank you again!

--
Uladzislau Rezki