[v2,2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock

Message ID 6c7f1ac0aeb55faaa46a09108d3999e4595870d9.1679209395.git.lstoakes@gmail.com
State New
Headers
Series convert read_kcore(), vread() to use iterators |

Commit Message

Lorenzo Stoakes March 19, 2023, 7:09 a.m. UTC
  vmalloc() is, by design, not permitted to be used in atomic context and
already contains components which may sleep, so avoiding spin locks is not
a problem from the perspective of atomic context.

The global vmap_area_lock is held when the red/black tree rooted in
vmap_are_root is accessed and thus is rather long-held and under
potentially high contention. It is likely to be under contention for reads
rather than write, so replace it with a rwsem.

Each individual vmap_block->lock is likely to be held for less time but
under low contention, so a mutex is not an outrageous choice here.

A subset of test_vmalloc.sh performance results:-

fix_size_alloc_test             0.40%
full_fit_alloc_test		2.08%
long_busy_list_alloc_test	0.34%
random_size_alloc_test		-0.25%
random_size_align_alloc_test	0.06%
...
all tests cycles                0.2%

This represents a tiny reduction in performance that sits barely above
noise.

The reason for making this change is to build a basis for vread() to be
usable asynchronously, this eliminating the need for a bounce buffer when
copying data to userland in read_kcore() and allowing that to be converted
to an iterator form.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 mm/vmalloc.c | 77 +++++++++++++++++++++++++++-------------------------
 1 file changed, 40 insertions(+), 37 deletions(-)
  

Comments

Andrew Morton March 19, 2023, 8:10 p.m. UTC | #1
On Sun, 19 Mar 2023 07:09:31 +0000 Lorenzo Stoakes <lstoakes@gmail.com> wrote:

> vmalloc() is, by design, not permitted to be used in atomic context and
> already contains components which may sleep, so avoiding spin locks is not
> a problem from the perspective of atomic context.
> 
> The global vmap_area_lock is held when the red/black tree rooted in
> vmap_are_root is accessed and thus is rather long-held and under
> potentially high contention. It is likely to be under contention for reads
> rather than write, so replace it with a rwsem.
> 
> Each individual vmap_block->lock is likely to be held for less time but
> under low contention, so a mutex is not an outrageous choice here.
> 
> A subset of test_vmalloc.sh performance results:-
> 
> fix_size_alloc_test             0.40%
> full_fit_alloc_test		2.08%
> long_busy_list_alloc_test	0.34%
> random_size_alloc_test		-0.25%
> random_size_align_alloc_test	0.06%
> ...
> all tests cycles                0.2%
> 
> This represents a tiny reduction in performance that sits barely above
> noise.
> 
> The reason for making this change is to build a basis for vread() to be
> usable asynchronously, this eliminating the need for a bounce buffer when
> copying data to userland in read_kcore() and allowing that to be converted
> to an iterator form.
> 

I'm not understanding the final paragraph.  How and where is vread()
used "asynchronously"?
  
Lorenzo Stoakes March 19, 2023, 8:29 p.m. UTC | #2
On Sun, Mar 19, 2023 at 01:10:47PM -0700, Andrew Morton wrote:
> On Sun, 19 Mar 2023 07:09:31 +0000 Lorenzo Stoakes <lstoakes@gmail.com> wrote:
>
> > vmalloc() is, by design, not permitted to be used in atomic context and
> > already contains components which may sleep, so avoiding spin locks is not
> > a problem from the perspective of atomic context.
> >
> > The global vmap_area_lock is held when the red/black tree rooted in
> > vmap_are_root is accessed and thus is rather long-held and under
> > potentially high contention. It is likely to be under contention for reads
> > rather than write, so replace it with a rwsem.
> >
> > Each individual vmap_block->lock is likely to be held for less time but
> > under low contention, so a mutex is not an outrageous choice here.
> >
> > A subset of test_vmalloc.sh performance results:-
> >
> > fix_size_alloc_test             0.40%
> > full_fit_alloc_test		2.08%
> > long_busy_list_alloc_test	0.34%
> > random_size_alloc_test		-0.25%
> > random_size_align_alloc_test	0.06%
> > ...
> > all tests cycles                0.2%
> >
> > This represents a tiny reduction in performance that sits barely above
> > noise.
> >
> > The reason for making this change is to build a basis for vread() to be
> > usable asynchronously, this eliminating the need for a bounce buffer when
> > copying data to userland in read_kcore() and allowing that to be converted
> > to an iterator form.
> >
>
> I'm not understanding the final paragraph.  How and where is vread()
> used "asynchronously"?


The basis for saying asynchronous was based on Documentation/filesystems/vfs.rst
describing read_iter() as 'possibly asynchronous read with iov_iter as
destination', and read_iter() is what is (now) invoked when accessing
/proc/kcore.

However I agree this is vague and it is clearer to refer to the fact that we are
now directly writing to user memory and thus wish to avoid spinlocks as we may
need to fault in user memory in doing so.

Would it be ok for you to go ahead and replace that final paragraph with the
below?:-

The reason for making this change is to build a basis for vread() to write
to user memory directly via an iterator; as a result we may cause page
faults during which we must not hold a spinlock. Doing this eliminates the
need for a bounce buffer in read_kcore() and thus permits that to be
converted to also use an iterator, as a read_iter() handler.
  
Matthew Wilcox March 19, 2023, 8:47 p.m. UTC | #3
On Sun, Mar 19, 2023 at 08:29:16PM +0000, Lorenzo Stoakes wrote:
> The basis for saying asynchronous was based on Documentation/filesystems/vfs.rst
> describing read_iter() as 'possibly asynchronous read with iov_iter as
> destination', and read_iter() is what is (now) invoked when accessing
> /proc/kcore.
> 
> However I agree this is vague and it is clearer to refer to the fact that we are
> now directly writing to user memory and thus wish to avoid spinlocks as we may
> need to fault in user memory in doing so.
> 
> Would it be ok for you to go ahead and replace that final paragraph with the
> below?:-
> 
> The reason for making this change is to build a basis for vread() to write
> to user memory directly via an iterator; as a result we may cause page
> faults during which we must not hold a spinlock. Doing this eliminates the
> need for a bounce buffer in read_kcore() and thus permits that to be
> converted to also use an iterator, as a read_iter() handler.

I'd say the purpose of the iterator is to abstract whether we're
accessing user memory, kernel memory or a pipe, so I'd suggest:

   The reason for making this change is to build a basis for vread() to
   write to memory via an iterator; as a result we may cause page faults
   during which we must not hold a spinlock. Doing this eliminates the
   need for a bounce buffer in read_kcore() and thus permits that to be
   converted to also use an iterator, as a read_iter() handler.

I'm still undecided whether this change is really a good thing.  I
think we have line-of-sight to making vmalloc (and thus kvmalloc)
usable from interrupt context, and this destroys that possibility.

I wonder if we can't do something like prefaulting the page before
taking the spinlock, then use copy_page_to_iter_atomic()
  
Lorenzo Stoakes March 19, 2023, 9:16 p.m. UTC | #4
On Sun, Mar 19, 2023 at 08:47:14PM +0000, Matthew Wilcox wrote:
> On Sun, Mar 19, 2023 at 08:29:16PM +0000, Lorenzo Stoakes wrote:
> > The basis for saying asynchronous was based on Documentation/filesystems/vfs.rst
> > describing read_iter() as 'possibly asynchronous read with iov_iter as
> > destination', and read_iter() is what is (now) invoked when accessing
> > /proc/kcore.
> >
> > However I agree this is vague and it is clearer to refer to the fact that we are
> > now directly writing to user memory and thus wish to avoid spinlocks as we may
> > need to fault in user memory in doing so.
> >
> > Would it be ok for you to go ahead and replace that final paragraph with the
> > below?:-
> >
> > The reason for making this change is to build a basis for vread() to write
> > to user memory directly via an iterator; as a result we may cause page
> > faults during which we must not hold a spinlock. Doing this eliminates the
> > need for a bounce buffer in read_kcore() and thus permits that to be
> > converted to also use an iterator, as a read_iter() handler.
>
> I'd say the purpose of the iterator is to abstract whether we're
> accessing user memory, kernel memory or a pipe, so I'd suggest:
>
>    The reason for making this change is to build a basis for vread() to
>    write to memory via an iterator; as a result we may cause page faults
>    during which we must not hold a spinlock. Doing this eliminates the
>    need for a bounce buffer in read_kcore() and thus permits that to be
>    converted to also use an iterator, as a read_iter() handler.
>

Thanks, sorry I missed the detail about iterators abstacting the three
different targets there, that is definitely better!

> I'm still undecided whether this change is really a good thing.  I
> think we have line-of-sight to making vmalloc (and thus kvmalloc)
> usable from interrupt context, and this destroys that possibility.
>
> I wonder if we can't do something like prefaulting the page before
> taking the spinlock, then use copy_page_to_iter_atomic()

There are a number of aspects of vmalloc that are not atomic-safe,
e.g. alloc_vmap_area() and vmap_range_noflush() are designated
might_sleep(), equally vfree().

So I feel that making it safe for atomic context requires a bit more of a
general rework. Given we would be able to revisit lock types at the point
we do that (something that would fit very solidly into the context of any
such change), and given that this patch series establishes that we use an
iterator, I think it is useful to keep this as-is as defer that change
until later.
  
Uladzislau Rezki March 20, 2023, 7:54 a.m. UTC | #5
> vmalloc() is, by design, not permitted to be used in atomic context and
> already contains components which may sleep, so avoiding spin locks is not
> a problem from the perspective of atomic context.
> 
> The global vmap_area_lock is held when the red/black tree rooted in
> vmap_are_root is accessed and thus is rather long-held and under
> potentially high contention. It is likely to be under contention for reads
> rather than write, so replace it with a rwsem.
> 
> Each individual vmap_block->lock is likely to be held for less time but
> under low contention, so a mutex is not an outrageous choice here.
> 
> A subset of test_vmalloc.sh performance results:-
> 
> fix_size_alloc_test             0.40%
> full_fit_alloc_test		2.08%
> long_busy_list_alloc_test	0.34%
> random_size_alloc_test		-0.25%
> random_size_align_alloc_test	0.06%
> ...
> all tests cycles                0.2%
> 
> This represents a tiny reduction in performance that sits barely above
> noise.
> 
How important to have many simultaneous users of vread()? I do not see a
big reason to switch into mutexes due to performance impact and making it
less atomic.

So, how important for you to have this change?

--
Uladzislau Rezki
  
Lorenzo Stoakes March 20, 2023, 8:25 a.m. UTC | #6
On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > vmalloc() is, by design, not permitted to be used in atomic context and
> > already contains components which may sleep, so avoiding spin locks is not
> > a problem from the perspective of atomic context.
> >
> > The global vmap_area_lock is held when the red/black tree rooted in
> > vmap_are_root is accessed and thus is rather long-held and under
> > potentially high contention. It is likely to be under contention for reads
> > rather than write, so replace it with a rwsem.
> >
> > Each individual vmap_block->lock is likely to be held for less time but
> > under low contention, so a mutex is not an outrageous choice here.
> >
> > A subset of test_vmalloc.sh performance results:-
> >
> > fix_size_alloc_test             0.40%
> > full_fit_alloc_test		2.08%
> > long_busy_list_alloc_test	0.34%
> > random_size_alloc_test		-0.25%
> > random_size_align_alloc_test	0.06%
> > ...
> > all tests cycles                0.2%
> >
> > This represents a tiny reduction in performance that sits barely above
> > noise.
> >
> How important to have many simultaneous users of vread()? I do not see a
> big reason to switch into mutexes due to performance impact and making it
> less atomic.

It's less about simultaneous users of vread() and more about being able to write
direct to user memory rather than via a bounce buffer and not hold a spinlock
over possible page faults.

The performance impact is barely above noise (I got fairly widely varying
results), so I don't think it's really much of a cost at all. I can't imagine
there are many users critically dependent on a sub-single digit % reduction in
speed in vmalloc() allocation.

As I was saying to Willy, the code is already not atomic, or rather needs rework
to become atomic-safe (there are a smattering of might_sleep()'s throughout)

However, given your objection alongside Willy's, let me examine Willy's
suggestion that we instead of doing this, prefault the user memory in advance of
the vread call.

>
> So, how important for you to have this change?
>

Personally, always very important :)

> --
> Uladzislau Rezki
  
Uladzislau Rezki March 20, 2023, 8:32 a.m. UTC | #7
On Mon, Mar 20, 2023 at 08:25:32AM +0000, Lorenzo Stoakes wrote:
> On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > already contains components which may sleep, so avoiding spin locks is not
> > > a problem from the perspective of atomic context.
> > >
> > > The global vmap_area_lock is held when the red/black tree rooted in
> > > vmap_are_root is accessed and thus is rather long-held and under
> > > potentially high contention. It is likely to be under contention for reads
> > > rather than write, so replace it with a rwsem.
> > >
> > > Each individual vmap_block->lock is likely to be held for less time but
> > > under low contention, so a mutex is not an outrageous choice here.
> > >
> > > A subset of test_vmalloc.sh performance results:-
> > >
> > > fix_size_alloc_test             0.40%
> > > full_fit_alloc_test		2.08%
> > > long_busy_list_alloc_test	0.34%
> > > random_size_alloc_test		-0.25%
> > > random_size_align_alloc_test	0.06%
> > > ...
> > > all tests cycles                0.2%
> > >
> > > This represents a tiny reduction in performance that sits barely above
> > > noise.
> > >
> > How important to have many simultaneous users of vread()? I do not see a
> > big reason to switch into mutexes due to performance impact and making it
> > less atomic.
> 
> It's less about simultaneous users of vread() and more about being able to write
> direct to user memory rather than via a bounce buffer and not hold a spinlock
> over possible page faults.
> 
> The performance impact is barely above noise (I got fairly widely varying
> results), so I don't think it's really much of a cost at all. I can't imagine
> there are many users critically dependent on a sub-single digit % reduction in
> speed in vmalloc() allocation.
> 
> As I was saying to Willy, the code is already not atomic, or rather needs rework
> to become atomic-safe (there are a smattering of might_sleep()'s throughout)
> 
> However, given your objection alongside Willy's, let me examine Willy's
> suggestion that we instead of doing this, prefault the user memory in advance of
> the vread call.
> 
Just a quick perf tests shows regression around 6%. 10 workers test_mask is 31:

# default
[  140.349731] All test took worker0=485061693537 cycles
[  140.386065] All test took worker1=486504572954 cycles
[  140.418452] All test took worker2=467204082542 cycles
[  140.435895] All test took worker3=512591010219 cycles
[  140.458316] All test took worker4=448583324125 cycles
[  140.494244] All test took worker5=501018129647 cycles
[  140.518144] All test took worker6=516224787767 cycles
[  140.535472] All test took worker7=442025617137 cycles
[  140.558249] All test took worker8=503337286539 cycles
[  140.590571] All test took worker9=494369561574 cycles

# patch
[  144.464916] All test took worker0=530373399067 cycles
[  144.492904] All test took worker1=522641540924 cycles
[  144.528999] All test took worker2=529711158267 cycles
[  144.552963] All test took worker3=527389011775 cycles
[  144.592951] All test took worker4=529583252449 cycles
[  144.610286] All test took worker5=523605706016 cycles
[  144.627690] All test took worker6=531494777011 cycles
[  144.653046] All test took worker7=527150114726 cycles
[  144.669818] All test took worker8=526599712235 cycles
[  144.693428] All test took worker9=526057490851 cycles

> >
> > So, how important for you to have this change?
> >
> 
> Personally, always very important :)
> 
This is good. Personal opinion always wins :)

--
Uladzislau Rezki
  
Lorenzo Stoakes March 20, 2023, 8:35 a.m. UTC | #8
On Mon, Mar 20, 2023 at 09:32:06AM +0100, Uladzislau Rezki wrote:
> On Mon, Mar 20, 2023 at 08:25:32AM +0000, Lorenzo Stoakes wrote:
> > On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > already contains components which may sleep, so avoiding spin locks is not
> > > > a problem from the perspective of atomic context.
> > > >
> > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > potentially high contention. It is likely to be under contention for reads
> > > > rather than write, so replace it with a rwsem.
> > > >
> > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > under low contention, so a mutex is not an outrageous choice here.
> > > >
> > > > A subset of test_vmalloc.sh performance results:-
> > > >
> > > > fix_size_alloc_test             0.40%
> > > > full_fit_alloc_test		2.08%
> > > > long_busy_list_alloc_test	0.34%
> > > > random_size_alloc_test		-0.25%
> > > > random_size_align_alloc_test	0.06%
> > > > ...
> > > > all tests cycles                0.2%
> > > >
> > > > This represents a tiny reduction in performance that sits barely above
> > > > noise.
> > > >
> > > How important to have many simultaneous users of vread()? I do not see a
> > > big reason to switch into mutexes due to performance impact and making it
> > > less atomic.
> >
> > It's less about simultaneous users of vread() and more about being able to write
> > direct to user memory rather than via a bounce buffer and not hold a spinlock
> > over possible page faults.
> >
> > The performance impact is barely above noise (I got fairly widely varying
> > results), so I don't think it's really much of a cost at all. I can't imagine
> > there are many users critically dependent on a sub-single digit % reduction in
> > speed in vmalloc() allocation.
> >
> > As I was saying to Willy, the code is already not atomic, or rather needs rework
> > to become atomic-safe (there are a smattering of might_sleep()'s throughout)
> >
> > However, given your objection alongside Willy's, let me examine Willy's
> > suggestion that we instead of doing this, prefault the user memory in advance of
> > the vread call.
> >
> Just a quick perf tests shows regression around 6%. 10 workers test_mask is 31:
>
> # default
> [  140.349731] All test took worker0=485061693537 cycles
> [  140.386065] All test took worker1=486504572954 cycles
> [  140.418452] All test took worker2=467204082542 cycles
> [  140.435895] All test took worker3=512591010219 cycles
> [  140.458316] All test took worker4=448583324125 cycles
> [  140.494244] All test took worker5=501018129647 cycles
> [  140.518144] All test took worker6=516224787767 cycles
> [  140.535472] All test took worker7=442025617137 cycles
> [  140.558249] All test took worker8=503337286539 cycles
> [  140.590571] All test took worker9=494369561574 cycles
>
> # patch
> [  144.464916] All test took worker0=530373399067 cycles
> [  144.492904] All test took worker1=522641540924 cycles
> [  144.528999] All test took worker2=529711158267 cycles
> [  144.552963] All test took worker3=527389011775 cycles
> [  144.592951] All test took worker4=529583252449 cycles
> [  144.610286] All test took worker5=523605706016 cycles
> [  144.627690] All test took worker6=531494777011 cycles
> [  144.653046] All test took worker7=527150114726 cycles
> [  144.669818] All test took worker8=526599712235 cycles
> [  144.693428] All test took worker9=526057490851 cycles
>

OK ouch, that's worse than I observed! Let me try this prefault approach and
then we can revert back to spinlocks.

> > >
> > > So, how important for you to have this change?
> > >
> >
> > Personally, always very important :)
> >
> This is good. Personal opinion always wins :)
>

The heart always wins ;) well, an adaption here can make everybody's hearts
happy I think.

> --
> Uladzislau Rezki
  
Lorenzo Stoakes March 20, 2023, 8:40 a.m. UTC | #9
On Sun, Mar 19, 2023 at 09:16:18PM +0000, Lorenzo Stoakes wrote:
> On Sun, Mar 19, 2023 at 08:47:14PM +0000, Matthew Wilcox wrote:
> > I wonder if we can't do something like prefaulting the page before
> > taking the spinlock, then use copy_page_to_iter_atomic()
>
<snip>

On second thoughts, and discussing with Uladzislau, this seems like a more
sensible approach.

I'm going to respin with prefaulting and revert to the previous locking.
  
Uladzislau Rezki March 20, 2023, 11:20 a.m. UTC | #10
On Mon, Mar 20, 2023 at 08:35:11AM +0000, Lorenzo Stoakes wrote:
> On Mon, Mar 20, 2023 at 09:32:06AM +0100, Uladzislau Rezki wrote:
> > On Mon, Mar 20, 2023 at 08:25:32AM +0000, Lorenzo Stoakes wrote:
> > > On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > > already contains components which may sleep, so avoiding spin locks is not
> > > > > a problem from the perspective of atomic context.
> > > > >
> > > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > > potentially high contention. It is likely to be under contention for reads
> > > > > rather than write, so replace it with a rwsem.
> > > > >
> > > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > > under low contention, so a mutex is not an outrageous choice here.
> > > > >
> > > > > A subset of test_vmalloc.sh performance results:-
> > > > >
> > > > > fix_size_alloc_test             0.40%
> > > > > full_fit_alloc_test		2.08%
> > > > > long_busy_list_alloc_test	0.34%
> > > > > random_size_alloc_test		-0.25%
> > > > > random_size_align_alloc_test	0.06%
> > > > > ...
> > > > > all tests cycles                0.2%
> > > > >
> > > > > This represents a tiny reduction in performance that sits barely above
> > > > > noise.
> > > > >
> > > > How important to have many simultaneous users of vread()? I do not see a
> > > > big reason to switch into mutexes due to performance impact and making it
> > > > less atomic.
> > >
> > > It's less about simultaneous users of vread() and more about being able to write
> > > direct to user memory rather than via a bounce buffer and not hold a spinlock
> > > over possible page faults.
> > >
> > > The performance impact is barely above noise (I got fairly widely varying
> > > results), so I don't think it's really much of a cost at all. I can't imagine
> > > there are many users critically dependent on a sub-single digit % reduction in
> > > speed in vmalloc() allocation.
> > >
> > > As I was saying to Willy, the code is already not atomic, or rather needs rework
> > > to become atomic-safe (there are a smattering of might_sleep()'s throughout)
> > >
> > > However, given your objection alongside Willy's, let me examine Willy's
> > > suggestion that we instead of doing this, prefault the user memory in advance of
> > > the vread call.
> > >
> > Just a quick perf tests shows regression around 6%. 10 workers test_mask is 31:
> >
> > # default
> > [  140.349731] All test took worker0=485061693537 cycles
> > [  140.386065] All test took worker1=486504572954 cycles
> > [  140.418452] All test took worker2=467204082542 cycles
> > [  140.435895] All test took worker3=512591010219 cycles
> > [  140.458316] All test took worker4=448583324125 cycles
> > [  140.494244] All test took worker5=501018129647 cycles
> > [  140.518144] All test took worker6=516224787767 cycles
> > [  140.535472] All test took worker7=442025617137 cycles
> > [  140.558249] All test took worker8=503337286539 cycles
> > [  140.590571] All test took worker9=494369561574 cycles
> >
> > # patch
> > [  144.464916] All test took worker0=530373399067 cycles
> > [  144.492904] All test took worker1=522641540924 cycles
> > [  144.528999] All test took worker2=529711158267 cycles
> > [  144.552963] All test took worker3=527389011775 cycles
> > [  144.592951] All test took worker4=529583252449 cycles
> > [  144.610286] All test took worker5=523605706016 cycles
> > [  144.627690] All test took worker6=531494777011 cycles
> > [  144.653046] All test took worker7=527150114726 cycles
> > [  144.669818] All test took worker8=526599712235 cycles
> > [  144.693428] All test took worker9=526057490851 cycles
> >
> 
> OK ouch, that's worse than I observed! Let me try this prefault approach and
> then we can revert back to spinlocks.
> 
> > > >
> > > > So, how important for you to have this change?
> > > >
> > >
> > > Personally, always very important :)
> > >
> > This is good. Personal opinion always wins :)
> >
> 
> The heart always wins ;) well, an adaption here can make everybody's hearts
> happy I think.
> 
Totally agree :)

--
Uladzislau Rezki
  
Uladzislau Rezki March 21, 2023, 5:23 a.m. UTC | #11
On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > vmalloc() is, by design, not permitted to be used in atomic context and
> > already contains components which may sleep, so avoiding spin locks is not
> > a problem from the perspective of atomic context.
> > 
> > The global vmap_area_lock is held when the red/black tree rooted in
> > vmap_are_root is accessed and thus is rather long-held and under
> > potentially high contention. It is likely to be under contention for reads
> > rather than write, so replace it with a rwsem.
> > 
> > Each individual vmap_block->lock is likely to be held for less time but
> > under low contention, so a mutex is not an outrageous choice here.
> > 
> > A subset of test_vmalloc.sh performance results:-
> > 
> > fix_size_alloc_test             0.40%
> > full_fit_alloc_test		2.08%
> > long_busy_list_alloc_test	0.34%
> > random_size_alloc_test		-0.25%
> > random_size_align_alloc_test	0.06%
> > ...
> > all tests cycles                0.2%
> > 
> > This represents a tiny reduction in performance that sits barely above
> > noise.
> 
> I'm travelling right now, but give me a few days and I'll test this
> against the XFS workloads that hammer the global vmalloc spin lock
> really, really badly. XFS can use vm_map_ram and vmalloc really
> heavily for metadata buffers and hit the global spin lock from every
> CPU in the system at the same time (i.e. highly concurrent
> workloads). vmalloc is also heavily used in the hottest path
> throught the journal where we process and calculate delta changes to
> several million items every second, again spread across every CPU in
> the system at the same time.
> 
> We really need the global spinlock to go away completely, but in the
> mean time a shared read lock should help a little bit....
> 
I am working on it. I submitted a proposal how to eliminate it:


<snip>
Hello, LSF.

Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention

Description:
 Currently the vmap code is not scaled to number of CPU cores in a system
 because a global vmap space is protected by a single spinlock. Such approach
 has a clear bottleneck if many CPUs simultaneously access to one resource.

 In this talk i would like to describe a drawback, show some data related
 to contentions and places where those occur in a code. Apart of that i
 would like to share ideas how to eliminate it providing a few approaches
 and compare them.

Requirements:
 * It should be a per-cpu approach;
 * Search of freed ptrs should not interfere with other freeing(as much as we can);
 *   - offload allocated areas(buzy ones) per-cpu;
 * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
 * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
 * Prefetch a fixed size in front and allocate per-cpu

Goals:
 * Implement a per-cpu way of allocation to eliminate a contention.

Thanks!
<snip>

--
Uladzislau Rezki
  
Lorenzo Stoakes March 21, 2023, 7:45 a.m. UTC | #12
On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > already contains components which may sleep, so avoiding spin locks is not
> > > a problem from the perspective of atomic context.
> > >
> > > The global vmap_area_lock is held when the red/black tree rooted in
> > > vmap_are_root is accessed and thus is rather long-held and under
> > > potentially high contention. It is likely to be under contention for reads
> > > rather than write, so replace it with a rwsem.
> > >
> > > Each individual vmap_block->lock is likely to be held for less time but
> > > under low contention, so a mutex is not an outrageous choice here.
> > >
> > > A subset of test_vmalloc.sh performance results:-
> > >
> > > fix_size_alloc_test             0.40%
> > > full_fit_alloc_test		2.08%
> > > long_busy_list_alloc_test	0.34%
> > > random_size_alloc_test		-0.25%
> > > random_size_align_alloc_test	0.06%
> > > ...
> > > all tests cycles                0.2%
> > >
> > > This represents a tiny reduction in performance that sits barely above
> > > noise.
> >
> > I'm travelling right now, but give me a few days and I'll test this
> > against the XFS workloads that hammer the global vmalloc spin lock
> > really, really badly. XFS can use vm_map_ram and vmalloc really
> > heavily for metadata buffers and hit the global spin lock from every
> > CPU in the system at the same time (i.e. highly concurrent
> > workloads). vmalloc is also heavily used in the hottest path
> > throught the journal where we process and calculate delta changes to
> > several million items every second, again spread across every CPU in
> > the system at the same time.
> >
> > We really need the global spinlock to go away completely, but in the
> > mean time a shared read lock should help a little bit....
> >

Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
reworked my patch set to use the original locks in order to satisfy Willy's
desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
~6% performance hit -
https://lore.kernel.org/all/cover.1679354384.git.lstoakes@gmail.com/

> I am working on it. I submitted a proposal how to eliminate it:
>
>
> <snip>
> Hello, LSF.
>
> Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
>
> Description:
>  Currently the vmap code is not scaled to number of CPU cores in a system
>  because a global vmap space is protected by a single spinlock. Such approach
>  has a clear bottleneck if many CPUs simultaneously access to one resource.
>
>  In this talk i would like to describe a drawback, show some data related
>  to contentions and places where those occur in a code. Apart of that i
>  would like to share ideas how to eliminate it providing a few approaches
>  and compare them.
>
> Requirements:
>  * It should be a per-cpu approach;
>  * Search of freed ptrs should not interfere with other freeing(as much as we can);
>  *   - offload allocated areas(buzy ones) per-cpu;
>  * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
>  * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
>  * Prefetch a fixed size in front and allocate per-cpu
>
> Goals:
>  * Implement a per-cpu way of allocation to eliminate a contention.
>
> Thanks!
> <snip>
>
> --
> Uladzislau Rezki
>

That's really awesome! I will come to that talk at LSF/MM :) being able to
sustain the lock in atomic context seems to be an aspect that is important going
forward also.
  
Uladzislau Rezki March 21, 2023, 8:54 a.m. UTC | #13
On Tue, Mar 21, 2023 at 07:45:56AM +0000, Lorenzo Stoakes wrote:
> On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> > On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > already contains components which may sleep, so avoiding spin locks is not
> > > > a problem from the perspective of atomic context.
> > > >
> > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > potentially high contention. It is likely to be under contention for reads
> > > > rather than write, so replace it with a rwsem.
> > > >
> > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > under low contention, so a mutex is not an outrageous choice here.
> > > >
> > > > A subset of test_vmalloc.sh performance results:-
> > > >
> > > > fix_size_alloc_test             0.40%
> > > > full_fit_alloc_test		2.08%
> > > > long_busy_list_alloc_test	0.34%
> > > > random_size_alloc_test		-0.25%
> > > > random_size_align_alloc_test	0.06%
> > > > ...
> > > > all tests cycles                0.2%
> > > >
> > > > This represents a tiny reduction in performance that sits barely above
> > > > noise.
> > >
> > > I'm travelling right now, but give me a few days and I'll test this
> > > against the XFS workloads that hammer the global vmalloc spin lock
> > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > heavily for metadata buffers and hit the global spin lock from every
> > > CPU in the system at the same time (i.e. highly concurrent
> > > workloads). vmalloc is also heavily used in the hottest path
> > > throught the journal where we process and calculate delta changes to
> > > several million items every second, again spread across every CPU in
> > > the system at the same time.
> > >
> > > We really need the global spinlock to go away completely, but in the
> > > mean time a shared read lock should help a little bit....
> > >
> 
> Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
> reworked my patch set to use the original locks in order to satisfy Willy's
> desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
> ~6% performance hit -
> https://lore.kernel.org/all/cover.1679354384.git.lstoakes@gmail.com/
> 
> > I am working on it. I submitted a proposal how to eliminate it:
> >
> >
> > <snip>
> > Hello, LSF.
> >
> > Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
> >
> > Description:
> >  Currently the vmap code is not scaled to number of CPU cores in a system
> >  because a global vmap space is protected by a single spinlock. Such approach
> >  has a clear bottleneck if many CPUs simultaneously access to one resource.
> >
> >  In this talk i would like to describe a drawback, show some data related
> >  to contentions and places where those occur in a code. Apart of that i
> >  would like to share ideas how to eliminate it providing a few approaches
> >  and compare them.
> >
> > Requirements:
> >  * It should be a per-cpu approach;
> >  * Search of freed ptrs should not interfere with other freeing(as much as we can);
> >  *   - offload allocated areas(buzy ones) per-cpu;
> >  * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
> >  * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
> >  * Prefetch a fixed size in front and allocate per-cpu
> >
> > Goals:
> >  * Implement a per-cpu way of allocation to eliminate a contention.
> >
> > Thanks!
> > <snip>
> >
> > --
> > Uladzislau Rezki
> >
> 
> That's really awesome! I will come to that talk at LSF/MM :) being able to
> sustain the lock in atomic context seems to be an aspect that is important going
> forward also.
>
Uhh... So i need to prepare then :)))

--
Uladzislau Rezki
  
Uladzislau Rezki March 21, 2023, 10:24 a.m. UTC | #14
On Tue, Mar 21, 2023 at 09:05:26PM +1100, Dave Chinner wrote:
> On Tue, Mar 21, 2023 at 07:45:56AM +0000, Lorenzo Stoakes wrote:
> > On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> > > On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > > > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > > already contains components which may sleep, so avoiding spin locks is not
> > > > > a problem from the perspective of atomic context.
> > > > >
> > > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > > potentially high contention. It is likely to be under contention for reads
> > > > > rather than write, so replace it with a rwsem.
> > > > >
> > > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > > under low contention, so a mutex is not an outrageous choice here.
> > > > >
> > > > > A subset of test_vmalloc.sh performance results:-
> > > > >
> > > > > fix_size_alloc_test             0.40%
> > > > > full_fit_alloc_test		2.08%
> > > > > long_busy_list_alloc_test	0.34%
> > > > > random_size_alloc_test		-0.25%
> > > > > random_size_align_alloc_test	0.06%
> > > > > ...
> > > > > all tests cycles                0.2%
> > > > >
> > > > > This represents a tiny reduction in performance that sits barely above
> > > > > noise.
> > > >
> > > > I'm travelling right now, but give me a few days and I'll test this
> > > > against the XFS workloads that hammer the global vmalloc spin lock
> > > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > > heavily for metadata buffers and hit the global spin lock from every
> > > > CPU in the system at the same time (i.e. highly concurrent
> > > > workloads). vmalloc is also heavily used in the hottest path
> > > > throught the journal where we process and calculate delta changes to
> > > > several million items every second, again spread across every CPU in
> > > > the system at the same time.
> > > >
> > > > We really need the global spinlock to go away completely, but in the
> > > > mean time a shared read lock should help a little bit....
> > > >
> > 
> > Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
> > reworked my patch set to use the original locks in order to satisfy Willy's
> > desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
> > ~6% performance hit -
> > https://lore.kernel.org/all/cover.1679354384.git.lstoakes@gmail.com/
> 
> Yeah, I'd already read that.
> 
> What I want to do, though, is to determine whether the problem
> shared access contention or exclusive access contention. If it's
> exclusive access contention, then an rwsem will do nothing to
> alleviate the problem, and that's kinda critical to know before any
> fix for the contention problems are worked out...
> 
> > > I am working on it. I submitted a proposal how to eliminate it:
> > >
> > >
> > > <snip>
> > > Hello, LSF.
> > >
> > > Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
> > >
> > > Description:
> > >  Currently the vmap code is not scaled to number of CPU cores in a system
> > >  because a global vmap space is protected by a single spinlock. Such approach
> > >  has a clear bottleneck if many CPUs simultaneously access to one resource.
> > >
> > >  In this talk i would like to describe a drawback, show some data related
> > >  to contentions and places where those occur in a code. Apart of that i
> > >  would like to share ideas how to eliminate it providing a few approaches
> > >  and compare them.
> 
> If you want data about contention problems with vmalloc
> 
> > > Requirements:
> > >  * It should be a per-cpu approach;
> 
> Hmmmm. My 2c worth on this: That is not a requirement.
> 
> That's a -solution-.
> 
> The requirement is that independent concurrent vmalloc/vfree
> operations do not severely contend with each other.
> 
> Yes, the solution will probably involve sharding the resource space
> across mulitple independent structures (as we do in filesystems with
> block groups, allocations groups, etc) but that does not necessarily
> need the structures to be per-cpu.
> 
> e.g per-node vmalloc arenas might be sufficient and allow more
> expensive but more efficient indexing structures to be used because
> we don't have to care about the explosion of memory that
> fine-grained per-cpu indexing generally entails.  This may also fit
> in to the existing per-node structure of the memory reclaim
> infrastructure to manage things like compaction, balancing, etc of
> vmalloc space assigned to the given node.
> 
> Hence I think saying "per-cpu is a requirement" kinda prevents
> exploration of other novel solutions that may have advantages other
> than "just solves the concurrency problem"...
> 
> > >  * Search of freed ptrs should not interfere with other freeing(as much as we can);
> > >  *   - offload allocated areas(buzy ones) per-cpu;
> > >  * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
> > >  * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
> > >  * Prefetch a fixed size in front and allocate per-cpu
> 
> I'd call these desired traits and/or potential optimisations, not
> hard requirements.
> 
> > > Goals:
> > >  * Implement a per-cpu way of allocation to eliminate a contention.
> 
> The goal should be to "allow contention-free vmalloc operations", not
> that we implement a specific solution.
> 
I think we are on the same page. I do not see that we go apart in anything.
Probably i was a bit more specific in requirements but this is how i see
personally on it based on different kind of experiments with it.

Thank you for your 2c!

--
Uladzislau Rezki
  
Uladzislau Rezki March 22, 2023, 1:18 p.m. UTC | #15
Hello, Dave.

> 
> I'm travelling right now, but give me a few days and I'll test this
> against the XFS workloads that hammer the global vmalloc spin lock
> really, really badly. XFS can use vm_map_ram and vmalloc really
> heavily for metadata buffers and hit the global spin lock from every
> CPU in the system at the same time (i.e. highly concurrent
> workloads). vmalloc is also heavily used in the hottest path
> throught the journal where we process and calculate delta changes to
> several million items every second, again spread across every CPU in
> the system at the same time.
> 
> We really need the global spinlock to go away completely, but in the
> mean time a shared read lock should help a little bit....
> 
Could you please share some steps how to run your workloads in order to
touch vmalloc() code. I would like to have a look at it in more detail
just for understanding the workloads.

Meanwhile my grep agains xfs shows:

<snip>
urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
./xfs_log_priv.h:675: * Log vector and shadow buffers can be large, so we need to use kvmalloc() here
./xfs_log_priv.h:676: * to ensure success. Unfortunately, kvmalloc() only allows GFP_KERNEL contexts
./xfs_log_priv.h:677: * to fall back to vmalloc, so we can't actually do anything useful with gfp
./xfs_log_priv.h:678: * flags to control the kmalloc() behaviour within kvmalloc(). Hence kmalloc()
./xfs_log_priv.h:681: * vmalloc if it can't get somethign straight away from the free lists or
./xfs_log_priv.h:682: * buddy allocator. Hence we have to open code kvmalloc outselves here.
./xfs_log_priv.h:686: * allocations. This is actually the only way to make vmalloc() do GFP_NOFS
./xfs_log_priv.h:691:xlog_kvmalloc(
./xfs_log_priv.h:702:                   p = vmalloc(buf_size);
./xfs_bio_io.c:21:      unsigned int            is_vmalloc = is_vmalloc_addr(data);
./xfs_bio_io.c:26:      if (is_vmalloc && op == REQ_OP_WRITE)
./xfs_bio_io.c:56:      if (is_vmalloc && op == REQ_OP_READ)
./xfs_log.c:1976:       if (is_vmalloc_addr(iclog->ic_data))
./xfs_log_cil.c:338:                    lv = xlog_kvmalloc(buf_size);
./libxfs/xfs_attr_leaf.c:522:           args->value = kvmalloc(valuelen, GFP_KERNEL | __GFP_NOLOCKDEP);
./kmem.h:12:#include <linux/vmalloc.h>
./kmem.h:78:    if (is_vmalloc_addr(addr))
./kmem.h:79:            return vmalloc_to_page(addr);
./xfs_attr_item.c:84:    * This could be over 64kB in length, so we have to use kvmalloc() for
./xfs_attr_item.c:85:    * this. But kvmalloc() utterly sucks, so we use our own version.
./xfs_attr_item.c:87:   nv = xlog_kvmalloc(sizeof(struct xfs_attri_log_nameval) +
./scrub/attr.c:60:      ab = kvmalloc(sizeof(*ab) + sz, flags);
urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$
<snip>

Thanks!

--
Uladzislau Rezki
  
Matthew Wilcox March 22, 2023, 5:47 p.m. UTC | #16
On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> Hello, Dave.
> 
> > 
> > I'm travelling right now, but give me a few days and I'll test this
> > against the XFS workloads that hammer the global vmalloc spin lock
> > really, really badly. XFS can use vm_map_ram and vmalloc really
> > heavily for metadata buffers and hit the global spin lock from every
> > CPU in the system at the same time (i.e. highly concurrent
> > workloads). vmalloc is also heavily used in the hottest path
> > throught the journal where we process and calculate delta changes to
> > several million items every second, again spread across every CPU in
> > the system at the same time.
> > 
> > We really need the global spinlock to go away completely, but in the
> > mean time a shared read lock should help a little bit....
> > 
> Could you please share some steps how to run your workloads in order to
> touch vmalloc() code. I would like to have a look at it in more detail
> just for understanding the workloads.
> 
> Meanwhile my grep agains xfs shows:
> 
> <snip>
> urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./

You're missing:

fs/xfs/xfs_buf.c:                       bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,

which i suspect is the majority of Dave's workload.  That will almost
certainly take the vb_alloc() path.
  
Uladzislau Rezki March 22, 2023, 6:01 p.m. UTC | #17
On Wed, Mar 22, 2023 at 05:47:28PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> > Hello, Dave.
> > 
> > > 
> > > I'm travelling right now, but give me a few days and I'll test this
> > > against the XFS workloads that hammer the global vmalloc spin lock
> > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > heavily for metadata buffers and hit the global spin lock from every
> > > CPU in the system at the same time (i.e. highly concurrent
> > > workloads). vmalloc is also heavily used in the hottest path
> > > throught the journal where we process and calculate delta changes to
> > > several million items every second, again spread across every CPU in
> > > the system at the same time.
> > > 
> > > We really need the global spinlock to go away completely, but in the
> > > mean time a shared read lock should help a little bit....
> > > 
> > Could you please share some steps how to run your workloads in order to
> > touch vmalloc() code. I would like to have a look at it in more detail
> > just for understanding the workloads.
> > 
> > Meanwhile my grep agains xfs shows:
> > 
> > <snip>
> > urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
> 
> You're missing:
> 
> fs/xfs/xfs_buf.c:                       bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> 
> which i suspect is the majority of Dave's workload.  That will almost
> certainly take the vb_alloc() path.
>
Then it has nothing to do with vmalloc contention(i mean global KVA allocator), IMHO.
Unless:

<snip>
void *vm_map_ram(struct page **pages, unsigned int count, int node)
{
	unsigned long size = (unsigned long)count << PAGE_SHIFT;
	unsigned long addr;
	void *mem;

	if (likely(count <= VMAP_MAX_ALLOC)) {
		mem = vb_alloc(size, GFP_KERNEL);
		if (IS_ERR(mem))
			return NULL;
		addr = (unsigned long)mem;
	} else {
		struct vmap_area *va;
		va = alloc_vmap_area(size, PAGE_SIZE,
				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
		if (IS_ERR(va))
			return NULL;
<snip>

number of pages > VMAP_MAX_ALLOC.

That is why i have asked about workloads because i would like to understand
where a "problem" is. A vm_map_ram() access the global vmap space but it happens 
when a new vmap block is required and i also think it is not a problem.

But who knows, therefore it makes sense to have a lock at workload.

--
Uladzislau Rezki
  
Uladzislau Rezki March 22, 2023, 7:15 p.m. UTC | #18
On Wed, Mar 22, 2023 at 07:01:59PM +0100, Uladzislau Rezki wrote:
> On Wed, Mar 22, 2023 at 05:47:28PM +0000, Matthew Wilcox wrote:
> > On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> > > Hello, Dave.
> > > 
> > > > 
> > > > I'm travelling right now, but give me a few days and I'll test this
> > > > against the XFS workloads that hammer the global vmalloc spin lock
> > > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > > heavily for metadata buffers and hit the global spin lock from every
> > > > CPU in the system at the same time (i.e. highly concurrent
> > > > workloads). vmalloc is also heavily used in the hottest path
> > > > throught the journal where we process and calculate delta changes to
> > > > several million items every second, again spread across every CPU in
> > > > the system at the same time.
> > > > 
> > > > We really need the global spinlock to go away completely, but in the
> > > > mean time a shared read lock should help a little bit....
> > > > 
> > > Could you please share some steps how to run your workloads in order to
> > > touch vmalloc() code. I would like to have a look at it in more detail
> > > just for understanding the workloads.
> > > 
> > > Meanwhile my grep agains xfs shows:
> > > 
> > > <snip>
> > > urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
> > 
> > You're missing:
> > 
> > fs/xfs/xfs_buf.c:                       bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> > 
> > which i suspect is the majority of Dave's workload.  That will almost
> > certainly take the vb_alloc() path.
> >
> Then it has nothing to do with vmalloc contention(i mean global KVA allocator), IMHO.
> Unless:
> 
> <snip>
> void *vm_map_ram(struct page **pages, unsigned int count, int node)
> {
> 	unsigned long size = (unsigned long)count << PAGE_SHIFT;
> 	unsigned long addr;
> 	void *mem;
> 
> 	if (likely(count <= VMAP_MAX_ALLOC)) {
> 		mem = vb_alloc(size, GFP_KERNEL);
> 		if (IS_ERR(mem))
> 			return NULL;
> 		addr = (unsigned long)mem;
> 	} else {
> 		struct vmap_area *va;
> 		va = alloc_vmap_area(size, PAGE_SIZE,
> 				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> 		if (IS_ERR(va))
> 			return NULL;
> <snip>
> 
> number of pages > VMAP_MAX_ALLOC.
> 
> That is why i have asked about workloads because i would like to understand
> where a "problem" is. A vm_map_ram() access the global vmap space but it happens 
> when a new vmap block is required and i also think it is not a problem.
> 
> But who knows, therefore it makes sense to have a lock at workload.
> 
There is a lock-stat statistics for vm_map_ram()/vm_unmap_ram() test.
I did it on 64 CPUs system with running 64 threads doing mapping/unmapping
of 1 page. Each thread does 10 000 000 mapping + unmapping in a loop:

<snip>
root@pc638:/home/urezki# cat /proc/lock_stat
lock_stat version 0.4
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

vmap_area_lock:       2554079        2554276           0.06         213.61    11719647.67           4.59        2986513        3005712           0.05          67.02     3573323.37           1.19
  --------------
  vmap_area_lock        1297948          [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
  vmap_area_lock        1256330          [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
  vmap_area_lock              1          [<00000000c95c05a7>] find_vm_area+0x16/0x70
  --------------
  vmap_area_lock        1738590          [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
  vmap_area_lock         815688          [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
  vmap_area_lock              1          [<00000000c1d619d7>] __get_vm_area_node+0xd2/0x170

.....................................................................................................................................................................................................

vmap_blocks.xa_lock:        862689         862698           0.05          77.74      849325.39           0.98        3005156        3005709           0.12          31.11     1920242.82           0.64
  -------------------
  vmap_blocks.xa_lock         378418          [<00000000625a5626>] vm_map_ram+0x359/0x4a0
  vmap_blocks.xa_lock         484280          [<00000000caa2ef03>] xa_erase+0xe/0x30
  -------------------
  vmap_blocks.xa_lock         576226          [<00000000caa2ef03>] xa_erase+0xe/0x30
  vmap_blocks.xa_lock         286472          [<00000000625a5626>] vm_map_ram+0x359/0x4a0

....................................................................................................................................................................................................

free_vmap_area_lock:        394960         394961           0.05         124.78      448241.23           1.13        1514508        1515077           0.12          30.48     1179167.01           0.78
  -------------------
  free_vmap_area_lock         385970          [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
  free_vmap_area_lock           4692          [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
  free_vmap_area_lock           4299          [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910
  -------------------
  free_vmap_area_lock         371734          [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
  free_vmap_area_lock          17007          [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
  free_vmap_area_lock           6220          [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910

.....................................................................................................................................................................................................

purge_vmap_area_lock:        169307         169312           0.05          31.08       81655.21           0.48        1514794        1515078           0.05          67.73      912391.12           0.60
  --------------------
  purge_vmap_area_lock         166409          [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
  purge_vmap_area_lock           2903          [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
  --------------------
  purge_vmap_area_lock         167511          [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
  purge_vmap_area_lock           1801          [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
<snip>

alloc_vmap_area is a top and second one is xa_lock. But the test i have
done is pretty high concurrent scenario.

--
Uladzislau Rezki
  
Uladzislau Rezki March 23, 2023, 12:47 p.m. UTC | #19
On Wed, Mar 22, 2023 at 08:15:09PM +0100, Uladzislau Rezki wrote:
> On Wed, Mar 22, 2023 at 07:01:59PM +0100, Uladzislau Rezki wrote:
> > On Wed, Mar 22, 2023 at 05:47:28PM +0000, Matthew Wilcox wrote:
> > > On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> > > > Hello, Dave.
> > > > 
> > > > > 
> > > > > I'm travelling right now, but give me a few days and I'll test this
> > > > > against the XFS workloads that hammer the global vmalloc spin lock
> > > > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > > > heavily for metadata buffers and hit the global spin lock from every
> > > > > CPU in the system at the same time (i.e. highly concurrent
> > > > > workloads). vmalloc is also heavily used in the hottest path
> > > > > throught the journal where we process and calculate delta changes to
> > > > > several million items every second, again spread across every CPU in
> > > > > the system at the same time.
> > > > > 
> > > > > We really need the global spinlock to go away completely, but in the
> > > > > mean time a shared read lock should help a little bit....
> > > > > 
> > > > Could you please share some steps how to run your workloads in order to
> > > > touch vmalloc() code. I would like to have a look at it in more detail
> > > > just for understanding the workloads.
> > > > 
> > > > Meanwhile my grep agains xfs shows:
> > > > 
> > > > <snip>
> > > > urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
> > > 
> > > You're missing:
> > > 
> > > fs/xfs/xfs_buf.c:                       bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> > > 
> > > which i suspect is the majority of Dave's workload.  That will almost
> > > certainly take the vb_alloc() path.
> > >
> > Then it has nothing to do with vmalloc contention(i mean global KVA allocator), IMHO.
> > Unless:
> > 
> > <snip>
> > void *vm_map_ram(struct page **pages, unsigned int count, int node)
> > {
> > 	unsigned long size = (unsigned long)count << PAGE_SHIFT;
> > 	unsigned long addr;
> > 	void *mem;
> > 
> > 	if (likely(count <= VMAP_MAX_ALLOC)) {
> > 		mem = vb_alloc(size, GFP_KERNEL);
> > 		if (IS_ERR(mem))
> > 			return NULL;
> > 		addr = (unsigned long)mem;
> > 	} else {
> > 		struct vmap_area *va;
> > 		va = alloc_vmap_area(size, PAGE_SIZE,
> > 				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> > 		if (IS_ERR(va))
> > 			return NULL;
> > <snip>
> > 
> > number of pages > VMAP_MAX_ALLOC.
> > 
> > That is why i have asked about workloads because i would like to understand
> > where a "problem" is. A vm_map_ram() access the global vmap space but it happens 
> > when a new vmap block is required and i also think it is not a problem.
> > 
> > But who knows, therefore it makes sense to have a lock at workload.
> > 
> There is a lock-stat statistics for vm_map_ram()/vm_unmap_ram() test.
> I did it on 64 CPUs system with running 64 threads doing mapping/unmapping
> of 1 page. Each thread does 10 000 000 mapping + unmapping in a loop:
> 
> <snip>
> root@pc638:/home/urezki# cat /proc/lock_stat
> lock_stat version 0.4
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> vmap_area_lock:       2554079        2554276           0.06         213.61    11719647.67           4.59        2986513        3005712           0.05          67.02     3573323.37           1.19
>   --------------
>   vmap_area_lock        1297948          [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
>   vmap_area_lock        1256330          [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
>   vmap_area_lock              1          [<00000000c95c05a7>] find_vm_area+0x16/0x70
>   --------------
>   vmap_area_lock        1738590          [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
>   vmap_area_lock         815688          [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
>   vmap_area_lock              1          [<00000000c1d619d7>] __get_vm_area_node+0xd2/0x170
> 
> .....................................................................................................................................................................................................
> 
> vmap_blocks.xa_lock:        862689         862698           0.05          77.74      849325.39           0.98        3005156        3005709           0.12          31.11     1920242.82           0.64
>   -------------------
>   vmap_blocks.xa_lock         378418          [<00000000625a5626>] vm_map_ram+0x359/0x4a0
>   vmap_blocks.xa_lock         484280          [<00000000caa2ef03>] xa_erase+0xe/0x30
>   -------------------
>   vmap_blocks.xa_lock         576226          [<00000000caa2ef03>] xa_erase+0xe/0x30
>   vmap_blocks.xa_lock         286472          [<00000000625a5626>] vm_map_ram+0x359/0x4a0
> 
> ....................................................................................................................................................................................................
> 
> free_vmap_area_lock:        394960         394961           0.05         124.78      448241.23           1.13        1514508        1515077           0.12          30.48     1179167.01           0.78
>   -------------------
>   free_vmap_area_lock         385970          [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
>   free_vmap_area_lock           4692          [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
>   free_vmap_area_lock           4299          [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910
>   -------------------
>   free_vmap_area_lock         371734          [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
>   free_vmap_area_lock          17007          [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
>   free_vmap_area_lock           6220          [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910
> 
> .....................................................................................................................................................................................................
> 
> purge_vmap_area_lock:        169307         169312           0.05          31.08       81655.21           0.48        1514794        1515078           0.05          67.73      912391.12           0.60
>   --------------------
>   purge_vmap_area_lock         166409          [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
>   purge_vmap_area_lock           2903          [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
>   --------------------
>   purge_vmap_area_lock         167511          [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
>   purge_vmap_area_lock           1801          [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
> <snip>
> 
> alloc_vmap_area is a top and second one is xa_lock. But the test i have
> done is pretty high concurrent scenario.
> 
<snip>
From 32c38d239c6de3f1d3accf97d9d4944ecaa4bccd Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Date: Thu, 23 Mar 2023 13:07:27 +0100
Subject: [PATCH] mm: vmalloc: Remove global vmap_blocks xarray

A global vmap_blocks-xarray array can be contented under
heavy usage of the vm_map_ram()/vm_unmap_ram() APIs. Under
stress test the lock-stat shows that a "vmap_blocks.xa_lock"
lock is a second in a list when it comes to contentions:

<snip>
----------------------------------------
class name con-bounces contentions ...
----------------------------------------
...
vmap_blocks.xa_lock:    862689 862698 ...
  -------------------
  vmap_blocks.xa_lock   378418    [<00000000625a5626>] vm_map_ram+0x359/0x4a0
  vmap_blocks.xa_lock   484280    [<00000000caa2ef03>] xa_erase+0xe/0x30
  -------------------
  vmap_blocks.xa_lock   576226    [<00000000caa2ef03>] xa_erase+0xe/0x30
  vmap_blocks.xa_lock   286472    [<00000000625a5626>] vm_map_ram+0x359/0x4a0
...
<snip>

that is a result of running vm_map_ram()/vm_unmap_ram() in
a loop. The test creates 64(on 64 CPUs system) threads and
each one maps/unmaps 1 page.

After this change the xa_lock is considered as noise in the
same test condition:

<snip>
...
&xa->xa_lock#1:         10333 10394 ...
  --------------
  &xa->xa_lock#1        5349      [<00000000bbbc9751>] xa_erase+0xe/0x30
  &xa->xa_lock#1        5045      [<0000000018def45d>] vm_map_ram+0x3a4/0x4f0
  --------------
  &xa->xa_lock#1        7326      [<0000000018def45d>] vm_map_ram+0x3a4/0x4f0
  &xa->xa_lock#1        3068      [<00000000bbbc9751>] xa_erase+0xe/0x30
...
<snip>

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 54 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 30 insertions(+), 24 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 978194dc2bb8..b1e549d152b2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1911,6 +1911,7 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 struct vmap_block_queue {
 	spinlock_t lock;
 	struct list_head free;
+	struct xarray vmap_blocks;
 };
 
 struct vmap_block {
@@ -1927,25 +1928,22 @@ struct vmap_block {
 /* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
 static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
 
-/*
- * XArray of vmap blocks, indexed by address, to quickly find a vmap block
- * in the free path. Could get rid of this if we change the API to return a
- * "cookie" from alloc, to be passed to free. But no big deal yet.
- */
-static DEFINE_XARRAY(vmap_blocks);
-
-/*
- * We should probably have a fallback mechanism to allocate virtual memory
- * out of partially filled vmap blocks. However vmap block sizing should be
- * fairly reasonable according to the vmalloc size, so it shouldn't be a
- * big problem.
- */
+static struct vmap_block_queue *
+addr_to_vbq(unsigned long addr)
+{
+	int cpu = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+	return &per_cpu(vmap_block_queue, cpu);
+}
 
-static unsigned long addr_to_vb_idx(unsigned long addr)
+static unsigned long
+addr_to_vb_va_start(unsigned long addr)
 {
-	addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1);
-	addr /= VMAP_BLOCK_SIZE;
-	return addr;
+	/* Check if aligned. */
+	if (IS_ALIGNED(addr, VMAP_BLOCK_SIZE))
+		return addr;
+
+	/* A start address of block an address belongs to. */
+	return rounddown(addr, VMAP_BLOCK_SIZE);
 }
 
 static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
@@ -1953,7 +1951,7 @@ static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
 	unsigned long addr;
 
 	addr = va_start + (pages_off << PAGE_SHIFT);
-	BUG_ON(addr_to_vb_idx(addr) != addr_to_vb_idx(va_start));
+	BUG_ON(addr_to_vb_va_start(addr) != addr_to_vb_va_start(va_start));
 	return (void *)addr;
 }
 
@@ -1970,7 +1968,6 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 	struct vmap_block_queue *vbq;
 	struct vmap_block *vb;
 	struct vmap_area *va;
-	unsigned long vb_idx;
 	int node, err;
 	void *vaddr;
 
@@ -2003,8 +2000,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 	bitmap_set(vb->used_map, 0, (1UL << order));
 	INIT_LIST_HEAD(&vb->free_list);
 
-	vb_idx = addr_to_vb_idx(va->va_start);
-	err = xa_insert(&vmap_blocks, vb_idx, vb, gfp_mask);
+	vbq = addr_to_vbq(va->va_start);
+	err = xa_insert(&vbq->vmap_blocks, va->va_start, vb, gfp_mask);
 	if (err) {
 		kfree(vb);
 		free_vmap_area(va);
@@ -2021,9 +2018,11 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 
 static void free_vmap_block(struct vmap_block *vb)
 {
+	struct vmap_block_queue *vbq;
 	struct vmap_block *tmp;
 
-	tmp = xa_erase(&vmap_blocks, addr_to_vb_idx(vb->va->va_start));
+	vbq = addr_to_vbq(vb->va->va_start);
+	tmp = xa_erase(&vbq->vmap_blocks, vb->va->va_start);
 	BUG_ON(tmp != vb);
 
 	spin_lock(&vmap_area_lock);
@@ -2135,6 +2134,7 @@ static void vb_free(unsigned long addr, unsigned long size)
 	unsigned long offset;
 	unsigned int order;
 	struct vmap_block *vb;
+	struct vmap_block_queue *vbq;
 
 	BUG_ON(offset_in_page(size));
 	BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC);
@@ -2143,7 +2143,10 @@ static void vb_free(unsigned long addr, unsigned long size)
 
 	order = get_order(size);
 	offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
-	vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr));
+
+	vbq = addr_to_vbq(addr);
+	vb = xa_load(&vbq->vmap_blocks, addr_to_vb_va_start(addr));
+
 	spin_lock(&vb->lock);
 	bitmap_clear(vb->used_map, offset, (1UL << order));
 	spin_unlock(&vb->lock);
@@ -3486,6 +3489,7 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 {
 	char *start;
 	struct vmap_block *vb;
+	struct vmap_block_queue *vbq;
 	unsigned long offset;
 	unsigned int rs, re, n;
 
@@ -3503,7 +3507,8 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 	 * Area is split into regions and tracked with vmap_block, read out
 	 * each region and zero fill the hole between regions.
 	 */
-	vb = xa_load(&vmap_blocks, addr_to_vb_idx((unsigned long)addr));
+	vbq = addr_to_vbq((unsigned long) addr);
+	vb = xa_load(&vbq->vmap_blocks, addr_to_vb_va_start((unsigned long) addr));
 	if (!vb)
 		goto finished;
 
@@ -4272,6 +4277,7 @@ void __init vmalloc_init(void)
 		p = &per_cpu(vfree_deferred, i);
 		init_llist_head(&p->list);
 		INIT_WORK(&p->wq, delayed_vfree_work);
+		xa_init(&vbq->vmap_blocks);
 	}
 
 	/* Import existing vmlist entries. */
  
Matthew Wilcox March 24, 2023, 5:31 a.m. UTC | #20
On Fri, Mar 24, 2023 at 04:25:39PM +1100, Dave Chinner wrote:
> Did you read the comment above this function? I mean, it's all about
> how poorly kvmalloc() works for the highly concurrent, fail-fast
> context that occurs in the journal commit fast path, and how we open
> code it with kmalloc and vmalloc to work "ok" in this path.
> 
> Then if you go look at the commits related to it, you might find
> that XFS developers tend to write properly useful changelogs to
> document things like "it's better, but vmalloc will soon have lock
> contention problems if we hit it any harder"....

The problem with writing whinges like this is that mm developers don't
read XFS changelogs.  I certainly had no idea this was a problem, and
I doubt anybody else who could make a start at fixing this problem had
any idea either.  Why go to all this effort instead of sending an email
to linux-mm?
  
Uladzislau Rezki March 27, 2023, 5:22 p.m. UTC | #21
>     So, this patch open codes the kvmalloc() in the commit path to have
>     the above described behaviour. The result is we more than halve the
>     CPU time spend doing kvmalloc() in this path and transaction commits
>     with 64kB objects in them more than doubles. i.e. we get ~5x
>     reduction in CPU usage per costly-sized kvmalloc() invocation and
>     the profile looks like this:
>     
>       - 37.60% xlog_cil_commit
>             16.01% memcpy_erms
>           - 8.45% __kmalloc
>              - 8.04% kmalloc_order_trace
>                 - 8.03% kmalloc_order
>                    - 7.93% alloc_pages
>                       - 7.90% __alloc_pages
>                          - 4.05% __alloc_pages_slowpath.constprop.0
>                             - 2.18% get_page_from_freelist
>                             - 1.77% wake_all_kswapds
>     ....
>                                         - __wake_up_common_lock
>                                            - 0.94% _raw_spin_lock_irqsave
>                          - 3.72% get_page_from_freelist
>                             - 2.43% _raw_spin_lock_irqsave
>           - 5.72% vmalloc
>              - 5.72% __vmalloc_node_range
>                 - 4.81% __get_vm_area_node.constprop.0
>                    - 3.26% alloc_vmap_area
>                       - 2.52% _raw_spin_lock
>                    - 1.46% _raw_spin_lock
>                   0.56% __alloc_pages_bulk
>           - 4.66% kvfree
>              - 3.25% vfree
OK, i see. I tried to use the fs_mark in different configurations. For
example:

<snip>
time fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 -d ./scratch/0 -d ./scratch/1 -d ./scratch/2  \
-d ./scratch/3 -d ./scratch/4 -d ./scratch/5 -d ./scratch/6 -d ./scratch/7 -d ./scratch/8 \
-d ./scratch/9 -d ./scratch/10 -d ./scratch/11 -d ./scratch/12 -d ./scratch/13 \
-d ./scratch/14 -d ./scratch/15 -t 64 -F
<snip>

But i did not manage to trigger xlog_cil_commit() to fallback to vmalloc
code. I think i should reduce an amount of memory on my kvm-pc and
repeat the tests!

--
Uladzislau Rezki
  
Uladzislau Rezki March 28, 2023, 12:40 p.m. UTC | #22
On Tue, Mar 28, 2023 at 01:53:27PM +1100, Dave Chinner wrote:
> On Mon, Mar 27, 2023 at 07:22:44PM +0200, Uladzislau Rezki wrote:
> > >     So, this patch open codes the kvmalloc() in the commit path to have
> > >     the above described behaviour. The result is we more than halve the
> > >     CPU time spend doing kvmalloc() in this path and transaction commits
> > >     with 64kB objects in them more than doubles. i.e. we get ~5x
> > >     reduction in CPU usage per costly-sized kvmalloc() invocation and
> > >     the profile looks like this:
> > >     
> > >       - 37.60% xlog_cil_commit
> > >             16.01% memcpy_erms
> > >           - 8.45% __kmalloc
> > >              - 8.04% kmalloc_order_trace
> > >                 - 8.03% kmalloc_order
> > >                    - 7.93% alloc_pages
> > >                       - 7.90% __alloc_pages
> > >                          - 4.05% __alloc_pages_slowpath.constprop.0
> > >                             - 2.18% get_page_from_freelist
> > >                             - 1.77% wake_all_kswapds
> > >     ....
> > >                                         - __wake_up_common_lock
> > >                                            - 0.94% _raw_spin_lock_irqsave
> > >                          - 3.72% get_page_from_freelist
> > >                             - 2.43% _raw_spin_lock_irqsave
> > >           - 5.72% vmalloc
> > >              - 5.72% __vmalloc_node_range
> > >                 - 4.81% __get_vm_area_node.constprop.0
> > >                    - 3.26% alloc_vmap_area
> > >                       - 2.52% _raw_spin_lock
> > >                    - 1.46% _raw_spin_lock
> > >                   0.56% __alloc_pages_bulk
> > >           - 4.66% kvfree
> > >              - 3.25% vfree
> > OK, i see. I tried to use the fs_mark in different configurations. For
> > example:
> > 
> > <snip>
> > time fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 -d ./scratch/0 -d ./scratch/1 -d ./scratch/2  \
> > -d ./scratch/3 -d ./scratch/4 -d ./scratch/5 -d ./scratch/6 -d ./scratch/7 -d ./scratch/8 \
> > -d ./scratch/9 -d ./scratch/10 -d ./scratch/11 -d ./scratch/12 -d ./scratch/13 \
> > -d ./scratch/14 -d ./scratch/15 -t 64 -F
> > <snip>
> > 
> > But i did not manage to trigger xlog_cil_commit() to fallback to vmalloc
> > code. I think i should reduce an amount of memory on my kvm-pc and
> > repeat the tests!
> 
> Simple way of doing is to use directory blocks that are larger than
> page size:
> 
> mkfs.xfs -n size=64k ....
> 
> We can hit that path in other ways - large attributes will hit it in
> the attr buffer allocation path, enabling the new attribute
> intent-based logging mode will hit it in the xlog_cil_commit path as
> well. IIRC, the above profile comes from the latter case, creating
> lots of zero length files with 64kB xattrs attached via fsmark.
> 
Good. Thank you that is useful.

--
Uladzislau Rezki
  

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 978194dc2bb8..c24b27664a97 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -40,6 +40,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/hugetlb.h>
 #include <linux/sched/mm.h>
+#include <linux/rwsem.h>
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
 
@@ -725,7 +726,7 @@  EXPORT_SYMBOL(vmalloc_to_pfn);
 #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
 
 
-static DEFINE_SPINLOCK(vmap_area_lock);
+static DECLARE_RWSEM(vmap_area_lock);
 static DEFINE_SPINLOCK(free_vmap_area_lock);
 /* Export for kexec only */
 LIST_HEAD(vmap_area_list);
@@ -1537,9 +1538,9 @@  static void free_vmap_area(struct vmap_area *va)
 	/*
 	 * Remove from the busy tree/list.
 	 */
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	unlink_va(va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 
 	/*
 	 * Insert/Merge it back to the free tree/list.
@@ -1627,9 +1628,9 @@  static struct vmap_area *alloc_vmap_area(unsigned long size,
 	va->vm = NULL;
 	va->flags = va_flags;
 
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 
 	BUG_ON(!IS_ALIGNED(va->va_start, align));
 	BUG_ON(va->va_start < vstart);
@@ -1854,9 +1855,9 @@  struct vmap_area *find_vmap_area(unsigned long addr)
 {
 	struct vmap_area *va;
 
-	spin_lock(&vmap_area_lock);
+	down_read(&vmap_area_lock);
 	va = __find_vmap_area(addr, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	up_read(&vmap_area_lock);
 
 	return va;
 }
@@ -1865,11 +1866,11 @@  static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 {
 	struct vmap_area *va;
 
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	va = __find_vmap_area(addr, &vmap_area_root);
 	if (va)
 		unlink_va(va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 
 	return va;
 }
@@ -1914,7 +1915,7 @@  struct vmap_block_queue {
 };
 
 struct vmap_block {
-	spinlock_t lock;
+	struct mutex lock;
 	struct vmap_area *va;
 	unsigned long free, dirty;
 	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
@@ -1991,7 +1992,7 @@  static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 	}
 
 	vaddr = vmap_block_vaddr(va->va_start, 0);
-	spin_lock_init(&vb->lock);
+	mutex_init(&vb->lock);
 	vb->va = va;
 	/* At least something should be left free */
 	BUG_ON(VMAP_BBMAP_BITS <= (1UL << order));
@@ -2026,9 +2027,9 @@  static void free_vmap_block(struct vmap_block *vb)
 	tmp = xa_erase(&vmap_blocks, addr_to_vb_idx(vb->va->va_start));
 	BUG_ON(tmp != vb);
 
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	unlink_va(vb->va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 
 	free_vmap_area_noflush(vb->va);
 	kfree_rcu(vb, rcu_head);
@@ -2047,7 +2048,7 @@  static void purge_fragmented_blocks(int cpu)
 		if (!(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS))
 			continue;
 
-		spin_lock(&vb->lock);
+		mutex_lock(&vb->lock);
 		if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) {
 			vb->free = 0; /* prevent further allocs after releasing lock */
 			vb->dirty = VMAP_BBMAP_BITS; /* prevent purging it again */
@@ -2056,10 +2057,10 @@  static void purge_fragmented_blocks(int cpu)
 			spin_lock(&vbq->lock);
 			list_del_rcu(&vb->free_list);
 			spin_unlock(&vbq->lock);
-			spin_unlock(&vb->lock);
+			mutex_unlock(&vb->lock);
 			list_add_tail(&vb->purge, &purge);
 		} else
-			spin_unlock(&vb->lock);
+			mutex_unlock(&vb->lock);
 	}
 	rcu_read_unlock();
 
@@ -2101,9 +2102,9 @@  static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
 	list_for_each_entry_rcu(vb, &vbq->free, free_list) {
 		unsigned long pages_off;
 
-		spin_lock(&vb->lock);
+		mutex_lock(&vb->lock);
 		if (vb->free < (1UL << order)) {
-			spin_unlock(&vb->lock);
+			mutex_unlock(&vb->lock);
 			continue;
 		}
 
@@ -2117,7 +2118,7 @@  static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
 			spin_unlock(&vbq->lock);
 		}
 
-		spin_unlock(&vb->lock);
+		mutex_unlock(&vb->lock);
 		break;
 	}
 
@@ -2144,16 +2145,16 @@  static void vb_free(unsigned long addr, unsigned long size)
 	order = get_order(size);
 	offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
 	vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr));
-	spin_lock(&vb->lock);
+	mutex_lock(&vb->lock);
 	bitmap_clear(vb->used_map, offset, (1UL << order));
-	spin_unlock(&vb->lock);
+	mutex_unlock(&vb->lock);
 
 	vunmap_range_noflush(addr, addr + size);
 
 	if (debug_pagealloc_enabled_static())
 		flush_tlb_kernel_range(addr, addr + size);
 
-	spin_lock(&vb->lock);
+	mutex_lock(&vb->lock);
 
 	/* Expand dirty range */
 	vb->dirty_min = min(vb->dirty_min, offset);
@@ -2162,10 +2163,10 @@  static void vb_free(unsigned long addr, unsigned long size)
 	vb->dirty += 1UL << order;
 	if (vb->dirty == VMAP_BBMAP_BITS) {
 		BUG_ON(vb->free);
-		spin_unlock(&vb->lock);
+		mutex_unlock(&vb->lock);
 		free_vmap_block(vb);
 	} else
-		spin_unlock(&vb->lock);
+		mutex_unlock(&vb->lock);
 }
 
 static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
@@ -2183,7 +2184,7 @@  static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 
 		rcu_read_lock();
 		list_for_each_entry_rcu(vb, &vbq->free, free_list) {
-			spin_lock(&vb->lock);
+			mutex_lock(&vb->lock);
 			if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
 				unsigned long va_start = vb->va->va_start;
 				unsigned long s, e;
@@ -2196,7 +2197,7 @@  static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 
 				flush = 1;
 			}
-			spin_unlock(&vb->lock);
+			mutex_unlock(&vb->lock);
 		}
 		rcu_read_unlock();
 	}
@@ -2451,9 +2452,9 @@  static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
 static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
 			      unsigned long flags, const void *caller)
 {
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	setup_vmalloc_vm_locked(vm, va, flags, caller);
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 }
 
 static void clear_vm_uninitialized_flag(struct vm_struct *vm)
@@ -3507,9 +3508,9 @@  static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 	if (!vb)
 		goto finished;
 
-	spin_lock(&vb->lock);
+	mutex_lock(&vb->lock);
 	if (bitmap_empty(vb->used_map, VMAP_BBMAP_BITS)) {
-		spin_unlock(&vb->lock);
+		mutex_unlock(&vb->lock);
 		goto finished;
 	}
 	for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
@@ -3536,7 +3537,7 @@  static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 		count -= n;
 	}
 unlock:
-	spin_unlock(&vb->lock);
+	mutex_unlock(&vb->lock);
 
 finished:
 	/* zero-fill the left dirty or free regions */
@@ -3576,13 +3577,15 @@  long vread(char *buf, char *addr, unsigned long count)
 	unsigned long buflen = count;
 	unsigned long n, size, flags;
 
+	might_sleep();
+
 	addr = kasan_reset_tag(addr);
 
 	/* Don't allow overflow */
 	if ((unsigned long) addr + count < count)
 		count = -(unsigned long) addr;
 
-	spin_lock(&vmap_area_lock);
+	down_read(&vmap_area_lock);
 	va = find_vmap_area_exceed_addr((unsigned long)addr);
 	if (!va)
 		goto finished;
@@ -3639,7 +3642,7 @@  long vread(char *buf, char *addr, unsigned long count)
 		count -= n;
 	}
 finished:
-	spin_unlock(&vmap_area_lock);
+	up_read(&vmap_area_lock);
 
 	if (buf == buf_start)
 		return 0;
@@ -3980,14 +3983,14 @@  struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	}
 
 	/* insert all vm's */
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	for (area = 0; area < nr_vms; area++) {
 		insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);
 
 		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
 				 pcpu_get_vm_areas);
 	}
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 
 	/*
 	 * Mark allocated areas as accessible. Do it now as a best-effort
@@ -4114,7 +4117,7 @@  static void *s_start(struct seq_file *m, loff_t *pos)
 	__acquires(&vmap_area_lock)
 {
 	mutex_lock(&vmap_purge_lock);
-	spin_lock(&vmap_area_lock);
+	down_read(&vmap_area_lock);
 
 	return seq_list_start(&vmap_area_list, *pos);
 }
@@ -4128,7 +4131,7 @@  static void s_stop(struct seq_file *m, void *p)
 	__releases(&vmap_area_lock)
 	__releases(&vmap_purge_lock)
 {
-	spin_unlock(&vmap_area_lock);
+	up_read(&vmap_area_lock);
 	mutex_unlock(&vmap_purge_lock);
 }