diff mbox series

lib/genalloc: use try_cmpxchg in {set,clear}_bits_ll

Message ID	20230118150703.4024-1-ubizjak@gmail.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Uros Bizjak <ubizjak@gmail.com> To: linux-kernel@vger.kernel.org Cc: Uros Bizjak <ubizjak@gmail.com>, Andrew Morton <akpm@linux-foundation.org> Subject: [PATCH] lib/genalloc: use try_cmpxchg in {set,clear}_bits_ll Date: Wed, 18 Jan 2023 16:07:03 +0100 Message-Id: <20230118150703.4024-1-ubizjak@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	lib/genalloc: use try_cmpxchg in {set,clear}_bits_ll \| lib/genalloc: use try_cmpxchg in {set,clear}_bits_ll

Commit Message

Uros Bizjak Jan. 18, 2023, 3:07 p.m. UTC

  Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
{set,clear}_bits_ll.  x86 CMPXCHG instruction returns success in ZF
flag, so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old"
when cmpxchg fails.

Note that the value from *ptr should be read using READ_ONCE to prevent
the compiler from merging, refetching or reordering the read.

The patch also declares these two functions inline, to ensure inlining.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 lib/genalloc.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

Comments

Andrew Morton Jan. 18, 2023, 9:18 p.m. UTC | #1

On Wed, 18 Jan 2023 16:07:03 +0100 Uros Bizjak <ubizjak@gmail.com> wrote:

> Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> {set,clear}_bits_ll.  x86 CMPXCHG instruction returns success in ZF
> flag, so this change saves a compare after cmpxchg (and related move
> instruction in front of cmpxchg).
> 
> Also, try_cmpxchg implicitly assigns old *ptr value to "old"
> when cmpxchg fails.
> 
> Note that the value from *ptr should be read using READ_ONCE to prevent
> the compiler from merging, refetching or reordering the read.
> 
> The patch also declares these two functions inline, to ensure inlining.

But why is that better?  This adds a few hundred bytes more text, which
has a cost.

Uros Bizjak Jan. 18, 2023, 9:47 p.m. UTC | #2

On Wed, Jan 18, 2023 at 10:18 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Wed, 18 Jan 2023 16:07:03 +0100 Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > {set,clear}_bits_ll.  x86 CMPXCHG instruction returns success in ZF
> > flag, so this change saves a compare after cmpxchg (and related move
> > instruction in front of cmpxchg).
> >
> > Also, try_cmpxchg implicitly assigns old *ptr value to "old"
> > when cmpxchg fails.
> >
> > Note that the value from *ptr should be read using READ_ONCE to prevent
> > the compiler from merging, refetching or reordering the read.
> >
> > The patch also declares these two functions inline, to ensure inlining.
>
> But why is that better?  This adds a few hundred bytes more text, which
> has a cost.

Originally, both functions are inlined and the size of an object file
is (gcc version 12.2.1, x86_64):

  text    data     bss     dec     hex filename
  4661     480       0    5141    1415 genalloc-orig.o

When try_cmpxchg is used, gcc chooses to not inline set_bits_ll (its
estimate of code size is not very precise when multi-line assembly is
involved), resulting in:

  text    data     bss     dec     hex filename
  4705     488       0    5193    1449 genalloc-noinline.o

And with an inline added to avoid gcc's quirks:

  text    data     bss     dec     hex filename
  4629     480       0    5109    13f5 genalloc.o

Considering that these two changed functions are used only in
genalloc.o, adding inline qualifier is a win, also when comparing to
the original size.

Uros.

Uros Bizjak Jan. 18, 2023, 9:55 p.m. UTC | #3

On Wed, Jan 18, 2023 at 10:47 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Wed, Jan 18, 2023 at 10:18 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 18 Jan 2023 16:07:03 +0100 Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > > {set,clear}_bits_ll.  x86 CMPXCHG instruction returns success in ZF
> > > flag, so this change saves a compare after cmpxchg (and related move
> > > instruction in front of cmpxchg).
> > >
> > > Also, try_cmpxchg implicitly assigns old *ptr value to "old"
> > > when cmpxchg fails.
> > >
> > > Note that the value from *ptr should be read using READ_ONCE to prevent
> > > the compiler from merging, refetching or reordering the read.
> > >
> > > The patch also declares these two functions inline, to ensure inlining.
> >
> > But why is that better?  This adds a few hundred bytes more text, which
> > has a cost.
>
> Originally, both functions are inlined and the size of an object file
> is (gcc version 12.2.1, x86_64):
>
>   text    data     bss     dec     hex filename
>   4661     480       0    5141    1415 genalloc-orig.o
>
> When try_cmpxchg is used, gcc chooses to not inline set_bits_ll (its
> estimate of code size is not very precise when multi-line assembly is
> involved), resulting in:
>
>   text    data     bss     dec     hex filename
>   4705     488       0    5193    1449 genalloc-noinline.o
>
> And with an inline added to avoid gcc's quirks:
>
>   text    data     bss     dec     hex filename
>   4629     480       0    5109    13f5 genalloc.o
>
> Considering that these two changed functions are used only in
> genalloc.o, adding inline qualifier is a win, also when comparing to
> the original size.

BTW: Recently, it was determined [1] that the usage of cpu_relax()
inside the cmpxchg loop can be harmful for performance. We actually
have the same situation here, so perhaps cpu_relax() should be removed
in the same way it was removed from the lockref.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f5fe24ef17b5fbe6db49534163e77499fb10ae8c

Uros.

Uros Bizjak Jan. 18, 2023, 10:01 p.m. UTC | #4

On Wed, Jan 18, 2023 at 10:55 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Wed, Jan 18, 2023 at 10:47 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > On Wed, Jan 18, 2023 at 10:18 PM Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> > >
> > > On Wed, 18 Jan 2023 16:07:03 +0100 Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > > > {set,clear}_bits_ll.  x86 CMPXCHG instruction returns success in ZF
> > > > flag, so this change saves a compare after cmpxchg (and related move
> > > > instruction in front of cmpxchg).
> > > >
> > > > Also, try_cmpxchg implicitly assigns old *ptr value to "old"
> > > > when cmpxchg fails.
> > > >
> > > > Note that the value from *ptr should be read using READ_ONCE to prevent
> > > > the compiler from merging, refetching or reordering the read.
> > > >
> > > > The patch also declares these two functions inline, to ensure inlining.
> > >
> > > But why is that better?  This adds a few hundred bytes more text, which
> > > has a cost.
> >
> > Originally, both functions are inlined and the size of an object file
> > is (gcc version 12.2.1, x86_64):
> >
> >   text    data     bss     dec     hex filename
> >   4661     480       0    5141    1415 genalloc-orig.o
> >
> > When try_cmpxchg is used, gcc chooses to not inline set_bits_ll (its
> > estimate of code size is not very precise when multi-line assembly is
> > involved), resulting in:
> >
> >   text    data     bss     dec     hex filename
> >   4705     488       0    5193    1449 genalloc-noinline.o
> >
> > And with an inline added to avoid gcc's quirks:
> >
> >   text    data     bss     dec     hex filename
> >   4629     480       0    5109    13f5 genalloc.o
> >
> > Considering that these two changed functions are used only in
> > genalloc.o, adding inline qualifier is a win, also when comparing to
> > the original size.
>
> BTW: Recently, it was determined [1] that the usage of cpu_relax()
> inside the cmpxchg loop can be harmful for performance. We actually
> have the same situation here, so perhaps cpu_relax() should be removed
> in the same way it was removed from the lockref.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f5fe24ef17b5fbe6db49534163e77499fb10ae8c

I forgot to add some CCs that may be interested in the above.

Uros.

David Laight Jan. 19, 2023, 12:47 p.m. UTC | #5

> BTW: Recently, it was determined [1] that the usage of cpu_relax()
> inside the cmpxchg loop can be harmful for performance. We actually
> have the same situation here, so perhaps cpu_relax() should be removed
> in the same way it was removed from the lockref.

I'm not sure you can ever want a cpu_relax() in a loop that
is implementing an atomic operation.
Even the ia64 (die...) issue was with a loop that was waiting
for another cpu to change the location (eg a spinlock).

For an atomic operation an immediate retry is likely to succeed.
Any kind of deferral to an another cpu can only make it worse.

Clearly if you have 100s of cpu looping doing atomic operation
on the same cache line it is likely that some get starved.
But to fix that you need to increase the time between successful
operations, not delay on failure.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Uros Bizjak Jan. 23, 2023, 3:04 p.m. UTC | #6

On Thu, Jan 19, 2023 at 1:47 PM David Laight <David.Laight@aculab.com> wrote:
>
> > BTW: Recently, it was determined [1] that the usage of cpu_relax()
> > inside the cmpxchg loop can be harmful for performance. We actually
> > have the same situation here, so perhaps cpu_relax() should be removed
> > in the same way it was removed from the lockref.
>
> I'm not sure you can ever want a cpu_relax() in a loop that
> is implementing an atomic operation.
> Even the ia64 (die...) issue was with a loop that was waiting
> for another cpu to change the location (eg a spinlock).
>
> For an atomic operation an immediate retry is likely to succeed.
> Any kind of deferral to an another cpu can only make it worse.
>
> Clearly if you have 100s of cpu looping doing atomic operation
> on the same cache line it is likely that some get starved.
> But to fix that you need to increase the time between successful
> operations, not delay on failure.

I would like to point out that the wikipedia article on
compare-and-swap claims [1] that:

Instead of immediately retrying after a CAS operation fails,
researchers have found that total system performance can be improved
in multiprocessor systems—where many threads constantly update some
particular shared variable—if threads that see their CAS fail use
exponential backoff—in other words, wait a little before retrying the
CAS [2].

[1] https://en.wikipedia.org/wiki/Compare-and-swap#Overview
[2] https://arxiv.org/pdf/1305.5800.pdf

Uros.

David Laight Jan. 23, 2023, 3:42 p.m. UTC | #7

From: Uros Bizjak
> Sent: 23 January 2023 15:05
> 
> On Thu, Jan 19, 2023 at 1:47 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > > BTW: Recently, it was determined [1] that the usage of cpu_relax()
> > > inside the cmpxchg loop can be harmful for performance. We actually
> > > have the same situation here, so perhaps cpu_relax() should be removed
> > > in the same way it was removed from the lockref.
> >
> > I'm not sure you can ever want a cpu_relax() in a loop that
> > is implementing an atomic operation.
> > Even the ia64 (die...) issue was with a loop that was waiting
> > for another cpu to change the location (eg a spinlock).
> >
> > For an atomic operation an immediate retry is likely to succeed.
> > Any kind of deferral to an another cpu can only make it worse.
> >
> > Clearly if you have 100s of cpu looping doing atomic operation
> > on the same cache line it is likely that some get starved.
> > But to fix that you need to increase the time between successful
> > operations, not delay on failure.
> 
> I would like to point out that the wikipedia article on
> compare-and-swap claims [1] that:
> 
> Instead of immediately retrying after a CAS operation fails,
> researchers have found that total system performance can be improved
> in multiprocessor systems—where many threads constantly update some
> particular shared variable—if threads that see their CAS fail use
> exponential backoff—in other words, wait a little before retrying the
> CAS [2].

Probably, but the real solution is 'don't do that'.
In any case I suspect the cpu_relax() explicitly lets the
other hyperthreading cpu run - which isn't useful at all.

What you actually want if for the cache logic to avoid losing
'exclusive' access to the cache line for enough clocks after a
failed compare+exchange to allow the cpu to re-issue the memory
cycle with an updated value.
You can't do anything about one cpu being starved, but a short
delay there is almost certainly beneficial.
(Some hardware cache engineer will probably say otherwise.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Mateusz Guzik Jan. 23, 2023, 3:59 p.m. UTC | #8

On 1/23/23, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Thu, Jan 19, 2023 at 1:47 PM David Laight <David.Laight@aculab.com>
> wrote:
>>
>> > BTW: Recently, it was determined [1] that the usage of cpu_relax()
>> > inside the cmpxchg loop can be harmful for performance. We actually
>> > have the same situation here, so perhaps cpu_relax() should be removed
>> > in the same way it was removed from the lockref.
>>
>> I'm not sure you can ever want a cpu_relax() in a loop that
>> is implementing an atomic operation.
>> Even the ia64 (die...) issue was with a loop that was waiting
>> for another cpu to change the location (eg a spinlock).
>>
>> For an atomic operation an immediate retry is likely to succeed.
>> Any kind of deferral to an another cpu can only make it worse.
>>
>> Clearly if you have 100s of cpu looping doing atomic operation
>> on the same cache line it is likely that some get starved.
>> But to fix that you need to increase the time between successful
>> operations, not delay on failure.
>
> I would like to point out that the wikipedia article on
> compare-and-swap claims [1] that:
>
> Instead of immediately retrying after a CAS operation fails,
> researchers have found that total system performance can be improved
> in multiprocessor systems—where many threads constantly update some
> particular shared variable—if threads that see their CAS fail use
> exponential backoff—in other words, wait a little before retrying the
> CAS [2].
>
> [1] https://en.wikipedia.org/wiki/Compare-and-swap#Overview
> [2] https://arxiv.org/pdf/1305.5800.pdf
>

I skimmed the paper and I think I got the gist of it, which boils down
to a very common idea concerning damage control of multiple CPUs
modifying the same thing.

The idea boils down to making some of the contending CPUs bugger off
for a time. How to specifically do it is another matter. Key problem
with it is that CPUs which already failed to make any progress now
artificially delay the entire thing even more.

Fundamental observation is that 2 or more CPUs rolling with an atomic
op on the same cacheline are going to interfere with each other and
that effect is only going to get worse as you keep adding more of them.
If you can make some of them temporarily bugger off, that is a win in
the sense that the ping-pong is reduced. If this is finely tuned, you
may get an overall win with by achieving better throughput while
maintaining acceptable worst case latency. Otherwise you are starving
other participants which is not acceptable. tl;dr you are walking a very
right rope here and I'm not sure it is worth it.

For a toy example one can imagine 2 CPUs running code to do an op n
times. If they both do it in a tight loop there is plenty of ping-pong,
but if one is simply made to wait until the other one is finished with
all the iterations, you register much shorter total real time. But then
again everything happening on the other CPU had to wait so much longer.

I don't know why but many people love to throw away the value returned
by CAS and instead do another plain access. While I don't see actual
code used, the description in the paper matches this problematic
behavior and I guarantee it artificially makes a tight CAS loop appear
worse than it is.

My own numbers from the cpu_relax removal commit show that throughput
flattens very early, and all the extra computing power thrown at it
simply goes to waste.

All that said. Is a plain cpu_relax in the loop good? Definitely not on
x86-64. Is something more elaborate going to be good? Maybe, highly
depends on CPUs at hand and the workload, and if you mess it up some of
them get starved. This in turn can degrade performance elsewhere if the
starved thread is holding a lock waited on by someone else, and so on.
I would definitely not play any games of the sort in the code at hand.

In that case what about lockref? It definitely is not the fastest it
can be, for one some provisions could be made for NUMA-aware handling.
Basic idea would be to have competing CPUs aggregate the total change
on the lock and then have one of them make it happen across the
interconnect -- so instead of add/sub 1 you would add/sub n, where n
can very well be quite big.

Personally, for lockref, I think the way to go forward is to use it less
to begin with and to look for ways to convert it into a lock xadd-able
primitive instead. The "doing it less thing" could be done by
implementing a RCU-only mode for select syscalls, which defaults to
opportunistically avoid refing squat. If the caller manages to do what
it needed to do, that is a massive win; otherwise refs get taken. This
could work well in the common case for syscalls like statx and access,
but would not for open. Frankly I'm surprised something like this is
not already implemented (maybe I missed a major showstopper?).

Linus Torvalds Jan. 23, 2023, 7:36 p.m. UTC | #9

On Mon, Jan 23, 2023 at 7:59 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> In that case what about lockref? It definitely is not the fastest it
> can be, for one some provisions could be made for NUMA-aware handling.

Note that the contention case for lockref really is fairly made-up.

Yes, yes, it's easy enough to trigger with microbenchmarks. In real
life less so.

> Basic idea would be to have competing CPUs aggregate the total change
> on the lock and then have one of them make it happen across the
> interconnect -- so instead of add/sub 1 you would add/sub n, where n
> can very well be quite big.

No. The use is literally serialized, and done one at a time, and needs
to be synchronous.

We're talking about looking up a <i>single</i> path component, while
at the same time making sure that that path component isn't being
removed by another CPU - and there is by definition no other locking
going on, because the lockref *is* the locking.

And it really is one single path component, because this is all very
much the equivalent of doing a list traversal (where the "list" is the
full path, and the list entries are the individual components), and
there is absolutely zero parallel cases except for entirely
independent threads doing this in parallel on *different* paths (that
may have common components, of course, with the root and home
directory in particular being shared - but they are "shared" only in
the sense that lots of threads use them entirely independently).

Now, there are ways to do lockref more efficiently, but I suspect

 (a) most of them are about hardware optimizations

 (b) it's very seldom an issue in practice, because the code is
actually very good at avoiding them already (ie the RCU pathname walk
already avoids it)

Thanks to RCU pathname lookup, the lockref thing *should* come into
play only when  you actually fall out of RCU mode, ie for the last
component. That's a huge win, because that's what avoids the whole
"everybody hammers on the root/pwd dentry counts".

Your benchmark that looked up the *same* last component in parallell
and all the time is not really a real load.

As to the actual hardware optimizations, it would be interesting to
see if a LL/SC architecture might do a "native" lockref implementation
better than the "emulated" one that just uses LL/SC for the cmpxchg.

And Intel *has* talked about instruction set extensions for remote
atomics, and using CMPxxXADD *may* be a win. At least it would avoid
the *looping* on cmpxchg, and it *migth* avoid bouncing the value too
if the op is actually done using L3 accesses or something.

(But I do also worry that CMPxxXADD migth actually make things *worse*
because of any remote access, because caches worka and contention is
rare, and in many common cases you might be better off with a local
cache than with a remote one).

So I'm not claiming lockref is perfect, but I do suspect you're
barking up the wrong tree here. The optimizations you are talking
about are either not realistic in the first place (because
serialization and locking) or have already mostly been done (ie
avoiding the op entirely except for the last component).

HOWEVER. There is one special exception that might be interesting and
that has never been done: 'fstat()' and friends could possibly avoid
the "try_to_unlazy()" even for the last component.

IOW, we might be able to do fstatat() without ever even finalizing the
RCU state of the pathname, and actually looking up the stat
information while still in RCU mode, and instead of doing the actual
final lockref_get_not_dead() to turn an RCU path into a real
ref-counted path, we could just do the "get the stat info, then do
read_seqcount_retry() to verify that the RCU walk _and_ the stat info
is still valid".

But I'm not convinced that final complexity would be worth it. It
sounds like a potentially fun and interesting exercise (Al Viro added
to particupants so that he can say "No! That's not 'fun and exciting',
that's just crazy!") if somebody really wants to, but see above: the
last path component is very seldom something that sees contention. You
look up the same root/pwd over and over again, but nobody sane looks
up the same full pathname over and over again.

And extending the LOOKUP_RCU window all the way over the stat info
gathering would require a lot of care, and force us to expose the
kinds of things we do for LOOKUP_RCU in namei.c to fs/stat.c too.

So it might be fun and exciting, but it might also be quite nasty, and
it's not entirely clear that ti would be worth the pain.

But since you clearly were looking at fstatat() performance, I can
only point you at this case and say "There's _potentially_ upside
there".

                   Linus

Mateusz Guzik Jan. 24, 2023, 12:11 a.m. UTC | #10

On 1/23/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, Jan 23, 2023 at 7:59 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>> Basic idea would be to have competing CPUs aggregate the total change
>> on the lock and then have one of them make it happen across the
>> interconnect -- so instead of add/sub 1 you would add/sub n, where n
>> can very well be quite big.
>
> No. The use is literally serialized, and done one at a time, and needs
> to be synchronous.
>
> We're talking about looking up a <i>single</i> path component, while
> at the same time making sure that that path component isn't being
> removed by another CPU - and there is by definition no other locking
> going on, because the lockref *is* the locking.
>
[snip]
>
> Thanks to RCU pathname lookup, the lockref thing *should* come into
> play only when  you actually fall out of RCU mode, ie for the last
> component. That's a huge win, because that's what avoids the whole
> "everybody hammers on the root/pwd dentry counts".
>
> Your benchmark that looked up the *same* last component in parallell
> and all the time is not really a real load.
>
[snip2]
>
> So I'm not claiming lockref is perfect, but I do suspect you're
> barking up the wrong tree here. The optimizations you are talking
> about are either not realistic in the first place (because
> serialization and locking) or have already mostly been done (ie
> avoiding the op entirely except for the last component).
>

That was a minor remark in the spirit of attempting to reduce pingpong,
only made because of the paper. It was not a serious proposal, but
perhaps I failed to make it clear enough.

A more serious remark was in the last paragraph.

> HOWEVER. There is one special exception that might be interesting and
> that has never been done: 'fstat()' and friends could possibly avoid
> the "try_to_unlazy()" even for the last component.
>
> IOW, we might be able to do fstatat() without ever even finalizing the
> RCU state of the pathname, and actually looking up the stat
> information while still in RCU mode, and instead of doing the actual
> final lockref_get_not_dead() to turn an RCU path into a real
> ref-counted path, we could just do the "get the stat info, then do
> read_seqcount_retry() to verify that the RCU walk _and_ the stat info
> is still valid".
>

Ignoring mentioning specific routines by name is precisely what
I proposed in that e-mail. To quote myself:

> Personally, for lockref, I think the way to go forward is to use it less
> to begin with and to look for ways to convert it into a lock xadd-able
> primitive instead. The "doing it less thing" could be done by
> implementing a RCU-only mode for select syscalls, which defaults to
> opportunistically avoid refing squat. If the caller manages to do what
> it needed to do, that is a massive win; otherwise refs get taken. This
> could work well in the common case for syscalls like statx and access,
> but would not for open. Frankly I'm surprised something like this is
> not already implemented (maybe I missed a major showstopper?).

Now back to your response:

> But I'm not convinced that final complexity would be worth it. It
> sounds like a potentially fun and interesting exercise (Al Viro added
> to particupants so that he can say "No! That's not 'fun and exciting',
> that's just crazy!") if somebody really wants to, but see above: the
> last path component is very seldom something that sees contention. You
> look up the same root/pwd over and over again, but nobody sane looks
> up the same full pathname over and over again.
>
[snip]
> But since you clearly were looking at fstatat() performance, I can
> only point you at this case and say "There's _potentially_ upside
> there".
>

So if you strace something like gcc compiling stuff you will find:
- some access calls on shared dirs, for example:
78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0
78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0
78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0
- same with newfstatat:
87428 newfstatat(AT_FDCWD, "./arch/x86/include",
{st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
87428 newfstatat(AT_FDCWD, "./arch/x86/include/generated",
{st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
87428 newfstatat(AT_FDCWD, "./include", {st_mode=S_IFDIR|0755,
st_size=4096, ...}, 0) = 0
87428 newfstatat(AT_FDCWD, "./arch/x86/include/uapi",
{st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
- there is also quite a bit of readlink:
87502 readlink("/tmp", 0x7ffe28847ac0, 1023) = -1 EINVAL (Invalid argument)
87502 readlink("/tmp/ccTh37oI.s", 0x7ffe28847ac0, 1023) = -1 EINVAL
(Invalid argument)

that last bit is glibc doing realpath(). A case can be made for making
realpath into a syscall instead, but I'm not going to flame over for
the time being. :)

Anyhow, my point is that 2 gcc compiling different files do invoke path
lookups with the same terminal path components and lockref stuff gets
bounced around when it happens.

On another point how to end up dealing with lockref less, I have to
note glibc switched fstat(fd, buf) to use newfstatat(fd, "", buf,
AT_EMPTY_PATH) internally. This adds a lot of work single-threaded as
it forces a path lookup with associated stac/clac trip, but more
importantly the kernel no longer relies on liveness of the dentry
provided by the file object, but uses lockref instead. Something which
allows to get data produced by newfstatat without forcing a lookup
sounds like a welcome addition. Same goes for statx which seems to be
the recommended syscall. I'll be writing about this to -fsdevel some
time later.

Linus Torvalds Jan. 24, 2023, 1:51 a.m. UTC | #11

On Mon, Jan 23, 2023 at 4:11 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On another point how to end up dealing with lockref less, I have to
> note glibc switched fstat(fd, buf) to use newfstatat(fd, "", buf,
> AT_EMPTY_PATH) internally.

Yeah, that's just plain stupid.

I don't know what the thinking (or rather, lack-thereof) was inside
glibc for that one.

              Linus

Mateusz Guzik Jan. 24, 2023, 6:48 a.m. UTC | #12

On 1/24/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, Jan 23, 2023 at 4:11 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> On another point how to end up dealing with lockref less, I have to
>> note glibc switched fstat(fd, buf) to use newfstatat(fd, "", buf,
>> AT_EMPTY_PATH) internally.
>
> Yeah, that's just plain stupid.
>
> I don't know what the thinking (or rather, lack-thereof) was inside
> glibc for that one.
>

To my reading the fstat system call is now legacy so they switched to
something else. There is a lot to say about the stat situation (going
past the fd vs lookup issue), which I'm going to do on fsdevel (perhaps
next week).

All flaming aside, I think Uros is still waiting for an answer what
should be done in cmpxchg loops and someone else will have to provide
it.

I stated my case for the best course of action being to do nothing
(for x86-64 anyway), but I'm not an authority figure.

David Laight Jan. 24, 2023, 8:54 a.m. UTC | #13

From: Mateusz Guzik
> Sent: 24 January 2023 00:11
...
> So if you strace something like gcc compiling stuff you will find:
> - some access calls on shared dirs, for example:
> 78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0
> 78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0
> 78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0

Are they back to back? Which is just stupid.
Once per invocation of gcc would be noise.

> - same with newfstatat:
> 87428 newfstatat(AT_FDCWD, "./arch/x86/include", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
> 87428 newfstatat(AT_FDCWD, "./arch/x86/include/generated", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
> 87428 newfstatat(AT_FDCWD, "./include", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
> 87428 newfstatat(AT_FDCWD, "./arch/x86/include/uapi", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
> - there is also quite a bit of readlink:
> 87502 readlink("/tmp", 0x7ffe28847ac0, 1023) = -1 EINVAL (Invalid argument)
> 87502 readlink("/tmp/ccTh37oI.s", 0x7ffe28847ac0, 1023) = -1 EINVAL (Invalid argument)
> 
> that last bit is glibc doing realpath(). A case can be made for making
> realpath into a syscall instead, but I'm not going to flame over for
> the time being. :)

I remember looking at syscall counts during a (NetBSD) build
and deciding that the dominant system call was actually failed
opens from the compiler searching long -I paths looking for
headers.
You can speed things up by copying all the .h files from the
fixed -I path list into a single directory.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Linus Torvalds Jan. 24, 2023, 4:58 p.m. UTC | #14

On Tue, Jan 24, 2023 at 12:54 AM David Laight <David.Laight@aculab.com> wrote:
>
> I remember looking at syscall counts during a (NetBSD) build
> and deciding that the dominant system call was actually failed
> opens from the compiler searching long -I paths looking for
> headers.

For some loads, yes.

> You can speed things up by copying all the .h files from the
> fixed -I path list into a single directory.

Or, better yet, do a proper pathname cache that has negative entries
in it too. Like Linux does.

Because it's a common pattern, not just for include paths. You find it
for the regular PATH handling, and for various "look up my config file
in these directories" kind of thing.

So caching not just successful pathname lookups, but the failed ones
too, is actually important.

(Obviously it's still faster to not have to even search at all, and
put everything in one directory and not have a search path at all, but
it does have its own serious downsides, notably the whole "now we have
to keep that union directory in sync with all the source directories")

               Linus

David Laight Jan. 24, 2023, 5:32 p.m. UTC | #15

> (Obviously it's still faster to not have to even search at all, and
> put everything in one directory and not have a search path at all, but
> it does have its own serious downsides, notably the whole "now we have
> to keep that union directory in sync with all the source directories")

I've used make to do the copies for a 'day job' project.
Just a matter of identifying the .h files that need copying
to obj/include.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Al Viro Jan. 27, 2023, 3:54 a.m. UTC | #16

On Mon, Jan 23, 2023 at 11:36:26AM -0800, Linus Torvalds wrote:

> HOWEVER. There is one special exception that might be interesting and
> that has never been done: 'fstat()' and friends could possibly avoid
> the "try_to_unlazy()" even for the last component.
> 
> IOW, we might be able to do fstatat() without ever even finalizing the
> RCU state of the pathname, and actually looking up the stat
> information while still in RCU mode, and instead of doing the actual
> final lockref_get_not_dead() to turn an RCU path into a real
> ref-counted path, we could just do the "get the stat info, then do
> read_seqcount_retry() to verify that the RCU walk _and_ the stat info
> is still valid".
> 
> But I'm not convinced that final complexity would be worth it. It
> sounds like a potentially fun and interesting exercise (Al Viro added
> to particupants so that he can say "No! That's not 'fun and exciting',
> that's just crazy!") if somebody really wants to, but see above: the
> last path component is very seldom something that sees contention. You
> look up the same root/pwd over and over again, but nobody sane looks
> up the same full pathname over and over again.
> 
> And extending the LOOKUP_RCU window all the way over the stat info
> gathering would require a lot of care, and force us to expose the
> kinds of things we do for LOOKUP_RCU in namei.c to fs/stat.c too.

Interesting...  So basically a variant of filename_lookup() that
fills struct kstat instead of doing that to struct path?

Not impossible, but will take some care; in particular, I'd consider
non-default ->getattr() as "fall back to non-RCU *now*, we are not
going there" - otherwise it becomes too nasty.  STATX_DIOALIGN on
block devices is even worse...

Need to check what struct user_namespace and associated stuff looks
like wrt RCU; OTOH, I wouldn't expect too much headache there,
given that things like may_follow_link() already cover most (if
not everything) of what we'd need in generic_fillattr().

Looks like the main obstacle is how to deal with duplication between
that thing and vfs_getattr{,_nosec}(); one possibility would be to
teach vfs_getattr about the possibility of being called from RCU
mode, and have it used by that new thing.  However, I really want
to avoid passing struct nameidata to that sucker.  Hmmmm...

OK, so what are the reasons for falling out of RCU mode there and
how early can we detect those?
	1) non-NULL ->getattr() (OK, corresponding new bit in
->i_opflags).  Can be seen immediately.
	2) STATX_DIOALIGN combined with S_ISBLK().  Also can be
seen immediately.
	3) security_inode_getattr().  That can get unpleasant -
it exposes apparmor and tomoyo to operating in RCU mode.  Until
now they had been spared that.  OTOH, they can just fail with
-ECHILD if they see whatever indicator of "you are called in RCU
mode" we end up choosing...

Hell knows - that just might be doable.  vfs_getattr() grows a flag
(not sure where - passing inode separately looks like a plausible
approach, with non-NULL meaning "we are in RCU mode").  If it
sees the indication of being called in RCU mode it should return -ECHILD
ASAP if it can't handle things.  Old callers just pass it
NULL for inode; filename_statx() (in RCU mode) does something like
	err = vfs_getattr(nd->path, nd->inode, ....);
	if (err == -ECHILD)
		if (!try_to_unlazy(nd))
			sod off and restart in non-RCU mode
		err = vfs_getattr(nd->path, NULL, ....);
	if (!err && (nd->flags & LOOKUP_RCU)) {
		do a trimmed-down variant of try_to_unlazy, just checking
		the seq counts and not grabbing anything.
	}

Oh, wait - there's the rest of the stuff done by complete_walk()...
Ugh...  ->d_weak_revalidate() part obviously means "no RCU for you",
but it needs to be handled for non-RCU modes and there's LOOKUP_IS_SCOPED
part...

I'll poke around at some point, but I've a bunch of other pending crap
at the moment.  It _might_ be doable without too much ugliness, but
then this is just from a cursory look - there might be dragons.

Linus Torvalds Jan. 27, 2023, 7:25 p.m. UTC | #17

On Thu, Jan 26, 2023 at 7:54 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> >
> > And extending the LOOKUP_RCU window all the way over the stat info
> > gathering would require a lot of care, and force us to expose the
> > kinds of things we do for LOOKUP_RCU in namei.c to fs/stat.c too.
>
> Interesting...  So basically a variant of filename_lookup() that
> fills struct kstat instead of doing that to struct path?

Well, kinda. You still want the fallback to struct path in case you
can't fill in the kstat without it (because the filesystem needs to do
something under RCU).

So I think what you'd really want is something like a special version
of filename_lookup() that is then given an optimistic "call this
function under RCU before you finalize the path".

And then, *if* the filesystem can do the kstat lookup etc under RCU,
it can just fill it in there, and instead of finalizing the path, we
can just do terminate the walk without ever doing the try_to_unlazy()
that legitimizes the path.

I suspect it's fairly close to how we do d_revalidate() for the path
component, except we'd do this not per-component, but as a "for the
final result, let's do one last thing under RCU, and if it succeeded
there, we don't actually need the path at all after all, because we've
already done everything we needed".

I think the only really relevant situation this is the case is
basically the stat() family of functions, since those don't actually
care about the path after the operation.

But there are other cases that have that

        error = filename_lookup(dfd, filename, lookup_flags, &path, NULL);
        ... do something simple once ...
        path_put(&path);

pattern where we just do the 'put' on the path and don't have any
other use for it.

The io_uring xattr code matches that same pattern, for example - but
may simply not be worth worrying about.

So either some generic "callback under RCU before we finalize it", or
we could make it very specific for just "fill in kstat and don't
bother finalizing the path when this flag is set"

> Looks like the main obstacle is how to deal with duplication between
> that thing and vfs_getattr{,_nosec}();

I think we'd need something like a new ->rcu_getattr() function, and
filesystems that can do getattr under RCU would just set that function
pointer.

I dunno. I didn't look at it all, but it *feels* like you could have
something that just says "if I have this function, and calling it
returns success, then we can just do "terminate_walk()" without ever
doing "try_to_unlazy()".

                 Linus

diff mbox series

Patch

diff --git a/lib/genalloc.c b/lib/genalloc.c
index 00fc50d0a640..0c883d6fbd44 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -40,32 +40,30 @@  static inline size_t chunk_size(const struct gen_pool_chunk *chunk)
 	return chunk->end_addr - chunk->start_addr + 1;
 }
 
-static int set_bits_ll(unsigned long *addr, unsigned long mask_to_set)
+static inline int
+set_bits_ll(unsigned long *addr, unsigned long mask_to_set)
 {
-	unsigned long val, nval;
+	unsigned long val = READ_ONCE(*addr);
 
-	nval = *addr;
 	do {
-		val = nval;
 		if (val & mask_to_set)
 			return -EBUSY;
 		cpu_relax();
-	} while ((nval = cmpxchg(addr, val, val | mask_to_set)) != val);
+	} while (!try_cmpxchg(addr, &val, val | mask_to_set));
 
 	return 0;
 }
 
-static int clear_bits_ll(unsigned long *addr, unsigned long mask_to_clear)
+static inline int
+clear_bits_ll(unsigned long *addr, unsigned long mask_to_clear)
 {
-	unsigned long val, nval;
+	unsigned long val = READ_ONCE(*addr);
 
-	nval = *addr;
 	do {
-		val = nval;
 		if ((val & mask_to_clear) != mask_to_clear)
 			return -EBUSY;
 		cpu_relax();
-	} while ((nval = cmpxchg(addr, val, val & ~mask_to_clear)) != val);
+	} while (!try_cmpxchg(addr, &val, val & ~mask_to_clear));
 
 	return 0;
 }