[next,4/5] locking/osq_lock: Optimise per-cpu data accesses.

Message ID bddb6b00434d4492abca4725c10f8d5a@AcuMS.aculab.com
State New
Headers
Series locking/osq_lock: Optimisations to osq_lock code |

Commit Message

David Laight Dec. 29, 2023, 8:57 p.m. UTC
  this_cpu_ptr() is rather more expensive than raw_cpu_read() since
the latter can use an 'offset from register' (%gs for x86-84).

Add a 'self' field to 'struct optimistic_spin_node' that can be
read with raw_cpu_read(), initialise on first call.

Signed-off-by: David Laight <david.laight@aculab.com>
---
 kernel/locking/osq_lock.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)
  

Comments

Waiman Long Dec. 30, 2023, 3:08 a.m. UTC | #1
On 12/29/23 15:57, David Laight wrote:
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.
>
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
>   kernel/locking/osq_lock.c | 14 +++++++++-----
>   1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 9bb3a077ba92..b60b0add0161 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -13,7 +13,7 @@
>    */
>   
>   struct optimistic_spin_node {
> -	struct optimistic_spin_node *next, *prev;
> +	struct optimistic_spin_node *self, *next, *prev;
>   	int locked; /* 1 if lock acquired */
>   	int cpu; /* encoded CPU # + 1 value */
>   };
> @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>   
>   bool osq_lock(struct optimistic_spin_queue *lock)
>   {
> -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);

My gcc 11 compiler produces the following x86-64 code:

92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
    0x0000000000000029 <+25>:    mov    %rcx,%rdx
    0x000000000000002c <+28>:    add %gs:0x0(%rip),%rdx        # 0x34 
<osq_lock+36>

Which looks pretty optimized for me. Maybe older compiler may generate 
more complex code. However, I do have some doubt as to the benefit of 
this patch at the expense of making the code a bit more complex.

Cheers,
Longman
  
Ingo Molnar Dec. 30, 2023, 11:09 a.m. UTC | #2
* Waiman Long <longman@redhat.com> wrote:

> On 12/29/23 15:57, David Laight wrote:
> > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > the latter can use an 'offset from register' (%gs for x86-84).
> > 
> > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > read with raw_cpu_read(), initialise on first call.
> > 
> > Signed-off-by: David Laight <david.laight@aculab.com>
> > ---
> >   kernel/locking/osq_lock.c | 14 +++++++++-----
> >   1 file changed, 9 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index 9bb3a077ba92..b60b0add0161 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -13,7 +13,7 @@
> >    */
> >   struct optimistic_spin_node {
> > -	struct optimistic_spin_node *next, *prev;
> > +	struct optimistic_spin_node *self, *next, *prev;
> >   	int locked; /* 1 if lock acquired */
> >   	int cpu; /* encoded CPU # + 1 value */
> >   };
> > @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> >   bool osq_lock(struct optimistic_spin_queue *lock)
> >   {
> > -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> 
> My gcc 11 compiler produces the following x86-64 code:
> 
> 92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>    0x0000000000000029 <+25>:    mov    %rcx,%rdx
>    0x000000000000002c <+28>:    add %gs:0x0(%rip),%rdx        # 0x34
> <osq_lock+36>
> 
> Which looks pretty optimized for me. Maybe older compiler may generate more
> complex code. However, I do have some doubt as to the benefit of this patch
> at the expense of making the code a bit more complex.

GCC-11 is plenty of a look-back window in terms of compiler efficiency: 
latest enterprise distros use GCC-11 or newer, while recent desktop 
distros use GCC-13. Anything older won't matter, because no major 
distribution is going to use new kernels with old compilers.

Thanks,

	Ingo
  
David Laight Dec. 30, 2023, 11:35 a.m. UTC | #3
From: Ingo Molnar
> Sent: 30 December 2023 11:09
> 
> 
> * Waiman Long <longman@redhat.com> wrote:
> 
> > On 12/29/23 15:57, David Laight wrote:
> > > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > > the latter can use an 'offset from register' (%gs for x86-84).
> > >
> > > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > > read with raw_cpu_read(), initialise on first call.
> > >
> > > Signed-off-by: David Laight <david.laight@aculab.com>
> > > ---
> > >   kernel/locking/osq_lock.c | 14 +++++++++-----
> > >   1 file changed, 9 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > > index 9bb3a077ba92..b60b0add0161 100644
> > > --- a/kernel/locking/osq_lock.c
> > > +++ b/kernel/locking/osq_lock.c
> > > @@ -13,7 +13,7 @@
> > >    */
> > >   struct optimistic_spin_node {
> > > -	struct optimistic_spin_node *next, *prev;
> > > +	struct optimistic_spin_node *self, *next, *prev;
> > >   	int locked; /* 1 if lock acquired */
> > >   	int cpu; /* encoded CPU # + 1 value */
> > >   };
> > > @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> > >   bool osq_lock(struct optimistic_spin_queue *lock)
> > >   {
> > > -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > > +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> >
> > My gcc 11 compiler produces the following x86-64 code:
> >
> > 92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> >    0x0000000000000029 <+25>:    mov    %rcx,%rdx
> >    0x000000000000002c <+28>:    add %gs:0x0(%rip),%rdx        # 0x34
> > <osq_lock+36>
> >
> > Which looks pretty optimized for me. Maybe older compiler may generate more
> > complex code. However, I do have some doubt as to the benefit of this patch
> > at the expense of making the code a bit more complex.

My changed code is one instruction shorter!
  18:   65 48 8b 15 00 00 00    mov    %gs:0x0(%rip),%rdx        # 20 <osq_lock+0x20>
  1f:   00
                        1c: R_X86_64_PC32       .data..percpu..shared_aligned-0x4
However is might have one less cache line miss.

> GCC-11 is plenty of a look-back window in terms of compiler efficiency:
> latest enterprise distros use GCC-11 or newer, while recent desktop
> distros use GCC-13. Anything older won't matter, because no major
> distribution is going to use new kernels with old compilers.

There must be a difference in the header files as well.
Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04).
But maybe based on some config option.

I'm seeing this_cpu_ptr(&xxx) converted to per_cpu_ptr(&xxx, smp_processor_id())
which necessitates an array lookup (indexed by cpu number).
Whereas I think you are seeing it implemented as
  raw_cpu_read(per_cpu_data_base) + offset_to(xxx)

So the old code generates (after the prologue):
  10:   49 89 fd                mov    %rdi,%r13
  13:   49 c7 c4 00 00 00 00    mov    $0x0,%r12
                        16: R_X86_64_32S        .data..percpu..shared_aligned
  1a:   e8 00 00 00 00          callq  1f <osq_lock+0x1f>
                        1b: R_X86_64_PC32       debug_smp_processor_id-0x4
  1f:   89 c0                   mov    %eax,%eax
  21:   48 8b 1c c5 00 00 00    mov    0x0(,%rax,8),%rbx
  28:   00
                        25: R_X86_64_32S        __per_cpu_offset
  29:   e8 00 00 00 00          callq  2e <osq_lock+0x2e>
                        2a: R_X86_64_PC32       debug_smp_processor_id-0x4
  2e:   4c 01 e3                add    %r12,%rbx
  31:   83 c0 01                add    $0x1,%eax
  34:   c7 43 10 00 00 00 00    movl   $0x0,0x10(%rbx)
  3b:   48 c7 03 00 00 00 00    movq   $0x0,(%rbx)
  42:   89 43 14                mov    %eax,0x14(%rbx)
  45:   41 87 45 00             xchg   %eax,0x0(%r13)

I was also surprised that smp_processor_id() is a real function rather
than an offset from %gs.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
Ingo Molnar Dec. 30, 2023, 8:37 p.m. UTC | #4
* David Laight <David.Laight@ACULAB.COM> wrote:

>  bool osq_lock(struct optimistic_spin_queue *lock)
>  {
> -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
>  	struct optimistic_spin_node *prev, *next;
>  	int old;
>  
> -	if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> -		node->cpu = encode_cpu(smp_processor_id());
> +	if (unlikely(!node)) {
> +		int cpu = encode_cpu(smp_processor_id());
> +		node = decode_cpu(cpu);
> +		node->self = node;
> +		node->cpu = cpu;

This whole initialization sequence is suboptimal and needs to be 
cleaned up first: the node->cpu field is constant once initialized, so 
it should be initialized from appropriate init methods, not runtime in 
osq_lock(), right?

Eliminating that initialization branch is a useful micro-optimization 
in itself for the hot path.

Thanks,

	Ingo
  
Linus Torvalds Dec. 30, 2023, 8:41 p.m. UTC | #5
On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@aculab.com> wrote:
>
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.

No, this is horrible.

The problem isn't the "this_cpu_ptr()", it's the rest of the code.

>  bool osq_lock(struct optimistic_spin_queue *lock)
>  {
> -       struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> +       struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);

No. Both of these are crap.

>         struct optimistic_spin_node *prev, *next;
>         int old;
>
> -       if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> -               node->cpu = encode_cpu(smp_processor_id());
> +       if (unlikely(!node)) {
> +               int cpu = encode_cpu(smp_processor_id());
> +               node = decode_cpu(cpu);
> +               node->self = node;
> +               node->cpu = cpu;
> +       }

The proper fix here is to not do that silly

        node = this_cpu_ptr(&osq_node);
        ..
        node->next = NULL;

dance at all, but to simply do

        this_cpu_write(osq_node.next, NULL);

in the first place. That makes the whole thing just a single store off
the segment descriptor.

Yes, you'll eventually end up doing that

        node = this_cpu_ptr(&osq_node);

thing because it then wants to use that raw pointer to do

        WRITE_ONCE(prev->next, node);

but that's a separate issue and still does not make it worth it to
create a pointless self-pointer.

Btw, if you *really* want to solve that separate issue, then make the
optimistic_spin_node struct not contain the pointers at all, but the
CPU numbers, and then turn those numbers into the pointers the exact
same way it does for the "lock->tail" thing, ie doing that whole

        prev = decode_cpu(old);

dance. That *may* then result in avoiding turning them into pointers
at all in some cases.

Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
be -1 instead, and add something like

  #define IS_OSQ_UNLOCKED(x) ((int)(x)<0)

and that would then avoid the +1 / -1 games in encoding/decoding the
CPU numbers. It causes silly code generated like this:

        subl    $1, %eax        #, cpu_nr
...
        cltq
        addq    __per_cpu_offset(,%rax,8), %rcx

which seems honestly stupid. The cltq is there for sign-extension,
which is because all these things are "int", and the "subl" will
zero-extend to 64-bit, not sign-extend.

At that point, I think gcc might be able to just generate

        addq    __per_cpu_offset-8(,%rax,8), %rcx

but honestly, I think it would be nicer to just have decode_cpu() do

        unsigned int cpu_nr = encoded_cpu_val;

        return per_cpu_ptr(&osq_node, cpu_nr);

and not have the -1/+1 at all.

Hmm?

UNTESTED patch to just do the "this_cpu_write()" parts attached.
Again, note how we do end up doing that this_cpu_ptr conversion later
anyway, but at least it's off the critical path.

                 Linus
  
Linus Torvalds Dec. 30, 2023, 8:59 p.m. UTC | #6
On Sat, 30 Dec 2023 at 12:41, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> UNTESTED patch to just do the "this_cpu_write()" parts attached.
> Again, note how we do end up doing that this_cpu_ptr conversion later
> anyway, but at least it's off the critical path.

Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely
code, it really is still better than caching a value in memory.

At least the memory location that 'this_cpu_ptr()' accesses is
slightly more likely to be hot (and is right next to the cpu number,
iirc).

That said, I think we should fix this_cpu_ptr() to not ever generate
that disgusting cltq just because the cpu pointer has the wrong
signedness. I don't quite know how to do it, but this:

  -#define per_cpu_offset(x) (__per_cpu_offset[x])
  +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)])

at least helps a *bit*. It gets rid of the cltq, at least, but if
somebody actually passes in an 'unsigned long' cpuid, it would cause
an unnecessary truncation.

And gcc still generates

        subl    $1, %eax        #, cpu_nr
        addq    __per_cpu_offset(,%rax,8), %rcx

instead of just doing

        addq    __per_cpu_offset-8(,%rax,8), %rcx

because it still needs to clear the upper 32 bits and doesn't know
that the 'xchg()' already did that.

Oh well. I guess even without the -1/+1 games by the OSQ code, we
would still end up with a "movl" just to do that upper bits clearing
that the compiler doesn't know is unnecessary.

I don't think we have any reasonable way to tell the compiler that the
register output of our xchg() inline asm has the upper 32 bits clear.

              Linus
  
David Laight Dec. 30, 2023, 10:47 p.m. UTC | #7
From: Ingo Molnar
> Sent: 30 December 2023 20:38
> 
> * David Laight <David.Laight@ACULAB.COM> wrote:
> 
> >  bool osq_lock(struct optimistic_spin_queue *lock)
> >  {
> > -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> >  	struct optimistic_spin_node *prev, *next;
> >  	int old;
> >
> > -	if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> > -		node->cpu = encode_cpu(smp_processor_id());
> > +	if (unlikely(!node)) {
> > +		int cpu = encode_cpu(smp_processor_id());
> > +		node = decode_cpu(cpu);
> > +		node->self = node;
> > +		node->cpu = cpu;
> 
> This whole initialization sequence is suboptimal and needs to be
> cleaned up first: the node->cpu field is constant once initialized, so
> it should be initialized from appropriate init methods, not runtime in
> osq_lock(), right?

I thought that as well, but there would need to be a list of 'init'
functions for the per-cpu data. I didn't spot one.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
Waiman Long Dec. 31, 2023, 3:04 a.m. UTC | #8
On 12/30/23 06:35, David Laight wrote:
> From: Ingo Molnar
>> Sent: 30 December 2023 11:09
>>
>>
>> * Waiman Long <longman@redhat.com> wrote:
>>
>>> On 12/29/23 15:57, David Laight wrote:
>>>> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
>>>> the latter can use an 'offset from register' (%gs for x86-84).
>>>>
>>>> Add a 'self' field to 'struct optimistic_spin_node' that can be
>>>> read with raw_cpu_read(), initialise on first call.
>>>>
>>>> Signed-off-by: David Laight <david.laight@aculab.com>
>>>> ---
>>>>    kernel/locking/osq_lock.c | 14 +++++++++-----
>>>>    1 file changed, 9 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>>>> index 9bb3a077ba92..b60b0add0161 100644
>>>> --- a/kernel/locking/osq_lock.c
>>>> +++ b/kernel/locking/osq_lock.c
>>>> @@ -13,7 +13,7 @@
>>>>     */
>>>>    struct optimistic_spin_node {
>>>> -	struct optimistic_spin_node *next, *prev;
>>>> +	struct optimistic_spin_node *self, *next, *prev;
>>>>    	int locked; /* 1 if lock acquired */
>>>>    	int cpu; /* encoded CPU # + 1 value */
>>>>    };
>>>> @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>>>>    bool osq_lock(struct optimistic_spin_queue *lock)
>>>>    {
>>>> -	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>>>> +	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
>>> My gcc 11 compiler produces the following x86-64 code:
>>>
>>> 92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>>>     0x0000000000000029 <+25>:    mov    %rcx,%rdx
>>>     0x000000000000002c <+28>:    add %gs:0x0(%rip),%rdx        # 0x34
>>> <osq_lock+36>
>>>
>>> Which looks pretty optimized for me. Maybe older compiler may generate more
>>> complex code. However, I do have some doubt as to the benefit of this patch
>>> at the expense of making the code a bit more complex.
> My changed code is one instruction shorter!
>    18:   65 48 8b 15 00 00 00    mov    %gs:0x0(%rip),%rdx        # 20 <osq_lock+0x20>
>    1f:   00
>                          1c: R_X86_64_PC32       .data..percpu..shared_aligned-0x4
> However is might have one less cache line miss.
>
>> GCC-11 is plenty of a look-back window in terms of compiler efficiency:
>> latest enterprise distros use GCC-11 or newer, while recent desktop
>> distros use GCC-13. Anything older won't matter, because no major
>> distribution is going to use new kernels with old compilers.
> There must be a difference in the header files as well.
> Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04).
> But maybe based on some config option.
>
> I'm seeing this_cpu_ptr(&xxx) converted to per_cpu_ptr(&xxx, smp_processor_id())
> which necessitates an array lookup (indexed by cpu number).
> Whereas I think you are seeing it implemented as
>    raw_cpu_read(per_cpu_data_base) + offset_to(xxx)
>
> So the old code generates (after the prologue):
>    10:   49 89 fd                mov    %rdi,%r13
>    13:   49 c7 c4 00 00 00 00    mov    $0x0,%r12
>                          16: R_X86_64_32S        .data..percpu..shared_aligned
>    1a:   e8 00 00 00 00          callq  1f <osq_lock+0x1f>
>                          1b: R_X86_64_PC32       debug_smp_processor_id-0x4
>    1f:   89 c0                   mov    %eax,%eax
>    21:   48 8b 1c c5 00 00 00    mov    0x0(,%rax,8),%rbx
>    28:   00
>                          25: R_X86_64_32S        __per_cpu_offset
>    29:   e8 00 00 00 00          callq  2e <osq_lock+0x2e>
>                          2a: R_X86_64_PC32       debug_smp_processor_id-0x4
>    2e:   4c 01 e3                add    %r12,%rbx
>    31:   83 c0 01                add    $0x1,%eax
>    34:   c7 43 10 00 00 00 00    movl   $0x0,0x10(%rbx)
>    3b:   48 c7 03 00 00 00 00    movq   $0x0,(%rbx)
>    42:   89 43 14                mov    %eax,0x14(%rbx)
>    45:   41 87 45 00             xchg   %eax,0x0(%r13)
>
> I was also surprised that smp_processor_id() is a real function rather
> than an offset from %gs.

I have looked up definition of this_cpu_ptr() and gotten the following 
results:

this_cpu_ptr() => raw_cpu_ptr() => arch_raw_cpu_ptr()

/*
  * Compared to the generic __my_cpu_offset version, the following
  * saves one instruction and avoids clobbering a temp register.
  */
#define arch_raw_cpu_ptr(ptr)                           \
({                                                      \
         unsigned long tcp_ptr__;                        \
         asm ("add " __percpu_arg(1) ", %0"              \
              : "=r" (tcp_ptr__)                         \
              : "m" (this_cpu_off), "0" (ptr));          \
         (typeof(*(ptr)) __kernel __force *)tcp_ptr__;   \
})

The presence of debug_smp_processor_id in your compiled code is likely 
due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config.

#ifdef CONFIG_DEBUG_PREEMPT
   extern unsigned int debug_smp_processor_id(void);
# define smp_processor_id() debug_smp_processor_id()
#else
# define smp_processor_id() __smp_processor_id()
#endif

I don't have that config entry in my kernel config and so I only get 2 
instructions for this_cpu_ptr(). We are not going to optimize the code 
specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped.

Cheers,
Longman
  
David Laight Dec. 31, 2023, 10:36 a.m. UTC | #9
From: Waiman Long
> Sent: 31 December 2023 03:04
....
> The presence of debug_smp_processor_id in your compiled code is likely
> due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config.
> 
> #ifdef CONFIG_DEBUG_PREEMPT
>    extern unsigned int debug_smp_processor_id(void);
> # define smp_processor_id() debug_smp_processor_id()
> #else
> # define smp_processor_id() __smp_processor_id()
> #endif
> 
> I don't have that config entry in my kernel config and so I only get 2
> instructions for this_cpu_ptr(). We are not going to optimize the code
> specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped.

Yes, I discovered that late last night.
I've no idea why it is set.
It might even be inherited from the ubuntu 18.04 (I think) .config
I started with (mostly removing extra modules to reduce compile
time and the size of uramdisk).

I'll look at the patches again.
Saving node->prev->cpu as node->prev_cpu will definitely save
the cache-line bounce.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
David Laight Dec. 31, 2023, 11:41 a.m. UTC | #10
From: Linus Torvalds
> Sent: 30 December 2023 20:41
> 
> On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@aculab.com> wrote:
> >
> > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > the latter can use an 'offset from register' (%gs for x86-84).
> >
> > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > read with raw_cpu_read(), initialise on first call.
> 
> No, this is horrible.
> 
> The problem isn't the "this_cpu_ptr()", it's the rest of the code.
> 
> >  bool osq_lock(struct optimistic_spin_queue *lock)
> >  {
> > -       struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > +       struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> 
> No. Both of these are crap.
> 
> >         struct optimistic_spin_node *prev, *next;
> >         int old;
> >
> > -       if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> > -               node->cpu = encode_cpu(smp_processor_id());
> > +       if (unlikely(!node)) {
> > +               int cpu = encode_cpu(smp_processor_id());
> > +               node = decode_cpu(cpu);
> > +               node->self = node;
> > +               node->cpu = cpu;
> > +       }
> 
> The proper fix here is to not do that silly
> 
>         node = this_cpu_ptr(&osq_node);
>         ..
>         node->next = NULL;
> 
> dance at all, but to simply do
> 
>         this_cpu_write(osq_node.next, NULL);
> 
> in the first place. That makes the whole thing just a single store off
> the segment descriptor.

Is the equivalent true (ie offset from fixed register) for all SMP archs?
Or do some have to do something rather more complicated?

> Yes, you'll eventually end up doing that
> 
>         node = this_cpu_ptr(&osq_node);

For some reason I've had CONFIG_DEBUG_PREEMPT set. I don't remember
setting it, and can't imagine why I might have.
Best guess is it was inherited from the ubuntu .config I started with.
In any case it changes smp_processor_id() into a real function
in order to check that preemption is disabled.
I'd guess something like BUG_ON(!raw_cpu_read(preempt_disable_count))
would be faster and more obvious!

> thing because it then wants to use that raw pointer to do
> 
>         WRITE_ONCE(prev->next, node);
> 
> but that's a separate issue and still does not make it worth it to
> create a pointless self-pointer.

I could claim that loading it is one instruction shorter and that
if the cache line containing 'node' is needed and 'pcpu_hot'
is (unexpectedly) not cached it saves a cache line load.
But I probably won't!

> 
> Btw, if you *really* want to solve that separate issue, then make the
> optimistic_spin_node struct not contain the pointers at all, but the
> CPU numbers, and then turn those numbers into the pointers the exact
> same way it does for the "lock->tail" thing, ie doing that whole
> 
>         prev = decode_cpu(old);
> 
> dance. That *may* then result in avoiding turning them into pointers
> at all in some cases.

Don't think so.
Pretty much all the uses need to dereference the next/prev pointers.

> Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
> be -1 instead, and add something like
> 
>   #define IS_OSQ_UNLOCKED(x) ((int)(x)<0)

I did think about that (but not for these patches).
But it is a lot more dangerous because it absolutely requires
the structure be correctly initialised, not just be all zero.
That might show up some very strange bugs.

> and that would then avoid the +1 / -1 games in encoding/decoding the
> CPU numbers. It causes silly code generated like this:
> 
>         subl    $1, %eax        #, cpu_nr
> ...
>         cltq
>         addq    __per_cpu_offset(,%rax,8), %rcx
> 
> which seems honestly stupid. The cltq is there for sign-extension,
> which is because all these things are "int", and the "subl" will
> zero-extend to 64-bit, not sign-extend.

Changing all the variables to 'unsigned int' will remove the sign
extension and, after the decrement, gcc will know the high bits
are zero so shouldn't need to zero extend.

> At that point, I think gcc might be able to just generate
> 
>         addq    __per_cpu_offset-8(,%rax,8), %rcx

That would need the decrement to be 64bit.
A quick test failed to make that work.
Probably (as you mentioned in the next email) because gcc
doesn't know that the high bits from atomic_xchg() are zero.

> but honestly, I think it would be nicer to just have decode_cpu() do
> 
>         unsigned int cpu_nr = encoded_cpu_val;
> 
>         return per_cpu_ptr(&osq_node, cpu_nr);
> 
> and not have the -1/+1 at all.

Unless you can persuade gcc that the high bits from atomic_xchg()
are zero that will require a zero extend.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
David Laight Dec. 31, 2023, 11:56 a.m. UTC | #11
From: Linus Torvalds
> Sent: 30 December 2023 20:59
> 
> On Sat, 30 Dec 2023 at 12:41, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > UNTESTED patch to just do the "this_cpu_write()" parts attached.
> > Again, note how we do end up doing that this_cpu_ptr conversion later
> > anyway, but at least it's off the critical path.
> 
> Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely
> code, it really is still better than caching a value in memory.
> 
> At least the memory location that 'this_cpu_ptr()' accesses is
> slightly more likely to be hot (and is right next to the cpu number,
> iirc).

I was only going to access the 'self' field in code that required
the 'node' cache line be present.

> 
> That said, I think we should fix this_cpu_ptr() to not ever generate
> that disgusting cltq just because the cpu pointer has the wrong
> signedness. I don't quite know how to do it, but this:
> 
>   -#define per_cpu_offset(x) (__per_cpu_offset[x])
>   +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)])
> 
> at least helps a *bit*. It gets rid of the cltq, at least, but if
> somebody actually passes in an 'unsigned long' cpuid, it would cause
> an unnecessary truncation.

Doing the conversion using arithmetic might help, so:
		__per_cpu_offset[(x) + 0u]

> And gcc still generates
> 
>         subl    $1, %eax        #, cpu_nr
>         addq    __per_cpu_offset(,%rax,8), %rcx
> 
> instead of just doing
> 
>         addq    __per_cpu_offset-8(,%rax,8), %rcx
> 
> because it still needs to clear the upper 32 bits and doesn't know
> that the 'xchg()' already did that.

Not only that, you need to do the 'subl' after converting to 64 bits.
Otherwise the wrong location is read were cpu_nr to be zero.
I've tried that - but it still failed.

> Oh well. I guess even without the -1/+1 games by the OSQ code, we
> would still end up with a "movl" just to do that upper bits clearing
> that the compiler doesn't know is unnecessary.
> 
> I don't think we have any reasonable way to tell the compiler that the
> register output of our xchg() inline asm has the upper 32 bits clear.

It could be done for a 32bit unsigned xchg() - just make the return
type unsigned 64bit.
But that won't work for the signed exchange - and 'atomic_t' is signed.
OTOH I'd guess this code could use 'unsigned int' instead of atomic_t?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  

Patch

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 9bb3a077ba92..b60b0add0161 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,7 +13,7 @@ 
  */
 
 struct optimistic_spin_node {
-	struct optimistic_spin_node *next, *prev;
+	struct optimistic_spin_node *self, *next, *prev;
 	int locked; /* 1 if lock acquired */
 	int cpu; /* encoded CPU # + 1 value */
 };
@@ -93,12 +93,16 @@  osq_wait_next(struct optimistic_spin_queue *lock,
 
 bool osq_lock(struct optimistic_spin_queue *lock)
 {
-	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
+	struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
 	struct optimistic_spin_node *prev, *next;
 	int old;
 
-	if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
-		node->cpu = encode_cpu(smp_processor_id());
+	if (unlikely(!node)) {
+		int cpu = encode_cpu(smp_processor_id());
+		node = decode_cpu(cpu);
+		node->self = node;
+		node->cpu = cpu;
+	}
 
 	/*
 	 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -222,7 +226,7 @@  void osq_unlock(struct optimistic_spin_queue *lock)
 	/*
 	 * Second most likely case.
 	 */
-	node = this_cpu_ptr(&osq_node);
+	node = raw_cpu_read(osq_node.self);
 	next = xchg(&node->next, NULL);
 	if (next) {
 		WRITE_ONCE(next->locked, 1);