[V2] asm-generic: ticket-lock: Optimize arch_spin_value_unlocked

Message ID 20230731023308.3748432-1-guoren@kernel.org
State New
Headers
Series [V2] asm-generic: ticket-lock: Optimize arch_spin_value_unlocked |

Commit Message

Guo Ren July 31, 2023, 2:33 a.m. UTC
  From: Guo Ren <guoren@linux.alibaba.com>

The arch_spin_value_unlocked would cause an unnecessary memory
access to the contended value. Although it won't cause a significant
performance gap in most architectures, the arch_spin_value_unlocked
argument contains enough information. Thus, remove unnecessary
atomic_read in arch_spin_value_unlocked().

The caller of arch_spin_value_unlocked() could benefit from this
change. Currently, the only caller is lockref.

Signed-off-by: Guo Ren <guoren@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
---
Changelog
V2:
 - Fixup commit log with Waiman advice.
 - Add Waiman comment in the commit msg.
---
 include/asm-generic/spinlock.h | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)
  

Comments

Will Deacon July 31, 2023, 9:56 a.m. UTC | #1
On Sun, Jul 30, 2023 at 10:33:08PM -0400, guoren@kernel.org wrote:
> From: Guo Ren <guoren@linux.alibaba.com>
> 
> The arch_spin_value_unlocked would cause an unnecessary memory
> access to the contended value. Although it won't cause a significant
> performance gap in most architectures, the arch_spin_value_unlocked
> argument contains enough information. Thus, remove unnecessary
> atomic_read in arch_spin_value_unlocked().
> 
> The caller of arch_spin_value_unlocked() could benefit from this
> change. Currently, the only caller is lockref.
> 
> Signed-off-by: Guo Ren <guoren@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: David Laight <David.Laight@ACULAB.COM>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> ---
> Changelog
> V2:
>  - Fixup commit log with Waiman advice.
>  - Add Waiman comment in the commit msg.
> ---
>  include/asm-generic/spinlock.h | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)

Acked-by: Will Deacon <will@kernel.org>

It would be nice to see some numbers showing this actually helps, though.

Will
  
maobibo July 31, 2023, 11:27 a.m. UTC | #2
在 2023/7/31 10:33, guoren@kernel.org 写道:
> From: Guo Ren <guoren@linux.alibaba.com>
> 
> The arch_spin_value_unlocked would cause an unnecessary memory
> access to the contended value. Although it won't cause a significant
> performance gap in most architectures, the arch_spin_value_unlocked
> argument contains enough information. Thus, remove unnecessary
> atomic_read in arch_spin_value_unlocked().
> 
> The caller of arch_spin_value_unlocked() could benefit from this
> change. Currently, the only caller is lockref.
> 
> Signed-off-by: Guo Ren <guoren@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: David Laight <David.Laight@ACULAB.COM>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> ---
> Changelog
> V2:
>  - Fixup commit log with Waiman advice.
>  - Add Waiman comment in the commit msg.
> ---
>  include/asm-generic/spinlock.h | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h
> index fdfebcb050f4..90803a826ba0 100644
> --- a/include/asm-generic/spinlock.h
> +++ b/include/asm-generic/spinlock.h
> @@ -68,11 +68,18 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
>  	smp_store_release(ptr, (u16)val + 1);
>  }
>  
> +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
> +{
> +	u32 val = lock.counter;
> +
> +	return ((val >> 16) == (val & 0xffff));
> +}
I do not know much about lock, will it be cached in register without memory
access again like READ_ONCE or atomic_read?

Regards
Bibo Mao
> +
>  static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock)
>  {
> -	u32 val = atomic_read(lock);
> +	arch_spinlock_t val = READ_ONCE(*lock);
>  
> -	return ((val >> 16) != (val & 0xffff));
> +	return !arch_spin_value_unlocked(val);
>  }
>  
>  static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
> @@ -82,11 +89,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
>  	return (s16)((val >> 16) - (val & 0xffff)) > 1;
>  }
>  
> -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
> -{
> -	return !arch_spin_is_locked(&lock);
> -}
> -
>  #include <asm/qrwlock.h>
>  
>  #endif /* __ASM_GENERIC_SPINLOCK_H */
  
Waiman Long July 31, 2023, 3:16 p.m. UTC | #3
On 7/30/23 22:33, guoren@kernel.org wrote:
> From: Guo Ren <guoren@linux.alibaba.com>
>
> The arch_spin_value_unlocked would cause an unnecessary memory
> access to the contended value. Although it won't cause a significant
> performance gap in most architectures, the arch_spin_value_unlocked
> argument contains enough information. Thus, remove unnecessary
> atomic_read in arch_spin_value_unlocked().
>
> The caller of arch_spin_value_unlocked() could benefit from this
> change. Currently, the only caller is lockref.
>
> Signed-off-by: Guo Ren <guoren@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: David Laight <David.Laight@ACULAB.COM>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> ---
> Changelog
> V2:
>   - Fixup commit log with Waiman advice.
>   - Add Waiman comment in the commit msg.
> ---
>   include/asm-generic/spinlock.h | 16 +++++++++-------
>   1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h
> index fdfebcb050f4..90803a826ba0 100644
> --- a/include/asm-generic/spinlock.h
> +++ b/include/asm-generic/spinlock.h
> @@ -68,11 +68,18 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
>   	smp_store_release(ptr, (u16)val + 1);
>   }
>   
> +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
> +{
> +	u32 val = lock.counter;
> +
> +	return ((val >> 16) == (val & 0xffff));
> +}
> +
>   static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock)
>   {
> -	u32 val = atomic_read(lock);
> +	arch_spinlock_t val = READ_ONCE(*lock);
>   
> -	return ((val >> 16) != (val & 0xffff));
> +	return !arch_spin_value_unlocked(val);
>   }
>   
>   static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
> @@ -82,11 +89,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
>   	return (s16)((val >> 16) - (val & 0xffff)) > 1;
>   }
>   
> -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
> -{
> -	return !arch_spin_is_locked(&lock);
> -}
> -
>   #include <asm/qrwlock.h>
>   
>   #endif /* __ASM_GENERIC_SPINLOCK_H */

I am fine with the current change. However, modern optimizing compiler 
should be able to avoid the redundant memory read anyway. So this patch 
may not have an impact from the performance point of view.

Acked-by: Waiman Long <longman@redhat.com>
  
maobibo Aug. 1, 2023, 1:37 a.m. UTC | #4
在 2023/7/31 23:16, Waiman Long 写道:
> On 7/30/23 22:33, guoren@kernel.org wrote:
>> From: Guo Ren <guoren@linux.alibaba.com>
>>
>> The arch_spin_value_unlocked would cause an unnecessary memory
>> access to the contended value. Although it won't cause a significant
>> performance gap in most architectures, the arch_spin_value_unlocked
>> argument contains enough information. Thus, remove unnecessary
>> atomic_read in arch_spin_value_unlocked().
>>
>> The caller of arch_spin_value_unlocked() could benefit from this
>> change. Currently, the only caller is lockref.
>>
>> Signed-off-by: Guo Ren <guoren@kernel.org>
>> Cc: Waiman Long <longman@redhat.com>
>> Cc: David Laight <David.Laight@ACULAB.COM>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
>> ---
>> Changelog
>> V2:
>>   - Fixup commit log with Waiman advice.
>>   - Add Waiman comment in the commit msg.
>> ---
>>   include/asm-generic/spinlock.h | 16 +++++++++-------
>>   1 file changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h
>> index fdfebcb050f4..90803a826ba0 100644
>> --- a/include/asm-generic/spinlock.h
>> +++ b/include/asm-generic/spinlock.h
>> @@ -68,11 +68,18 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
>>       smp_store_release(ptr, (u16)val + 1);
>>   }
>>   +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>> +{
>> +    u32 val = lock.counter;
>> +
>> +    return ((val >> 16) == (val & 0xffff));
>> +}
>> +
>>   static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock)
>>   {
>> -    u32 val = atomic_read(lock);
>> +    arch_spinlock_t val = READ_ONCE(*lock);
>>   -    return ((val >> 16) != (val & 0xffff));
>> +    return !arch_spin_value_unlocked(val);
>>   }
>>     static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
>> @@ -82,11 +89,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
>>       return (s16)((val >> 16) - (val & 0xffff)) > 1;
>>   }
>>   -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>> -{
>> -    return !arch_spin_is_locked(&lock);
>> -}
>> -
>>   #include <asm/qrwlock.h>
>>     #endif /* __ASM_GENERIC_SPINLOCK_H */
> 
> I am fine with the current change. However, modern optimizing compiler should be able to avoid the redundant memory read anyway. So this patch may not have an impact from the performance point of view.

arch_spin_value_unlocked is called with lockref like this:

#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                        \
        int retry = 100;                                                        \
        struct lockref old;                                                     \
        BUILD_BUG_ON(sizeof(old) != 8);                                         \
        old.lock_count = READ_ONCE(lockref->lock_count);                        \
        while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \

With modern optimizing compiler, Is it possible that old value of 
old.lock.rlock.raw_lock is cached in register, despite that try_cmpxchg64_relaxed
modifies the memory of old.lock_count with new value? 

Regards
Bibo Mao

                struct lockref new = old;                                       \
                CODE                                                            \
                if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
                                                 &old.lock_count,               \
                                                 new.lock_count))) {            \
                        SUCCESS;                                                \
                }                                                               \
                if (!--retry)                                                   \
                        break;                                                  \
        }                                                                       \
} while (0)

> 
> Acked-by: Waiman Long <longman@redhat.com>
  
Waiman Long Aug. 1, 2023, 1:59 a.m. UTC | #5
On 7/31/23 21:37, bibo mao wrote:
>
> 在 2023/7/31 23:16, Waiman Long 写道:
>> On 7/30/23 22:33, guoren@kernel.org wrote:
>>> From: Guo Ren <guoren@linux.alibaba.com>
>>>
>>> The arch_spin_value_unlocked would cause an unnecessary memory
>>> access to the contended value. Although it won't cause a significant
>>> performance gap in most architectures, the arch_spin_value_unlocked
>>> argument contains enough information. Thus, remove unnecessary
>>> atomic_read in arch_spin_value_unlocked().
>>>
>>> The caller of arch_spin_value_unlocked() could benefit from this
>>> change. Currently, the only caller is lockref.
>>>
>>> Signed-off-by: Guo Ren <guoren@kernel.org>
>>> Cc: Waiman Long <longman@redhat.com>
>>> Cc: David Laight <David.Laight@ACULAB.COM>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
>>> ---
>>> Changelog
>>> V2:
>>>    - Fixup commit log with Waiman advice.
>>>    - Add Waiman comment in the commit msg.
>>> ---
>>>    include/asm-generic/spinlock.h | 16 +++++++++-------
>>>    1 file changed, 9 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h
>>> index fdfebcb050f4..90803a826ba0 100644
>>> --- a/include/asm-generic/spinlock.h
>>> +++ b/include/asm-generic/spinlock.h
>>> @@ -68,11 +68,18 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
>>>        smp_store_release(ptr, (u16)val + 1);
>>>    }
>>>    +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>>> +{
>>> +    u32 val = lock.counter;
>>> +
>>> +    return ((val >> 16) == (val & 0xffff));
>>> +}
>>> +
>>>    static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock)
>>>    {
>>> -    u32 val = atomic_read(lock);
>>> +    arch_spinlock_t val = READ_ONCE(*lock);
>>>    -    return ((val >> 16) != (val & 0xffff));
>>> +    return !arch_spin_value_unlocked(val);
>>>    }
>>>      static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
>>> @@ -82,11 +89,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
>>>        return (s16)((val >> 16) - (val & 0xffff)) > 1;
>>>    }
>>>    -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>>> -{
>>> -    return !arch_spin_is_locked(&lock);
>>> -}
>>> -
>>>    #include <asm/qrwlock.h>
>>>      #endif /* __ASM_GENERIC_SPINLOCK_H */
>> I am fine with the current change. However, modern optimizing compiler should be able to avoid the redundant memory read anyway. So this patch may not have an impact from the performance point of view.
> arch_spin_value_unlocked is called with lockref like this:
>
> #define CMPXCHG_LOOP(CODE, SUCCESS) do {                                        \
>          int retry = 100;                                                        \
>          struct lockref old;                                                     \
>          BUILD_BUG_ON(sizeof(old) != 8);                                         \
>          old.lock_count = READ_ONCE(lockref->lock_count);                        \
>          while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \
>
> With modern optimizing compiler, Is it possible that old value of
> old.lock.rlock.raw_lock is cached in register, despite that try_cmpxchg64_relaxed
> modifies the memory of old.lock_count with new value?

What I meant is that the call to arch_spin_value_unlocked() as it is 
today will not generate 2 memory reads of the same location with or 
without the patch. Of course, a new memory read will be needed after a 
failed cmpxchg().

Cheers,
Longman
  
maobibo Aug. 1, 2023, 4:05 a.m. UTC | #6
在 2023/8/1 09:59, Waiman Long 写道:
> On 7/31/23 21:37, bibo mao wrote:
>>
>> 在 2023/7/31 23:16, Waiman Long 写道:
>>> On 7/30/23 22:33, guoren@kernel.org wrote:
>>>> From: Guo Ren <guoren@linux.alibaba.com>
>>>>
>>>> The arch_spin_value_unlocked would cause an unnecessary memory
>>>> access to the contended value. Although it won't cause a significant
>>>> performance gap in most architectures, the arch_spin_value_unlocked
>>>> argument contains enough information. Thus, remove unnecessary
>>>> atomic_read in arch_spin_value_unlocked().
>>>>
>>>> The caller of arch_spin_value_unlocked() could benefit from this
>>>> change. Currently, the only caller is lockref.
>>>>
>>>> Signed-off-by: Guo Ren <guoren@kernel.org>
>>>> Cc: Waiman Long <longman@redhat.com>
>>>> Cc: David Laight <David.Laight@ACULAB.COM>
>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
>>>> ---
>>>> Changelog
>>>> V2:
>>>>    - Fixup commit log with Waiman advice.
>>>>    - Add Waiman comment in the commit msg.
>>>> ---
>>>>    include/asm-generic/spinlock.h | 16 +++++++++-------
>>>>    1 file changed, 9 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h
>>>> index fdfebcb050f4..90803a826ba0 100644
>>>> --- a/include/asm-generic/spinlock.h
>>>> +++ b/include/asm-generic/spinlock.h
>>>> @@ -68,11 +68,18 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
>>>>        smp_store_release(ptr, (u16)val + 1);
>>>>    }
>>>>    +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>>>> +{
>>>> +    u32 val = lock.counter;
>>>> +
>>>> +    return ((val >> 16) == (val & 0xffff));
>>>> +}
>>>> +
>>>>    static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock)
>>>>    {
>>>> -    u32 val = atomic_read(lock);
>>>> +    arch_spinlock_t val = READ_ONCE(*lock);
>>>>    -    return ((val >> 16) != (val & 0xffff));
>>>> +    return !arch_spin_value_unlocked(val);
>>>>    }
>>>>      static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
>>>> @@ -82,11 +89,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
>>>>        return (s16)((val >> 16) - (val & 0xffff)) > 1;
>>>>    }
>>>>    -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>>>> -{
>>>> -    return !arch_spin_is_locked(&lock);
>>>> -}
>>>> -
>>>>    #include <asm/qrwlock.h>
>>>>      #endif /* __ASM_GENERIC_SPINLOCK_H */
>>> I am fine with the current change. However, modern optimizing compiler should be able to avoid the redundant memory read anyway. So this patch may not have an impact from the performance point of view.
>> arch_spin_value_unlocked is called with lockref like this:
>>
>> #define CMPXCHG_LOOP(CODE, SUCCESS) do {                                        \
>>          int retry = 100;                                                        \
>>          struct lockref old;                                                     \
>>          BUILD_BUG_ON(sizeof(old) != 8);                                         \
>>          old.lock_count = READ_ONCE(lockref->lock_count);                        \
>>          while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \
>>
>> With modern optimizing compiler, Is it possible that old value of
>> old.lock.rlock.raw_lock is cached in register, despite that try_cmpxchg64_relaxed
>> modifies the memory of old.lock_count with new value?
> 
> What I meant is that the call to arch_spin_value_unlocked() as it is today will not generate 2 memory reads of the same location with or without the patch. Of course, a new memory read will be needed after a failed cmpxchg().
yeap, it can solve the issue with a new memory read after a failed cmpxchg().

Regards
Bibo Mao
> 
> Cheers,
> Longman
  
Mateusz Guzik Aug. 7, 2023, 6:36 p.m. UTC | #7
On Sun, Jul 30, 2023 at 10:33:08PM -0400, guoren@kernel.org wrote:
> From: Guo Ren <guoren@linux.alibaba.com>
> 
> The arch_spin_value_unlocked would cause an unnecessary memory
> access to the contended value. Although it won't cause a significant
> performance gap in most architectures, the arch_spin_value_unlocked
> argument contains enough information. Thus, remove unnecessary
> atomic_read in arch_spin_value_unlocked().
> 
> The caller of arch_spin_value_unlocked() could benefit from this
> change. Currently, the only caller is lockref.
> 

Have you verified you are getting an extra memory access from this in
lockref? What architecture is it?

I have no opinion about the patch itself, I will note though that the
argument to the routine is *not* the actual memory-shared lockref,
instead it's something from the local copy obtained with READ_ONCE
from the real thing. So I would be surprised if the stock routine was
generating accesses to that sucker.

Nonetheless, if the patched routine adds nasty asm, that would be nice
to sort out.

FWIW on x86-64 qspinlock is used (i.e. not the stuff you are patching)
and I verified there are only 2 memory accesses -- the initial READ_ONCE
and later cmpxchg. I don't know which archs *don't* use qspinlock.

It also turns out generated asm is quite atrocious and cleaning it up
may yield a small win under more traffic. Maybe I'll see about it later
this week.

For example, disassembling lockref_put_return:
<+0>:     mov    (%rdi),%rax            <-- initial load, expected
<+3>:     mov    $0x64,%r8d
<+9>:     mov    %rax,%rdx
<+12>:    test   %eax,%eax              <-- retries loop back here
					<-- this is also the unlocked
					    check
<+14>:    jne    0xffffffff8157aba3 <lockref_put_return+67>
<+16>:    mov    %rdx,%rsi
<+19>:    mov    %edx,%edx
<+21>:    sar    $0x20,%rsi
<+25>:    lea    -0x1(%rsi),%ecx        <-- new.count--;
<+28>:    shl    $0x20,%rcx
<+32>:    or     %rcx,%rdx
<+35>:    test   %esi,%esi
<+37>:    jle    0xffffffff8157aba3 <lockref_put_return+67>
<+39>:    lock cmpxchg %rdx,(%rdi)      <-- the attempt to change
<+44>:    jne    0xffffffff8157ab9a <lockref_put_return+58>
<+46>:    shr    $0x20,%rdx
<+50>:    mov    %rdx,%rax
<+53>:    jmp    0xffffffff81af8540 <__x86_return_thunk>
<+58>:    mov    %rax,%rdx
<+61>:    sub    $0x1,%r8d              <-- retry count check
<+65>:    jne    0xffffffff8157ab6c <lockref_put_return+12> <-- go back
<+67>:    mov    $0xffffffff,%eax
<+72>:    jmp    0xffffffff81af8540 <__x86_return_thunk>
  
Guo Ren Aug. 8, 2023, 3:29 a.m. UTC | #8
On Mon, Aug 07, 2023 at 08:36:55PM +0200, Mateusz Guzik wrote:
> On Sun, Jul 30, 2023 at 10:33:08PM -0400, guoren@kernel.org wrote:
> > From: Guo Ren <guoren@linux.alibaba.com>
> > 
> > The arch_spin_value_unlocked would cause an unnecessary memory
> > access to the contended value. Although it won't cause a significant
> > performance gap in most architectures, the arch_spin_value_unlocked
> > argument contains enough information. Thus, remove unnecessary
> > atomic_read in arch_spin_value_unlocked().
> > 
> > The caller of arch_spin_value_unlocked() could benefit from this
> > change. Currently, the only caller is lockref.
> > 
> 
> Have you verified you are getting an extra memory access from this in
> lockref? What architecture is it?
For riscv, this patch could optimize the lock_ref on the same compiling
condition:
 - After lifting data dependencies, the compiler optimizes the prologue
   behavior, thus the callee-register-saved path becomes optional. This
   is a significant improvement on the lock_ref() self.
 - Compare the "98: & 9c:" lines before the patch and the "88:" line
   after the patch. We saved two memory accesses not only one load.

========================================================================
Before the patch:
void lockref_get(struct lockref *lockref)
{
  78:   fd010113                add     sp,sp,-48
  7c:   02813023                sd      s0,32(sp)
  80:   02113423                sd      ra,40(sp)
  84:   03010413                add     s0,sp,48

0000000000000088 <.LBB296>:
        CMPXCHG_LOOP(
  88:   00053783                ld      a5,0(a0)

000000000000008c <.LBB265>:
}

static __always_inline int ticket_spin_is_locked(arch_spinlock_t *lock)
{
        u32 val = atomic_read(&lock->val);
        return ((val >> 16) != (val & 0xffff));
  8c:   00010637                lui     a2,0x10

0000000000000090 <.LBE265>:
  90:   06400593                li      a1,100

0000000000000094 <.LBB274>:
  94:   fff60613                add     a2,a2,-1 # ffff <.LLST8+0xf49a>

0000000000000098 <.L8>:
  98:   fef42423                sw      a5,-24(s0)

000000000000009c <.LBB269>:
  9c:   fe842703                lw      a4,-24(s0)

00000000000000a0 <.LBE269>:
  a0:   0107569b                srlw    a3,a4,0x10
  a4:   00c77733                and     a4,a4,a2
  a8:   04e69063                bne     a3,a4,e8 <.L12>

00000000000000ac <.LBB282>:
  ac:   4207d693                sra     a3,a5,0x20
  b0:   02079713                sll     a4,a5,0x20
  b4:   0016869b                addw    a3,a3,1
  b8:   02069693                sll     a3,a3,0x20
  bc:   02075713                srl     a4,a4,0x20
  c0:   00d76733                or      a4,a4,a3

00000000000000c4 <.L0^B1>:
  c4:   100536af                lr.d    a3,(a0)
  c8:   00f69863                bne     a3,a5,d8 <.L1^B1>
  cc:   1ae5382f                sc.d.rl a6,a4,(a0)
  d0:   fe081ae3                bnez    a6,c4 <.L0^B1>
  d4:   0330000f                fence   rw,rw

00000000000000d8 <.L1^B1>:
  d8:   02d78a63                beq     a5,a3,10c <.L7>

00000000000000dc <.LBE292>:
  dc:   fff5859b                addw    a1,a1,-1

00000000000000e0 <.LBB293>:
  e0:   00068793                mv      a5,a3

00000000000000e4 <.LBE293>:
  e4:   fa059ae3                bnez    a1,98 <.L8>

00000000000000e8 <.L12>:

========================================================================
After the patch:
void lockref_get(struct lockref *lockref)
{
        CMPXCHG_LOOP(
  78:   00053783                ld      a5,0(a0)

000000000000007c <.LBB212>:

static __always_inline int ticket_spin_value_unlocked(arch_spinlock_t
lock)
{
        u32 val = lock.val.counter;

        return ((val >> 16) == (val & 0xffff));
  7c:   00010637                lui     a2,0x10

0000000000000080 <.LBE212>:
  80:   06400593                li      a1,100

0000000000000084 <.LBB216>:
  84:   fff60613                add     a2,a2,-1 # ffff <.LLST8+0xf4aa>

0000000000000088 <.L8>:
  88:   0007871b                sext.w  a4,a5

000000000000008c <.LBB217>:
  8c:   0107d69b                srlw    a3,a5,0x10
  90:   00c77733                and     a4,a4,a2
  94:   04e69063                bne     a3,a4,d4 <.L12>

0000000000000098 <.LBB218>:
  98:   4207d693                sra     a3,a5,0x20
  9c:   02079713                sll     a4,a5,0x20
  a0:   0016869b                addw    a3,a3,1
  a4:   02069693                sll     a3,a3,0x20
  a8:   02075713                srl     a4,a4,0x20
  ac:   00d76733                or      a4,a4,a3

00000000000000b0 <.L0^B1>:
  b0:   100536af                lr.d    a3,(a0)
  b4:   00f69863                bne     a3,a5,c4 <.L1^B1>
  b8:   1ae5382f                sc.d.rl a6,a4,(a0)
  bc:   fe081ae3                bnez    a6,b0 <.L0^B1>
  c0:   0330000f                fence   rw,rw

00000000000000c4 <.L1^B1>:
  c4:   04d78a63                beq     a5,a3,118 <.L18>

00000000000000c8 <.LBE228>:
  c8:   fff5859b                addw    a1,a1,-1

00000000000000cc <.LBB229>:
  cc:   00068793                mv      a5,a3

00000000000000d0 <.LBE229>:
  d0:   fa059ce3                bnez    a1,88 <.L8>

00000000000000d4 <.L12>:
{
  d4:   fe010113                add     sp,sp,-32
  d8:   00113c23                sd      ra,24(sp)
  dc:   00813823                sd      s0,16(sp)
  e0:   02010413                add     s0,sp,32
========================================================================

> 
> I have no opinion about the patch itself, I will note though that the
> argument to the routine is *not* the actual memory-shared lockref,
> instead it's something from the local copy obtained with READ_ONCE
> from the real thing. So I would be surprised if the stock routine was
> generating accesses to that sucker.
> 
> Nonetheless, if the patched routine adds nasty asm, that would be nice
> to sort out.
> 
> FWIW on x86-64 qspinlock is used (i.e. not the stuff you are patching)
> and I verified there are only 2 memory accesses -- the initial READ_ONCE
> and later cmpxchg. I don't know which archs *don't* use qspinlock.
> 
> It also turns out generated asm is quite atrocious and cleaning it up
> may yield a small win under more traffic. Maybe I'll see about it later
> this week.
> 
> For example, disassembling lockref_put_return:
> <+0>:     mov    (%rdi),%rax            <-- initial load, expected
> <+3>:     mov    $0x64,%r8d
> <+9>:     mov    %rax,%rdx
> <+12>:    test   %eax,%eax              <-- retries loop back here
> 					<-- this is also the unlocked
> 					    check
> <+14>:    jne    0xffffffff8157aba3 <lockref_put_return+67>
> <+16>:    mov    %rdx,%rsi
> <+19>:    mov    %edx,%edx
> <+21>:    sar    $0x20,%rsi
> <+25>:    lea    -0x1(%rsi),%ecx        <-- new.count--;
> <+28>:    shl    $0x20,%rcx
> <+32>:    or     %rcx,%rdx
> <+35>:    test   %esi,%esi
> <+37>:    jle    0xffffffff8157aba3 <lockref_put_return+67>
> <+39>:    lock cmpxchg %rdx,(%rdi)      <-- the attempt to change
> <+44>:    jne    0xffffffff8157ab9a <lockref_put_return+58>
> <+46>:    shr    $0x20,%rdx
> <+50>:    mov    %rdx,%rax
> <+53>:    jmp    0xffffffff81af8540 <__x86_return_thunk>
> <+58>:    mov    %rax,%rdx
> <+61>:    sub    $0x1,%r8d              <-- retry count check
> <+65>:    jne    0xffffffff8157ab6c <lockref_put_return+12> <-- go back
> <+67>:    mov    $0xffffffff,%eax
> <+72>:    jmp    0xffffffff81af8540 <__x86_return_thunk>
>
  
Mateusz Guzik Aug. 8, 2023, 8:02 a.m. UTC | #9
On 8/8/23, Guo Ren <guoren@kernel.org> wrote:
> On Mon, Aug 07, 2023 at 08:36:55PM +0200, Mateusz Guzik wrote:
>> On Sun, Jul 30, 2023 at 10:33:08PM -0400, guoren@kernel.org wrote:
>> > From: Guo Ren <guoren@linux.alibaba.com>
>> >
>> > The arch_spin_value_unlocked would cause an unnecessary memory
>> > access to the contended value. Although it won't cause a significant
>> > performance gap in most architectures, the arch_spin_value_unlocked
>> > argument contains enough information. Thus, remove unnecessary
>> > atomic_read in arch_spin_value_unlocked().
>> >
>> > The caller of arch_spin_value_unlocked() could benefit from this
>> > change. Currently, the only caller is lockref.
>> >
>>
>> Have you verified you are getting an extra memory access from this in
>> lockref? What architecture is it?
> For riscv, this patch could optimize the lock_ref on the same compiling
> condition:
>  - After lifting data dependencies, the compiler optimizes the prologue
>    behavior, thus the callee-register-saved path becomes optional. This
>    is a significant improvement on the lock_ref() self.
>  - Compare the "98: & 9c:" lines before the patch and the "88:" line
>    after the patch. We saved two memory accesses not only one load.
>

Now that you mention it, I see riscv in cc. ;)

Your commit message states "arch_spin_value_unlocked would cause an
unnecessary memory access to the contended value" and that lockref
uses it. Perhaps incorrectly I took it to claim lockref is suffering
extra loads from the area it modifies with cmpxchg -- as I verified,
this is not happening as the argument to arch_spin_value_unlocked is a
copy of the target lockref struct. With this not being a problem,
potential scalability impact goes down a lot. And so happens with the
code from qspinlock on x86-64 there are no extra memory accesses to
anything anyway.

I don't speak riscv asm so can't comment on the result. I'll note
again that extra work for single-threaded use is definitely worth
shaving and may or may not affect throughput in face of other CPUs
messing with lockref.

You can easily test lockref with will-it-scale, I would suggest the
access() system call which afaics has least amount of unrelated
overhead. You can find the bench here:
https://github.com/antonblanchard/will-it-scale/pull/36/files

> ========================================================================
> Before the patch:
> void lockref_get(struct lockref *lockref)
> {
>   78:   fd010113                add     sp,sp,-48
>   7c:   02813023                sd      s0,32(sp)
>   80:   02113423                sd      ra,40(sp)
>   84:   03010413                add     s0,sp,48
>
> 0000000000000088 <.LBB296>:
>         CMPXCHG_LOOP(
>   88:   00053783                ld      a5,0(a0)
>
> 000000000000008c <.LBB265>:
> }
>
> static __always_inline int ticket_spin_is_locked(arch_spinlock_t *lock)
> {
>         u32 val = atomic_read(&lock->val);
>         return ((val >> 16) != (val & 0xffff));
>   8c:   00010637                lui     a2,0x10
>
> 0000000000000090 <.LBE265>:
>   90:   06400593                li      a1,100
>
> 0000000000000094 <.LBB274>:
>   94:   fff60613                add     a2,a2,-1 # ffff <.LLST8+0xf49a>
>
> 0000000000000098 <.L8>:
>   98:   fef42423                sw      a5,-24(s0)
>
> 000000000000009c <.LBB269>:
>   9c:   fe842703                lw      a4,-24(s0)
>
> 00000000000000a0 <.LBE269>:
>   a0:   0107569b                srlw    a3,a4,0x10
>   a4:   00c77733                and     a4,a4,a2
>   a8:   04e69063                bne     a3,a4,e8 <.L12>
>
> 00000000000000ac <.LBB282>:
>   ac:   4207d693                sra     a3,a5,0x20
>   b0:   02079713                sll     a4,a5,0x20
>   b4:   0016869b                addw    a3,a3,1
>   b8:   02069693                sll     a3,a3,0x20
>   bc:   02075713                srl     a4,a4,0x20
>   c0:   00d76733                or      a4,a4,a3
>
> 00000000000000c4 <.L0^B1>:
>   c4:   100536af                lr.d    a3,(a0)
>   c8:   00f69863                bne     a3,a5,d8 <.L1^B1>
>   cc:   1ae5382f                sc.d.rl a6,a4,(a0)
>   d0:   fe081ae3                bnez    a6,c4 <.L0^B1>
>   d4:   0330000f                fence   rw,rw
>
> 00000000000000d8 <.L1^B1>:
>   d8:   02d78a63                beq     a5,a3,10c <.L7>
>
> 00000000000000dc <.LBE292>:
>   dc:   fff5859b                addw    a1,a1,-1
>
> 00000000000000e0 <.LBB293>:
>   e0:   00068793                mv      a5,a3
>
> 00000000000000e4 <.LBE293>:
>   e4:   fa059ae3                bnez    a1,98 <.L8>
>
> 00000000000000e8 <.L12>:
>
> ========================================================================
> After the patch:
> void lockref_get(struct lockref *lockref)
> {
>         CMPXCHG_LOOP(
>   78:   00053783                ld      a5,0(a0)
>
> 000000000000007c <.LBB212>:
>
> static __always_inline int ticket_spin_value_unlocked(arch_spinlock_t
> lock)
> {
>         u32 val = lock.val.counter;
>
>         return ((val >> 16) == (val & 0xffff));
>   7c:   00010637                lui     a2,0x10
>
> 0000000000000080 <.LBE212>:
>   80:   06400593                li      a1,100
>
> 0000000000000084 <.LBB216>:
>   84:   fff60613                add     a2,a2,-1 # ffff <.LLST8+0xf4aa>
>
> 0000000000000088 <.L8>:
>   88:   0007871b                sext.w  a4,a5
>
> 000000000000008c <.LBB217>:
>   8c:   0107d69b                srlw    a3,a5,0x10
>   90:   00c77733                and     a4,a4,a2
>   94:   04e69063                bne     a3,a4,d4 <.L12>
>
> 0000000000000098 <.LBB218>:
>   98:   4207d693                sra     a3,a5,0x20
>   9c:   02079713                sll     a4,a5,0x20
>   a0:   0016869b                addw    a3,a3,1
>   a4:   02069693                sll     a3,a3,0x20
>   a8:   02075713                srl     a4,a4,0x20
>   ac:   00d76733                or      a4,a4,a3
>
> 00000000000000b0 <.L0^B1>:
>   b0:   100536af                lr.d    a3,(a0)
>   b4:   00f69863                bne     a3,a5,c4 <.L1^B1>
>   b8:   1ae5382f                sc.d.rl a6,a4,(a0)
>   bc:   fe081ae3                bnez    a6,b0 <.L0^B1>
>   c0:   0330000f                fence   rw,rw
>
> 00000000000000c4 <.L1^B1>:
>   c4:   04d78a63                beq     a5,a3,118 <.L18>
>
> 00000000000000c8 <.LBE228>:
>   c8:   fff5859b                addw    a1,a1,-1
>
> 00000000000000cc <.LBB229>:
>   cc:   00068793                mv      a5,a3
>
> 00000000000000d0 <.LBE229>:
>   d0:   fa059ce3                bnez    a1,88 <.L8>
>
> 00000000000000d4 <.L12>:
> {
>   d4:   fe010113                add     sp,sp,-32
>   d8:   00113c23                sd      ra,24(sp)
>   dc:   00813823                sd      s0,16(sp)
>   e0:   02010413                add     s0,sp,32
> ========================================================================
>
>>
>> I have no opinion about the patch itself, I will note though that the
>> argument to the routine is *not* the actual memory-shared lockref,
>> instead it's something from the local copy obtained with READ_ONCE
>> from the real thing. So I would be surprised if the stock routine was
>> generating accesses to that sucker.
>>
>> Nonetheless, if the patched routine adds nasty asm, that would be nice
>> to sort out.
>>
>> FWIW on x86-64 qspinlock is used (i.e. not the stuff you are patching)
>> and I verified there are only 2 memory accesses -- the initial READ_ONCE
>> and later cmpxchg. I don't know which archs *don't* use qspinlock.
>>
>> It also turns out generated asm is quite atrocious and cleaning it up
>> may yield a small win under more traffic. Maybe I'll see about it later
>> this week.
>>
>> For example, disassembling lockref_put_return:
>> <+0>:     mov    (%rdi),%rax            <-- initial load, expected
>> <+3>:     mov    $0x64,%r8d
>> <+9>:     mov    %rax,%rdx
>> <+12>:    test   %eax,%eax              <-- retries loop back here
>> 					<-- this is also the unlocked
>> 					    check
>> <+14>:    jne    0xffffffff8157aba3 <lockref_put_return+67>
>> <+16>:    mov    %rdx,%rsi
>> <+19>:    mov    %edx,%edx
>> <+21>:    sar    $0x20,%rsi
>> <+25>:    lea    -0x1(%rsi),%ecx        <-- new.count--;
>> <+28>:    shl    $0x20,%rcx
>> <+32>:    or     %rcx,%rdx
>> <+35>:    test   %esi,%esi
>> <+37>:    jle    0xffffffff8157aba3 <lockref_put_return+67>
>> <+39>:    lock cmpxchg %rdx,(%rdi)      <-- the attempt to change
>> <+44>:    jne    0xffffffff8157ab9a <lockref_put_return+58>
>> <+46>:    shr    $0x20,%rdx
>> <+50>:    mov    %rdx,%rax
>> <+53>:    jmp    0xffffffff81af8540 <__x86_return_thunk>
>> <+58>:    mov    %rax,%rdx
>> <+61>:    sub    $0x1,%r8d              <-- retry count check
>> <+65>:    jne    0xffffffff8157ab6c <lockref_put_return+12> <-- go back
>> <+67>:    mov    $0xffffffff,%eax
>> <+72>:    jmp    0xffffffff81af8540 <__x86_return_thunk>
>>
>
  

Patch

diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h
index fdfebcb050f4..90803a826ba0 100644
--- a/include/asm-generic/spinlock.h
+++ b/include/asm-generic/spinlock.h
@@ -68,11 +68,18 @@  static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 	smp_store_release(ptr, (u16)val + 1);
 }
 
+static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
+{
+	u32 val = lock.counter;
+
+	return ((val >> 16) == (val & 0xffff));
+}
+
 static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock)
 {
-	u32 val = atomic_read(lock);
+	arch_spinlock_t val = READ_ONCE(*lock);
 
-	return ((val >> 16) != (val & 0xffff));
+	return !arch_spin_value_unlocked(val);
 }
 
 static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
@@ -82,11 +89,6 @@  static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock)
 	return (s16)((val >> 16) - (val & 0xffff)) > 1;
 }
 
-static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
-{
-	return !arch_spin_is_locked(&lock);
-}
-
 #include <asm/qrwlock.h>
 
 #endif /* __ASM_GENERIC_SPINLOCK_H */