[1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits

Message ID 20230508081532.36379-1-qiuxu.zhuo@intel.com
State New
Headers
Series [1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits |

Commit Message

Qiuxu Zhuo May 8, 2023, 8:15 a.m. UTC
  The 1st spinner (header of the MCS queue) spins on the whole qspinlock
variable to check whether the lock is released. For a contended qspinlock,
this spinning is a hotspot as each CPU queued in the MCS queue performs
the spinning when it becomes the 1st spinner (header of the MCS queue).

The granularity among SMT h/w threads in the same core could be "byte"
which the Load-Store Unit (LSU) inside the core handles. Making the 1st
spinner only spin on locked_pending bits (not the whole qspinlock) can
avoid the false dependency between the tail field and the locked_pending
field. So this micro-optimization helps the h/w thread (the 1st spinner)
stay in a low power state and prevents it from being woken up by other
h/w threads in the same core when they perform xchg_tail() to update
the tail field. Please see a similar discussion in the link [1].

[1] https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
 kernel/locking/qspinlock.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)
  

Comments

Waiman Long May 8, 2023, 3:31 p.m. UTC | #1
On 5/8/23 04:15, Qiuxu Zhuo wrote:
> The 1st spinner (header of the MCS queue) spins on the whole qspinlock
> variable to check whether the lock is released. For a contended qspinlock,
> this spinning is a hotspot as each CPU queued in the MCS queue performs
> the spinning when it becomes the 1st spinner (header of the MCS queue).
>
> The granularity among SMT h/w threads in the same core could be "byte"
> which the Load-Store Unit (LSU) inside the core handles. Making the 1st
> spinner only spin on locked_pending bits (not the whole qspinlock) can
> avoid the false dependency between the tail field and the locked_pending
> field. So this micro-optimization helps the h/w thread (the 1st spinner)
> stay in a low power state and prevents it from being woken up by other
> h/w threads in the same core when they perform xchg_tail() to update
> the tail field. Please see a similar discussion in the link [1].
>
> [1] https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org
>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---
>   kernel/locking/qspinlock.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index efebbf19f887..e7b990b28610 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -513,7 +513,20 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>   	if ((val = pv_wait_head_or_lock(lock, node)))
>   		goto locked;
>   
> +#if _Q_PENDING_BITS == 8
> +	/*
> +	 * Spinning on the 2-byte locked_pending instead of the 4-byte qspinlock
> +	 * variable can avoid the false dependency between the tail field and
> +	 * the locked_pending field. This helps the h/w thread (the 1st spinner)
> +	 * stay in a low power state and prevents it from being woken up by other
> +	 * h/w threads in the same core when they perform xchg_tail() to update
> +	 * the tail field only.
> +	 */
> +	smp_cond_load_acquire(&lock->locked_pending, !VAL);
> +	val = atomic_read_acquire(&lock->val);
> +#else
>   	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
> +#endif
>   
>   locked:
>   	/*

What hardware can benefit from this change? Do you have any 
micro-benchmark that can show the performance benefit?

Cheers,
Longman
  
Qiuxu Zhuo May 9, 2023, 2:45 a.m. UTC | #2
Hi Waiman,

Thanks for your review of this patch. 
Please see the comments below.

> From: Waiman Long <longman@redhat.com>
> Sent: Monday, May 8, 2023 11:31 PM
> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra
> <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon
> <will@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
> locked_pending bits
> 
> 
> On 5/8/23 04:15, Qiuxu Zhuo wrote:
> > The 1st spinner (header of the MCS queue) spins on the whole qspinlock
> > variable to check whether the lock is released. For a contended
> > qspinlock, this spinning is a hotspot as each CPU queued in the MCS
> > queue performs the spinning when it becomes the 1st spinner (header of
> the MCS queue).
> >
> > The granularity among SMT h/w threads in the same core could be "byte"
> > which the Load-Store Unit (LSU) inside the core handles. Making the
> > 1st spinner only spin on locked_pending bits (not the whole qspinlock)
> > can avoid the false dependency between the tail field and the
> > locked_pending field. So this micro-optimization helps the h/w thread
> > (the 1st spinner) stay in a low power state and prevents it from being
> > woken up by other h/w threads in the same core when they perform
> > xchg_tail() to update the tail field. Please see a similar discussion in the link
> [1].
> >
> > [1]
> > https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org
> >
> > Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> > ---
> >   kernel/locking/qspinlock.c | 13 +++++++++++++
> >   1 file changed, 13 insertions(+)
> >
> > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > index efebbf19f887..e7b990b28610 100644
> > --- a/kernel/locking/qspinlock.c
> > +++ b/kernel/locking/qspinlock.c
> > @@ -513,7 +513,20 @@ void __lockfunc
> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >   	if ((val = pv_wait_head_or_lock(lock, node)))
> >   		goto locked;
> >
> > +#if _Q_PENDING_BITS == 8
> > +	/*
> > +	 * Spinning on the 2-byte locked_pending instead of the 4-byte
> qspinlock
> > +	 * variable can avoid the false dependency between the tail field and
> > +	 * the locked_pending field. This helps the h/w thread (the 1st
> spinner)
> > +	 * stay in a low power state and prevents it from being woken up by
> other
> > +	 * h/w threads in the same core when they perform xchg_tail() to
> update
> > +	 * the tail field only.
> > +	 */
> > +	smp_cond_load_acquire(&lock->locked_pending, !VAL);
> > +	val = atomic_read_acquire(&lock->val); #else
> >   	val = atomic_cond_read_acquire(&lock->val, !(VAL &
> > _Q_LOCKED_PENDING_MASK));
> > +#endif
> >
> >   locked:
> >   	/*
> 
> What hardware can benefit from this change? Do you have any micro-
> benchmark that can show the performance benefit?

i)  I don't have the hardware to measure the data. 
    But I run a benchmark [1] for the contended spinlock on an Intel Sapphire Rapids 
    server (192 h/w threads, 2sockets) that showed that the 1st spinner spinning was 
    a hotspot (contributed ~55% cache bouncing traffic measured by the perf C2C.
     I don't analyze the cache bouncing here, but just say the spinning is a hotspot).

ii) The similar micro-optimization discussion [2] (looked like it was accepted by you 😉) that 
    avoiding the false dependency (between the tail field and the locked_pending field) 
    should help some arches (e.g., some ARM64???) the h/w thread (spinner) stay in a 
    low-power state without the disruption by its sibling h/w threads in the same core. 

[1] https://github.com/yamahata/ipi_benchmark.git
[2] https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org

Thanks
-Qiuxu

> Cheers,
> Longman
  
Waiman Long May 9, 2023, 2:57 a.m. UTC | #3
On 5/8/23 22:45, Zhuo, Qiuxu wrote:
> Hi Waiman,
>
> Thanks for your review of this patch.
> Please see the comments below.
>
>> From: Waiman Long <longman@redhat.com>
>> Sent: Monday, May 8, 2023 11:31 PM
>> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra
>> <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon
>> <will@kernel.org>
>> Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
>> locked_pending bits
>>
>>
>> On 5/8/23 04:15, Qiuxu Zhuo wrote:
>>> The 1st spinner (header of the MCS queue) spins on the whole qspinlock
>>> variable to check whether the lock is released. For a contended
>>> qspinlock, this spinning is a hotspot as each CPU queued in the MCS
>>> queue performs the spinning when it becomes the 1st spinner (header of
>> the MCS queue).
>>> The granularity among SMT h/w threads in the same core could be "byte"
>>> which the Load-Store Unit (LSU) inside the core handles. Making the
>>> 1st spinner only spin on locked_pending bits (not the whole qspinlock)
>>> can avoid the false dependency between the tail field and the
>>> locked_pending field. So this micro-optimization helps the h/w thread
>>> (the 1st spinner) stay in a low power state and prevents it from being
>>> woken up by other h/w threads in the same core when they perform
>>> xchg_tail() to update the tail field. Please see a similar discussion in the link
>> [1].
>>> [1]
>>> https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org
>>>
>>> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
>>> ---
>>>    kernel/locking/qspinlock.c | 13 +++++++++++++
>>>    1 file changed, 13 insertions(+)
>>>
>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>>> index efebbf19f887..e7b990b28610 100644
>>> --- a/kernel/locking/qspinlock.c
>>> +++ b/kernel/locking/qspinlock.c
>>> @@ -513,7 +513,20 @@ void __lockfunc
>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>>    	if ((val = pv_wait_head_or_lock(lock, node)))
>>>    		goto locked;
>>>
>>> +#if _Q_PENDING_BITS == 8
>>> +	/*
>>> +	 * Spinning on the 2-byte locked_pending instead of the 4-byte
>> qspinlock
>>> +	 * variable can avoid the false dependency between the tail field and
>>> +	 * the locked_pending field. This helps the h/w thread (the 1st
>> spinner)
>>> +	 * stay in a low power state and prevents it from being woken up by
>> other
>>> +	 * h/w threads in the same core when they perform xchg_tail() to
>> update
>>> +	 * the tail field only.
>>> +	 */
>>> +	smp_cond_load_acquire(&lock->locked_pending, !VAL);
>>> +	val = atomic_read_acquire(&lock->val); #else
>>>    	val = atomic_cond_read_acquire(&lock->val, !(VAL &
>>> _Q_LOCKED_PENDING_MASK));
>>> +#endif
>>>
>>>    locked:
>>>    	/*
>> What hardware can benefit from this change? Do you have any micro-
>> benchmark that can show the performance benefit?
> i)  I don't have the hardware to measure the data.
>      But I run a benchmark [1] for the contended spinlock on an Intel Sapphire Rapids
>      server (192 h/w threads, 2sockets) that showed that the 1st spinner spinning was
>      a hotspot (contributed ~55% cache bouncing traffic measured by the perf C2C.
>       I don't analyze the cache bouncing here, but just say the spinning is a hotspot).
I believe the amount of cacheline bouncing will be the same whether you 
read 32 or 16 bits from the lock word. At least this is my understanding 
of the x86 arch. Please correct me if my assumption is incorrect.

>
> ii) The similar micro-optimization discussion [2] (looked like it was accepted by you 😉) that
>      avoiding the false dependency (between the tail field and the locked_pending field)
>      should help some arches (e.g., some ARM64???) the h/w thread (spinner) stay in a
>      low-power state without the disruption by its sibling h/w threads in the same core.

That is true. However, this patch causes one more read from the lock 
cacheline which isn't necessary for arches that won't benefit from it.  
So I am less incline to accept it unless there is evidence of the 
benefit it can bring.

Cheers,
Longman
  
Qiuxu Zhuo May 9, 2023, 3:30 a.m. UTC | #4
> From: Waiman Long <longman@redhat.com>
> Sent: Tuesday, May 9, 2023 10:58 AM
> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra
> <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon
> <will@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
> locked_pending bits
> 
> 
> On 5/8/23 22:45, Zhuo, Qiuxu wrote:
> > Hi Waiman,
> >
> > Thanks for your review of this patch.
> > Please see the comments below.
> >
> >> From: Waiman Long <longman@redhat.com>
> >> Sent: Monday, May 8, 2023 11:31 PM
> >> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra
> >> <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon
> >> <will@kernel.org>
> >> Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org
> >> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only
> >> spin on locked_pending bits
> >>
> >>
> >> On 5/8/23 04:15, Qiuxu Zhuo wrote:
> >>> The 1st spinner (header of the MCS queue) spins on the whole
> >>> qspinlock variable to check whether the lock is released. For a
> >>> contended qspinlock, this spinning is a hotspot as each CPU queued
> >>> in the MCS queue performs the spinning when it becomes the 1st
> >>> spinner (header of
> >> the MCS queue).
> >>> The granularity among SMT h/w threads in the same core could be
> "byte"
> >>> which the Load-Store Unit (LSU) inside the core handles. Making the
> >>> 1st spinner only spin on locked_pending bits (not the whole
> >>> qspinlock) can avoid the false dependency between the tail field and
> >>> the locked_pending field. So this micro-optimization helps the h/w
> >>> thread (the 1st spinner) stay in a low power state and prevents it
> >>> from being woken up by other h/w threads in the same core when they
> >>> perform
> >>> xchg_tail() to update the tail field. Please see a similar
> >>> discussion in the link
> >> [1].
> >>> [1]
> >>> https://lore.kernel.org/r/20230105021952.3090070-1-
> guoren@kernel.org
> >>>
> >>> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> >>> ---
> >>>    kernel/locking/qspinlock.c | 13 +++++++++++++
> >>>    1 file changed, 13 insertions(+)
> >>>
> >>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> >>> index efebbf19f887..e7b990b28610 100644
> >>> --- a/kernel/locking/qspinlock.c
> >>> +++ b/kernel/locking/qspinlock.c
> >>> @@ -513,7 +513,20 @@ void __lockfunc
> >> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >>>    	if ((val = pv_wait_head_or_lock(lock, node)))
> >>>    		goto locked;
> >>>
> >>> +#if _Q_PENDING_BITS == 8
> >>> +	/*
> >>> +	 * Spinning on the 2-byte locked_pending instead of the 4-byte
> >> qspinlock
> >>> +	 * variable can avoid the false dependency between the tail field and
> >>> +	 * the locked_pending field. This helps the h/w thread (the 1st
> >> spinner)
> >>> +	 * stay in a low power state and prevents it from being woken up
> >>> +by
> >> other
> >>> +	 * h/w threads in the same core when they perform xchg_tail() to
> >> update
> >>> +	 * the tail field only.
> >>> +	 */
> >>> +	smp_cond_load_acquire(&lock->locked_pending, !VAL);
> >>> +	val = atomic_read_acquire(&lock->val); #else
> >>>    	val = atomic_cond_read_acquire(&lock->val, !(VAL &
> >>> _Q_LOCKED_PENDING_MASK));
> >>> +#endif
> >>>
> >>>    locked:
> >>>    	/*
> >> What hardware can benefit from this change? Do you have any micro-
> >> benchmark that can show the performance benefit?
> > i)  I don't have the hardware to measure the data.
> >      But I run a benchmark [1] for the contended spinlock on an Intel
> Sapphire Rapids
> >      server (192 h/w threads, 2sockets) that showed that the 1st spinner
> spinning was
> >      a hotspot (contributed ~55% cache bouncing traffic measured by the
> perf C2C.
> >       I don't analyze the cache bouncing here, but just say the spinning is a
> hotspot).
> I believe the amount of cacheline bouncing will be the same whether you
> read 32 or 16 bits from the lock word. At least this is my understanding of the
> x86 arch. Please correct me if my assumption is incorrect.

You're right. 
The amount of cache line bouncing was nearly the same either spinning 32 or 16 bits 
(according to my measured perf C2C data on an x86 server). 
 
> >
> > ii) The similar micro-optimization discussion [2] (looked like it was accepted
> by you 😉) that
> >      avoiding the false dependency (between the tail field and the
> locked_pending field)
> >      should help some arches (e.g., some ARM64???) the h/w thread
> (spinner) stay in a
> >      low-power state without the disruption by its sibling h/w threads in the
> same core.
> 
> That is true. However, this patch causes one more read from the lock
> cacheline which isn't necessary for arches that won't benefit from it.
> So I am less incline to accept it unless there is evidence of the
> benefit it can bring.

This patch removes a bitwise AND operation on the VAL value.
Does it compensate for the one more read from the cache line? 
 
Thanks!
- Qiuxu

> Cheers,
> Longman
  
Waiman Long May 9, 2023, 2:33 p.m. UTC | #5
On 5/8/23 23:30, Zhuo, Qiuxu wrote:
>> From: Waiman Long <longman@redhat.com>
>> Sent: Tuesday, May 9, 2023 10:58 AM
>> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra
>> <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon
>> <will@kernel.org>
>> Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
>> locked_pending bits
>>
>>
>> On 5/8/23 22:45, Zhuo, Qiuxu wrote:
>>> Hi Waiman,
>>>
>>> Thanks for your review of this patch.
>>> Please see the comments below.
>>>
>>>> From: Waiman Long <longman@redhat.com>
>>>> Sent: Monday, May 8, 2023 11:31 PM
>>>> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra
>>>> <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon
>>>> <will@kernel.org>
>>>> Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org
>>>> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only
>>>> spin on locked_pending bits
>>>>
>>>>
>>>> On 5/8/23 04:15, Qiuxu Zhuo wrote:
>>>>> The 1st spinner (header of the MCS queue) spins on the whole
>>>>> qspinlock variable to check whether the lock is released. For a
>>>>> contended qspinlock, this spinning is a hotspot as each CPU queued
>>>>> in the MCS queue performs the spinning when it becomes the 1st
>>>>> spinner (header of
>>>> the MCS queue).
>>>>> The granularity among SMT h/w threads in the same core could be
>> "byte"
>>>>> which the Load-Store Unit (LSU) inside the core handles. Making the
>>>>> 1st spinner only spin on locked_pending bits (not the whole
>>>>> qspinlock) can avoid the false dependency between the tail field and
>>>>> the locked_pending field. So this micro-optimization helps the h/w
>>>>> thread (the 1st spinner) stay in a low power state and prevents it
>>>>> from being woken up by other h/w threads in the same core when they
>>>>> perform
>>>>> xchg_tail() to update the tail field. Please see a similar
>>>>> discussion in the link
>>>> [1].
>>>>> [1]
>>>>> https://lore.kernel.org/r/20230105021952.3090070-1-
>> guoren@kernel.org
>>>>> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
>>>>> ---
>>>>>     kernel/locking/qspinlock.c | 13 +++++++++++++
>>>>>     1 file changed, 13 insertions(+)
>>>>>
>>>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>>>>> index efebbf19f887..e7b990b28610 100644
>>>>> --- a/kernel/locking/qspinlock.c
>>>>> +++ b/kernel/locking/qspinlock.c
>>>>> @@ -513,7 +513,20 @@ void __lockfunc
>>>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>>>>     	if ((val = pv_wait_head_or_lock(lock, node)))
>>>>>     		goto locked;
>>>>>
>>>>> +#if _Q_PENDING_BITS == 8
>>>>> +	/*
>>>>> +	 * Spinning on the 2-byte locked_pending instead of the 4-byte
>>>> qspinlock
>>>>> +	 * variable can avoid the false dependency between the tail field and
>>>>> +	 * the locked_pending field. This helps the h/w thread (the 1st
>>>> spinner)
>>>>> +	 * stay in a low power state and prevents it from being woken up
>>>>> +by
>>>> other
>>>>> +	 * h/w threads in the same core when they perform xchg_tail() to
>>>> update
>>>>> +	 * the tail field only.
>>>>> +	 */
>>>>> +	smp_cond_load_acquire(&lock->locked_pending, !VAL);
>>>>> +	val = atomic_read_acquire(&lock->val); #else
>>>>>     	val = atomic_cond_read_acquire(&lock->val, !(VAL &
>>>>> _Q_LOCKED_PENDING_MASK));
>>>>> +#endif
>>>>>
>>>>>     locked:
>>>>>     	/*
>>>> What hardware can benefit from this change? Do you have any micro-
>>>> benchmark that can show the performance benefit?
>>> i)  I don't have the hardware to measure the data.
>>>       But I run a benchmark [1] for the contended spinlock on an Intel
>> Sapphire Rapids
>>>       server (192 h/w threads, 2sockets) that showed that the 1st spinner
>> spinning was
>>>       a hotspot (contributed ~55% cache bouncing traffic measured by the
>> perf C2C.
>>>        I don't analyze the cache bouncing here, but just say the spinning is a
>> hotspot).
>> I believe the amount of cacheline bouncing will be the same whether you
>> read 32 or 16 bits from the lock word. At least this is my understanding of the
>> x86 arch. Please correct me if my assumption is incorrect.
> You're right.
> The amount of cache line bouncing was nearly the same either spinning 32 or 16 bits
> (according to my measured perf C2C data on an x86 server).
>   
>>> ii) The similar micro-optimization discussion [2] (looked like it was accepted
>> by you 😉) that
>>>       avoiding the false dependency (between the tail field and the
>> locked_pending field)
>>>       should help some arches (e.g., some ARM64???) the h/w thread
>> (spinner) stay in a
>>>       low-power state without the disruption by its sibling h/w threads in the
>> same core.
>>
>> That is true. However, this patch causes one more read from the lock
>> cacheline which isn't necessary for arches that won't benefit from it.
>> So I am less incline to accept it unless there is evidence of the
>> benefit it can bring.
> This patch removes a bitwise AND operation on the VAL value.
> Does it compensate for the one more read from the cache line?

Register to register operation in the case of bitwise AND doesn't cost 
much, it is the potential reading from a hot cacheline that cause most 
of the delay. Besides there is also an additional acquire barrier. 
Again, if there is no concrete proof of a benefit, there is no point to 
make the code more complicated.

Cheers,
Longman
  
kernel test robot May 15, 2023, 6:42 a.m. UTC | #6
Hello,

kernel test robot noticed "kernel_BUG_at_lib/list_debug.c" on:

commit: 3fa78d3cb9dfb63f66f86b76a7621c9d1c1ee302 ("[PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits")
url: https://github.com/intel-lab-lkp/linux/commits/Qiuxu-Zhuo/locking-qspinlock-Make-the-1st-spinner-only-spin-on-locked_pending-bits/20230508-161751
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git ec570320b09f76d52819e60abdccf372658216b6
patch link: https://lore.kernel.org/all/20230508081532.36379-1-qiuxu.zhuo@intel.com/
patch subject: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits

in testcase: phoronix-test-suite
version: 
with following parameters:

	test: sqlite-2.1.0
	option_a: 128
	cpufreq_governor: performance

test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/


compiler: gcc-11
test machine: 16 threads 1 sockets Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (Coffee Lake) with 32G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


+----------------+------------+------------+
|                | ec570320b0 | 3fa78d3cb9 |
+----------------+------------+------------+
| boot_successes | 0          | 2          |
+----------------+------------+------------+


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Link: https://lore.kernel.org/oe-lkp/202305151330.ca5d459d-oliver.sang@intel.com


[   54.397344][ T2693] ------------[ cut here ]------------
[   54.402911][ T2693] kernel BUG at lib/list_debug.c:59!
[   54.408314][ T2693] invalid opcode: 0000 [#1] SMP NOPTI
[   54.413773][ T2693] CPU: 2 PID: 2693 Comm: sqlite3 Tainted: G S                 6.3.0-rc1-00010-g3fa78d3cb9df #1
[   54.424166][ T2693] Hardware name: Intel Corporation Mehlow UP Server Platform/Moss Beach Server, BIOS CNLSE2R1.R00.X188.B13.1903250419 03/25/2019
[ 54.437758][ T2693] RIP: 0010:__list_del_entry_valid (lib/list_debug.c:59 (discriminator 3)) 
[ 54.443830][ T2693] Code: 36 6f 82 e8 87 98 a8 ff 0f 0b 48 89 ca 48 c7 c7 20 37 6f 82 e8 76 98 a8 ff 0f 0b 4c 89 c2 48 c7 c7 58 37 6f 82 e8 65 98 a8 ff <0f> 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 a8 37 6f 82 e8 4e 98 a8
All code
========
   0:	36 6f                	outsl  %ss:(%rsi),(%dx)
   2:	82                   	(bad)  
   3:	e8 87 98 a8 ff       	callq  0xffffffffffa8988f
   8:	0f 0b                	ud2    
   a:	48 89 ca             	mov    %rcx,%rdx
   d:	48 c7 c7 20 37 6f 82 	mov    $0xffffffff826f3720,%rdi
  14:	e8 76 98 a8 ff       	callq  0xffffffffffa8988f
  19:	0f 0b                	ud2    
  1b:	4c 89 c2             	mov    %r8,%rdx
  1e:	48 c7 c7 58 37 6f 82 	mov    $0xffffffff826f3758,%rdi
  25:	e8 65 98 a8 ff       	callq  0xffffffffffa8988f
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	48 89 d1             	mov    %rdx,%rcx
  2f:	4c 89 c6             	mov    %r8,%rsi
  32:	4c 89 ca             	mov    %r9,%rdx
  35:	48 c7 c7 a8 37 6f 82 	mov    $0xffffffff826f37a8,%rdi
  3c:	e8                   	.byte 0xe8
  3d:	4e 98                	rex.WRX cltq 
  3f:	a8                   	.byte 0xa8

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	48 89 d1             	mov    %rdx,%rcx
   5:	4c 89 c6             	mov    %r8,%rsi
   8:	4c 89 ca             	mov    %r9,%rdx
   b:	48 c7 c7 a8 37 6f 82 	mov    $0xffffffff826f37a8,%rdi
  12:	e8                   	.byte 0xe8
  13:	4e 98                	rex.WRX cltq 
  15:	a8                   	.byte 0xa8
[   54.463787][ T2693] RSP: 0018:ffffc90004ad3e50 EFLAGS: 00010246
[   54.469954][ T2693] RAX: 000000000000006d RBX: ffff88887567c6c0 RCX: 0000000000000000
[   54.478031][ T2693] RDX: 0000000000000000 RSI: ffff888854e9c700 RDI: ffff888854e9c700
[   54.486105][ T2693] RBP: ffff888108530300 R08: 0000000000000000 R09: 00000000ffff7fff
[   54.494179][ T2693] R10: ffffc90004ad3d08 R11: ffffffff82bd8d88 R12: ffff888108530358
[   54.502264][ T2693] R13: 00000000ffffff9c R14: ffff8881086e96a0 R15: ffff888108530300
[   54.510352][ T2693] FS:  00007f9d97379740(0000) GS:ffff888854e80000(0000) knlGS:0000000000000000
[   54.519388][ T2693] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   54.526081][ T2693] CR2: 0000556700799178 CR3: 000000018215c001 CR4: 00000000003706e0
[   54.534170][ T2693] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   54.542247][ T2693] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   54.550328][ T2693] Call Trace:
[   54.553758][ T2693]  <TASK>
[ 54.556798][ T2693] __dentry_kill (include/linux/list.h:134 fs/dcache.c:550 fs/dcache.c:603) 
[ 54.561399][ T2693] dentry_kill (fs/dcache.c:755) 
[ 54.565830][ T2693] dput (fs/dcache.c:913) 
[ 54.569770][ T2693] do_unlinkat (fs/namei.c:4321) 
[ 54.574289][ T2693] __x64_sys_unlink (fs/namei.c:4364 fs/namei.c:4362 fs/namei.c:4362) 
[ 54.579056][ T2693] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[ 54.583593][ T2693] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) 
[   54.589613][ T2693] RIP: 0033:0x7f9d9749d247
[ 54.594121][ T2693] Code: f0 ff ff 73 01 c3 48 8b 0d 46 bc 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 19 bc 0c 00 f7 d8 64 89 01 48
All code
========
   0:	f0 ff                	lock (bad) 
   2:	ff 73 01             	pushq  0x1(%rbx)
   5:	c3                   	retq   
   6:	48 8b 0d 46 bc 0c 00 	mov    0xcbc46(%rip),%rcx        # 0xcbc53
   d:	f7 d8                	neg    %eax
   f:	64 89 01             	mov    %eax,%fs:(%rcx)
  12:	48 83 c8 ff          	or     $0xffffffffffffffff,%rax
  16:	c3                   	retq   
  17:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  1e:	00 00 00 
  21:	66 90                	xchg   %ax,%ax
  23:	b8 57 00 00 00       	mov    $0x57,%eax
  28:	0f 05                	syscall 
  2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<-- trapping instruction
  30:	73 01                	jae    0x33
  32:	c3                   	retq   
  33:	48 8b 0d 19 bc 0c 00 	mov    0xcbc19(%rip),%rcx        # 0xcbc53
  3a:	f7 d8                	neg    %eax
  3c:	64 89 01             	mov    %eax,%fs:(%rcx)
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
   6:	73 01                	jae    0x9
   8:	c3                   	retq   
   9:	48 8b 0d 19 bc 0c 00 	mov    0xcbc19(%rip),%rcx        # 0xcbc29
  10:	f7 d8                	neg    %eax
  12:	64 89 01             	mov    %eax,%fs:(%rcx)
  15:	48                   	rex.W
[   54.614042][ T2693] RSP: 002b:00007fff66911a48 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
[   54.622596][ T2693] RAX: ffffffffffffffda RBX: 00005637c6657f58 RCX: 00007f9d9749d247
[   54.630716][ T2693] RDX: 0000000000000000 RSI: 00005637c6658287 RDI: 00005637c6658287
[   54.638820][ T2693] RBP: 0000000000000001 R08: 00005637c6657b98 R09: 0000000000000000
[   54.646908][ T2693] R10: 00007fff66911a20 R11: 0000000000000206 R12: 0000000000000000
[   54.654990][ T2693] R13: 00005637c6658287 R14: 00005637c6657a98 R15: 00005637c6673008
[   54.663083][ T2693]  </TASK>
[   54.666234][ T2693] Modules linked in: sg ip_tables overlay rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver btrfs blake2b_generic xor raid6_pq libcrc32c intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp sd_mod coretemp t10_pi crc64_rocksoft_generic kvm_intel crc64_rocksoft crc64 kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ast ghash_clmulni_intel drm_shmem_helper drm_kms_helper sha512_ssse3 syscopyarea sysfillrect sysimgblt ahci rapl libahci ppdev intel_cstate parport_pc wmi_bmof intel_wmi_thunderbolt mei_me video i2c_designware_platform intel_uncore drm libata intel_pmc_core mei i2c_designware_core idma64 intel_pch_thermal ie31200_edac wmi parport acpi_tad acpi_power_meter acpi_pad
[   54.731069][ T2693] ---[ end trace 0000000000000000 ]---
[ 54.736720][ T2693] RIP: 0010:__list_del_entry_valid (lib/list_debug.c:59 (discriminator 3)) 
[ 54.742881][ T2693] Code: 36 6f 82 e8 87 98 a8 ff 0f 0b 48 89 ca 48 c7 c7 20 37 6f 82 e8 76 98 a8 ff 0f 0b 4c 89 c2 48 c7 c7 58 37 6f 82 e8 65 98 a8 ff <0f> 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 a8 37 6f 82 e8 4e 98 a8
All code
========
   0:	36 6f                	outsl  %ss:(%rsi),(%dx)
   2:	82                   	(bad)  
   3:	e8 87 98 a8 ff       	callq  0xffffffffffa8988f
   8:	0f 0b                	ud2    
   a:	48 89 ca             	mov    %rcx,%rdx
   d:	48 c7 c7 20 37 6f 82 	mov    $0xffffffff826f3720,%rdi
  14:	e8 76 98 a8 ff       	callq  0xffffffffffa8988f
  19:	0f 0b                	ud2    
  1b:	4c 89 c2             	mov    %r8,%rdx
  1e:	48 c7 c7 58 37 6f 82 	mov    $0xffffffff826f3758,%rdi
  25:	e8 65 98 a8 ff       	callq  0xffffffffffa8988f
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	48 89 d1             	mov    %rdx,%rcx
  2f:	4c 89 c6             	mov    %r8,%rsi
  32:	4c 89 ca             	mov    %r9,%rdx
  35:	48 c7 c7 a8 37 6f 82 	mov    $0xffffffff826f37a8,%rdi
  3c:	e8                   	.byte 0xe8
  3d:	4e 98                	rex.WRX cltq 
  3f:	a8                   	.byte 0xa8

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	48 89 d1             	mov    %rdx,%rcx
   5:	4c 89 c6             	mov    %r8,%rsi
   8:	4c 89 ca             	mov    %r9,%rdx
   b:	48 c7 c7 a8 37 6f 82 	mov    $0xffffffff826f37a8,%rdi
  12:	e8                   	.byte 0xe8
  13:	4e 98                	rex.WRX cltq 
  15:	a8                   	.byte 0xa8


To reproduce:

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        sudo bin/lkp install job.yaml           # job file is attached in this email
        bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
        sudo bin/lkp run generated-yaml-file

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
  

Patch

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index efebbf19f887..e7b990b28610 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -513,7 +513,20 @@  void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if ((val = pv_wait_head_or_lock(lock, node)))
 		goto locked;
 
+#if _Q_PENDING_BITS == 8
+	/*
+	 * Spinning on the 2-byte locked_pending instead of the 4-byte qspinlock
+	 * variable can avoid the false dependency between the tail field and
+	 * the locked_pending field. This helps the h/w thread (the 1st spinner)
+	 * stay in a low power state and prevents it from being woken up by other
+	 * h/w threads in the same core when they perform xchg_tail() to update
+	 * the tail field only.
+	 */
+	smp_cond_load_acquire(&lock->locked_pending, !VAL);
+	val = atomic_read_acquire(&lock->val);
+#else
 	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
+#endif
 
 locked:
 	/*