Message ID | 20230508081532.36379-1-qiuxu.zhuo@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp1994389vqo; Mon, 8 May 2023 01:26:13 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ654FXmwacyYRwb2OWQz1fenfIJpUnJE+Oiec3rLxZXCDEypza3ldqtm+KeePDHRKzfI/V2 X-Received: by 2002:a05:6a20:a107:b0:f0:36a3:706a with SMTP id q7-20020a056a20a10700b000f036a3706amr12524450pzk.10.1683534373625; Mon, 08 May 2023 01:26:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683534373; cv=none; d=google.com; s=arc-20160816; b=SDiUJrc6FDJgzFpe/V8xDEQTvt6mJNBv1gfD0uLi+lVmda57WXeCnyICYXIPdDvtcO N33QirfXCqmO/gtvyROw0835v3ur0zdNvbyFZawqMMj1u/7it/NOOF38gkrCkpDCO1c/ wa3DuzLkhtadQ67qNLxLmXUn454tMZxQ7lIn1Bvwv6tJN4m3WTSdzvryBvNDAZBTWV8g 8Dhg/x6kzeMVUvleIF+i7892i0v8dKsghAA3kcofCOwBdDkuVTwSYrlEP6yFPGWKVAzD K31ct9lrkeL1Eb5QFBHIAhyFAAt4ImznG4XtA7DwaFBYkbdVhE9YjDSlRsj8cSol9gYI Dw8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from :dkim-signature; bh=j6yRC8cc4+Rt2Vj0HlqdHJVmSqRnHcJiS1mEiaTsI0U=; b=qXlq4WuJfI2Z/2WR3bzrE8MnfPQJeKfEAg2wx6gXVFc9PFWNOOYNlBAPUzh8fTQR3B S1P+ke2R6JU/fX7ES1EJ1WJkZZ/vTFCFBzGEOu6wn6q0RvupeKgli5w7wL6op1+dIVVs j6gxTmLMI4TxpwoADDf/jHDT/z7rP2AC2fXd2nfJa38IAUhvBdT+v5zPdy0D4OLHyjaN z+CF87GRHPZAy1Q4UppDZTYUWQq2xNtYUibU06LuPLcH2hHgYeR4/V+v++1Y9z6jdZAL NvgDhai4CezwbISyKq1lF/4oXKD/OA8DcBy8c7ALY8Axh7GqnWAfrHMGxYsZgX7CrwGI SFCw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=F+rDdGqK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c6-20020a6566c6000000b00520a074168csi7396889pgw.454.2023.05.08.01.25.57; Mon, 08 May 2023 01:26:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=F+rDdGqK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233609AbjEHIQI (ORCPT <rfc822;baris.duru.linux@gmail.com> + 99 others); Mon, 8 May 2023 04:16:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43808 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232949AbjEHIQE (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 8 May 2023 04:16:04 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B798F3C34 for <linux-kernel@vger.kernel.org>; Mon, 8 May 2023 01:16:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1683533763; x=1715069763; h=from:to:cc:subject:date:message-id; bh=tLUH/bGNzy2lFq4ntkONNzY8WwFRfNEvq/WBkxI5uQM=; b=F+rDdGqKmKHDcR/TdWJdDy2QarqklSzNiSGX1WllDOEY9+KNGAfldt1A SkqCk5JsQHXo0tgtqY0970bLLdXff//i2Np4CdSkMcjTL5BMns3PuKFXK 0ePWOeZh70gNc3H26WjOuvWJxjhXvXhXRQFAonMzAdLUYP8bSAlhG6Jrr PUgxP/xcsgw00lB8eF7zuJ3rw1HJRj5KYDQ+khAub76V6jUDdCJZ5cda3 YaYpoLFHRoyE/4yujbDmGruaoCRs6fmXomz/Pg73wJjl3J3XqAIADLfup hvYBQdcDaWfIVTf6vnITMIZ1zBw6D82Kj37jelyUnTS642Q8szVc4cqPu A==; X-IronPort-AV: E=McAfee;i="6600,9927,10703"; a="334027245" X-IronPort-AV: E=Sophos;i="5.99,258,1677571200"; d="scan'208";a="334027245" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 May 2023 01:16:03 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10703"; a="1028335675" X-IronPort-AV: E=Sophos;i="5.99,258,1677571200"; d="scan'208";a="1028335675" Received: from qiuxu-clx.sh.intel.com ([10.239.53.109]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 May 2023 01:16:00 -0700 From: Qiuxu Zhuo <qiuxu.zhuo@intel.com> To: Peter Zijlstra <peterz@infradead.org>, Waiman Long <longman@redhat.com>, Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>, Boqun Feng <boqun.feng@gmail.com>, linux-kernel@vger.kernel.org Subject: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits Date: Mon, 8 May 2023 16:15:32 +0800 Message-Id: <20230508081532.36379-1-qiuxu.zhuo@intel.com> X-Mailer: git-send-email 2.17.1 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE, SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1765313739228504092?= X-GMAIL-MSGID: =?utf-8?q?1765313739228504092?= |
Series |
[1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits
|
|
Commit Message
Qiuxu Zhuo
May 8, 2023, 8:15 a.m. UTC
The 1st spinner (header of the MCS queue) spins on the whole qspinlock
variable to check whether the lock is released. For a contended qspinlock,
this spinning is a hotspot as each CPU queued in the MCS queue performs
the spinning when it becomes the 1st spinner (header of the MCS queue).
The granularity among SMT h/w threads in the same core could be "byte"
which the Load-Store Unit (LSU) inside the core handles. Making the 1st
spinner only spin on locked_pending bits (not the whole qspinlock) can
avoid the false dependency between the tail field and the locked_pending
field. So this micro-optimization helps the h/w thread (the 1st spinner)
stay in a low power state and prevents it from being woken up by other
h/w threads in the same core when they perform xchg_tail() to update
the tail field. Please see a similar discussion in the link [1].
[1] https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
kernel/locking/qspinlock.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
Comments
On 5/8/23 04:15, Qiuxu Zhuo wrote: > The 1st spinner (header of the MCS queue) spins on the whole qspinlock > variable to check whether the lock is released. For a contended qspinlock, > this spinning is a hotspot as each CPU queued in the MCS queue performs > the spinning when it becomes the 1st spinner (header of the MCS queue). > > The granularity among SMT h/w threads in the same core could be "byte" > which the Load-Store Unit (LSU) inside the core handles. Making the 1st > spinner only spin on locked_pending bits (not the whole qspinlock) can > avoid the false dependency between the tail field and the locked_pending > field. So this micro-optimization helps the h/w thread (the 1st spinner) > stay in a low power state and prevents it from being woken up by other > h/w threads in the same core when they perform xchg_tail() to update > the tail field. Please see a similar discussion in the link [1]. > > [1] https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org > > Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > --- > kernel/locking/qspinlock.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c > index efebbf19f887..e7b990b28610 100644 > --- a/kernel/locking/qspinlock.c > +++ b/kernel/locking/qspinlock.c > @@ -513,7 +513,20 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > if ((val = pv_wait_head_or_lock(lock, node))) > goto locked; > > +#if _Q_PENDING_BITS == 8 > + /* > + * Spinning on the 2-byte locked_pending instead of the 4-byte qspinlock > + * variable can avoid the false dependency between the tail field and > + * the locked_pending field. This helps the h/w thread (the 1st spinner) > + * stay in a low power state and prevents it from being woken up by other > + * h/w threads in the same core when they perform xchg_tail() to update > + * the tail field only. > + */ > + smp_cond_load_acquire(&lock->locked_pending, !VAL); > + val = atomic_read_acquire(&lock->val); > +#else > val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK)); > +#endif > > locked: > /* What hardware can benefit from this change? Do you have any micro-benchmark that can show the performance benefit? Cheers, Longman
Hi Waiman, Thanks for your review of this patch. Please see the comments below. > From: Waiman Long <longman@redhat.com> > Sent: Monday, May 8, 2023 11:31 PM > To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra > <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon > <will@kernel.org> > Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org > Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on > locked_pending bits > > > On 5/8/23 04:15, Qiuxu Zhuo wrote: > > The 1st spinner (header of the MCS queue) spins on the whole qspinlock > > variable to check whether the lock is released. For a contended > > qspinlock, this spinning is a hotspot as each CPU queued in the MCS > > queue performs the spinning when it becomes the 1st spinner (header of > the MCS queue). > > > > The granularity among SMT h/w threads in the same core could be "byte" > > which the Load-Store Unit (LSU) inside the core handles. Making the > > 1st spinner only spin on locked_pending bits (not the whole qspinlock) > > can avoid the false dependency between the tail field and the > > locked_pending field. So this micro-optimization helps the h/w thread > > (the 1st spinner) stay in a low power state and prevents it from being > > woken up by other h/w threads in the same core when they perform > > xchg_tail() to update the tail field. Please see a similar discussion in the link > [1]. > > > > [1] > > https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org > > > > Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > > --- > > kernel/locking/qspinlock.c | 13 +++++++++++++ > > 1 file changed, 13 insertions(+) > > > > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c > > index efebbf19f887..e7b990b28610 100644 > > --- a/kernel/locking/qspinlock.c > > +++ b/kernel/locking/qspinlock.c > > @@ -513,7 +513,20 @@ void __lockfunc > queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > > if ((val = pv_wait_head_or_lock(lock, node))) > > goto locked; > > > > +#if _Q_PENDING_BITS == 8 > > + /* > > + * Spinning on the 2-byte locked_pending instead of the 4-byte > qspinlock > > + * variable can avoid the false dependency between the tail field and > > + * the locked_pending field. This helps the h/w thread (the 1st > spinner) > > + * stay in a low power state and prevents it from being woken up by > other > > + * h/w threads in the same core when they perform xchg_tail() to > update > > + * the tail field only. > > + */ > > + smp_cond_load_acquire(&lock->locked_pending, !VAL); > > + val = atomic_read_acquire(&lock->val); #else > > val = atomic_cond_read_acquire(&lock->val, !(VAL & > > _Q_LOCKED_PENDING_MASK)); > > +#endif > > > > locked: > > /* > > What hardware can benefit from this change? Do you have any micro- > benchmark that can show the performance benefit? i) I don't have the hardware to measure the data. But I run a benchmark [1] for the contended spinlock on an Intel Sapphire Rapids server (192 h/w threads, 2sockets) that showed that the 1st spinner spinning was a hotspot (contributed ~55% cache bouncing traffic measured by the perf C2C. I don't analyze the cache bouncing here, but just say the spinning is a hotspot). ii) The similar micro-optimization discussion [2] (looked like it was accepted by you 😉) that avoiding the false dependency (between the tail field and the locked_pending field) should help some arches (e.g., some ARM64???) the h/w thread (spinner) stay in a low-power state without the disruption by its sibling h/w threads in the same core. [1] https://github.com/yamahata/ipi_benchmark.git [2] https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org Thanks -Qiuxu > Cheers, > Longman
On 5/8/23 22:45, Zhuo, Qiuxu wrote: > Hi Waiman, > > Thanks for your review of this patch. > Please see the comments below. > >> From: Waiman Long <longman@redhat.com> >> Sent: Monday, May 8, 2023 11:31 PM >> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra >> <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon >> <will@kernel.org> >> Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org >> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on >> locked_pending bits >> >> >> On 5/8/23 04:15, Qiuxu Zhuo wrote: >>> The 1st spinner (header of the MCS queue) spins on the whole qspinlock >>> variable to check whether the lock is released. For a contended >>> qspinlock, this spinning is a hotspot as each CPU queued in the MCS >>> queue performs the spinning when it becomes the 1st spinner (header of >> the MCS queue). >>> The granularity among SMT h/w threads in the same core could be "byte" >>> which the Load-Store Unit (LSU) inside the core handles. Making the >>> 1st spinner only spin on locked_pending bits (not the whole qspinlock) >>> can avoid the false dependency between the tail field and the >>> locked_pending field. So this micro-optimization helps the h/w thread >>> (the 1st spinner) stay in a low power state and prevents it from being >>> woken up by other h/w threads in the same core when they perform >>> xchg_tail() to update the tail field. Please see a similar discussion in the link >> [1]. >>> [1] >>> https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org >>> >>> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> >>> --- >>> kernel/locking/qspinlock.c | 13 +++++++++++++ >>> 1 file changed, 13 insertions(+) >>> >>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c >>> index efebbf19f887..e7b990b28610 100644 >>> --- a/kernel/locking/qspinlock.c >>> +++ b/kernel/locking/qspinlock.c >>> @@ -513,7 +513,20 @@ void __lockfunc >> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) >>> if ((val = pv_wait_head_or_lock(lock, node))) >>> goto locked; >>> >>> +#if _Q_PENDING_BITS == 8 >>> + /* >>> + * Spinning on the 2-byte locked_pending instead of the 4-byte >> qspinlock >>> + * variable can avoid the false dependency between the tail field and >>> + * the locked_pending field. This helps the h/w thread (the 1st >> spinner) >>> + * stay in a low power state and prevents it from being woken up by >> other >>> + * h/w threads in the same core when they perform xchg_tail() to >> update >>> + * the tail field only. >>> + */ >>> + smp_cond_load_acquire(&lock->locked_pending, !VAL); >>> + val = atomic_read_acquire(&lock->val); #else >>> val = atomic_cond_read_acquire(&lock->val, !(VAL & >>> _Q_LOCKED_PENDING_MASK)); >>> +#endif >>> >>> locked: >>> /* >> What hardware can benefit from this change? Do you have any micro- >> benchmark that can show the performance benefit? > i) I don't have the hardware to measure the data. > But I run a benchmark [1] for the contended spinlock on an Intel Sapphire Rapids > server (192 h/w threads, 2sockets) that showed that the 1st spinner spinning was > a hotspot (contributed ~55% cache bouncing traffic measured by the perf C2C. > I don't analyze the cache bouncing here, but just say the spinning is a hotspot). I believe the amount of cacheline bouncing will be the same whether you read 32 or 16 bits from the lock word. At least this is my understanding of the x86 arch. Please correct me if my assumption is incorrect. > > ii) The similar micro-optimization discussion [2] (looked like it was accepted by you 😉) that > avoiding the false dependency (between the tail field and the locked_pending field) > should help some arches (e.g., some ARM64???) the h/w thread (spinner) stay in a > low-power state without the disruption by its sibling h/w threads in the same core. That is true. However, this patch causes one more read from the lock cacheline which isn't necessary for arches that won't benefit from it. So I am less incline to accept it unless there is evidence of the benefit it can bring. Cheers, Longman
> From: Waiman Long <longman@redhat.com> > Sent: Tuesday, May 9, 2023 10:58 AM > To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra > <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon > <will@kernel.org> > Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org > Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on > locked_pending bits > > > On 5/8/23 22:45, Zhuo, Qiuxu wrote: > > Hi Waiman, > > > > Thanks for your review of this patch. > > Please see the comments below. > > > >> From: Waiman Long <longman@redhat.com> > >> Sent: Monday, May 8, 2023 11:31 PM > >> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra > >> <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon > >> <will@kernel.org> > >> Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org > >> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only > >> spin on locked_pending bits > >> > >> > >> On 5/8/23 04:15, Qiuxu Zhuo wrote: > >>> The 1st spinner (header of the MCS queue) spins on the whole > >>> qspinlock variable to check whether the lock is released. For a > >>> contended qspinlock, this spinning is a hotspot as each CPU queued > >>> in the MCS queue performs the spinning when it becomes the 1st > >>> spinner (header of > >> the MCS queue). > >>> The granularity among SMT h/w threads in the same core could be > "byte" > >>> which the Load-Store Unit (LSU) inside the core handles. Making the > >>> 1st spinner only spin on locked_pending bits (not the whole > >>> qspinlock) can avoid the false dependency between the tail field and > >>> the locked_pending field. So this micro-optimization helps the h/w > >>> thread (the 1st spinner) stay in a low power state and prevents it > >>> from being woken up by other h/w threads in the same core when they > >>> perform > >>> xchg_tail() to update the tail field. Please see a similar > >>> discussion in the link > >> [1]. > >>> [1] > >>> https://lore.kernel.org/r/20230105021952.3090070-1- > guoren@kernel.org > >>> > >>> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > >>> --- > >>> kernel/locking/qspinlock.c | 13 +++++++++++++ > >>> 1 file changed, 13 insertions(+) > >>> > >>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c > >>> index efebbf19f887..e7b990b28610 100644 > >>> --- a/kernel/locking/qspinlock.c > >>> +++ b/kernel/locking/qspinlock.c > >>> @@ -513,7 +513,20 @@ void __lockfunc > >> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > >>> if ((val = pv_wait_head_or_lock(lock, node))) > >>> goto locked; > >>> > >>> +#if _Q_PENDING_BITS == 8 > >>> + /* > >>> + * Spinning on the 2-byte locked_pending instead of the 4-byte > >> qspinlock > >>> + * variable can avoid the false dependency between the tail field and > >>> + * the locked_pending field. This helps the h/w thread (the 1st > >> spinner) > >>> + * stay in a low power state and prevents it from being woken up > >>> +by > >> other > >>> + * h/w threads in the same core when they perform xchg_tail() to > >> update > >>> + * the tail field only. > >>> + */ > >>> + smp_cond_load_acquire(&lock->locked_pending, !VAL); > >>> + val = atomic_read_acquire(&lock->val); #else > >>> val = atomic_cond_read_acquire(&lock->val, !(VAL & > >>> _Q_LOCKED_PENDING_MASK)); > >>> +#endif > >>> > >>> locked: > >>> /* > >> What hardware can benefit from this change? Do you have any micro- > >> benchmark that can show the performance benefit? > > i) I don't have the hardware to measure the data. > > But I run a benchmark [1] for the contended spinlock on an Intel > Sapphire Rapids > > server (192 h/w threads, 2sockets) that showed that the 1st spinner > spinning was > > a hotspot (contributed ~55% cache bouncing traffic measured by the > perf C2C. > > I don't analyze the cache bouncing here, but just say the spinning is a > hotspot). > I believe the amount of cacheline bouncing will be the same whether you > read 32 or 16 bits from the lock word. At least this is my understanding of the > x86 arch. Please correct me if my assumption is incorrect. You're right. The amount of cache line bouncing was nearly the same either spinning 32 or 16 bits (according to my measured perf C2C data on an x86 server). > > > > ii) The similar micro-optimization discussion [2] (looked like it was accepted > by you 😉) that > > avoiding the false dependency (between the tail field and the > locked_pending field) > > should help some arches (e.g., some ARM64???) the h/w thread > (spinner) stay in a > > low-power state without the disruption by its sibling h/w threads in the > same core. > > That is true. However, this patch causes one more read from the lock > cacheline which isn't necessary for arches that won't benefit from it. > So I am less incline to accept it unless there is evidence of the > benefit it can bring. This patch removes a bitwise AND operation on the VAL value. Does it compensate for the one more read from the cache line? Thanks! - Qiuxu > Cheers, > Longman
On 5/8/23 23:30, Zhuo, Qiuxu wrote: >> From: Waiman Long <longman@redhat.com> >> Sent: Tuesday, May 9, 2023 10:58 AM >> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra >> <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon >> <will@kernel.org> >> Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org >> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on >> locked_pending bits >> >> >> On 5/8/23 22:45, Zhuo, Qiuxu wrote: >>> Hi Waiman, >>> >>> Thanks for your review of this patch. >>> Please see the comments below. >>> >>>> From: Waiman Long <longman@redhat.com> >>>> Sent: Monday, May 8, 2023 11:31 PM >>>> To: Zhuo, Qiuxu <qiuxu.zhuo@intel.com>; Peter Zijlstra >>>> <peterz@infradead.org>; Ingo Molnar <mingo@redhat.com>; Will Deacon >>>> <will@kernel.org> >>>> Cc: Boqun Feng <boqun.feng@gmail.com>; linux-kernel@vger.kernel.org >>>> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only >>>> spin on locked_pending bits >>>> >>>> >>>> On 5/8/23 04:15, Qiuxu Zhuo wrote: >>>>> The 1st spinner (header of the MCS queue) spins on the whole >>>>> qspinlock variable to check whether the lock is released. For a >>>>> contended qspinlock, this spinning is a hotspot as each CPU queued >>>>> in the MCS queue performs the spinning when it becomes the 1st >>>>> spinner (header of >>>> the MCS queue). >>>>> The granularity among SMT h/w threads in the same core could be >> "byte" >>>>> which the Load-Store Unit (LSU) inside the core handles. Making the >>>>> 1st spinner only spin on locked_pending bits (not the whole >>>>> qspinlock) can avoid the false dependency between the tail field and >>>>> the locked_pending field. So this micro-optimization helps the h/w >>>>> thread (the 1st spinner) stay in a low power state and prevents it >>>>> from being woken up by other h/w threads in the same core when they >>>>> perform >>>>> xchg_tail() to update the tail field. Please see a similar >>>>> discussion in the link >>>> [1]. >>>>> [1] >>>>> https://lore.kernel.org/r/20230105021952.3090070-1- >> guoren@kernel.org >>>>> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> >>>>> --- >>>>> kernel/locking/qspinlock.c | 13 +++++++++++++ >>>>> 1 file changed, 13 insertions(+) >>>>> >>>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c >>>>> index efebbf19f887..e7b990b28610 100644 >>>>> --- a/kernel/locking/qspinlock.c >>>>> +++ b/kernel/locking/qspinlock.c >>>>> @@ -513,7 +513,20 @@ void __lockfunc >>>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) >>>>> if ((val = pv_wait_head_or_lock(lock, node))) >>>>> goto locked; >>>>> >>>>> +#if _Q_PENDING_BITS == 8 >>>>> + /* >>>>> + * Spinning on the 2-byte locked_pending instead of the 4-byte >>>> qspinlock >>>>> + * variable can avoid the false dependency between the tail field and >>>>> + * the locked_pending field. This helps the h/w thread (the 1st >>>> spinner) >>>>> + * stay in a low power state and prevents it from being woken up >>>>> +by >>>> other >>>>> + * h/w threads in the same core when they perform xchg_tail() to >>>> update >>>>> + * the tail field only. >>>>> + */ >>>>> + smp_cond_load_acquire(&lock->locked_pending, !VAL); >>>>> + val = atomic_read_acquire(&lock->val); #else >>>>> val = atomic_cond_read_acquire(&lock->val, !(VAL & >>>>> _Q_LOCKED_PENDING_MASK)); >>>>> +#endif >>>>> >>>>> locked: >>>>> /* >>>> What hardware can benefit from this change? Do you have any micro- >>>> benchmark that can show the performance benefit? >>> i) I don't have the hardware to measure the data. >>> But I run a benchmark [1] for the contended spinlock on an Intel >> Sapphire Rapids >>> server (192 h/w threads, 2sockets) that showed that the 1st spinner >> spinning was >>> a hotspot (contributed ~55% cache bouncing traffic measured by the >> perf C2C. >>> I don't analyze the cache bouncing here, but just say the spinning is a >> hotspot). >> I believe the amount of cacheline bouncing will be the same whether you >> read 32 or 16 bits from the lock word. At least this is my understanding of the >> x86 arch. Please correct me if my assumption is incorrect. > You're right. > The amount of cache line bouncing was nearly the same either spinning 32 or 16 bits > (according to my measured perf C2C data on an x86 server). > >>> ii) The similar micro-optimization discussion [2] (looked like it was accepted >> by you 😉) that >>> avoiding the false dependency (between the tail field and the >> locked_pending field) >>> should help some arches (e.g., some ARM64???) the h/w thread >> (spinner) stay in a >>> low-power state without the disruption by its sibling h/w threads in the >> same core. >> >> That is true. However, this patch causes one more read from the lock >> cacheline which isn't necessary for arches that won't benefit from it. >> So I am less incline to accept it unless there is evidence of the >> benefit it can bring. > This patch removes a bitwise AND operation on the VAL value. > Does it compensate for the one more read from the cache line? Register to register operation in the case of bitwise AND doesn't cost much, it is the potential reading from a hot cacheline that cause most of the delay. Besides there is also an additional acquire barrier. Again, if there is no concrete proof of a benefit, there is no point to make the code more complicated. Cheers, Longman
Hello, kernel test robot noticed "kernel_BUG_at_lib/list_debug.c" on: commit: 3fa78d3cb9dfb63f66f86b76a7621c9d1c1ee302 ("[PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits") url: https://github.com/intel-lab-lkp/linux/commits/Qiuxu-Zhuo/locking-qspinlock-Make-the-1st-spinner-only-spin-on-locked_pending-bits/20230508-161751 base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git ec570320b09f76d52819e60abdccf372658216b6 patch link: https://lore.kernel.org/all/20230508081532.36379-1-qiuxu.zhuo@intel.com/ patch subject: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits in testcase: phoronix-test-suite version: with following parameters: test: sqlite-2.1.0 option_a: 128 cpufreq_governor: performance test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added. test-url: http://www.phoronix-test-suite.com/ compiler: gcc-11 test machine: 16 threads 1 sockets Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (Coffee Lake) with 32G memory (please refer to attached dmesg/kmsg for entire log/backtrace) +----------------+------------+------------+ | | ec570320b0 | 3fa78d3cb9 | +----------------+------------+------------+ | boot_successes | 0 | 2 | +----------------+------------+------------+ If you fix the issue, kindly add following tag | Reported-by: kernel test robot <oliver.sang@intel.com> | Link: https://lore.kernel.org/oe-lkp/202305151330.ca5d459d-oliver.sang@intel.com [ 54.397344][ T2693] ------------[ cut here ]------------ [ 54.402911][ T2693] kernel BUG at lib/list_debug.c:59! [ 54.408314][ T2693] invalid opcode: 0000 [#1] SMP NOPTI [ 54.413773][ T2693] CPU: 2 PID: 2693 Comm: sqlite3 Tainted: G S 6.3.0-rc1-00010-g3fa78d3cb9df #1 [ 54.424166][ T2693] Hardware name: Intel Corporation Mehlow UP Server Platform/Moss Beach Server, BIOS CNLSE2R1.R00.X188.B13.1903250419 03/25/2019 [ 54.437758][ T2693] RIP: 0010:__list_del_entry_valid (lib/list_debug.c:59 (discriminator 3)) [ 54.443830][ T2693] Code: 36 6f 82 e8 87 98 a8 ff 0f 0b 48 89 ca 48 c7 c7 20 37 6f 82 e8 76 98 a8 ff 0f 0b 4c 89 c2 48 c7 c7 58 37 6f 82 e8 65 98 a8 ff <0f> 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 a8 37 6f 82 e8 4e 98 a8 All code ======== 0: 36 6f outsl %ss:(%rsi),(%dx) 2: 82 (bad) 3: e8 87 98 a8 ff callq 0xffffffffffa8988f 8: 0f 0b ud2 a: 48 89 ca mov %rcx,%rdx d: 48 c7 c7 20 37 6f 82 mov $0xffffffff826f3720,%rdi 14: e8 76 98 a8 ff callq 0xffffffffffa8988f 19: 0f 0b ud2 1b: 4c 89 c2 mov %r8,%rdx 1e: 48 c7 c7 58 37 6f 82 mov $0xffffffff826f3758,%rdi 25: e8 65 98 a8 ff callq 0xffffffffffa8988f 2a:* 0f 0b ud2 <-- trapping instruction 2c: 48 89 d1 mov %rdx,%rcx 2f: 4c 89 c6 mov %r8,%rsi 32: 4c 89 ca mov %r9,%rdx 35: 48 c7 c7 a8 37 6f 82 mov $0xffffffff826f37a8,%rdi 3c: e8 .byte 0xe8 3d: 4e 98 rex.WRX cltq 3f: a8 .byte 0xa8 Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: 48 89 d1 mov %rdx,%rcx 5: 4c 89 c6 mov %r8,%rsi 8: 4c 89 ca mov %r9,%rdx b: 48 c7 c7 a8 37 6f 82 mov $0xffffffff826f37a8,%rdi 12: e8 .byte 0xe8 13: 4e 98 rex.WRX cltq 15: a8 .byte 0xa8 [ 54.463787][ T2693] RSP: 0018:ffffc90004ad3e50 EFLAGS: 00010246 [ 54.469954][ T2693] RAX: 000000000000006d RBX: ffff88887567c6c0 RCX: 0000000000000000 [ 54.478031][ T2693] RDX: 0000000000000000 RSI: ffff888854e9c700 RDI: ffff888854e9c700 [ 54.486105][ T2693] RBP: ffff888108530300 R08: 0000000000000000 R09: 00000000ffff7fff [ 54.494179][ T2693] R10: ffffc90004ad3d08 R11: ffffffff82bd8d88 R12: ffff888108530358 [ 54.502264][ T2693] R13: 00000000ffffff9c R14: ffff8881086e96a0 R15: ffff888108530300 [ 54.510352][ T2693] FS: 00007f9d97379740(0000) GS:ffff888854e80000(0000) knlGS:0000000000000000 [ 54.519388][ T2693] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 54.526081][ T2693] CR2: 0000556700799178 CR3: 000000018215c001 CR4: 00000000003706e0 [ 54.534170][ T2693] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 54.542247][ T2693] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 54.550328][ T2693] Call Trace: [ 54.553758][ T2693] <TASK> [ 54.556798][ T2693] __dentry_kill (include/linux/list.h:134 fs/dcache.c:550 fs/dcache.c:603) [ 54.561399][ T2693] dentry_kill (fs/dcache.c:755) [ 54.565830][ T2693] dput (fs/dcache.c:913) [ 54.569770][ T2693] do_unlinkat (fs/namei.c:4321) [ 54.574289][ T2693] __x64_sys_unlink (fs/namei.c:4364 fs/namei.c:4362 fs/namei.c:4362) [ 54.579056][ T2693] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 54.583593][ T2693] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) [ 54.589613][ T2693] RIP: 0033:0x7f9d9749d247 [ 54.594121][ T2693] Code: f0 ff ff 73 01 c3 48 8b 0d 46 bc 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 19 bc 0c 00 f7 d8 64 89 01 48 All code ======== 0: f0 ff lock (bad) 2: ff 73 01 pushq 0x1(%rbx) 5: c3 retq 6: 48 8b 0d 46 bc 0c 00 mov 0xcbc46(%rip),%rcx # 0xcbc53 d: f7 d8 neg %eax f: 64 89 01 mov %eax,%fs:(%rcx) 12: 48 83 c8 ff or $0xffffffffffffffff,%rax 16: c3 retq 17: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 1e: 00 00 00 21: 66 90 xchg %ax,%ax 23: b8 57 00 00 00 mov $0x57,%eax 28: 0f 05 syscall 2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction 30: 73 01 jae 0x33 32: c3 retq 33: 48 8b 0d 19 bc 0c 00 mov 0xcbc19(%rip),%rcx # 0xcbc53 3a: f7 d8 neg %eax 3c: 64 89 01 mov %eax,%fs:(%rcx) 3f: 48 rex.W Code starting with the faulting instruction =========================================== 0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax 6: 73 01 jae 0x9 8: c3 retq 9: 48 8b 0d 19 bc 0c 00 mov 0xcbc19(%rip),%rcx # 0xcbc29 10: f7 d8 neg %eax 12: 64 89 01 mov %eax,%fs:(%rcx) 15: 48 rex.W [ 54.614042][ T2693] RSP: 002b:00007fff66911a48 EFLAGS: 00000206 ORIG_RAX: 0000000000000057 [ 54.622596][ T2693] RAX: ffffffffffffffda RBX: 00005637c6657f58 RCX: 00007f9d9749d247 [ 54.630716][ T2693] RDX: 0000000000000000 RSI: 00005637c6658287 RDI: 00005637c6658287 [ 54.638820][ T2693] RBP: 0000000000000001 R08: 00005637c6657b98 R09: 0000000000000000 [ 54.646908][ T2693] R10: 00007fff66911a20 R11: 0000000000000206 R12: 0000000000000000 [ 54.654990][ T2693] R13: 00005637c6658287 R14: 00005637c6657a98 R15: 00005637c6673008 [ 54.663083][ T2693] </TASK> [ 54.666234][ T2693] Modules linked in: sg ip_tables overlay rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver btrfs blake2b_generic xor raid6_pq libcrc32c intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp sd_mod coretemp t10_pi crc64_rocksoft_generic kvm_intel crc64_rocksoft crc64 kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ast ghash_clmulni_intel drm_shmem_helper drm_kms_helper sha512_ssse3 syscopyarea sysfillrect sysimgblt ahci rapl libahci ppdev intel_cstate parport_pc wmi_bmof intel_wmi_thunderbolt mei_me video i2c_designware_platform intel_uncore drm libata intel_pmc_core mei i2c_designware_core idma64 intel_pch_thermal ie31200_edac wmi parport acpi_tad acpi_power_meter acpi_pad [ 54.731069][ T2693] ---[ end trace 0000000000000000 ]--- [ 54.736720][ T2693] RIP: 0010:__list_del_entry_valid (lib/list_debug.c:59 (discriminator 3)) [ 54.742881][ T2693] Code: 36 6f 82 e8 87 98 a8 ff 0f 0b 48 89 ca 48 c7 c7 20 37 6f 82 e8 76 98 a8 ff 0f 0b 4c 89 c2 48 c7 c7 58 37 6f 82 e8 65 98 a8 ff <0f> 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 a8 37 6f 82 e8 4e 98 a8 All code ======== 0: 36 6f outsl %ss:(%rsi),(%dx) 2: 82 (bad) 3: e8 87 98 a8 ff callq 0xffffffffffa8988f 8: 0f 0b ud2 a: 48 89 ca mov %rcx,%rdx d: 48 c7 c7 20 37 6f 82 mov $0xffffffff826f3720,%rdi 14: e8 76 98 a8 ff callq 0xffffffffffa8988f 19: 0f 0b ud2 1b: 4c 89 c2 mov %r8,%rdx 1e: 48 c7 c7 58 37 6f 82 mov $0xffffffff826f3758,%rdi 25: e8 65 98 a8 ff callq 0xffffffffffa8988f 2a:* 0f 0b ud2 <-- trapping instruction 2c: 48 89 d1 mov %rdx,%rcx 2f: 4c 89 c6 mov %r8,%rsi 32: 4c 89 ca mov %r9,%rdx 35: 48 c7 c7 a8 37 6f 82 mov $0xffffffff826f37a8,%rdi 3c: e8 .byte 0xe8 3d: 4e 98 rex.WRX cltq 3f: a8 .byte 0xa8 Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: 48 89 d1 mov %rdx,%rcx 5: 4c 89 c6 mov %r8,%rsi 8: 4c 89 ca mov %r9,%rdx b: 48 c7 c7 a8 37 6f 82 mov $0xffffffff826f37a8,%rdi 12: e8 .byte 0xe8 13: 4e 98 rex.WRX cltq 15: a8 .byte 0xa8 To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests sudo bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run sudo bin/lkp run generated-yaml-file # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index efebbf19f887..e7b990b28610 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -513,7 +513,20 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) if ((val = pv_wait_head_or_lock(lock, node))) goto locked; +#if _Q_PENDING_BITS == 8 + /* + * Spinning on the 2-byte locked_pending instead of the 4-byte qspinlock + * variable can avoid the false dependency between the tail field and + * the locked_pending field. This helps the h/w thread (the 1st spinner) + * stay in a low power state and prevents it from being woken up by other + * h/w threads in the same core when they perform xchg_tail() to update + * the tail field only. + */ + smp_cond_load_acquire(&lock->locked_pending, !VAL); + val = atomic_read_acquire(&lock->val); +#else val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK)); +#endif locked: /*