Message ID | 20221224120545.262989-1-guoren@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp130312wrt; Sat, 24 Dec 2022 04:13:28 -0800 (PST) X-Google-Smtp-Source: AMrXdXteXxku288YMp5j54hKbyK4HTRzuB1N5EhLKHbBzmFlQuqgifXXViTjbjq7gkRmhN2IfIwK X-Received: by 2002:a62:8408:0:b0:575:d06d:1bfa with SMTP id k8-20020a628408000000b00575d06d1bfamr15313773pfd.2.1671884008174; Sat, 24 Dec 2022 04:13:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671884008; cv=none; d=google.com; s=arc-20160816; b=sVypAN7GwRByvhVkNsrTNsx91Zk7xL9ZY/ZVC5LmNOWFUOQ7UFLPYH+8K2uc4C3r79 4/cuB6SdEJBgrM1xfUWc7RKRHE6VIoWe/6dIW5C8Y3HWjcm94rRh3tAnL5GXd/4CsBIf 4SXpmjyC7aUtsjBUQLEAWBNNTZzV3Fo5yT3NOXY28v4mlThI/SKW+mb5hLaZ7p8M26OQ +UCeLUKGDsgQHk7ROKcZ58mXX43Ld+SZrAV5kJXVKjku4Y0rHC4/jzTuqn0aBUehAkin HHDv1azSCMjCZvSsDiX5y4hU059p3d9tE6wQgNPkfIgRD7u+5+KBTwZedXs5NGSv+GsF IKSA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=AU2d8g0Md+01iI92OWonI0WsJ6TyjbWnxlx70n8Ca84=; b=CsITM56gmzMCQD3HVl5fqhbKd3qoBwxxZGkToP2DyUVzeac7BVuGOm3/P/aigsSIgx FN3RweL5gAoXVmgvouw/tie9EKAE0XHh+I3NQis+Viu/fB6Sgbrw2XMHChWfBeqNFQP2 PHVXyas/Bhs/Uh0fAr+soXU8HHeur/Tzp6ivfEaMWXwvhxO1P5CHq/6/3GnX4liMveiZ jLpt5XJOdkIW8JTFqDKejIJLvv4RhptYnwiF47eUIZhouUntXBJgpOSf5BLRSBY3eU/x SCKvfK2boGxzcFpnQPN6n+J0J1aYoG01JbpBYFkpqRH2yGmnFQYfeA7mfrso7eogl4pF dsBg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=rCS4ejFs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d7-20020a056a0024c700b0057731e4f614si6706582pfv.85.2022.12.24.04.13.14; Sat, 24 Dec 2022 04:13:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=rCS4ejFs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231130AbiLXMG6 (ORCPT <rfc822;pacteraone@gmail.com> + 99 others); Sat, 24 Dec 2022 07:06:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41014 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229507AbiLXMGz (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 24 Dec 2022 07:06:55 -0500 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 80F0D1012 for <linux-kernel@vger.kernel.org>; Sat, 24 Dec 2022 04:06:54 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id CEE29601C3 for <linux-kernel@vger.kernel.org>; Sat, 24 Dec 2022 12:06:53 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 701E4C433EF; Sat, 24 Dec 2022 12:06:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1671883613; bh=Lbw0POaZnF+kGvgdPds+fH17Q+0WMb3P+5vjsislLY4=; h=From:To:Cc:Subject:Date:From; b=rCS4ejFsUQ32fNYAwoAN9QtpY/7IeI2eDTf0t4iYOrCiGQfRQMI1eS9g8am7GiMGv MBhcp8TzCkv8XzNqXW3uXyrPcX4WujtBYO93189U15ryHlO53Wt9T2dG6/Y/mAk41u q6qtPMqkLtw0CG9A0x8ajNYIfqxn9T/VzuwERXgRmNy9xp36dFhePtUWh1w2NZBQzf 3BMDXTXM9a12B3Q3cELlPAslbYQvCw7hhqAdi4wIRDEDRPnEeyfqd2J+FAYLgn8NPl LAAbABMMZ0rbK5MtnZqV/Bi75OgQJBHk0Jf5Q4K29ZQMlMW2KBaByYecpT2NRvgiwJ +NrQnm1GNOOfQ== From: guoren@kernel.org To: peterz@infradead.org, longman@redhat.com Cc: linux-kernel@vger.kernel.org, guoren@kernel.org, Guo Ren <guoren@linux.alibaba.com>, Boqun Feng <boqun.feng@gmail.com>, Will Deacon <will@kernel.org>, Ingo Molnar <mingo@redhat.com> Subject: [PATCH] locking/qspinlock: Optimize pending state waiting for unlock Date: Sat, 24 Dec 2022 07:05:45 -0500 Message-Id: <20221224120545.262989-1-guoren@kernel.org> X-Mailer: git-send-email 2.36.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1753097445696348387?= X-GMAIL-MSGID: =?utf-8?q?1753097445696348387?= |
Series |
locking/qspinlock: Optimize pending state waiting for unlock
|
|
Commit Message
Guo Ren
Dec. 24, 2022, 12:05 p.m. UTC
From: Guo Ren <guoren@linux.alibaba.com> When we're pending, we only care about lock value. The xchg_tail wouldn't affect the pending state. That means the hardware thread could stay in a sleep state and leaves the rest execution units' resources of pipeline to other hardware threads. This optimization may work only for SMT scenarios because the granularity between cores is cache-block. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> --- kernel/locking/qspinlock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
Comments
On 12/24/22 07:05, guoren@kernel.org wrote: > From: Guo Ren <guoren@linux.alibaba.com> > > When we're pending, we only care about lock value. The xchg_tail > wouldn't affect the pending state. That means the hardware thread > could stay in a sleep state and leaves the rest execution units' > resources of pipeline to other hardware threads. This optimization > may work only for SMT scenarios because the granularity between > cores is cache-block. > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > Signed-off-by: Guo Ren <guoren@kernel.org> > Cc: Waiman Long <longman@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Boqun Feng <boqun.feng@gmail.com> > Cc: Will Deacon <will@kernel.org> > Cc: Ingo Molnar <mingo@redhat.com> > --- > kernel/locking/qspinlock.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c > index 2b23378775fe..ebe6b8ec7cb3 100644 > --- a/kernel/locking/qspinlock.c > +++ b/kernel/locking/qspinlock.c > @@ -371,7 +371,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > /* > * We're pending, wait for the owner to go away. > * > - * 0,1,1 -> 0,1,0 > + * 0,1,1 -> *,1,0 > * > * this wait loop must be a load-acquire such that we match the > * store-release that clears the locked bit and create lock Yes, we don't care about the tail. > @@ -380,7 +380,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > * barriers. > */ > if (val & _Q_LOCKED_MASK) > - atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK)); > + smp_cond_load_acquire(&lock->locked, !VAL); > > /* > * take ownership and clear the pending bit. We may save an AND operation here which may be a cycle or two. I remember that it may be more costly to load a byte instead of an integer in some arches. So it doesn't seem like that much of an optimization from my point of view. I know that arm64 will enter a low power state in this *cond_load_acquire() loop, but I believe any change in the state of the the lock cacheline will wake it up. So it doesn't really matter if you are checking a byte or an int. Do you have any other data point to support your optimization claim? Cheers, Longman
On Sun, Dec 25, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote: > > On 12/24/22 07:05, guoren@kernel.org wrote: > > From: Guo Ren <guoren@linux.alibaba.com> > > > > When we're pending, we only care about lock value. The xchg_tail > > wouldn't affect the pending state. That means the hardware thread > > could stay in a sleep state and leaves the rest execution units' > > resources of pipeline to other hardware threads. This optimization > > may work only for SMT scenarios because the granularity between > > cores is cache-block. Please have a look at the comment I've written. > > > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > Signed-off-by: Guo Ren <guoren@kernel.org> > > Cc: Waiman Long <longman@redhat.com> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Boqun Feng <boqun.feng@gmail.com> > > Cc: Will Deacon <will@kernel.org> > > Cc: Ingo Molnar <mingo@redhat.com> > > --- > > kernel/locking/qspinlock.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c > > index 2b23378775fe..ebe6b8ec7cb3 100644 > > --- a/kernel/locking/qspinlock.c > > +++ b/kernel/locking/qspinlock.c > > @@ -371,7 +371,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > > /* > > * We're pending, wait for the owner to go away. > > * > > - * 0,1,1 -> 0,1,0 > > + * 0,1,1 -> *,1,0 > > * > > * this wait loop must be a load-acquire such that we match the > > * store-release that clears the locked bit and create lock > Yes, we don't care about the tail. > > @@ -380,7 +380,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > > * barriers. > > */ > > if (val & _Q_LOCKED_MASK) > > - atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK)); > > + smp_cond_load_acquire(&lock->locked, !VAL); > > > > /* > > * take ownership and clear the pending bit. > > We may save an AND operation here which may be a cycle or two. I > remember that it may be more costly to load a byte instead of an integer > in some arches. So it doesn't seem like that much of an optimization > from my point of view. The reason is, of course, not here. See my commit comment. > I know that arm64 will enter a low power state in > this *cond_load_acquire() loop, but I believe any change in the state of > the the lock cacheline will wake it up. So it doesn't really matter if > you are checking a byte or an int. The situation is the SMT scenarios in the same core. Not an entering low-power state situation. Of course, the granularity between cores is "cacheline", but the granularity between SMT hw threads of the same core could be "byte" which internal LSU handles. For example, when a hw-thread yields the resources of the core to other hw-threads, this patch could help the hw-thread stay in the sleep state and prevent it from being woken up by other hw-threads xchg_tail. Finally, from the software semantic view, does the patch make it more accurate? (We don't care about the tail here.) > > Do you have any other data point to support your optimization claim? > > Cheers, > Longman >
On 12/24/22 21:57, Guo Ren wrote: > On Sun, Dec 25, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote: >> On 12/24/22 07:05, guoren@kernel.org wrote: >>> From: Guo Ren <guoren@linux.alibaba.com> >>> >>> When we're pending, we only care about lock value. The xchg_tail >>> wouldn't affect the pending state. That means the hardware thread >>> could stay in a sleep state and leaves the rest execution units' >>> resources of pipeline to other hardware threads. This optimization >>> may work only for SMT scenarios because the granularity between >>> cores is cache-block. > Please have a look at the comment I've written. > >>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> >>> Signed-off-by: Guo Ren <guoren@kernel.org> >>> Cc: Waiman Long <longman@redhat.com> >>> Cc: Peter Zijlstra <peterz@infradead.org> >>> Cc: Boqun Feng <boqun.feng@gmail.com> >>> Cc: Will Deacon <will@kernel.org> >>> Cc: Ingo Molnar <mingo@redhat.com> >>> --- >>> kernel/locking/qspinlock.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c >>> index 2b23378775fe..ebe6b8ec7cb3 100644 >>> --- a/kernel/locking/qspinlock.c >>> +++ b/kernel/locking/qspinlock.c >>> @@ -371,7 +371,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) >>> /* >>> * We're pending, wait for the owner to go away. >>> * >>> - * 0,1,1 -> 0,1,0 >>> + * 0,1,1 -> *,1,0 >>> * >>> * this wait loop must be a load-acquire such that we match the >>> * store-release that clears the locked bit and create lock >> Yes, we don't care about the tail. >>> @@ -380,7 +380,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) >>> * barriers. >>> */ >>> if (val & _Q_LOCKED_MASK) >>> - atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK)); >>> + smp_cond_load_acquire(&lock->locked, !VAL); >>> >>> /* >>> * take ownership and clear the pending bit. >> We may save an AND operation here which may be a cycle or two. I >> remember that it may be more costly to load a byte instead of an integer >> in some arches. So it doesn't seem like that much of an optimization >> from my point of view. > The reason is, of course, not here. See my commit comment. > >> I know that arm64 will enter a low power state in >> this *cond_load_acquire() loop, but I believe any change in the state of >> the the lock cacheline will wake it up. So it doesn't really matter if >> you are checking a byte or an int. > The situation is the SMT scenarios in the same core. Not an entering > low-power state situation. Of course, the granularity between cores is > "cacheline", but the granularity between SMT hw threads of the same > core could be "byte" which internal LSU handles. For example, when a > hw-thread yields the resources of the core to other hw-threads, this > patch could help the hw-thread stay in the sleep state and prevent it > from being woken up by other hw-threads xchg_tail. > > Finally, from the software semantic view, does the patch make it more > accurate? (We don't care about the tail here.) Thanks for the clarification. I am not arguing for the simplification part. I just want to clarify my limited understanding of how the CPU hardware are actually dealing with these conditions. With that, I am fine with this patch. It would be nice if you can elaborate a bit more in your commit log. Acked-by: Waiman Long <longman@redhat.com>
On 12/24/22 22:29, Waiman Long wrote: > On 12/24/22 21:57, Guo Ren wrote: >> On Sun, Dec 25, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote: >>> On 12/24/22 07:05, guoren@kernel.org wrote: >>>> From: Guo Ren <guoren@linux.alibaba.com> >>>> >>>> When we're pending, we only care about lock value. The xchg_tail >>>> wouldn't affect the pending state. That means the hardware thread >>>> could stay in a sleep state and leaves the rest execution units' >>>> resources of pipeline to other hardware threads. This optimization >>>> may work only for SMT scenarios because the granularity between >>>> cores is cache-block. >> Please have a look at the comment I've written. >> >>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> >>>> Signed-off-by: Guo Ren <guoren@kernel.org> >>>> Cc: Waiman Long <longman@redhat.com> >>>> Cc: Peter Zijlstra <peterz@infradead.org> >>>> Cc: Boqun Feng <boqun.feng@gmail.com> >>>> Cc: Will Deacon <will@kernel.org> >>>> Cc: Ingo Molnar <mingo@redhat.com> >>>> --- >>>> kernel/locking/qspinlock.c | 4 ++-- >>>> 1 file changed, 2 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c >>>> index 2b23378775fe..ebe6b8ec7cb3 100644 >>>> --- a/kernel/locking/qspinlock.c >>>> +++ b/kernel/locking/qspinlock.c >>>> @@ -371,7 +371,7 @@ void __lockfunc >>>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) >>>> /* >>>> * We're pending, wait for the owner to go away. >>>> * >>>> - * 0,1,1 -> 0,1,0 >>>> + * 0,1,1 -> *,1,0 >>>> * >>>> * this wait loop must be a load-acquire such that we match the >>>> * store-release that clears the locked bit and create lock >>> Yes, we don't care about the tail. >>>> @@ -380,7 +380,7 @@ void __lockfunc >>>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) >>>> * barriers. >>>> */ >>>> if (val & _Q_LOCKED_MASK) >>>> - atomic_cond_read_acquire(&lock->val, !(VAL & >>>> _Q_LOCKED_MASK)); >>>> + smp_cond_load_acquire(&lock->locked, !VAL); >>>> >>>> /* >>>> * take ownership and clear the pending bit. >>> We may save an AND operation here which may be a cycle or two. I >>> remember that it may be more costly to load a byte instead of an >>> integer >>> in some arches. So it doesn't seem like that much of an optimization >>> from my point of view. >> The reason is, of course, not here. See my commit comment. >> >>> I know that arm64 will enter a low power state in >>> this *cond_load_acquire() loop, but I believe any change in the >>> state of >>> the the lock cacheline will wake it up. So it doesn't really matter if >>> you are checking a byte or an int. >> The situation is the SMT scenarios in the same core. Not an entering >> low-power state situation. Of course, the granularity between cores is >> "cacheline", but the granularity between SMT hw threads of the same >> core could be "byte" which internal LSU handles. For example, when a >> hw-thread yields the resources of the core to other hw-threads, this >> patch could help the hw-thread stay in the sleep state and prevent it >> from being woken up by other hw-threads xchg_tail. >> >> Finally, from the software semantic view, does the patch make it more >> accurate? (We don't care about the tail here.) > > Thanks for the clarification. > > I am not arguing for the simplification part. I just want to clarify > my limited understanding of how the CPU hardware are actually dealing > with these conditions. > > With that, I am fine with this patch. It would be nice if you can > elaborate a bit more in your commit log. > > Acked-by: Waiman Long <longman@redhat.com> > BTW, have you actually observe any performance improvement with this patch? Cheers, Longman
On Sun, Dec 25, 2022 at 11:31 AM Waiman Long <longman@redhat.com> wrote: > > On 12/24/22 22:29, Waiman Long wrote: > > On 12/24/22 21:57, Guo Ren wrote: > >> On Sun, Dec 25, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote: > >>> On 12/24/22 07:05, guoren@kernel.org wrote: > >>>> From: Guo Ren <guoren@linux.alibaba.com> > >>>> > >>>> When we're pending, we only care about lock value. The xchg_tail > >>>> wouldn't affect the pending state. That means the hardware thread > >>>> could stay in a sleep state and leaves the rest execution units' > >>>> resources of pipeline to other hardware threads. This optimization > >>>> may work only for SMT scenarios because the granularity between > >>>> cores is cache-block. > >> Please have a look at the comment I've written. > >> > >>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > >>>> Signed-off-by: Guo Ren <guoren@kernel.org> > >>>> Cc: Waiman Long <longman@redhat.com> > >>>> Cc: Peter Zijlstra <peterz@infradead.org> > >>>> Cc: Boqun Feng <boqun.feng@gmail.com> > >>>> Cc: Will Deacon <will@kernel.org> > >>>> Cc: Ingo Molnar <mingo@redhat.com> > >>>> --- > >>>> kernel/locking/qspinlock.c | 4 ++-- > >>>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c > >>>> index 2b23378775fe..ebe6b8ec7cb3 100644 > >>>> --- a/kernel/locking/qspinlock.c > >>>> +++ b/kernel/locking/qspinlock.c > >>>> @@ -371,7 +371,7 @@ void __lockfunc > >>>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > >>>> /* > >>>> * We're pending, wait for the owner to go away. > >>>> * > >>>> - * 0,1,1 -> 0,1,0 > >>>> + * 0,1,1 -> *,1,0 > >>>> * > >>>> * this wait loop must be a load-acquire such that we match the > >>>> * store-release that clears the locked bit and create lock > >>> Yes, we don't care about the tail. > >>>> @@ -380,7 +380,7 @@ void __lockfunc > >>>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) > >>>> * barriers. > >>>> */ > >>>> if (val & _Q_LOCKED_MASK) > >>>> - atomic_cond_read_acquire(&lock->val, !(VAL & > >>>> _Q_LOCKED_MASK)); > >>>> + smp_cond_load_acquire(&lock->locked, !VAL); > >>>> > >>>> /* > >>>> * take ownership and clear the pending bit. > >>> We may save an AND operation here which may be a cycle or two. I > >>> remember that it may be more costly to load a byte instead of an > >>> integer > >>> in some arches. So it doesn't seem like that much of an optimization > >>> from my point of view. > >> The reason is, of course, not here. See my commit comment. > >> > >>> I know that arm64 will enter a low power state in > >>> this *cond_load_acquire() loop, but I believe any change in the > >>> state of > >>> the the lock cacheline will wake it up. So it doesn't really matter if > >>> you are checking a byte or an int. > >> The situation is the SMT scenarios in the same core. Not an entering > >> low-power state situation. Of course, the granularity between cores is > >> "cacheline", but the granularity between SMT hw threads of the same > >> core could be "byte" which internal LSU handles. For example, when a > >> hw-thread yields the resources of the core to other hw-threads, this > >> patch could help the hw-thread stay in the sleep state and prevent it > >> from being woken up by other hw-threads xchg_tail. > >> > >> Finally, from the software semantic view, does the patch make it more > >> accurate? (We don't care about the tail here.) > > > > Thanks for the clarification. > > > > I am not arguing for the simplification part. I just want to clarify > > my limited understanding of how the CPU hardware are actually dealing > > with these conditions. > > > > With that, I am fine with this patch. It would be nice if you can > > elaborate a bit more in your commit log. > > > > Acked-by: Waiman Long <longman@redhat.com> > > > BTW, have you actually observe any performance improvement with this patch? Not yet. I'm researching how the hardware could satisfy qspinlock better. Here are three points I concluded: 1. Atomic forward progress guarantee: Prevent unnecessary LL/SC retry, which may cause expensive bus transactions when crossing the NUMA nodes. 2. Sub-word atomic primitive: Enable freedom from interference between locked, pending, and tail. 3. Load-cond primitive: Prevent processor from wasting loop operations for detection. For points 2 & 3, I have a continuous proposal to add new atomic_read_cond_mask & smp_load_cond_mask for Linux atomic primitives [1]. [1]: https://lore.kernel.org/lkml/20221225115529.490378-1-guoren@kernel.org/ > > Cheers, > Longman > -- Best Regards Guo Ren
* Guo Ren <guoren@kernel.org> wrote: > > >> The situation is the SMT scenarios in the same core. Not an entering > > >> low-power state situation. Of course, the granularity between cores is > > >> "cacheline", but the granularity between SMT hw threads of the same > > >> core could be "byte" which internal LSU handles. For example, when a > > >> hw-thread yields the resources of the core to other hw-threads, this > > >> patch could help the hw-thread stay in the sleep state and prevent it > > >> from being woken up by other hw-threads xchg_tail. > > >> > > >> Finally, from the software semantic view, does the patch make it more > > >> accurate? (We don't care about the tail here.) > > > > > > Thanks for the clarification. > > > > > > I am not arguing for the simplification part. I just want to clarify > > > my limited understanding of how the CPU hardware are actually dealing > > > with these conditions. > > > > > > With that, I am fine with this patch. It would be nice if you can > > > elaborate a bit more in your commit log. > > > > > > Acked-by: Waiman Long <longman@redhat.com> > > > > > BTW, have you actually observe any performance improvement with this patch? > Not yet. I'm researching how the hardware could satisfy qspinlock > better. Here are three points I concluded: > 1. Atomic forward progress guarantee: Prevent unnecessary LL/SC > retry, which may cause expensive bus transactions when crossing the > NUMA nodes. > 2. Sub-word atomic primitive: Enable freedom from interference > between locked, pending, and tail. > 3. Load-cond primitive: Prevent processor from wasting loop > operations for detection. As to this patch, please send a -v2 version of this patch that has this discussion & explanation included in the changelog, as requested by Waiman. Thanks, Ingo
On Thu, Jan 5, 2023 at 4:19 AM Ingo Molnar <mingo@kernel.org> wrote: > > > * Guo Ren <guoren@kernel.org> wrote: > > > > >> The situation is the SMT scenarios in the same core. Not an entering > > > >> low-power state situation. Of course, the granularity between cores is > > > >> "cacheline", but the granularity between SMT hw threads of the same > > > >> core could be "byte" which internal LSU handles. For example, when a > > > >> hw-thread yields the resources of the core to other hw-threads, this > > > >> patch could help the hw-thread stay in the sleep state and prevent it > > > >> from being woken up by other hw-threads xchg_tail. > > > >> > > > >> Finally, from the software semantic view, does the patch make it more > > > >> accurate? (We don't care about the tail here.) > > > > > > > > Thanks for the clarification. > > > > > > > > I am not arguing for the simplification part. I just want to clarify > > > > my limited understanding of how the CPU hardware are actually dealing > > > > with these conditions. > > > > > > > > With that, I am fine with this patch. It would be nice if you can > > > > elaborate a bit more in your commit log. > > > > > > > > Acked-by: Waiman Long <longman@redhat.com> > > > > > > > BTW, have you actually observe any performance improvement with this patch? > > Not yet. I'm researching how the hardware could satisfy qspinlock > > better. Here are three points I concluded: > > 1. Atomic forward progress guarantee: Prevent unnecessary LL/SC > > retry, which may cause expensive bus transactions when crossing the > > NUMA nodes. > > 2. Sub-word atomic primitive: Enable freedom from interference > > between locked, pending, and tail. > > 3. Load-cond primitive: Prevent processor from wasting loop > > operations for detection. > > As to this patch, please send a -v2 version of this patch that has this > discussion & explanation included in the changelog, as requested by Waiman. Done https://lore.kernel.org/lkml/20230105021952.3090070-1-guoren@kernel.org/ > > Thanks, > > Ingo
* Guo Ren <guoren@kernel.org> wrote: > On Thu, Jan 5, 2023 at 4:19 AM Ingo Molnar <mingo@kernel.org> wrote: > > > > > > * Guo Ren <guoren@kernel.org> wrote: > > > > > > >> The situation is the SMT scenarios in the same core. Not an entering > > > > >> low-power state situation. Of course, the granularity between cores is > > > > >> "cacheline", but the granularity between SMT hw threads of the same > > > > >> core could be "byte" which internal LSU handles. For example, when a > > > > >> hw-thread yields the resources of the core to other hw-threads, this > > > > >> patch could help the hw-thread stay in the sleep state and prevent it > > > > >> from being woken up by other hw-threads xchg_tail. > > > > >> > > > > >> Finally, from the software semantic view, does the patch make it more > > > > >> accurate? (We don't care about the tail here.) > > > > > > > > > > Thanks for the clarification. > > > > > > > > > > I am not arguing for the simplification part. I just want to clarify > > > > > my limited understanding of how the CPU hardware are actually dealing > > > > > with these conditions. > > > > > > > > > > With that, I am fine with this patch. It would be nice if you can > > > > > elaborate a bit more in your commit log. > > > > > > > > > > Acked-by: Waiman Long <longman@redhat.com> > > > > > > > > > BTW, have you actually observe any performance improvement with this patch? > > > Not yet. I'm researching how the hardware could satisfy qspinlock > > > better. Here are three points I concluded: > > > 1. Atomic forward progress guarantee: Prevent unnecessary LL/SC > > > retry, which may cause expensive bus transactions when crossing the > > > NUMA nodes. > > > 2. Sub-word atomic primitive: Enable freedom from interference > > > between locked, pending, and tail. > > > 3. Load-cond primitive: Prevent processor from wasting loop > > > operations for detection. > > > > As to this patch, please send a -v2 version of this patch that has this > > discussion & explanation included in the changelog, as requested by Waiman. > Done > > https://lore.kernel.org/lkml/20230105021952.3090070-1-guoren@kernel.org/ Applied to tip:locking/core for a v6.3 merge, thanks! Ingo
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 2b23378775fe..ebe6b8ec7cb3 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -371,7 +371,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) /* * We're pending, wait for the owner to go away. * - * 0,1,1 -> 0,1,0 + * 0,1,1 -> *,1,0 * * this wait loop must be a load-acquire such that we match the * store-release that clears the locked bit and create lock @@ -380,7 +380,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) * barriers. */ if (val & _Q_LOCKED_MASK) - atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK)); + smp_cond_load_acquire(&lock->locked, !VAL); /* * take ownership and clear the pending bit.