Message ID | 20230731023308.3748432-1-guoren@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:918b:0:b0:3e4:2afc:c1 with SMTP id s11csp1761869vqg; Sun, 30 Jul 2023 19:42:12 -0700 (PDT) X-Google-Smtp-Source: APBJJlEsIR8S3y7TvggAPwLgPu6vVqA0+6H9nVAtb2PRedVU0BAsJ8kvJNrhPhnhry+8NReint4q X-Received: by 2002:a17:906:100e:b0:994:673:8af6 with SMTP id 14-20020a170906100e00b0099406738af6mr5963030ejm.29.1690771332019; Sun, 30 Jul 2023 19:42:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690771332; cv=none; d=google.com; s=arc-20160816; b=EJ9ul7t4oFVsTRVDgaR+yOV9N+1PEAVhwk4za+9cXXogj6muujNdV7uz7gKYnY7MRy mmiuxwhOwihOCcBeMxnuQT/M9+AQFky6HuuitrWD91O26Rrbu4G1UAmvm0LEeAihDpA3 yG4auVhUEfkpI7QTA3m2awtjehk+j9j7CqiZAG5tUmLJ93ioZoVkqseyLSNH8xhNvTwX wMe6uv99SIDseCQKVJ6R0Vr3+gHDuNSk+jAHwGLV8pZRBxmcYy+cSYCpyGdYdL71ujmc E1no/1jQ/Qsa1ypXBTDiGNgq/NggLJhizIqktKMkngH8g6Xn52tj/prLo6dwJdokOFTW Y8vg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=7C/YDvLDqM5oBWvLugTMwG6osGe28lZQbLSKuSrexFM=; fh=ynD8gm6gJQXSez8FdW3HqBskq1VQGujBF8+c/DnQoyg=; b=OygHbr7V/7hYvHjuDYw8XV9xqUlZAjPHNV93vN9vRFpZah2UQLtCHGulM2Zpn6RSzi dzGxHXcEP0gBV8DphwUAVdH7rzhND+ySXuVmlBVQZJFUcE6CkEIK/ikRHfwp37NNoQpb nIK1tMEwgbyOFev4mfygXSFx2hba+0su5h2nzssiSWSlR2+DKWxdslBoiEivYpMxxmiq YS0UimhEl64xalLS+qM470WXkB5N0UBN03cll3hRIZ3jToZBOzWdS99WEfy8v7v9crDV p0/HjuW98Rg7I72bc4m0/mzgEG68qNT96jaNqFGDIyM77ku1bp039bTg7uUuKsdNoG5M H+2A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=G6PYP+nC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id lu44-20020a170906faec00b0098874379199si6219005ejb.163.2023.07.30.19.41.48; Sun, 30 Jul 2023 19:42:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=G6PYP+nC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229712AbjGaCdY (ORCPT <rfc822;dengxinlin2429@gmail.com> + 99 others); Sun, 30 Jul 2023 22:33:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37224 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229608AbjGaCdX (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 30 Jul 2023 22:33:23 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3AFDCE78; Sun, 30 Jul 2023 19:33:22 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 8440760E0A; Mon, 31 Jul 2023 02:33:21 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6B6D6C433C8; Mon, 31 Jul 2023 02:33:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1690770800; bh=0gAtJOQpVMD8BwO73WEaaztsqkDnNRkgEfGHbRe8E4w=; h=From:To:Cc:Subject:Date:From; b=G6PYP+nCCFGX5k/Hi5Jr75omCG6AH0IALpAYUKDAbmdOdLbSoz/oEoCh2GaopoYJA b4So3ErnwWMVWBWd/RtKCT2QAL5tjhLvwXpmwIKWtbXOm778YrJtOxLUDPLM8wyrY9 LY37LeLPQFbJrspWfmGzgLb5qKB3z2/OvFHOe5kzJGll3iu+mMAS0tZr4Pwc3WYWy4 X/Yk1+kNU/5RC3BLEVkhgYmXdPMd8jvH/hgJzD6JLQi0EE5TcqOdwEx/tMBua3PB92 WOJvUGpzDAiv0fc4Hri9UdC+z1JAs75/diQ1JRd/fnElYCXG3XqJmtF0WdpeYpsb1w RdFH53r+vERNA== From: guoren@kernel.org To: guoren@kernel.org, David.Laight@ACULAB.COM, will@kernel.org, peterz@infradead.org, mingo@redhat.com, longman@redhat.com Cc: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org, Guo Ren <guoren@linux.alibaba.com> Subject: [PATCH V2] asm-generic: ticket-lock: Optimize arch_spin_value_unlocked Date: Sun, 30 Jul 2023 22:33:08 -0400 Message-Id: <20230731023308.3748432-1-guoren@kernel.org> X-Mailer: git-send-email 2.36.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1772902239967397249 X-GMAIL-MSGID: 1772902239967397249 |
Series |
[V2] asm-generic: ticket-lock: Optimize arch_spin_value_unlocked
|
|
Commit Message
Guo Ren
July 31, 2023, 2:33 a.m. UTC
From: Guo Ren <guoren@linux.alibaba.com> The arch_spin_value_unlocked would cause an unnecessary memory access to the contended value. Although it won't cause a significant performance gap in most architectures, the arch_spin_value_unlocked argument contains enough information. Thus, remove unnecessary atomic_read in arch_spin_value_unlocked(). The caller of arch_spin_value_unlocked() could benefit from this change. Currently, the only caller is lockref. Signed-off-by: Guo Ren <guoren@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> --- Changelog V2: - Fixup commit log with Waiman advice. - Add Waiman comment in the commit msg. --- include/asm-generic/spinlock.h | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-)
Comments
On Sun, Jul 30, 2023 at 10:33:08PM -0400, guoren@kernel.org wrote: > From: Guo Ren <guoren@linux.alibaba.com> > > The arch_spin_value_unlocked would cause an unnecessary memory > access to the contended value. Although it won't cause a significant > performance gap in most architectures, the arch_spin_value_unlocked > argument contains enough information. Thus, remove unnecessary > atomic_read in arch_spin_value_unlocked(). > > The caller of arch_spin_value_unlocked() could benefit from this > change. Currently, the only caller is lockref. > > Signed-off-by: Guo Ren <guoren@kernel.org> > Cc: Waiman Long <longman@redhat.com> > Cc: David Laight <David.Laight@ACULAB.COM> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > --- > Changelog > V2: > - Fixup commit log with Waiman advice. > - Add Waiman comment in the commit msg. > --- > include/asm-generic/spinlock.h | 16 +++++++++------- > 1 file changed, 9 insertions(+), 7 deletions(-) Acked-by: Will Deacon <will@kernel.org> It would be nice to see some numbers showing this actually helps, though. Will
在 2023/7/31 10:33, guoren@kernel.org 写道: > From: Guo Ren <guoren@linux.alibaba.com> > > The arch_spin_value_unlocked would cause an unnecessary memory > access to the contended value. Although it won't cause a significant > performance gap in most architectures, the arch_spin_value_unlocked > argument contains enough information. Thus, remove unnecessary > atomic_read in arch_spin_value_unlocked(). > > The caller of arch_spin_value_unlocked() could benefit from this > change. Currently, the only caller is lockref. > > Signed-off-by: Guo Ren <guoren@kernel.org> > Cc: Waiman Long <longman@redhat.com> > Cc: David Laight <David.Laight@ACULAB.COM> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > --- > Changelog > V2: > - Fixup commit log with Waiman advice. > - Add Waiman comment in the commit msg. > --- > include/asm-generic/spinlock.h | 16 +++++++++------- > 1 file changed, 9 insertions(+), 7 deletions(-) > > diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h > index fdfebcb050f4..90803a826ba0 100644 > --- a/include/asm-generic/spinlock.h > +++ b/include/asm-generic/spinlock.h > @@ -68,11 +68,18 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) > smp_store_release(ptr, (u16)val + 1); > } > > +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) > +{ > + u32 val = lock.counter; > + > + return ((val >> 16) == (val & 0xffff)); > +} I do not know much about lock, will it be cached in register without memory access again like READ_ONCE or atomic_read? Regards Bibo Mao > + > static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock) > { > - u32 val = atomic_read(lock); > + arch_spinlock_t val = READ_ONCE(*lock); > > - return ((val >> 16) != (val & 0xffff)); > + return !arch_spin_value_unlocked(val); > } > > static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) > @@ -82,11 +89,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) > return (s16)((val >> 16) - (val & 0xffff)) > 1; > } > > -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) > -{ > - return !arch_spin_is_locked(&lock); > -} > - > #include <asm/qrwlock.h> > > #endif /* __ASM_GENERIC_SPINLOCK_H */
On 7/30/23 22:33, guoren@kernel.org wrote: > From: Guo Ren <guoren@linux.alibaba.com> > > The arch_spin_value_unlocked would cause an unnecessary memory > access to the contended value. Although it won't cause a significant > performance gap in most architectures, the arch_spin_value_unlocked > argument contains enough information. Thus, remove unnecessary > atomic_read in arch_spin_value_unlocked(). > > The caller of arch_spin_value_unlocked() could benefit from this > change. Currently, the only caller is lockref. > > Signed-off-by: Guo Ren <guoren@kernel.org> > Cc: Waiman Long <longman@redhat.com> > Cc: David Laight <David.Laight@ACULAB.COM> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > --- > Changelog > V2: > - Fixup commit log with Waiman advice. > - Add Waiman comment in the commit msg. > --- > include/asm-generic/spinlock.h | 16 +++++++++------- > 1 file changed, 9 insertions(+), 7 deletions(-) > > diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h > index fdfebcb050f4..90803a826ba0 100644 > --- a/include/asm-generic/spinlock.h > +++ b/include/asm-generic/spinlock.h > @@ -68,11 +68,18 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) > smp_store_release(ptr, (u16)val + 1); > } > > +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) > +{ > + u32 val = lock.counter; > + > + return ((val >> 16) == (val & 0xffff)); > +} > + > static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock) > { > - u32 val = atomic_read(lock); > + arch_spinlock_t val = READ_ONCE(*lock); > > - return ((val >> 16) != (val & 0xffff)); > + return !arch_spin_value_unlocked(val); > } > > static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) > @@ -82,11 +89,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) > return (s16)((val >> 16) - (val & 0xffff)) > 1; > } > > -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) > -{ > - return !arch_spin_is_locked(&lock); > -} > - > #include <asm/qrwlock.h> > > #endif /* __ASM_GENERIC_SPINLOCK_H */ I am fine with the current change. However, modern optimizing compiler should be able to avoid the redundant memory read anyway. So this patch may not have an impact from the performance point of view. Acked-by: Waiman Long <longman@redhat.com>
在 2023/7/31 23:16, Waiman Long 写道: > On 7/30/23 22:33, guoren@kernel.org wrote: >> From: Guo Ren <guoren@linux.alibaba.com> >> >> The arch_spin_value_unlocked would cause an unnecessary memory >> access to the contended value. Although it won't cause a significant >> performance gap in most architectures, the arch_spin_value_unlocked >> argument contains enough information. Thus, remove unnecessary >> atomic_read in arch_spin_value_unlocked(). >> >> The caller of arch_spin_value_unlocked() could benefit from this >> change. Currently, the only caller is lockref. >> >> Signed-off-by: Guo Ren <guoren@kernel.org> >> Cc: Waiman Long <longman@redhat.com> >> Cc: David Laight <David.Laight@ACULAB.COM> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> >> --- >> Changelog >> V2: >> - Fixup commit log with Waiman advice. >> - Add Waiman comment in the commit msg. >> --- >> include/asm-generic/spinlock.h | 16 +++++++++------- >> 1 file changed, 9 insertions(+), 7 deletions(-) >> >> diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h >> index fdfebcb050f4..90803a826ba0 100644 >> --- a/include/asm-generic/spinlock.h >> +++ b/include/asm-generic/spinlock.h >> @@ -68,11 +68,18 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) >> smp_store_release(ptr, (u16)val + 1); >> } >> +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) >> +{ >> + u32 val = lock.counter; >> + >> + return ((val >> 16) == (val & 0xffff)); >> +} >> + >> static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock) >> { >> - u32 val = atomic_read(lock); >> + arch_spinlock_t val = READ_ONCE(*lock); >> - return ((val >> 16) != (val & 0xffff)); >> + return !arch_spin_value_unlocked(val); >> } >> static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) >> @@ -82,11 +89,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) >> return (s16)((val >> 16) - (val & 0xffff)) > 1; >> } >> -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) >> -{ >> - return !arch_spin_is_locked(&lock); >> -} >> - >> #include <asm/qrwlock.h> >> #endif /* __ASM_GENERIC_SPINLOCK_H */ > > I am fine with the current change. However, modern optimizing compiler should be able to avoid the redundant memory read anyway. So this patch may not have an impact from the performance point of view. arch_spin_value_unlocked is called with lockref like this: #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ int retry = 100; \ struct lockref old; \ BUILD_BUG_ON(sizeof(old) != 8); \ old.lock_count = READ_ONCE(lockref->lock_count); \ while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ With modern optimizing compiler, Is it possible that old value of old.lock.rlock.raw_lock is cached in register, despite that try_cmpxchg64_relaxed modifies the memory of old.lock_count with new value? Regards Bibo Mao struct lockref new = old; \ CODE \ if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ &old.lock_count, \ new.lock_count))) { \ SUCCESS; \ } \ if (!--retry) \ break; \ } \ } while (0) > > Acked-by: Waiman Long <longman@redhat.com>
On 7/31/23 21:37, bibo mao wrote: > > 在 2023/7/31 23:16, Waiman Long 写道: >> On 7/30/23 22:33, guoren@kernel.org wrote: >>> From: Guo Ren <guoren@linux.alibaba.com> >>> >>> The arch_spin_value_unlocked would cause an unnecessary memory >>> access to the contended value. Although it won't cause a significant >>> performance gap in most architectures, the arch_spin_value_unlocked >>> argument contains enough information. Thus, remove unnecessary >>> atomic_read in arch_spin_value_unlocked(). >>> >>> The caller of arch_spin_value_unlocked() could benefit from this >>> change. Currently, the only caller is lockref. >>> >>> Signed-off-by: Guo Ren <guoren@kernel.org> >>> Cc: Waiman Long <longman@redhat.com> >>> Cc: David Laight <David.Laight@ACULAB.COM> >>> Cc: Peter Zijlstra <peterz@infradead.org> >>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> >>> --- >>> Changelog >>> V2: >>> - Fixup commit log with Waiman advice. >>> - Add Waiman comment in the commit msg. >>> --- >>> include/asm-generic/spinlock.h | 16 +++++++++------- >>> 1 file changed, 9 insertions(+), 7 deletions(-) >>> >>> diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h >>> index fdfebcb050f4..90803a826ba0 100644 >>> --- a/include/asm-generic/spinlock.h >>> +++ b/include/asm-generic/spinlock.h >>> @@ -68,11 +68,18 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) >>> smp_store_release(ptr, (u16)val + 1); >>> } >>> +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) >>> +{ >>> + u32 val = lock.counter; >>> + >>> + return ((val >> 16) == (val & 0xffff)); >>> +} >>> + >>> static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock) >>> { >>> - u32 val = atomic_read(lock); >>> + arch_spinlock_t val = READ_ONCE(*lock); >>> - return ((val >> 16) != (val & 0xffff)); >>> + return !arch_spin_value_unlocked(val); >>> } >>> static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) >>> @@ -82,11 +89,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) >>> return (s16)((val >> 16) - (val & 0xffff)) > 1; >>> } >>> -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) >>> -{ >>> - return !arch_spin_is_locked(&lock); >>> -} >>> - >>> #include <asm/qrwlock.h> >>> #endif /* __ASM_GENERIC_SPINLOCK_H */ >> I am fine with the current change. However, modern optimizing compiler should be able to avoid the redundant memory read anyway. So this patch may not have an impact from the performance point of view. > arch_spin_value_unlocked is called with lockref like this: > > #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > int retry = 100; \ > struct lockref old; \ > BUILD_BUG_ON(sizeof(old) != 8); \ > old.lock_count = READ_ONCE(lockref->lock_count); \ > while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > With modern optimizing compiler, Is it possible that old value of > old.lock.rlock.raw_lock is cached in register, despite that try_cmpxchg64_relaxed > modifies the memory of old.lock_count with new value? What I meant is that the call to arch_spin_value_unlocked() as it is today will not generate 2 memory reads of the same location with or without the patch. Of course, a new memory read will be needed after a failed cmpxchg(). Cheers, Longman
在 2023/8/1 09:59, Waiman Long 写道: > On 7/31/23 21:37, bibo mao wrote: >> >> 在 2023/7/31 23:16, Waiman Long 写道: >>> On 7/30/23 22:33, guoren@kernel.org wrote: >>>> From: Guo Ren <guoren@linux.alibaba.com> >>>> >>>> The arch_spin_value_unlocked would cause an unnecessary memory >>>> access to the contended value. Although it won't cause a significant >>>> performance gap in most architectures, the arch_spin_value_unlocked >>>> argument contains enough information. Thus, remove unnecessary >>>> atomic_read in arch_spin_value_unlocked(). >>>> >>>> The caller of arch_spin_value_unlocked() could benefit from this >>>> change. Currently, the only caller is lockref. >>>> >>>> Signed-off-by: Guo Ren <guoren@kernel.org> >>>> Cc: Waiman Long <longman@redhat.com> >>>> Cc: David Laight <David.Laight@ACULAB.COM> >>>> Cc: Peter Zijlstra <peterz@infradead.org> >>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> >>>> --- >>>> Changelog >>>> V2: >>>> - Fixup commit log with Waiman advice. >>>> - Add Waiman comment in the commit msg. >>>> --- >>>> include/asm-generic/spinlock.h | 16 +++++++++------- >>>> 1 file changed, 9 insertions(+), 7 deletions(-) >>>> >>>> diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h >>>> index fdfebcb050f4..90803a826ba0 100644 >>>> --- a/include/asm-generic/spinlock.h >>>> +++ b/include/asm-generic/spinlock.h >>>> @@ -68,11 +68,18 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) >>>> smp_store_release(ptr, (u16)val + 1); >>>> } >>>> +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) >>>> +{ >>>> + u32 val = lock.counter; >>>> + >>>> + return ((val >> 16) == (val & 0xffff)); >>>> +} >>>> + >>>> static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock) >>>> { >>>> - u32 val = atomic_read(lock); >>>> + arch_spinlock_t val = READ_ONCE(*lock); >>>> - return ((val >> 16) != (val & 0xffff)); >>>> + return !arch_spin_value_unlocked(val); >>>> } >>>> static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) >>>> @@ -82,11 +89,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) >>>> return (s16)((val >> 16) - (val & 0xffff)) > 1; >>>> } >>>> -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) >>>> -{ >>>> - return !arch_spin_is_locked(&lock); >>>> -} >>>> - >>>> #include <asm/qrwlock.h> >>>> #endif /* __ASM_GENERIC_SPINLOCK_H */ >>> I am fine with the current change. However, modern optimizing compiler should be able to avoid the redundant memory read anyway. So this patch may not have an impact from the performance point of view. >> arch_spin_value_unlocked is called with lockref like this: >> >> #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ >> int retry = 100; \ >> struct lockref old; \ >> BUILD_BUG_ON(sizeof(old) != 8); \ >> old.lock_count = READ_ONCE(lockref->lock_count); \ >> while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ >> >> With modern optimizing compiler, Is it possible that old value of >> old.lock.rlock.raw_lock is cached in register, despite that try_cmpxchg64_relaxed >> modifies the memory of old.lock_count with new value? > > What I meant is that the call to arch_spin_value_unlocked() as it is today will not generate 2 memory reads of the same location with or without the patch. Of course, a new memory read will be needed after a failed cmpxchg(). yeap, it can solve the issue with a new memory read after a failed cmpxchg(). Regards Bibo Mao > > Cheers, > Longman
On Sun, Jul 30, 2023 at 10:33:08PM -0400, guoren@kernel.org wrote: > From: Guo Ren <guoren@linux.alibaba.com> > > The arch_spin_value_unlocked would cause an unnecessary memory > access to the contended value. Although it won't cause a significant > performance gap in most architectures, the arch_spin_value_unlocked > argument contains enough information. Thus, remove unnecessary > atomic_read in arch_spin_value_unlocked(). > > The caller of arch_spin_value_unlocked() could benefit from this > change. Currently, the only caller is lockref. > Have you verified you are getting an extra memory access from this in lockref? What architecture is it? I have no opinion about the patch itself, I will note though that the argument to the routine is *not* the actual memory-shared lockref, instead it's something from the local copy obtained with READ_ONCE from the real thing. So I would be surprised if the stock routine was generating accesses to that sucker. Nonetheless, if the patched routine adds nasty asm, that would be nice to sort out. FWIW on x86-64 qspinlock is used (i.e. not the stuff you are patching) and I verified there are only 2 memory accesses -- the initial READ_ONCE and later cmpxchg. I don't know which archs *don't* use qspinlock. It also turns out generated asm is quite atrocious and cleaning it up may yield a small win under more traffic. Maybe I'll see about it later this week. For example, disassembling lockref_put_return: <+0>: mov (%rdi),%rax <-- initial load, expected <+3>: mov $0x64,%r8d <+9>: mov %rax,%rdx <+12>: test %eax,%eax <-- retries loop back here <-- this is also the unlocked check <+14>: jne 0xffffffff8157aba3 <lockref_put_return+67> <+16>: mov %rdx,%rsi <+19>: mov %edx,%edx <+21>: sar $0x20,%rsi <+25>: lea -0x1(%rsi),%ecx <-- new.count--; <+28>: shl $0x20,%rcx <+32>: or %rcx,%rdx <+35>: test %esi,%esi <+37>: jle 0xffffffff8157aba3 <lockref_put_return+67> <+39>: lock cmpxchg %rdx,(%rdi) <-- the attempt to change <+44>: jne 0xffffffff8157ab9a <lockref_put_return+58> <+46>: shr $0x20,%rdx <+50>: mov %rdx,%rax <+53>: jmp 0xffffffff81af8540 <__x86_return_thunk> <+58>: mov %rax,%rdx <+61>: sub $0x1,%r8d <-- retry count check <+65>: jne 0xffffffff8157ab6c <lockref_put_return+12> <-- go back <+67>: mov $0xffffffff,%eax <+72>: jmp 0xffffffff81af8540 <__x86_return_thunk>
On Mon, Aug 07, 2023 at 08:36:55PM +0200, Mateusz Guzik wrote: > On Sun, Jul 30, 2023 at 10:33:08PM -0400, guoren@kernel.org wrote: > > From: Guo Ren <guoren@linux.alibaba.com> > > > > The arch_spin_value_unlocked would cause an unnecessary memory > > access to the contended value. Although it won't cause a significant > > performance gap in most architectures, the arch_spin_value_unlocked > > argument contains enough information. Thus, remove unnecessary > > atomic_read in arch_spin_value_unlocked(). > > > > The caller of arch_spin_value_unlocked() could benefit from this > > change. Currently, the only caller is lockref. > > > > Have you verified you are getting an extra memory access from this in > lockref? What architecture is it? For riscv, this patch could optimize the lock_ref on the same compiling condition: - After lifting data dependencies, the compiler optimizes the prologue behavior, thus the callee-register-saved path becomes optional. This is a significant improvement on the lock_ref() self. - Compare the "98: & 9c:" lines before the patch and the "88:" line after the patch. We saved two memory accesses not only one load. ======================================================================== Before the patch: void lockref_get(struct lockref *lockref) { 78: fd010113 add sp,sp,-48 7c: 02813023 sd s0,32(sp) 80: 02113423 sd ra,40(sp) 84: 03010413 add s0,sp,48 0000000000000088 <.LBB296>: CMPXCHG_LOOP( 88: 00053783 ld a5,0(a0) 000000000000008c <.LBB265>: } static __always_inline int ticket_spin_is_locked(arch_spinlock_t *lock) { u32 val = atomic_read(&lock->val); return ((val >> 16) != (val & 0xffff)); 8c: 00010637 lui a2,0x10 0000000000000090 <.LBE265>: 90: 06400593 li a1,100 0000000000000094 <.LBB274>: 94: fff60613 add a2,a2,-1 # ffff <.LLST8+0xf49a> 0000000000000098 <.L8>: 98: fef42423 sw a5,-24(s0) 000000000000009c <.LBB269>: 9c: fe842703 lw a4,-24(s0) 00000000000000a0 <.LBE269>: a0: 0107569b srlw a3,a4,0x10 a4: 00c77733 and a4,a4,a2 a8: 04e69063 bne a3,a4,e8 <.L12> 00000000000000ac <.LBB282>: ac: 4207d693 sra a3,a5,0x20 b0: 02079713 sll a4,a5,0x20 b4: 0016869b addw a3,a3,1 b8: 02069693 sll a3,a3,0x20 bc: 02075713 srl a4,a4,0x20 c0: 00d76733 or a4,a4,a3 00000000000000c4 <.L0^B1>: c4: 100536af lr.d a3,(a0) c8: 00f69863 bne a3,a5,d8 <.L1^B1> cc: 1ae5382f sc.d.rl a6,a4,(a0) d0: fe081ae3 bnez a6,c4 <.L0^B1> d4: 0330000f fence rw,rw 00000000000000d8 <.L1^B1>: d8: 02d78a63 beq a5,a3,10c <.L7> 00000000000000dc <.LBE292>: dc: fff5859b addw a1,a1,-1 00000000000000e0 <.LBB293>: e0: 00068793 mv a5,a3 00000000000000e4 <.LBE293>: e4: fa059ae3 bnez a1,98 <.L8> 00000000000000e8 <.L12>: ======================================================================== After the patch: void lockref_get(struct lockref *lockref) { CMPXCHG_LOOP( 78: 00053783 ld a5,0(a0) 000000000000007c <.LBB212>: static __always_inline int ticket_spin_value_unlocked(arch_spinlock_t lock) { u32 val = lock.val.counter; return ((val >> 16) == (val & 0xffff)); 7c: 00010637 lui a2,0x10 0000000000000080 <.LBE212>: 80: 06400593 li a1,100 0000000000000084 <.LBB216>: 84: fff60613 add a2,a2,-1 # ffff <.LLST8+0xf4aa> 0000000000000088 <.L8>: 88: 0007871b sext.w a4,a5 000000000000008c <.LBB217>: 8c: 0107d69b srlw a3,a5,0x10 90: 00c77733 and a4,a4,a2 94: 04e69063 bne a3,a4,d4 <.L12> 0000000000000098 <.LBB218>: 98: 4207d693 sra a3,a5,0x20 9c: 02079713 sll a4,a5,0x20 a0: 0016869b addw a3,a3,1 a4: 02069693 sll a3,a3,0x20 a8: 02075713 srl a4,a4,0x20 ac: 00d76733 or a4,a4,a3 00000000000000b0 <.L0^B1>: b0: 100536af lr.d a3,(a0) b4: 00f69863 bne a3,a5,c4 <.L1^B1> b8: 1ae5382f sc.d.rl a6,a4,(a0) bc: fe081ae3 bnez a6,b0 <.L0^B1> c0: 0330000f fence rw,rw 00000000000000c4 <.L1^B1>: c4: 04d78a63 beq a5,a3,118 <.L18> 00000000000000c8 <.LBE228>: c8: fff5859b addw a1,a1,-1 00000000000000cc <.LBB229>: cc: 00068793 mv a5,a3 00000000000000d0 <.LBE229>: d0: fa059ce3 bnez a1,88 <.L8> 00000000000000d4 <.L12>: { d4: fe010113 add sp,sp,-32 d8: 00113c23 sd ra,24(sp) dc: 00813823 sd s0,16(sp) e0: 02010413 add s0,sp,32 ======================================================================== > > I have no opinion about the patch itself, I will note though that the > argument to the routine is *not* the actual memory-shared lockref, > instead it's something from the local copy obtained with READ_ONCE > from the real thing. So I would be surprised if the stock routine was > generating accesses to that sucker. > > Nonetheless, if the patched routine adds nasty asm, that would be nice > to sort out. > > FWIW on x86-64 qspinlock is used (i.e. not the stuff you are patching) > and I verified there are only 2 memory accesses -- the initial READ_ONCE > and later cmpxchg. I don't know which archs *don't* use qspinlock. > > It also turns out generated asm is quite atrocious and cleaning it up > may yield a small win under more traffic. Maybe I'll see about it later > this week. > > For example, disassembling lockref_put_return: > <+0>: mov (%rdi),%rax <-- initial load, expected > <+3>: mov $0x64,%r8d > <+9>: mov %rax,%rdx > <+12>: test %eax,%eax <-- retries loop back here > <-- this is also the unlocked > check > <+14>: jne 0xffffffff8157aba3 <lockref_put_return+67> > <+16>: mov %rdx,%rsi > <+19>: mov %edx,%edx > <+21>: sar $0x20,%rsi > <+25>: lea -0x1(%rsi),%ecx <-- new.count--; > <+28>: shl $0x20,%rcx > <+32>: or %rcx,%rdx > <+35>: test %esi,%esi > <+37>: jle 0xffffffff8157aba3 <lockref_put_return+67> > <+39>: lock cmpxchg %rdx,(%rdi) <-- the attempt to change > <+44>: jne 0xffffffff8157ab9a <lockref_put_return+58> > <+46>: shr $0x20,%rdx > <+50>: mov %rdx,%rax > <+53>: jmp 0xffffffff81af8540 <__x86_return_thunk> > <+58>: mov %rax,%rdx > <+61>: sub $0x1,%r8d <-- retry count check > <+65>: jne 0xffffffff8157ab6c <lockref_put_return+12> <-- go back > <+67>: mov $0xffffffff,%eax > <+72>: jmp 0xffffffff81af8540 <__x86_return_thunk> >
On 8/8/23, Guo Ren <guoren@kernel.org> wrote: > On Mon, Aug 07, 2023 at 08:36:55PM +0200, Mateusz Guzik wrote: >> On Sun, Jul 30, 2023 at 10:33:08PM -0400, guoren@kernel.org wrote: >> > From: Guo Ren <guoren@linux.alibaba.com> >> > >> > The arch_spin_value_unlocked would cause an unnecessary memory >> > access to the contended value. Although it won't cause a significant >> > performance gap in most architectures, the arch_spin_value_unlocked >> > argument contains enough information. Thus, remove unnecessary >> > atomic_read in arch_spin_value_unlocked(). >> > >> > The caller of arch_spin_value_unlocked() could benefit from this >> > change. Currently, the only caller is lockref. >> > >> >> Have you verified you are getting an extra memory access from this in >> lockref? What architecture is it? > For riscv, this patch could optimize the lock_ref on the same compiling > condition: > - After lifting data dependencies, the compiler optimizes the prologue > behavior, thus the callee-register-saved path becomes optional. This > is a significant improvement on the lock_ref() self. > - Compare the "98: & 9c:" lines before the patch and the "88:" line > after the patch. We saved two memory accesses not only one load. > Now that you mention it, I see riscv in cc. ;) Your commit message states "arch_spin_value_unlocked would cause an unnecessary memory access to the contended value" and that lockref uses it. Perhaps incorrectly I took it to claim lockref is suffering extra loads from the area it modifies with cmpxchg -- as I verified, this is not happening as the argument to arch_spin_value_unlocked is a copy of the target lockref struct. With this not being a problem, potential scalability impact goes down a lot. And so happens with the code from qspinlock on x86-64 there are no extra memory accesses to anything anyway. I don't speak riscv asm so can't comment on the result. I'll note again that extra work for single-threaded use is definitely worth shaving and may or may not affect throughput in face of other CPUs messing with lockref. You can easily test lockref with will-it-scale, I would suggest the access() system call which afaics has least amount of unrelated overhead. You can find the bench here: https://github.com/antonblanchard/will-it-scale/pull/36/files > ======================================================================== > Before the patch: > void lockref_get(struct lockref *lockref) > { > 78: fd010113 add sp,sp,-48 > 7c: 02813023 sd s0,32(sp) > 80: 02113423 sd ra,40(sp) > 84: 03010413 add s0,sp,48 > > 0000000000000088 <.LBB296>: > CMPXCHG_LOOP( > 88: 00053783 ld a5,0(a0) > > 000000000000008c <.LBB265>: > } > > static __always_inline int ticket_spin_is_locked(arch_spinlock_t *lock) > { > u32 val = atomic_read(&lock->val); > return ((val >> 16) != (val & 0xffff)); > 8c: 00010637 lui a2,0x10 > > 0000000000000090 <.LBE265>: > 90: 06400593 li a1,100 > > 0000000000000094 <.LBB274>: > 94: fff60613 add a2,a2,-1 # ffff <.LLST8+0xf49a> > > 0000000000000098 <.L8>: > 98: fef42423 sw a5,-24(s0) > > 000000000000009c <.LBB269>: > 9c: fe842703 lw a4,-24(s0) > > 00000000000000a0 <.LBE269>: > a0: 0107569b srlw a3,a4,0x10 > a4: 00c77733 and a4,a4,a2 > a8: 04e69063 bne a3,a4,e8 <.L12> > > 00000000000000ac <.LBB282>: > ac: 4207d693 sra a3,a5,0x20 > b0: 02079713 sll a4,a5,0x20 > b4: 0016869b addw a3,a3,1 > b8: 02069693 sll a3,a3,0x20 > bc: 02075713 srl a4,a4,0x20 > c0: 00d76733 or a4,a4,a3 > > 00000000000000c4 <.L0^B1>: > c4: 100536af lr.d a3,(a0) > c8: 00f69863 bne a3,a5,d8 <.L1^B1> > cc: 1ae5382f sc.d.rl a6,a4,(a0) > d0: fe081ae3 bnez a6,c4 <.L0^B1> > d4: 0330000f fence rw,rw > > 00000000000000d8 <.L1^B1>: > d8: 02d78a63 beq a5,a3,10c <.L7> > > 00000000000000dc <.LBE292>: > dc: fff5859b addw a1,a1,-1 > > 00000000000000e0 <.LBB293>: > e0: 00068793 mv a5,a3 > > 00000000000000e4 <.LBE293>: > e4: fa059ae3 bnez a1,98 <.L8> > > 00000000000000e8 <.L12>: > > ======================================================================== > After the patch: > void lockref_get(struct lockref *lockref) > { > CMPXCHG_LOOP( > 78: 00053783 ld a5,0(a0) > > 000000000000007c <.LBB212>: > > static __always_inline int ticket_spin_value_unlocked(arch_spinlock_t > lock) > { > u32 val = lock.val.counter; > > return ((val >> 16) == (val & 0xffff)); > 7c: 00010637 lui a2,0x10 > > 0000000000000080 <.LBE212>: > 80: 06400593 li a1,100 > > 0000000000000084 <.LBB216>: > 84: fff60613 add a2,a2,-1 # ffff <.LLST8+0xf4aa> > > 0000000000000088 <.L8>: > 88: 0007871b sext.w a4,a5 > > 000000000000008c <.LBB217>: > 8c: 0107d69b srlw a3,a5,0x10 > 90: 00c77733 and a4,a4,a2 > 94: 04e69063 bne a3,a4,d4 <.L12> > > 0000000000000098 <.LBB218>: > 98: 4207d693 sra a3,a5,0x20 > 9c: 02079713 sll a4,a5,0x20 > a0: 0016869b addw a3,a3,1 > a4: 02069693 sll a3,a3,0x20 > a8: 02075713 srl a4,a4,0x20 > ac: 00d76733 or a4,a4,a3 > > 00000000000000b0 <.L0^B1>: > b0: 100536af lr.d a3,(a0) > b4: 00f69863 bne a3,a5,c4 <.L1^B1> > b8: 1ae5382f sc.d.rl a6,a4,(a0) > bc: fe081ae3 bnez a6,b0 <.L0^B1> > c0: 0330000f fence rw,rw > > 00000000000000c4 <.L1^B1>: > c4: 04d78a63 beq a5,a3,118 <.L18> > > 00000000000000c8 <.LBE228>: > c8: fff5859b addw a1,a1,-1 > > 00000000000000cc <.LBB229>: > cc: 00068793 mv a5,a3 > > 00000000000000d0 <.LBE229>: > d0: fa059ce3 bnez a1,88 <.L8> > > 00000000000000d4 <.L12>: > { > d4: fe010113 add sp,sp,-32 > d8: 00113c23 sd ra,24(sp) > dc: 00813823 sd s0,16(sp) > e0: 02010413 add s0,sp,32 > ======================================================================== > >> >> I have no opinion about the patch itself, I will note though that the >> argument to the routine is *not* the actual memory-shared lockref, >> instead it's something from the local copy obtained with READ_ONCE >> from the real thing. So I would be surprised if the stock routine was >> generating accesses to that sucker. >> >> Nonetheless, if the patched routine adds nasty asm, that would be nice >> to sort out. >> >> FWIW on x86-64 qspinlock is used (i.e. not the stuff you are patching) >> and I verified there are only 2 memory accesses -- the initial READ_ONCE >> and later cmpxchg. I don't know which archs *don't* use qspinlock. >> >> It also turns out generated asm is quite atrocious and cleaning it up >> may yield a small win under more traffic. Maybe I'll see about it later >> this week. >> >> For example, disassembling lockref_put_return: >> <+0>: mov (%rdi),%rax <-- initial load, expected >> <+3>: mov $0x64,%r8d >> <+9>: mov %rax,%rdx >> <+12>: test %eax,%eax <-- retries loop back here >> <-- this is also the unlocked >> check >> <+14>: jne 0xffffffff8157aba3 <lockref_put_return+67> >> <+16>: mov %rdx,%rsi >> <+19>: mov %edx,%edx >> <+21>: sar $0x20,%rsi >> <+25>: lea -0x1(%rsi),%ecx <-- new.count--; >> <+28>: shl $0x20,%rcx >> <+32>: or %rcx,%rdx >> <+35>: test %esi,%esi >> <+37>: jle 0xffffffff8157aba3 <lockref_put_return+67> >> <+39>: lock cmpxchg %rdx,(%rdi) <-- the attempt to change >> <+44>: jne 0xffffffff8157ab9a <lockref_put_return+58> >> <+46>: shr $0x20,%rdx >> <+50>: mov %rdx,%rax >> <+53>: jmp 0xffffffff81af8540 <__x86_return_thunk> >> <+58>: mov %rax,%rdx >> <+61>: sub $0x1,%r8d <-- retry count check >> <+65>: jne 0xffffffff8157ab6c <lockref_put_return+12> <-- go back >> <+67>: mov $0xffffffff,%eax >> <+72>: jmp 0xffffffff81af8540 <__x86_return_thunk> >> >
diff --git a/include/asm-generic/spinlock.h b/include/asm-generic/spinlock.h index fdfebcb050f4..90803a826ba0 100644 --- a/include/asm-generic/spinlock.h +++ b/include/asm-generic/spinlock.h @@ -68,11 +68,18 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) smp_store_release(ptr, (u16)val + 1); } +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) +{ + u32 val = lock.counter; + + return ((val >> 16) == (val & 0xffff)); +} + static __always_inline int arch_spin_is_locked(arch_spinlock_t *lock) { - u32 val = atomic_read(lock); + arch_spinlock_t val = READ_ONCE(*lock); - return ((val >> 16) != (val & 0xffff)); + return !arch_spin_value_unlocked(val); } static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) @@ -82,11 +89,6 @@ static __always_inline int arch_spin_is_contended(arch_spinlock_t *lock) return (s16)((val >> 16) - (val & 0xffff)) > 1; } -static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) -{ - return !arch_spin_is_locked(&lock); -} - #include <asm/qrwlock.h> #endif /* __ASM_GENERIC_SPINLOCK_H */