Message ID | 20230615120152.20836-1-guohui@uniontech.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp606844vqr; Thu, 15 Jun 2023 05:47:18 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7nPX5MjH+aB7ujQgXBlkI/VSGX/HYyJKbl9R41lh9PIpdk+lTcwrj12VxM34Tc5O4Nl5nr X-Received: by 2002:a17:907:86ac:b0:982:c64c:6941 with SMTP id qa44-20020a17090786ac00b00982c64c6941mr385162ejc.25.1686833238039; Thu, 15 Jun 2023 05:47:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686833238; cv=none; d=google.com; s=arc-20160816; b=L7NNhSx6Mfp14cGUgnL1wHbctZ9kjmL5/D7L5G4oHWYFWXSD42picIDkyP3Db3TNTA ZZix+U1mq7vYEaeuShxfjtwEkBHc0Q02EwOO9NnJXtROVRcTz7d/f3Y/sS1s0uL3ss7N yKUwW511u7fVswJAZSHrveyr33yDJvAmloBRbeLBvrwQF0DgBsTInuggmKhH03wzJKBx DyKo5AxrQjtQYd7Oo0uTzegnC7C2ZDc/cDQB/trq7zlOoWW1cgMg453LYAlFAG769R9n CuxA8L4WocaSsO7oFRZyWb/5FJtvogo6dW4MOWY4NalqppNRmEpRJ3ZQyVPgkEV55gq/ WzJA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:feedback-id:content-transfer-encoding :mime-version:message-id:date:subject:cc:to:from; bh=wY+IqXmHqi9T+YO0bWCiEBBjQ+l6LGSbL72zz8FjrpI=; b=tPoVYbSxIE3l75fnqBr1Zng5sl9yIKk1hK5mCSafTkN14P2uov7c0YWaSvZVdUK3eg 4v1UNuaCwXFbwsPEpdTC7mjFgcMXMCuKQCP6U21v+BgrPd6qnQzxz72rqGltofqBC3kG AjSEeDzFvsZDtN8gHHyMN5g4CFli4KV3z6CNkrq/4laZTBBOLmOaj0eVYC8jfN1LPRcg ycwPJPliONQUBBmwJrfQAmZZHbzYCoW2JiRsEuZwmLjKfXybvtP5XHWZ+97V1k28K3K6 j1SgMrYuvsva2IkSTwqRssH+hD99vu7ghzUzoJQASxi496kUscYCavs152uDznWMX+wg Ni0w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i7-20020a1709063c4700b00973877aac6fsi9187554ejg.30.2023.06.15.05.46.48; Thu, 15 Jun 2023 05:47:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241638AbjFOMFV (ORCPT <rfc822;n2h9z4@gmail.com> + 99 others); Thu, 15 Jun 2023 08:05:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50860 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344730AbjFOMEt (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 15 Jun 2023 08:04:49 -0400 Received: from smtpbgsg1.qq.com (smtpbgsg1.qq.com [54.254.200.92]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BCAD83C1E for <linux-kernel@vger.kernel.org>; Thu, 15 Jun 2023 05:02:28 -0700 (PDT) X-QQ-mid: bizesmtp81t1686830516td0ag0cp Received: from localhost.localdomain ( [113.200.76.118]) by bizesmtp.qq.com (ESMTP) with id ; Thu, 15 Jun 2023 20:01:54 +0800 (CST) X-QQ-SSF: 01400000000000D0H000000A0000020 X-QQ-FEAT: U7OmrCiPqqCNs/1Wo4j9RRW767Dpr6gKSGIV7rEhk2Ho0VjgHu8Y0rNelj88+ e3Dwulayjn32aTNqtC5GnIA+kdpn8PVdFjCwBwfeplEwhX7s9R48vMp85sfD8C/MHb31iFs n5orTlH3zxdna8x9hYzwg+00cxVwlpx+Dvh1WrYdUYL5oIy+nyNo6q0ZdunhNLw84eaq9fc Kic18ZVXsljJn8UDiRVtUJhJP5XEHSMEnBX6HmgHfkWmfxDT+LgSDGWdTjTiZMrKDl7rPJe h/xFzX8chawsft1dSpvnumGB4H7wo1l4Q56ew3fqbPOJYtmOv6AXJYI7vx/pluvlimVV3oQ eBw383rJfJ+zhU7FlMSdOkCwsdGNrLWTNMBfYtpUecUvOEQ1SmgxZLe5kFcOngV0wldGP/G Xt11U8JyYu7ZfENG0UOTSQ== X-QQ-GoodBg: 1 X-BIZMAIL-ID: 4772815933109222999 From: Guo Hui <guohui@uniontech.com> To: peterz@infradead.org, mingo@redhat.com, will@kernel.org, longman@redhat.com, boqun.feng@gmail.com Cc: linux-kernel@vger.kernel.org, wangxiaohua@uniontech.com, Guo Hui <guohui@uniontech.com> Subject: [PATCH v2] locking/osq_lock: Fix false sharing of optimistic_spin_node in osq_lock Date: Thu, 15 Jun 2023 20:01:52 +0800 Message-Id: <20230615120152.20836-1-guohui@uniontech.com> X-Mailer: git-send-email 2.20.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-QQ-SENDSIZE: 520 Feedback-ID: bizesmtp:uniontech.com:qybglogicsvrsz:qybglogicsvrsz4a-0 X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_PASS,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768730711529002218?= X-GMAIL-MSGID: =?utf-8?q?1768772849678959385?= |
Series |
[v2] locking/osq_lock: Fix false sharing of optimistic_spin_node in osq_lock
|
|
Commit Message
Guo Hui
June 15, 2023, 12:01 p.m. UTC
For the performance of osq_lock,
I have made a patch before:
https://lore.kernel.org/lkml/20220628161251.21950-1-guohui@uniontech.com/
the analysis conclusion is due to the memory access
of the following code caused performance degradation:
cpu = node->cpu - 1;
The instructions corresponding to the C code are:
mov 0x14(%rax),%edi
sub $0x1,%edi
in the X86 operating environment,
causing high cache-misses and degrading performance.
The memory access instructions that cause performance degradation are
further analyzed.The cache-misses of the above instructions are caused
by a large number of cache-line false sharing
when accessing non-local-CPU variables,as follows:
---------------------------------
| struct optimistic_spin_node |
---------------------------------
| next | prev | locked | cpu |
---------------------------------
| cache line |
---------------------------------
| CPU0 |
---------------------------------
When a CPU other than CPU0 reads the value of
optimistic_spin_node->cpu of CPU0,CPU0 frequently modifies
the data of the cache line,which will cause false sharing
on the currently accessing CPU,and the variable of
the structure optimistic_spin_node type will be
defined as a cacheline alignmented per cpu variable,
each optimistic_spin_node variable is bound to the corresponding CPU core:
DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
Therefore, the value of optimistic_spin_node->cpu is usually unchanged,
so the false sharing caused by access to optimistic_spin_node->cpu
is caused by frequent modification of the other three attributes
of optimistic_spin_node.
There are two solutions as follows:
solution 1:
Put the cpu attribute of optimistic_spin_node into a cacheline separately.
The patch is as follows:
struct optimistic_spin_node {
struct optimistic_spin_node *next, *prev;
int locked; /* 1 if lock acquired */
- int cpu; /* encoded CPU # + 1 value */
+ int cpu ____cacheline_aligned; /* encoded CPU # + 1 value */
};
Unixbench full-core performance data is as follows:
Machine: Hygon X86, 128 cores
with patch without patch promote
Dhrystone 2 using register variables 194923.07 195091 -0.09%
Double-Precision Whetstone 79885.47 79838.87 +0.06%
Execl Throughput 2327.17 2272.1 +2.42%
File Copy 1024 bufsize 2000 maxblocks 742.1 687.53 +7.94%
File Copy 256 bufsize 500 maxblocks 462.73 428.03 +8.11%
File Copy 4096 bufsize 8000 maxblocks 1600.37 1520.53 +5.25%
Pipe Throughput 79815.33 79522.13 +0.37%
Pipe-based Context Switching 28962.9 27987.8 +3.48%
Process Creation 3084.4 2999.1 +2.84%
Shell Scripts 1 concurrent 11687.1 11394.67 +2.57%
Shell Scripts 8 concurrent 10787.1 10496.17 +2.77%
System Call Overhead 4322.77 4322.23 +0.01%
System Benchmarks Index Score 8079.4 7848.37 +3.0%
solution 2:
The core idea of osq lock is that
the lock applicant spins on the local-CPU variable
to eliminate cache-line bouncing.Therefore,
the same method is used for the above degradation.
For the optimistic_spin_node of the current CPU,
the cpu attribute of its predecessor optimistic_spin_node is
non-local-CPU variables,the cpu attribute of
the predecessor optimistic_spin_node is cached
in the optimistic_spin_node of the current CPU
to eliminate performance degradation
caused by non-local-CPU variable access, as follows:
bool osq_lock(struct optimistic_spin_queue *lock)
{
[... ...]
node->prev = prev;
node->prev_cpu = prev->cpu; --------------- A
[... ...]
WRITE_ONCE(next->prev, prev);
WRITE_ONCE(prev->next, next);
WRITE_ONCE(next->prev_cpu, prev->cpu); -------------- B
[... ...]
}
static inline int node_cpu(struct optimistic_spin_node *node)
{
return node->prev_cpu - 1; ----------------------------- C
}
While setting the prev attribute of the optimistic_spin_node of
the current CPU,the current patch also caches the prev cpu attribute
in the prev_cpu attribute of the optimistic_spin_node of the current CPU,
as in the above code lines A and B,where node is a per cpu variable,
so each one node corresponds to a CPU core,and the cpu attribute of
the node corresponding to the CPU core will not change.
Only when the prev attribute of the node is set,
the prev_cpu of the node may change with the change of prev.
At other times, the prev attribute of the node will not change.
so the prev_cpu of node will not change.
This patch greatly reduces the non-local-CPU variable
access at code line C and improves performance.
Unixbench full-core performance data is as follows:
Machine: Hygon X86, 128 cores
with patch without patch promote
Dhrystone 2 using register variables 194818.7 195091 -0.14%
Double-Precision Whetstone 79847.57 79838.87 +0.01%
Execl Throughput 2372.83 2272.1 +4.43%
File Copy 1024 bufsize 2000 maxblocks 765 687.53 +11.27%
File Copy 256 bufsize 500 maxblocks 472.13 428.03 +10.30%
File Copy 4096 bufsize 8000 maxblocks 1658.13 1520.53 +9.05%
Pipe Throughput 79634.17 79522.13 +0.14%
Pipe-based Context Switching 28584.7 27987.8 +2.13%
Process Creation 3020.27 2999.1 +0.71%
Shell Scripts 1 concurrent 11890.87 11394.67 +4.35%
Shell Scripts 8 concurrent 10912.9 10496.17 +3.97%
System Call Overhead 4320.63 4322.23 -0.04%
System Benchmarks Index Score 8144.43 7848.37 +4.0%
In summary, the performance of solution 2 is better than solution 1.
Especially use cases: execl, file copy, shell1, shell8,
great improvement,because solution 1 still has the possibility of
remote memory access across NUMA nodes,
and solution 2 completely accesses local-CPU variables,
so solution 2 is better than solution 1.
Both solutions also have a great improvement in the X86 virtual machine.
The current patch also uses solution 2.
Signed-off-by: Guo Hui <guohui@uniontech.com>
---
include/linux/osq_lock.h | 1 +
kernel/locking/osq_lock.c | 8 +++++---
2 files changed, 6 insertions(+), 3 deletions(-)
Comments
On 6/15/23 08:01, Guo Hui wrote: > For the performance of osq_lock, > I have made a patch before: > > https://lore.kernel.org/lkml/20220628161251.21950-1-guohui@uniontech.com/ > > the analysis conclusion is due to the memory access > of the following code caused performance degradation: > > cpu = node->cpu - 1; > > The instructions corresponding to the C code are: > mov 0x14(%rax),%edi > sub $0x1,%edi > > in the X86 operating environment, > causing high cache-misses and degrading performance. > > The memory access instructions that cause performance degradation are > further analyzed.The cache-misses of the above instructions are caused > by a large number of cache-line false sharing > when accessing non-local-CPU variables,as follows: > > --------------------------------- > | struct optimistic_spin_node | > --------------------------------- > | next | prev | locked | cpu | > --------------------------------- > | cache line | > --------------------------------- > | CPU0 | > --------------------------------- > > When a CPU other than CPU0 reads the value of > optimistic_spin_node->cpu of CPU0,CPU0 frequently modifies > the data of the cache line,which will cause false sharing > on the currently accessing CPU,and the variable of > the structure optimistic_spin_node type will be > defined as a cacheline alignmented per cpu variable, > each optimistic_spin_node variable is bound to the corresponding CPU core: > > DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); > > Therefore, the value of optimistic_spin_node->cpu is usually unchanged, > so the false sharing caused by access to optimistic_spin_node->cpu > is caused by frequent modification of the other three attributes > of optimistic_spin_node. > > There are two solutions as follows: > > solution 1: > Put the cpu attribute of optimistic_spin_node into a cacheline separately. > The patch is as follows: > > struct optimistic_spin_node { > struct optimistic_spin_node *next, *prev; > int locked; /* 1 if lock acquired */ > - int cpu; /* encoded CPU # + 1 value */ > + int cpu ____cacheline_aligned; /* encoded CPU # + 1 value */ > }; > > Unixbench full-core performance data is as follows: > Machine: Hygon X86, 128 cores > with patch without patch promote > Dhrystone 2 using register variables 194923.07 195091 -0.09% > Double-Precision Whetstone 79885.47 79838.87 +0.06% > Execl Throughput 2327.17 2272.1 +2.42% > File Copy 1024 bufsize 2000 maxblocks 742.1 687.53 +7.94% > File Copy 256 bufsize 500 maxblocks 462.73 428.03 +8.11% > File Copy 4096 bufsize 8000 maxblocks 1600.37 1520.53 +5.25% > Pipe Throughput 79815.33 79522.13 +0.37% > Pipe-based Context Switching 28962.9 27987.8 +3.48% > Process Creation 3084.4 2999.1 +2.84% > Shell Scripts 1 concurrent 11687.1 11394.67 +2.57% > Shell Scripts 8 concurrent 10787.1 10496.17 +2.77% > System Call Overhead 4322.77 4322.23 +0.01% > System Benchmarks Index Score 8079.4 7848.37 +3.0% > > solution 2: > The core idea of osq lock is that > the lock applicant spins on the local-CPU variable > to eliminate cache-line bouncing.Therefore, > the same method is used for the above degradation. > For the optimistic_spin_node of the current CPU, > the cpu attribute of its predecessor optimistic_spin_node is > non-local-CPU variables,the cpu attribute of > the predecessor optimistic_spin_node is cached > in the optimistic_spin_node of the current CPU > to eliminate performance degradation > caused by non-local-CPU variable access, as follows: > > bool osq_lock(struct optimistic_spin_queue *lock) > { > [... ...] > > node->prev = prev; > node->prev_cpu = prev->cpu; --------------- A > > [... ...] > > WRITE_ONCE(next->prev, prev); > WRITE_ONCE(prev->next, next); > WRITE_ONCE(next->prev_cpu, prev->cpu); -------------- B > > [... ...] > } > > static inline int node_cpu(struct optimistic_spin_node *node) > { > return node->prev_cpu - 1; ----------------------------- C > } > > While setting the prev attribute of the optimistic_spin_node of > the current CPU,the current patch also caches the prev cpu attribute > in the prev_cpu attribute of the optimistic_spin_node of the current CPU, > as in the above code lines A and B,where node is a per cpu variable, > so each one node corresponds to a CPU core,and the cpu attribute of > the node corresponding to the CPU core will not change. > Only when the prev attribute of the node is set, > the prev_cpu of the node may change with the change of prev. > At other times, the prev attribute of the node will not change. > so the prev_cpu of node will not change. > This patch greatly reduces the non-local-CPU variable > access at code line C and improves performance. > > Unixbench full-core performance data is as follows: > Machine: Hygon X86, 128 cores > with patch without patch promote > Dhrystone 2 using register variables 194818.7 195091 -0.14% > Double-Precision Whetstone 79847.57 79838.87 +0.01% > Execl Throughput 2372.83 2272.1 +4.43% > File Copy 1024 bufsize 2000 maxblocks 765 687.53 +11.27% > File Copy 256 bufsize 500 maxblocks 472.13 428.03 +10.30% > File Copy 4096 bufsize 8000 maxblocks 1658.13 1520.53 +9.05% > Pipe Throughput 79634.17 79522.13 +0.14% > Pipe-based Context Switching 28584.7 27987.8 +2.13% > Process Creation 3020.27 2999.1 +0.71% > Shell Scripts 1 concurrent 11890.87 11394.67 +4.35% > Shell Scripts 8 concurrent 10912.9 10496.17 +3.97% > System Call Overhead 4320.63 4322.23 -0.04% > System Benchmarks Index Score 8144.43 7848.37 +4.0% > > In summary, the performance of solution 2 is better than solution 1. > Especially use cases: execl, file copy, shell1, shell8, > great improvement,because solution 1 still has the possibility of > remote memory access across NUMA nodes, > and solution 2 completely accesses local-CPU variables, > so solution 2 is better than solution 1. > > Both solutions also have a great improvement in the X86 virtual machine. > > The current patch also uses solution 2. > > Signed-off-by: Guo Hui <guohui@uniontech.com> > --- > include/linux/osq_lock.h | 1 + > kernel/locking/osq_lock.c | 8 +++++--- > 2 files changed, 6 insertions(+), 3 deletions(-) > > diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h > index 5581dbd3bd34..8a1bb36f4a07 100644 > --- a/include/linux/osq_lock.h > +++ b/include/linux/osq_lock.h > @@ -10,6 +10,7 @@ struct optimistic_spin_node { > struct optimistic_spin_node *next, *prev; > int locked; /* 1 if lock acquired */ > int cpu; /* encoded CPU # + 1 value */ > + int prev_cpu; /* Only for optimizing false sharing */ > }; > > struct optimistic_spin_queue { > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > index d5610ad52b92..ea1fdd10ab3e 100644 > --- a/kernel/locking/osq_lock.c > +++ b/kernel/locking/osq_lock.c > @@ -22,9 +22,9 @@ static inline int encode_cpu(int cpu_nr) > return cpu_nr + 1; > } > > -static inline int node_cpu(struct optimistic_spin_node *node) > +static inline int node_prev_cpu(struct optimistic_spin_node *node) > { > - return node->cpu - 1; > + return READ_ONCE(node->prev_cpu) - 1; > } > > static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) > @@ -110,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) > > prev = decode_cpu(old); > node->prev = prev; > + node->prev_cpu = prev->cpu; > > /* > * osq_lock() unqueue > @@ -141,7 +142,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) > * polling, be careful. > */ > if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || > - vcpu_is_preempted(node_cpu(node->prev)))) > + vcpu_is_preempted(node_prev_cpu(node)))) > return true; > > /* unqueue */ > @@ -200,6 +201,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) > > WRITE_ONCE(next->prev, prev); > WRITE_ONCE(prev->next, next); > + WRITE_ONCE(next->prev_cpu, prev->cpu); > > return false; > } After a further look, I have one more comment. First of all prev->next is NULL at this point. If you look at the beginning of osq_lock: node->prev = prev; smp_wmb(); WRITE_ONCE(prev->next, node); To maintain the same memory ordering relationship, you may do WRITE_ONCE(next->prev, prev); WRITE_ONCE(next->prev_cpu, prev->cpu); smp_wmb(); WRITE_ONCE(prev->next, next); That should avoid the corner case that the next CPU is fetching the wrong prev_cpu when it try to dequeue itself concurrently. Alternatively, we can even remove prev from node since we can easily compute prev from prev_cpu using decode_cpu(). That will eliminate potential inconsistency between prev and prev_cpu. Cheers, Longman
On 6/15/23 9:30 PM, Waiman Long wrote: > On 6/15/23 08:01, Guo Hui wrote: >> For the performance of osq_lock, >> I have made a patch before: >> >> https://lore.kernel.org/lkml/20220628161251.21950-1-guohui@uniontech.com/ >> >> >> the analysis conclusion is due to the memory access >> of the following code caused performance degradation: >> >> cpu = node->cpu - 1; >> >> The instructions corresponding to the C code are: >> mov 0x14(%rax),%edi >> sub $0x1,%edi >> >> in the X86 operating environment, >> causing high cache-misses and degrading performance. >> >> The memory access instructions that cause performance degradation are >> further analyzed.The cache-misses of the above instructions are caused >> by a large number of cache-line false sharing >> when accessing non-local-CPU variables,as follows: >> >> --------------------------------- >> | struct optimistic_spin_node | >> --------------------------------- >> | next | prev | locked | cpu | >> --------------------------------- >> | cache line | >> --------------------------------- >> | CPU0 | >> --------------------------------- >> >> When a CPU other than CPU0 reads the value of >> optimistic_spin_node->cpu of CPU0,CPU0 frequently modifies >> the data of the cache line,which will cause false sharing >> on the currently accessing CPU,and the variable of >> the structure optimistic_spin_node type will be >> defined as a cacheline alignmented per cpu variable, >> each optimistic_spin_node variable is bound to the corresponding CPU >> core: >> >> DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, >> osq_node); >> >> Therefore, the value of optimistic_spin_node->cpu is usually unchanged, >> so the false sharing caused by access to optimistic_spin_node->cpu >> is caused by frequent modification of the other three attributes >> of optimistic_spin_node. >> >> There are two solutions as follows: >> >> solution 1: >> Put the cpu attribute of optimistic_spin_node into a cacheline >> separately. >> The patch is as follows: >> >> struct optimistic_spin_node { >> struct optimistic_spin_node *next, *prev; >> int locked; /* 1 if lock acquired */ >> - int cpu; /* encoded CPU # + 1 value */ >> + int cpu ____cacheline_aligned; /* encoded CPU # + 1 value */ >> }; >> >> Unixbench full-core performance data is as follows: >> Machine: Hygon X86, 128 cores >> with patch without patch promote >> Dhrystone 2 using register variables 194923.07 195091 -0.09% >> Double-Precision Whetstone 79885.47 79838.87 +0.06% >> Execl Throughput 2327.17 2272.1 +2.42% >> File Copy 1024 bufsize 2000 maxblocks 742.1 687.53 +7.94% >> File Copy 256 bufsize 500 maxblocks 462.73 428.03 +8.11% >> File Copy 4096 bufsize 8000 maxblocks 1600.37 1520.53 +5.25% >> Pipe Throughput 79815.33 79522.13 +0.37% >> Pipe-based Context Switching 28962.9 27987.8 +3.48% >> Process Creation 3084.4 2999.1 +2.84% >> Shell Scripts 1 concurrent 11687.1 11394.67 +2.57% >> Shell Scripts 8 concurrent 10787.1 10496.17 +2.77% >> System Call Overhead 4322.77 4322.23 +0.01% >> System Benchmarks Index Score 8079.4 7848.37 +3.0% >> >> solution 2: >> The core idea of osq lock is that >> the lock applicant spins on the local-CPU variable >> to eliminate cache-line bouncing.Therefore, >> the same method is used for the above degradation. >> For the optimistic_spin_node of the current CPU, >> the cpu attribute of its predecessor optimistic_spin_node is >> non-local-CPU variables,the cpu attribute of >> the predecessor optimistic_spin_node is cached >> in the optimistic_spin_node of the current CPU >> to eliminate performance degradation >> caused by non-local-CPU variable access, as follows: >> >> bool osq_lock(struct optimistic_spin_queue *lock) >> { >> [... ...] >> >> node->prev = prev; >> node->prev_cpu = prev->cpu; --------------- A >> >> [... ...] >> >> WRITE_ONCE(next->prev, prev); >> WRITE_ONCE(prev->next, next); >> WRITE_ONCE(next->prev_cpu, prev->cpu); -------------- B >> >> [... ...] >> } >> >> static inline int node_cpu(struct optimistic_spin_node *node) >> { >> return node->prev_cpu - 1; >> ----------------------------- C >> } >> >> While setting the prev attribute of the optimistic_spin_node of >> the current CPU,the current patch also caches the prev cpu attribute >> in the prev_cpu attribute of the optimistic_spin_node of the current >> CPU, >> as in the above code lines A and B,where node is a per cpu variable, >> so each one node corresponds to a CPU core,and the cpu attribute of >> the node corresponding to the CPU core will not change. >> Only when the prev attribute of the node is set, >> the prev_cpu of the node may change with the change of prev. >> At other times, the prev attribute of the node will not change. >> so the prev_cpu of node will not change. >> This patch greatly reduces the non-local-CPU variable >> access at code line C and improves performance. >> >> Unixbench full-core performance data is as follows: >> Machine: Hygon X86, 128 cores >> with patch without patch >> promote >> Dhrystone 2 using register variables 194818.7 195091 -0.14% >> Double-Precision Whetstone 79847.57 79838.87 +0.01% >> Execl Throughput 2372.83 2272.1 +4.43% >> File Copy 1024 bufsize 2000 maxblocks 765 687.53 +11.27% >> File Copy 256 bufsize 500 maxblocks 472.13 428.03 +10.30% >> File Copy 4096 bufsize 8000 maxblocks 1658.13 1520.53 +9.05% >> Pipe Throughput 79634.17 79522.13 +0.14% >> Pipe-based Context Switching 28584.7 27987.8 +2.13% >> Process Creation 3020.27 2999.1 +0.71% >> Shell Scripts 1 concurrent 11890.87 11394.67 +4.35% >> Shell Scripts 8 concurrent 10912.9 10496.17 +3.97% >> System Call Overhead 4320.63 4322.23 -0.04% >> System Benchmarks Index Score 8144.43 7848.37 +4.0% >> >> In summary, the performance of solution 2 is better than solution 1. >> Especially use cases: execl, file copy, shell1, shell8, >> great improvement,because solution 1 still has the possibility of >> remote memory access across NUMA nodes, >> and solution 2 completely accesses local-CPU variables, >> so solution 2 is better than solution 1. >> >> Both solutions also have a great improvement in the X86 virtual machine. >> >> The current patch also uses solution 2. >> >> Signed-off-by: Guo Hui <guohui@uniontech.com> >> --- >> include/linux/osq_lock.h | 1 + >> kernel/locking/osq_lock.c | 8 +++++--- >> 2 files changed, 6 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h >> index 5581dbd3bd34..8a1bb36f4a07 100644 >> --- a/include/linux/osq_lock.h >> +++ b/include/linux/osq_lock.h >> @@ -10,6 +10,7 @@ struct optimistic_spin_node { >> struct optimistic_spin_node *next, *prev; >> int locked; /* 1 if lock acquired */ >> int cpu; /* encoded CPU # + 1 value */ >> + int prev_cpu; /* Only for optimizing false sharing */ >> }; >> struct optimistic_spin_queue { >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c >> index d5610ad52b92..ea1fdd10ab3e 100644 >> --- a/kernel/locking/osq_lock.c >> +++ b/kernel/locking/osq_lock.c >> @@ -22,9 +22,9 @@ static inline int encode_cpu(int cpu_nr) >> return cpu_nr + 1; >> } >> -static inline int node_cpu(struct optimistic_spin_node *node) >> +static inline int node_prev_cpu(struct optimistic_spin_node *node) >> { >> - return node->cpu - 1; >> + return READ_ONCE(node->prev_cpu) - 1; >> } >> static inline struct optimistic_spin_node *decode_cpu(int >> encoded_cpu_val) >> @@ -110,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) >> prev = decode_cpu(old); >> node->prev = prev; >> + node->prev_cpu = prev->cpu; >> /* >> * osq_lock() unqueue >> @@ -141,7 +142,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) >> * polling, be careful. >> */ >> if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || >> - vcpu_is_preempted(node_cpu(node->prev)))) >> + vcpu_is_preempted(node_prev_cpu(node)))) >> return true; >> /* unqueue */ >> @@ -200,6 +201,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) >> WRITE_ONCE(next->prev, prev); >> WRITE_ONCE(prev->next, next); >> + WRITE_ONCE(next->prev_cpu, prev->cpu); >> return false; >> } > > After a further look, I have one more comment. > > First of all prev->next is NULL at this point. If you look at the > beginning of osq_lock: > > node->prev = prev; > smp_wmb(); > WRITE_ONCE(prev->next, node); > > To maintain the same memory ordering relationship, you may do > > WRITE_ONCE(next->prev, prev); > WRITE_ONCE(next->prev_cpu, prev->cpu); > smp_wmb(); > WRITE_ONCE(prev->next, next); > > That should avoid the corner case that the next CPU is fetching the > wrong prev_cpu when it try to dequeue itself concurrently. > Alternatively, we can even remove prev from node since we can easily > compute prev from prev_cpu using decode_cpu(). That will eliminate > potential inconsistency between prev and prev_cpu. > > Cheers, > Longman > > > > Ok thanks, you're right, I hadn't considered that.
On 6/15/23 9:30 PM, Waiman Long wrote: > On 6/15/23 08:01, Guo Hui wrote: >> For the performance of osq_lock, >> I have made a patch before: >> >> https://lore.kernel.org/lkml/20220628161251.21950-1-guohui@uniontech.com/ >> >> >> the analysis conclusion is due to the memory access >> of the following code caused performance degradation: >> >> cpu = node->cpu - 1; >> >> The instructions corresponding to the C code are: >> mov 0x14(%rax),%edi >> sub $0x1,%edi >> >> in the X86 operating environment, >> causing high cache-misses and degrading performance. >> >> The memory access instructions that cause performance degradation are >> further analyzed.The cache-misses of the above instructions are caused >> by a large number of cache-line false sharing >> when accessing non-local-CPU variables,as follows: >> >> --------------------------------- >> | struct optimistic_spin_node | >> --------------------------------- >> | next | prev | locked | cpu | >> --------------------------------- >> | cache line | >> --------------------------------- >> | CPU0 | >> --------------------------------- >> >> When a CPU other than CPU0 reads the value of >> optimistic_spin_node->cpu of CPU0,CPU0 frequently modifies >> the data of the cache line,which will cause false sharing >> on the currently accessing CPU,and the variable of >> the structure optimistic_spin_node type will be >> defined as a cacheline alignmented per cpu variable, >> each optimistic_spin_node variable is bound to the corresponding CPU >> core: >> >> DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, >> osq_node); >> >> Therefore, the value of optimistic_spin_node->cpu is usually unchanged, >> so the false sharing caused by access to optimistic_spin_node->cpu >> is caused by frequent modification of the other three attributes >> of optimistic_spin_node. >> >> There are two solutions as follows: >> >> solution 1: >> Put the cpu attribute of optimistic_spin_node into a cacheline >> separately. >> The patch is as follows: >> >> struct optimistic_spin_node { >> struct optimistic_spin_node *next, *prev; >> int locked; /* 1 if lock acquired */ >> - int cpu; /* encoded CPU # + 1 value */ >> + int cpu ____cacheline_aligned; /* encoded CPU # + 1 value */ >> }; >> >> Unixbench full-core performance data is as follows: >> Machine: Hygon X86, 128 cores >> with patch without patch promote >> Dhrystone 2 using register variables 194923.07 195091 -0.09% >> Double-Precision Whetstone 79885.47 79838.87 +0.06% >> Execl Throughput 2327.17 2272.1 +2.42% >> File Copy 1024 bufsize 2000 maxblocks 742.1 687.53 +7.94% >> File Copy 256 bufsize 500 maxblocks 462.73 428.03 +8.11% >> File Copy 4096 bufsize 8000 maxblocks 1600.37 1520.53 +5.25% >> Pipe Throughput 79815.33 79522.13 +0.37% >> Pipe-based Context Switching 28962.9 27987.8 +3.48% >> Process Creation 3084.4 2999.1 +2.84% >> Shell Scripts 1 concurrent 11687.1 11394.67 +2.57% >> Shell Scripts 8 concurrent 10787.1 10496.17 +2.77% >> System Call Overhead 4322.77 4322.23 +0.01% >> System Benchmarks Index Score 8079.4 7848.37 +3.0% >> >> solution 2: >> The core idea of osq lock is that >> the lock applicant spins on the local-CPU variable >> to eliminate cache-line bouncing.Therefore, >> the same method is used for the above degradation. >> For the optimistic_spin_node of the current CPU, >> the cpu attribute of its predecessor optimistic_spin_node is >> non-local-CPU variables,the cpu attribute of >> the predecessor optimistic_spin_node is cached >> in the optimistic_spin_node of the current CPU >> to eliminate performance degradation >> caused by non-local-CPU variable access, as follows: >> >> bool osq_lock(struct optimistic_spin_queue *lock) >> { >> [... ...] >> >> node->prev = prev; >> node->prev_cpu = prev->cpu; --------------- A >> >> [... ...] >> >> WRITE_ONCE(next->prev, prev); >> WRITE_ONCE(prev->next, next); >> WRITE_ONCE(next->prev_cpu, prev->cpu); -------------- B >> >> [... ...] >> } >> >> static inline int node_cpu(struct optimistic_spin_node *node) >> { >> return node->prev_cpu - 1; >> ----------------------------- C >> } >> >> While setting the prev attribute of the optimistic_spin_node of >> the current CPU,the current patch also caches the prev cpu attribute >> in the prev_cpu attribute of the optimistic_spin_node of the current >> CPU, >> as in the above code lines A and B,where node is a per cpu variable, >> so each one node corresponds to a CPU core,and the cpu attribute of >> the node corresponding to the CPU core will not change. >> Only when the prev attribute of the node is set, >> the prev_cpu of the node may change with the change of prev. >> At other times, the prev attribute of the node will not change. >> so the prev_cpu of node will not change. >> This patch greatly reduces the non-local-CPU variable >> access at code line C and improves performance. >> >> Unixbench full-core performance data is as follows: >> Machine: Hygon X86, 128 cores >> with patch without patch >> promote >> Dhrystone 2 using register variables 194818.7 195091 -0.14% >> Double-Precision Whetstone 79847.57 79838.87 +0.01% >> Execl Throughput 2372.83 2272.1 +4.43% >> File Copy 1024 bufsize 2000 maxblocks 765 687.53 +11.27% >> File Copy 256 bufsize 500 maxblocks 472.13 428.03 +10.30% >> File Copy 4096 bufsize 8000 maxblocks 1658.13 1520.53 +9.05% >> Pipe Throughput 79634.17 79522.13 +0.14% >> Pipe-based Context Switching 28584.7 27987.8 +2.13% >> Process Creation 3020.27 2999.1 +0.71% >> Shell Scripts 1 concurrent 11890.87 11394.67 +4.35% >> Shell Scripts 8 concurrent 10912.9 10496.17 +3.97% >> System Call Overhead 4320.63 4322.23 -0.04% >> System Benchmarks Index Score 8144.43 7848.37 +4.0% >> >> In summary, the performance of solution 2 is better than solution 1. >> Especially use cases: execl, file copy, shell1, shell8, >> great improvement,because solution 1 still has the possibility of >> remote memory access across NUMA nodes, >> and solution 2 completely accesses local-CPU variables, >> so solution 2 is better than solution 1. >> >> Both solutions also have a great improvement in the X86 virtual machine. >> >> The current patch also uses solution 2. >> >> Signed-off-by: Guo Hui <guohui@uniontech.com> >> --- >> include/linux/osq_lock.h | 1 + >> kernel/locking/osq_lock.c | 8 +++++--- >> 2 files changed, 6 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h >> index 5581dbd3bd34..8a1bb36f4a07 100644 >> --- a/include/linux/osq_lock.h >> +++ b/include/linux/osq_lock.h >> @@ -10,6 +10,7 @@ struct optimistic_spin_node { >> struct optimistic_spin_node *next, *prev; >> int locked; /* 1 if lock acquired */ >> int cpu; /* encoded CPU # + 1 value */ >> + int prev_cpu; /* Only for optimizing false sharing */ >> }; >> struct optimistic_spin_queue { >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c >> index d5610ad52b92..ea1fdd10ab3e 100644 >> --- a/kernel/locking/osq_lock.c >> +++ b/kernel/locking/osq_lock.c >> @@ -22,9 +22,9 @@ static inline int encode_cpu(int cpu_nr) >> return cpu_nr + 1; >> } >> -static inline int node_cpu(struct optimistic_spin_node *node) >> +static inline int node_prev_cpu(struct optimistic_spin_node *node) >> { >> - return node->cpu - 1; >> + return READ_ONCE(node->prev_cpu) - 1; >> } >> static inline struct optimistic_spin_node *decode_cpu(int >> encoded_cpu_val) >> @@ -110,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) >> prev = decode_cpu(old); >> node->prev = prev; >> + node->prev_cpu = prev->cpu; >> /* >> * osq_lock() unqueue >> @@ -141,7 +142,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) >> * polling, be careful. >> */ >> if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || >> - vcpu_is_preempted(node_cpu(node->prev)))) >> + vcpu_is_preempted(node_prev_cpu(node)))) >> return true; >> /* unqueue */ >> @@ -200,6 +201,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) >> WRITE_ONCE(next->prev, prev); >> WRITE_ONCE(prev->next, next); >> + WRITE_ONCE(next->prev_cpu, prev->cpu); >> return false; >> } > > After a further look, I have one more comment. > > First of all prev->next is NULL at this point. If you look at the > beginning of osq_lock: > > node->prev = prev; > smp_wmb(); > WRITE_ONCE(prev->next, node); > > To maintain the same memory ordering relationship, you may do > > WRITE_ONCE(next->prev, prev); > WRITE_ONCE(next->prev_cpu, prev->cpu); > smp_wmb(); > WRITE_ONCE(prev->next, next); > > That should avoid the corner case that the next CPU is fetching the > wrong prev_cpu when it try to dequeue itself concurrently. > Alternatively, we can even remove prev from node since we can easily > compute prev from prev_cpu using decode_cpu(). That will eliminate > potential inconsistency between prev and prev_cpu. > > Cheers, > Longman > > > > Hi Longman, thanks for your suggestion, I analyzed it and came to the following conclusions: node->prev = prev; smp_wmb(); ----------------- A WRITE_ONCE(prev->next, node); The smp_wmb at A is to ensure the order of osq_lock and unqueue: osq_lock() unqueue node->prev = prev osq_wait_next() WMB MB prev->next = node next->prev = prev // unqueue-C Here 'node->prev' and 'next->prev' are the same variable and we need to ensure these stores happen in-order to avoid corrupting the list. WRITE_ONCE(next->prev, prev); ----------------------- B WRITE_ONCE(next->prev_cpu, prev->cpu); smp_wmb(); ---------------------- C WRITE_ONCE(prev->next, next); The smp_wmb at C is to ensure the consistency of next's prev and prev_cpu, and to ensure that next gets the correct prev_cpu, which is slightly different from A. But if there is a parallel acquisition of prev_cpu at B, inconsistencies may still occur. In summary, I think the following modification should be more appropriate: WRITE_ONCE(next->prev_cpu, prev->cpu); smp_wmb(); WRITE_ONCE(next->prev, prev); WRITE_ONCE(prev->next, next); Am I right? Thanks.
diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index 5581dbd3bd34..8a1bb36f4a07 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -10,6 +10,7 @@ struct optimistic_spin_node { struct optimistic_spin_node *next, *prev; int locked; /* 1 if lock acquired */ int cpu; /* encoded CPU # + 1 value */ + int prev_cpu; /* Only for optimizing false sharing */ }; struct optimistic_spin_queue { diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d5610ad52b92..ea1fdd10ab3e 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -22,9 +22,9 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } -static inline int node_cpu(struct optimistic_spin_node *node) +static inline int node_prev_cpu(struct optimistic_spin_node *node) { - return node->cpu - 1; + return READ_ONCE(node->prev_cpu) - 1; } static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) @@ -110,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) prev = decode_cpu(old); node->prev = prev; + node->prev_cpu = prev->cpu; /* * osq_lock() unqueue @@ -141,7 +142,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * polling, be careful. */ if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || - vcpu_is_preempted(node_cpu(node->prev)))) + vcpu_is_preempted(node_prev_cpu(node)))) return true; /* unqueue */ @@ -200,6 +201,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) WRITE_ONCE(next->prev, prev); WRITE_ONCE(prev->next, next); + WRITE_ONCE(next->prev_cpu, prev->cpu); return false; }