Message ID | 20230128035902.1758726-1-joel@joelfernandes.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp1180894wrn; Fri, 27 Jan 2023 20:13:58 -0800 (PST) X-Google-Smtp-Source: AMrXdXtRtVPaO61x7BYGk3oqDbNAPmgOCFpt2GtUCcMz0KxS2yIxuwxd9gN92yhLPcwLw5fT6uOU X-Received: by 2002:a17:907:a2cb:b0:870:7b:94db with SMTP id re11-20020a170907a2cb00b00870007b94dbmr52155842ejc.28.1674879238480; Fri, 27 Jan 2023 20:13:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674879238; cv=none; d=google.com; s=arc-20160816; b=HCBuelNHGmhMj32ERB+EtL5HhdSTfVxnpJuwVFnjLA2ELSf9KUaxFNwqdzTqt5uXPt /j2/duFk5uMfbAvjRJu1z8Q8y5Mdbo7Etxu9LKE15TDsGWuK/XGpbjnYaFDYSP7R50pJ HnG6HC8xION23wgf94XsCaplgwSfEPRR8Hw3k8CYfMUiOdz6x5KFiRTeMCOQFQfQCbDL +/a11EIXzsIbHoVfd6CyrahiQHXD1+oysbMbhNTComQ8UGLuvxjGobFWdmTGu66cexaX gyAID5jsPbA1MyPUOb8qVcE//dKmHfHhrEbq1Aenr2TtyyTZu6ZDIM1ZvvOyfGxxOJHu GOcg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=NHcWkiK2q6vininDacdGdov5aGnkXeS+Opi5lg+y9ow=; b=h8F1mhClAVMb0/PjnQ5dR8fumpCJRYTN++SckKQvbj7epL0vhfgb7wRMIOJx4kp5ne I6noCqTeImVfgSSPW0opXunssbfxwQuvnK7tRbTjf+CpVOBNcjUHWM40J/Xfh1iYOKip Kqthmplf2R2rUtZbejFwE97pvoOlQfvMr14MfLxNn4y8MZfrLZ+fv9A3plVXZvXedDvx CxQYkOM8AfHFxLdPQKpjrxWRaT9nR/a/x9gOjzW4sLVrP9xRyGBVY0rucKBpYHO6XIxW NNMHsauXT1f/MlEehlbU4YhVLf3/Fqwjuud8UW2TUxzlsJcJ7snzScEh/P7Ne8glu5zu hQsw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=jsRRDoAs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 9-20020a170906224900b0088197290291si1423803ejr.61.2023.01.27.20.13.35; Fri, 27 Jan 2023 20:13:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=jsRRDoAs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230482AbjA1D7I (ORCPT <rfc822;jesperjuhl76@gmail.com> + 99 others); Fri, 27 Jan 2023 22:59:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35216 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229737AbjA1D7H (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 27 Jan 2023 22:59:07 -0500 Received: from mail-qt1-x82b.google.com (mail-qt1-x82b.google.com [IPv6:2607:f8b0:4864:20::82b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 861CA79F11 for <linux-kernel@vger.kernel.org>; Fri, 27 Jan 2023 19:59:06 -0800 (PST) Received: by mail-qt1-x82b.google.com with SMTP id g16so5792323qtu.2 for <linux-kernel@vger.kernel.org>; Fri, 27 Jan 2023 19:59:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=NHcWkiK2q6vininDacdGdov5aGnkXeS+Opi5lg+y9ow=; b=jsRRDoAsw4PuBcuPBo5p5M/FPm5JAJJm2QpqRfc18jDxJf75DZAulZuAefJFLkAXeS F2tUA0mJ24KUqgpR6Olom1A3kd0ABLsiajg+d55p4204hya0BM8VXlUrYcMuFlQhUj8E ClPrkiLPq6gZzaCZHxJX6oK2rRK1qFz5JnXm8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=NHcWkiK2q6vininDacdGdov5aGnkXeS+Opi5lg+y9ow=; b=H175ZNKJ8R1P87XliYl9DcLWv8mNUX/BgSvbnDRtWUUPOjXOvUpA7Tvo7JwYnHNakL 81GeP5eUduUbsA5wl0yUrF8FlYqphiv0RXSBMXuJ9YUk7vXT+Z4eeWQckC39jw4NaNmT cpnZuVdqmGybQ9x1AzfqpHMoyt7hnwidYMChIAC0X/Ao41Rqkn6JtQoC+Vv4m3vKy/Ap Wg7eMH+HIikpySsmVSc212L2RBAadnzLDTf74MrbTQsM4ZGaJBNPQ0mfyTtEzP9Ce+sT GZyZChjOHuZKrG5hvuk7f35Qvetw9NyW9uvqLgVLslFddtKr+XrpLStHTAE0MGfVFkUD HN+g== X-Gm-Message-State: AO0yUKWO+V7fF9gIwlfiJclcYjxTcUtL0MfJu/lBrIbebiZjAOnLYntV z7Y/VqcnIjt9rgraxEBvgLVh+5ZwV1cJl5Os X-Received: by 2002:a05:622a:1207:b0:3b8:2ea9:a09c with SMTP id y7-20020a05622a120700b003b82ea9a09cmr4373829qtx.1.1674878345080; Fri, 27 Jan 2023 19:59:05 -0800 (PST) Received: from joelboxx.c.googlers.com.com (129.239.188.35.bc.googleusercontent.com. [35.188.239.129]) by smtp.gmail.com with ESMTPSA id r21-20020ac84255000000b003b635009149sm3865663qtm.72.2023.01.27.19.59.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 27 Jan 2023 19:59:04 -0800 (PST) From: "Joel Fernandes (Google)" <joel@joelfernandes.org> To: linux-kernel@vger.kernel.org Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>, Frederic Weisbecker <frederic@kernel.org>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Boqun Feng <boqun.feng@gmail.com>, Josh Triplett <josh@joshtriplett.org>, Lai Jiangshan <jiangshanlai@gmail.com>, "Paul E. McKenney" <paulmck@kernel.org>, rcu@vger.kernel.org, Steven Rostedt <rostedt@goodmis.org> Subject: [PATCH v4] srcu: Clarify comments on memory barrier "E" Date: Sat, 28 Jan 2023 03:59:01 +0000 Message-Id: <20230128035902.1758726-1-joel@joelfernandes.org> X-Mailer: git-send-email 2.39.1.456.gfc5497dd1b-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,URIBL_BLACK autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756238172068777030?= X-GMAIL-MSGID: =?utf-8?q?1756238172068777030?= |
Series |
[v4] srcu: Clarify comments on memory barrier "E"
|
|
Commit Message
Joel Fernandes
Jan. 28, 2023, 3:59 a.m. UTC
During a flip, we have a full memory barrier before srcu_idx is incremented.
The idea is we intend to order the first phase scan's read of lock
counters with the flipping of the index.
However, such ordering is already enforced because of the
control-dependency between the 2 scans. We would be flipping the index
only if lock and unlock counts matched.
But such match will not happen if there was a pending reader before the flip
in the first place (observation courtesy Mathieu Desnoyers).
The litmus test below shows this:
(test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me):
C srcu
(*
* bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip.
*
* So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo
* (control deps) on both sides, and both P0 and P1 are interconnected by ->rf
* relations. Combining the ->ppo with ->rf, a cycle is impossible.
*)
{}
// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int lock1;
int unlock1;
int lock0;
int unlock0;
// SCAN1
unlock1 = READ_ONCE(*UNLOCK1);
smp_mb(); // A
lock1 = READ_ONCE(*LOCK1);
// FLIP
if (lock1 == unlock1) { // Control dep
smp_mb(); // E // Remove E and still passes.
WRITE_ONCE(*IDX, 1);
smp_mb(); // D
// SCAN2
unlock0 = READ_ONCE(*UNLOCK0);
smp_mb(); // A
lock0 = READ_ONCE(*LOCK0);
}
}
// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int tmp;
int idx1;
int idx2;
// 1st reader
idx1 = READ_ONCE(*IDX);
if (idx1 == 0) { // Control dep
tmp = READ_ONCE(*LOCK0);
WRITE_ONCE(*LOCK0, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK0);
WRITE_ONCE(*UNLOCK0, tmp + 1);
} else {
tmp = READ_ONCE(*LOCK1);
WRITE_ONCE(*LOCK1, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK1);
WRITE_ONCE(*UNLOCK1, tmp + 1);
}
}
exists (0:lock1=1 /\ 1:idx1=1)
More complicated litmus tests with multiple SRCU readers also show that memory
barrier E is not needed.
This commit therefore clarifies the comment on memory barrier E while keeping
the memory barrier anyway for extra safety (since it is on a slow path anyway).
Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
v1->v2: Update changelog, keep old comments.
v2->v3: Moar changelog updates.
v3->v4: Keep smp_mb() and just update comments
cc_list | 8 ++++++++
kernel/rcu/srcutree.c | 20 ++++++++++++++------
send.sh | 5 +++++
to_list | 1 +
4 files changed, 28 insertions(+), 6 deletions(-)
create mode 100644 cc_list
create mode 100755 send.sh
create mode 100644 to_list
Comments
On Sat, Jan 28, 2023 at 03:59:01AM +0000, Joel Fernandes (Google) wrote: > During a flip, we have a full memory barrier before srcu_idx is incremented. > > The idea is we intend to order the first phase scan's read of lock > counters with the flipping of the index. > > However, such ordering is already enforced because of the > control-dependency between the 2 scans. We would be flipping the index > only if lock and unlock counts matched. > > But such match will not happen if there was a pending reader before the flip > in the first place (observation courtesy Mathieu Desnoyers). > > The litmus test below shows this: > (test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me): Much better, thank you! I of course did the usual wordsmithing, as shown below. Does this version capture your intent and understanding? Thanx, Paul ------------------------------------------------------------------------ commit 963f34624beb2af1ec08527e637d16ab6a1dacbd Author: Joel Fernandes (Google) <joel@joelfernandes.org> Date: Sat Jan 28 03:59:01 2023 +0000 srcu: Clarify comments on memory barrier "E" There is an smp_mb() named "E" in srcu_flip() immediately before the increment (flip) of the srcu_struct structure's ->srcu_idx. The purpose of E is to order the preceding scan's read of lock counters against the flipping of the ->srcu_idx, in order to prevent new readers from continuing to use the old ->srcu_idx value, which might needlessly extend the grace period. However, this ordering is already enforced because of the control dependency between the preceding scan and the ->srcu_idx flip. This control dependency exists because atomic_long_read() is used to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx, and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and ->srcu_unlock_count[] counts match. And such a match cannot happen when there is an in-flight reader that started before the flip (observation courtesy Mathieu Desnoyers). The litmus test below (courtesy of Frederic Weisbecker, with changes for ctrldep by Boqun and Joel) shows this: C srcu (* * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip. * * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf * relations. Combining the ->ppo with ->rf, a cycle is impossible. *) {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); // FLIP if (lock1 == unlock1) { // Control dep smp_mb(); // E // Remove E and still passes. WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx1; int idx2; // 1st reader idx1 = READ_ONCE(*IDX); if (idx1 == 0) { // Control dep tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:lock1=1 /\ 1:idx1=1) More complicated litmus tests with multiple SRCU readers also show that memory barrier E is not needed. This commit therefore clarifies the comment on memory barrier E. Why not also remove that redundant smp_mb()? Because control dependencies are quite fragile due to their not being recognized by most compilers and tools. Control dependencies therefore exact an ongoing maintenance burden, and such a burden cannot be justified in this slowpath. Therefore, that smp_mb() stays until such time as its overhead becomes a measurable problem in a real workload running on a real production system, or until such time as compilers start paying attention to this sort of control dependency. Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index c541b82646b63..cd46fe063e50f 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1085,16 +1085,36 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) static void srcu_flip(struct srcu_struct *ssp) { /* - * Ensure that if this updater saw a given reader's increment - * from __srcu_read_lock(), that reader was using an old value - * of ->srcu_idx. Also ensure that if a given reader sees the - * new value of ->srcu_idx, this updater's earlier scans cannot - * have seen that reader's increments (which is OK, because this - * grace period need not wait on that reader). + * Because the flip of ->srcu_idx is executed only if the + * preceding call to srcu_readers_active_idx_check() found that + * the ->srcu_unlock_count[] and ->srcu_lock_count[] sums matched + * and because that summing uses atomic_long_read(), there is + * ordering due to a control dependency between that summing and + * the WRITE_ONCE() in this call to srcu_flip(). This ordering + * ensures that if this updater saw a given reader's increment from + * __srcu_read_lock(), that reader was using a value of ->srcu_idx + * from before the previous call to srcu_flip(), which should be + * quite rare. This ordering thus helps forward progress because + * the grace period could otherwise be delayed by additional + * calls to __srcu_read_lock() using that old (soon to be new) + * value of ->srcu_idx. + * + * This sum-equality check and ordering also ensures that if + * a given call to __srcu_read_lock() uses the new value of + * ->srcu_idx, this updater's earlier scans cannot have seen + * that reader's increments, which is all to the good, because + * this grace period need not wait on that reader. After all, + * if those earlier scans had seen that reader, there would have + * been a sum mismatch and this code would not be reached. + * + * This means that the following smp_mb() is redundant, but + * it stays until either (1) Compilers learn about this sort of + * control dependency or (2) Some production workload running on + * a production system is unduly delayed by this slowpath smp_mb(). */ smp_mb(); /* E */ /* Pairs with B and C. */ - WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); // Flip the counter. /* * Ensure that if the updater misses an __srcu_read_unlock()
On Sat, Jan 28, 2023 at 1:24 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Sat, Jan 28, 2023 at 03:59:01AM +0000, Joel Fernandes (Google) wrote: > > During a flip, we have a full memory barrier before srcu_idx is incremented. > > > > The idea is we intend to order the first phase scan's read of lock > > counters with the flipping of the index. > > > > However, such ordering is already enforced because of the > > control-dependency between the 2 scans. We would be flipping the index > > only if lock and unlock counts matched. > > > > But such match will not happen if there was a pending reader before the flip > > in the first place (observation courtesy Mathieu Desnoyers). > > > > The litmus test below shows this: > > (test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me): > > Much better, thank you! > > I of course did the usual wordsmithing, as shown below. Does this > version capture your intent and understanding? > It looks good to me. According to [1] , the architecture at least should not be reordering read-write control dependency. Only read-read is problematic. But I am not 100% sure, is that not true? For the compiler, you are saying that read-write control dependency can be reordered even with *ONCE() accesses? In other words, the flipping of idx can happen in ->po order before the locks and unlock counts match? That sounds sort of like a broken compiler. [1] https://lpc.events/event/7/contributions/821/attachments/598/1075/LPC_2020_--_Dependency_ordering.pdf More comments below: > ------------------------------------------------------------------------ > > commit 963f34624beb2af1ec08527e637d16ab6a1dacbd > Author: Joel Fernandes (Google) <joel@joelfernandes.org> > Date: Sat Jan 28 03:59:01 2023 +0000 > > srcu: Clarify comments on memory barrier "E" > > There is an smp_mb() named "E" in srcu_flip() immediately before the > increment (flip) of the srcu_struct structure's ->srcu_idx. > > The purpose of E is to order the preceding scan's read of lock counters > against the flipping of the ->srcu_idx, in order to prevent new readers > from continuing to use the old ->srcu_idx value, which might needlessly > extend the grace period. > > However, this ordering is already enforced because of the control > dependency between the preceding scan and the ->srcu_idx flip. > This control dependency exists because atomic_long_read() is used > to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx, > and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and > ->srcu_unlock_count[] counts match. And such a match cannot happen when > there is an in-flight reader that started before the flip (observation > courtesy Mathieu Desnoyers). Agreed. > The litmus test below (courtesy of Frederic Weisbecker, with changes > for ctrldep by Boqun and Joel) shows this: > > C srcu > (* > * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip. > * > * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo > * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf > * relations. Combining the ->ppo with ->rf, a cycle is impossible. > *) > > {} > > // updater > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int lock1; > int unlock1; > int lock0; > int unlock0; > > // SCAN1 > unlock1 = READ_ONCE(*UNLOCK1); > smp_mb(); // A > lock1 = READ_ONCE(*LOCK1); > > // FLIP > if (lock1 == unlock1) { // Control dep > smp_mb(); // E // Remove E and still passes. > WRITE_ONCE(*IDX, 1); > smp_mb(); // D > > // SCAN2 > unlock0 = READ_ONCE(*UNLOCK0); > smp_mb(); // A > lock0 = READ_ONCE(*LOCK0); > } > } > > // reader > P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > { > int tmp; > int idx1; > int idx2; > > // 1st reader > idx1 = READ_ONCE(*IDX); > if (idx1 == 0) { // Control dep > tmp = READ_ONCE(*LOCK0); > WRITE_ONCE(*LOCK0, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK0); > WRITE_ONCE(*UNLOCK0, tmp + 1); > } else { > tmp = READ_ONCE(*LOCK1); > WRITE_ONCE(*LOCK1, tmp + 1); > smp_mb(); /* B and C */ > tmp = READ_ONCE(*UNLOCK1); > WRITE_ONCE(*UNLOCK1, tmp + 1); > } > } > > exists (0:lock1=1 /\ 1:idx1=1) > > More complicated litmus tests with multiple SRCU readers also show that > memory barrier E is not needed. > > This commit therefore clarifies the comment on memory barrier E. > > Why not also remove that redundant smp_mb()? > > Because control dependencies are quite fragile due to their not being > recognized by most compilers and tools. Control dependencies therefore > exact an ongoing maintenance burden, and such a burden cannot be justified > in this slowpath. Therefore, that smp_mb() stays until such time as > its overhead becomes a measurable problem in a real workload running on > a real production system, or until such time as compilers start paying > attention to this sort of control dependency. > > Co-developed-by: Frederic Weisbecker <frederic@kernel.org> > Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Co-developed-by: Boqun Feng <boqun.feng@gmail.com> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index c541b82646b63..cd46fe063e50f 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -1085,16 +1085,36 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) > static void srcu_flip(struct srcu_struct *ssp) > { > /* > - * Ensure that if this updater saw a given reader's increment > - * from __srcu_read_lock(), that reader was using an old value > - * of ->srcu_idx. Also ensure that if a given reader sees the > - * new value of ->srcu_idx, this updater's earlier scans cannot > - * have seen that reader's increments (which is OK, because this > - * grace period need not wait on that reader). > + * Because the flip of ->srcu_idx is executed only if the > + * preceding call to srcu_readers_active_idx_check() found that > + * the ->srcu_unlock_count[] and ->srcu_lock_count[] sums matched > + * and because that summing uses atomic_long_read(), there is > + * ordering due to a control dependency between that summing and > + * the WRITE_ONCE() in this call to srcu_flip(). This ordering > + * ensures that if this updater saw a given reader's increment from > + * __srcu_read_lock(), that reader was using a value of ->srcu_idx > + * from before the previous call to srcu_flip(), which should be > + * quite rare. This ordering thus helps forward progress because > + * the grace period could otherwise be delayed by additional > + * calls to __srcu_read_lock() using that old (soon to be new) > + * value of ->srcu_idx. > + * > + * This sum-equality check and ordering also ensures that if > + * a given call to __srcu_read_lock() uses the new value of > + * ->srcu_idx, this updater's earlier scans cannot have seen > + * that reader's increments, which is all to the good, because > + * this grace period need not wait on that reader. After all, > + * if those earlier scans had seen that reader, there would have > + * been a sum mismatch and this code would not be reached. > + * > + * This means that the following smp_mb() is redundant, but > + * it stays until either (1) Compilers learn about this sort of > + * control dependency or (2) Some production workload running on > + * a production system is unduly delayed by this slowpath smp_mb(). > */ I agree that a read-write control dependency reordering by the compiler can cause a reader to observe the flipped srcu_idx too soon, thus perhaps delaying the grace period from ending (because the second scan will now end up waiting on that reader..). Thanks, - Joel > smp_mb(); /* E */ /* Pairs with B and C. */ > > - WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); > + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); // Flip the counter. > > /* > * Ensure that if the updater misses an __srcu_read_unlock()
On Sat, Jan 28, 2023 at 04:16:34PM -0500, Joel Fernandes wrote: > On Sat, Jan 28, 2023 at 1:24 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Sat, Jan 28, 2023 at 03:59:01AM +0000, Joel Fernandes (Google) wrote: > > > During a flip, we have a full memory barrier before srcu_idx is incremented. > > > > > > The idea is we intend to order the first phase scan's read of lock > > > counters with the flipping of the index. > > > > > > However, such ordering is already enforced because of the > > > control-dependency between the 2 scans. We would be flipping the index > > > only if lock and unlock counts matched. > > > > > > But such match will not happen if there was a pending reader before the flip > > > in the first place (observation courtesy Mathieu Desnoyers). > > > > > > The litmus test below shows this: > > > (test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me): > > > > Much better, thank you! > > > > I of course did the usual wordsmithing, as shown below. Does this > > version capture your intent and understanding? > > > > It looks good to me. > According to [1] , the architecture at least should not be reordering > read-write control dependency. Only read-read is problematic. But I am > not 100% sure, is that not true? Agreed, READ_ONCE() or stronger through condition to WRITE_ONCE() or stronger is ordered. Replace that WRITE_ONCE() with any type of unordered read and all bets are off. And now that the ARM folks chimed in, this is a solid guarantee at the hardware level. Not so much at the compiler level. Oddly enough, compilers do provide ordering for plain C-language stores, but that is helpful only if no other CPU or thread is concurrently accessing that variable. > For the compiler, you are saying that read-write control dependency > can be reordered even with *ONCE() accesses? In other words, the > flipping of idx can happen in ->po order before the locks and unlock > counts match? That sounds sort of like a broken compiler. One case where a sane compiler can reasonably enable the hardware to do the reordering is where you have the same WRITE_ONCE() in both the then-clause and else-clause of an "if" statement. Another is if it can somehow prove something about the value returned from that READ_ONCE(), for example, if that value is used to index a singleton array, then the compiler has to do the READ_ONCE(), but it can then assume that the value returned was zero, throwing away the real value returned. Fun with undefined behavior! > [1] https://lpc.events/event/7/contributions/821/attachments/598/1075/LPC_2020_--_Dependency_ordering.pdf > > More comments below: > > > ------------------------------------------------------------------------ > > > > commit 963f34624beb2af1ec08527e637d16ab6a1dacbd > > Author: Joel Fernandes (Google) <joel@joelfernandes.org> > > Date: Sat Jan 28 03:59:01 2023 +0000 > > > > srcu: Clarify comments on memory barrier "E" > > > > There is an smp_mb() named "E" in srcu_flip() immediately before the > > increment (flip) of the srcu_struct structure's ->srcu_idx. > > > > The purpose of E is to order the preceding scan's read of lock counters > > against the flipping of the ->srcu_idx, in order to prevent new readers > > from continuing to use the old ->srcu_idx value, which might needlessly > > extend the grace period. > > > > However, this ordering is already enforced because of the control > > dependency between the preceding scan and the ->srcu_idx flip. > > This control dependency exists because atomic_long_read() is used > > to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx, > > and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and > > ->srcu_unlock_count[] counts match. And such a match cannot happen when > > there is an in-flight reader that started before the flip (observation > > courtesy Mathieu Desnoyers). > > Agreed. > > > The litmus test below (courtesy of Frederic Weisbecker, with changes > > for ctrldep by Boqun and Joel) shows this: > > > > C srcu > > (* > > * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip. > > * > > * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo > > * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf > > * relations. Combining the ->ppo with ->rf, a cycle is impossible. > > *) > > > > {} > > > > // updater > > P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > { > > int lock1; > > int unlock1; > > int lock0; > > int unlock0; > > > > // SCAN1 > > unlock1 = READ_ONCE(*UNLOCK1); > > smp_mb(); // A > > lock1 = READ_ONCE(*LOCK1); > > > > // FLIP > > if (lock1 == unlock1) { // Control dep > > smp_mb(); // E // Remove E and still passes. > > WRITE_ONCE(*IDX, 1); > > smp_mb(); // D > > > > // SCAN2 > > unlock0 = READ_ONCE(*UNLOCK0); > > smp_mb(); // A > > lock0 = READ_ONCE(*LOCK0); > > } > > } > > > > // reader > > P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) > > { > > int tmp; > > int idx1; > > int idx2; > > > > // 1st reader > > idx1 = READ_ONCE(*IDX); > > if (idx1 == 0) { // Control dep > > tmp = READ_ONCE(*LOCK0); > > WRITE_ONCE(*LOCK0, tmp + 1); > > smp_mb(); /* B and C */ > > tmp = READ_ONCE(*UNLOCK0); > > WRITE_ONCE(*UNLOCK0, tmp + 1); > > } else { > > tmp = READ_ONCE(*LOCK1); > > WRITE_ONCE(*LOCK1, tmp + 1); > > smp_mb(); /* B and C */ > > tmp = READ_ONCE(*UNLOCK1); > > WRITE_ONCE(*UNLOCK1, tmp + 1); > > } > > } > > > > exists (0:lock1=1 /\ 1:idx1=1) > > > > More complicated litmus tests with multiple SRCU readers also show that > > memory barrier E is not needed. > > > > This commit therefore clarifies the comment on memory barrier E. > > > > Why not also remove that redundant smp_mb()? > > > > Because control dependencies are quite fragile due to their not being > > recognized by most compilers and tools. Control dependencies therefore > > exact an ongoing maintenance burden, and such a burden cannot be justified > > in this slowpath. Therefore, that smp_mb() stays until such time as > > its overhead becomes a measurable problem in a real workload running on > > a real production system, or until such time as compilers start paying > > attention to this sort of control dependency. > > > > Co-developed-by: Frederic Weisbecker <frederic@kernel.org> > > Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > Co-developed-by: Boqun Feng <boqun.feng@gmail.com> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index c541b82646b63..cd46fe063e50f 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -1085,16 +1085,36 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) > > static void srcu_flip(struct srcu_struct *ssp) > > { > > /* > > - * Ensure that if this updater saw a given reader's increment > > - * from __srcu_read_lock(), that reader was using an old value > > - * of ->srcu_idx. Also ensure that if a given reader sees the > > - * new value of ->srcu_idx, this updater's earlier scans cannot > > - * have seen that reader's increments (which is OK, because this > > - * grace period need not wait on that reader). > > + * Because the flip of ->srcu_idx is executed only if the > > + * preceding call to srcu_readers_active_idx_check() found that > > + * the ->srcu_unlock_count[] and ->srcu_lock_count[] sums matched > > + * and because that summing uses atomic_long_read(), there is > > + * ordering due to a control dependency between that summing and > > + * the WRITE_ONCE() in this call to srcu_flip(). This ordering > > + * ensures that if this updater saw a given reader's increment from > > + * __srcu_read_lock(), that reader was using a value of ->srcu_idx > > + * from before the previous call to srcu_flip(), which should be > > + * quite rare. This ordering thus helps forward progress because > > + * the grace period could otherwise be delayed by additional > > + * calls to __srcu_read_lock() using that old (soon to be new) > > + * value of ->srcu_idx. > > + * > > + * This sum-equality check and ordering also ensures that if > > + * a given call to __srcu_read_lock() uses the new value of > > + * ->srcu_idx, this updater's earlier scans cannot have seen > > + * that reader's increments, which is all to the good, because > > + * this grace period need not wait on that reader. After all, > > + * if those earlier scans had seen that reader, there would have > > + * been a sum mismatch and this code would not be reached. > > + * > > + * This means that the following smp_mb() is redundant, but > > + * it stays until either (1) Compilers learn about this sort of > > + * control dependency or (2) Some production workload running on > > + * a production system is unduly delayed by this slowpath smp_mb(). > > */ > > I agree that a read-write control dependency reordering by the > compiler can cause a reader to observe the flipped srcu_idx too soon, > thus perhaps delaying the grace period from ending (because the second > scan will now end up waiting on that reader..). Very good! I will push the commit out on -rcu. Thanx, Paul > Thanks, > > - Joel > > > smp_mb(); /* E */ /* Pairs with B and C. */ > > > > - WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); > > + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); // Flip the counter. > > > > /* > > * Ensure that if the updater misses an __srcu_read_unlock()
On Sat, Jan 28, 2023 at 09:09:04PM -0800, Paul E. McKenney wrote: > On Sat, Jan 28, 2023 at 04:16:34PM -0500, Joel Fernandes wrote: > > On Sat, Jan 28, 2023 at 1:24 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Sat, Jan 28, 2023 at 03:59:01AM +0000, Joel Fernandes (Google) wrote: > > > > During a flip, we have a full memory barrier before srcu_idx is incremented. > > > > > > > > The idea is we intend to order the first phase scan's read of lock > > > > counters with the flipping of the index. > > > > > > > > However, such ordering is already enforced because of the > > > > control-dependency between the 2 scans. We would be flipping the index > > > > only if lock and unlock counts matched. > > > > > > > > But such match will not happen if there was a pending reader before the flip > > > > in the first place (observation courtesy Mathieu Desnoyers). > > > > > > > > The litmus test below shows this: > > > > (test courtesy Frederic Weisbecker, Changes for ctrldep by Boqun/me): > > > > > > Much better, thank you! > > > > > > I of course did the usual wordsmithing, as shown below. Does this > > > version capture your intent and understanding? > > > > > > > It looks good to me. > > According to [1] , the architecture at least should not be reordering > > read-write control dependency. Only read-read is problematic. But I am > > not 100% sure, is that not true? > > Agreed, READ_ONCE() or stronger through condition to WRITE_ONCE() > or stronger is ordered. Replace that WRITE_ONCE() with any type of > unordered read and all bets are off. > > And now that the ARM folks chimed in, this is a solid guarantee at > the hardware level. > > Not so much at the compiler level. Oddly enough, compilers do provide > ordering for plain C-language stores, but that is helpful only if no > other CPU or thread is concurrently accessing that variable. > > > For the compiler, you are saying that read-write control dependency > > can be reordered even with *ONCE() accesses? In other words, the > > flipping of idx can happen in ->po order before the locks and unlock > > counts match? That sounds sort of like a broken compiler. > > One case where a sane compiler can reasonably enable the hardware to > do the reordering is where you have the same WRITE_ONCE() in both the > then-clause and else-clause of an "if" statement. Another is if it can > somehow prove something about the value returned from that READ_ONCE(), > for example, if that value is used to index a singleton array, then the > compiler has to do the READ_ONCE(), but it can then assume that the > value returned was zero, throwing away the real value returned. > > Fun with undefined behavior! > > > [1] https://lpc.events/event/7/contributions/821/attachments/598/1075/LPC_2020_--_Dependency_ordering.pdf > > > > More comments below: Except that it was pointed out to me that the Co-developed-by tags also need Signed-off-by tags. If there are no objections to the update shown below, I will fix this on my next rebase. Thanx, Paul ------------------------------------------------------------------------ commit 6c135bb38c55d354527a6659cbf2f4e7e20b4360 Author: Joel Fernandes (Google) <joel@joelfernandes.org> Date: Sat Jan 28 03:59:01 2023 +0000 srcu: Clarify comments on memory barrier "E" There is an smp_mb() named "E" in srcu_flip() immediately before the increment (flip) of the srcu_struct structure's ->srcu_idx. The purpose of E is to order the preceding scan's read of lock counters against the flipping of the ->srcu_idx, in order to prevent new readers from continuing to use the old ->srcu_idx value, which might needlessly extend the grace period. However, this ordering is already enforced because of the control dependency between the preceding scan and the ->srcu_idx flip. This control dependency exists because atomic_long_read() is used to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx, and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and ->srcu_unlock_count[] counts match. And such a match cannot happen when there is an in-flight reader that started before the flip (observation courtesy Mathieu Desnoyers). The litmus test below (courtesy of Frederic Weisbecker, with changes for ctrldep by Boqun and Joel) shows this: C srcu (* * bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip. * * So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo * (control deps) on both sides, and both P0 and P1 are interconnected by ->rf * relations. Combining the ->ppo with ->rf, a cycle is impossible. *) {} // updater P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int lock1; int unlock1; int lock0; int unlock0; // SCAN1 unlock1 = READ_ONCE(*UNLOCK1); smp_mb(); // A lock1 = READ_ONCE(*LOCK1); // FLIP if (lock1 == unlock1) { // Control dep smp_mb(); // E // Remove E and still passes. WRITE_ONCE(*IDX, 1); smp_mb(); // D // SCAN2 unlock0 = READ_ONCE(*UNLOCK0); smp_mb(); // A lock0 = READ_ONCE(*LOCK0); } } // reader P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1) { int tmp; int idx1; int idx2; // 1st reader idx1 = READ_ONCE(*IDX); if (idx1 == 0) { // Control dep tmp = READ_ONCE(*LOCK0); WRITE_ONCE(*LOCK0, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK0); WRITE_ONCE(*UNLOCK0, tmp + 1); } else { tmp = READ_ONCE(*LOCK1); WRITE_ONCE(*LOCK1, tmp + 1); smp_mb(); /* B and C */ tmp = READ_ONCE(*UNLOCK1); WRITE_ONCE(*UNLOCK1, tmp + 1); } } exists (0:lock1=1 /\ 1:idx1=1) More complicated litmus tests with multiple SRCU readers also show that memory barrier E is not needed. This commit therefore clarifies the comment on memory barrier E. Why not also remove that redundant smp_mb()? Because control dependencies are quite fragile due to their not being recognized by most compilers and tools. Control dependencies therefore exact an ongoing maintenance burden, and such a burden cannot be justified in this slowpath. Therefore, that smp_mb() stays until such time as its overhead becomes a measurable problem in a real workload running on a real production system, or until such time as compilers start paying attention to this sort of control dependency. Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index c541b82646b63..cd46fe063e50f 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1085,16 +1085,36 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) static void srcu_flip(struct srcu_struct *ssp) { /* - * Ensure that if this updater saw a given reader's increment - * from __srcu_read_lock(), that reader was using an old value - * of ->srcu_idx. Also ensure that if a given reader sees the - * new value of ->srcu_idx, this updater's earlier scans cannot - * have seen that reader's increments (which is OK, because this - * grace period need not wait on that reader). + * Because the flip of ->srcu_idx is executed only if the + * preceding call to srcu_readers_active_idx_check() found that + * the ->srcu_unlock_count[] and ->srcu_lock_count[] sums matched + * and because that summing uses atomic_long_read(), there is + * ordering due to a control dependency between that summing and + * the WRITE_ONCE() in this call to srcu_flip(). This ordering + * ensures that if this updater saw a given reader's increment from + * __srcu_read_lock(), that reader was using a value of ->srcu_idx + * from before the previous call to srcu_flip(), which should be + * quite rare. This ordering thus helps forward progress because + * the grace period could otherwise be delayed by additional + * calls to __srcu_read_lock() using that old (soon to be new) + * value of ->srcu_idx. + * + * This sum-equality check and ordering also ensures that if + * a given call to __srcu_read_lock() uses the new value of + * ->srcu_idx, this updater's earlier scans cannot have seen + * that reader's increments, which is all to the good, because + * this grace period need not wait on that reader. After all, + * if those earlier scans had seen that reader, there would have + * been a sum mismatch and this code would not be reached. + * + * This means that the following smp_mb() is redundant, but + * it stays until either (1) Compilers learn about this sort of + * control dependency or (2) Some production workload running on + * a production system is unduly delayed by this slowpath smp_mb(). */ smp_mb(); /* E */ /* Pairs with B and C. */ - WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); // Flip the counter. /* * Ensure that if the updater misses an __srcu_read_unlock()
On 2023-02-07 22:38, Paul E. McKenney wrote: [...] > > Except that it was pointed out to me that the Co-developed-by tags also > need Signed-off-by tags. If there are no objections to the update shown > below, I will fix this on my next rebase. No objection from me, thanks! Mathieu
On Tue, Feb 07, 2023 at 10:48:42PM -0500, Mathieu Desnoyers wrote: > On 2023-02-07 22:38, Paul E. McKenney wrote: > [...] > > > > Except that it was pointed out to me that the Co-developed-by tags also > > need Signed-off-by tags. If there are no objections to the update shown > > below, I will fix this on my next rebase. > > No objection from me, thanks! And I have updated this commit, thank you all! Thanx, Paul
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 1c304fec89c0..2872998edbb7 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -983,12 +983,20 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) static void srcu_flip(struct srcu_struct *ssp) { /* - * Ensure that if this updater saw a given reader's increment - * from __srcu_read_lock(), that reader was using an old value - * of ->srcu_idx. Also ensure that if a given reader sees the - * new value of ->srcu_idx, this updater's earlier scans cannot - * have seen that reader's increments (which is OK, because this - * grace period need not wait on that reader). + * Control dependency (causality) between the before-flip + * srcu_readers_active_idx_check() and a call to srcu_flip(), ensures + * that we end up here only if lock and unlock counts match. This fact + * ensures that if this updater saw a given reader's increment from + * __srcu_read_lock(), that reader was using an old value of + * ->srcu_idx. That is why the lock and unlock counts matched in the + * first place. The causality also ensures that if a given reader sees + * the new value of ->srcu_idx, this updater's earlier scans cannot + * have seen that reader's increments (which is OK, because this grace + * period need not wait on that reader), because again, that would + * cause a lock/unlock count mismatch and we not end up here. + * + * So we don't really need the following smp_mb() before incrementing + * srcu_idx, however we have it anyway for additional safety. */ smp_mb(); /* E */ /* Pairs with B and C. */