Message ID | 20230224135251.24989-1-jonas.oberhauser@huaweicloud.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp917816wrd; Fri, 24 Feb 2023 05:54:22 -0800 (PST) X-Google-Smtp-Source: AK7set++mKtBXxZeK4LKEJJQFcaY5z5to9tBMxkaj5/ekxHXBYv6vZ67m30XmUni8EAAv1gXcEki X-Received: by 2002:a17:90b:380b:b0:234:881b:2e8b with SMTP id mq11-20020a17090b380b00b00234881b2e8bmr16710736pjb.49.1677246861825; Fri, 24 Feb 2023 05:54:21 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1677246861; cv=none; d=google.com; s=arc-20160816; b=EDwuKbPr0Xv1LceoO2V6pgfJafwF3VdFeMOs9bY1ppuMrk1dPuP8eObdWgoRf9auon aj2NX/o2Od1p0A3j3MG9r+ac387eaQ13GlnmQtWOsFyHMuodljac7i8RHgOPHwSu/UCM n9ngwPGRS7mLGXTe6ENo6pLBkRqYeDjFiG49lRlKuxSIS0Zdel7xYPdLhWOYclmhCSn7 w+sayEOWi2mHusrfz1erMA6Wlcic/G+XjUtr8TshMH92wQsRqESOoWEREJrzqmdcLqfV VWWG7ZxL4hMDopxVfOHTeFBO4O2A4am5zsYnftJBXOtEnLwQY+k4pq8+DDE/a09ueJ7r Uevg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from; bh=1io0Nk1UnlBZ52vSU+B0lqN/6pMDjedJ0w//K3ZrQ7I=; b=uo2o5Mkpfy7QjXF0X6Wtkpa6kh1EdiCoYs3s3GtsCs/f4SuN1WHG+LBleXZgnsM2S4 ZL+h1GaK3N+4mVbTCoFEFWS5q0/3la99d7VRaJVuHhjDzk1/76Hfy77j8kOtxbAhgQkM fOghqO421oaDWP+vOK2fRqX+NSUif+xGVME4lhc8Y8ndoQzy6gEBPBq/s/OzHOXKPoHG aYnR/gCYRsrBmOdE2DK6c4248zacl8uJQMD/XRjD/g8Vph145bvM1uBCxmAv/uF1EwG/ CR4J2nHa1Fiu8PyGMsOGohSivUDuT7nWdRvYINHCgDe2u+qmQEFYvWDQToz8VMw2M0LC ep6g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z1-20020a17090a014100b00233cc2c7da6si2415070pje.79.2023.02.24.05.54.04; Fri, 24 Feb 2023 05:54:21 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229731AbjBXNtR (ORCPT <rfc822;jeff.pang.chn@gmail.com> + 99 others); Fri, 24 Feb 2023 08:49:17 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41506 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229669AbjBXNtO (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 24 Feb 2023 08:49:14 -0500 Received: from frasgout12.his.huawei.com (frasgout12.his.huawei.com [14.137.139.154]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3A95F1815C for <linux-kernel@vger.kernel.org>; Fri, 24 Feb 2023 05:49:13 -0800 (PST) Received: from mail02.huawei.com (unknown [172.18.147.228]) by frasgout12.his.huawei.com (SkyGuard) with ESMTP id 4PNWH24W9mz9xGYl for <linux-kernel@vger.kernel.org>; Fri, 24 Feb 2023 21:40:14 +0800 (CST) Received: from huaweicloud.com (unknown [10.206.133.88]) by APP2 (Coremail) with SMTP id GxC2BwAn8lg3wPhjZJxJAQ--.62941S2; Fri, 24 Feb 2023 14:48:49 +0100 (CET) From: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> To: paulmck@kernel.org Cc: stern@rowland.harvard.edu, parri.andrea@gmail.com, will@kernel.org, peterz@infradead.org, boqun.feng@gmail.com, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org, urezki@gmail.com, quic_neeraju@quicinc.com, frederic@kernel.org, linux-kernel@vger.kernel.org, Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> Subject: [PATCH v3] tools/memory-model: Make ppo a subrelation of po Date: Fri, 24 Feb 2023 14:52:51 +0100 Message-Id: <20230224135251.24989-1-jonas.oberhauser@huaweicloud.com> X-Mailer: git-send-email 2.17.1 X-CM-TRANSID: GxC2BwAn8lg3wPhjZJxJAQ--.62941S2 X-Coremail-Antispam: 1UD129KBjvJXoW7ZF48GrWDJryktw1UKr4kJFb_yoW8CrWxpr WDG3yrKa1qqr9Y9FyDXw4kuF1fuayrWw48JFWDCa45A3sxXFsxuFWkKFs8ZrySqrZrCayj qr4jvry2v398CFJanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUvj14x267AKxVW5JVWrJwAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK02 1l84ACjcxK6xIIjxv20xvE14v26r1j6r1xM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26r4j 6F4UM28EF7xvwVC2z280aVAFwI0_Gr0_Cr1l84ACjcxK6I8E87Iv6xkF7I0E14v26r4j6r 4UJwAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ewAv7VC0 I7IYx2IY67AKxVWUGVWUXwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFVCjc4AY6r1j6r 4UM4x0Y48IcxkI7VAKI48JM4x0x7Aq67IIx4CEVc8vx2IErcIFxwACI402YVCY1x02628v n2kIc2xKxwCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F4 0E14v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_GFv_Wryl IxkGc2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7CjxV AFwI0_Gr0_Cr1lIxAIcVCF04k26cxKx2IYs7xG6Fyj6rWUJwCI42IY6I8E87Iv67AKxVWU JVW8JwCI42IY6I8E87Iv6xkF7I0E14v26r4j6r4UJbIYCTnIWIevJa73UjIFyTuYvjfUF0 eHDUUUU X-CM-SenderInfo: 5mrqt2oorev25kdx2v3u6k3tpzhluzxrxghudrp/ X-CFilter-Loop: Reflected X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758720804907679791?= X-GMAIL-MSGID: =?utf-8?q?1758720804907679791?= |
Series |
[v3] tools/memory-model: Make ppo a subrelation of po
|
|
Commit Message
Jonas Oberhauser
Feb. 24, 2023, 1:52 p.m. UTC
As stated in the documentation and implied by its name, the ppo
(preserved program order) relation is intended to link po-earlier
to po-later instructions under certain conditions. However, a
corner case currently allows instructions to be linked by ppo that
are not executed by the same thread, i.e., instructions are being
linked that have no po relation.
This happens due to the mb/strong-fence/fence relations, which (as
one case) provide order when locks are passed between threads
followed by an smp_mb__after_unlock_lock() fence. This is
illustrated in the following litmus test (as can be seen when using
herd7 with `doshow ppo`):
P0(int *x, int *y)
{
spin_lock(x);
spin_unlock(x);
}
P1(int *x, int *y)
{
spin_lock(x);
smp_mb__after_unlock_lock();
*y = 1;
}
The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
P0 passes a lock to P1 which then uses this fence.
The patch makes ppo a subrelation of po by letting fence contribute
to ppo only in case the fence links events of the same thread.
Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
---
tools/memory-model/linux-kernel.cat | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote: > As stated in the documentation and implied by its name, the ppo > (preserved program order) relation is intended to link po-earlier > to po-later instructions under certain conditions. However, a > corner case currently allows instructions to be linked by ppo that > are not executed by the same thread, i.e., instructions are being > linked that have no po relation. > > This happens due to the mb/strong-fence/fence relations, which (as > one case) provide order when locks are passed between threads > followed by an smp_mb__after_unlock_lock() fence. This is > illustrated in the following litmus test (as can be seen when using > herd7 with `doshow ppo`): > > P0(int *x, int *y) > { > spin_lock(x); > spin_unlock(x); > } > > P1(int *x, int *y) > { > spin_lock(x); > smp_mb__after_unlock_lock(); > *y = 1; > } > > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because > P0 passes a lock to P1 which then uses this fence. > > The patch makes ppo a subrelation of po by letting fence contribute > to ppo only in case the fence links events of the same thread. > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> > --- > tools/memory-model/linux-kernel.cat | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat > index cfc1b8fd46da..adf3c4f41229 100644 > --- a/tools/memory-model/linux-kernel.cat > +++ b/tools/memory-model/linux-kernel.cat > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W] > let overwrite = co | fr > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) > let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi) > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int) > > (* Propagation: Ordering from release operations and strong fences. *) > let A-cumul(r) = (rfe ; [Marked])? ; r Acked-by: Alan Stern <stern@rowland.harvard.edu>
On Fri, Feb 24, 2023 at 10:32:43AM -0500, Alan Stern wrote: > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote: > > As stated in the documentation and implied by its name, the ppo > > (preserved program order) relation is intended to link po-earlier > > to po-later instructions under certain conditions. However, a > > corner case currently allows instructions to be linked by ppo that > > are not executed by the same thread, i.e., instructions are being > > linked that have no po relation. > > > > This happens due to the mb/strong-fence/fence relations, which (as > > one case) provide order when locks are passed between threads > > followed by an smp_mb__after_unlock_lock() fence. This is > > illustrated in the following litmus test (as can be seen when using > > herd7 with `doshow ppo`): > > > > P0(int *x, int *y) > > { > > spin_lock(x); > > spin_unlock(x); > > } > > > > P1(int *x, int *y) > > { > > spin_lock(x); > > smp_mb__after_unlock_lock(); > > *y = 1; > > } > > > > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because > > P0 passes a lock to P1 which then uses this fence. > > > > The patch makes ppo a subrelation of po by letting fence contribute > > to ppo only in case the fence links events of the same thread. > > > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> > > --- > > tools/memory-model/linux-kernel.cat | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat > > index cfc1b8fd46da..adf3c4f41229 100644 > > --- a/tools/memory-model/linux-kernel.cat > > +++ b/tools/memory-model/linux-kernel.cat > > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W] > > let overwrite = co | fr > > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) > > let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi) > > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) > > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int) > > > > (* Propagation: Ordering from release operations and strong fences. *) > > let A-cumul(r) = (rfe ; [Marked])? ; r > > Acked-by: Alan Stern <stern@rowland.harvard.edu> Queued for the v6.4 merge window (not the current one), thank you both! Thanx, Paul
On Fri, Feb 24, 2023 at 10:37:58AM -0800, Paul E. McKenney wrote: > On Fri, Feb 24, 2023 at 10:32:43AM -0500, Alan Stern wrote: > > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote: > > > As stated in the documentation and implied by its name, the ppo > > > (preserved program order) relation is intended to link po-earlier > > > to po-later instructions under certain conditions. However, a > > > corner case currently allows instructions to be linked by ppo that > > > are not executed by the same thread, i.e., instructions are being > > > linked that have no po relation. > > > > > > This happens due to the mb/strong-fence/fence relations, which (as > > > one case) provide order when locks are passed between threads > > > followed by an smp_mb__after_unlock_lock() fence. This is > > > illustrated in the following litmus test (as can be seen when using > > > herd7 with `doshow ppo`): > > > > > > P0(int *x, int *y) > > > { > > > spin_lock(x); > > > spin_unlock(x); > > > } > > > > > > P1(int *x, int *y) > > > { > > > spin_lock(x); > > > smp_mb__after_unlock_lock(); > > > *y = 1; > > > } > > > > > > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because > > > P0 passes a lock to P1 which then uses this fence. > > > > > > The patch makes ppo a subrelation of po by letting fence contribute > > > to ppo only in case the fence links events of the same thread. > > > > > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> > > > --- > > > tools/memory-model/linux-kernel.cat | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat > > > index cfc1b8fd46da..adf3c4f41229 100644 > > > --- a/tools/memory-model/linux-kernel.cat > > > +++ b/tools/memory-model/linux-kernel.cat > > > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W] > > > let overwrite = co | fr > > > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) > > > let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi) > > > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) > > > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int) > > > > > > (* Propagation: Ordering from release operations and strong fences. *) > > > let A-cumul(r) = (rfe ; [Marked])? ; r > > > > Acked-by: Alan Stern <stern@rowland.harvard.edu> > > Queued for the v6.4 merge window (not the current one), thank you both! I tested both Alan's and Jonas's commit. These do not see to produce any significant differences in behavior, which is of course a good thing. Here are the differences and a few oddities: auto/C-RR-G+RR-R+RR-G+RR-G+RR-R+RR-R+RR-R+RR-R.litmus Timed out with changes, completed without them. But it completed in 558.29 seconds against a limit of 600 seconds, so never mind. auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus Timed out with changes, completed without them. But it completed in 580.01 seconds against a limit of 600 seconds, so never mind. * auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-R+RR-R.litmus Timed out with changes, completed without them. But it completed in 522.29 seconds against a limit of 600 seconds, so never mind. auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-G+RR-R.litmus Timed out with changes, completed without them. But it completed in 588.70 seconds against a limit of 600 seconds, so never mind. All tests that didn't time out matched Results comments. The reason I am so cavalier about the times is that I was foolishly running rcutorture concurrently with the new-version testing. I re-ran and of them, only auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus timed out the second time. I re-ran it again, but without a time limit, and it completed properly in 364.8 seconds compared to 580. A rerun took 360.1 seconds. So things have slowed down a bit. A few other oddities: litmus/auto/C-LB-Lww+R-OC.litmus Both versions flag a data race, which I am not seeing. It appears to me that P1's store to u0 cannot happen unless P0's store has completed. So what am I missing here? litmus/auto/C-LB-Lrw+R-OC.litmus litmus/auto/C-LB-Lww+R-Oc.litmus litmus/auto/C-LB-Lrw+R-Oc.litmus litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus litmus/auto/C-LB-Lww+R-A+R-OC.litmus Ditto. (There are likely more.) Thoughts? Thanx, Paul
On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote: > A few other oddities: > > litmus/auto/C-LB-Lww+R-OC.litmus > > Both versions flag a data race, which I am not seeing. It appears > to me that P1's store to u0 cannot happen unless P0's store > has completed. So what am I missing here? The LKMM doesn't believe that a control or data dependency orders a plain write after a marked read. Hence in this test it thinks that P1's store to u0 can happen before the load of x1. I don't remember why we did it this way -- probably we just wanted to minimize the restrictions on when plain accesses can execute. (I do remember the reason for making address dependencies induce order; it was so RCU would work.) The patch below will change what the LKMM believes. It eliminates the positive outcome of the litmus test and the data race. Should it be adopted into the memory model? > litmus/auto/C-LB-Lrw+R-OC.litmus > litmus/auto/C-LB-Lww+R-Oc.litmus > litmus/auto/C-LB-Lrw+R-Oc.litmus > litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus > litmus/auto/C-LB-Lww+R-A+R-OC.litmus > > Ditto. (There are likely more.) I haven't looked at these but they're probably similar. Alan --- usb-devel.orig/tools/memory-model/linux-kernel.cat +++ usb-devel/tools/memory-model/linux-kernel.cat @@ -172,7 +172,7 @@ let vis = cumul-fence* ; rfe? ; [Marked] ((strong-fence ; [Marked] ; xbstar) | (xbstar & int)) (* Boundaries for lifetimes of plain accesses *) -let w-pre-bounded = [Marked] ; (addr | fence)? +let w-pre-bounded = [Marked] ; (rwdep | fence)? let r-pre-bounded = [Marked] ; (addr | nonrw-fence | ([R4rmb] ; fencerel(Rmb) ; [~Noreturn]))? let w-post-bounded = fence? ; [Marked] ; rmw-sequence
On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote: > On Fri, Feb 24, 2023 at 10:37:58AM -0800, Paul E. McKenney wrote: > > On Fri, Feb 24, 2023 at 10:32:43AM -0500, Alan Stern wrote: > > > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote: > > > > As stated in the documentation and implied by its name, the ppo > > > > (preserved program order) relation is intended to link po-earlier > > > > to po-later instructions under certain conditions. However, a > > > > corner case currently allows instructions to be linked by ppo that > > > > are not executed by the same thread, i.e., instructions are being > > > > linked that have no po relation. > > > > > > > > This happens due to the mb/strong-fence/fence relations, which (as > > > > one case) provide order when locks are passed between threads > > > > followed by an smp_mb__after_unlock_lock() fence. This is > > > > illustrated in the following litmus test (as can be seen when using > > > > herd7 with `doshow ppo`): > > > > > > > > P0(int *x, int *y) > > > > { > > > > spin_lock(x); > > > > spin_unlock(x); > > > > } > > > > > > > > P1(int *x, int *y) > > > > { > > > > spin_lock(x); > > > > smp_mb__after_unlock_lock(); > > > > *y = 1; > > > > } > > > > > > > > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because > > > > P0 passes a lock to P1 which then uses this fence. > > > > > > > > The patch makes ppo a subrelation of po by letting fence contribute > > > > to ppo only in case the fence links events of the same thread. > > > > > > > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> > > > > --- > > > > tools/memory-model/linux-kernel.cat | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat > > > > index cfc1b8fd46da..adf3c4f41229 100644 > > > > --- a/tools/memory-model/linux-kernel.cat > > > > +++ b/tools/memory-model/linux-kernel.cat > > > > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W] > > > > let overwrite = co | fr > > > > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) > > > > let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi) > > > > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) > > > > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int) > > > > > > > > (* Propagation: Ordering from release operations and strong fences. *) > > > > let A-cumul(r) = (rfe ; [Marked])? ; r > > > > > > Acked-by: Alan Stern <stern@rowland.harvard.edu> > > > > Queued for the v6.4 merge window (not the current one), thank you both! > > I tested both Alan's and Jonas's commit. These do not see to produce > any significant differences in behavior, which is of course a good thing. > > Here are the differences and a few oddities: > > auto/C-RR-G+RR-R+RR-G+RR-G+RR-R+RR-R+RR-R+RR-R.litmus > > Timed out with changes, completed without them. But it completed > in 558.29 seconds against a limit of 600 seconds, so never mind. > > auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus > > Timed out with changes, completed without them. But it completed > in 580.01 seconds against a limit of 600 seconds, so never mind. * > > auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-R+RR-R.litmus > > Timed out with changes, completed without them. But it completed > in 522.29 seconds against a limit of 600 seconds, so never mind. > > auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-G+RR-R.litmus > > Timed out with changes, completed without them. But it completed > in 588.70 seconds against a limit of 600 seconds, so never mind. > > All tests that didn't time out matched Results comments. > > The reason I am so cavalier about the times is that I was foolishly > running rcutorture concurrently with the new-version testing. I re-ran > and of them, only auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus > timed out the second time. I re-ran it again, but without a time limit, > and it completed properly in 364.8 seconds compared to 580. A rerun > took 360.1 seconds. So things have slowed down a bit. > > A few other oddities: > > litmus/auto/C-LB-Lww+R-OC.litmus > > Both versions flag a data race, which I am not seeing. It appears > to me that P1's store to u0 cannot happen unless P0's store > has completed. So what am I missing here? > > litmus/auto/C-LB-Lrw+R-OC.litmus > litmus/auto/C-LB-Lww+R-Oc.litmus > litmus/auto/C-LB-Lrw+R-Oc.litmus > litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus > litmus/auto/C-LB-Lww+R-A+R-OC.litmus > > Ditto. (There are likely more.) > > Thoughts? And what happened here was that I conflated LKMM with the C++ memory model, producing something stronger than either. Never mind!!! Thanx, Paul
On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote: > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote: > > A few other oddities: > > > > litmus/auto/C-LB-Lww+R-OC.litmus > > > > Both versions flag a data race, which I am not seeing. It appears > > to me that P1's store to u0 cannot happen unless P0's store > > has completed. So what am I missing here? > > The LKMM doesn't believe that a control or data dependency orders a > plain write after a marked read. Hence in this test it thinks that P1's > store to u0 can happen before the load of x1. I don't remember why we > did it this way -- probably we just wanted to minimize the restrictions > on when plain accesses can execute. (I do remember the reason for > making address dependencies induce order; it was so RCU would work.) > > The patch below will change what the LKMM believes. It eliminates the > positive outcome of the litmus test and the data race. Should it be > adopted into the memory model? Excellent question! As noted separately, I was conflating the C++ memory model and LKMM. > > litmus/auto/C-LB-Lrw+R-OC.litmus > > litmus/auto/C-LB-Lww+R-Oc.litmus > > litmus/auto/C-LB-Lrw+R-Oc.litmus > > litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus > > litmus/auto/C-LB-Lww+R-A+R-OC.litmus > > > > Ditto. (There are likely more.) > > I haven't looked at these but they're probably similar. Let me give this patch a go and see what it does. Thanx, Paul > Alan > > > > --- usb-devel.orig/tools/memory-model/linux-kernel.cat > +++ usb-devel/tools/memory-model/linux-kernel.cat > @@ -172,7 +172,7 @@ let vis = cumul-fence* ; rfe? ; [Marked] > ((strong-fence ; [Marked] ; xbstar) | (xbstar & int)) > > (* Boundaries for lifetimes of plain accesses *) > -let w-pre-bounded = [Marked] ; (addr | fence)? > +let w-pre-bounded = [Marked] ; (rwdep | fence)? > let r-pre-bounded = [Marked] ; (addr | nonrw-fence | > ([R4rmb] ; fencerel(Rmb) ; [~Noreturn]))? > let w-post-bounded = fence? ; [Marked] ; rmw-sequence
On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote: > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote: > > A few other oddities: > > > > litmus/auto/C-LB-Lww+R-OC.litmus > > > > Both versions flag a data race, which I am not seeing. It appears > > to me that P1's store to u0 cannot happen unless P0's store > > has completed. So what am I missing here? > > The LKMM doesn't believe that a control or data dependency orders a > plain write after a marked read. Hence in this test it thinks that P1's > store to u0 can happen before the load of x1. I don't remember why we > did it this way -- probably we just wanted to minimize the restrictions > on when plain accesses can execute. (I do remember the reason for > making address dependencies induce order; it was so RCU would work.) > Because plain store can be optimzed as an "store only if not equal"? As the following sentenses in the explanations.txt: The need to distinguish between r- and w-bounding raises yet another issue. When the source code contains a plain store, the compiler is allowed to put plain loads of the same location into the object code. For example, given the source code: x = 1; the compiler is theoretically allowed to generate object code that looks like: if (x != 1) x = 1; thereby adding a load (and possibly replacing the store entirely). For this reason, whenever the LKMM requires a plain store to be w-pre-bounded or w-post-bounded by a marked access, it also requires the store to be r-pre-bounded or r-post-bounded, so as to handle cases where the compiler adds a load. Regards, Boqun > The patch below will change what the LKMM believes. It eliminates the > positive outcome of the litmus test and the data race. Should it be > adopted into the memory model? > > > litmus/auto/C-LB-Lrw+R-OC.litmus > > litmus/auto/C-LB-Lww+R-Oc.litmus > > litmus/auto/C-LB-Lrw+R-Oc.litmus > > litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus > > litmus/auto/C-LB-Lww+R-A+R-OC.litmus > > > > Ditto. (There are likely more.) > > I haven't looked at these but they're probably similar. > > Alan > > > > --- usb-devel.orig/tools/memory-model/linux-kernel.cat > +++ usb-devel/tools/memory-model/linux-kernel.cat > @@ -172,7 +172,7 @@ let vis = cumul-fence* ; rfe? ; [Marked] > ((strong-fence ; [Marked] ; xbstar) | (xbstar & int)) > > (* Boundaries for lifetimes of plain accesses *) > -let w-pre-bounded = [Marked] ; (addr | fence)? > +let w-pre-bounded = [Marked] ; (rwdep | fence)? > let r-pre-bounded = [Marked] ; (addr | nonrw-fence | > ([R4rmb] ; fencerel(Rmb) ; [~Noreturn]))? > let w-post-bounded = fence? ; [Marked] ; rmw-sequence
On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote: > On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote: > > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote: > > > A few other oddities: > > > > > > litmus/auto/C-LB-Lww+R-OC.litmus > > > > > > Both versions flag a data race, which I am not seeing. It appears > > > to me that P1's store to u0 cannot happen unless P0's store > > > has completed. So what am I missing here? > > > > The LKMM doesn't believe that a control or data dependency orders a > > plain write after a marked read. Hence in this test it thinks that P1's > > store to u0 can happen before the load of x1. I don't remember why we > > did it this way -- probably we just wanted to minimize the restrictions > > on when plain accesses can execute. (I do remember the reason for > > making address dependencies induce order; it was so RCU would work.) > > > > Because plain store can be optimzed as an "store only if not equal"? > As the following sentenses in the explanations.txt: > > The need to distinguish between r- and w-bounding raises yet another > issue. When the source code contains a plain store, the compiler is > allowed to put plain loads of the same location into the object code. > For example, given the source code: > > x = 1; > > the compiler is theoretically allowed to generate object code that > looks like: > > if (x != 1) > x = 1; > > thereby adding a load (and possibly replacing the store entirely). > For this reason, whenever the LKMM requires a plain store to be > w-pre-bounded or w-post-bounded by a marked access, it also requires > the store to be r-pre-bounded or r-post-bounded, so as to handle cases > where the compiler adds a load. Good guess; maybe that was the reason. Ironically, in this case the store _is_ r-pre-bounded, because there's an smp_rmb() in the litmus test just before it. So perhaps the original reason is not valid now that the memory model explicitly includes tests for stores being r-pre/post-bounded. Alan
On 2/26/2023 4:30 AM, Alan Stern wrote: > On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote: >> On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote: >>> On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote: >>>> A few other oddities: >>>> >>>> litmus/auto/C-LB-Lww+R-OC.litmus >>>> >>>> Both versions flag a data race, which I am not seeing. It appears >>>> to me that P1's store to u0 cannot happen unless P0's store >>>> has completed. So what am I missing here? >>> The LKMM doesn't believe that a control or data dependency orders a >>> plain write after a marked read. Hence in this test it thinks that P1's >>> store to u0 can happen before the load of x1. I don't remember why we >>> did it this way -- probably we just wanted to minimize the restrictions >>> on when plain accesses can execute. (I do remember the reason for >>> making address dependencies induce order; it was so RCU would work.) >>> >> Because plain store can be optimzed as an "store only if not equal"? >> As the following sentenses in the explanations.txt: >> >> The need to distinguish between r- and w-bounding raises yet another >> issue. When the source code contains a plain store, the compiler is >> allowed to put plain loads of the same location into the object code. >> For example, given the source code: >> >> x = 1; >> >> the compiler is theoretically allowed to generate object code that >> looks like: >> >> if (x != 1) >> x = 1; >> >> thereby adding a load (and possibly replacing the store entirely). >> For this reason, whenever the LKMM requires a plain store to be >> w-pre-bounded or w-post-bounded by a marked access, it also requires >> the store to be r-pre-bounded or r-post-bounded, so as to handle cases >> where the compiler adds a load. > Good guess; maybe that was the reason. [...] > So perhaps the original reason is not valid now > that the memory model explicitly includes tests for stores being > r-pre/post-bounded. > > Alan I agree, I think you could relax that condition. Note there's also rw-xbstar (used with fr) which doesn't check for r-pre-bounded, but it should be ok. That's because only reads would be unordered, as a result the read (in the if (x != ..) x=..) should provide the correct value. The store would be issued as necessary, and the issued store would still be ordered correctly w.r.t the read. Best wishes, jonas
On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote: > As stated in the documentation and implied by its name, the ppo > (preserved program order) relation is intended to link po-earlier > to po-later instructions under certain conditions. However, a > corner case currently allows instructions to be linked by ppo that > are not executed by the same thread, i.e., instructions are being > linked that have no po relation. > > This happens due to the mb/strong-fence/fence relations, which (as > one case) provide order when locks are passed between threads > followed by an smp_mb__after_unlock_lock() fence. This is > illustrated in the following litmus test (as can be seen when using > herd7 with `doshow ppo`): > > P0(int *x, int *y) > { > spin_lock(x); > spin_unlock(x); > } > > P1(int *x, int *y) > { > spin_lock(x); > smp_mb__after_unlock_lock(); > *y = 1; > } > > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because > P0 passes a lock to P1 which then uses this fence. > > The patch makes ppo a subrelation of po by letting fence contribute > to ppo only in case the fence links events of the same thread. > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> > --- > tools/memory-model/linux-kernel.cat | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat > index cfc1b8fd46da..adf3c4f41229 100644 > --- a/tools/memory-model/linux-kernel.cat > +++ b/tools/memory-model/linux-kernel.cat > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W] > let overwrite = co | fr > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) > let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi) > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int) Alternatively can be the following appended diff? Requires only single 'int' in ->ppo then and prevents future similar issues caused by sub relations. Also makes clear that ->ppo can only be CPU-internal. Or would that not work for some reason? For the test you shared at least, the graphs are the same. Either way: Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> ---8<----------------------- diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat index 07f884f9b2bf..63052d1628e9 100644 --- a/tools/memory-model/linux-kernel.cat +++ b/tools/memory-model/linux-kernel.cat @@ -70,7 +70,7 @@ let rwdep = (dep | ctrl) ; [W] let overwrite = co | fr let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) let to-r = addr | (dep ; [Marked] ; rfi) -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) +let ppo = (to-r | to-w | fence | po-unlock-lock-po) & int (* Propagation: Ordering from release operations and strong fences. *) let A-cumul(r) = (rfe ; [Marked])? ; r
On Sun, Feb 26, 2023 at 12:17:31PM +0100, Jonas Oberhauser wrote: > > > On 2/26/2023 4:30 AM, Alan Stern wrote: > > On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote: > > > On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote: > > > > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote: > > > > > A few other oddities: > > > > > > > > > > litmus/auto/C-LB-Lww+R-OC.litmus > > > > > > > > > > Both versions flag a data race, which I am not seeing. It appears > > > > > to me that P1's store to u0 cannot happen unless P0's store > > > > > has completed. So what am I missing here? > > > > The LKMM doesn't believe that a control or data dependency orders a > > > > plain write after a marked read. Hence in this test it thinks that P1's > > > > store to u0 can happen before the load of x1. I don't remember why we > > > > did it this way -- probably we just wanted to minimize the restrictions > > > > on when plain accesses can execute. (I do remember the reason for > > > > making address dependencies induce order; it was so RCU would work.) > > > > > > > Because plain store can be optimzed as an "store only if not equal"? > > > As the following sentenses in the explanations.txt: > > > > > > The need to distinguish between r- and w-bounding raises yet another > > > issue. When the source code contains a plain store, the compiler is > > > allowed to put plain loads of the same location into the object code. > > > For example, given the source code: > > > > > > x = 1; > > > > > > the compiler is theoretically allowed to generate object code that > > > looks like: > > > > > > if (x != 1) > > > x = 1; > > > > > > thereby adding a load (and possibly replacing the store entirely). > > > For this reason, whenever the LKMM requires a plain store to be > > > w-pre-bounded or w-post-bounded by a marked access, it also requires > > > the store to be r-pre-bounded or r-post-bounded, so as to handle cases > > > where the compiler adds a load. > > Good guess; maybe that was the reason. [...] > > So perhaps the original reason is not valid now > > that the memory model explicitly includes tests for stores being > > r-pre/post-bounded. > > > > Alan > > I agree, I think you could relax that condition. Here's a related question to think about. Suppose a compiler does make this change, adding a load-and-test in front of a store. Can that load cause a data race? Normally I'd say no, because compilers aren't allowed to create data races where one didn't already exist. But that restriction is part of the C/C++ standard, and what we consider to be a data race differs from what the standard considers. So what's the answer? Is the compiler allowed to translate: r1 = READ_ONCE(*x); if (r1) *y = 1; into something resembling: r1 = READ_ONCE(*x); rtemp = *y; if (r1) { if (rtemp != 1) *y = 1; } (Note that whether the load to rtemp occurs inside the "if (r1)" conditional or not makes no difference; either way the CPU can execute it before testing the condition. Even before reading the value of *x.) _If_ we assume that these manufactured loads can never cause a data race then it should be safe to remove the r-pre/post-bounded tests for plain writes. But what if rtemp reads from a plain write that was torn, and the intermediate value it observes happens to be 1, even though neither the initial nor the final value of *y was 1? > Note there's also rw-xbstar (used with fr) which doesn't check for > r-pre-bounded, but it should be ok. That's because only reads would be > unordered, as a result the read (in the if (x != ..) x=..) should provide > the correct value. The store would be issued as necessary, and the issued > store would still be ordered correctly w.r.t the read. That isn't the reason I left r-pre-bounded out from rw-xbstar. If the write gets changed to a read there's no need for rw-xbstar to check r-pre-bounded, because then rw-race would be comparing a read with another read (instead of with a write) and so there would be no possibility of a race in any case. Alan
On Sun, Feb 26, 2023 at 11:51:15AM -0500, Alan Stern wrote: > On Sun, Feb 26, 2023 at 12:17:31PM +0100, Jonas Oberhauser wrote: > > On 2/26/2023 4:30 AM, Alan Stern wrote: > > > On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote: > > > > On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote: > > > > > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote: > > > > > > A few other oddities: > > > > > > > > > > > > litmus/auto/C-LB-Lww+R-OC.litmus > > > > > > > > > > > > Both versions flag a data race, which I am not seeing. It appears > > > > > > to me that P1's store to u0 cannot happen unless P0's store > > > > > > has completed. So what am I missing here? > > > > > The LKMM doesn't believe that a control or data dependency orders a > > > > > plain write after a marked read. Hence in this test it thinks that P1's > > > > > store to u0 can happen before the load of x1. I don't remember why we > > > > > did it this way -- probably we just wanted to minimize the restrictions > > > > > on when plain accesses can execute. (I do remember the reason for > > > > > making address dependencies induce order; it was so RCU would work.) > > > > > > > > > Because plain store can be optimzed as an "store only if not equal"? > > > > As the following sentenses in the explanations.txt: > > > > > > > > The need to distinguish between r- and w-bounding raises yet another > > > > issue. When the source code contains a plain store, the compiler is > > > > allowed to put plain loads of the same location into the object code. > > > > For example, given the source code: > > > > > > > > x = 1; > > > > > > > > the compiler is theoretically allowed to generate object code that > > > > looks like: > > > > > > > > if (x != 1) > > > > x = 1; > > > > > > > > thereby adding a load (and possibly replacing the store entirely). > > > > For this reason, whenever the LKMM requires a plain store to be > > > > w-pre-bounded or w-post-bounded by a marked access, it also requires > > > > the store to be r-pre-bounded or r-post-bounded, so as to handle cases > > > > where the compiler adds a load. > > > Good guess; maybe that was the reason. [...] > > > So perhaps the original reason is not valid now > > > that the memory model explicitly includes tests for stores being > > > r-pre/post-bounded. > > > > > > Alan > > > > I agree, I think you could relax that condition. > > Here's a related question to think about. Suppose a compiler does make > this change, adding a load-and-test in front of a store. Can that load > cause a data race? > > Normally I'd say no, because compilers aren't allowed to create data > races where one didn't already exist. But that restriction is part of > the C/C++ standard, and what we consider to be a data race differs from > what the standard considers. > > So what's the answer? Is the compiler allowed to translate: > > r1 = READ_ONCE(*x); > if (r1) > *y = 1; > > into something resembling: > > r1 = READ_ONCE(*x); > rtemp = *y; > if (r1) { > if (rtemp != 1) > *y = 1; > } > > (Note that whether the load to rtemp occurs inside the "if (r1)" > conditional or not makes no difference; either way the CPU can execute > it before testing the condition. Even before reading the value of *x.) > > _If_ we assume that these manufactured loads can never cause a data race > then it should be safe to remove the r-pre/post-bounded tests for plain > writes. > > But what if rtemp reads from a plain write that was torn, and the > intermediate value it observes happens to be 1, even though neither the > initial nor the final value of *y was 1? I am not worried about compilers creating data races, so that assignment to rtemp would be within the "if (r1)" statement. Not that this matters, as you say, from a hardware ordering viewpoint. However, tearing is a concern. Just to make sure I undersand, one scenario might be where the initial value of *y was zero and the final value was 0x10001, correct? In that case, we have seen compilers that would write that constant 16 bits at a time, resulting in an transitory value of 0x1. But in this case, we would need the value to -not- be 1 for bad things to happen, correct? And in that case, we would need the value to initially be 1 and the desired value to be 1 and some other store to redundantly set it to 1, but tear is such a way that the transitory value is not 1, correct? Plus we should detect the data race in that case, not? Or am I missing yet another opportunity for a mischievous compiler? > > Note there's also rw-xbstar (used with fr) which doesn't check for > > r-pre-bounded, but it should be ok. That's because only reads would be > > unordered, as a result the read (in the if (x != ..) x=..) should provide > > the correct value. The store would be issued as necessary, and the issued > > store would still be ordered correctly w.r.t the read. > > That isn't the reason I left r-pre-bounded out from rw-xbstar. If the > write gets changed to a read there's no need for rw-xbstar to check > r-pre-bounded, because then rw-race would be comparing a read with > another read (instead of with a write) and so there would be no > possibility of a race in any case. True, and if there was a racing write, it would be a data race in any case. Thanx, Paul
On Sat, Feb 25, 2023 at 07:03:11PM -0800, Paul E. McKenney wrote: > On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote: > > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote: > > > A few other oddities: > > > > > > litmus/auto/C-LB-Lww+R-OC.litmus > > > > > > Both versions flag a data race, which I am not seeing. It appears > > > to me that P1's store to u0 cannot happen unless P0's store > > > has completed. So what am I missing here? > > > > The LKMM doesn't believe that a control or data dependency orders a > > plain write after a marked read. Hence in this test it thinks that P1's > > store to u0 can happen before the load of x1. I don't remember why we > > did it this way -- probably we just wanted to minimize the restrictions > > on when plain accesses can execute. (I do remember the reason for > > making address dependencies induce order; it was so RCU would work.) > > > > The patch below will change what the LKMM believes. It eliminates the > > positive outcome of the litmus test and the data race. Should it be > > adopted into the memory model? > > Excellent question! > > As noted separately, I was conflating the C++ memory model and LKMM. > > > > litmus/auto/C-LB-Lrw+R-OC.litmus > > > litmus/auto/C-LB-Lww+R-Oc.litmus > > > litmus/auto/C-LB-Lrw+R-Oc.litmus > > > litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus > > > litmus/auto/C-LB-Lww+R-A+R-OC.litmus > > > > > > Ditto. (There are likely more.) > > > > I haven't looked at these but they're probably similar. > > Let me give this patch a go and see what it does. And it operates as expected, converting Sometimes/data-race results into Never. Leaving the question of whether that is what we need. ;-) Thanx, Paul
On Sun, Feb 26, 2023 at 10:45:28AM -0800, Paul E. McKenney wrote: > On Sun, Feb 26, 2023 at 11:51:15AM -0500, Alan Stern wrote: > > Here's a related question to think about. Suppose a compiler does make > > this change, adding a load-and-test in front of a store. Can that load > > cause a data race? > > > > Normally I'd say no, because compilers aren't allowed to create data > > races where one didn't already exist. But that restriction is part of > > the C/C++ standard, and what we consider to be a data race differs from > > what the standard considers. > > > > So what's the answer? Is the compiler allowed to translate: > > > > r1 = READ_ONCE(*x); > > if (r1) > > *y = 1; > > > > into something resembling: > > > > r1 = READ_ONCE(*x); > > rtemp = *y; > > if (r1) { > > if (rtemp != 1) > > *y = 1; > > } > > > > (Note that whether the load to rtemp occurs inside the "if (r1)" > > conditional or not makes no difference; either way the CPU can execute > > it before testing the condition. Even before reading the value of *x.) > > > > _If_ we assume that these manufactured loads can never cause a data race > > then it should be safe to remove the r-pre/post-bounded tests for plain > > writes. > > > > But what if rtemp reads from a plain write that was torn, and the > > intermediate value it observes happens to be 1, even though neither the > > initial nor the final value of *y was 1? > > I am not worried about compilers creating data races, so that assignment > to rtemp would be within the "if (r1)" statement. Not that this matters, > as you say, from a hardware ordering viewpoint. > > However, tearing is a concern. Just to make sure I undersand, one > scenario might be where the initial value of *y was zero and the final > value was 0x10001, correct? In that case, we have seen compilers that > would write that constant 16 bits at a time, resulting in an transitory > value of 0x1. > > But in this case, we would need the value to -not- be 1 for bad things > to happen, correct? > > And in that case, we would need the value to initially be 1 and the > desired value to be 1 and some other store to redundantly set it to > 1, but tear is such a way that the transitory value is not 1, correct? > Plus we should detect the data race in that case, not? > > Or am I missing yet another opportunity for a mischievous compiler? Let's try a better example: P0(int *x, int *y) { *y = 0x10001; smp_store_release(x, 1); } P1(int *x, int *y) { int r1; r1 = READ_ONCE(*x); if (r1) *y = 1; } exists (1:r1=1 /\ y=0x10001) Assume the compiler translates "*y = 1;" to: { rtemp = *y; if (rtemp != 1) *y = 1; } Then there is nothing preventing P1's CPU from loading *y into rtemp before it reads *x. If the plain write in P0 is torn then rtemp could end up getting set to 1, so P1 wouldn't write anything to *y and the litmus test would succeed. If the LKMM should think this litmus test has no data race then it should also think the test will never succeed. But I've just shown how it might succeed. Ergo the LKMM should say this test has a data race, and thus we shouldn't remove the r-pre-bounded tests for plain writes. Of course, this is a separate question from whether w-pre-bounded should be changed to use rwdep instead of addr. Alan
On 2/26/2023 5:51 PM, Alan Stern wrote: > On Sun, Feb 26, 2023 at 12:17:31PM +0100, Jonas Oberhauser wrote: >> >> On 2/26/2023 4:30 AM, Alan Stern wrote: >>> On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote: >>>> On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote: >>>>> On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote: >>>>>> A few other oddities: >>>>>> >>>>>> litmus/auto/C-LB-Lww+R-OC.litmus >>>>>> >>>>>> Both versions flag a data race, which I am not seeing. It appears >>>>>> to me that P1's store to u0 cannot happen unless P0's store >>>>>> has completed. So what am I missing here? >>>>> The LKMM doesn't believe that a control or data dependency orders a >>>>> plain write after a marked read. Hence in this test it thinks that P1's >>>>> store to u0 can happen before the load of x1. I don't remember why we >>>>> did it this way -- probably we just wanted to minimize the restrictions >>>>> on when plain accesses can execute. (I do remember the reason for >>>>> making address dependencies induce order; it was so RCU would work.) >>>>> >>>> Because plain store can be optimzed as an "store only if not equal"? >>>> As the following sentenses in the explanations.txt: >>>> >>>> The need to distinguish between r- and w-bounding raises yet another >>>> issue. When the source code contains a plain store, the compiler is >>>> allowed to put plain loads of the same location into the object code. >>>> For example, given the source code: >>>> >>>> x = 1; >>>> >>>> the compiler is theoretically allowed to generate object code that >>>> looks like: >>>> >>>> if (x != 1) >>>> x = 1; >>>> >>>> thereby adding a load (and possibly replacing the store entirely). >>>> For this reason, whenever the LKMM requires a plain store to be >>>> w-pre-bounded or w-post-bounded by a marked access, it also requires >>>> the store to be r-pre-bounded or r-post-bounded, so as to handle cases >>>> where the compiler adds a load. >>> Good guess; maybe that was the reason. [...] >>> So perhaps the original reason is not valid now >>> that the memory model explicitly includes tests for stores being >>> r-pre/post-bounded. >>> >>> Alan >> I agree, I think you could relax that condition. > Here's a related question to think about. Suppose a compiler does make > this change, adding a load-and-test in front of a store. Can that load > cause a data race? > > Normally I'd say no, because compilers aren't allowed to create data > races where one didn't already exist. I don't think that's completely true. At least N1570 says Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. In fact I vaguely recall certain optimizations that do introduce data races, like load hoisting. The issue is that some optimizations are only valid in the absence of data races, and therefore these optimizations and the optimizations that introduce data races can't be used on the same portion of code at the same time. https://cs.nju.edu.cn/hongjin/teaching/concurrency/03_C11MM_ViktorVafeiadis.pdf I think in generall it's not about whether the compiler introduces races or not, but whether those races create new behaviors that you don't see in LKMM. > But that restriction is part of > the C/C++ standard, and what we consider to be a data race differs from > what the standard considers. > > So what's the answer? Is the compiler allowed to translate: > > r1 = READ_ONCE(*x); > if (r1) > *y = 1; > > into something resembling: > > r1 = READ_ONCE(*x); > rtemp = *y; > if (r1) { > if (rtemp != 1) > *y = 1; > } > > (Note that whether the load to rtemp occurs inside the "if (r1)" > conditional or not makes no difference; either way the CPU can execute > it before testing the condition. Even before reading the value of *x.) > > _If_ we assume that these manufactured loads can never cause a data race > then it should be safe to remove the r-pre/post-bounded tests for plain > writes. Note that I don't want to remove the r-pre/post-bounded tests. What I agreed to is that the restriction to only addr for plain writes is an overly conservative "r-pre/post-bounded-style" test which is made redundant by the existence of the actual r-pre/post-bounded test. > > But what if rtemp reads from a plain write that was torn, and the > intermediate value it observes happens to be 1, even though neither the > initial nor the final value of *y was 1? Of course, and you don't even need a torn write to cause problems without requiring r-bounding. For example, if the other side does *y=1; *y=2; mb(); *x=&y;, then r1==&x seems to imply that y always end up as ==1 from a compiler-oblivious perspective, but because of the data race it could be ==2 instead. >> Note there's also rw-xbstar (used with fr) which doesn't check for >> r-pre-bounded, but it should be ok. That's because only reads would be >> unordered, as a result the read (in the if (x != ..) x=..) should provide >> the correct value. The store would be issued as necessary, and the issued >> store would still be ordered correctly w.r.t the read. > That isn't the reason I left r-pre-bounded out from rw-xbstar. If the > write gets changed to a read there's no need for rw-xbstar to check > r-pre-bounded, because then rw-race would be comparing a read with > another read (instead of with a write) and so there would be no > possibility of a race in any case. That is the first part of my explanation (only reads would be unordered) but I don't think it's sufficient in general. Imagine a hypothetical memory model with a release fence that upgrades the next memory operation only (and only stores) to release (e.g., to save some opcode design space you do weird_release_fence;str x1 x2 instead of stlr x1 x2). Then in the message passing pattern T1 { u = a; release(&x, 1); } T2 { t = READ_ONCE(&x); weird_release_fence; a = 1; } where T2 is changed by the compiler to T2 { t = READ_ONCE(&x); weird_release_fence(); if (a!=1) a = 1; } In the specific executions where t==1, there wouldn't be a data race when just considering the equivalent of rw-xbstar, but u==1 and t==1 would be possible (even though it might not seem so from the first example). Of course in LKMM there's no such fence, but I think to make the argument complete you still need to go through every case that provides the w-pre-bounding and make sure it still provides the necessary order in the compiler-generated version. (or you can try a more complicated argument of the form "there would be another execution of the same program that would have a data race", which works at least for this example, not sure in general) Have fun, jonas
On 2/26/2023 5:23 PM, Joel Fernandes wrote: > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote: >> As stated in the documentation and implied by its name, the ppo >> (preserved program order) relation is intended to link po-earlier >> to po-later instructions under certain conditions. However, a >> corner case currently allows instructions to be linked by ppo that >> are not executed by the same thread, i.e., instructions are being >> linked that have no po relation. >> >> This happens due to the mb/strong-fence/fence relations, which (as >> one case) provide order when locks are passed between threads >> followed by an smp_mb__after_unlock_lock() fence. This is >> illustrated in the following litmus test (as can be seen when using >> herd7 with `doshow ppo`): >> >> P0(int *x, int *y) >> { >> spin_lock(x); >> spin_unlock(x); >> } >> >> P1(int *x, int *y) >> { >> spin_lock(x); >> smp_mb__after_unlock_lock(); >> *y = 1; >> } >> >> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because >> P0 passes a lock to P1 which then uses this fence. >> >> The patch makes ppo a subrelation of po by letting fence contribute >> to ppo only in case the fence links events of the same thread. >> >> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> >> --- >> tools/memory-model/linux-kernel.cat | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat >> index cfc1b8fd46da..adf3c4f41229 100644 >> --- a/tools/memory-model/linux-kernel.cat >> +++ b/tools/memory-model/linux-kernel.cat >> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W] >> let overwrite = co | fr >> let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) >> let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi) >> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) >> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int) > Alternatively can be the following appended diff? Requires only single 'int' > in ->ppo then and prevents future similar issues caused by sub relations. > Also makes clear that ->ppo can only be CPU-internal. I had thought about going in that direction, but it doesn't prevent future similar issues, at best makes them less likely. For example, you could have an xfence that somehow goes back to the original thread, but to a po-earlier event (e.g., like prop). Given that to-r and to-w are unlikely to ever become become inconsistent with po, I am not sure it even really helps much. Personally I'm not too happy with the ad-hoc '& int' because it's like adding some unused stuff (via ... | unused-stuff) and then cutting it back out with &int, unlike prop & int which has a real semantic meaning (propagate back to the thread). The fastest move is the move we avoid doing, so I rather wouldn't add those parts in the first place. However fixing the fence relation turned out to be a lot trickier, both because of the missed data race and also rmw-sequences, essentially I would have had to disambiguate between xfences and fences already in this patch. So I did this minimal local fix for now and we can discuss whether it makes sense to get rid of the '& int' once/if we have xfence etc. Best wishes, jonas PS: > ---8<----------------------- haha that's so clever :D > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat > index 07f884f9b2bf..63052d1628e9 100644 > --- a/tools/memory-model/linux-kernel.cat > +++ b/tools/memory-model/linux-kernel.cat > @@ -70,7 +70,7 @@ let rwdep = (dep | ctrl) ; [W] > let overwrite = co | fr > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) > let to-r = addr | (dep ; [Marked] ; rfi) > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) > +let ppo = (to-r | to-w | fence | po-unlock-lock-po) & int > > (* Propagation: Ordering from release operations and strong fences. *) > let A-cumul(r) = (rfe ; [Marked])? ; r
On Mon, Feb 27, 2023 at 03:03:16PM +0100, Jonas Oberhauser wrote: > Note that I don't want to remove the r-pre/post-bounded tests. > What I agreed to is that the restriction to only addr for plain writes is an > overly conservative "r-pre/post-bounded-style" test which is made redundant > by the existence of the actual r-pre/post-bounded test. Good, that agrees with what I've been thinking. > > > Note there's also rw-xbstar (used with fr) which doesn't check for > > > r-pre-bounded, but it should be ok. That's because only reads would be > > > unordered, as a result the read (in the if (x != ..) x=..) should provide > > > the correct value. The store would be issued as necessary, and the issued > > > store would still be ordered correctly w.r.t the read. > > That isn't the reason I left r-pre-bounded out from rw-xbstar. If the > > write gets changed to a read there's no need for rw-xbstar to check > > r-pre-bounded, because then rw-race would be comparing a read with > > another read (instead of with a write) and so there would be no > > possibility of a race in any case. > > That is the first part of my explanation (only reads would be unordered) but It is? I couldn't tell from what you wrote that this was supposed to imply we didn't have to worry about a data race. > I don't think it's sufficient in general. > Imagine a hypothetical memory model with a release fence that upgrades the > next memory operation only (and only stores) to release (e.g., to save some > opcode design space you do weird_release_fence;str x1 x2 instead of stlr x1 > x2). > Then in the message passing pattern > > T1 { > u = a; > release(&x, 1); > } > > T2 { > t = READ_ONCE(&x); > weird_release_fence; > a = 1; > } [That isn't the Message Passing pattern. In the MP pattern, one thread does two writes and the other thread does two reads. This is the Read Buffering (RB) pattern: Each thread does a read followed by a write.] > > where T2 is changed by the compiler to > > T2 { > t = READ_ONCE(&x); > weird_release_fence(); > if (a!=1) a = 1; > } > > In the specific executions where t==1, there wouldn't be a data race when > just considering the equivalent of rw-xbstar, but u==1 and t==1 would be > possible (even though it might not seem so from the first example). If such a fence existed in the LKMM, we would handle this case by saying that weird_release_fence() does not give release semantics to an immediately following plain store; only to an immediately following marked store. The reason being that the compiler is allowed to muck around with the code generated for plain accesses, so there's no guarantee that the first machine instruction generated for "a = 1;" will be a store. As a result, there would not be an rw-xbstar link from T1 to T2. > Of course in LKMM there's no such fence, but I think to make the argument > complete you still need to go through every case that provides the > w-pre-bounding and make sure it still provides the necessary order in the > compiler-generated version. (or you can try a more complicated argument of > the form "there would be another execution of the same program that would > have a data race", which works at least for this example, not sure in > general) So I don't see this as a valid argument for not using rw-xbstar in rw-race. Even theoretically. Alan
On 2/27/2023 5:16 PM, Alan Stern wrote: > On Mon, Feb 27, 2023 at 03:03:16PM +0100, Jonas Oberhauser wrote: > >>>> Note there's also rw-xbstar (used with fr) which doesn't check for >>>> r-pre-bounded, but it should be ok. That's because only reads would be >>>> unordered, as a result the read (in the if (x != ..) x=..) should provide >>>> the correct value. The store would be issued as necessary, and the issued >>>> store would still be ordered correctly w.r.t the read. >>> That isn't the reason I left r-pre-bounded out from rw-xbstar. If the >>> write gets changed to a read there's no need for rw-xbstar to check >>> r-pre-bounded, because then rw-race would be comparing a read with >>> another read (instead of with a write) and so there would be no >>> possibility of a race in any case. >> That is the first part of my explanation (only reads would be unordered) but > It is? I couldn't tell from what you wrote that this was supposed to > imply we didn't have to worry about a data race. Because it wasn't supposed to imply that, in the sense that by itself, the fact that the things that are left unordered are reads does not immediately imply that we don't have to worry about a data race. It's just a step in the sequence of reasoning. But of course I won't claim that I laid out that sequence in full enough detail to make sense. > >> I don't think it's sufficient in general. >> Imagine a hypothetical memory model with a release fence that upgrades the >> next memory operation only (and only stores) to release (e.g., to save some >> opcode design space you do weird_release_fence;str x1 x2 instead of stlr x1 >> x2). >> Then in the message passing pattern >> >> T1 { >> u = a; >> release(&x, 1); >> } >> >> T2 { >> t = READ_ONCE(&x); >> weird_release_fence; >> a = 1; >> } > [That isn't the Message Passing pattern. In the MP pattern, one thread > does two writes and the other thread does two reads. This is the Read > Buffering (RB) pattern: Each thread does a read followed by a write.] Ah, thanks. Somehow I misremembered MP to be any case of [do stuff on a] ; x = ... || read that update from x; [do stuff on a]. > >> where T2 is changed by the compiler to >> >> T2 { >> t = READ_ONCE(&x); >> weird_release_fence(); >> if (a!=1) a = 1; >> } >> >> In the specific executions where t==1, there wouldn't be a data race when >> just considering the equivalent of rw-xbstar, but u==1 and t==1 would be >> possible (even though it might not seem so from the first example). > If such a fence existed in the LKMM, we would handle this case by saying > that weird_release_fence() does not give release semantics to an > immediately following plain store; only to an immediately following > marked store. The reason being that the compiler is allowed to muck > around with the code generated for plain accesses, so there's no > guarantee that the first machine instruction generated for "a = 1;" will > be a store. > > As a result, there would not be an rw-xbstar link from T1 to T2. Sure, but IMHO a well-designed high-level model like LKMM shouldn't have such a fence at all :D This is something you might use in the implementation of a release store in inline-assembly, it doesn't belong in the source code. But it's just for the sake of argument anyways. > >> Of course in LKMM there's no such fence, but I think to make the argument >> complete you still need to go through every case that provides the >> w-pre-bounding and make sure it still provides the necessary order in the >> compiler-generated version. (or you can try a more complicated argument of >> the form "there would be another execution of the same program that would >> have a data race", which works at least for this example, not sure in >> general) > So I don't see this as a valid argument for not using rw-xbstar in > rw-race. Even theoretically. There's nothing wrong with using rw-xbstar in rw-race, especially in current LKMM, and I'm not arguing against that. I'm saying that the argument "if rw-xbstar links a read R to a plain write W, and that plain write is replaced by a read R', and in case R' reads a value different from W, followed by a store W' (with some dependency from R' to W') by the compiler, then the fact that R and R' can't have a data race means that it's safe to use rw-xbstar in rw-race" is incomplete. (Of course that doesn't mean the claim is wrong.) To make the argument complete, you also need that W' is generated if necessary, and more crucially that W' is still ordered behind R! Otherwise you would now have a data race between R and W', like in the hypothetical example I mentioned, even though R and R' don't race. And if you do that second step in LKMM (even with the change of w-pre-bounded we are discussing) you quickly find that W' is indeed still ordered, so rw-xbstar is perfectly fine. Perhaps that step is so trivial to you that you don't feel it needs mentioning : ) But speaking about LKMM-like models in general, some might have some subtle case where rw-xbstar links R and W but would not R and W'. jonas
On Mon, Feb 27, 2023 at 9:39 AM Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> wrote: > > > > On 2/26/2023 5:23 PM, Joel Fernandes wrote: > > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote: > >> As stated in the documentation and implied by its name, the ppo > >> (preserved program order) relation is intended to link po-earlier > >> to po-later instructions under certain conditions. However, a > >> corner case currently allows instructions to be linked by ppo that > >> are not executed by the same thread, i.e., instructions are being > >> linked that have no po relation. > >> > >> This happens due to the mb/strong-fence/fence relations, which (as > >> one case) provide order when locks are passed between threads > >> followed by an smp_mb__after_unlock_lock() fence. This is > >> illustrated in the following litmus test (as can be seen when using > >> herd7 with `doshow ppo`): > >> > >> P0(int *x, int *y) > >> { > >> spin_lock(x); > >> spin_unlock(x); > >> } > >> > >> P1(int *x, int *y) > >> { > >> spin_lock(x); > >> smp_mb__after_unlock_lock(); > >> *y = 1; > >> } > >> > >> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because > >> P0 passes a lock to P1 which then uses this fence. > >> > >> The patch makes ppo a subrelation of po by letting fence contribute > >> to ppo only in case the fence links events of the same thread. > >> > >> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> > >> --- > >> tools/memory-model/linux-kernel.cat | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat > >> index cfc1b8fd46da..adf3c4f41229 100644 > >> --- a/tools/memory-model/linux-kernel.cat > >> +++ b/tools/memory-model/linux-kernel.cat > >> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W] > >> let overwrite = co | fr > >> let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) > >> let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi) > >> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) > >> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int) > > Alternatively can be the following appended diff? Requires only single 'int' > > in ->ppo then and prevents future similar issues caused by sub relations. > > Also makes clear that ->ppo can only be CPU-internal. > > I had thought about going in that direction, but it doesn't prevent > future similar issues, at best makes them less likely. Making less likely still sounds like a win to me. > For example, you could have an xfence that somehow goes back to the > original thread, but to a po-earlier event (e.g., like prop). > > Given that to-r and to-w are unlikely to ever become become inconsistent > with po, I am not sure it even really helps much. I am not sure I understand, what is the problem with enforcing that ppo is only supposed to ever be -int ? Sounds like it makes it super clear. > Personally I'm not too happy with the ad-hoc '& int' because it's like So, with the idea I suggest, you will have fewer ints so you should be happy ;-) > adding some unused stuff (via ... | unused-stuff) and then cutting it > back out with &int, unlike prop & int which has a real semantic meaning > (propagate back to the thread). The fastest move is the move we avoid > doing, so I rather wouldn't add those parts in the first place. > > However fixing the fence relation turned out to be a lot trickier, both > because of the missed data race and also rmw-sequences, essentially I > would have had to disambiguate between xfences and fences already in > this patch. So I did this minimal local fix for now and we can discuss > whether it makes sense to get rid of the '& int' once/if we have xfence etc. I see. Ok, I'll defer to your expertise on this since you know more than I. I am relatively only recent with even opening up the CAT code. Cheers, - Joel > > Best wishes, > jonas > > PS: > > ---8<----------------------- > > haha that's so clever :D > > > > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat > > index 07f884f9b2bf..63052d1628e9 100644 > > --- a/tools/memory-model/linux-kernel.cat > > +++ b/tools/memory-model/linux-kernel.cat > > @@ -70,7 +70,7 @@ let rwdep = (dep | ctrl) ; [W] > > let overwrite = co | fr > > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) > > let to-r = addr | (dep ; [Marked] ; rfi) > > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) > > +let ppo = (to-r | to-w | fence | po-unlock-lock-po) & int > > > > (* Propagation: Ordering from release operations and strong fences. *) > > let A-cumul(r) = (rfe ; [Marked])? ; r >
On Mon, Feb 27, 2023 at 05:50:15PM +0100, Jonas Oberhauser wrote: > > So I don't see this as a valid argument for not using rw-xbstar in > > rw-race. Even theoretically. > > There's nothing wrong with using rw-xbstar in rw-race, especially in current > LKMM, and I'm not arguing against that. > > I'm saying that the argument > "if rw-xbstar links a read R to a plain write W, and that plain write is > replaced by a read R', and in case R' reads a value different from W, > followed by a store W' (with some dependency from R' to W') by the > compiler, then the fact that R and R' can't have a data race means that it's > safe to use rw-xbstar in rw-race" > is incomplete. (Of course that doesn't mean the claim is wrong.) > To make the argument complete, you also need that W' is generated if > necessary, and more crucially that W' is still ordered behind R! > Otherwise you would now have a data race between R and W', like in the > hypothetical example I mentioned, even though R and R' don't race. > > And if you do that second step in LKMM (even with the change of > w-pre-bounded we are discussing) you quickly find that W' is indeed still > ordered, so rw-xbstar is perfectly fine. > > Perhaps that step is so trivial to you that you don't feel it needs > mentioning : ) But speaking about LKMM-like models in general, some might > have some subtle case where rw-xbstar links R and W but would not R and W'. Ah, okay. Yes, it is a subtle point. And by the reasoning I just used, if such a case did exist then one could conclude it would be an example demonstrating that rw-xbstar should not have linked R and W in the first place. And it looks like I should write up and submit a patch allowing more dependencies in the definition of w-pre-bounded. Alan
On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote: > As stated in the documentation and implied by its name, the ppo > (preserved program order) relation is intended to link po-earlier > to po-later instructions under certain conditions. However, a > corner case currently allows instructions to be linked by ppo that > are not executed by the same thread, i.e., instructions are being > linked that have no po relation. > > This happens due to the mb/strong-fence/fence relations, which (as > one case) provide order when locks are passed between threads > followed by an smp_mb__after_unlock_lock() fence. This is > illustrated in the following litmus test (as can be seen when using > herd7 with `doshow ppo`): > > P0(int *x, int *y) > { > spin_lock(x); > spin_unlock(x); > } > > P1(int *x, int *y) > { > spin_lock(x); > smp_mb__after_unlock_lock(); > *y = 1; > } nit: for readability if not for other, s/int *x/spinlock_t *x/. > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because > P0 passes a lock to P1 which then uses this fence. > > The patch makes ppo a subrelation of po by letting fence contribute > to ppo only in case the fence links events of the same thread. > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> Acked-by: Andrea Parri <parri.andrea@gmail.com> Andrea > --- > tools/memory-model/linux-kernel.cat | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat > index cfc1b8fd46da..adf3c4f41229 100644 > --- a/tools/memory-model/linux-kernel.cat > +++ b/tools/memory-model/linux-kernel.cat > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W] > let overwrite = co | fr > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) > let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi) > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int) > > (* Propagation: Ordering from release operations and strong fences. *) > let A-cumul(r) = (rfe ; [Marked])? ; r > -- > 2.17.1 >
> The LKMM doesn't believe that a control or data dependency orders a > plain write after a marked read. Hence in this test it thinks that P1's > store to u0 can happen before the load of x1. I don't remember why we > did it this way -- probably we just wanted to minimize the restrictions > on when plain accesses can execute. (I do remember the reason for > making address dependencies induce order; it was so RCU would work.) > > The patch below will change what the LKMM believes. It eliminates the > positive outcome of the litmus test and the data race. Should it be > adopted into the memory model? (Unpopular opinion I know,) it should drop dependencies ordering, not add/promote it. Andrea
On 2/27/2023 8:40 PM, Andrea Parri wrote: >> The LKMM doesn't believe that a control or data dependency orders a >> plain write after a marked read. Hence in this test it thinks that P1's >> store to u0 can happen before the load of x1. I don't remember why we >> did it this way -- probably we just wanted to minimize the restrictions >> on when plain accesses can execute. (I do remember the reason for >> making address dependencies induce order; it was so RCU would work.) >> >> The patch below will change what the LKMM believes. It eliminates the >> positive outcome of the litmus test and the data race. Should it be >> adopted into the memory model? > (Unpopular opinion I know,) it should drop dependencies ordering, not > add/promote it. > > Andrea Maybe not as unpopular as you think... :) But either way IMHO it should be consistent; either take all the dependencies that are true and add them, or drop them all. In the latter case, RCU should change to an acquire barrier. (also, one would have to deal with OOTA in some yet different way). Generally my position is that unless there's a real-world benchmark with proven performance benefits of relying on dependency ordering, one should use an acquire barrier. I haven't yet met such a case, but maybe one of you has... Best wishes, jonas
On 2/27/2023 6:57 PM, Joel Fernandes wrote: > On Mon, Feb 27, 2023 at 9:39 AM Jonas Oberhauser > <jonas.oberhauser@huaweicloud.com> wrote: >> >> >> On 2/26/2023 5:23 PM, Joel Fernandes wrote: >>> On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote: >>>> [...] >>> Alternatively can be the following appended diff? Requires only single 'int' >>> in ->ppo then and prevents future similar issues caused by sub relations. >>> Also makes clear that ->ppo can only be CPU-internal. >> I had thought about going in that direction, but it doesn't prevent >> future similar issues, at best makes them less likely. > Making less likely still sounds like a win to me. > >> For example, you could have an xfence that somehow goes back to the >> original thread, but to a po-earlier event (e.g., like prop). >> >> Given that to-r and to-w are unlikely to ever become become inconsistent >> with po, I am not sure it even really helps much. > I am not sure I understand, what is the problem with enforcing that > ppo is only supposed to ever be -int ? Sounds like it makes it super > clear. You could go further and do ... & po. But it would still make me feel that it's a plaster used to hold together a steampipe. It reminds me a bit of college when some of my class mates passed the nightly tests in the programming lab by doing "if input == (the input of the specific test case they were having problem with) return (value expected by testcase)". Or making everything atomic so that tsan does not complain about data races anymore. If there's something in one of these relations tying together events of different threads, is it intentional or a bug? I prefer to be very conscious about what is being tied together by the relations. I'd rather take Boqun's suggestion to add some "debug/testing" flags to see if a litmus test violates a property assumed by LKMM. Yet I think the ideal way is to have a mechanized proof that LKMM satisfies these properties and use that to avoid regressions. > >> Personally I'm not too happy with the ad-hoc '& int' because it's like > So, with the idea I suggest, you will have fewer ints so you should be happy ;-) haha : ) An alternative perspective is that the &int now covers more cases of the relation ; ) > >> adding some unused stuff (via ... | unused-stuff) and then cutting it >> back out with &int, unlike prop & int which has a real semantic meaning >> (propagate back to the thread). The fastest move is the move we avoid >> doing, so I rather wouldn't add those parts in the first place. >> >> However fixing the fence relation turned out to be a lot trickier, both >> because of the missed data race and also rmw-sequences, essentially I >> would have had to disambiguate between xfences and fences already in >> this patch. So I did this minimal local fix for now and we can discuss >> whether it makes sense to get rid of the '& int' once/if we have xfence etc. > I see. Ok, I'll defer to your expertise on this since you know more > than I. I am relatively only recent with even opening up the CAT code. Enjoy : ) jonas
On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote: > > > On 2/27/2023 8:40 PM, Andrea Parri wrote: > > > The LKMM doesn't believe that a control or data dependency orders a > > > plain write after a marked read. Hence in this test it thinks that P1's > > > store to u0 can happen before the load of x1. I don't remember why we > > > did it this way -- probably we just wanted to minimize the restrictions > > > on when plain accesses can execute. (I do remember the reason for > > > making address dependencies induce order; it was so RCU would work.) > > > > > > The patch below will change what the LKMM believes. It eliminates the > > > positive outcome of the litmus test and the data race. Should it be > > > adopted into the memory model? > > (Unpopular opinion I know,) it should drop dependencies ordering, not > > add/promote it. > > > > Andrea > > Maybe not as unpopular as you think... :) > But either way IMHO it should be consistent; either take all the > dependencies that are true and add them, or drop them all. > In the latter case, RCU should change to an acquire barrier. (also, one > would have to deal with OOTA in some yet different way). > > Generally my position is that unless there's a real-world benchmark with > proven performance benefits of relying on dependency ordering, one should > use an acquire barrier. I haven't yet met such a case, but maybe one of you > has... https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141). Though this is admittedly for ARMv7 and PowerPC. Thanx, Paul
On 2/27/2023 11:21 PM, Paul E. McKenney wrote: > On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote: >> >> On 2/27/2023 8:40 PM, Andrea Parri wrote: >>>> The LKMM doesn't believe that a control or data dependency orders a >>>> plain write after a marked read. Hence in this test it thinks that P1's >>>> store to u0 can happen before the load of x1. I don't remember why we >>>> did it this way -- probably we just wanted to minimize the restrictions >>>> on when plain accesses can execute. (I do remember the reason for >>>> making address dependencies induce order; it was so RCU would work.) >>>> >>>> The patch below will change what the LKMM believes. It eliminates the >>>> positive outcome of the litmus test and the data race. Should it be >>>> adopted into the memory model? >>> (Unpopular opinion I know,) it should drop dependencies ordering, not >>> add/promote it. >>> >>> Andrea >> Maybe not as unpopular as you think... :) >> But either way IMHO it should be consistent; either take all the >> dependencies that are true and add them, or drop them all. >> In the latter case, RCU should change to an acquire barrier. (also, one >> would have to deal with OOTA in some yet different way). >> >> Generally my position is that unless there's a real-world benchmark with >> proven performance benefits of relying on dependency ordering, one should >> use an acquire barrier. I haven't yet met such a case, but maybe one of you >> has... > https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141). > > Though this is admittedly for ARMv7 and PowerPC. > Thanks for the link. It's true that on architectures that don't have an acquire load (and have to use a fence), the penalty will be bigger. But the more obvious discussion would be what constitutes a real-world benchmark : ) In my experience you can get a lot of performance benefits out of optimizing barriers in code if all you execute is that code. But once you embed that into a real-world application, often 90%-99% of time spent will be in the business logic, not in the data structure. And then the benefits suddenly disappear. Note that a lot of barriers are a lot cheaper as well when there's no contention. Because of that, making optimization decisions based on microbenchmarks can sometimes lead to a very poor "time invested" vs "total product improvement" ratio. Best wishes, jonas
On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote: > > > On 2/27/2023 11:21 PM, Paul E. McKenney wrote: > > On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote: > > > > > > On 2/27/2023 8:40 PM, Andrea Parri wrote: > > > > > The LKMM doesn't believe that a control or data dependency orders a > > > > > plain write after a marked read. Hence in this test it thinks that P1's > > > > > store to u0 can happen before the load of x1. I don't remember why we > > > > > did it this way -- probably we just wanted to minimize the restrictions > > > > > on when plain accesses can execute. (I do remember the reason for > > > > > making address dependencies induce order; it was so RCU would work.) > > > > > > > > > > The patch below will change what the LKMM believes. It eliminates the > > > > > positive outcome of the litmus test and the data race. Should it be > > > > > adopted into the memory model? > > > > (Unpopular opinion I know,) it should drop dependencies ordering, not > > > > add/promote it. > > > > > > > > Andrea > > > Maybe not as unpopular as you think... :) > > > But either way IMHO it should be consistent; either take all the > > > dependencies that are true and add them, or drop them all. > > > In the latter case, RCU should change to an acquire barrier. (also, one > > > would have to deal with OOTA in some yet different way). > > > > > > Generally my position is that unless there's a real-world benchmark with > > > proven performance benefits of relying on dependency ordering, one should > > > use an acquire barrier. I haven't yet met such a case, but maybe one of you > > > has... > > https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141). > > > > Though this is admittedly for ARMv7 and PowerPC. > > > > Thanks for the link. > > It's true that on architectures that don't have an acquire load (and have to > use a fence), the penalty will be bigger. > > But the more obvious discussion would be what constitutes a real-world > benchmark : ) > In my experience you can get a lot of performance benefits out of optimizing > barriers in code if all you execute is that code. > But once you embed that into a real-world application, often 90%-99% of time > spent will be in the business logic, not in the data structure. > > And then the benefits suddenly disappear. > Note that a lot of barriers are a lot cheaper as well when there's no > contention. > > Because of that, making optimization decisions based on microbenchmarks can > sometimes lead to a very poor "time invested" vs "total product improvement" > ratio. All true, though that 2x and 4x should be worth something. The real-world examples I know of involved garbage collectors, and the improvement was said to be a few percent system-wide. But that was a verbal exchange, so I don't have a citation for you. Thanx, Paul
On Mon, Feb 27, 2023 at 08:35:11PM +0100, Andrea Parri wrote: > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote: > > As stated in the documentation and implied by its name, the ppo > > (preserved program order) relation is intended to link po-earlier > > to po-later instructions under certain conditions. However, a > > corner case currently allows instructions to be linked by ppo that > > are not executed by the same thread, i.e., instructions are being > > linked that have no po relation. > > > > This happens due to the mb/strong-fence/fence relations, which (as > > one case) provide order when locks are passed between threads > > followed by an smp_mb__after_unlock_lock() fence. This is > > illustrated in the following litmus test (as can be seen when using > > herd7 with `doshow ppo`): > > > > P0(int *x, int *y) > > { > > spin_lock(x); > > spin_unlock(x); > > } > > > > P1(int *x, int *y) > > { > > spin_lock(x); > > smp_mb__after_unlock_lock(); > > *y = 1; > > } > > nit: for readability if not for other, s/int *x/spinlock_t *x/. Unless there are objections, I will make this change on my next rebase. > > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because > > P0 passes a lock to P1 which then uses this fence. > > > > The patch makes ppo a subrelation of po by letting fence contribute > > to ppo only in case the fence links events of the same thread. > > > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> > > Acked-by: Andrea Parri <parri.andrea@gmail.com> And add this as well, thank you! Thanx, Paul > Andrea > > > > --- > > tools/memory-model/linux-kernel.cat | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat > > index cfc1b8fd46da..adf3c4f41229 100644 > > --- a/tools/memory-model/linux-kernel.cat > > +++ b/tools/memory-model/linux-kernel.cat > > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W] > > let overwrite = co | fr > > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) > > let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi) > > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) > > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int) > > > > (* Propagation: Ordering from release operations and strong fences. *) > > let A-cumul(r) = (rfe ; [Marked])? ; r > > -- > > 2.17.1 > >
On 2/28/2023 4:40 PM, Paul E. McKenney wrote: > On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote: >> >> On 2/27/2023 11:21 PM, Paul E. McKenney wrote: >>> On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote: >>>> On 2/27/2023 8:40 PM, Andrea Parri wrote: >>>>>> The LKMM doesn't believe that a control or data dependency orders a >>>>>> plain write after a marked read. Hence in this test it thinks that P1's >>>>>> store to u0 can happen before the load of x1. I don't remember why we >>>>>> did it this way -- probably we just wanted to minimize the restrictions >>>>>> on when plain accesses can execute. (I do remember the reason for >>>>>> making address dependencies induce order; it was so RCU would work.) >>>>>> >>>>>> The patch below will change what the LKMM believes. It eliminates the >>>>>> positive outcome of the litmus test and the data race. Should it be >>>>>> adopted into the memory model? >>>>> (Unpopular opinion I know,) it should drop dependencies ordering, not >>>>> add/promote it. >>>>> >>>>> Andrea >>>> Maybe not as unpopular as you think... :) >>>> But either way IMHO it should be consistent; either take all the >>>> dependencies that are true and add them, or drop them all. >>>> In the latter case, RCU should change to an acquire barrier. (also, one >>>> would have to deal with OOTA in some yet different way). >>>> >>>> Generally my position is that unless there's a real-world benchmark with >>>> proven performance benefits of relying on dependency ordering, one should >>>> use an acquire barrier. I haven't yet met such a case, but maybe one of you >>>> has... >>> https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141). >>> >>> Though this is admittedly for ARMv7 and PowerPC. >>> >> Thanks for the link. >> >> It's true that on architectures that don't have an acquire load (and have to >> use a fence), the penalty will be bigger. >> >> But the more obvious discussion would be what constitutes a real-world >> benchmark : ) >> In my experience you can get a lot of performance benefits out of optimizing >> barriers in code if all you execute is that code. >> But once you embed that into a real-world application, often 90%-99% of time >> spent will be in the business logic, not in the data structure. >> >> And then the benefits suddenly disappear. >> Note that a lot of barriers are a lot cheaper as well when there's no >> contention. >> >> Because of that, making optimization decisions based on microbenchmarks can >> sometimes lead to a very poor "time invested" vs "total product improvement" >> ratio. > All true, though that 2x and 4x should be worth something. I think the most egregious case we had (not barrier related, but cache related) was something like ~100x in specific benchmarks and then I think somewhere around 1% system-wide. I think the issue was that in the larger system, we couldn't keep the cache hot, so our greatly improved data locality was being diluted. But of course it always depends on how much that component actually contributes to the overall system performance. Which IMHO should be one of the measurements taken before starting to invest heavily into optimizations. Surprisingly, many people don't want to do that. I think it's something about the misleading calculus of "2 months implementing the optimization + 2 weeks building robust benchmarks & profiling infrastructure > 2 months implementing the optimization" instead of "2 weeks building robust benchmarks & profiling infrastructure + 0 months implementing a useless optimization < 2 months implementing the optimization", which seems to be the more common case. > The real-world examples I know of involved garbage collectors, and the > improvement was said to be a few percent system-wide. But that was a > verbal exchange, so I don't have a citation for you. Thanks for the example, it sounds very reasonable (at least for platforms like PowerPC). GC has all the hallmarks of a good optimization target: measurable impact on system wide throughput and latency, and pointer chasing (=dependency ordering) being a hot part inside the algorithm. Best wishes, jonas
On Wed, Mar 01, 2023 at 11:52:09AM +0100, Jonas Oberhauser wrote: > > > On 2/28/2023 4:40 PM, Paul E. McKenney wrote: > > On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote: > > > > > > On 2/27/2023 11:21 PM, Paul E. McKenney wrote: > > > > On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote: > > > > > On 2/27/2023 8:40 PM, Andrea Parri wrote: > > > > > > > The LKMM doesn't believe that a control or data dependency orders a > > > > > > > plain write after a marked read. Hence in this test it thinks that P1's > > > > > > > store to u0 can happen before the load of x1. I don't remember why we > > > > > > > did it this way -- probably we just wanted to minimize the restrictions > > > > > > > on when plain accesses can execute. (I do remember the reason for > > > > > > > making address dependencies induce order; it was so RCU would work.) > > > > > > > > > > > > > > The patch below will change what the LKMM believes. It eliminates the > > > > > > > positive outcome of the litmus test and the data race. Should it be > > > > > > > adopted into the memory model? > > > > > > (Unpopular opinion I know,) it should drop dependencies ordering, not > > > > > > add/promote it. > > > > > > > > > > > > Andrea > > > > > Maybe not as unpopular as you think... :) > > > > > But either way IMHO it should be consistent; either take all the > > > > > dependencies that are true and add them, or drop them all. > > > > > In the latter case, RCU should change to an acquire barrier. (also, one > > > > > would have to deal with OOTA in some yet different way). > > > > > > > > > > Generally my position is that unless there's a real-world benchmark with > > > > > proven performance benefits of relying on dependency ordering, one should > > > > > use an acquire barrier. I haven't yet met such a case, but maybe one of you > > > > > has... > > > > https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141). > > > > > > > > Though this is admittedly for ARMv7 and PowerPC. > > > > > > > Thanks for the link. > > > > > > It's true that on architectures that don't have an acquire load (and have to > > > use a fence), the penalty will be bigger. > > > > > > But the more obvious discussion would be what constitutes a real-world > > > benchmark : ) > > > In my experience you can get a lot of performance benefits out of optimizing > > > barriers in code if all you execute is that code. > > > But once you embed that into a real-world application, often 90%-99% of time > > > spent will be in the business logic, not in the data structure. > > > > > > And then the benefits suddenly disappear. > > > Note that a lot of barriers are a lot cheaper as well when there's no > > > contention. > > > > > > Because of that, making optimization decisions based on microbenchmarks can > > > sometimes lead to a very poor "time invested" vs "total product improvement" > > > ratio. > > All true, though that 2x and 4x should be worth something. > > I think the most egregious case we had (not barrier related, but cache > related) was something like ~100x in specific benchmarks and then I think > somewhere around 1% system-wide. I think the issue was that in the larger > system, we couldn't keep the cache hot, so our greatly improved data > locality was being diluted. > But of course it always depends on how much that component actually > contributes to the overall system performance. > Which IMHO should be one of the measurements taken before starting to invest > heavily into optimizations. Agreed, it is all too easy for profound local optimizations to have near-zero global effect. > Surprisingly, many people don't want to do that. I think it's something > about the misleading calculus of "2 months implementing the optimization + 2 > weeks building robust benchmarks & profiling infrastructure > 2 months > implementing the optimization" instead of "2 weeks building robust > benchmarks & profiling infrastructure + 0 months implementing a useless > optimization < 2 months implementing the optimization", which seems to be > the more common case. Or, for that matter, having the investigation locate a five-minute change that produces large benefits. > > The real-world examples I know of involved garbage collectors, and the > > improvement was said to be a few percent system-wide. But that was a > > verbal exchange, so I don't have a citation for you. > > Thanks for the example, it sounds very reasonable (at least for platforms > like PowerPC). > GC has all the hallmarks of a good optimization target: measurable impact on > system wide throughput and latency, and pointer chasing (=dependency > ordering) being a hot part inside the algorithm. I believe that the target was ARM. It was definitely not PowerPC. But yes, GCs are indeed intense exercise for pointer chasing. Thanx, Paul
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat index cfc1b8fd46da..adf3c4f41229 100644 --- a/tools/memory-model/linux-kernel.cat +++ b/tools/memory-model/linux-kernel.cat @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W] let overwrite = co | fr let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb) let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi) -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int) +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int) (* Propagation: Ordering from release operations and strong fences. *) let A-cumul(r) = (rfe ; [Marked])? ; r