Message ID | 20230126134604.2160-3-jonas.oberhauser@huaweicloud.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp278397wrn; Thu, 26 Jan 2023 05:49:36 -0800 (PST) X-Google-Smtp-Source: AMrXdXuZuU02TsDiae1BAb70bPs5BBvjqfu/+c+SkPD6wavo6UQsQaRHQeum3k919p+gfi0ou+S5 X-Received: by 2002:a05:6402:44:b0:497:233d:3ef6 with SMTP id f4-20020a056402004400b00497233d3ef6mr36120904edu.17.1674740976339; Thu, 26 Jan 2023 05:49:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674740976; cv=none; d=google.com; s=arc-20160816; b=Lfj+JTN33Xpuew78lrp3dtoffdnqDIqDbn+fUhWPNfa4ZXqiKmgN4YY7rOue753oD9 MlKRJ7o/jgubswJ4avuOTszSP/TZftyNPgQFn66l3aJte4392fTQ8s2wDtJdV/dUA+Bf HsyGSWMVZXd9gqG0fOvxULXhwZOClxl+aem0ZGDkIps6dRpOjRVcdxJh674jFlyhoL1b xlC/TObCGgrNzfK4mwkpLxUR5TOVGne5vtlo4DOX2M8HoMhEXxI7jdESf8pEcEJKqf5S ivHlUj5Wn237Flrca8mLtuPSwXVK1V1mrtH/d4/GsNgoQKaBdhgP3+IIjWqU4m7gLHH8 s2uA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:references:in-reply-to:message-id:date:subject :cc:to:from; bh=KL4oPQ3Zd1CCKdQBWgvmrYOhqL5VnPtk6M0OuLdqQl0=; b=l2ijvrRAEuh17kCrbHzXU2BSeeY7sgkLmCCm3TsbCZytSpu0UnIqzqQL5o4pwCe980 4EKNN553gBC3pg2RzG7WLySXJkWdDWSt/OF9e5/3Liq9nIQi8we0XOdGaqMRV2yf7Nh4 Mw+B1FEQlQryGEeesass6zIW9NtqVk2vXUATz6hTO/su9JIIvItPcxAV/rwWJOYoczLo kJlXljyPXT4u20Ryf+2In7akuI28OL9mEea2K+0jIiqJ6lSSSrKGiS/7Ncwi0XlJh63Q Al6/G1CXAus/Ag/UfJuFQAVbAxILXhnGgYpXohBJiu8nAp0Zt+lyI3JSUgdbtynaHV6m 6Hyg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o15-20020aa7d3cf000000b004a0901f70f0si1654523edr.581.2023.01.26.05.49.06; Thu, 26 Jan 2023 05:49:36 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232123AbjAZNmu (ORCPT <rfc822;lekhanya01809@gmail.com> + 99 others); Thu, 26 Jan 2023 08:42:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54570 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231590AbjAZNmk (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 26 Jan 2023 08:42:40 -0500 Received: from frasgout11.his.huawei.com (frasgout11.his.huawei.com [14.137.139.23]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E07506DFF0 for <linux-kernel@vger.kernel.org>; Thu, 26 Jan 2023 05:42:26 -0800 (PST) Received: from mail02.huawei.com (unknown [172.18.147.229]) by frasgout11.his.huawei.com (SkyGuard) with ESMTP id 4P2hWf2L9wz9v7gZ for <linux-kernel@vger.kernel.org>; Thu, 26 Jan 2023 21:34:22 +0800 (CST) Received: from huaweicloud.com (unknown [10.206.133.88]) by APP2 (Coremail) with SMTP id GxC2BwBnOWAOg9JjSrPJAA--.522S4; Thu, 26 Jan 2023 14:42:02 +0100 (CET) From: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> To: paulmck@kernel.org Cc: stern@rowland.harvard.edu, parri.andrea@gmail.com, will@kernel.org, peterz@infradead.org, boqun.feng@gmail.com, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org, urezki@gmail.com, quic_neeraju@quicinc.com, frederic@kernel.org, linux-kernel@vger.kernel.org, Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> Subject: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po Date: Thu, 26 Jan 2023 14:46:04 +0100 Message-Id: <20230126134604.2160-3-jonas.oberhauser@huaweicloud.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20230126134604.2160-1-jonas.oberhauser@huaweicloud.com> References: <20230126134604.2160-1-jonas.oberhauser@huaweicloud.com> X-CM-TRANSID: GxC2BwBnOWAOg9JjSrPJAA--.522S4 X-Coremail-Antispam: 1UD129KBjvJXoW7ZF48GrWDJryktw1UKr4kJFb_yoW5JrW3pr y0k343KF4UtrsY93ZrW3ZxuF1UCa1xKw18GF4DAw1rAw13WFZFvF18trs8Wa4aqrZ7GFWD Zr1Yvrn2yayDAFJanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUP014x267AKxVWrJVCq3wAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2048vs2IY020E87I2jVAFwI0_Jryl82xGYIkIc2 x26xkF7I0E14v26ryj6s0DM28lY4IEw2IIxxk0rwA2F7IY1VAKz4vEj48ve4kI8wA2z4x0 Y4vE2Ix0cI8IcVAFwI0_JFI_Gr1l84ACjcxK6xIIjxv20xvEc7CjxVAFwI0_Cr0_Gr1UM2 8EF7xvwVC2z280aVAFwI0_Gr0_Cr1l84ACjcxK6I8E87Iv6xkF7I0E14v26r4UJVWxJr1l e2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E2Ix0cI 8IcVAFwI0_JrI_JrylYx0Ex4A2jsIE14v26r1j6r4UMcvjeVCFs4IE7xkEbVWUJVW8JwAC jcxG0xvY0x0EwIxGrwACjI8F5VA0II8E6IAqYI8I648v4I1lFIxGxcIEc7CjxVA2Y2ka0x kIwI1l42xK82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AK xVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r4a6rW5MIIYrx kI7VAKI48JMIIF0xvE2Ix0cI8IcVAFwI0_JFI_Gr1lIxAIcVC0I7IYx2IY6xkF7I0E14v2 6F4j6r4UJwCI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_Jr0_Gr 1lIxAIcVC2z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7VU1489tUU UUU== X-CM-SenderInfo: 5mrqt2oorev25kdx2v3u6k3tpzhluzxrxghudrp/ X-CFilter-Loop: Reflected X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756093194067723106?= X-GMAIL-MSGID: =?utf-8?q?1756093194067723106?= |
Series |
Streamlining treatment of smp_mb__after_unlock_lock
|
|
Commit Message
Jonas Oberhauser
Jan. 26, 2023, 1:46 p.m. UTC
As stated in the documentation and implied by its name, the ppo
(preserved program order) relation is intended to link po-earlier
to po-later instructions under certain conditions. However, a
corner case currently allows instructions to be linked by ppo that
are not executed by the same thread, i.e., instructions are being
linked that have no po relation.
This happens due to the mb/strong-fence relations, which (as one
case) provide order when locks are passed between threads followed
by an smp_mb__after_unlock_lock() fence. This is illustrated in
the following litmus test (as can be seen when using herd7 with
`doshow ppo`):
P0(int *x, int *y)
{
spin_lock(x);
spin_unlock(x);
}
P1(int *x, int *y)
{
spin_lock(x);
smp_mb__after_unlock_lock();
*y = 1;
}
The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
P0 passes a lock to P1 which then uses this fence.
The patch makes ppo a subrelation of po by eliminating this possibility
from mb (but not strong-fence) and relying explicitly on mb|gp instead
of strong-fence when defining ppo.
Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
---
tools/memory-model/linux-kernel.cat | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
Comments
On Thu, Jan 26, 2023 at 02:46:04PM +0100, Jonas Oberhauser wrote: > As stated in the documentation and implied by its name, the ppo > (preserved program order) relation is intended to link po-earlier > to po-later instructions under certain conditions. However, a > corner case currently allows instructions to be linked by ppo that > are not executed by the same thread, i.e., instructions are being > linked that have no po relation. > > This happens due to the mb/strong-fence relations, which (as one > case) provide order when locks are passed between threads followed > by an smp_mb__after_unlock_lock() fence. This is illustrated in > the following litmus test (as can be seen when using herd7 with > `doshow ppo`): > > P0(int *x, int *y) > { > spin_lock(x); > spin_unlock(x); > } > > P1(int *x, int *y) > { > spin_lock(x); > smp_mb__after_unlock_lock(); > *y = 1; > } > > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because > P0 passes a lock to P1 which then uses this fence. > > The patch makes ppo a subrelation of po by eliminating this possibility > from mb (but not strong-fence) and relying explicitly on mb|gp instead > of strong-fence when defining ppo. > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> > --- This changes the meaning of the fence relation, which is used in w-pre-bounded, w-post-bounded, ww-vis, wr-vis, and rw-xbstar. Have you checked that they won't be affected by the change? > tools/memory-model/linux-kernel.cat | 9 +++++---- > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat > index 6e531457bb73..815fdafacaef 100644 > --- a/tools/memory-model/linux-kernel.cat > +++ b/tools/memory-model/linux-kernel.cat > @@ -36,7 +36,9 @@ let wmb = [W] ; fencerel(Wmb) ; [W] > let mb = ([M] ; fencerel(Mb) ; [M]) | > ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) | > ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) | > - ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) | > + ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) > +let gp = po ; [Sync-rcu | Sync-srcu] ; po? > +let strong-fence = mb | gp | > (* > * Note: The po-unlock-lock-po relation only passes the lock to the direct > * successor, perhaps giving the impression that the ordering of the > @@ -50,10 +52,9 @@ let mb = ([M] ; fencerel(Mb) ; [M]) | > *) > ([M] ; po-unlock-lock-po ; > [After-unlock-lock] ; po ; [M]) > -let gp = po ; [Sync-rcu | Sync-srcu] ; po? > -let strong-fence = mb | gp > > -let nonrw-fence = strong-fence | po-rel | acq-po > + Extra blank line. > +let nonrw-fence = mb | gp | po-rel | acq-po > let fence = nonrw-fence | wmb | rmb > let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu | > Before-atomic | After-atomic | Acquire | Release | > -- > 2.17.1 Alan
On 1/26/2023 5:36 PM, Alan Stern wrote: > On Thu, Jan 26, 2023 at 02:46:04PM +0100, Jonas Oberhauser wrote: >> As stated in the documentation and implied by its name, the ppo >> (preserved program order) relation is intended to link po-earlier >> to po-later instructions under certain conditions. However, a >> corner case currently allows instructions to be linked by ppo that >> are not executed by the same thread, i.e., instructions are being >> linked that have no po relation. >> >> This happens due to the mb/strong-fence relations, which (as one >> case) provide order when locks are passed between threads followed >> by an smp_mb__after_unlock_lock() fence. This is illustrated in >> the following litmus test (as can be seen when using herd7 with >> `doshow ppo`): >> >> P0(int *x, int *y) >> { >> spin_lock(x); >> spin_unlock(x); >> } >> >> P1(int *x, int *y) >> { >> spin_lock(x); >> smp_mb__after_unlock_lock(); >> *y = 1; >> } >> >> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because >> P0 passes a lock to P1 which then uses this fence. >> >> The patch makes ppo a subrelation of po by eliminating this possibility >> from mb (but not strong-fence) and relying explicitly on mb|gp instead >> of strong-fence when defining ppo. >> >> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> >> --- > This changes the meaning of the fence relation, which is used in > w-pre-bounded, w-post-bounded, ww-vis, wr-vis, and rw-xbstar. Have you > checked that they won't be affected by the change? Good point, in fact it changes the nonrw-fence as well which is used in the r-* relations. I had missed this completely. That's what I get for not looking at the data race relations! Let's go through the individual cases. If we had e.g. *-post-bounded because of the po-unlock-lock-po;[After-unlock-lock];po edge, this may be either due to the fence rule or due to (...fence ; vis ; *-pre-bounded). In the first case we have po-rel ; rfe ; acq-po and get fence;rfe;(xbstar&int);fence which gives us *-post-bounded. In the second case we now have strong-fence, with vis <= xbstar we see that **-vis is preserved here by switching from the fence rule to the strong-fence;xbstar;... case. For *-pre-bounded, the situtation is tricky because of the xbstar&int that can be at the end of vis, when *-pre-bounded is used to define w*-vis through (w-post-bounded;vis;*-pre-bounded). In this case, the xbstar&int can point in the opposite direction of po, which means that the unlock that creates the strong fence might not be po-after the instruction that starts the link. Here's a litmus test illustrating the difference, where P1 has a backwards-pointing xbstar&int. Currently there's no data race, but with the proposed patch there is. P0(int *x, int *y) { *x = 1; smp_store_release(y, 1); } P1(int *x, int *y, int *dx, int *dy, spinlock_t *l) { spin_lock(l); int r1 = READ_ONCE(*dy); if (r1==1) spin_unlock(l); int r0 = smp_load_acquire(y); if (r0 == 1) { WRITE_ONCE(*dx,1); } } P2(int *dx, int *dy) { WRITE_ONCE(*dy,READ_ONCE(*dx)); } P3(int *x, spinlock_t *l) { spin_lock(l); smp_mb__after_unlock_lock(); *x = 2; } This actually makes me wonder. I thought the reason for the xbstar & int is that it ensures that the overall relation, after shuffling around a little bit, becomes prop&int ; hb*. Like in case the *x=2 is co-before the *x=1, we get Wx2 ->overwrite Wx1 ->cumul-fence*;rfe (some event on the same CPU as Wx2) ->fence Wx2 which is Wx2 -> prop&int some other event ->hb Wx2 which must be irreflexive. However, that's not the case at all, because the fence relation currently doesn't actually have to relate events of the same CPU. So we get Wx2 ->overwrite Wx1 ->cumul-fence*;rfe (some event on some other CPU than Wx2's) ->(hb*&int);fence Wx2 i.e., Wx2 ->prop&ext;hb*;strong-fence Wx2 which shouldn't provide any ordering in general. In fact, replacing the *x=1 and *x=2 with WRITE_ONCEs, (pilot errors notwithstanding) both Wx1 ->co Wx2 and Wx2 ->co Wx1 become allowed in the current LKMM (in graphs where all other edges are equal). Shouldn't this actually *be* a data race? And potentially the same with rcu-fence? The other cases of *-pre-bounded seem to work out: they all link stuff via xbstar to the instruction that is linked via po-unlock-lock-po ; [After-unlock-lock] ; po to the potentially racy access, and po-unlock-lock-po;po is xbstar ; acq-po, which allows closing the gap. Best wishes, jonas
On Fri, Jan 27, 2023 at 03:31:25PM +0100, Jonas Oberhauser wrote: > Here's a litmus test illustrating the difference, where P1 has a > backwards-pointing xbstar&int. Currently there's no data race, but with the > proposed patch there is. > > P0(int *x, int *y) > { > *x = 1; > smp_store_release(y, 1); > } > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l) > { > spin_lock(l); > int r1 = READ_ONCE(*dy); > if (r1==1) > spin_unlock(l); > > int r0 = smp_load_acquire(y); > if (r0 == 1) { > WRITE_ONCE(*dx,1); > } > } > > P2(int *dx, int *dy) > { > WRITE_ONCE(*dy,READ_ONCE(*dx)); > } > > > P3(int *x, spinlock_t *l) > { > spin_lock(l); > smp_mb__after_unlock_lock(); > *x = 2; > } I don't understand why the current LKMM doesn't say there is a data race. In fact, I don't understand what herd7 is doing with this litmus test at all. Evidently the plain-coherence check rules out x=1 at the end, because when I relax that check, x=1 becomes a possible result. Furthermore, the graphical output confirms that this execution has a ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1 to Wx=2! How can this be possible? It seems like a bug in herd7. Furthermore, the execution with x=2 at the end doesn't have either a ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2. So why isn't there a ww-race edge? > This actually makes me wonder. I thought the reason for the xbstar & int is > that it ensures that the overall relation, after shuffling around a little > bit, becomes prop&int ; hb*. No, that is not the reason for it. See below. > Like in case the *x=2 is co-before the *x=1, we get > Wx2 ->overwrite Wx1 ->cumul-fence*;rfe (some event on the same CPU as > Wx2) ->fence Wx2 > which is > Wx2 -> prop&int some other event ->hb Wx2 > which must be irreflexive. > > However, that's not the case at all, because the fence relation currently > doesn't actually have to relate events of the same CPU. > So we get > Wx2 ->overwrite Wx1 ->cumul-fence*;rfe (some event on some other CPU than > Wx2's) ->(hb*&int);fence Wx2 > i.e., > Wx2 ->prop&ext;hb*;strong-fence Wx2 > > which shouldn't provide any ordering in general. > > In fact, replacing the *x=1 and *x=2 with WRITE_ONCEs, (pilot errors > notwithstanding) both Wx1 ->co Wx2 and Wx2 ->co Wx1 become allowed in the > current LKMM (in graphs where all other edges are equal). > > Shouldn't this actually *be* a data race? And potentially the same with > rcu-fence? I think that herd7 _should_ say there is a data race. On the other hand, I also think that the operational model says there isn't. This is a case where the formal model fails to match the operational model. The operational model says that if W is a release-store on CPU C and W' is another store which propagates to C before W executes, then W' propagates to every CPU before W does. (In other words, releases are A-cumulative). But the formal model enforces this rule only in cases the event reading W' on C is po-before W. In your litmus test, the event reading y=1 on P1 is po-after the spin_unlock (which is a release). Nevertheless, any feasible execution requires that P1 must execute Ry=1 before the unlock. So the operational model says that y=1 must propagate to P3 before the unlock does, i.e., before P3 executes the spin_lock(). But the formal model doesn't require this. Ideally we would fix this by changing the definition of po-rel to: [M] ; (xbstar & int) ; [Release] (This is closely related to the use of (xbstar & int) in the definition of vis that you asked about.) Unfortunately we can't do this, because po-rel has to be defined long before xbstar. > The other cases of *-pre-bounded seem to work out: they all link stuff via > xbstar to the instruction that is linked via po-unlock-lock-po ; > [After-unlock-lock] ; po to the potentially racy access, and > po-unlock-lock-po;po is xbstar ; acq-po, which allows closing the gap. I could not follow your arguments at all; the writing was too confusing. Alan
> Evidently the plain-coherence check rules out x=1 at the > end, because when I relax that check, x=1 becomes a possible result. > Furthermore, the graphical output confirms that this execution has a > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1 > to Wx=2! How can this be possible? It seems like a bug in herd7. By default, herd7 performs some edges removal when generating the graphical outputs. The option -showraw can be useful to increase the "verbosity", for example, [with "exists (x=2)", output in /tmp/T.dot] $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis > Furthermore, the execution with x=2 at the end doesn't have either a > ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2. So why isn't there a > ww-race edge? And similarly [with "exists (x=2)"] $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -doshow ww-vis,ww-nonrace -showraw ww-vis,ww-nonrace Andrea
On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote: > > Evidently the plain-coherence check rules out x=1 at the > > end, because when I relax that check, x=1 becomes a possible result. > > Furthermore, the graphical output confirms that this execution has a > > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1 > > to Wx=2! How can this be possible? It seems like a bug in herd7. > > By default, herd7 performs some edges removal when generating the > graphical outputs. The option -showraw can be useful to increase > the "verbosity", for example, > > [with "exists (x=2)", output in /tmp/T.dot] This was meant to be "exists (x=1)". Andrea > $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis > > > > Furthermore, the execution with x=2 at the end doesn't have either a > > ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2. So why isn't there a > > ww-race edge? > > And similarly > > [with "exists (x=2)"] > $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -doshow ww-vis,ww-nonrace -showraw ww-vis,ww-nonrace > > Andrea
On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote: > > Evidently the plain-coherence check rules out x=1 at the > > end, because when I relax that check, x=1 becomes a possible result. > > Furthermore, the graphical output confirms that this execution has a > > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1 > > to Wx=2! How can this be possible? It seems like a bug in herd7. > > By default, herd7 performs some edges removal when generating the > graphical outputs. The option -showraw can be useful to increase > the "verbosity", for example, > > [with "exists (x=2)", output in /tmp/T.dot] > $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis Okay, thanks, that helps a lot. So here's what we've got. The litmus test: C hb-and-int {} P0(int *x, int *y) { *x = 1; smp_store_release(y, 1); } P1(int *x, int *y, int *dx, int *dy, spinlock_t *l) { spin_lock(l); int r1 = READ_ONCE(*dy); if (r1==1) spin_unlock(l); int r0 = smp_load_acquire(y); if (r0 == 1) { WRITE_ONCE(*dx,1); } } P2(int *dx, int *dy) { WRITE_ONCE(*dy,READ_ONCE(*dx)); } P3(int *x, spinlock_t *l) { spin_lock(l); smp_mb__after_unlock_lock(); *x = 2; } exists (x=2) The reason why Wx=1 ->ww-vis Wx=2: 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded. 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis. 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2 so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2 and po-unlock-lock-po <= mb <= fence <= w-pre-bounded. Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis. This explains why the memory model says there isn't a data race. This doesn't use the smp_mb__after_unlock_lock at all. Alan
On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote: > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote: > > > Evidently the plain-coherence check rules out x=1 at the > > > end, because when I relax that check, x=1 becomes a possible result. > > > Furthermore, the graphical output confirms that this execution has a > > > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1 > > > to Wx=2! How can this be possible? It seems like a bug in herd7. > > > > By default, herd7 performs some edges removal when generating the > > graphical outputs. The option -showraw can be useful to increase > > the "verbosity", for example, > > > > [with "exists (x=2)", output in /tmp/T.dot] > > $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis > > Okay, thanks, that helps a lot. > > So here's what we've got. The litmus test: > > > C hb-and-int > {} > > P0(int *x, int *y) > { > *x = 1; > smp_store_release(y, 1); > } > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l) > { > spin_lock(l); > int r1 = READ_ONCE(*dy); > if (r1==1) > spin_unlock(l); > > int r0 = smp_load_acquire(y); > if (r0 == 1) { > WRITE_ONCE(*dx,1); > } The lack of a spin_unlock() when r1!=1 is intentional? It is admittedly a cute way to prevent P3 from doing anything when r1!=1. And P1 won't do anything if P3 runs first. > } > > P2(int *dx, int *dy) > { > WRITE_ONCE(*dy,READ_ONCE(*dx)); > } > > > P3(int *x, spinlock_t *l) > { > spin_lock(l); > smp_mb__after_unlock_lock(); > *x = 2; > } > > exists (x=2) > > > The reason why Wx=1 ->ww-vis Wx=2: > > 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded. > > 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and > (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis. > > 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2 > so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2 > and po-unlock-lock-po <= mb <= fence <= w-pre-bounded. > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis. > > This explains why the memory model says there isn't a data race. This > doesn't use the smp_mb__after_unlock_lock at all. You lost me on this one. Suppose that P3 starts first, then P0. P1 is then stuck at the spin_lock() because P3 does not release that lock. P2 goes out for a pizza. Why can't the two stores to x by P0 and P3 conflict, resulting in a data race? Thanx, Paul
On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote: > On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote: > > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote: > > > > Evidently the plain-coherence check rules out x=1 at the > > > > end, because when I relax that check, x=1 becomes a possible result. > > > > Furthermore, the graphical output confirms that this execution has a > > > > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1 > > > > to Wx=2! How can this be possible? It seems like a bug in herd7. > > > > > > By default, herd7 performs some edges removal when generating the > > > graphical outputs. The option -showraw can be useful to increase > > > the "verbosity", for example, > > > > > > [with "exists (x=2)", output in /tmp/T.dot] > > > $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis > > > > Okay, thanks, that helps a lot. > > > > So here's what we've got. The litmus test: > > > > > > C hb-and-int > > {} > > > > P0(int *x, int *y) > > { > > *x = 1; > > smp_store_release(y, 1); > > } > > > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l) > > { > > spin_lock(l); > > int r1 = READ_ONCE(*dy); > > if (r1==1) > > spin_unlock(l); > > > > int r0 = smp_load_acquire(y); > > if (r0 == 1) { > > WRITE_ONCE(*dx,1); > > } > > The lack of a spin_unlock() when r1!=1 is intentional? I assume so. > It is admittedly a cute way to prevent P3 from doing anything > when r1!=1. And P1 won't do anything if P3 runs first. Right. > > } > > > > P2(int *dx, int *dy) > > { > > WRITE_ONCE(*dy,READ_ONCE(*dx)); > > } > > > > > > P3(int *x, spinlock_t *l) > > { > > spin_lock(l); > > smp_mb__after_unlock_lock(); > > *x = 2; > > } > > > > exists (x=2) > > > > > > The reason why Wx=1 ->ww-vis Wx=2: > > > > 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded. > > > > 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and > > (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis. > > > > 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2 > > so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2 > > and po-unlock-lock-po <= mb <= fence <= w-pre-bounded. > > > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis. > > > > This explains why the memory model says there isn't a data race. This > > doesn't use the smp_mb__after_unlock_lock at all. > > You lost me on this one. > > Suppose that P3 starts first, then P0. P1 is then stuck at the > spin_lock() because P3 does not release that lock. P2 goes out for a > pizza. That wouldn't be a valid execution. One of the rules in lock.cat says that a spin_lock() call must read from a spin_unlock() or from an initial write, which rules out executions in which P3 acquires the lock first. > Why can't the two stores to x by P0 and P3 conflict, resulting in a > data race? That can't happen in executions where P1 acquires the lock first for the reason outlined above (P0's store to x propagates to P3 before P3 writes to x). And there are no other executions -- basically, herd7 ignores deadlock scenarios. Alan
On Sun, Jan 29, 2023 at 11:03:46AM -0500, Alan Stern wrote: > On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote: > > On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote: > > > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote: > > > > > Evidently the plain-coherence check rules out x=1 at the > > > > > end, because when I relax that check, x=1 becomes a possible result. > > > > > Furthermore, the graphical output confirms that this execution has a > > > > > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1 > > > > > to Wx=2! How can this be possible? It seems like a bug in herd7. > > > > > > > > By default, herd7 performs some edges removal when generating the > > > > graphical outputs. The option -showraw can be useful to increase > > > > the "verbosity", for example, > > > > > > > > [with "exists (x=2)", output in /tmp/T.dot] > > > > $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis > > > > > > Okay, thanks, that helps a lot. > > > > > > So here's what we've got. The litmus test: > > > > > > > > > C hb-and-int > > > {} > > > > > > P0(int *x, int *y) > > > { > > > *x = 1; > > > smp_store_release(y, 1); > > > } > > > > > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l) > > > { > > > spin_lock(l); > > > int r1 = READ_ONCE(*dy); > > > if (r1==1) > > > spin_unlock(l); > > > > > > int r0 = smp_load_acquire(y); > > > if (r0 == 1) { > > > WRITE_ONCE(*dx,1); > > > } > > > > The lack of a spin_unlock() when r1!=1 is intentional? > > I assume so. > > > It is admittedly a cute way to prevent P3 from doing anything > > when r1!=1. And P1 won't do anything if P3 runs first. > > Right. > > > > } > > > > > > P2(int *dx, int *dy) > > > { > > > WRITE_ONCE(*dy,READ_ONCE(*dx)); > > > } > > > > > > > > > P3(int *x, spinlock_t *l) > > > { > > > spin_lock(l); > > > smp_mb__after_unlock_lock(); > > > *x = 2; > > > } > > > > > > exists (x=2) > > > > > > > > > The reason why Wx=1 ->ww-vis Wx=2: > > > > > > 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded. > > > > > > 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and > > > (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis. > > > > > > 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2 > > > so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2 > > > and po-unlock-lock-po <= mb <= fence <= w-pre-bounded. > > > > > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis. > > > > > > This explains why the memory model says there isn't a data race. This > > > doesn't use the smp_mb__after_unlock_lock at all. > > > > You lost me on this one. > > > > Suppose that P3 starts first, then P0. P1 is then stuck at the > > spin_lock() because P3 does not release that lock. P2 goes out for a > > pizza. > > That wouldn't be a valid execution. One of the rules in lock.cat says > that a spin_lock() call must read from a spin_unlock() or from an > initial write, which rules out executions in which P3 acquires the lock > first. OK, I will bite... Why can't P3's spin_lock() read from that initial write? > > Why can't the two stores to x by P0 and P3 conflict, resulting in a > > data race? > > That can't happen in executions where P1 acquires the lock first for the > reason outlined above (P0's store to x propagates to P3 before P3 writes > to x). And there are no other executions -- basically, herd7 ignores > deadlock scenarios. True enough, if P1 gets there first, then P3 never stores to x. What I don't understand is why P1 must always get there first. Thanx, Paul
> The reason why Wx=1 ->ww-vis Wx=2: > > 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded. > > 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and > (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis. > > 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2 > so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2 > and po-unlock-lock-po <= mb <= fence <= w-pre-bounded. > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis. To clarify, po-unlock-lock-po is not a subrelation of mb; see what happens without the smp_mb__after_unlock_lock(). Andrea
> Why can't P3's spin_lock() read from that initial write?
Mmh, sounds like you want to play with something like below?
Andrea
diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
index 6b52f365d73ac..20c3af4511255 100644
--- a/tools/memory-model/lock.cat
+++ b/tools/memory-model/lock.cat
@@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
(* Allow up to one unmatched LKW per location; more must deadlock *)
let UNMATCHED-LKW = LKW \ domain(critical)
-empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
(* rfi for LF events: link each LKW to the LF events in its critical section *)
let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
@@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
let rf = rf | rf-lf | rf-ru
(* Generate all co relations, including LKW events but not UL *)
-let co0 = co0 | ([IW] ; loc ; [LKW]) |
- (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
+let co0 = co0 | ([IW] ; loc ; [LKW])
include "cos-opt.cat"
let W = W | UL
let M = R | W
On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote: > > Why can't P3's spin_lock() read from that initial write? > > Mmh, sounds like you want to play with something like below? > > Andrea > > diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat > index 6b52f365d73ac..20c3af4511255 100644 > --- a/tools/memory-model/lock.cat > +++ b/tools/memory-model/lock.cat > @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock > > (* Allow up to one unmatched LKW per location; more must deadlock *) > let UNMATCHED-LKW = LKW \ domain(critical) > -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks > > (* rfi for LF events: link each LKW to the LF events in its critical section *) > let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc) > @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru > let rf = rf | rf-lf | rf-ru > > (* Generate all co relations, including LKW events but not UL *) > -let co0 = co0 | ([IW] ; loc ; [LKW]) | > - (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW]) > +let co0 = co0 | ([IW] ; loc ; [LKW]) > include "cos-opt.cat" > let W = W | UL > let M = R | W No idea. But the following litmus test gets no executions whatsoever, so point taken about my missing at least one corner case. ;-) Adding a spin_unlock() to the end of either process allows both to run. One could argue that this is a bug, but one could equally well argue that if you have a deadlock, you have a deadlock. Thoughts? Thanx, Paul ------------------------------------------------------------------------ C lock { } P0(int *a, int *b, spinlock_t *l) { spin_lock(l); WRITE_ONCE(*a, 1); } P1(int *a, int *b, spinlock_t *l) { spin_lock(l); WRITE_ONCE(*b, 1); } exists (a=1 /\ b=1)
On Sun, Jan 29, 2023 at 08:21:56AM -0800, Paul E. McKenney wrote: > On Sun, Jan 29, 2023 at 11:03:46AM -0500, Alan Stern wrote: > > On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote: > > > On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote: > > > > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote: > > > > > > Evidently the plain-coherence check rules out x=1 at the > > > > > > end, because when I relax that check, x=1 becomes a possible result. > > > > > > Furthermore, the graphical output confirms that this execution has a > > > > > > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1 > > > > > > to Wx=2! How can this be possible? It seems like a bug in herd7. > > > > > > > > > > By default, herd7 performs some edges removal when generating the > > > > > graphical outputs. The option -showraw can be useful to increase > > > > > the "verbosity", for example, > > > > > > > > > > [with "exists (x=2)", output in /tmp/T.dot] > > > > > $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis > > > > > > > > Okay, thanks, that helps a lot. > > > > > > > > So here's what we've got. The litmus test: > > > > > > > > > > > > C hb-and-int > > > > {} > > > > > > > > P0(int *x, int *y) > > > > { > > > > *x = 1; > > > > smp_store_release(y, 1); > > > > } > > > > > > > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l) > > > > { > > > > spin_lock(l); > > > > int r1 = READ_ONCE(*dy); > > > > if (r1==1) > > > > spin_unlock(l); > > > > > > > > int r0 = smp_load_acquire(y); > > > > if (r0 == 1) { > > > > WRITE_ONCE(*dx,1); > > > > } > > > > > > The lack of a spin_unlock() when r1!=1 is intentional? > > > > I assume so. > > > > > It is admittedly a cute way to prevent P3 from doing anything > > > when r1!=1. And P1 won't do anything if P3 runs first. > > > > Right. > > > > > > } > > > > > > > > P2(int *dx, int *dy) > > > > { > > > > WRITE_ONCE(*dy,READ_ONCE(*dx)); > > > > } > > > > > > > > > > > > P3(int *x, spinlock_t *l) > > > > { > > > > spin_lock(l); > > > > smp_mb__after_unlock_lock(); > > > > *x = 2; > > > > } > > > > > > > > exists (x=2) > > > > > > > > > > > > The reason why Wx=1 ->ww-vis Wx=2: > > > > > > > > 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded. > > > > > > > > 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and > > > > (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis. > > > > > > > > 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2 > > > > so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2 > > > > and po-unlock-lock-po <= mb <= fence <= w-pre-bounded. > > > > > > > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis. > > > > > > > > This explains why the memory model says there isn't a data race. This > > > > doesn't use the smp_mb__after_unlock_lock at all. > > > > > > You lost me on this one. > > > > > > Suppose that P3 starts first, then P0. P1 is then stuck at the > > > spin_lock() because P3 does not release that lock. P2 goes out for a > > > pizza. > > > > That wouldn't be a valid execution. One of the rules in lock.cat says > > that a spin_lock() call must read from a spin_unlock() or from an > > initial write, which rules out executions in which P3 acquires the lock > > first. > > OK, I will bite... > > Why can't P3's spin_lock() read from that initial write? > > > > Why can't the two stores to x by P0 and P3 conflict, resulting in a > > > data race? > > > > That can't happen in executions where P1 acquires the lock first for the > > reason outlined above (P0's store to x propagates to P3 before P3 writes > > to x). And there are no other executions -- basically, herd7 ignores > > deadlock scenarios. > > True enough, if P1 gets there first, then P3 never stores to x. > > What I don't understand is why P1 must always get there first. Ah, is the rule that all processes must complete? If so, then the only way all processes can complete is if P1 loads 1 from dy, thus releasing the lock. But that dy=1 load can only happen after P2 has copied dx to dy, and has stored a 1. Which in turn only happens if P1's write to dx is ordered before the lock release. Which only executes if P1's load from y returned 1, which only happens after P0 stored 1 to y. Which means that P3 gets the lock only after P0 completes its plain write to x, so there is no data race. The reason that P3 cannot go first is that this will prevent P1 from completing, in turn preventing herd7 from counting that execution. Or am I still missing a turn in there somewhere? Thanx, Paul
On Sun, Jan 29, 2023 at 10:44:03AM -0800, Paul E. McKenney wrote: > On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote: > > > Why can't P3's spin_lock() read from that initial write? > > > > Mmh, sounds like you want to play with something like below? > > > > Andrea > > > > diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat > > index 6b52f365d73ac..20c3af4511255 100644 > > --- a/tools/memory-model/lock.cat > > +++ b/tools/memory-model/lock.cat > > @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock > > > > (* Allow up to one unmatched LKW per location; more must deadlock *) > > let UNMATCHED-LKW = LKW \ domain(critical) > > -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks > > > > (* rfi for LF events: link each LKW to the LF events in its critical section *) > > let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc) > > @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru > > let rf = rf | rf-lf | rf-ru > > > > (* Generate all co relations, including LKW events but not UL *) > > -let co0 = co0 | ([IW] ; loc ; [LKW]) | > > - (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW]) > > +let co0 = co0 | ([IW] ; loc ; [LKW]) > > include "cos-opt.cat" > > let W = W | UL > > let M = R | W > > No idea. But the following litmus test gets no executions whatsoever, > so point taken about my missing at least one corner case. ;-) > > Adding a spin_unlock() to the end of either process allows both to > run. > > One could argue that this is a bug, but one could equally well argue > that if you have a deadlock, you have a deadlock. > in lock.cat: (* Allow up to one unmatched LKW per location; more must deadlock *) let UNMATCHED-LKW = LKW \ domain(critical) empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks we rule out deadlocks from the execution candidates we care about. Regards, Boqun > Thoughts? > > Thanx, Paul > > ------------------------------------------------------------------------ > > C lock > > { > } > > > P0(int *a, int *b, spinlock_t *l) > { > spin_lock(l); > WRITE_ONCE(*a, 1); > } > > P1(int *a, int *b, spinlock_t *l) > { > spin_lock(l); > WRITE_ONCE(*b, 1); > } > > exists (a=1 /\ b=1)
On Sun, Jan 29, 2023 at 06:11:12PM +0100, Andrea Parri wrote: > > The reason why Wx=1 ->ww-vis Wx=2: > > > > 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded. > > > > 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and > > (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis. > > > > 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2 > > so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2 > > and po-unlock-lock-po <= mb <= fence <= w-pre-bounded. > > > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis. > > To clarify, po-unlock-lock-po is not a subrelation of mb; see what > happens without the smp_mb__after_unlock_lock(). Ah, thank you again. That was what I got wrong, and it explains why the data race appears with Jonas's patch. This also points out an important unstated fact: All of the inter-CPU extended fences in the memory model are A-cumulative. In this case the fence links Rdy=1 on P1 to Wx=3 on P3. We know that 0:Wx=1 is visible to P1 before the Rdy executes, but what we need is that it is visible to P3 before Wx=3 executes. The fact that the extended fence is strong (and therefore A-cumulative) provides this extra guarantee. Alan
Hi all, apologies on the confusion about the litmus test. I should have explained it better but it seems you mostly figured it out. As Alan said I'm tricking a little bit by not unlocking in certain places to filter out all executions that aren't what I'm looking for. I didn't have much time when I sent it (hence also the lack of explanation and why I haven't responded earlier), so I didn't have time to play around with the filter feature to do this the "proper"/non-cute way. As such it really isn't about deadlocks. I think one question is whether the distinction between the models could be reproduced without using any kind of filtering at all. I have a feeling it should be possible but I haven't had time to think up a litmus test that does that. On 1/28/2023 11:59 PM, Alan Stern wrote: > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote: >>> Evidently the plain-coherence check rules out x=1 at the >>> end, because when I relax that check, x=1 becomes a possible result. >>> Furthermore, the graphical output confirms that this execution has a >>> ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1 >>> to Wx=2! How can this be possible? It seems like a bug in herd7. >> By default, herd7 performs some edges removal when generating the >> graphical outputs. The option -showraw can be useful to increase >> the "verbosity", for example, >> >> [with "exists (x=2)", output in /tmp/T.dot] >> $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis > Okay, thanks, that helps a lot. > > So here's what we've got. The litmus test: > > > C hb-and-int > {} > > P0(int *x, int *y) > { > *x = 1; > smp_store_release(y, 1); > } > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l) > { > spin_lock(l); > int r1 = READ_ONCE(*dy); > if (r1==1) > spin_unlock(l); > > int r0 = smp_load_acquire(y); > if (r0 == 1) { > WRITE_ONCE(*dx,1); > } > } > > P2(int *dx, int *dy) > { > WRITE_ONCE(*dy,READ_ONCE(*dx)); > } > > > P3(int *x, spinlock_t *l) > { > spin_lock(l); > smp_mb__after_unlock_lock(); > *x = 2; > } > > exists (x=2) > > > The reason why Wx=1 ->ww-vis Wx=2: > > 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded. > > 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and > (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis. > > 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2 > so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2 > and po-unlock-lock-po <= mb <= fence <= w-pre-bounded. > > [...] > > This explains why the memory model says there isn't a data race. This > doesn't use the smp_mb__after_unlock_lock at all. Note as Andrea said that po-unlock-lock-po is generally not in mb (and also not otherwise in fence). Only po-unlock-lock-po;[After-unlock-lock];po is in mb (resp. ...&int in fence). While the example uses smp_mb__after_unlock_lock, the anomaly should generally cover any extended fences. [Snipping in a part of an earlier e-mail you sent] > I think that herd7_should_ say there is a data race. > > On the other hand, I also think that the operational model says there > isn't. This is a case where the formal model fails to match the > operational model. > > The operational model says that if W is a release-store on CPU C and W' > is another store which propagates to C before W executes, then W' > propagates to every CPU before W does. (In other words, releases are > A-cumulative). But the formal model enforces this rule only in cases > the event reading W' on C is po-before W. > > In your litmus test, the event reading y=1 on P1 is po-after the > spin_unlock (which is a release). Nevertheless, any feasible execution > requires that P1 must execute Ry=1 before the unlock. So the > operational model says that y=1 must propagate to P3 before the unlock > does, i.e., before P3 executes the spin_lock(). But the formal model > doesn't require this. I see now. Somehow I thought stores must execute in program order, but I guess it doesn't make sense. In that sense, W ->xbstar&int X always means W propagates to X's CPU before X executes. > Ideally we would fix this by changing the definition of po-rel to: > > [M] ; (xbstar & int) ; [Release] > > (This is closely related to the use of (xbstar & int) in the definition > of vis that you asked about.) This misses the property of release stores that any po-earlier store must also execute before the release store. Perhaps it could be changed to the old po-rel | [M] ; (xbstar & int) ; [Release] but then one could instead move this into the definition of cumul-fence. In fact you'd probably want this for all the propagation fences, so cumul-fence and pb should be the right place. > Unfortunately we can't do this, because > po-rel has to be defined long before xbstar. You could do it, by turning the relation into one massive recursive definition. Thinking about what the options are: 1) accept the difference and run with it by making it consistent inside the axiomatic model 2) fix it through the recursive definition, which seems to be quite ugly but also consistent with the power operational model as far as I can tell 3) weaken the operational model... somehow 4) just ignore the anomaly 5) ??? Currently my least favorite option is 4) since it seems a bit off that the reasoning applies in one specific case of LKMM, more specifically the data race definition which should be equivalent to "the order of the two races isn't fixed", but here the order isn't fixed but it's a data race. I think the patch happens to almost do 1) because the xbstar&int at the end should already imply ordering through the prop&int <= hb rule. What would remain is to also exclude rcu-fence somehow. 2) seems the most correct choice but also to complicate LKMM a lot. Seems like being between a rock and hard place. jonas PS: >> The other cases of *-pre-bounded seem to work out: they all link stuff via >> xbstar to the instruction that is linked via po-unlock-lock-po ; >> [After-unlock-lock] ; po to the potentially racy access, and >> po-unlock-lock-po;po is xbstar ; acq-po, which allows closing the gap. > I could not follow your arguments at all; the writing was too confusing. Sorry, I collapsed some cases. I'll show an example. I think all the other cases are the same. Let's pick an edge that links two events with ww-vis through w-post-bounded ; vis ; w-pre-bounded where the vis comes from cumul-fence* ; rfe? ; [Marked] ; (strong-fence ; [Marked] ; xbstar) <= vis and the w-pre-bounded came from po-unlock-lock-po;[After-unlock-lock];po but not po-unlock-lock-po & int. Note that such po-unlock-lock-po;[After-unlock-lock];po must come from po-rel ; rfe ; acq-po So we have w-post-bounded ; cumul-fence* ; rfe? ; [Marked] ; strong-fence ; [Marked] ; xbstar ; po-rel ; rfe ; acq-po <= w-post-bounded ; cumul-fence* ; rfe? ; [Marked] ; strong-fence ; [Marked] ; xbstar ; hb ; hb ; acq-po <= w-post-bounded ; cumul-fence* ; rfe? ; [Marked] ; strong-fence ; [Marked] ; xbstar ; fence <= ww-vis All the other cases where w-pre-bounded are used have the shape ... ; xbstar ; w-pre-bounded which can all be handled in the same manner.
On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote: > On Sun, Jan 29, 2023 at 10:44:03AM -0800, Paul E. McKenney wrote: > > On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote: > > > > Why can't P3's spin_lock() read from that initial write? > > > > > > Mmh, sounds like you want to play with something like below? > > > > > > Andrea > > > > > > diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat > > > index 6b52f365d73ac..20c3af4511255 100644 > > > --- a/tools/memory-model/lock.cat > > > +++ b/tools/memory-model/lock.cat > > > @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock > > > > > > (* Allow up to one unmatched LKW per location; more must deadlock *) > > > let UNMATCHED-LKW = LKW \ domain(critical) > > > -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks > > > > > > (* rfi for LF events: link each LKW to the LF events in its critical section *) > > > let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc) > > > @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru > > > let rf = rf | rf-lf | rf-ru > > > > > > (* Generate all co relations, including LKW events but not UL *) > > > -let co0 = co0 | ([IW] ; loc ; [LKW]) | > > > - (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW]) > > > +let co0 = co0 | ([IW] ; loc ; [LKW]) > > > include "cos-opt.cat" > > > let W = W | UL > > > let M = R | W > > > > No idea. But the following litmus test gets no executions whatsoever, > > so point taken about my missing at least one corner case. ;-) > > > > Adding a spin_unlock() to the end of either process allows both to > > run. > > > > One could argue that this is a bug, but one could equally well argue > > that if you have a deadlock, you have a deadlock. > > > > in lock.cat: > > (* Allow up to one unmatched LKW per location; more must deadlock *) > let UNMATCHED-LKW = LKW \ domain(critical) > empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks > > we rule out deadlocks from the execution candidates we care about. Thank you, Boqun! Thanx, Paul > Regards, > Boqun > > > Thoughts? > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > C lock > > > > { > > } > > > > > > P0(int *a, int *b, spinlock_t *l) > > { > > spin_lock(l); > > WRITE_ONCE(*a, 1); > > } > > > > P1(int *a, int *b, spinlock_t *l) > > { > > spin_lock(l); > > WRITE_ONCE(*b, 1); > > } > > > > exists (a=1 /\ b=1)
On Sun, Jan 29, 2023 at 03:09:00PM -0800, Paul E. McKenney wrote: > On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote: > > in lock.cat: > > > > (* Allow up to one unmatched LKW per location; more must deadlock *) > > let UNMATCHED-LKW = LKW \ domain(critical) > > empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks > > > > we rule out deadlocks from the execution candidates we care about. > > Thank you, Boqun! Actually that's only part of it. The other part is rather obscure: (* Generate all co relations, including LKW events but not UL *) let co0 = co0 | ([IW] ; loc ; [LKW]) | (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW]) Implicitly this says that any lock with no corresponding unlock must come last in the coherence order, which implies the unmatched-locks rule (since only one lock event can be last). By itself, the unmatched-locks rule would not prevent P3 from executing before P1, provided P1 executes both its lock and unlock. Alan
On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote: > I see now. Somehow I thought stores must execute in program order, but I > guess it doesn't make sense. > In that sense, W ->xbstar&int X always means W propagates to X's CPU before > X executes. It also means any write that propagates to W's CPU before W executes also propagates to X's CPU before X executes (because it's the same CPU and W executes before X). > > Ideally we would fix this by changing the definition of po-rel to: > > > > [M] ; (xbstar & int) ; [Release] > > > > (This is closely related to the use of (xbstar & int) in the definition > > of vis that you asked about.) > > This misses the property of release stores that any po-earlier store must > also execute before the release store. I should have written: [M] ; (po | (xbstar & int)) ; [Release] > Perhaps it could be changed to the old po-rel | [M] ; (xbstar & int) ; > [Release] but then one could instead move this into the definition of > cumul-fence. > In fact you'd probably want this for all the propagation fences, so > cumul-fence and pb should be the right place. > > > Unfortunately we can't do this, because > > po-rel has to be defined long before xbstar. > > You could do it, by turning the relation into one massive recursive > definition. Which would make pretty much the entire memory model one big recursion. I do not want to do that. > Thinking about what the options are: > 1) accept the difference and run with it by making it consistent inside the > axiomatic model > 2) fix it through the recursive definition, which seems to be quite ugly but > also consistent with the power operational model as far as I can tell > 3) weaken the operational model... somehow > 4) just ignore the anomaly > 5) ??? > > Currently my least favorite option is 4) since it seems a bit off that the > reasoning applies in one specific case of LKMM, more specifically the data > race definition which should be equivalent to "the order of the two races > isn't fixed", but here the order isn't fixed but it's a data race. > I think the patch happens to almost do 1) because the xbstar&int at the end > should already imply ordering through the prop&int <= hb rule. > What would remain is to also exclude rcu-fence somehow. IMO 1) is the best choice. Alan PS: For the record, here's a simpler litmus test to illustrates the failing. The idea is that Wz=1 is reordered before the store-release, so it ought to propagate before Wy=1. The LKMM does not require this. C before-release {} P0(int *x, int *y, int *z) { int r1; r1 = READ_ONCE(*x); smp_store_release(y, 1); WRITE_ONCE(*z, 1); } P1(int *x, int *y, int *z) { int r2; r2 = READ_ONCE(*z); WRITE_ONCE(*x, r2); } P2(int *x, int *y, int *z) { int r3; int r4; r3 = READ_ONCE(*y); smp_rmb(); r4 = READ_ONCE(*z); } exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote: > On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote: > > I see now. Somehow I thought stores must execute in program order, but I > > guess it doesn't make sense. > > In that sense, W ->xbstar&int X always means W propagates to X's CPU before > > X executes. > > It also means any write that propagates to W's CPU before W executes > also propagates to X's CPU before X executes (because it's the same CPU > and W executes before X). > > > > Ideally we would fix this by changing the definition of po-rel to: > > > > > > [M] ; (xbstar & int) ; [Release] > > > > > > (This is closely related to the use of (xbstar & int) in the definition > > > of vis that you asked about.) > > > > This misses the property of release stores that any po-earlier store must > > also execute before the release store. > > I should have written: > > [M] ; (po | (xbstar & int)) ; [Release] > > > Perhaps it could be changed to the old po-rel | [M] ; (xbstar & int) ; > > [Release] but then one could instead move this into the definition of > > cumul-fence. > > In fact you'd probably want this for all the propagation fences, so > > cumul-fence and pb should be the right place. > > > > > Unfortunately we can't do this, because > > > po-rel has to be defined long before xbstar. > > > > You could do it, by turning the relation into one massive recursive > > definition. > > Which would make pretty much the entire memory model one big recursion. > I do not want to do that. > > > Thinking about what the options are: > > 1) accept the difference and run with it by making it consistent inside the > > axiomatic model > > 2) fix it through the recursive definition, which seems to be quite ugly but > > also consistent with the power operational model as far as I can tell > > 3) weaken the operational model... somehow > > 4) just ignore the anomaly > > 5) ??? > > > > Currently my least favorite option is 4) since it seems a bit off that the > > reasoning applies in one specific case of LKMM, more specifically the data > > race definition which should be equivalent to "the order of the two races > > isn't fixed", but here the order isn't fixed but it's a data race. > > I think the patch happens to almost do 1) because the xbstar&int at the end > > should already imply ordering through the prop&int <= hb rule. > > What would remain is to also exclude rcu-fence somehow. > > IMO 1) is the best choice. > > Alan > > PS: For the record, here's a simpler litmus test to illustrates the > failing. The idea is that Wz=1 is reordered before the store-release, > so it ought to propagate before Wy=1. The LKMM does not require this. In PowerPC terms, would this be like having the Wz=1 being reorders before the Wy=1, but not before the lwsync instruction preceding the Wy=1 that made it be a release store? If so, we might have to keep this quirk. Thanx, Paul > C before-release > > {} > > P0(int *x, int *y, int *z) > { > int r1; > > r1 = READ_ONCE(*x); > smp_store_release(y, 1); > WRITE_ONCE(*z, 1); > } > > P1(int *x, int *y, int *z) > { > int r2; > > r2 = READ_ONCE(*z); > WRITE_ONCE(*x, r2); > } > > P2(int *x, int *y, int *z) > { > int r3; > int r4; > > r3 = READ_ONCE(*y); > smp_rmb(); > r4 = READ_ONCE(*z); > } > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
On Sun, Jan 29, 2023 at 09:18:24PM -0500, Alan Stern wrote: > On Sun, Jan 29, 2023 at 03:09:00PM -0800, Paul E. McKenney wrote: > > On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote: > > > in lock.cat: > > > > > > (* Allow up to one unmatched LKW per location; more must deadlock *) > > > let UNMATCHED-LKW = LKW \ domain(critical) > > > empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks > > > > > > we rule out deadlocks from the execution candidates we care about. > > > > Thank you, Boqun! > > Actually that's only part of it. The other part is rather obscure: > > (* Generate all co relations, including LKW events but not UL *) > let co0 = co0 | ([IW] ; loc ; [LKW]) | > (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW]) > > Implicitly this says that any lock with no corresponding unlock must > come last in the coherence order, which implies the unmatched-locks rule > (since only one lock event can be last). By itself, the unmatched-locks > rule would not prevent P3 from executing before P1, provided P1 executes > both its lock and unlock. And thank you, Alan, as well! And RCU looks to operate in a similar manner: ------------------------------------------------------------------------ C rcudeadlock { } P0(int *a, int *b) { rcu_read_lock(); WRITE_ONCE(*a, 1); synchronize_rcu(); WRITE_ONCE(*b, 1); rcu_read_unlock(); } P1(int *a, int *b) { int r1; int r2; r1 = READ_ONCE(*b); smp_mb(); r2 = READ_ONCE(*a); } exists (1:r1=1 /\ 1:r2=0) ------------------------------------------------------------------------ $ herd7 -conf linux-kernel.cfg rcudeadlock.litmus Test rcudeadlock Allowed States 0 No Witnesses Positive: 0 Negative: 0 Condition exists (1:r1=1 /\ 1:r2=0) Observation rcudeadlock Never 0 0 Time rcudeadlock 0.00 Hash=4f7f336ad39d724d93b089133b00d1e2 ------------------------------------------------------------------------ So good enough! ;-) Thanx, Paul
On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote: > > Hi all, apologies on the confusion about the litmus test. > I should have explained it better but it seems you mostly figured it out. > As Alan said I'm tricking a little bit by not unlocking in certain places to > filter out all executions that aren't what I'm looking for. > I didn't have much time when I sent it (hence also the lack of explanation > and why I haven't responded earlier), so I didn't have time to play around > with the filter feature to do this the "proper"/non-cute way. > As such it really isn't about deadlocks. Not a problem! > I think one question is whether the distinction between the models could be > reproduced without using any kind of filtering at all. > I have a feeling it should be possible but I haven't had time to think up a > litmus test that does that. Here is an example litmus test using filter, if this helps. You put it right before the "exists" clause, and the condition is the same as in the "exists" clause. If an execution does not satisfy the condition in the filter clause, it is tossed. Thanx, Paul ------------------------------------------------------------------------ C C-srcu-nest-6 (* * Result: Never * * This would be valid for srcu_down_read() and srcu_up_read(). *) {} P0(int *x, int *y, struct srcu_struct *s1, int *idx, int *f) { int r2; int r3; r3 = srcu_down_read(s1); WRITE_ONCE(*idx, r3); r2 = READ_ONCE(*y); smp_store_release(f, 1); } P1(int *x, int *y, struct srcu_struct *s1, int *idx, int *f) { int r1; int r3; int r4; r4 = smp_load_acquire(f); r1 = READ_ONCE(*x); r3 = READ_ONCE(*idx); srcu_up_read(s1, r3); } P2(int *x, int *y, struct srcu_struct *s1) { WRITE_ONCE(*y, 1); synchronize_srcu(s1); WRITE_ONCE(*x, 1); } locations [0:r1] filter (1:r4=1) exists (1:r1=1 /\ 0:r2=0)
On Sun, Jan 29, 2023 at 08:36:45PM -0800, Paul E. McKenney wrote: > On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote: > > On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote: > > > I see now. Somehow I thought stores must execute in program order, but I > > > guess it doesn't make sense. > > > In that sense, W ->xbstar&int X always means W propagates to X's CPU before > > > X executes. > > > > It also means any write that propagates to W's CPU before W executes > > also propagates to X's CPU before X executes (because it's the same CPU > > and W executes before X). > > > > > > Ideally we would fix this by changing the definition of po-rel to: > > > > > > > > [M] ; (xbstar & int) ; [Release] > > > > > > > > (This is closely related to the use of (xbstar & int) in the definition > > > > of vis that you asked about.) > > > > > > This misses the property of release stores that any po-earlier store must > > > also execute before the release store. > > > > I should have written: > > > > [M] ; (po | (xbstar & int)) ; [Release] > > > > > Perhaps it could be changed to the old po-rel | [M] ; (xbstar & int) ; > > > [Release] but then one could instead move this into the definition of > > > cumul-fence. > > > In fact you'd probably want this for all the propagation fences, so > > > cumul-fence and pb should be the right place. > > > > > > > Unfortunately we can't do this, because > > > > po-rel has to be defined long before xbstar. > > > > > > You could do it, by turning the relation into one massive recursive > > > definition. > > > > Which would make pretty much the entire memory model one big recursion. > > I do not want to do that. > > > > > Thinking about what the options are: > > > 1) accept the difference and run with it by making it consistent inside the > > > axiomatic model > > > 2) fix it through the recursive definition, which seems to be quite ugly but > > > also consistent with the power operational model as far as I can tell > > > 3) weaken the operational model... somehow > > > 4) just ignore the anomaly > > > 5) ??? > > > > > > Currently my least favorite option is 4) since it seems a bit off that the > > > reasoning applies in one specific case of LKMM, more specifically the data > > > race definition which should be equivalent to "the order of the two races > > > isn't fixed", but here the order isn't fixed but it's a data race. > > > I think the patch happens to almost do 1) because the xbstar&int at the end > > > should already imply ordering through the prop&int <= hb rule. > > > What would remain is to also exclude rcu-fence somehow. > > > > IMO 1) is the best choice. > > > > Alan > > > > PS: For the record, here's a simpler litmus test to illustrates the > > failing. The idea is that Wz=1 is reordered before the store-release, > > so it ought to propagate before Wy=1. The LKMM does not require this. > > In PowerPC terms, would this be like having the Wz=1 being reorders > before the Wy=1, but not before the lwsync instruction preceding the > Wy=1 that made it be a release store? No, it would be like having the Wz=1 reordered before the Rx=1, therefore before the lwsync. Obviously this can't ever happen on PowerPC. Alan > If so, we might have to keep this quirk. > > Thanx, Paul > > > C before-release > > > > {} > > > > P0(int *x, int *y, int *z) > > { > > int r1; > > > > r1 = READ_ONCE(*x); > > smp_store_release(y, 1); > > WRITE_ONCE(*z, 1); > > } > > > > P1(int *x, int *y, int *z) > > { > > int r2; > > > > r2 = READ_ONCE(*z); > > WRITE_ONCE(*x, r2); > > } > > > > P2(int *x, int *y, int *z) > > { > > int r3; > > int r4; > > > > r3 = READ_ONCE(*y); > > smp_rmb(); > > r4 = READ_ONCE(*z); > > } > > > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
On Mon, Jan 30, 2023 at 11:47:50AM -0500, Alan Stern wrote: > On Sun, Jan 29, 2023 at 08:36:45PM -0800, Paul E. McKenney wrote: > > On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote: > > > On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote: > > > > I see now. Somehow I thought stores must execute in program order, but I > > > > guess it doesn't make sense. > > > > In that sense, W ->xbstar&int X always means W propagates to X's CPU before > > > > X executes. > > > > > > It also means any write that propagates to W's CPU before W executes > > > also propagates to X's CPU before X executes (because it's the same CPU > > > and W executes before X). > > > > > > > > Ideally we would fix this by changing the definition of po-rel to: > > > > > > > > > > [M] ; (xbstar & int) ; [Release] > > > > > > > > > > (This is closely related to the use of (xbstar & int) in the definition > > > > > of vis that you asked about.) > > > > > > > > This misses the property of release stores that any po-earlier store must > > > > also execute before the release store. > > > > > > I should have written: > > > > > > [M] ; (po | (xbstar & int)) ; [Release] > > > > > > > Perhaps it could be changed to the old po-rel | [M] ; (xbstar & int) ; > > > > [Release] but then one could instead move this into the definition of > > > > cumul-fence. > > > > In fact you'd probably want this for all the propagation fences, so > > > > cumul-fence and pb should be the right place. > > > > > > > > > Unfortunately we can't do this, because > > > > > po-rel has to be defined long before xbstar. > > > > > > > > You could do it, by turning the relation into one massive recursive > > > > definition. > > > > > > Which would make pretty much the entire memory model one big recursion. > > > I do not want to do that. > > > > > > > Thinking about what the options are: > > > > 1) accept the difference and run with it by making it consistent inside the > > > > axiomatic model > > > > 2) fix it through the recursive definition, which seems to be quite ugly but > > > > also consistent with the power operational model as far as I can tell > > > > 3) weaken the operational model... somehow > > > > 4) just ignore the anomaly > > > > 5) ??? > > > > > > > > Currently my least favorite option is 4) since it seems a bit off that the > > > > reasoning applies in one specific case of LKMM, more specifically the data > > > > race definition which should be equivalent to "the order of the two races > > > > isn't fixed", but here the order isn't fixed but it's a data race. > > > > I think the patch happens to almost do 1) because the xbstar&int at the end > > > > should already imply ordering through the prop&int <= hb rule. > > > > What would remain is to also exclude rcu-fence somehow. > > > > > > IMO 1) is the best choice. > > > > > > Alan > > > > > > PS: For the record, here's a simpler litmus test to illustrates the > > > failing. The idea is that Wz=1 is reordered before the store-release, > > > so it ought to propagate before Wy=1. The LKMM does not require this. > > > > In PowerPC terms, would this be like having the Wz=1 being reorders > > before the Wy=1, but not before the lwsync instruction preceding the > > Wy=1 that made it be a release store? > > No, it would be like having the Wz=1 reordered before the Rx=1, > therefore before the lwsync. Obviously this can't ever happen on > PowerPC. Whew!!! ;-) Thanx, Paul > Alan > > > If so, we might have to keep this quirk. > > > > Thanx, Paul > > > > > C before-release > > > > > > {} > > > > > > P0(int *x, int *y, int *z) > > > { > > > int r1; > > > > > > r1 = READ_ONCE(*x); > > > smp_store_release(y, 1); > > > WRITE_ONCE(*z, 1); > > > } > > > > > > P1(int *x, int *y, int *z) > > > { > > > int r2; > > > > > > r2 = READ_ONCE(*z); > > > WRITE_ONCE(*x, r2); > > > } > > > > > > P2(int *x, int *y, int *z) > > > { > > > int r3; > > > int r4; > > > > > > r3 = READ_ONCE(*y); > > > smp_rmb(); > > > r4 = READ_ONCE(*z); > > > } > > > > > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
On 1/30/2023 3:39 AM, Alan Stern wrote: > On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote: >> You could do it, by turning the relation into one massive recursive >> definition. > Which would make pretty much the entire memory model one big recursion. > I do not want to do that. Neither do I :D > >> Thinking about what the options are: >> 1) accept the difference and run with it by making it consistent inside the >> axiomatic model >> 2) fix it through the recursive definition, which seems to be quite ugly but >> also consistent with the power operational model as far as I can tell >> 3) weaken the operational model... somehow >> 4) just ignore the anomaly >> 5) ??? >> >> Currently my least favorite option is 4) since it seems a bit off that the >> reasoning applies in one specific case of LKMM, more specifically the data >> race definition which should be equivalent to "the order of the two races >> isn't fixed", but here the order isn't fixed but it's a data race. >> I think the patch happens to almost do 1) because the xbstar&int at the end >> should already imply ordering through the prop&int <= hb rule. >> What would remain is to also exclude rcu-fence somehow. > IMO 1) is the best choice. I have some additional thoughts now. It seems that you could weaken the operational model by stating that an A-cumulative fence orders propagation of all *external* stores (in addition to all po-earlier stores) that propagated to you before the fence is executed. It seems that on power, from an operational model perspective, there's currently no difference between propagation fences ordering all stores vs only external stores that propagated to the CPU before the fence is executed, because they only have bidirectional (*->W) fences (sync, lwsync) and not uni-directional (acquire, release), and so it is not possible for a store that is po-later than the barrier to be executed before the barrier; i.e., on power, every internal store that propagates to a CPU before the fence executes is also po-earler than the fence. If power did introduce release stores, I think you could potentially create implementations that allow the behavior in the example you have given, but I don't think they are the most natural ones: > {} > > P0(int *x, int *y, int *z) > { > int r1; > > r1 = READ_ONCE(*x); > smp_store_release(y, 1); > WRITE_ONCE(*z, 1); > } > > P1(int *x, int *y, int *z) > { > int r2; > > r2 = READ_ONCE(*z); > WRITE_ONCE(*x, r2); > } > > P2(int *x, int *y, int *z) > { > int r3; > int r4; > > r3 = READ_ONCE(*y); > smp_rmb(); > r4 = READ_ONCE(*z); > } > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0) I could imagine that P0 posts both of its stores in a shared store buffer before reading *x, but marks the release store as "not ready". Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads, and subsequently marks its release store as "ready". Then the release store is sent to the cache, where P2 reads *y=1 and then *z=0. Finally P0 sends its *z=1 store to the cache. However, a perhaps more natural implementation would not post the release store to the store buffer until it is "ready", in which case the order in the store buffer would be *z=1 before *y=1, and in this case the release ordering would presumably work like your current operational model. Nevertheless, perhaps this slightly weaker operational model isn't as absurd as it sounds. And I think many people wouldn't be shocked if the release store didn't provide ordering with *z=1. Best wishes, jonas
On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote: > I have some additional thoughts now. It seems that you could weaken the > operational model by stating that an A-cumulative fence orders propagation > of all *external* stores (in addition to all po-earlier stores) that > propagated to you before the fence is executed. How is that a weakening of the operational model? It's what the operational model says right now. From explanation.txt: Release fences, such as smp_store_release(), force the CPU to execute all po-earlier instructions before the store associated with the fence (e.g., the store part of an smp_store_release()). ... When a fence instruction is executed on CPU C: ... For each other CPU C', any store which propagates to C before a release fence is executed (including all po-earlier stores executed on C) is forced to propagate to C' before the store associated with the release fence does. In theory, we could weaken the operational model by saying that pfences order propagation of stores from other CPUs only when those stores are read-from by instructions po-before the fence. But I suspect that's not such a good idea. > It seems that on power, from an operational model perspective, there's > currently no difference between propagation fences ordering all stores vs > only external stores that propagated to the CPU before the fence is > executed, because they only have bidirectional (*->W) fences (sync, lwsync) > and not uni-directional (acquire, release), and so it is not possible for a > store that is po-later than the barrier to be executed before the barrier; > i.e., on power, every internal store that propagates to a CPU before the > fence executes is also po-earler than the fence. > > If power did introduce release stores, I think you could potentially create > implementations that allow the behavior in the example you have given, but I > don't think they are the most natural ones: Maybe so. In any case, it's a moot point. In fact, I don't know if any architecture supporting Linux allows a write that is po-after a release store to be reordered before the release store. > > P0(int *x, int *y, int *z) > > { > > int r1; > > > > r1 = READ_ONCE(*x); > > smp_store_release(y, 1); > > WRITE_ONCE(*z, 1); > > } > > > > P1(int *x, int *y, int *z) > > { > > int r2; > > > > r2 = READ_ONCE(*z); > > WRITE_ONCE(*x, r2); > > } > > > > P2(int *x, int *y, int *z) > > { > > int r3; > > int r4; > > > > r3 = READ_ONCE(*y); > > smp_rmb(); > > r4 = READ_ONCE(*z); > > } > > > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0) > > I could imagine that P0 posts both of its stores in a shared store buffer > before reading *x, but marks the release store as "not ready". > Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads, > and subsequently marks its release store as "ready". That isn't how release stores are meant to work. The read of x is supposed to be complete before the release store becomes visible to any other CPU. This is true even in C11. > Then the release store is sent to the cache, where P2 reads *y=1 and then > *z=0. > Finally P0 sends its *z=1 store to the cache. > > However, a perhaps more natural implementation would not post the release > store to the store buffer until it is "ready", in which case the order in > the store buffer would be *z=1 before *y=1, and in this case the release > ordering would presumably work like your current operational model. > > Nevertheless, perhaps this slightly weaker operational model isn't as absurd > as it sounds. And I think many people wouldn't be shocked if the release > store didn't provide ordering with *z=1. This issue is one we should discuss with all the other people involved in maintaining the LKMM. Alan
On 1/31/2023 4:06 PM, Alan Stern wrote: > On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote: >> I have some additional thoughts now. It seems that you could weaken the >> operational model by stating that an A-cumulative fence orders propagation >> of all *external* stores (in addition to all po-earlier stores) that >> propagated to you before the fence is executed. > How is that a weakening of the operational model? It's what the > operational model says right now. No, as in the part that you have quoted, it is stated that an A-cumulative fence orderes propagation of *all* stores that propagated to you before the fence is executed. I'm saying you could weaken this to only cover all *external* stores. More precisely, I would change > For each other CPU C', any store which propagates to C before > a release fence is executed (including all po-earlier > stores executed on C) is forced to propagate to C' before the > store associated with the release fence does. Into something like For each other CPU C', any *external* store which propagates to C before a release fence is executed as well as any po-earlier store executed on C is forced to propagate to C' before the store associated with the release fence does. The difference is that po-later stores that happen to propagate to C before the release fence is executed would no longer be ordered. That should be consistent with the axiomatic model. > > In theory, we could weaken the operational model by saying that pfences > order propagation of stores from other CPUs only when those stores are > read-from by instructions po-before the fence. But I suspect that's not > such a good idea. That indeed looks too confusing. >> It seems that on power, from an operational model perspective, there's >> currently no difference between propagation fences ordering all stores vs >> only external stores that propagated to the CPU before the fence is >> executed, because they only have bidirectional (*->W) fences (sync, lwsync) >> and not uni-directional (acquire, release), and so it is not possible for a >> store that is po-later than the barrier to be executed before the barrier; >> i.e., on power, every internal store that propagates to a CPU before the >> fence executes is also po-earler than the fence. >> >> If power did introduce release stores, I think you could potentially create >> implementations that allow the behavior in the example you have given, but I >> don't think they are the most natural ones: > Maybe so. In any case, it's a moot point. In fact, I don't know if any > architecture supporting Linux allows a write that is po-after a release > store to be reordered before the release store. Arm and Risc5 do, but they are multi-copy-atomic anyways. > >>> P0(int *x, int *y, int *z) >>> { >>> int r1; >>> >>> r1 = READ_ONCE(*x); >>> smp_store_release(y, 1); >>> WRITE_ONCE(*z, 1); >>> } >>> >>> P1(int *x, int *y, int *z) >>> { >>> int r2; >>> >>> r2 = READ_ONCE(*z); >>> WRITE_ONCE(*x, r2); >>> } >>> >>> P2(int *x, int *y, int *z) >>> { >>> int r3; >>> int r4; >>> >>> r3 = READ_ONCE(*y); >>> smp_rmb(); >>> r4 = READ_ONCE(*z); >>> } >>> >>> exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0) >> I could imagine that P0 posts both of its stores in a shared store buffer >> before reading *x, but marks the release store as "not ready". >> Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads, >> and subsequently marks its release store as "ready". > That isn't how release stores are meant to work. The read of x is > supposed to be complete before the release store becomes visible to any > other CPU. Note that the release store isn't observed until it becomes "ready", so it is really indistinguishable of whether it had become visible to any other CPU. Indeed stores that aren't marked "ready" would be ignored during forwarding, and not allowed to be pushed to the cache. The reason this kind of implementation seems less natural to me is that such an "not ready" store would need to be pushed back in the buffer (if it is the head of the buffer and the cache is ready to take a store), stall the later stores, or be aborted until it becomes ready. That just seems to create a lot of hassle for no discernible benefit. A "not ready" store probably shouldn't be put into a store queue, even if the only reason it is not ready is that there are some otherwise unrelated reads that haven't completed yet. > This is true even in C11. Arguable... The following pseudo-code litmus test should demonstrate this: P0 { int r = read_relaxed(&x); store_release(&y,1); } P1 { int s = read_relaxed(&y); store_release(&x,1); } In C11, it should be possible to read r==s==1. >> Then the release store is sent to the cache, where P2 reads *y=1 and then >> *z=0. >> Finally P0 sends its *z=1 store to the cache. >> >> However, a perhaps more natural implementation would not post the release >> store to the store buffer until it is "ready", in which case the order in >> the store buffer would be *z=1 before *y=1, and in this case the release >> ordering would presumably work like your current operational model. >> >> Nevertheless, perhaps this slightly weaker operational model isn't as absurd >> as it sounds. And I think many people wouldn't be shocked if the release >> store didn't provide ordering with *z=1. > This issue is one we should discuss with all the other people involved > in maintaining the LKMM. > > Alan Sure. Btw, how to proceed for your SRCU patch and this one? Are you planning to make any changes? I think the version you have is ok if you don't think the patch is improved by anything I brought up. Any additional concerns/changes for this patch? Best wishes, jonas
On Tue, Jan 31, 2023 at 04:33:25PM +0100, Jonas Oberhauser wrote: > > > On 1/31/2023 4:06 PM, Alan Stern wrote: > > On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote: > > > I have some additional thoughts now. It seems that you could weaken the > > > operational model by stating that an A-cumulative fence orders propagation > > > of all *external* stores (in addition to all po-earlier stores) that > > > propagated to you before the fence is executed. > > How is that a weakening of the operational model? It's what the > > operational model says right now. > > No, as in the part that you have quoted, it is stated that an A-cumulative > fence orderes propagation of *all* stores that propagated to you before the > fence is executed. > I'm saying you could weaken this to only cover all *external* stores. Okay, now I understand. > More precisely, I would change > > > For each other CPU C', any store which propagates to C before > > a release fence is executed (including all po-earlier > > stores executed on C) is forced to propagate to C' before the > > store associated with the release fence does. > > Into something like > > > For each other CPU C', any *external* store which propagates to C > before > a release fence is executed as well as any po-earlier > store executed on C is forced to propagate to C' before the > store associated with the release fence does. > > The difference is that po-later stores that happen to propagate to C before > the release fence is executed would no longer be ordered. > That should be consistent with the axiomatic model. I had to check that it wouldn't affect the (xbstar & int) part of vis, but it looks all right. This seems like a reasonable change. However, it only fixes part of the problem. Suppose an external write is read by an instruction po-after the release-store, but the read executes before the release-store. The operational model would still say the external write has to obey the propagation ordering, whereas the formal model doesn't require it. > > > > P0(int *x, int *y, int *z) > > > > { > > > > int r1; > > > > > > > > r1 = READ_ONCE(*x); > > > > smp_store_release(y, 1); > > > > WRITE_ONCE(*z, 1); > > > > } > > > > > > > > P1(int *x, int *y, int *z) > > > > { > > > > int r2; > > > > > > > > r2 = READ_ONCE(*z); > > > > WRITE_ONCE(*x, r2); > > > > } > > > > > > > > P2(int *x, int *y, int *z) > > > > { > > > > int r3; > > > > int r4; > > > > > > > > r3 = READ_ONCE(*y); > > > > smp_rmb(); > > > > r4 = READ_ONCE(*z); > > > > } > > > > > > > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0) > > > I could imagine that P0 posts both of its stores in a shared store buffer > > > before reading *x, but marks the release store as "not ready". > > > Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads, > > > and subsequently marks its release store as "ready". > > That isn't how release stores are meant to work. The read of x is > > supposed to be complete before the release store becomes visible to any > > other CPU. > > Note that the release store isn't observed until it becomes "ready", so it > is really indistinguishable of whether it had become visible to any other > CPU. > Indeed stores that aren't marked "ready" would be ignored during forwarding, > and not allowed to be pushed to the cache. Oops, I mixed up a couple of the accesses. Okay, yes, this mechanism will allow writes that are po-after a release store but execute before it to evade the propagation restriction. > The reason this kind of implementation seems less natural to me is that such > an "not ready" store would need to be pushed back in the buffer (if it is > the head of the buffer and the cache is ready to take a store), stall the > later stores, or be aborted until it becomes ready. > That just seems to create a lot of hassle for no discernible benefit. > A "not ready" store probably shouldn't be put into a store queue, even if > the only reason it is not ready is that there are some otherwise unrelated > reads that haven't completed yet. > > > > > This is true even in C11. > > Arguable... The following pseudo-code litmus test should demonstrate this: > > P0 { > int r = read_relaxed(&x); > store_release(&y,1); > } > > > P1 { > int s = read_relaxed(&y); > store_release(&x,1); > } > > In C11, it should be possible to read r==s==1. True, in C11 releases don't mean anything unless they're paired with acquires. But if your P1 had been int s = read_acquire(&y); write_relaxed(&x, 1); then r = s = 1 would not be allowed. And presumably the same object code would be generated for P0 either way, particularly if P1 was in a separate compilation unit (link-time optimization notwithstanding). > Btw, how to proceed for your SRCU patch and this one? > Are you planning to make any changes? I think the version you have is ok if > you don't think the patch is improved by anything I brought up. I don't see any need to change the SRCU patch at this point, other than to improve the attributions. > Any additional concerns/changes for this patch? It should give the same data-race diagnostics as the current LKMM. This probably means the patch will need to punch up the definitions of *-pre-bounded and *-post-bounded, unless you can think of a better approach. Alan
On 1/31/2023 5:55 PM, Alan Stern wrote: > On Tue, Jan 31, 2023 at 04:33:25PM +0100, Jonas Oberhauser wrote: >> >> More precisely, I would change >> >>> For each other CPU C', any store which propagates to C before >>> a release fence is executed (including all po-earlier >>> stores executed on C) is forced to propagate to C' before the >>> store associated with the release fence does. >> Into something like >> >> >> For each other CPU C', any *external* store which propagates to C >> before >> a release fence is executed as well as any po-earlier >> store executed on C is forced to propagate to C' before the >> store associated with the release fence does. >> >> The difference is that po-later stores that happen to propagate to C before >> the release fence is executed would no longer be ordered. >> That should be consistent with the axiomatic model. > I had to check that it wouldn't affect the (xbstar & int) part of vis, > but it looks all right. This seems like a reasonable change. > > However, it only fixes part of the problem. Suppose an external write > is read by an instruction po-after the release-store, but the read > executes before the release-store. The operational model would still > say the external write has to obey the propagation ordering, whereas the > formal model doesn't require it. Ouch. I had only thought about the [W];(xbstar & int);[Release] case, but there's also the rfe;(xbstar & int);[Release] case.... >> Any additional concerns/changes for this patch? > It should give the same data-race diagnostics as the current LKMM. This > probably means the patch will need to punch up the definitions of > *-pre-bounded and *-post-bounded, unless you can think of a better > approach. > > Alan Seems the 1 thing per patch mentally hasn't yet become ingrained in me. Thanks! jonas
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat index 6e531457bb73..815fdafacaef 100644 --- a/tools/memory-model/linux-kernel.cat +++ b/tools/memory-model/linux-kernel.cat @@ -36,7 +36,9 @@ let wmb = [W] ; fencerel(Wmb) ; [W] let mb = ([M] ; fencerel(Mb) ; [M]) | ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) | ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) | - ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) | + ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) +let gp = po ; [Sync-rcu | Sync-srcu] ; po? +let strong-fence = mb | gp | (* * Note: The po-unlock-lock-po relation only passes the lock to the direct * successor, perhaps giving the impression that the ordering of the @@ -50,10 +52,9 @@ let mb = ([M] ; fencerel(Mb) ; [M]) | *) ([M] ; po-unlock-lock-po ; [After-unlock-lock] ; po ; [M]) -let gp = po ; [Sync-rcu | Sync-srcu] ; po? -let strong-fence = mb | gp -let nonrw-fence = strong-fence | po-rel | acq-po + +let nonrw-fence = mb | gp | po-rel | acq-po let fence = nonrw-fence | wmb | rmb let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu | Before-atomic | After-atomic | Acquire | Release |