[v2,2/2] tools/memory-model: Make ppo a subrelation of po

Message ID 20230126134604.2160-3-jonas.oberhauser@huaweicloud.com
State New
Headers
Series Streamlining treatment of smp_mb__after_unlock_lock |

Commit Message

Jonas Oberhauser Jan. 26, 2023, 1:46 p.m. UTC
  As stated in the documentation and implied by its name, the ppo
(preserved program order) relation is intended to link po-earlier
to po-later instructions under certain conditions.  However, a
corner case currently allows instructions to be linked by ppo that
are not executed by the same thread, i.e., instructions are being
linked that have no po relation.

This happens due to the mb/strong-fence relations, which (as one
case) provide order when locks are passed between threads followed
by an smp_mb__after_unlock_lock() fence.  This is illustrated in
the following litmus test (as can be seen when using herd7 with
`doshow ppo`):

P0(int *x, int *y)
{
    spin_lock(x);
    spin_unlock(x);
}

P1(int *x, int *y)
{
    spin_lock(x);
    smp_mb__after_unlock_lock();
    *y = 1;
}

The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
P0 passes a lock to P1 which then uses this fence.

The patch makes ppo a subrelation of po by eliminating this possibility
from mb (but not strong-fence) and relying explicitly on mb|gp instead
of strong-fence when defining ppo.

Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
---
 tools/memory-model/linux-kernel.cat | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)
  

Comments

Alan Stern Jan. 26, 2023, 4:36 p.m. UTC | #1
On Thu, Jan 26, 2023 at 02:46:04PM +0100, Jonas Oberhauser wrote:
> As stated in the documentation and implied by its name, the ppo
> (preserved program order) relation is intended to link po-earlier
> to po-later instructions under certain conditions.  However, a
> corner case currently allows instructions to be linked by ppo that
> are not executed by the same thread, i.e., instructions are being
> linked that have no po relation.
> 
> This happens due to the mb/strong-fence relations, which (as one
> case) provide order when locks are passed between threads followed
> by an smp_mb__after_unlock_lock() fence.  This is illustrated in
> the following litmus test (as can be seen when using herd7 with
> `doshow ppo`):
> 
> P0(int *x, int *y)
> {
>     spin_lock(x);
>     spin_unlock(x);
> }
> 
> P1(int *x, int *y)
> {
>     spin_lock(x);
>     smp_mb__after_unlock_lock();
>     *y = 1;
> }
> 
> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> P0 passes a lock to P1 which then uses this fence.
> 
> The patch makes ppo a subrelation of po by eliminating this possibility
> from mb (but not strong-fence) and relying explicitly on mb|gp instead
> of strong-fence when defining ppo.
> 
> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> ---

This changes the meaning of the fence relation, which is used in 
w-pre-bounded, w-post-bounded, ww-vis, wr-vis, and rw-xbstar.  Have you 
checked that they won't be affected by the change?

>  tools/memory-model/linux-kernel.cat | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 6e531457bb73..815fdafacaef 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -36,7 +36,9 @@ let wmb = [W] ; fencerel(Wmb) ; [W]
>  let mb = ([M] ; fencerel(Mb) ; [M]) |
>  	([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
>  	([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> -	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> +	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M])
> +let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> +let strong-fence = mb | gp |
>  (*
>   * Note: The po-unlock-lock-po relation only passes the lock to the direct
>   * successor, perhaps giving the impression that the ordering of the
> @@ -50,10 +52,9 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
>   *)
>  	([M] ; po-unlock-lock-po ;
>  		[After-unlock-lock] ; po ; [M])
> -let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> -let strong-fence = mb | gp
>  
> -let nonrw-fence = strong-fence | po-rel | acq-po
> +

Extra blank line.

> +let nonrw-fence = mb | gp | po-rel | acq-po
>  let fence = nonrw-fence | wmb | rmb
>  let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu |
>  		Before-atomic | After-atomic | Acquire | Release |
> -- 
> 2.17.1

Alan
  
Jonas Oberhauser Jan. 27, 2023, 2:31 p.m. UTC | #2
On 1/26/2023 5:36 PM, Alan Stern wrote:
> On Thu, Jan 26, 2023 at 02:46:04PM +0100, Jonas Oberhauser wrote:
>> As stated in the documentation and implied by its name, the ppo
>> (preserved program order) relation is intended to link po-earlier
>> to po-later instructions under certain conditions.  However, a
>> corner case currently allows instructions to be linked by ppo that
>> are not executed by the same thread, i.e., instructions are being
>> linked that have no po relation.
>>
>> This happens due to the mb/strong-fence relations, which (as one
>> case) provide order when locks are passed between threads followed
>> by an smp_mb__after_unlock_lock() fence.  This is illustrated in
>> the following litmus test (as can be seen when using herd7 with
>> `doshow ppo`):
>>
>> P0(int *x, int *y)
>> {
>>      spin_lock(x);
>>      spin_unlock(x);
>> }
>>
>> P1(int *x, int *y)
>> {
>>      spin_lock(x);
>>      smp_mb__after_unlock_lock();
>>      *y = 1;
>> }
>>
>> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
>> P0 passes a lock to P1 which then uses this fence.
>>
>> The patch makes ppo a subrelation of po by eliminating this possibility
>> from mb (but not strong-fence) and relying explicitly on mb|gp instead
>> of strong-fence when defining ppo.
>>
>> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>> ---
> This changes the meaning of the fence relation, which is used in
> w-pre-bounded, w-post-bounded, ww-vis, wr-vis, and rw-xbstar.  Have you
> checked that they won't be affected by the change?

Good point, in fact it changes the nonrw-fence as well which is used in 
the r-* relations.
I had missed this completely. That's what I get for not looking at the 
data race relations!


Let's go through the individual cases.
If we had e.g. *-post-bounded because of the 
po-unlock-lock-po;[After-unlock-lock];po edge, this may be either due to 
the fence rule or due to (...fence ; vis ; *-pre-bounded).
In the first case we have po-rel ; rfe ; acq-po and get 
fence;rfe;(xbstar&int);fence which gives us *-post-bounded.
In the second case we now have strong-fence, with vis <= xbstar we see 
that **-vis is preserved here by switching from the fence rule to the 
strong-fence;xbstar;... case.

For *-pre-bounded, the situtation is tricky because of the xbstar&int 
that can be at the end of vis, when *-pre-bounded is used to define 
w*-vis through (w-post-bounded;vis;*-pre-bounded). In this case, the 
xbstar&int can point in the opposite direction of po, which means that 
the unlock that creates the strong fence might not be po-after the 
instruction that starts the link.

Here's a litmus test illustrating the difference, where P1 has a 
backwards-pointing xbstar&int. Currently there's no data race, but with 
the proposed patch there is.

P0(int *x, int *y)
{
     *x = 1;
     smp_store_release(y, 1);
}

P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
{
     spin_lock(l);
     int r1 = READ_ONCE(*dy);
     if (r1==1)
         spin_unlock(l);

     int r0 = smp_load_acquire(y);
     if (r0 == 1) {
         WRITE_ONCE(*dx,1);
     }
}

P2(int *dx, int *dy)
{
     WRITE_ONCE(*dy,READ_ONCE(*dx));
}


P3(int *x, spinlock_t *l)
{
     spin_lock(l);
     smp_mb__after_unlock_lock();
     *x = 2;
}


This actually makes me wonder. I thought the reason for the xbstar & int 
is that it ensures that the overall relation, after shuffling around a 
little bit, becomes prop&int ; hb*.

Like in case the *x=2 is co-before the *x=1, we get
   Wx2 ->overwrite Wx1 ->cumul-fence*;rfe  (some event on the same CPU 
as Wx2)  ->fence Wx2
which is
   Wx2 -> prop&int some other event ->hb Wx2
which must be irreflexive.

However, that's not the case at all, because the fence relation 
currently doesn't actually have to relate events of the same CPU.
So we get
   Wx2 ->overwrite Wx1 ->cumul-fence*;rfe  (some event on some other CPU 
than Wx2's) ->(hb*&int);fence Wx2
i.e.,
   Wx2 ->prop&ext;hb*;strong-fence Wx2

which shouldn't provide any ordering in general.

In fact, replacing the *x=1 and *x=2 with WRITE_ONCEs, (pilot errors 
notwithstanding) both Wx1 ->co Wx2 and Wx2 ->co Wx1 become allowed in 
the current LKMM (in graphs where all other edges are equal).

Shouldn't this actually *be* a data race? And potentially the same with 
rcu-fence?

The other cases of *-pre-bounded seem to work out: they all link stuff 
via xbstar to the instruction that is linked via po-unlock-lock-po ; 
[After-unlock-lock] ; po to the potentially racy access, and 
po-unlock-lock-po;po   is xbstar ; acq-po, which allows closing the gap.

Best wishes, jonas
  
Alan Stern Jan. 28, 2023, 7:56 p.m. UTC | #3
On Fri, Jan 27, 2023 at 03:31:25PM +0100, Jonas Oberhauser wrote:
> Here's a litmus test illustrating the difference, where P1 has a
> backwards-pointing xbstar&int. Currently there's no data race, but with the
> proposed patch there is.
> 
> P0(int *x, int *y)
> {
>     *x = 1;
>     smp_store_release(y, 1);
> }
> 
> P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> {
>     spin_lock(l);
>     int r1 = READ_ONCE(*dy);
>     if (r1==1)
>         spin_unlock(l);
> 
>     int r0 = smp_load_acquire(y);
>     if (r0 == 1) {
>         WRITE_ONCE(*dx,1);
>     }
> }
> 
> P2(int *dx, int *dy)
> {
>     WRITE_ONCE(*dy,READ_ONCE(*dx));
> }
> 
> 
> P3(int *x, spinlock_t *l)
> {
>     spin_lock(l);
>     smp_mb__after_unlock_lock();
>     *x = 2;
> }

I don't understand why the current LKMM doesn't say there is a data 
race.  In fact, I don't understand what herd7 is doing with this litmus 
test at all.  Evidently the plain-coherence check rules out x=1 at the 
end, because when I relax that check, x=1 becomes a possible result.  
Furthermore, the graphical output confirms that this execution has a 
ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
to Wx=2!  How can this be possible?  It seems like a bug in herd7.

Furthermore, the execution with x=2 at the end doesn't have either a 
ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2.  So why isn't there a 
ww-race edge?

> This actually makes me wonder. I thought the reason for the xbstar & int is
> that it ensures that the overall relation, after shuffling around a little
> bit, becomes prop&int ; hb*.

No, that is not the reason for it.  See below.

> Like in case the *x=2 is co-before the *x=1, we get
>   Wx2 ->overwrite Wx1 ->cumul-fence*;rfe  (some event on the same CPU as
> Wx2)  ->fence Wx2
> which is
>   Wx2 -> prop&int some other event ->hb Wx2
> which must be irreflexive.
> 
> However, that's not the case at all, because the fence relation currently
> doesn't actually have to relate events of the same CPU.
> So we get
>   Wx2 ->overwrite Wx1 ->cumul-fence*;rfe  (some event on some other CPU than
> Wx2's) ->(hb*&int);fence Wx2
> i.e.,
>   Wx2 ->prop&ext;hb*;strong-fence Wx2
> 
> which shouldn't provide any ordering in general.
> 
> In fact, replacing the *x=1 and *x=2 with WRITE_ONCEs, (pilot errors
> notwithstanding) both Wx1 ->co Wx2 and Wx2 ->co Wx1 become allowed in the
> current LKMM (in graphs where all other edges are equal).
> 
> Shouldn't this actually *be* a data race? And potentially the same with
> rcu-fence?

I think that herd7 _should_ say there is a data race.

On the other hand, I also think that the operational model says there 
isn't.  This is a case where the formal model fails to match the 
operational model.

The operational model says that if W is a release-store on CPU C and W' 
is another store which propagates to C before W executes, then W' 
propagates to every CPU before W does.  (In other words, releases are 
A-cumulative).  But the formal model enforces this rule only in cases 
the event reading W' on C is po-before W.

In your litmus test, the event reading y=1 on P1 is po-after the 
spin_unlock (which is a release).  Nevertheless, any feasible execution 
requires that P1 must execute Ry=1 before the unlock.  So the 
operational model says that y=1 must propagate to P3 before the unlock 
does, i.e., before P3 executes the spin_lock().  But the formal model 
doesn't require this.

Ideally we would fix this by changing the definition of po-rel to:

	[M] ; (xbstar & int) ; [Release]

(This is closely related to the use of (xbstar & int) in the definition 
of vis that you asked about.)  Unfortunately we can't do this, because 
po-rel has to be defined long before xbstar.

> The other cases of *-pre-bounded seem to work out: they all link stuff via
> xbstar to the instruction that is linked via po-unlock-lock-po ;
> [After-unlock-lock] ; po to the potentially racy access, and
> po-unlock-lock-po;po   is xbstar ; acq-po, which allows closing the gap.

I could not follow your arguments at all; the writing was too confusing.

Alan
  
Andrea Parri Jan. 28, 2023, 10:14 p.m. UTC | #4
> Evidently the plain-coherence check rules out x=1 at the 
> end, because when I relax that check, x=1 becomes a possible result.  
> Furthermore, the graphical output confirms that this execution has a 
> ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> to Wx=2!  How can this be possible?  It seems like a bug in herd7.

By default, herd7 performs some edges removal when generating the
graphical outputs.  The option -showraw can be useful to increase
the "verbosity", for example,

  [with "exists (x=2)", output in /tmp/T.dot]
  $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis


> Furthermore, the execution with x=2 at the end doesn't have either a 
> ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2.  So why isn't there a 
> ww-race edge?

And similarly

  [with "exists (x=2)"]
  $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -doshow ww-vis,ww-nonrace -showraw ww-vis,ww-nonrace

  Andrea
  
Andrea Parri Jan. 28, 2023, 10:21 p.m. UTC | #5
On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > Evidently the plain-coherence check rules out x=1 at the 
> > end, because when I relax that check, x=1 becomes a possible result.  
> > Furthermore, the graphical output confirms that this execution has a 
> > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> 
> By default, herd7 performs some edges removal when generating the
> graphical outputs.  The option -showraw can be useful to increase
> the "verbosity", for example,
> 
>   [with "exists (x=2)", output in /tmp/T.dot]

This was meant to be "exists (x=1)".

  Andrea


>   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> 
> 
> > Furthermore, the execution with x=2 at the end doesn't have either a 
> > ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2.  So why isn't there a 
> > ww-race edge?
> 
> And similarly
> 
>   [with "exists (x=2)"]
>   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -doshow ww-vis,ww-nonrace -showraw ww-vis,ww-nonrace
> 
>   Andrea
  
Alan Stern Jan. 28, 2023, 10:59 p.m. UTC | #6
On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > Evidently the plain-coherence check rules out x=1 at the 
> > end, because when I relax that check, x=1 becomes a possible result.  
> > Furthermore, the graphical output confirms that this execution has a 
> > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> 
> By default, herd7 performs some edges removal when generating the
> graphical outputs.  The option -showraw can be useful to increase
> the "verbosity", for example,
> 
>   [with "exists (x=2)", output in /tmp/T.dot]
>   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis

Okay, thanks, that helps a lot.

So here's what we've got.  The litmus test:


C hb-and-int
{}

P0(int *x, int *y)
{
    *x = 1;
    smp_store_release(y, 1);
}

P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
{
    spin_lock(l);
    int r1 = READ_ONCE(*dy);
    if (r1==1)
        spin_unlock(l);

    int r0 = smp_load_acquire(y);
    if (r0 == 1) {
        WRITE_ONCE(*dx,1);
    }
}

P2(int *dx, int *dy)
{
    WRITE_ONCE(*dy,READ_ONCE(*dx));
}


P3(int *x, spinlock_t *l)
{
    spin_lock(l);
    smp_mb__after_unlock_lock();
    *x = 2;
}

exists (x=2)


The reason why Wx=1 ->ww-vis Wx=2:

	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.

	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.

	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.

Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.

This explains why the memory model says there isn't a data race.  This 
doesn't use the smp_mb__after_unlock_lock at all.

Alan
  
Paul E. McKenney Jan. 29, 2023, 5:17 a.m. UTC | #7
On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > Evidently the plain-coherence check rules out x=1 at the 
> > > end, because when I relax that check, x=1 becomes a possible result.  
> > > Furthermore, the graphical output confirms that this execution has a 
> > > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> > 
> > By default, herd7 performs some edges removal when generating the
> > graphical outputs.  The option -showraw can be useful to increase
> > the "verbosity", for example,
> > 
> >   [with "exists (x=2)", output in /tmp/T.dot]
> >   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> 
> Okay, thanks, that helps a lot.
> 
> So here's what we've got.  The litmus test:
> 
> 
> C hb-and-int
> {}
> 
> P0(int *x, int *y)
> {
>     *x = 1;
>     smp_store_release(y, 1);
> }
> 
> P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> {
>     spin_lock(l);
>     int r1 = READ_ONCE(*dy);
>     if (r1==1)
>         spin_unlock(l);
> 
>     int r0 = smp_load_acquire(y);
>     if (r0 == 1) {
>         WRITE_ONCE(*dx,1);
>     }

The lack of a spin_unlock() when r1!=1 is intentional?

It is admittedly a cute way to prevent P3 from doing anything
when r1!=1.  And P1 won't do anything if P3 runs first.

> }
> 
> P2(int *dx, int *dy)
> {
>     WRITE_ONCE(*dy,READ_ONCE(*dx));
> }
> 
> 
> P3(int *x, spinlock_t *l)
> {
>     spin_lock(l);
>     smp_mb__after_unlock_lock();
>     *x = 2;
> }
> 
> exists (x=2)
> 
> 
> The reason why Wx=1 ->ww-vis Wx=2:
> 
> 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> 
> 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> 
> 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> 
> Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> 
> This explains why the memory model says there isn't a data race.  This 
> doesn't use the smp_mb__after_unlock_lock at all.

You lost me on this one.

Suppose that P3 starts first, then P0.  P1 is then stuck at the
spin_lock() because P3 does not release that lock.  P2 goes out for a
pizza.

Why can't the two stores to x by P0 and P3 conflict, resulting in a
data race?

							Thanx, Paul
  
Alan Stern Jan. 29, 2023, 4:03 p.m. UTC | #8
On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote:
> On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > > Evidently the plain-coherence check rules out x=1 at the 
> > > > end, because when I relax that check, x=1 becomes a possible result.  
> > > > Furthermore, the graphical output confirms that this execution has a 
> > > > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > > > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> > > 
> > > By default, herd7 performs some edges removal when generating the
> > > graphical outputs.  The option -showraw can be useful to increase
> > > the "verbosity", for example,
> > > 
> > >   [with "exists (x=2)", output in /tmp/T.dot]
> > >   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> > 
> > Okay, thanks, that helps a lot.
> > 
> > So here's what we've got.  The litmus test:
> > 
> > 
> > C hb-and-int
> > {}
> > 
> > P0(int *x, int *y)
> > {
> >     *x = 1;
> >     smp_store_release(y, 1);
> > }
> > 
> > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> > {
> >     spin_lock(l);
> >     int r1 = READ_ONCE(*dy);
> >     if (r1==1)
> >         spin_unlock(l);
> > 
> >     int r0 = smp_load_acquire(y);
> >     if (r0 == 1) {
> >         WRITE_ONCE(*dx,1);
> >     }
> 
> The lack of a spin_unlock() when r1!=1 is intentional?

I assume so.

> It is admittedly a cute way to prevent P3 from doing anything
> when r1!=1.  And P1 won't do anything if P3 runs first.

Right.

> > }
> > 
> > P2(int *dx, int *dy)
> > {
> >     WRITE_ONCE(*dy,READ_ONCE(*dx));
> > }
> > 
> > 
> > P3(int *x, spinlock_t *l)
> > {
> >     spin_lock(l);
> >     smp_mb__after_unlock_lock();
> >     *x = 2;
> > }
> > 
> > exists (x=2)
> > 
> > 
> > The reason why Wx=1 ->ww-vis Wx=2:
> > 
> > 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> > 
> > 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> > 
> > 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> > 
> > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> > 
> > This explains why the memory model says there isn't a data race.  This 
> > doesn't use the smp_mb__after_unlock_lock at all.
> 
> You lost me on this one.
> 
> Suppose that P3 starts first, then P0.  P1 is then stuck at the
> spin_lock() because P3 does not release that lock.  P2 goes out for a
> pizza.

That wouldn't be a valid execution.  One of the rules in lock.cat says 
that a spin_lock() call must read from a spin_unlock() or from an 
initial write, which rules out executions in which P3 acquires the lock 
first.

> Why can't the two stores to x by P0 and P3 conflict, resulting in a
> data race?

That can't happen in executions where P1 acquires the lock first for the 
reason outlined above (P0's store to x propagates to P3 before P3 writes 
to x).  And there are no other executions -- basically, herd7 ignores 
deadlock scenarios.

Alan
  
Paul E. McKenney Jan. 29, 2023, 4:21 p.m. UTC | #9
On Sun, Jan 29, 2023 at 11:03:46AM -0500, Alan Stern wrote:
> On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote:
> > On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> > > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > > > Evidently the plain-coherence check rules out x=1 at the 
> > > > > end, because when I relax that check, x=1 becomes a possible result.  
> > > > > Furthermore, the graphical output confirms that this execution has a 
> > > > > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > > > > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> > > > 
> > > > By default, herd7 performs some edges removal when generating the
> > > > graphical outputs.  The option -showraw can be useful to increase
> > > > the "verbosity", for example,
> > > > 
> > > >   [with "exists (x=2)", output in /tmp/T.dot]
> > > >   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> > > 
> > > Okay, thanks, that helps a lot.
> > > 
> > > So here's what we've got.  The litmus test:
> > > 
> > > 
> > > C hb-and-int
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > >     *x = 1;
> > >     smp_store_release(y, 1);
> > > }
> > > 
> > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> > > {
> > >     spin_lock(l);
> > >     int r1 = READ_ONCE(*dy);
> > >     if (r1==1)
> > >         spin_unlock(l);
> > > 
> > >     int r0 = smp_load_acquire(y);
> > >     if (r0 == 1) {
> > >         WRITE_ONCE(*dx,1);
> > >     }
> > 
> > The lack of a spin_unlock() when r1!=1 is intentional?
> 
> I assume so.
> 
> > It is admittedly a cute way to prevent P3 from doing anything
> > when r1!=1.  And P1 won't do anything if P3 runs first.
> 
> Right.
> 
> > > }
> > > 
> > > P2(int *dx, int *dy)
> > > {
> > >     WRITE_ONCE(*dy,READ_ONCE(*dx));
> > > }
> > > 
> > > 
> > > P3(int *x, spinlock_t *l)
> > > {
> > >     spin_lock(l);
> > >     smp_mb__after_unlock_lock();
> > >     *x = 2;
> > > }
> > > 
> > > exists (x=2)
> > > 
> > > 
> > > The reason why Wx=1 ->ww-vis Wx=2:
> > > 
> > > 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> > > 
> > > 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > > 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> > > 
> > > 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > > 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > > 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> > > 
> > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> > > 
> > > This explains why the memory model says there isn't a data race.  This 
> > > doesn't use the smp_mb__after_unlock_lock at all.
> > 
> > You lost me on this one.
> > 
> > Suppose that P3 starts first, then P0.  P1 is then stuck at the
> > spin_lock() because P3 does not release that lock.  P2 goes out for a
> > pizza.
> 
> That wouldn't be a valid execution.  One of the rules in lock.cat says 
> that a spin_lock() call must read from a spin_unlock() or from an 
> initial write, which rules out executions in which P3 acquires the lock 
> first.

OK, I will bite...

Why can't P3's spin_lock() read from that initial write?

> > Why can't the two stores to x by P0 and P3 conflict, resulting in a
> > data race?
> 
> That can't happen in executions where P1 acquires the lock first for the 
> reason outlined above (P0's store to x propagates to P3 before P3 writes 
> to x).  And there are no other executions -- basically, herd7 ignores 
> deadlock scenarios.

True enough, if P1 gets there first, then P3 never stores to x.

What I don't understand is why P1 must always get there first.

							Thanx, Paul
  
Andrea Parri Jan. 29, 2023, 5:11 p.m. UTC | #10
> The reason why Wx=1 ->ww-vis Wx=2:
> 
> 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> 
> 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> 
> 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> 
> Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.

To clarify, po-unlock-lock-po is not a subrelation of mb; see what
happens without the smp_mb__after_unlock_lock().

  Andrea
  
Andrea Parri Jan. 29, 2023, 5:28 p.m. UTC | #11
> Why can't P3's spin_lock() read from that initial write?

Mmh, sounds like you want to play with something like below?

  Andrea

diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
index 6b52f365d73ac..20c3af4511255 100644
--- a/tools/memory-model/lock.cat
+++ b/tools/memory-model/lock.cat
@@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
 
 (* Allow up to one unmatched LKW per location; more must deadlock *)
 let UNMATCHED-LKW = LKW \ domain(critical)
-empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
 
 (* rfi for LF events: link each LKW to the LF events in its critical section *)
 let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
@@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
 let rf = rf | rf-lf | rf-ru
 
 (* Generate all co relations, including LKW events but not UL *)
-let co0 = co0 | ([IW] ; loc ; [LKW]) |
-	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
+let co0 = co0 | ([IW] ; loc ; [LKW])
 include "cos-opt.cat"
 let W = W | UL
 let M = R | W
  
Paul E. McKenney Jan. 29, 2023, 6:44 p.m. UTC | #12
On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote:
> > Why can't P3's spin_lock() read from that initial write?
> 
> Mmh, sounds like you want to play with something like below?
> 
>   Andrea
> 
> diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
> index 6b52f365d73ac..20c3af4511255 100644
> --- a/tools/memory-model/lock.cat
> +++ b/tools/memory-model/lock.cat
> @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
>  
>  (* Allow up to one unmatched LKW per location; more must deadlock *)
>  let UNMATCHED-LKW = LKW \ domain(critical)
> -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
>  
>  (* rfi for LF events: link each LKW to the LF events in its critical section *)
>  let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
> @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
>  let rf = rf | rf-lf | rf-ru
>  
>  (* Generate all co relations, including LKW events but not UL *)
> -let co0 = co0 | ([IW] ; loc ; [LKW]) |
> -	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
> +let co0 = co0 | ([IW] ; loc ; [LKW])
>  include "cos-opt.cat"
>  let W = W | UL
>  let M = R | W

No idea.  But the following litmus test gets no executions whatsoever,
so point taken about my missing at least one corner case.  ;-)

Adding a spin_unlock() to the end of either process allows both to
run.

One could argue that this is a bug, but one could equally well argue
that if you have a deadlock, you have a deadlock.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

C lock

{
}


P0(int *a, int *b, spinlock_t *l)
{
	spin_lock(l);
	WRITE_ONCE(*a, 1);
}

P1(int *a, int *b, spinlock_t *l)
{
	spin_lock(l);
	WRITE_ONCE(*b, 1);
}

exists (a=1 /\ b=1)
  
Paul E. McKenney Jan. 29, 2023, 7:17 p.m. UTC | #13
On Sun, Jan 29, 2023 at 08:21:56AM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 11:03:46AM -0500, Alan Stern wrote:
> > On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote:
> > > On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> > > > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > > > > Evidently the plain-coherence check rules out x=1 at the 
> > > > > > end, because when I relax that check, x=1 becomes a possible result.  
> > > > > > Furthermore, the graphical output confirms that this execution has a 
> > > > > > ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1 
> > > > > > to Wx=2!  How can this be possible?  It seems like a bug in herd7.
> > > > > 
> > > > > By default, herd7 performs some edges removal when generating the
> > > > > graphical outputs.  The option -showraw can be useful to increase
> > > > > the "verbosity", for example,
> > > > > 
> > > > >   [with "exists (x=2)", output in /tmp/T.dot]
> > > > >   $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> > > > 
> > > > Okay, thanks, that helps a lot.
> > > > 
> > > > So here's what we've got.  The litmus test:
> > > > 
> > > > 
> > > > C hb-and-int
> > > > {}
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > >     *x = 1;
> > > >     smp_store_release(y, 1);
> > > > }
> > > > 
> > > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> > > > {
> > > >     spin_lock(l);
> > > >     int r1 = READ_ONCE(*dy);
> > > >     if (r1==1)
> > > >         spin_unlock(l);
> > > > 
> > > >     int r0 = smp_load_acquire(y);
> > > >     if (r0 == 1) {
> > > >         WRITE_ONCE(*dx,1);
> > > >     }
> > > 
> > > The lack of a spin_unlock() when r1!=1 is intentional?
> > 
> > I assume so.
> > 
> > > It is admittedly a cute way to prevent P3 from doing anything
> > > when r1!=1.  And P1 won't do anything if P3 runs first.
> > 
> > Right.
> > 
> > > > }
> > > > 
> > > > P2(int *dx, int *dy)
> > > > {
> > > >     WRITE_ONCE(*dy,READ_ONCE(*dx));
> > > > }
> > > > 
> > > > 
> > > > P3(int *x, spinlock_t *l)
> > > > {
> > > >     spin_lock(l);
> > > >     smp_mb__after_unlock_lock();
> > > >     *x = 2;
> > > > }
> > > > 
> > > > exists (x=2)
> > > > 
> > > > 
> > > > The reason why Wx=1 ->ww-vis Wx=2:
> > > > 
> > > > 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> > > > 
> > > > 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > > > 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> > > > 
> > > > 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > > > 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > > > 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> > > > 
> > > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> > > > 
> > > > This explains why the memory model says there isn't a data race.  This 
> > > > doesn't use the smp_mb__after_unlock_lock at all.
> > > 
> > > You lost me on this one.
> > > 
> > > Suppose that P3 starts first, then P0.  P1 is then stuck at the
> > > spin_lock() because P3 does not release that lock.  P2 goes out for a
> > > pizza.
> > 
> > That wouldn't be a valid execution.  One of the rules in lock.cat says 
> > that a spin_lock() call must read from a spin_unlock() or from an 
> > initial write, which rules out executions in which P3 acquires the lock 
> > first.
> 
> OK, I will bite...
> 
> Why can't P3's spin_lock() read from that initial write?
> 
> > > Why can't the two stores to x by P0 and P3 conflict, resulting in a
> > > data race?
> > 
> > That can't happen in executions where P1 acquires the lock first for the 
> > reason outlined above (P0's store to x propagates to P3 before P3 writes 
> > to x).  And there are no other executions -- basically, herd7 ignores 
> > deadlock scenarios.
> 
> True enough, if P1 gets there first, then P3 never stores to x.
> 
> What I don't understand is why P1 must always get there first.

Ah, is the rule that all processes must complete?

If so, then the only way all processes can complete is if P1 loads 1
from dy, thus releasing the lock.

But that dy=1 load can only happen after P2 has copied dx to dy, and has
stored a 1.  Which in turn only happens if P1's write to dx is ordered
before the lock release.  Which only executes if P1's load from y returned
1, which only happens after P0 stored 1 to y.

Which means that P3 gets the lock only after P0 completes its plain
write to x, so there is no data race.

The reason that P3 cannot go first is that this will prevent P1 from
completing, in turn preventing herd7 from counting that execution.

Or am I still missing a turn in there somewhere?

							Thanx, Paul
  
Boqun Feng Jan. 29, 2023, 9:43 p.m. UTC | #14
On Sun, Jan 29, 2023 at 10:44:03AM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote:
> > > Why can't P3's spin_lock() read from that initial write?
> > 
> > Mmh, sounds like you want to play with something like below?
> > 
> >   Andrea
> > 
> > diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
> > index 6b52f365d73ac..20c3af4511255 100644
> > --- a/tools/memory-model/lock.cat
> > +++ b/tools/memory-model/lock.cat
> > @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
> >  
> >  (* Allow up to one unmatched LKW per location; more must deadlock *)
> >  let UNMATCHED-LKW = LKW \ domain(critical)
> > -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> >  
> >  (* rfi for LF events: link each LKW to the LF events in its critical section *)
> >  let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
> > @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
> >  let rf = rf | rf-lf | rf-ru
> >  
> >  (* Generate all co relations, including LKW events but not UL *)
> > -let co0 = co0 | ([IW] ; loc ; [LKW]) |
> > -	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
> > +let co0 = co0 | ([IW] ; loc ; [LKW])
> >  include "cos-opt.cat"
> >  let W = W | UL
> >  let M = R | W
> 
> No idea.  But the following litmus test gets no executions whatsoever,
> so point taken about my missing at least one corner case.  ;-)
> 
> Adding a spin_unlock() to the end of either process allows both to
> run.
> 
> One could argue that this is a bug, but one could equally well argue
> that if you have a deadlock, you have a deadlock.
> 

in lock.cat: 

	(* Allow up to one unmatched LKW per location; more must deadlock *)
	let UNMATCHED-LKW = LKW \ domain(critical)
	empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks

we rule out deadlocks from the execution candidates we care about.

Regards,
Boqun

> Thoughts?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> C lock
> 
> {
> }
> 
> 
> P0(int *a, int *b, spinlock_t *l)
> {
> 	spin_lock(l);
> 	WRITE_ONCE(*a, 1);
> }
> 
> P1(int *a, int *b, spinlock_t *l)
> {
> 	spin_lock(l);
> 	WRITE_ONCE(*b, 1);
> }
> 
> exists (a=1 /\ b=1)
  
Alan Stern Jan. 29, 2023, 10:10 p.m. UTC | #15
On Sun, Jan 29, 2023 at 06:11:12PM +0100, Andrea Parri wrote:
> > The reason why Wx=1 ->ww-vis Wx=2:
> > 
> > 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> > 
> > 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> > 
> > 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> > 
> > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> 
> To clarify, po-unlock-lock-po is not a subrelation of mb; see what
> happens without the smp_mb__after_unlock_lock().

Ah, thank you again.  That was what I got wrong, and it explains why the 
data race appears with Jonas's patch.

This also points out an important unstated fact: All of the inter-CPU 
extended fences in the memory model are A-cumulative.  In this case the 
fence links Rdy=1 on P1 to Wx=3 on P3.  We know that 0:Wx=1 is visible 
to P1 before the Rdy executes, but what we need is that it is visible to 
P3 before Wx=3 executes.  The fact that the extended fence is strong 
(and therefore A-cumulative) provides this extra guarantee.

Alan
  
Jonas Oberhauser Jan. 29, 2023, 10:19 p.m. UTC | #16
Hi all, apologies on the confusion about the litmus test.
I should have explained it better but it seems you mostly figured it out.
As Alan said I'm tricking a little bit by not unlocking in certain 
places to filter out all executions that aren't what I'm looking for.
I didn't have much time when I sent it (hence also the lack of 
explanation and why I haven't responded earlier), so I didn't have time 
to play around with the filter feature to do this the "proper"/non-cute way.
As such it really isn't about deadlocks.

I think one question is whether the distinction between the models could 
be reproduced without using any kind of filtering at all.
I have a feeling it should be possible but I haven't had time to think 
up a litmus test that does that.


On 1/28/2023 11:59 PM, Alan Stern wrote:
> On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
>>> Evidently the plain-coherence check rules out x=1 at the
>>> end, because when I relax that check, x=1 becomes a possible result.
>>> Furthermore, the graphical output confirms that this execution has a
>>> ww-incoh edge from Wx=2 to Wx=1.  But there is no ww-vis edge from Wx=1
>>> to Wx=2!  How can this be possible?  It seems like a bug in herd7.
>> By default, herd7 performs some edges removal when generating the
>> graphical outputs.  The option -showraw can be useful to increase
>> the "verbosity", for example,
>>
>>    [with "exists (x=2)", output in /tmp/T.dot]
>>    $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> Okay, thanks, that helps a lot.
>
> So here's what we've got.  The litmus test:
>
>
> C hb-and-int
> {}
>
> P0(int *x, int *y)
> {
>      *x = 1;
>      smp_store_release(y, 1);
> }
>
> P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> {
>      spin_lock(l);
>      int r1 = READ_ONCE(*dy);
>      if (r1==1)
>          spin_unlock(l);
>
>      int r0 = smp_load_acquire(y);
>      if (r0 == 1) {
>          WRITE_ONCE(*dx,1);
>      }
> }
>
> P2(int *dx, int *dy)
> {
>      WRITE_ONCE(*dy,READ_ONCE(*dx));
> }
>
>
> P3(int *x, spinlock_t *l)
> {
>      spin_lock(l);
>      smp_mb__after_unlock_lock();
>      *x = 2;
> }
>
> exists (x=2)
>
>
> The reason why Wx=1 ->ww-vis Wx=2:
>
> 	0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
>
> 	0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> 		(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
>
> 	1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> 		so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> 		and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
>
> [...]
>
> This explains why the memory model says there isn't a data race.  This
> doesn't use the smp_mb__after_unlock_lock at all.

Note as Andrea said that po-unlock-lock-po is generally not in mb (and 
also not otherwise in fence).
Only po-unlock-lock-po;[After-unlock-lock];po is in mb (resp. ...&int in 
fence).
While the example uses smp_mb__after_unlock_lock, the anomaly should 
generally cover any extended fences.


[Snipping in a part of an earlier e-mail you sent]



> I think that herd7_should_  say there is a data race.
>
> On the other hand, I also think that the operational model says there
> isn't.  This is a case where the formal model fails to match the
> operational model.
>
> The operational model says that if W is a release-store on CPU C and W'
> is another store which propagates to C before W executes, then W'
> propagates to every CPU before W does.  (In other words, releases are
> A-cumulative).  But the formal model enforces this rule only in cases
> the event reading W' on C is po-before W.
>
> In your litmus test, the event reading y=1 on P1 is po-after the
> spin_unlock (which is a release).  Nevertheless, any feasible execution
> requires that P1 must execute Ry=1 before the unlock.  So the
> operational model says that y=1 must propagate to P3 before the unlock
> does, i.e., before P3 executes the spin_lock().  But the formal model
> doesn't require this.


I see now. Somehow I thought stores must execute in program order, but I 
guess it doesn't make sense.
In that sense, W ->xbstar&int X always means W propagates to X's CPU 
before X executes.


> Ideally we would fix this by changing the definition of po-rel to:
>
> 	[M] ; (xbstar & int) ; [Release]
>
> (This is closely related to the use of (xbstar & int) in the definition
> of vis that you asked about.)

This misses the property of release stores that any po-earlier store 
must also execute before the release store.
Perhaps it could be changed to the old  po-rel | [M] ; (xbstar & int) ; 
[Release] but then one could instead move this into the definition of 
cumul-fence.
In fact you'd probably want this for all the propagation fences, so 
cumul-fence and pb should be the right place.

> Unfortunately we can't do this, because
> po-rel has to be defined long before xbstar.

You could do it, by turning the relation into one massive recursive 
definition.

Thinking about what the options are:
1) accept the difference and run with it by making it consistent inside 
the axiomatic model
2) fix it through the recursive definition, which seems to be quite ugly 
but also consistent with the power operational model as far as I can tell
3) weaken the operational model... somehow
4) just ignore the anomaly
5) ???

Currently my least favorite option is 4) since it seems a bit off that 
the reasoning applies in one specific case of LKMM, more specifically 
the data race definition which should be equivalent to "the order of the 
two races isn't fixed", but here the order isn't fixed but it's a data race.
I think the patch happens to almost do 1) because the xbstar&int at the 
end should already imply ordering through the prop&int <= hb rule.
What would remain is to also exclude rcu-fence somehow.

2) seems the most correct choice but also to complicate LKMM a lot.

Seems like being between a rock and hard place.
jonas

PS:
>> The other cases of *-pre-bounded seem to work out: they all link stuff via
>> xbstar to the instruction that is linked via po-unlock-lock-po ;
>> [After-unlock-lock] ; po to the potentially racy access, and
>> po-unlock-lock-po;po   is xbstar ; acq-po, which allows closing the gap.
> I could not follow your arguments at all; the writing was too confusing.

Sorry, I collapsed some cases. I'll show an example. I think all the 
other cases are the same.
Let's pick an edge that links two events with ww-vis through
   w-post-bounded ; vis ; w-pre-bounded
where the vis comes from
   cumul-fence* ; rfe? ; [Marked] ;
     (strong-fence ; [Marked] ; xbstar)  <= vis
and the w-pre-bounded came from po-unlock-lock-po;[After-unlock-lock];po 
but not po-unlock-lock-po & int.

Note that such po-unlock-lock-po;[After-unlock-lock];po must come from
   po-rel ; rfe ; acq-po

So we have
   w-post-bounded ;  cumul-fence* ; rfe? ; [Marked] ;
      strong-fence ; [Marked] ; xbstar ; po-rel ; rfe ; acq-po
<=
   w-post-bounded ;  cumul-fence* ; rfe? ; [Marked] ;
      strong-fence ; [Marked] ; xbstar ; hb ; hb ;  acq-po
<=
   w-post-bounded ;  cumul-fence* ; rfe? ; [Marked] ;
      strong-fence ; [Marked] ; xbstar ;                fence
<= ww-vis

All the other cases where w-pre-bounded are used have the shape
     ... ; xbstar ; w-pre-bounded
which can all be handled in the same manner.
  
Paul E. McKenney Jan. 29, 2023, 11:09 p.m. UTC | #17
On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote:
> On Sun, Jan 29, 2023 at 10:44:03AM -0800, Paul E. McKenney wrote:
> > On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote:
> > > > Why can't P3's spin_lock() read from that initial write?
> > > 
> > > Mmh, sounds like you want to play with something like below?
> > > 
> > >   Andrea
> > > 
> > > diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
> > > index 6b52f365d73ac..20c3af4511255 100644
> > > --- a/tools/memory-model/lock.cat
> > > +++ b/tools/memory-model/lock.cat
> > > @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
> > >  
> > >  (* Allow up to one unmatched LKW per location; more must deadlock *)
> > >  let UNMATCHED-LKW = LKW \ domain(critical)
> > > -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> > >  
> > >  (* rfi for LF events: link each LKW to the LF events in its critical section *)
> > >  let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
> > > @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
> > >  let rf = rf | rf-lf | rf-ru
> > >  
> > >  (* Generate all co relations, including LKW events but not UL *)
> > > -let co0 = co0 | ([IW] ; loc ; [LKW]) |
> > > -	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
> > > +let co0 = co0 | ([IW] ; loc ; [LKW])
> > >  include "cos-opt.cat"
> > >  let W = W | UL
> > >  let M = R | W
> > 
> > No idea.  But the following litmus test gets no executions whatsoever,
> > so point taken about my missing at least one corner case.  ;-)
> > 
> > Adding a spin_unlock() to the end of either process allows both to
> > run.
> > 
> > One could argue that this is a bug, but one could equally well argue
> > that if you have a deadlock, you have a deadlock.
> > 
> 
> in lock.cat: 
> 
> 	(* Allow up to one unmatched LKW per location; more must deadlock *)
> 	let UNMATCHED-LKW = LKW \ domain(critical)
> 	empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> 
> we rule out deadlocks from the execution candidates we care about.

Thank you, Boqun!

							Thanx, Paul

> Regards,
> Boqun
> 
> > Thoughts?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > C lock
> > 
> > {
> > }
> > 
> > 
> > P0(int *a, int *b, spinlock_t *l)
> > {
> > 	spin_lock(l);
> > 	WRITE_ONCE(*a, 1);
> > }
> > 
> > P1(int *a, int *b, spinlock_t *l)
> > {
> > 	spin_lock(l);
> > 	WRITE_ONCE(*b, 1);
> > }
> > 
> > exists (a=1 /\ b=1)
  
Alan Stern Jan. 30, 2023, 2:18 a.m. UTC | #18
On Sun, Jan 29, 2023 at 03:09:00PM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote:
> > in lock.cat: 
> > 
> > 	(* Allow up to one unmatched LKW per location; more must deadlock *)
> > 	let UNMATCHED-LKW = LKW \ domain(critical)
> > 	empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> > 
> > we rule out deadlocks from the execution candidates we care about.
> 
> Thank you, Boqun!

Actually that's only part of it.  The other part is rather obscure:

(* Generate all co relations, including LKW events but not UL *)
let co0 = co0 | ([IW] ; loc ; [LKW]) |
	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])

Implicitly this says that any lock with no corresponding unlock must 
come last in the coherence order, which implies the unmatched-locks rule 
(since only one lock event can be last).  By itself, the unmatched-locks 
rule would not prevent P3 from executing before P1, provided P1 executes 
both its lock and unlock.

Alan
  
Alan Stern Jan. 30, 2023, 2:39 a.m. UTC | #19
On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> I see now. Somehow I thought stores must execute in program order, but I
> guess it doesn't make sense.
> In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> X executes.

It also means any write that propagates to W's CPU before W executes 
also propagates to X's CPU before X executes (because it's the same CPU 
and W executes before X).

> > Ideally we would fix this by changing the definition of po-rel to:
> > 
> > 	[M] ; (xbstar & int) ; [Release]
> > 
> > (This is closely related to the use of (xbstar & int) in the definition
> > of vis that you asked about.)
> 
> This misses the property of release stores that any po-earlier store must
> also execute before the release store.

I should have written:

	[M] ; (po | (xbstar & int)) ; [Release]

> Perhaps it could be changed to the old  po-rel | [M] ; (xbstar & int) ;
> [Release] but then one could instead move this into the definition of
> cumul-fence.
> In fact you'd probably want this for all the propagation fences, so
> cumul-fence and pb should be the right place.
> 
> > Unfortunately we can't do this, because
> > po-rel has to be defined long before xbstar.
> 
> You could do it, by turning the relation into one massive recursive
> definition.

Which would make pretty much the entire memory model one big recursion.  
I do not want to do that.

> Thinking about what the options are:
> 1) accept the difference and run with it by making it consistent inside the
> axiomatic model
> 2) fix it through the recursive definition, which seems to be quite ugly but
> also consistent with the power operational model as far as I can tell
> 3) weaken the operational model... somehow
> 4) just ignore the anomaly
> 5) ???
> 
> Currently my least favorite option is 4) since it seems a bit off that the
> reasoning applies in one specific case of LKMM, more specifically the data
> race definition which should be equivalent to "the order of the two races
> isn't fixed", but here the order isn't fixed but it's a data race.
> I think the patch happens to almost do 1) because the xbstar&int at the end
> should already imply ordering through the prop&int <= hb rule.
> What would remain is to also exclude rcu-fence somehow.

IMO 1) is the best choice.

Alan

PS: For the record, here's a simpler litmus test to illustrates the 
failing.  The idea is that Wz=1 is reordered before the store-release, 
so it ought to propagate before Wy=1.  The LKMM does not require this.


C before-release

{}

P0(int *x, int *y, int *z)
{
	int r1;

	r1 = READ_ONCE(*x);
	smp_store_release(y, 1);
	WRITE_ONCE(*z, 1);
}

P1(int *x, int *y, int *z)
{
	int r2;

	r2 = READ_ONCE(*z);
	WRITE_ONCE(*x, r2);
}

P2(int *x, int *y, int *z)
{
	int r3;
	int r4;

	r3 = READ_ONCE(*y);
	smp_rmb();
	r4 = READ_ONCE(*z);
}

exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
  
Paul E. McKenney Jan. 30, 2023, 4:36 a.m. UTC | #20
On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> > I see now. Somehow I thought stores must execute in program order, but I
> > guess it doesn't make sense.
> > In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> > X executes.
> 
> It also means any write that propagates to W's CPU before W executes 
> also propagates to X's CPU before X executes (because it's the same CPU 
> and W executes before X).
> 
> > > Ideally we would fix this by changing the definition of po-rel to:
> > > 
> > > 	[M] ; (xbstar & int) ; [Release]
> > > 
> > > (This is closely related to the use of (xbstar & int) in the definition
> > > of vis that you asked about.)
> > 
> > This misses the property of release stores that any po-earlier store must
> > also execute before the release store.
> 
> I should have written:
> 
> 	[M] ; (po | (xbstar & int)) ; [Release]
> 
> > Perhaps it could be changed to the old  po-rel | [M] ; (xbstar & int) ;
> > [Release] but then one could instead move this into the definition of
> > cumul-fence.
> > In fact you'd probably want this for all the propagation fences, so
> > cumul-fence and pb should be the right place.
> > 
> > > Unfortunately we can't do this, because
> > > po-rel has to be defined long before xbstar.
> > 
> > You could do it, by turning the relation into one massive recursive
> > definition.
> 
> Which would make pretty much the entire memory model one big recursion.  
> I do not want to do that.
> 
> > Thinking about what the options are:
> > 1) accept the difference and run with it by making it consistent inside the
> > axiomatic model
> > 2) fix it through the recursive definition, which seems to be quite ugly but
> > also consistent with the power operational model as far as I can tell
> > 3) weaken the operational model... somehow
> > 4) just ignore the anomaly
> > 5) ???
> > 
> > Currently my least favorite option is 4) since it seems a bit off that the
> > reasoning applies in one specific case of LKMM, more specifically the data
> > race definition which should be equivalent to "the order of the two races
> > isn't fixed", but here the order isn't fixed but it's a data race.
> > I think the patch happens to almost do 1) because the xbstar&int at the end
> > should already imply ordering through the prop&int <= hb rule.
> > What would remain is to also exclude rcu-fence somehow.
> 
> IMO 1) is the best choice.
> 
> Alan
> 
> PS: For the record, here's a simpler litmus test to illustrates the 
> failing.  The idea is that Wz=1 is reordered before the store-release, 
> so it ought to propagate before Wy=1.  The LKMM does not require this.

In PowerPC terms, would this be like having the Wz=1 being reorders
before the Wy=1, but not before the lwsync instruction preceding the
Wy=1 that made it be a release store?

If so, we might have to keep this quirk.

							Thanx, Paul

> C before-release
> 
> {}
> 
> P0(int *x, int *y, int *z)
> {
> 	int r1;
> 
> 	r1 = READ_ONCE(*x);
> 	smp_store_release(y, 1);
> 	WRITE_ONCE(*z, 1);
> }
> 
> P1(int *x, int *y, int *z)
> {
> 	int r2;
> 
> 	r2 = READ_ONCE(*z);
> 	WRITE_ONCE(*x, r2);
> }
> 
> P2(int *x, int *y, int *z)
> {
> 	int r3;
> 	int r4;
> 
> 	r3 = READ_ONCE(*y);
> 	smp_rmb();
> 	r4 = READ_ONCE(*z);
> }
> 
> exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
  
Paul E. McKenney Jan. 30, 2023, 4:43 a.m. UTC | #21
On Sun, Jan 29, 2023 at 09:18:24PM -0500, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 03:09:00PM -0800, Paul E. McKenney wrote:
> > On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote:
> > > in lock.cat: 
> > > 
> > > 	(* Allow up to one unmatched LKW per location; more must deadlock *)
> > > 	let UNMATCHED-LKW = LKW \ domain(critical)
> > > 	empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> > > 
> > > we rule out deadlocks from the execution candidates we care about.
> > 
> > Thank you, Boqun!
> 
> Actually that's only part of it.  The other part is rather obscure:
> 
> (* Generate all co relations, including LKW events but not UL *)
> let co0 = co0 | ([IW] ; loc ; [LKW]) |
> 	(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
> 
> Implicitly this says that any lock with no corresponding unlock must 
> come last in the coherence order, which implies the unmatched-locks rule 
> (since only one lock event can be last).  By itself, the unmatched-locks 
> rule would not prevent P3 from executing before P1, provided P1 executes 
> both its lock and unlock.

And thank you, Alan, as well!

And RCU looks to operate in a similar manner:

------------------------------------------------------------------------

C rcudeadlock

{
}


P0(int *a, int *b)
{
	rcu_read_lock();
	WRITE_ONCE(*a, 1);
	synchronize_rcu();
	WRITE_ONCE(*b, 1);
	rcu_read_unlock();
}

P1(int *a, int *b)
{
	int r1;
	int r2;

	r1 = READ_ONCE(*b);
	smp_mb();
	r2 = READ_ONCE(*a);
}

exists (1:r1=1 /\ 1:r2=0)

------------------------------------------------------------------------

$ herd7 -conf linux-kernel.cfg rcudeadlock.litmus 
Test rcudeadlock Allowed
States 0
No
Witnesses
Positive: 0 Negative: 0
Condition exists (1:r1=1 /\ 1:r2=0)
Observation rcudeadlock Never 0 0
Time rcudeadlock 0.00
Hash=4f7f336ad39d724d93b089133b00d1e2

------------------------------------------------------------------------

So good enough!  ;-)

							Thanx, Paul
  
Paul E. McKenney Jan. 30, 2023, 4:46 a.m. UTC | #22
On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> 
> Hi all, apologies on the confusion about the litmus test.
> I should have explained it better but it seems you mostly figured it out.
> As Alan said I'm tricking a little bit by not unlocking in certain places to
> filter out all executions that aren't what I'm looking for.
> I didn't have much time when I sent it (hence also the lack of explanation
> and why I haven't responded earlier), so I didn't have time to play around
> with the filter feature to do this the "proper"/non-cute way.
> As such it really isn't about deadlocks.

Not a problem!

> I think one question is whether the distinction between the models could be
> reproduced without using any kind of filtering at all.
> I have a feeling it should be possible but I haven't had time to think up a
> litmus test that does that.

Here is an example litmus test using filter, if this helps.

You put it right before the "exists" clause, and the condition is the
same as in the "exists" clause.  If an execution does not satisfy the
condition in the filter clause, it is tossed.

							Thanx, Paul

------------------------------------------------------------------------

C C-srcu-nest-6

(*
 * Result: Never
 *
 * This would be valid for srcu_down_read() and srcu_up_read().
 *)

{}

P0(int *x, int *y, struct srcu_struct *s1, int *idx, int *f)
{
	int r2;
	int r3;

	r3 = srcu_down_read(s1);
	WRITE_ONCE(*idx, r3);
	r2 = READ_ONCE(*y);
	smp_store_release(f, 1);
}

P1(int *x, int *y, struct srcu_struct *s1, int *idx, int *f)
{
	int r1;
	int r3;
	int r4;

	r4 = smp_load_acquire(f);
	r1 = READ_ONCE(*x);
	r3 = READ_ONCE(*idx);
	srcu_up_read(s1, r3);
}

P2(int *x, int *y, struct srcu_struct *s1)
{
	WRITE_ONCE(*y, 1);
	synchronize_srcu(s1);
	WRITE_ONCE(*x, 1);
}

locations [0:r1]
filter (1:r4=1)
exists (1:r1=1 /\ 0:r2=0)
  
Alan Stern Jan. 30, 2023, 4:47 p.m. UTC | #23
On Sun, Jan 29, 2023 at 08:36:45PM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote:
> > On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> > > I see now. Somehow I thought stores must execute in program order, but I
> > > guess it doesn't make sense.
> > > In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> > > X executes.
> > 
> > It also means any write that propagates to W's CPU before W executes 
> > also propagates to X's CPU before X executes (because it's the same CPU 
> > and W executes before X).
> > 
> > > > Ideally we would fix this by changing the definition of po-rel to:
> > > > 
> > > > 	[M] ; (xbstar & int) ; [Release]
> > > > 
> > > > (This is closely related to the use of (xbstar & int) in the definition
> > > > of vis that you asked about.)
> > > 
> > > This misses the property of release stores that any po-earlier store must
> > > also execute before the release store.
> > 
> > I should have written:
> > 
> > 	[M] ; (po | (xbstar & int)) ; [Release]
> > 
> > > Perhaps it could be changed to the old  po-rel | [M] ; (xbstar & int) ;
> > > [Release] but then one could instead move this into the definition of
> > > cumul-fence.
> > > In fact you'd probably want this for all the propagation fences, so
> > > cumul-fence and pb should be the right place.
> > > 
> > > > Unfortunately we can't do this, because
> > > > po-rel has to be defined long before xbstar.
> > > 
> > > You could do it, by turning the relation into one massive recursive
> > > definition.
> > 
> > Which would make pretty much the entire memory model one big recursion.  
> > I do not want to do that.
> > 
> > > Thinking about what the options are:
> > > 1) accept the difference and run with it by making it consistent inside the
> > > axiomatic model
> > > 2) fix it through the recursive definition, which seems to be quite ugly but
> > > also consistent with the power operational model as far as I can tell
> > > 3) weaken the operational model... somehow
> > > 4) just ignore the anomaly
> > > 5) ???
> > > 
> > > Currently my least favorite option is 4) since it seems a bit off that the
> > > reasoning applies in one specific case of LKMM, more specifically the data
> > > race definition which should be equivalent to "the order of the two races
> > > isn't fixed", but here the order isn't fixed but it's a data race.
> > > I think the patch happens to almost do 1) because the xbstar&int at the end
> > > should already imply ordering through the prop&int <= hb rule.
> > > What would remain is to also exclude rcu-fence somehow.
> > 
> > IMO 1) is the best choice.
> > 
> > Alan
> > 
> > PS: For the record, here's a simpler litmus test to illustrates the 
> > failing.  The idea is that Wz=1 is reordered before the store-release, 
> > so it ought to propagate before Wy=1.  The LKMM does not require this.
> 
> In PowerPC terms, would this be like having the Wz=1 being reorders
> before the Wy=1, but not before the lwsync instruction preceding the
> Wy=1 that made it be a release store?

No, it would be like having the Wz=1 reordered before the Rx=1, 
therefore before the lwsync.  Obviously this can't ever happen on 
PowerPC.

Alan

> If so, we might have to keep this quirk.
> 
> 							Thanx, Paul
> 
> > C before-release
> > 
> > {}
> > 
> > P0(int *x, int *y, int *z)
> > {
> > 	int r1;
> > 
> > 	r1 = READ_ONCE(*x);
> > 	smp_store_release(y, 1);
> > 	WRITE_ONCE(*z, 1);
> > }
> > 
> > P1(int *x, int *y, int *z)
> > {
> > 	int r2;
> > 
> > 	r2 = READ_ONCE(*z);
> > 	WRITE_ONCE(*x, r2);
> > }
> > 
> > P2(int *x, int *y, int *z)
> > {
> > 	int r3;
> > 	int r4;
> > 
> > 	r3 = READ_ONCE(*y);
> > 	smp_rmb();
> > 	r4 = READ_ONCE(*z);
> > }
> > 
> > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
  
Paul E. McKenney Jan. 30, 2023, 4:50 p.m. UTC | #24
On Mon, Jan 30, 2023 at 11:47:50AM -0500, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 08:36:45PM -0800, Paul E. McKenney wrote:
> > On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote:
> > > On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> > > > I see now. Somehow I thought stores must execute in program order, but I
> > > > guess it doesn't make sense.
> > > > In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> > > > X executes.
> > > 
> > > It also means any write that propagates to W's CPU before W executes 
> > > also propagates to X's CPU before X executes (because it's the same CPU 
> > > and W executes before X).
> > > 
> > > > > Ideally we would fix this by changing the definition of po-rel to:
> > > > > 
> > > > > 	[M] ; (xbstar & int) ; [Release]
> > > > > 
> > > > > (This is closely related to the use of (xbstar & int) in the definition
> > > > > of vis that you asked about.)
> > > > 
> > > > This misses the property of release stores that any po-earlier store must
> > > > also execute before the release store.
> > > 
> > > I should have written:
> > > 
> > > 	[M] ; (po | (xbstar & int)) ; [Release]
> > > 
> > > > Perhaps it could be changed to the old  po-rel | [M] ; (xbstar & int) ;
> > > > [Release] but then one could instead move this into the definition of
> > > > cumul-fence.
> > > > In fact you'd probably want this for all the propagation fences, so
> > > > cumul-fence and pb should be the right place.
> > > > 
> > > > > Unfortunately we can't do this, because
> > > > > po-rel has to be defined long before xbstar.
> > > > 
> > > > You could do it, by turning the relation into one massive recursive
> > > > definition.
> > > 
> > > Which would make pretty much the entire memory model one big recursion.  
> > > I do not want to do that.
> > > 
> > > > Thinking about what the options are:
> > > > 1) accept the difference and run with it by making it consistent inside the
> > > > axiomatic model
> > > > 2) fix it through the recursive definition, which seems to be quite ugly but
> > > > also consistent with the power operational model as far as I can tell
> > > > 3) weaken the operational model... somehow
> > > > 4) just ignore the anomaly
> > > > 5) ???
> > > > 
> > > > Currently my least favorite option is 4) since it seems a bit off that the
> > > > reasoning applies in one specific case of LKMM, more specifically the data
> > > > race definition which should be equivalent to "the order of the two races
> > > > isn't fixed", but here the order isn't fixed but it's a data race.
> > > > I think the patch happens to almost do 1) because the xbstar&int at the end
> > > > should already imply ordering through the prop&int <= hb rule.
> > > > What would remain is to also exclude rcu-fence somehow.
> > > 
> > > IMO 1) is the best choice.
> > > 
> > > Alan
> > > 
> > > PS: For the record, here's a simpler litmus test to illustrates the 
> > > failing.  The idea is that Wz=1 is reordered before the store-release, 
> > > so it ought to propagate before Wy=1.  The LKMM does not require this.
> > 
> > In PowerPC terms, would this be like having the Wz=1 being reorders
> > before the Wy=1, but not before the lwsync instruction preceding the
> > Wy=1 that made it be a release store?
> 
> No, it would be like having the Wz=1 reordered before the Rx=1, 
> therefore before the lwsync.  Obviously this can't ever happen on 
> PowerPC.

Whew!!!  ;-)

							Thanx, Paul

> Alan
> 
> > If so, we might have to keep this quirk.
> > 
> > 							Thanx, Paul
> > 
> > > C before-release
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y, int *z)
> > > {
> > > 	int r1;
> > > 
> > > 	r1 = READ_ONCE(*x);
> > > 	smp_store_release(y, 1);
> > > 	WRITE_ONCE(*z, 1);
> > > }
> > > 
> > > P1(int *x, int *y, int *z)
> > > {
> > > 	int r2;
> > > 
> > > 	r2 = READ_ONCE(*z);
> > > 	WRITE_ONCE(*x, r2);
> > > }
> > > 
> > > P2(int *x, int *y, int *z)
> > > {
> > > 	int r3;
> > > 	int r4;
> > > 
> > > 	r3 = READ_ONCE(*y);
> > > 	smp_rmb();
> > > 	r4 = READ_ONCE(*z);
> > > }
> > > 
> > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
  
Jonas Oberhauser Jan. 31, 2023, 1:56 p.m. UTC | #25
On 1/30/2023 3:39 AM, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
>> You could do it, by turning the relation into one massive recursive
>> definition.
> Which would make pretty much the entire memory model one big recursion.
> I do not want to do that.

Neither do I :D

>
>> Thinking about what the options are:
>> 1) accept the difference and run with it by making it consistent inside the
>> axiomatic model
>> 2) fix it through the recursive definition, which seems to be quite ugly but
>> also consistent with the power operational model as far as I can tell
>> 3) weaken the operational model... somehow
>> 4) just ignore the anomaly
>> 5) ???
>>
>> Currently my least favorite option is 4) since it seems a bit off that the
>> reasoning applies in one specific case of LKMM, more specifically the data
>> race definition which should be equivalent to "the order of the two races
>> isn't fixed", but here the order isn't fixed but it's a data race.
>> I think the patch happens to almost do 1) because the xbstar&int at the end
>> should already imply ordering through the prop&int <= hb rule.
>> What would remain is to also exclude rcu-fence somehow.
> IMO 1) is the best choice.

I have some additional thoughts now. It seems that you could weaken the 
operational model by stating that an A-cumulative fence orders 
propagation of all *external* stores (in addition to all po-earlier 
stores) that propagated to you before the fence is executed.

It seems that on power, from an operational model perspective, there's 
currently no difference between propagation fences ordering all stores 
vs only external stores that propagated to the CPU before the fence is 
executed, because they only have bidirectional (*->W) fences (sync, 
lwsync) and not uni-directional (acquire, release), and so it is not 
possible for a store that is po-later than the barrier to be executed 
before the barrier; i.e., on power, every internal store that propagates 
to a CPU before the fence executes is also po-earler than the fence.

If power did introduce release stores, I think you could potentially 
create implementations that allow the behavior in the example you have 
given, but I don't think they are the most natural ones:

> {}
>
> P0(int *x, int *y, int *z)
> {
> 	int r1;
>
> 	r1 = READ_ONCE(*x);
> 	smp_store_release(y, 1);
> 	WRITE_ONCE(*z, 1);
> }
>
> P1(int *x, int *y, int *z)
> {
> 	int r2;
>
> 	r2 = READ_ONCE(*z);
> 	WRITE_ONCE(*x, r2);
> }
>
> P2(int *x, int *y, int *z)
> {
> 	int r3;
> 	int r4;
>
> 	r3 = READ_ONCE(*y);
> 	smp_rmb();
> 	r4 = READ_ONCE(*z);
> }
>
> exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)

I could imagine that P0 posts both of its stores in a shared store 
buffer before reading *x, but marks the release store as "not ready".
Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 
reads, and subsequently marks its release store as "ready".
Then the release store is sent to the cache, where P2 reads *y=1 and 
then *z=0.
Finally P0 sends its *z=1 store to the cache.

However, a perhaps more natural implementation would not post the 
release store to the store buffer until it is "ready", in which case the 
order in the store buffer would be *z=1 before *y=1, and in this case 
the release ordering would presumably work like your current operational 
model.

Nevertheless, perhaps this slightly weaker operational model isn't as 
absurd as it sounds. And I think many people wouldn't be shocked if the 
release store didn't provide ordering with *z=1.

Best wishes, jonas
  
Alan Stern Jan. 31, 2023, 3:06 p.m. UTC | #26
On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote:
> I have some additional thoughts now. It seems that you could weaken the
> operational model by stating that an A-cumulative fence orders propagation
> of all *external* stores (in addition to all po-earlier stores) that
> propagated to you before the fence is executed.

How is that a weakening of the operational model?  It's what the 
operational model says right now.  From explanation.txt:

	Release fences, such as smp_store_release(), force the CPU to
	execute all po-earlier instructions before the store
	associated with the fence (e.g., the store part of an
	smp_store_release()).

...  When a fence instruction is executed on CPU C:

	...

	For each other CPU C', any store which propagates to C before
	a release fence is executed (including all po-earlier
	stores executed on C) is forced to propagate to C' before the
	store associated with the release fence does.

In theory, we could weaken the operational model by saying that pfences 
order propagation of stores from other CPUs only when those stores are 
read-from by instructions po-before the fence.  But I suspect that's not 
such a good idea.

> It seems that on power, from an operational model perspective, there's
> currently no difference between propagation fences ordering all stores vs
> only external stores that propagated to the CPU before the fence is
> executed, because they only have bidirectional (*->W) fences (sync, lwsync)
> and not uni-directional (acquire, release), and so it is not possible for a
> store that is po-later than the barrier to be executed before the barrier;
> i.e., on power, every internal store that propagates to a CPU before the
> fence executes is also po-earler than the fence.
> 
> If power did introduce release stores, I think you could potentially create
> implementations that allow the behavior in the example you have given, but I
> don't think they are the most natural ones:

Maybe so.  In any case, it's a moot point.  In fact, I don't know if any 
architecture supporting Linux allows a write that is po-after a release 
store to be reordered before the release store.

> > P0(int *x, int *y, int *z)
> > {
> > 	int r1;
> > 
> > 	r1 = READ_ONCE(*x);
> > 	smp_store_release(y, 1);
> > 	WRITE_ONCE(*z, 1);
> > }
> > 
> > P1(int *x, int *y, int *z)
> > {
> > 	int r2;
> > 
> > 	r2 = READ_ONCE(*z);
> > 	WRITE_ONCE(*x, r2);
> > }
> > 
> > P2(int *x, int *y, int *z)
> > {
> > 	int r3;
> > 	int r4;
> > 
> > 	r3 = READ_ONCE(*y);
> > 	smp_rmb();
> > 	r4 = READ_ONCE(*z);
> > }
> > 
> > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
> 
> I could imagine that P0 posts both of its stores in a shared store buffer
> before reading *x, but marks the release store as "not ready".
> Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads,
> and subsequently marks its release store as "ready".

That isn't how release stores are meant to work.  The read of x is 
supposed to be complete before the release store becomes visible to any 
other CPU.  This is true even in C11.

> Then the release store is sent to the cache, where P2 reads *y=1 and then
> *z=0.
> Finally P0 sends its *z=1 store to the cache.
> 
> However, a perhaps more natural implementation would not post the release
> store to the store buffer until it is "ready", in which case the order in
> the store buffer would be *z=1 before *y=1, and in this case the release
> ordering would presumably work like your current operational model.
> 
> Nevertheless, perhaps this slightly weaker operational model isn't as absurd
> as it sounds. And I think many people wouldn't be shocked if the release
> store didn't provide ordering with *z=1.

This issue is one we should discuss with all the other people involved 
in maintaining the LKMM.

Alan
  
Jonas Oberhauser Jan. 31, 2023, 3:33 p.m. UTC | #27
On 1/31/2023 4:06 PM, Alan Stern wrote:
> On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote:
>> I have some additional thoughts now. It seems that you could weaken the
>> operational model by stating that an A-cumulative fence orders propagation
>> of all *external* stores (in addition to all po-earlier stores) that
>> propagated to you before the fence is executed.
> How is that a weakening of the operational model?  It's what the
> operational model says right now.

No, as in the part that you have quoted, it is stated that an 
A-cumulative fence orderes propagation of *all* stores that propagated 
to you before the fence is executed.
I'm saying you could weaken this to only cover all *external* stores.

More precisely, I would change

> 	For each other CPU C', any store which propagates to C before
> 	a release fence is executed (including all po-earlier
> 	stores executed on C) is forced to propagate to C' before the
> 	store associated with the release fence does.

Into something like


      For each other CPU C', any *external* store which propagates to C 
before
      a release fence is executed as well as any po-earlier
      store executed on C is forced to propagate to C' before the
      store associated with the release fence does.

The difference is that po-later stores that happen to propagate to C 
before the release fence is executed would no longer be ordered.
That should be consistent with the axiomatic model.


>
> In theory, we could weaken the operational model by saying that pfences
> order propagation of stores from other CPUs only when those stores are
> read-from by instructions po-before the fence.  But I suspect that's not
> such a good idea.

That indeed looks too confusing.


>> It seems that on power, from an operational model perspective, there's
>> currently no difference between propagation fences ordering all stores vs
>> only external stores that propagated to the CPU before the fence is
>> executed, because they only have bidirectional (*->W) fences (sync, lwsync)
>> and not uni-directional (acquire, release), and so it is not possible for a
>> store that is po-later than the barrier to be executed before the barrier;
>> i.e., on power, every internal store that propagates to a CPU before the
>> fence executes is also po-earler than the fence.
>>
>> If power did introduce release stores, I think you could potentially create
>> implementations that allow the behavior in the example you have given, but I
>> don't think they are the most natural ones:
> Maybe so.  In any case, it's a moot point.  In fact, I don't know if any
> architecture supporting Linux allows a write that is po-after a release
> store to be reordered before the release store.

Arm and Risc5 do, but they are multi-copy-atomic anyways.

>
>>> P0(int *x, int *y, int *z)
>>> {
>>> 	int r1;
>>>
>>> 	r1 = READ_ONCE(*x);
>>> 	smp_store_release(y, 1);
>>> 	WRITE_ONCE(*z, 1);
>>> }
>>>
>>> P1(int *x, int *y, int *z)
>>> {
>>> 	int r2;
>>>
>>> 	r2 = READ_ONCE(*z);
>>> 	WRITE_ONCE(*x, r2);
>>> }
>>>
>>> P2(int *x, int *y, int *z)
>>> {
>>> 	int r3;
>>> 	int r4;
>>>
>>> 	r3 = READ_ONCE(*y);
>>> 	smp_rmb();
>>> 	r4 = READ_ONCE(*z);
>>> }
>>>
>>> exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
>> I could imagine that P0 posts both of its stores in a shared store buffer
>> before reading *x, but marks the release store as "not ready".
>> Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads,
>> and subsequently marks its release store as "ready".
> That isn't how release stores are meant to work.  The read of x is
> supposed to be complete before the release store becomes visible to any
> other CPU.

Note that the release store isn't observed until it becomes "ready", so 
it is really indistinguishable of whether it had become visible to any 
other CPU.
Indeed stores that aren't marked "ready" would be ignored during 
forwarding, and not allowed to be pushed to the cache.

The reason this kind of implementation seems less natural to me is that 
such an "not ready" store would need to be pushed back in the buffer (if 
it is the head of the buffer and the cache is ready to take a store), 
stall the later stores, or be aborted until it becomes ready.
That just seems to create a lot of hassle for no discernible benefit.
A "not ready" store probably shouldn't be put into a store queue, even 
if the only reason it is not ready is that there are some otherwise 
unrelated reads that haven't completed yet.



> This is true even in C11.

Arguable... The following pseudo-code litmus test should demonstrate this:

P0 {
    int r = read_relaxed(&x);
    store_release(&y,1);
}


P1 {
    int s = read_relaxed(&y);
    store_release(&x,1);
}

In C11, it should be possible to read r==s==1.


>> Then the release store is sent to the cache, where P2 reads *y=1 and then
>> *z=0.
>> Finally P0 sends its *z=1 store to the cache.
>>
>> However, a perhaps more natural implementation would not post the release
>> store to the store buffer until it is "ready", in which case the order in
>> the store buffer would be *z=1 before *y=1, and in this case the release
>> ordering would presumably work like your current operational model.
>>
>> Nevertheless, perhaps this slightly weaker operational model isn't as absurd
>> as it sounds. And I think many people wouldn't be shocked if the release
>> store didn't provide ordering with *z=1.
> This issue is one we should discuss with all the other people involved
> in maintaining the LKMM.
>
> Alan

Sure.

Btw, how to proceed for your SRCU patch and this one?
Are you planning to make any changes? I think the version you have is ok 
if you don't think the patch is improved by anything I brought up.

Any additional concerns/changes for this patch?

Best wishes, jonas
  
Alan Stern Jan. 31, 2023, 4:55 p.m. UTC | #28
On Tue, Jan 31, 2023 at 04:33:25PM +0100, Jonas Oberhauser wrote:
> 
> 
> On 1/31/2023 4:06 PM, Alan Stern wrote:
> > On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote:
> > > I have some additional thoughts now. It seems that you could weaken the
> > > operational model by stating that an A-cumulative fence orders propagation
> > > of all *external* stores (in addition to all po-earlier stores) that
> > > propagated to you before the fence is executed.
> > How is that a weakening of the operational model?  It's what the
> > operational model says right now.
> 
> No, as in the part that you have quoted, it is stated that an A-cumulative
> fence orderes propagation of *all* stores that propagated to you before the
> fence is executed.
> I'm saying you could weaken this to only cover all *external* stores.

Okay, now I understand.

> More precisely, I would change
> 
> > 	For each other CPU C', any store which propagates to C before
> > 	a release fence is executed (including all po-earlier
> > 	stores executed on C) is forced to propagate to C' before the
> > 	store associated with the release fence does.
> 
> Into something like
> 
> 
>      For each other CPU C', any *external* store which propagates to C
> before
>      a release fence is executed as well as any po-earlier
>      store executed on C is forced to propagate to C' before the
>      store associated with the release fence does.
> 
> The difference is that po-later stores that happen to propagate to C before
> the release fence is executed would no longer be ordered.
> That should be consistent with the axiomatic model.

I had to check that it wouldn't affect the (xbstar & int) part of vis, 
but it looks all right.  This seems like a reasonable change.

However, it only fixes part of the problem.  Suppose an external write 
is read by an instruction po-after the release-store, but the read 
executes before the release-store.  The operational model would still 
say the external write has to obey the propagation ordering, whereas the 
formal model doesn't require it.

> > > > P0(int *x, int *y, int *z)
> > > > {
> > > > 	int r1;
> > > > 
> > > > 	r1 = READ_ONCE(*x);
> > > > 	smp_store_release(y, 1);
> > > > 	WRITE_ONCE(*z, 1);
> > > > }
> > > > 
> > > > P1(int *x, int *y, int *z)
> > > > {
> > > > 	int r2;
> > > > 
> > > > 	r2 = READ_ONCE(*z);
> > > > 	WRITE_ONCE(*x, r2);
> > > > }
> > > > 
> > > > P2(int *x, int *y, int *z)
> > > > {
> > > > 	int r3;
> > > > 	int r4;
> > > > 
> > > > 	r3 = READ_ONCE(*y);
> > > > 	smp_rmb();
> > > > 	r4 = READ_ONCE(*z);
> > > > }
> > > > 
> > > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
> > > I could imagine that P0 posts both of its stores in a shared store buffer
> > > before reading *x, but marks the release store as "not ready".
> > > Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads,
> > > and subsequently marks its release store as "ready".
> > That isn't how release stores are meant to work.  The read of x is
> > supposed to be complete before the release store becomes visible to any
> > other CPU.
> 
> Note that the release store isn't observed until it becomes "ready", so it
> is really indistinguishable of whether it had become visible to any other
> CPU.
> Indeed stores that aren't marked "ready" would be ignored during forwarding,
> and not allowed to be pushed to the cache.

Oops, I mixed up a couple of the accesses.  Okay, yes, this mechanism 
will allow writes that are po-after a release store but execute before 
it to evade the propagation restriction.

> The reason this kind of implementation seems less natural to me is that such
> an "not ready" store would need to be pushed back in the buffer (if it is
> the head of the buffer and the cache is ready to take a store), stall the
> later stores, or be aborted until it becomes ready.
> That just seems to create a lot of hassle for no discernible benefit.
> A "not ready" store probably shouldn't be put into a store queue, even if
> the only reason it is not ready is that there are some otherwise unrelated
> reads that haven't completed yet.
> 
> 
> 
> > This is true even in C11.
> 
> Arguable... The following pseudo-code litmus test should demonstrate this:
> 
> P0 {
>    int r = read_relaxed(&x);
>    store_release(&y,1);
> }
> 
> 
> P1 {
>    int s = read_relaxed(&y);
>    store_release(&x,1);
> }
> 
> In C11, it should be possible to read r==s==1.

True, in C11 releases don't mean anything unless they're paired with 
acquires.  But if your P1 had been

	int s = read_acquire(&y);
	write_relaxed(&x, 1);

then r = s = 1 would not be allowed.  And presumably the same object 
code would be generated for P0 either way, particularly if P1 was in a 
separate compilation unit (link-time optimization notwithstanding).

> Btw, how to proceed for your SRCU patch and this one?
> Are you planning to make any changes? I think the version you have is ok if
> you don't think the patch is improved by anything I brought up.

I don't see any need to change the SRCU patch at this point, other than 
to improve the attributions.

> Any additional concerns/changes for this patch?

It should give the same data-race diagnostics as the current LKMM.  This 
probably means the patch will need to punch up the definitions of 
*-pre-bounded and *-post-bounded, unless you can think of a better 
approach.

Alan
  
Jonas Oberhauser Feb. 1, 2023, 10:37 a.m. UTC | #29
On 1/31/2023 5:55 PM, Alan Stern wrote:
> On Tue, Jan 31, 2023 at 04:33:25PM +0100, Jonas Oberhauser wrote:
>>
>> More precisely, I would change
>>
>>> 	For each other CPU C', any store which propagates to C before
>>> 	a release fence is executed (including all po-earlier
>>> 	stores executed on C) is forced to propagate to C' before the
>>> 	store associated with the release fence does.
>> Into something like
>>
>>
>>       For each other CPU C', any *external* store which propagates to C
>> before
>>       a release fence is executed as well as any po-earlier
>>       store executed on C is forced to propagate to C' before the
>>       store associated with the release fence does.
>>
>> The difference is that po-later stores that happen to propagate to C before
>> the release fence is executed would no longer be ordered.
>> That should be consistent with the axiomatic model.
> I had to check that it wouldn't affect the (xbstar & int) part of vis,
> but it looks all right.  This seems like a reasonable change.
>
> However, it only fixes part of the problem.  Suppose an external write
> is read by an instruction po-after the release-store, but the read
> executes before the release-store.  The operational model would still
> say the external write has to obey the propagation ordering, whereas the
> formal model doesn't require it.


Ouch. I had only thought about the [W];(xbstar & int);[Release] case, 
but there's also the rfe;(xbstar & int);[Release] case....


>> Any additional concerns/changes for this patch?
> It should give the same data-race diagnostics as the current LKMM.  This
> probably means the patch will need to punch up the definitions of
> *-pre-bounded and *-post-bounded, unless you can think of a better
> approach.
>
> Alan

Seems the 1 thing per patch mentally hasn't yet become ingrained in me. 
Thanks!

jonas
  

Patch

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 6e531457bb73..815fdafacaef 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -36,7 +36,9 @@  let wmb = [W] ; fencerel(Wmb) ; [W]
 let mb = ([M] ; fencerel(Mb) ; [M]) |
 	([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
 	([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
-	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
+	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M])
+let gp = po ; [Sync-rcu | Sync-srcu] ; po?
+let strong-fence = mb | gp |
 (*
  * Note: The po-unlock-lock-po relation only passes the lock to the direct
  * successor, perhaps giving the impression that the ordering of the
@@ -50,10 +52,9 @@  let mb = ([M] ; fencerel(Mb) ; [M]) |
  *)
 	([M] ; po-unlock-lock-po ;
 		[After-unlock-lock] ; po ; [M])
-let gp = po ; [Sync-rcu | Sync-srcu] ; po?
-let strong-fence = mb | gp
 
-let nonrw-fence = strong-fence | po-rel | acq-po
+
+let nonrw-fence = mb | gp | po-rel | acq-po
 let fence = nonrw-fence | wmb | rmb
 let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu |
 		Before-atomic | After-atomic | Acquire | Release |