[v3] tools/memory-model: Make ppo a subrelation of po

Message ID 20230224135251.24989-1-jonas.oberhauser@huaweicloud.com
State New
Headers
Series [v3] tools/memory-model: Make ppo a subrelation of po |

Commit Message

Jonas Oberhauser Feb. 24, 2023, 1:52 p.m. UTC
  As stated in the documentation and implied by its name, the ppo
(preserved program order) relation is intended to link po-earlier
to po-later instructions under certain conditions.  However, a
corner case currently allows instructions to be linked by ppo that
are not executed by the same thread, i.e., instructions are being
linked that have no po relation.

This happens due to the mb/strong-fence/fence relations, which (as
one case) provide order when locks are passed between threads
followed by an smp_mb__after_unlock_lock() fence.  This is
illustrated in the following litmus test (as can be seen when using
herd7 with `doshow ppo`):

P0(int *x, int *y)
{
    spin_lock(x);
    spin_unlock(x);
}

P1(int *x, int *y)
{
    spin_lock(x);
    smp_mb__after_unlock_lock();
    *y = 1;
}

The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
P0 passes a lock to P1 which then uses this fence.

The patch makes ppo a subrelation of po by letting fence contribute
to ppo only in case the fence links events of the same thread.

Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
---
 tools/memory-model/linux-kernel.cat | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Alan Stern Feb. 24, 2023, 3:32 p.m. UTC | #1
On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> As stated in the documentation and implied by its name, the ppo
> (preserved program order) relation is intended to link po-earlier
> to po-later instructions under certain conditions.  However, a
> corner case currently allows instructions to be linked by ppo that
> are not executed by the same thread, i.e., instructions are being
> linked that have no po relation.
> 
> This happens due to the mb/strong-fence/fence relations, which (as
> one case) provide order when locks are passed between threads
> followed by an smp_mb__after_unlock_lock() fence.  This is
> illustrated in the following litmus test (as can be seen when using
> herd7 with `doshow ppo`):
> 
> P0(int *x, int *y)
> {
>     spin_lock(x);
>     spin_unlock(x);
> }
> 
> P1(int *x, int *y)
> {
>     spin_lock(x);
>     smp_mb__after_unlock_lock();
>     *y = 1;
> }
> 
> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> P0 passes a lock to P1 which then uses this fence.
> 
> The patch makes ppo a subrelation of po by letting fence contribute
> to ppo only in case the fence links events of the same thread.
> 
> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> ---
>  tools/memory-model/linux-kernel.cat | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index cfc1b8fd46da..adf3c4f41229 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
>  let overwrite = co | fr
>  let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
>  let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
>  
>  (* Propagation: Ordering from release operations and strong fences. *)
>  let A-cumul(r) = (rfe ; [Marked])? ; r

Acked-by: Alan Stern <stern@rowland.harvard.edu>
  
Paul E. McKenney Feb. 24, 2023, 6:37 p.m. UTC | #2
On Fri, Feb 24, 2023 at 10:32:43AM -0500, Alan Stern wrote:
> On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> > As stated in the documentation and implied by its name, the ppo
> > (preserved program order) relation is intended to link po-earlier
> > to po-later instructions under certain conditions.  However, a
> > corner case currently allows instructions to be linked by ppo that
> > are not executed by the same thread, i.e., instructions are being
> > linked that have no po relation.
> > 
> > This happens due to the mb/strong-fence/fence relations, which (as
> > one case) provide order when locks are passed between threads
> > followed by an smp_mb__after_unlock_lock() fence.  This is
> > illustrated in the following litmus test (as can be seen when using
> > herd7 with `doshow ppo`):
> > 
> > P0(int *x, int *y)
> > {
> >     spin_lock(x);
> >     spin_unlock(x);
> > }
> > 
> > P1(int *x, int *y)
> > {
> >     spin_lock(x);
> >     smp_mb__after_unlock_lock();
> >     *y = 1;
> > }
> > 
> > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> > P0 passes a lock to P1 which then uses this fence.
> > 
> > The patch makes ppo a subrelation of po by letting fence contribute
> > to ppo only in case the fence links events of the same thread.
> > 
> > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > ---
> >  tools/memory-model/linux-kernel.cat | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index cfc1b8fd46da..adf3c4f41229 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> >  let overwrite = co | fr
> >  let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> >  let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> >  
> >  (* Propagation: Ordering from release operations and strong fences. *)
> >  let A-cumul(r) = (rfe ; [Marked])? ; r
> 
> Acked-by: Alan Stern <stern@rowland.harvard.edu>

Queued for the v6.4 merge window (not the current one), thank you both!

							Thanx, Paul
  
Paul E. McKenney Feb. 26, 2023, 1:01 a.m. UTC | #3
On Fri, Feb 24, 2023 at 10:37:58AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 24, 2023 at 10:32:43AM -0500, Alan Stern wrote:
> > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> > > As stated in the documentation and implied by its name, the ppo
> > > (preserved program order) relation is intended to link po-earlier
> > > to po-later instructions under certain conditions.  However, a
> > > corner case currently allows instructions to be linked by ppo that
> > > are not executed by the same thread, i.e., instructions are being
> > > linked that have no po relation.
> > > 
> > > This happens due to the mb/strong-fence/fence relations, which (as
> > > one case) provide order when locks are passed between threads
> > > followed by an smp_mb__after_unlock_lock() fence.  This is
> > > illustrated in the following litmus test (as can be seen when using
> > > herd7 with `doshow ppo`):
> > > 
> > > P0(int *x, int *y)
> > > {
> > >     spin_lock(x);
> > >     spin_unlock(x);
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > >     spin_lock(x);
> > >     smp_mb__after_unlock_lock();
> > >     *y = 1;
> > > }
> > > 
> > > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> > > P0 passes a lock to P1 which then uses this fence.
> > > 
> > > The patch makes ppo a subrelation of po by letting fence contribute
> > > to ppo only in case the fence links events of the same thread.
> > > 
> > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > > ---
> > >  tools/memory-model/linux-kernel.cat | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > > index cfc1b8fd46da..adf3c4f41229 100644
> > > --- a/tools/memory-model/linux-kernel.cat
> > > +++ b/tools/memory-model/linux-kernel.cat
> > > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> > >  let overwrite = co | fr
> > >  let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> > >  let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> > > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> > > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> > >  
> > >  (* Propagation: Ordering from release operations and strong fences. *)
> > >  let A-cumul(r) = (rfe ; [Marked])? ; r
> > 
> > Acked-by: Alan Stern <stern@rowland.harvard.edu>
> 
> Queued for the v6.4 merge window (not the current one), thank you both!

I tested both Alan's and Jonas's commit.  These do not see to produce
any significant differences in behavior, which is of course a good thing.

Here are the differences and a few oddities:

auto/C-RR-G+RR-R+RR-G+RR-G+RR-R+RR-R+RR-R+RR-R.litmus

	Timed out with changes, completed without them.  But it completed
	in 558.29 seconds against a limit of 600 seconds, so never mind.

auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus

	Timed out with changes, completed without them.  But it completed
	in 580.01 seconds against a limit of 600 seconds, so never mind. *

auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-R+RR-R.litmus

	Timed out with changes, completed without them.  But it completed
	in 522.29 seconds against a limit of 600 seconds, so never mind.

auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-G+RR-R.litmus

	Timed out with changes, completed without them.  But it completed
	in 588.70 seconds against a limit of 600 seconds, so never mind.

All tests that didn't time out matched Results comments.

The reason I am so cavalier about the times is that I was foolishly
running rcutorture concurrently with the new-version testing.  I re-ran
and of them, only auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus
timed out the second time.  I re-ran it again, but without a time limit,
and it completed properly in 364.8 seconds compared to 580.  A rerun
took 360.1 seconds.  So things have slowed down a bit.

A few other oddities:

litmus/auto/C-LB-Lww+R-OC.litmus

	Both versions flag a data race, which I am not seeing.	It appears
	to me that P1's store to u0 cannot happen unless P0's store
	has completed.  So what am I missing here?

litmus/auto/C-LB-Lrw+R-OC.litmus
litmus/auto/C-LB-Lww+R-Oc.litmus
litmus/auto/C-LB-Lrw+R-Oc.litmus
litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
litmus/auto/C-LB-Lww+R-A+R-OC.litmus

	Ditto.  (There are likely more.)

Thoughts?

							Thanx, Paul
  
Alan Stern Feb. 26, 2023, 2:29 a.m. UTC | #4
On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> A few other oddities:
> 
> litmus/auto/C-LB-Lww+R-OC.litmus
> 
> 	Both versions flag a data race, which I am not seeing.	It appears
> 	to me that P1's store to u0 cannot happen unless P0's store
> 	has completed.  So what am I missing here?

The LKMM doesn't believe that a control or data dependency orders a 
plain write after a marked read.  Hence in this test it thinks that P1's 
store to u0 can happen before the load of x1.  I don't remember why we 
did it this way -- probably we just wanted to minimize the restrictions 
on when plain accesses can execute.  (I do remember the reason for 
making address dependencies induce order; it was so RCU would work.)

The patch below will change what the LKMM believes.  It eliminates the 
positive outcome of the litmus test and the data race.  Should it be 
adopted into the memory model?

> litmus/auto/C-LB-Lrw+R-OC.litmus
> litmus/auto/C-LB-Lww+R-Oc.litmus
> litmus/auto/C-LB-Lrw+R-Oc.litmus
> litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
> litmus/auto/C-LB-Lww+R-A+R-OC.litmus
> 
> 	Ditto.  (There are likely more.)

I haven't looked at these but they're probably similar.

Alan



--- usb-devel.orig/tools/memory-model/linux-kernel.cat
+++ usb-devel/tools/memory-model/linux-kernel.cat
@@ -172,7 +172,7 @@ let vis = cumul-fence* ; rfe? ; [Marked]
 	((strong-fence ; [Marked] ; xbstar) | (xbstar & int))
 
 (* Boundaries for lifetimes of plain accesses *)
-let w-pre-bounded = [Marked] ; (addr | fence)?
+let w-pre-bounded = [Marked] ; (rwdep | fence)?
 let r-pre-bounded = [Marked] ; (addr | nonrw-fence |
 	([R4rmb] ; fencerel(Rmb) ; [~Noreturn]))?
 let w-post-bounded = fence? ; [Marked] ; rmw-sequence
  
Paul E. McKenney Feb. 26, 2023, 2:58 a.m. UTC | #5
On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 24, 2023 at 10:37:58AM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 24, 2023 at 10:32:43AM -0500, Alan Stern wrote:
> > > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> > > > As stated in the documentation and implied by its name, the ppo
> > > > (preserved program order) relation is intended to link po-earlier
> > > > to po-later instructions under certain conditions.  However, a
> > > > corner case currently allows instructions to be linked by ppo that
> > > > are not executed by the same thread, i.e., instructions are being
> > > > linked that have no po relation.
> > > > 
> > > > This happens due to the mb/strong-fence/fence relations, which (as
> > > > one case) provide order when locks are passed between threads
> > > > followed by an smp_mb__after_unlock_lock() fence.  This is
> > > > illustrated in the following litmus test (as can be seen when using
> > > > herd7 with `doshow ppo`):
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > >     spin_lock(x);
> > > >     spin_unlock(x);
> > > > }
> > > > 
> > > > P1(int *x, int *y)
> > > > {
> > > >     spin_lock(x);
> > > >     smp_mb__after_unlock_lock();
> > > >     *y = 1;
> > > > }
> > > > 
> > > > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> > > > P0 passes a lock to P1 which then uses this fence.
> > > > 
> > > > The patch makes ppo a subrelation of po by letting fence contribute
> > > > to ppo only in case the fence links events of the same thread.
> > > > 
> > > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > > > ---
> > > >  tools/memory-model/linux-kernel.cat | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > > > index cfc1b8fd46da..adf3c4f41229 100644
> > > > --- a/tools/memory-model/linux-kernel.cat
> > > > +++ b/tools/memory-model/linux-kernel.cat
> > > > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> > > >  let overwrite = co | fr
> > > >  let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> > > >  let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> > > > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> > > > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> > > >  
> > > >  (* Propagation: Ordering from release operations and strong fences. *)
> > > >  let A-cumul(r) = (rfe ; [Marked])? ; r
> > > 
> > > Acked-by: Alan Stern <stern@rowland.harvard.edu>
> > 
> > Queued for the v6.4 merge window (not the current one), thank you both!
> 
> I tested both Alan's and Jonas's commit.  These do not see to produce
> any significant differences in behavior, which is of course a good thing.
> 
> Here are the differences and a few oddities:
> 
> auto/C-RR-G+RR-R+RR-G+RR-G+RR-R+RR-R+RR-R+RR-R.litmus
> 
> 	Timed out with changes, completed without them.  But it completed
> 	in 558.29 seconds against a limit of 600 seconds, so never mind.
> 
> auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus
> 
> 	Timed out with changes, completed without them.  But it completed
> 	in 580.01 seconds against a limit of 600 seconds, so never mind. *
> 
> auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-R+RR-R.litmus
> 
> 	Timed out with changes, completed without them.  But it completed
> 	in 522.29 seconds against a limit of 600 seconds, so never mind.
> 
> auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-G+RR-R.litmus
> 
> 	Timed out with changes, completed without them.  But it completed
> 	in 588.70 seconds against a limit of 600 seconds, so never mind.
> 
> All tests that didn't time out matched Results comments.
> 
> The reason I am so cavalier about the times is that I was foolishly
> running rcutorture concurrently with the new-version testing.  I re-ran
> and of them, only auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus
> timed out the second time.  I re-ran it again, but without a time limit,
> and it completed properly in 364.8 seconds compared to 580.  A rerun
> took 360.1 seconds.  So things have slowed down a bit.
> 
> A few other oddities:
> 
> litmus/auto/C-LB-Lww+R-OC.litmus
> 
> 	Both versions flag a data race, which I am not seeing.	It appears
> 	to me that P1's store to u0 cannot happen unless P0's store
> 	has completed.  So what am I missing here?
> 
> litmus/auto/C-LB-Lrw+R-OC.litmus
> litmus/auto/C-LB-Lww+R-Oc.litmus
> litmus/auto/C-LB-Lrw+R-Oc.litmus
> litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
> litmus/auto/C-LB-Lww+R-A+R-OC.litmus
> 
> 	Ditto.  (There are likely more.)
> 
> Thoughts?

And what happened here was that I conflated LKMM with the C++ memory
model, producing something stronger than either.

Never mind!!!

							Thanx, Paul
  
Paul E. McKenney Feb. 26, 2023, 3:03 a.m. UTC | #6
On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > A few other oddities:
> > 
> > litmus/auto/C-LB-Lww+R-OC.litmus
> > 
> > 	Both versions flag a data race, which I am not seeing.	It appears
> > 	to me that P1's store to u0 cannot happen unless P0's store
> > 	has completed.  So what am I missing here?
> 
> The LKMM doesn't believe that a control or data dependency orders a 
> plain write after a marked read.  Hence in this test it thinks that P1's 
> store to u0 can happen before the load of x1.  I don't remember why we 
> did it this way -- probably we just wanted to minimize the restrictions 
> on when plain accesses can execute.  (I do remember the reason for 
> making address dependencies induce order; it was so RCU would work.)
> 
> The patch below will change what the LKMM believes.  It eliminates the 
> positive outcome of the litmus test and the data race.  Should it be 
> adopted into the memory model?

Excellent question!

As noted separately, I was conflating the C++ memory model and LKMM.

> > litmus/auto/C-LB-Lrw+R-OC.litmus
> > litmus/auto/C-LB-Lww+R-Oc.litmus
> > litmus/auto/C-LB-Lrw+R-Oc.litmus
> > litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
> > litmus/auto/C-LB-Lww+R-A+R-OC.litmus
> > 
> > 	Ditto.  (There are likely more.)
> 
> I haven't looked at these but they're probably similar.

Let me give this patch a go and see what it does.

							Thanx, Paul

> Alan
> 
> 
> 
> --- usb-devel.orig/tools/memory-model/linux-kernel.cat
> +++ usb-devel/tools/memory-model/linux-kernel.cat
> @@ -172,7 +172,7 @@ let vis = cumul-fence* ; rfe? ; [Marked]
>  	((strong-fence ; [Marked] ; xbstar) | (xbstar & int))
>  
>  (* Boundaries for lifetimes of plain accesses *)
> -let w-pre-bounded = [Marked] ; (addr | fence)?
> +let w-pre-bounded = [Marked] ; (rwdep | fence)?
>  let r-pre-bounded = [Marked] ; (addr | nonrw-fence |
>  	([R4rmb] ; fencerel(Rmb) ; [~Noreturn]))?
>  let w-post-bounded = fence? ; [Marked] ; rmw-sequence
  
Boqun Feng Feb. 26, 2023, 3:09 a.m. UTC | #7
On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > A few other oddities:
> > 
> > litmus/auto/C-LB-Lww+R-OC.litmus
> > 
> > 	Both versions flag a data race, which I am not seeing.	It appears
> > 	to me that P1's store to u0 cannot happen unless P0's store
> > 	has completed.  So what am I missing here?
> 
> The LKMM doesn't believe that a control or data dependency orders a 
> plain write after a marked read.  Hence in this test it thinks that P1's 
> store to u0 can happen before the load of x1.  I don't remember why we 
> did it this way -- probably we just wanted to minimize the restrictions 
> on when plain accesses can execute.  (I do remember the reason for 
> making address dependencies induce order; it was so RCU would work.)
> 

Because plain store can be optimzed as an "store only if not equal"?
As the following sentenses in the explanations.txt:

	The need to distinguish between r- and w-bounding raises yet another
	issue.  When the source code contains a plain store, the compiler is
	allowed to put plain loads of the same location into the object code.
	For example, given the source code:

		x = 1;

	the compiler is theoretically allowed to generate object code that
	looks like:

		if (x != 1)
			x = 1;

	thereby adding a load (and possibly replacing the store entirely).
	For this reason, whenever the LKMM requires a plain store to be
	w-pre-bounded or w-post-bounded by a marked access, it also requires
	the store to be r-pre-bounded or r-post-bounded, so as to handle cases
	where the compiler adds a load.

Regards,
Boqun

> The patch below will change what the LKMM believes.  It eliminates the 
> positive outcome of the litmus test and the data race.  Should it be 
> adopted into the memory model?
> 
> > litmus/auto/C-LB-Lrw+R-OC.litmus
> > litmus/auto/C-LB-Lww+R-Oc.litmus
> > litmus/auto/C-LB-Lrw+R-Oc.litmus
> > litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
> > litmus/auto/C-LB-Lww+R-A+R-OC.litmus
> > 
> > 	Ditto.  (There are likely more.)
> 
> I haven't looked at these but they're probably similar.
> 
> Alan
> 
> 
> 
> --- usb-devel.orig/tools/memory-model/linux-kernel.cat
> +++ usb-devel/tools/memory-model/linux-kernel.cat
> @@ -172,7 +172,7 @@ let vis = cumul-fence* ; rfe? ; [Marked]
>  	((strong-fence ; [Marked] ; xbstar) | (xbstar & int))
>  
>  (* Boundaries for lifetimes of plain accesses *)
> -let w-pre-bounded = [Marked] ; (addr | fence)?
> +let w-pre-bounded = [Marked] ; (rwdep | fence)?
>  let r-pre-bounded = [Marked] ; (addr | nonrw-fence |
>  	([R4rmb] ; fencerel(Rmb) ; [~Noreturn]))?
>  let w-post-bounded = fence? ; [Marked] ; rmw-sequence
  
Alan Stern Feb. 26, 2023, 3:30 a.m. UTC | #8
On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote:
> On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > > A few other oddities:
> > > 
> > > litmus/auto/C-LB-Lww+R-OC.litmus
> > > 
> > > 	Both versions flag a data race, which I am not seeing.	It appears
> > > 	to me that P1's store to u0 cannot happen unless P0's store
> > > 	has completed.  So what am I missing here?
> > 
> > The LKMM doesn't believe that a control or data dependency orders a 
> > plain write after a marked read.  Hence in this test it thinks that P1's 
> > store to u0 can happen before the load of x1.  I don't remember why we 
> > did it this way -- probably we just wanted to minimize the restrictions 
> > on when plain accesses can execute.  (I do remember the reason for 
> > making address dependencies induce order; it was so RCU would work.)
> > 
> 
> Because plain store can be optimzed as an "store only if not equal"?
> As the following sentenses in the explanations.txt:
> 
> 	The need to distinguish between r- and w-bounding raises yet another
> 	issue.  When the source code contains a plain store, the compiler is
> 	allowed to put plain loads of the same location into the object code.
> 	For example, given the source code:
> 
> 		x = 1;
> 
> 	the compiler is theoretically allowed to generate object code that
> 	looks like:
> 
> 		if (x != 1)
> 			x = 1;
> 
> 	thereby adding a load (and possibly replacing the store entirely).
> 	For this reason, whenever the LKMM requires a plain store to be
> 	w-pre-bounded or w-post-bounded by a marked access, it also requires
> 	the store to be r-pre-bounded or r-post-bounded, so as to handle cases
> 	where the compiler adds a load.

Good guess; maybe that was the reason.  Ironically, in this case the 
store _is_ r-pre-bounded, because there's an smp_rmb() in the litmus 
test just before it.  So perhaps the original reason is not valid now 
that the memory model explicitly includes tests for stores being 
r-pre/post-bounded.

Alan
  
Jonas Oberhauser Feb. 26, 2023, 11:17 a.m. UTC | #9
On 2/26/2023 4:30 AM, Alan Stern wrote:
> On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote:
>> On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
>>> On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
>>>> A few other oddities:
>>>>
>>>> litmus/auto/C-LB-Lww+R-OC.litmus
>>>>
>>>> 	Both versions flag a data race, which I am not seeing.	It appears
>>>> 	to me that P1's store to u0 cannot happen unless P0's store
>>>> 	has completed.  So what am I missing here?
>>> The LKMM doesn't believe that a control or data dependency orders a
>>> plain write after a marked read.  Hence in this test it thinks that P1's
>>> store to u0 can happen before the load of x1.  I don't remember why we
>>> did it this way -- probably we just wanted to minimize the restrictions
>>> on when plain accesses can execute.  (I do remember the reason for
>>> making address dependencies induce order; it was so RCU would work.)
>>>
>> Because plain store can be optimzed as an "store only if not equal"?
>> As the following sentenses in the explanations.txt:
>>
>> 	The need to distinguish between r- and w-bounding raises yet another
>> 	issue.  When the source code contains a plain store, the compiler is
>> 	allowed to put plain loads of the same location into the object code.
>> 	For example, given the source code:
>>
>> 		x = 1;
>>
>> 	the compiler is theoretically allowed to generate object code that
>> 	looks like:
>>
>> 		if (x != 1)
>> 			x = 1;
>>
>> 	thereby adding a load (and possibly replacing the store entirely).
>> 	For this reason, whenever the LKMM requires a plain store to be
>> 	w-pre-bounded or w-post-bounded by a marked access, it also requires
>> 	the store to be r-pre-bounded or r-post-bounded, so as to handle cases
>> 	where the compiler adds a load.
> Good guess; maybe that was the reason.  [...]
> So perhaps the original reason is not valid now
> that the memory model explicitly includes tests for stores being
> r-pre/post-bounded.
>
> Alan

I agree, I think you could relax that condition.

Note there's also rw-xbstar (used with fr) which doesn't check for
r-pre-bounded, but it should be ok. That's because only reads would be
unordered, as a result the read (in the if (x != ..) x=..) should provide
the correct value. The store would be issued as necessary, and the issued
store would still be ordered correctly w.r.t the read.

Best wishes,
jonas
  
Joel Fernandes Feb. 26, 2023, 4:23 p.m. UTC | #10
On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> As stated in the documentation and implied by its name, the ppo
> (preserved program order) relation is intended to link po-earlier
> to po-later instructions under certain conditions.  However, a
> corner case currently allows instructions to be linked by ppo that
> are not executed by the same thread, i.e., instructions are being
> linked that have no po relation.
> 
> This happens due to the mb/strong-fence/fence relations, which (as
> one case) provide order when locks are passed between threads
> followed by an smp_mb__after_unlock_lock() fence.  This is
> illustrated in the following litmus test (as can be seen when using
> herd7 with `doshow ppo`):
> 
> P0(int *x, int *y)
> {
>     spin_lock(x);
>     spin_unlock(x);
> }
> 
> P1(int *x, int *y)
> {
>     spin_lock(x);
>     smp_mb__after_unlock_lock();
>     *y = 1;
> }
> 
> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> P0 passes a lock to P1 which then uses this fence.
> 
> The patch makes ppo a subrelation of po by letting fence contribute
> to ppo only in case the fence links events of the same thread.
> 
> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> ---
>  tools/memory-model/linux-kernel.cat | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index cfc1b8fd46da..adf3c4f41229 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
>  let overwrite = co | fr
>  let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
>  let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)

Alternatively can be the following appended diff? Requires only single 'int'
in ->ppo then and prevents future similar issues caused by sub relations.
Also makes clear that ->ppo can only be CPU-internal.

Or would that not work for some reason? For the test you shared at least, the
graphs are the same.

Either way:

Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

---8<-----------------------

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 07f884f9b2bf..63052d1628e9 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -70,7 +70,7 @@ let rwdep = (dep | ctrl) ; [W]
 let overwrite = co | fr
 let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
 let to-r = addr | (dep ; [Marked] ; rfi)
-let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
+let ppo = (to-r | to-w | fence | po-unlock-lock-po) & int
 
 (* Propagation: Ordering from release operations and strong fences. *)
 let A-cumul(r) = (rfe ; [Marked])? ; r
  
Alan Stern Feb. 26, 2023, 4:51 p.m. UTC | #11
On Sun, Feb 26, 2023 at 12:17:31PM +0100, Jonas Oberhauser wrote:
> 
> 
> On 2/26/2023 4:30 AM, Alan Stern wrote:
> > On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote:
> > > On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> > > > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > > > > A few other oddities:
> > > > > 
> > > > > litmus/auto/C-LB-Lww+R-OC.litmus
> > > > > 
> > > > > 	Both versions flag a data race, which I am not seeing.	It appears
> > > > > 	to me that P1's store to u0 cannot happen unless P0's store
> > > > > 	has completed.  So what am I missing here?
> > > > The LKMM doesn't believe that a control or data dependency orders a
> > > > plain write after a marked read.  Hence in this test it thinks that P1's
> > > > store to u0 can happen before the load of x1.  I don't remember why we
> > > > did it this way -- probably we just wanted to minimize the restrictions
> > > > on when plain accesses can execute.  (I do remember the reason for
> > > > making address dependencies induce order; it was so RCU would work.)
> > > > 
> > > Because plain store can be optimzed as an "store only if not equal"?
> > > As the following sentenses in the explanations.txt:
> > > 
> > > 	The need to distinguish between r- and w-bounding raises yet another
> > > 	issue.  When the source code contains a plain store, the compiler is
> > > 	allowed to put plain loads of the same location into the object code.
> > > 	For example, given the source code:
> > > 
> > > 		x = 1;
> > > 
> > > 	the compiler is theoretically allowed to generate object code that
> > > 	looks like:
> > > 
> > > 		if (x != 1)
> > > 			x = 1;
> > > 
> > > 	thereby adding a load (and possibly replacing the store entirely).
> > > 	For this reason, whenever the LKMM requires a plain store to be
> > > 	w-pre-bounded or w-post-bounded by a marked access, it also requires
> > > 	the store to be r-pre-bounded or r-post-bounded, so as to handle cases
> > > 	where the compiler adds a load.
> > Good guess; maybe that was the reason.  [...]
> > So perhaps the original reason is not valid now
> > that the memory model explicitly includes tests for stores being
> > r-pre/post-bounded.
> > 
> > Alan
> 
> I agree, I think you could relax that condition.

Here's a related question to think about.  Suppose a compiler does make 
this change, adding a load-and-test in front of a store.  Can that load 
cause a data race?

Normally I'd say no, because compilers aren't allowed to create data 
races where one didn't already exist.  But that restriction is part of 
the C/C++ standard, and what we consider to be a data race differs from 
what the standard considers.

So what's the answer?  Is the compiler allowed to translate:

	r1 = READ_ONCE(*x);
	if (r1)
		*y = 1;

into something resembling:

	r1 = READ_ONCE(*x);
	rtemp = *y;
	if (r1) {
		if (rtemp != 1)
			*y = 1;
	}

(Note that whether the load to rtemp occurs inside the "if (r1)" 
conditional or not makes no difference; either way the CPU can execute 
it before testing the condition.  Even before reading the value of *x.)

_If_ we assume that these manufactured loads can never cause a data race 
then it should be safe to remove the r-pre/post-bounded tests for plain 
writes.

But what if rtemp reads from a plain write that was torn, and the 
intermediate value it observes happens to be 1, even though neither the 
initial nor the final value of *y was 1?

> Note there's also rw-xbstar (used with fr) which doesn't check for
> r-pre-bounded, but it should be ok. That's because only reads would be
> unordered, as a result the read (in the if (x != ..) x=..) should provide
> the correct value. The store would be issued as necessary, and the issued
> store would still be ordered correctly w.r.t the read.

That isn't the reason I left r-pre-bounded out from rw-xbstar.  If the 
write gets changed to a read there's no need for rw-xbstar to check 
r-pre-bounded, because then rw-race would be comparing a read with 
another read (instead of with a write) and so there would be no 
possibility of a race in any case.

Alan
  
Paul E. McKenney Feb. 26, 2023, 6:45 p.m. UTC | #12
On Sun, Feb 26, 2023 at 11:51:15AM -0500, Alan Stern wrote:
> On Sun, Feb 26, 2023 at 12:17:31PM +0100, Jonas Oberhauser wrote:
> > On 2/26/2023 4:30 AM, Alan Stern wrote:
> > > On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote:
> > > > On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> > > > > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > > > > > A few other oddities:
> > > > > > 
> > > > > > litmus/auto/C-LB-Lww+R-OC.litmus
> > > > > > 
> > > > > > 	Both versions flag a data race, which I am not seeing.	It appears
> > > > > > 	to me that P1's store to u0 cannot happen unless P0's store
> > > > > > 	has completed.  So what am I missing here?
> > > > > The LKMM doesn't believe that a control or data dependency orders a
> > > > > plain write after a marked read.  Hence in this test it thinks that P1's
> > > > > store to u0 can happen before the load of x1.  I don't remember why we
> > > > > did it this way -- probably we just wanted to minimize the restrictions
> > > > > on when plain accesses can execute.  (I do remember the reason for
> > > > > making address dependencies induce order; it was so RCU would work.)
> > > > > 
> > > > Because plain store can be optimzed as an "store only if not equal"?
> > > > As the following sentenses in the explanations.txt:
> > > > 
> > > > 	The need to distinguish between r- and w-bounding raises yet another
> > > > 	issue.  When the source code contains a plain store, the compiler is
> > > > 	allowed to put plain loads of the same location into the object code.
> > > > 	For example, given the source code:
> > > > 
> > > > 		x = 1;
> > > > 
> > > > 	the compiler is theoretically allowed to generate object code that
> > > > 	looks like:
> > > > 
> > > > 		if (x != 1)
> > > > 			x = 1;
> > > > 
> > > > 	thereby adding a load (and possibly replacing the store entirely).
> > > > 	For this reason, whenever the LKMM requires a plain store to be
> > > > 	w-pre-bounded or w-post-bounded by a marked access, it also requires
> > > > 	the store to be r-pre-bounded or r-post-bounded, so as to handle cases
> > > > 	where the compiler adds a load.
> > > Good guess; maybe that was the reason.  [...]
> > > So perhaps the original reason is not valid now
> > > that the memory model explicitly includes tests for stores being
> > > r-pre/post-bounded.
> > > 
> > > Alan
> > 
> > I agree, I think you could relax that condition.
> 
> Here's a related question to think about.  Suppose a compiler does make 
> this change, adding a load-and-test in front of a store.  Can that load 
> cause a data race?
> 
> Normally I'd say no, because compilers aren't allowed to create data 
> races where one didn't already exist.  But that restriction is part of 
> the C/C++ standard, and what we consider to be a data race differs from 
> what the standard considers.
> 
> So what's the answer?  Is the compiler allowed to translate:
> 
> 	r1 = READ_ONCE(*x);
> 	if (r1)
> 		*y = 1;
> 
> into something resembling:
> 
> 	r1 = READ_ONCE(*x);
> 	rtemp = *y;
> 	if (r1) {
> 		if (rtemp != 1)
> 			*y = 1;
> 	}
> 
> (Note that whether the load to rtemp occurs inside the "if (r1)" 
> conditional or not makes no difference; either way the CPU can execute 
> it before testing the condition.  Even before reading the value of *x.)
> 
> _If_ we assume that these manufactured loads can never cause a data race 
> then it should be safe to remove the r-pre/post-bounded tests for plain 
> writes.
> 
> But what if rtemp reads from a plain write that was torn, and the 
> intermediate value it observes happens to be 1, even though neither the 
> initial nor the final value of *y was 1?

I am not worried about compilers creating data races, so that assignment
to rtemp would be within the "if (r1)" statement.  Not that this matters,
as you say, from a hardware ordering viewpoint.

However, tearing is a concern.  Just to make sure I undersand, one
scenario might be where the initial value of *y was zero and the final
value was 0x10001, correct?  In that case, we have seen compilers that
would write that constant 16 bits at a time, resulting in an transitory
value of 0x1.

But in this case, we would need the value to -not- be 1 for bad things
to happen, correct?

And in that case, we would need the value to initially be 1 and the
desired value to be 1 and some other store to redundantly set it to
1, but tear is such a way that the transitory value is not 1, correct?
Plus we should detect the data race in that case, not?

Or am I missing yet another opportunity for a mischievous compiler?

> > Note there's also rw-xbstar (used with fr) which doesn't check for
> > r-pre-bounded, but it should be ok. That's because only reads would be
> > unordered, as a result the read (in the if (x != ..) x=..) should provide
> > the correct value. The store would be issued as necessary, and the issued
> > store would still be ordered correctly w.r.t the read.
> 
> That isn't the reason I left r-pre-bounded out from rw-xbstar.  If the 
> write gets changed to a read there's no need for rw-xbstar to check 
> r-pre-bounded, because then rw-race would be comparing a read with 
> another read (instead of with a write) and so there would be no 
> possibility of a race in any case.

True, and if there was a racing write, it would be a data race in
any case.

							Thanx, Paul
  
Paul E. McKenney Feb. 26, 2023, 6:49 p.m. UTC | #13
On Sat, Feb 25, 2023 at 07:03:11PM -0800, Paul E. McKenney wrote:
> On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > > A few other oddities:
> > > 
> > > litmus/auto/C-LB-Lww+R-OC.litmus
> > > 
> > > 	Both versions flag a data race, which I am not seeing.	It appears
> > > 	to me that P1's store to u0 cannot happen unless P0's store
> > > 	has completed.  So what am I missing here?
> > 
> > The LKMM doesn't believe that a control or data dependency orders a 
> > plain write after a marked read.  Hence in this test it thinks that P1's 
> > store to u0 can happen before the load of x1.  I don't remember why we 
> > did it this way -- probably we just wanted to minimize the restrictions 
> > on when plain accesses can execute.  (I do remember the reason for 
> > making address dependencies induce order; it was so RCU would work.)
> > 
> > The patch below will change what the LKMM believes.  It eliminates the 
> > positive outcome of the litmus test and the data race.  Should it be 
> > adopted into the memory model?
> 
> Excellent question!
> 
> As noted separately, I was conflating the C++ memory model and LKMM.
> 
> > > litmus/auto/C-LB-Lrw+R-OC.litmus
> > > litmus/auto/C-LB-Lww+R-Oc.litmus
> > > litmus/auto/C-LB-Lrw+R-Oc.litmus
> > > litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
> > > litmus/auto/C-LB-Lww+R-A+R-OC.litmus
> > > 
> > > 	Ditto.  (There are likely more.)
> > 
> > I haven't looked at these but they're probably similar.
> 
> Let me give this patch a go and see what it does.

And it operates as expected, converting Sometimes/data-race results
into Never.

Leaving the question of whether that is what we need.  ;-)

							Thanx, Paul
  
Alan Stern Feb. 26, 2023, 7:32 p.m. UTC | #14
On Sun, Feb 26, 2023 at 10:45:28AM -0800, Paul E. McKenney wrote:
> On Sun, Feb 26, 2023 at 11:51:15AM -0500, Alan Stern wrote:
> > Here's a related question to think about.  Suppose a compiler does make 
> > this change, adding a load-and-test in front of a store.  Can that load 
> > cause a data race?
> > 
> > Normally I'd say no, because compilers aren't allowed to create data 
> > races where one didn't already exist.  But that restriction is part of 
> > the C/C++ standard, and what we consider to be a data race differs from 
> > what the standard considers.
> > 
> > So what's the answer?  Is the compiler allowed to translate:
> > 
> > 	r1 = READ_ONCE(*x);
> > 	if (r1)
> > 		*y = 1;
> > 
> > into something resembling:
> > 
> > 	r1 = READ_ONCE(*x);
> > 	rtemp = *y;
> > 	if (r1) {
> > 		if (rtemp != 1)
> > 			*y = 1;
> > 	}
> > 
> > (Note that whether the load to rtemp occurs inside the "if (r1)" 
> > conditional or not makes no difference; either way the CPU can execute 
> > it before testing the condition.  Even before reading the value of *x.)
> > 
> > _If_ we assume that these manufactured loads can never cause a data race 
> > then it should be safe to remove the r-pre/post-bounded tests for plain 
> > writes.
> > 
> > But what if rtemp reads from a plain write that was torn, and the 
> > intermediate value it observes happens to be 1, even though neither the 
> > initial nor the final value of *y was 1?
> 
> I am not worried about compilers creating data races, so that assignment
> to rtemp would be within the "if (r1)" statement.  Not that this matters,
> as you say, from a hardware ordering viewpoint.
> 
> However, tearing is a concern.  Just to make sure I undersand, one
> scenario might be where the initial value of *y was zero and the final
> value was 0x10001, correct?  In that case, we have seen compilers that
> would write that constant 16 bits at a time, resulting in an transitory
> value of 0x1.
> 
> But in this case, we would need the value to -not- be 1 for bad things
> to happen, correct?
> 
> And in that case, we would need the value to initially be 1 and the
> desired value to be 1 and some other store to redundantly set it to
> 1, but tear is such a way that the transitory value is not 1, correct?
> Plus we should detect the data race in that case, not?
> 
> Or am I missing yet another opportunity for a mischievous compiler?

Let's try a better example:

P0(int *x, int *y)
{
	*y = 0x10001;
	smp_store_release(x, 1);
}

P1(int *x, int *y)
{
	int r1;

	r1 = READ_ONCE(*x);
	if (r1)
		*y = 1;
}

exists (1:r1=1 /\ y=0x10001)

Assume the compiler translates "*y = 1;" to:

	{
		rtemp = *y;
		if (rtemp != 1)
			*y = 1;
	}

Then there is nothing preventing P1's CPU from loading *y into rtemp 
before it reads *x.  If the plain write in P0 is torn then rtemp could 
end up getting set to 1, so P1 wouldn't write anything to *y and the 
litmus test would succeed.

If the LKMM should think this litmus test has no data race then it 
should also think the test will never succeed.  But I've just shown how 
it might succeed.  Ergo the LKMM should say this test has a data race, 
and thus we shouldn't remove the r-pre-bounded tests for plain writes.

Of course, this is a separate question from whether w-pre-bounded should 
be changed to use rwdep instead of addr.

Alan
  
Jonas Oberhauser Feb. 27, 2023, 2:03 p.m. UTC | #15
On 2/26/2023 5:51 PM, Alan Stern wrote:
> On Sun, Feb 26, 2023 at 12:17:31PM +0100, Jonas Oberhauser wrote:
>>
>> On 2/26/2023 4:30 AM, Alan Stern wrote:
>>> On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote:
>>>> On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
>>>>> On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
>>>>>> A few other oddities:
>>>>>>
>>>>>> litmus/auto/C-LB-Lww+R-OC.litmus
>>>>>>
>>>>>> 	Both versions flag a data race, which I am not seeing.	It appears
>>>>>> 	to me that P1's store to u0 cannot happen unless P0's store
>>>>>> 	has completed.  So what am I missing here?
>>>>> The LKMM doesn't believe that a control or data dependency orders a
>>>>> plain write after a marked read.  Hence in this test it thinks that P1's
>>>>> store to u0 can happen before the load of x1.  I don't remember why we
>>>>> did it this way -- probably we just wanted to minimize the restrictions
>>>>> on when plain accesses can execute.  (I do remember the reason for
>>>>> making address dependencies induce order; it was so RCU would work.)
>>>>>
>>>> Because plain store can be optimzed as an "store only if not equal"?
>>>> As the following sentenses in the explanations.txt:
>>>>
>>>> 	The need to distinguish between r- and w-bounding raises yet another
>>>> 	issue.  When the source code contains a plain store, the compiler is
>>>> 	allowed to put plain loads of the same location into the object code.
>>>> 	For example, given the source code:
>>>>
>>>> 		x = 1;
>>>>
>>>> 	the compiler is theoretically allowed to generate object code that
>>>> 	looks like:
>>>>
>>>> 		if (x != 1)
>>>> 			x = 1;
>>>>
>>>> 	thereby adding a load (and possibly replacing the store entirely).
>>>> 	For this reason, whenever the LKMM requires a plain store to be
>>>> 	w-pre-bounded or w-post-bounded by a marked access, it also requires
>>>> 	the store to be r-pre-bounded or r-post-bounded, so as to handle cases
>>>> 	where the compiler adds a load.
>>> Good guess; maybe that was the reason.  [...]
>>> So perhaps the original reason is not valid now
>>> that the memory model explicitly includes tests for stores being
>>> r-pre/post-bounded.
>>>
>>> Alan
>> I agree, I think you could relax that condition.
> Here's a related question to think about.  Suppose a compiler does make
> this change, adding a load-and-test in front of a store.  Can that load
> cause a data race?
>
> Normally I'd say no, because compilers aren't allowed to create data
> races where one didn't already exist.

I don't think that's completely true. At least N1570 says

      Transformations that introduce a speculative read of a potentially shared memory location may
      not preserve the semantics of the program as defined in this standard, since they potentially introduce a data
      race. However, they are typically valid in the context of an optimizing compiler that targets a specific
      machine with well-defined semantics for data races.

In fact I vaguely recall certain optimizations that do introduce data races, like load hoisting.
The issue is that some optimizations are only valid in the absence of data races, and therefore these optimizations and the optimizations that introduce data races can't be used on the same portion of code at the same time.

https://cs.nju.edu.cn/hongjin/teaching/concurrency/03_C11MM_ViktorVafeiadis.pdf

I think in generall it's not about whether the compiler introduces races or not, but whether those races create new behaviors that you don't see in LKMM.


>    But that restriction is part of
> the C/C++ standard, and what we consider to be a data race differs from
> what the standard considers.
>
> So what's the answer?  Is the compiler allowed to translate:
>
> 	r1 = READ_ONCE(*x);
> 	if (r1)
> 		*y = 1;
>
> into something resembling:
>
> 	r1 = READ_ONCE(*x);
> 	rtemp = *y;
> 	if (r1) {
> 		if (rtemp != 1)
> 			*y = 1;
> 	}
>
> (Note that whether the load to rtemp occurs inside the "if (r1)"
> conditional or not makes no difference; either way the CPU can execute
> it before testing the condition.  Even before reading the value of *x.)
>
> _If_ we assume that these manufactured loads can never cause a data race
> then it should be safe to remove the r-pre/post-bounded tests for plain
> writes.

Note that I don't want to remove the r-pre/post-bounded tests.
What I agreed to is that the restriction to only addr for plain writes 
is an overly conservative "r-pre/post-bounded-style" test which is made 
redundant by the existence of the actual r-pre/post-bounded test.

>
> But what if rtemp reads from a plain write that was torn, and the
> intermediate value it observes happens to be 1, even though neither the
> initial nor the final value of *y was 1?

Of course, and you don't even need a torn write to cause problems 
without requiring r-bounding.
For example, if the other side does *y=1; *y=2; mb(); *x=&y;, then 
r1==&x seems to imply that y always end up as ==1 from a 
compiler-oblivious perspective, but because of the data race it could be 
==2 instead.


>> Note there's also rw-xbstar (used with fr) which doesn't check for
>> r-pre-bounded, but it should be ok. That's because only reads would be
>> unordered, as a result the read (in the if (x != ..) x=..) should provide
>> the correct value. The store would be issued as necessary, and the issued
>> store would still be ordered correctly w.r.t the read.
> That isn't the reason I left r-pre-bounded out from rw-xbstar.  If the
> write gets changed to a read there's no need for rw-xbstar to check
> r-pre-bounded, because then rw-race would be comparing a read with
> another read (instead of with a write) and so there would be no
> possibility of a race in any case.

That is the first part of my explanation (only reads would be unordered) 
but I don't think it's sufficient in general.
Imagine a hypothetical memory model with a release fence that upgrades 
the next memory operation only (and only stores) to release (e.g., to 
save some opcode design space you do weird_release_fence;str x1 x2 
instead of stlr x1 x2).
Then in the message passing pattern

T1 {
    u = a;
    release(&x, 1);
}

T2 {
   t = READ_ONCE(&x);
   weird_release_fence;
   a = 1;
}

where T2 is changed by the compiler to

T2 {
   t = READ_ONCE(&x);
   weird_release_fence();
   if (a!=1) a = 1;
}

In the specific executions where t==1, there wouldn't be a data race 
when just considering the equivalent of rw-xbstar, but u==1 and t==1 
would be possible (even though it might not seem so from the first example).

Of course in LKMM there's no such fence, but I think to make the 
argument complete you still need to go through every case that provides 
the w-pre-bounding and make sure it still provides the necessary order 
in the compiler-generated version. (or you can try a more complicated 
argument of the form "there would be another execution of the same 
program that would have a data race", which works at least for this 
example, not sure in general)

Have fun,
jonas
  
Jonas Oberhauser Feb. 27, 2023, 2:39 p.m. UTC | #16
On 2/26/2023 5:23 PM, Joel Fernandes wrote:
> On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
>> As stated in the documentation and implied by its name, the ppo
>> (preserved program order) relation is intended to link po-earlier
>> to po-later instructions under certain conditions.  However, a
>> corner case currently allows instructions to be linked by ppo that
>> are not executed by the same thread, i.e., instructions are being
>> linked that have no po relation.
>>
>> This happens due to the mb/strong-fence/fence relations, which (as
>> one case) provide order when locks are passed between threads
>> followed by an smp_mb__after_unlock_lock() fence.  This is
>> illustrated in the following litmus test (as can be seen when using
>> herd7 with `doshow ppo`):
>>
>> P0(int *x, int *y)
>> {
>>      spin_lock(x);
>>      spin_unlock(x);
>> }
>>
>> P1(int *x, int *y)
>> {
>>      spin_lock(x);
>>      smp_mb__after_unlock_lock();
>>      *y = 1;
>> }
>>
>> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
>> P0 passes a lock to P1 which then uses this fence.
>>
>> The patch makes ppo a subrelation of po by letting fence contribute
>> to ppo only in case the fence links events of the same thread.
>>
>> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>> ---
>>   tools/memory-model/linux-kernel.cat | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
>> index cfc1b8fd46da..adf3c4f41229 100644
>> --- a/tools/memory-model/linux-kernel.cat
>> +++ b/tools/memory-model/linux-kernel.cat
>> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
>>   let overwrite = co | fr
>>   let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
>>   let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
>> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
>> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> Alternatively can be the following appended diff? Requires only single 'int'
> in ->ppo then and prevents future similar issues caused by sub relations.
> Also makes clear that ->ppo can only be CPU-internal.

I had thought about going in that direction, but it doesn't prevent 
future similar issues, at best makes them less likely.
For example, you could have an xfence that somehow goes back to the 
original thread, but to a po-earlier event (e.g., like prop).

Given that to-r and to-w are unlikely to ever become become inconsistent 
with po, I am not sure it even really helps much.

Personally I'm not too happy with the ad-hoc '& int' because it's like 
adding some unused stuff (via ... | unused-stuff) and then cutting it 
back out with &int, unlike prop & int which has a real semantic meaning 
(propagate back to the thread). The fastest move is the move we avoid 
doing, so I rather wouldn't add those parts in the first place.

However fixing the fence relation turned out to be a lot trickier, both 
because of the missed data race and also rmw-sequences, essentially I 
would have had to disambiguate between xfences and fences already in 
this patch. So I did this minimal local fix for now and we can discuss 
whether it makes sense to get rid of the '& int' once/if we have xfence etc.

Best wishes,
jonas

PS:
> ---8<-----------------------

haha that's so clever :D

>
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 07f884f9b2bf..63052d1628e9 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -70,7 +70,7 @@ let rwdep = (dep | ctrl) ; [W]
>   let overwrite = co | fr
>   let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
>   let to-r = addr | (dep ; [Marked] ; rfi)
> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> +let ppo = (to-r | to-w | fence | po-unlock-lock-po) & int
>   
>   (* Propagation: Ordering from release operations and strong fences. *)
>   let A-cumul(r) = (rfe ; [Marked])? ; r
  
Alan Stern Feb. 27, 2023, 4:16 p.m. UTC | #17
On Mon, Feb 27, 2023 at 03:03:16PM +0100, Jonas Oberhauser wrote:
> Note that I don't want to remove the r-pre/post-bounded tests.
> What I agreed to is that the restriction to only addr for plain writes is an
> overly conservative "r-pre/post-bounded-style" test which is made redundant
> by the existence of the actual r-pre/post-bounded test.

Good, that agrees with what I've been thinking.

> > > Note there's also rw-xbstar (used with fr) which doesn't check for
> > > r-pre-bounded, but it should be ok. That's because only reads would be
> > > unordered, as a result the read (in the if (x != ..) x=..) should provide
> > > the correct value. The store would be issued as necessary, and the issued
> > > store would still be ordered correctly w.r.t the read.
> > That isn't the reason I left r-pre-bounded out from rw-xbstar.  If the
> > write gets changed to a read there's no need for rw-xbstar to check
> > r-pre-bounded, because then rw-race would be comparing a read with
> > another read (instead of with a write) and so there would be no
> > possibility of a race in any case.
> 
> That is the first part of my explanation (only reads would be unordered) but

It is?  I couldn't tell from what you wrote that this was supposed to 
imply we didn't have to worry about a data race.

> I don't think it's sufficient in general.
> Imagine a hypothetical memory model with a release fence that upgrades the
> next memory operation only (and only stores) to release (e.g., to save some
> opcode design space you do weird_release_fence;str x1 x2 instead of stlr x1
> x2).
> Then in the message passing pattern
> 
> T1 {
>    u = a;
>    release(&x, 1);
> }
> 
> T2 {
>   t = READ_ONCE(&x);
>   weird_release_fence;
>   a = 1;
> }

[That isn't the Message Passing pattern.  In the MP pattern, one thread 
does two writes and the other thread does two reads.  This is the Read 
Buffering (RB) pattern: Each thread does a read followed by a write.]

> 
> where T2 is changed by the compiler to
> 
> T2 {
>   t = READ_ONCE(&x);
>   weird_release_fence();
>   if (a!=1) a = 1;
> }
> 
> In the specific executions where t==1, there wouldn't be a data race when
> just considering the equivalent of rw-xbstar, but u==1 and t==1 would be
> possible (even though it might not seem so from the first example).

If such a fence existed in the LKMM, we would handle this case by saying 
that weird_release_fence() does not give release semantics to an 
immediately following plain store; only to an immediately following 
marked store.  The reason being that the compiler is allowed to muck 
around with the code generated for plain accesses, so there's no 
guarantee that the first machine instruction generated for "a = 1;" will 
be a store.

As a result, there would not be an rw-xbstar link from T1 to T2.

> Of course in LKMM there's no such fence, but I think to make the argument
> complete you still need to go through every case that provides the
> w-pre-bounding and make sure it still provides the necessary order in the
> compiler-generated version. (or you can try a more complicated argument of
> the form "there would be another execution of the same program that would
> have a data race", which works at least for this example, not sure in
> general)

So I don't see this as a valid argument for not using rw-xbstar in 
rw-race.  Even theoretically.

Alan
  
Jonas Oberhauser Feb. 27, 2023, 4:50 p.m. UTC | #18
On 2/27/2023 5:16 PM, Alan Stern wrote:
> On Mon, Feb 27, 2023 at 03:03:16PM +0100, Jonas Oberhauser wrote:
>
>>>> Note there's also rw-xbstar (used with fr) which doesn't check for
>>>> r-pre-bounded, but it should be ok. That's because only reads would be
>>>> unordered, as a result the read (in the if (x != ..) x=..) should provide
>>>> the correct value. The store would be issued as necessary, and the issued
>>>> store would still be ordered correctly w.r.t the read.
>>> That isn't the reason I left r-pre-bounded out from rw-xbstar.  If the
>>> write gets changed to a read there's no need for rw-xbstar to check
>>> r-pre-bounded, because then rw-race would be comparing a read with
>>> another read (instead of with a write) and so there would be no
>>> possibility of a race in any case.
>> That is the first part of my explanation (only reads would be unordered) but
> It is?  I couldn't tell from what you wrote that this was supposed to
> imply we didn't have to worry about a data race.

Because it wasn't supposed to imply that, in the sense that by itself, 
the fact that the things that are left unordered are reads does not 
immediately imply that we don't have to worry about a data race. It's 
just a step in the sequence of reasoning. But of course I won't claim 
that I laid out that sequence in full enough detail to make sense.

>
>> I don't think it's sufficient in general.
>> Imagine a hypothetical memory model with a release fence that upgrades the
>> next memory operation only (and only stores) to release (e.g., to save some
>> opcode design space you do weird_release_fence;str x1 x2 instead of stlr x1
>> x2).
>> Then in the message passing pattern
>>
>> T1 {
>>     u = a;
>>     release(&x, 1);
>> }
>>
>> T2 {
>>    t = READ_ONCE(&x);
>>    weird_release_fence;
>>    a = 1;
>> }
> [That isn't the Message Passing pattern.  In the MP pattern, one thread
> does two writes and the other thread does two reads.  This is the Read
> Buffering (RB) pattern: Each thread does a read followed by a write.]

Ah, thanks. Somehow I misremembered MP to be any case of [do stuff on a] 
; x = ...  || read that update from x; [do stuff on a].

>
>> where T2 is changed by the compiler to
>>
>> T2 {
>>    t = READ_ONCE(&x);
>>    weird_release_fence();
>>    if (a!=1) a = 1;
>> }
>>
>> In the specific executions where t==1, there wouldn't be a data race when
>> just considering the equivalent of rw-xbstar, but u==1 and t==1 would be
>> possible (even though it might not seem so from the first example).
> If such a fence existed in the LKMM, we would handle this case by saying
> that weird_release_fence() does not give release semantics to an
> immediately following plain store; only to an immediately following
> marked store.  The reason being that the compiler is allowed to muck
> around with the code generated for plain accesses, so there's no
> guarantee that the first machine instruction generated for "a = 1;" will
> be a store.
>
> As a result, there would not be an rw-xbstar link from T1 to T2.

Sure, but IMHO a well-designed high-level model like LKMM shouldn't have 
such a fence at all :D This is something you might use in the 
implementation of a release store in inline-assembly, it doesn't belong 
in the source code. But it's just for the sake of argument anyways.


>
>> Of course in LKMM there's no such fence, but I think to make the argument
>> complete you still need to go through every case that provides the
>> w-pre-bounding and make sure it still provides the necessary order in the
>> compiler-generated version. (or you can try a more complicated argument of
>> the form "there would be another execution of the same program that would
>> have a data race", which works at least for this example, not sure in
>> general)
> So I don't see this as a valid argument for not using rw-xbstar in
> rw-race.  Even theoretically.

There's nothing wrong with using rw-xbstar in rw-race, especially in 
current LKMM, and I'm not arguing against that.

I'm saying that the argument
"if rw-xbstar links a read R to a plain write W, and that plain write is 
replaced by a read R', and in case R' reads a value different from W, 
followed by a store W' (with some dependency from R' to W')  by the 
compiler, then the fact that R and R' can't have a data race means that 
it's safe to use rw-xbstar in rw-race"
is incomplete. (Of course that doesn't mean the claim is wrong.)
To make the argument complete, you also need that W' is generated if 
necessary, and more crucially that W' is still ordered behind R!
Otherwise you would now have a data race between R and W', like in the 
hypothetical example I mentioned, even though R and R' don't race.

And if you do that second step in LKMM (even with the change of 
w-pre-bounded we are discussing) you quickly find that W' is indeed 
still ordered, so rw-xbstar is perfectly fine.

Perhaps that step is so trivial to you that you don't feel it needs 
mentioning : ) But speaking about LKMM-like models in general, some 
might have some subtle case where rw-xbstar links R and W but would not 
R and W'.

jonas
  
Joel Fernandes Feb. 27, 2023, 5:57 p.m. UTC | #19
On Mon, Feb 27, 2023 at 9:39 AM Jonas Oberhauser
<jonas.oberhauser@huaweicloud.com> wrote:
>
>
>
> On 2/26/2023 5:23 PM, Joel Fernandes wrote:
> > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> >> As stated in the documentation and implied by its name, the ppo
> >> (preserved program order) relation is intended to link po-earlier
> >> to po-later instructions under certain conditions.  However, a
> >> corner case currently allows instructions to be linked by ppo that
> >> are not executed by the same thread, i.e., instructions are being
> >> linked that have no po relation.
> >>
> >> This happens due to the mb/strong-fence/fence relations, which (as
> >> one case) provide order when locks are passed between threads
> >> followed by an smp_mb__after_unlock_lock() fence.  This is
> >> illustrated in the following litmus test (as can be seen when using
> >> herd7 with `doshow ppo`):
> >>
> >> P0(int *x, int *y)
> >> {
> >>      spin_lock(x);
> >>      spin_unlock(x);
> >> }
> >>
> >> P1(int *x, int *y)
> >> {
> >>      spin_lock(x);
> >>      smp_mb__after_unlock_lock();
> >>      *y = 1;
> >> }
> >>
> >> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> >> P0 passes a lock to P1 which then uses this fence.
> >>
> >> The patch makes ppo a subrelation of po by letting fence contribute
> >> to ppo only in case the fence links events of the same thread.
> >>
> >> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> >> ---
> >>   tools/memory-model/linux-kernel.cat | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> >> index cfc1b8fd46da..adf3c4f41229 100644
> >> --- a/tools/memory-model/linux-kernel.cat
> >> +++ b/tools/memory-model/linux-kernel.cat
> >> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> >>   let overwrite = co | fr
> >>   let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> >>   let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> >> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> >> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> > Alternatively can be the following appended diff? Requires only single 'int'
> > in ->ppo then and prevents future similar issues caused by sub relations.
> > Also makes clear that ->ppo can only be CPU-internal.
>
> I had thought about going in that direction, but it doesn't prevent
> future similar issues, at best makes them less likely.

Making less likely still sounds like a win to me.

> For example, you could have an xfence that somehow goes back to the
> original thread, but to a po-earlier event (e.g., like prop).
>
> Given that to-r and to-w are unlikely to ever become become inconsistent
> with po, I am not sure it even really helps much.

I am not sure I understand, what is the problem with enforcing that
ppo is only supposed to ever be -int ? Sounds like it makes it super
clear.

> Personally I'm not too happy with the ad-hoc '& int' because it's like

So, with the idea I suggest, you will have fewer ints so you should be happy ;-)

> adding some unused stuff (via ... | unused-stuff) and then cutting it
> back out with &int, unlike prop & int which has a real semantic meaning
> (propagate back to the thread). The fastest move is the move we avoid
> doing, so I rather wouldn't add those parts in the first place.
>
> However fixing the fence relation turned out to be a lot trickier, both
> because of the missed data race and also rmw-sequences, essentially I
> would have had to disambiguate between xfences and fences already in
> this patch. So I did this minimal local fix for now and we can discuss
> whether it makes sense to get rid of the '& int' once/if we have xfence etc.

I see. Ok, I'll defer to your expertise on this since you know more
than I. I am relatively only recent with even opening up the CAT code.

Cheers,

 - Joel


>
> Best wishes,
> jonas
>
> PS:
> > ---8<-----------------------
>
> haha that's so clever :D
>
> >
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index 07f884f9b2bf..63052d1628e9 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -70,7 +70,7 @@ let rwdep = (dep | ctrl) ; [W]
> >   let overwrite = co | fr
> >   let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> >   let to-r = addr | (dep ; [Marked] ; rfi)
> > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> > +let ppo = (to-r | to-w | fence | po-unlock-lock-po) & int
> >
> >   (* Propagation: Ordering from release operations and strong fences. *)
> >   let A-cumul(r) = (rfe ; [Marked])? ; r
>
  
Alan Stern Feb. 27, 2023, 6:41 p.m. UTC | #20
On Mon, Feb 27, 2023 at 05:50:15PM +0100, Jonas Oberhauser wrote:
> > So I don't see this as a valid argument for not using rw-xbstar in
> > rw-race.  Even theoretically.
> 
> There's nothing wrong with using rw-xbstar in rw-race, especially in current
> LKMM, and I'm not arguing against that.
> 
> I'm saying that the argument
> "if rw-xbstar links a read R to a plain write W, and that plain write is
> replaced by a read R', and in case R' reads a value different from W,
> followed by a store W' (with some dependency from R' to W')  by the
> compiler, then the fact that R and R' can't have a data race means that it's
> safe to use rw-xbstar in rw-race"
> is incomplete. (Of course that doesn't mean the claim is wrong.)
> To make the argument complete, you also need that W' is generated if
> necessary, and more crucially that W' is still ordered behind R!
> Otherwise you would now have a data race between R and W', like in the
> hypothetical example I mentioned, even though R and R' don't race.
> 
> And if you do that second step in LKMM (even with the change of
> w-pre-bounded we are discussing) you quickly find that W' is indeed still
> ordered, so rw-xbstar is perfectly fine.
> 
> Perhaps that step is so trivial to you that you don't feel it needs
> mentioning : ) But speaking about LKMM-like models in general, some might
> have some subtle case where rw-xbstar links R and W but would not R and W'.

Ah, okay.  Yes, it is a subtle point.  And by the reasoning I just used, 
if such a case did exist then one could conclude it would be an example 
demonstrating that rw-xbstar should not have linked R and W in the first 
place.

And it looks like I should write up and submit a patch allowing more 
dependencies in the definition of w-pre-bounded.

Alan
  
Andrea Parri Feb. 27, 2023, 7:35 p.m. UTC | #21
On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> As stated in the documentation and implied by its name, the ppo
> (preserved program order) relation is intended to link po-earlier
> to po-later instructions under certain conditions.  However, a
> corner case currently allows instructions to be linked by ppo that
> are not executed by the same thread, i.e., instructions are being
> linked that have no po relation.
> 
> This happens due to the mb/strong-fence/fence relations, which (as
> one case) provide order when locks are passed between threads
> followed by an smp_mb__after_unlock_lock() fence.  This is
> illustrated in the following litmus test (as can be seen when using
> herd7 with `doshow ppo`):
> 
> P0(int *x, int *y)
> {
>     spin_lock(x);
>     spin_unlock(x);
> }
> 
> P1(int *x, int *y)
> {
>     spin_lock(x);
>     smp_mb__after_unlock_lock();
>     *y = 1;
> }

nit: for readability if not for other, s/int *x/spinlock_t *x/.


> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> P0 passes a lock to P1 which then uses this fence.
> 
> The patch makes ppo a subrelation of po by letting fence contribute
> to ppo only in case the fence links events of the same thread.
> 
> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>

Acked-by: Andrea Parri <parri.andrea@gmail.com>

  Andrea


> ---
>  tools/memory-model/linux-kernel.cat | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index cfc1b8fd46da..adf3c4f41229 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
>  let overwrite = co | fr
>  let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
>  let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
>  
>  (* Propagation: Ordering from release operations and strong fences. *)
>  let A-cumul(r) = (rfe ; [Marked])? ; r
> -- 
> 2.17.1
>
  
Andrea Parri Feb. 27, 2023, 7:40 p.m. UTC | #22
> The LKMM doesn't believe that a control or data dependency orders a 
> plain write after a marked read.  Hence in this test it thinks that P1's 
> store to u0 can happen before the load of x1.  I don't remember why we 
> did it this way -- probably we just wanted to minimize the restrictions 
> on when plain accesses can execute.  (I do remember the reason for 
> making address dependencies induce order; it was so RCU would work.)
> 
> The patch below will change what the LKMM believes.  It eliminates the 
> positive outcome of the litmus test and the data race.  Should it be 
> adopted into the memory model?

(Unpopular opinion I know,) it should drop dependencies ordering, not
add/promote it.

  Andrea
  
Jonas Oberhauser Feb. 27, 2023, 8:13 p.m. UTC | #23
On 2/27/2023 8:40 PM, Andrea Parri wrote:
>> The LKMM doesn't believe that a control or data dependency orders a
>> plain write after a marked read.  Hence in this test it thinks that P1's
>> store to u0 can happen before the load of x1.  I don't remember why we
>> did it this way -- probably we just wanted to minimize the restrictions
>> on when plain accesses can execute.  (I do remember the reason for
>> making address dependencies induce order; it was so RCU would work.)
>>
>> The patch below will change what the LKMM believes.  It eliminates the
>> positive outcome of the litmus test and the data race.  Should it be
>> adopted into the memory model?
> (Unpopular opinion I know,) it should drop dependencies ordering, not
> add/promote it.
>
>    Andrea

Maybe not as unpopular as you think... :)
But either way IMHO it should be consistent; either take all the 
dependencies that are true and add them, or drop them all.
In the latter case, RCU should change to an acquire barrier. (also, one 
would have to deal with OOTA in some yet different way).

Generally my position is that unless there's a real-world benchmark with 
proven performance benefits of relying on dependency ordering, one 
should use an acquire barrier. I haven't yet met such a case, but maybe 
one of you has...

Best wishes,
jonas
  
Jonas Oberhauser Feb. 27, 2023, 8:24 p.m. UTC | #24
On 2/27/2023 6:57 PM, Joel Fernandes wrote:
> On Mon, Feb 27, 2023 at 9:39 AM Jonas Oberhauser
> <jonas.oberhauser@huaweicloud.com> wrote:
>>
>>
>> On 2/26/2023 5:23 PM, Joel Fernandes wrote:
>>> On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
>>>> [...]
>>> Alternatively can be the following appended diff? Requires only single 'int'
>>> in ->ppo then and prevents future similar issues caused by sub relations.
>>> Also makes clear that ->ppo can only be CPU-internal.
>> I had thought about going in that direction, but it doesn't prevent
>> future similar issues, at best makes them less likely.
> Making less likely still sounds like a win to me.
>
>> For example, you could have an xfence that somehow goes back to the
>> original thread, but to a po-earlier event (e.g., like prop).
>>
>> Given that to-r and to-w are unlikely to ever become become inconsistent
>> with po, I am not sure it even really helps much.
> I am not sure I understand, what is the problem with enforcing that
> ppo is only supposed to ever be -int ? Sounds like it makes it super
> clear.

You could go further and do ... & po.

But it would still make me feel that it's a plaster used to hold 
together a steampipe.
It reminds me a bit of college when some of my class mates passed the 
nightly tests in the programming lab by doing
"if input == (the input of the specific test case they were having 
problem with) return (value expected by testcase)".
Or making everything atomic so that tsan does not complain about data 
races anymore.

If there's something in one of these relations tying together events of 
different threads, is it intentional or a bug?
I prefer to be very conscious about what is being tied together by the 
relations.

I'd rather take Boqun's suggestion to add some "debug/testing" flags to 
see if a litmus test violates a property assumed by LKMM.

Yet I think the ideal way is to have a mechanized proof that LKMM 
satisfies these properties and use that to avoid regressions.


>
>> Personally I'm not too happy with the ad-hoc '& int' because it's like
> So, with the idea I suggest, you will have fewer ints so you should be happy ;-)

haha : ) An alternative perspective is that the &int now covers more 
cases of the relation ; )


>
>> adding some unused stuff (via ... | unused-stuff) and then cutting it
>> back out with &int, unlike prop & int which has a real semantic meaning
>> (propagate back to the thread). The fastest move is the move we avoid
>> doing, so I rather wouldn't add those parts in the first place.
>>
>> However fixing the fence relation turned out to be a lot trickier, both
>> because of the missed data race and also rmw-sequences, essentially I
>> would have had to disambiguate between xfences and fences already in
>> this patch. So I did this minimal local fix for now and we can discuss
>> whether it makes sense to get rid of the '& int' once/if we have xfence etc.
> I see. Ok, I'll defer to your expertise on this since you know more
> than I. I am relatively only recent with even opening up the CAT code.

Enjoy : )

jonas
  
Paul E. McKenney Feb. 27, 2023, 10:21 p.m. UTC | #25
On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
> 
> 
> On 2/27/2023 8:40 PM, Andrea Parri wrote:
> > > The LKMM doesn't believe that a control or data dependency orders a
> > > plain write after a marked read.  Hence in this test it thinks that P1's
> > > store to u0 can happen before the load of x1.  I don't remember why we
> > > did it this way -- probably we just wanted to minimize the restrictions
> > > on when plain accesses can execute.  (I do remember the reason for
> > > making address dependencies induce order; it was so RCU would work.)
> > > 
> > > The patch below will change what the LKMM believes.  It eliminates the
> > > positive outcome of the litmus test and the data race.  Should it be
> > > adopted into the memory model?
> > (Unpopular opinion I know,) it should drop dependencies ordering, not
> > add/promote it.
> > 
> >    Andrea
> 
> Maybe not as unpopular as you think... :)
> But either way IMHO it should be consistent; either take all the
> dependencies that are true and add them, or drop them all.
> In the latter case, RCU should change to an acquire barrier. (also, one
> would have to deal with OOTA in some yet different way).
> 
> Generally my position is that unless there's a real-world benchmark with
> proven performance benefits of relying on dependency ordering, one should
> use an acquire barrier. I haven't yet met such a case, but maybe one of you
> has...

https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).

Though this is admittedly for ARMv7 and PowerPC.

							Thanx, Paul
  
Jonas Oberhauser Feb. 28, 2023, 8:49 a.m. UTC | #26
On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
> On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
>>
>> On 2/27/2023 8:40 PM, Andrea Parri wrote:
>>>> The LKMM doesn't believe that a control or data dependency orders a
>>>> plain write after a marked read.  Hence in this test it thinks that P1's
>>>> store to u0 can happen before the load of x1.  I don't remember why we
>>>> did it this way -- probably we just wanted to minimize the restrictions
>>>> on when plain accesses can execute.  (I do remember the reason for
>>>> making address dependencies induce order; it was so RCU would work.)
>>>>
>>>> The patch below will change what the LKMM believes.  It eliminates the
>>>> positive outcome of the litmus test and the data race.  Should it be
>>>> adopted into the memory model?
>>> (Unpopular opinion I know,) it should drop dependencies ordering, not
>>> add/promote it.
>>>
>>>     Andrea
>> Maybe not as unpopular as you think... :)
>> But either way IMHO it should be consistent; either take all the
>> dependencies that are true and add them, or drop them all.
>> In the latter case, RCU should change to an acquire barrier. (also, one
>> would have to deal with OOTA in some yet different way).
>>
>> Generally my position is that unless there's a real-world benchmark with
>> proven performance benefits of relying on dependency ordering, one should
>> use an acquire barrier. I haven't yet met such a case, but maybe one of you
>> has...
> https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).
>
> Though this is admittedly for ARMv7 and PowerPC.
>

Thanks for the link.

It's true that on architectures that don't have an acquire load (and 
have to use a fence), the penalty will be bigger.

But the more obvious discussion would be what constitutes a real-world 
benchmark : )
In my experience you can get a lot of performance benefits out of 
optimizing barriers in code if all you execute is that code.
But once you embed that into a real-world application, often 90%-99% of 
time spent will be in the business logic, not in the data structure.

And then the benefits suddenly disappear.
Note that a lot of barriers are a lot cheaper as well when there's no 
contention.

Because of that, making optimization decisions based on microbenchmarks 
can sometimes lead to a very poor "time invested" vs "total product 
improvement" ratio.

Best wishes,
jonas
  
Paul E. McKenney Feb. 28, 2023, 3:40 p.m. UTC | #27
On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote:
> 
> 
> On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
> > On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
> > > 
> > > On 2/27/2023 8:40 PM, Andrea Parri wrote:
> > > > > The LKMM doesn't believe that a control or data dependency orders a
> > > > > plain write after a marked read.  Hence in this test it thinks that P1's
> > > > > store to u0 can happen before the load of x1.  I don't remember why we
> > > > > did it this way -- probably we just wanted to minimize the restrictions
> > > > > on when plain accesses can execute.  (I do remember the reason for
> > > > > making address dependencies induce order; it was so RCU would work.)
> > > > > 
> > > > > The patch below will change what the LKMM believes.  It eliminates the
> > > > > positive outcome of the litmus test and the data race.  Should it be
> > > > > adopted into the memory model?
> > > > (Unpopular opinion I know,) it should drop dependencies ordering, not
> > > > add/promote it.
> > > > 
> > > >     Andrea
> > > Maybe not as unpopular as you think... :)
> > > But either way IMHO it should be consistent; either take all the
> > > dependencies that are true and add them, or drop them all.
> > > In the latter case, RCU should change to an acquire barrier. (also, one
> > > would have to deal with OOTA in some yet different way).
> > > 
> > > Generally my position is that unless there's a real-world benchmark with
> > > proven performance benefits of relying on dependency ordering, one should
> > > use an acquire barrier. I haven't yet met such a case, but maybe one of you
> > > has...
> > https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).
> > 
> > Though this is admittedly for ARMv7 and PowerPC.
> > 
> 
> Thanks for the link.
> 
> It's true that on architectures that don't have an acquire load (and have to
> use a fence), the penalty will be bigger.
> 
> But the more obvious discussion would be what constitutes a real-world
> benchmark : )
> In my experience you can get a lot of performance benefits out of optimizing
> barriers in code if all you execute is that code.
> But once you embed that into a real-world application, often 90%-99% of time
> spent will be in the business logic, not in the data structure.
> 
> And then the benefits suddenly disappear.
> Note that a lot of barriers are a lot cheaper as well when there's no
> contention.
> 
> Because of that, making optimization decisions based on microbenchmarks can
> sometimes lead to a very poor "time invested" vs "total product improvement"
> ratio.

All true, though that 2x and 4x should be worth something.

The real-world examples I know of involved garbage collectors, and the
improvement was said to be a few percent system-wide.  But that was a
verbal exchange, so I don't have a citation for you.

							Thanx, Paul
  
Paul E. McKenney Feb. 28, 2023, 10:01 p.m. UTC | #28
On Mon, Feb 27, 2023 at 08:35:11PM +0100, Andrea Parri wrote:
> On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> > As stated in the documentation and implied by its name, the ppo
> > (preserved program order) relation is intended to link po-earlier
> > to po-later instructions under certain conditions.  However, a
> > corner case currently allows instructions to be linked by ppo that
> > are not executed by the same thread, i.e., instructions are being
> > linked that have no po relation.
> > 
> > This happens due to the mb/strong-fence/fence relations, which (as
> > one case) provide order when locks are passed between threads
> > followed by an smp_mb__after_unlock_lock() fence.  This is
> > illustrated in the following litmus test (as can be seen when using
> > herd7 with `doshow ppo`):
> > 
> > P0(int *x, int *y)
> > {
> >     spin_lock(x);
> >     spin_unlock(x);
> > }
> > 
> > P1(int *x, int *y)
> > {
> >     spin_lock(x);
> >     smp_mb__after_unlock_lock();
> >     *y = 1;
> > }
> 
> nit: for readability if not for other, s/int *x/spinlock_t *x/.

Unless there are objections, I will make this change on my next
rebase.

> > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> > P0 passes a lock to P1 which then uses this fence.
> > 
> > The patch makes ppo a subrelation of po by letting fence contribute
> > to ppo only in case the fence links events of the same thread.
> > 
> > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> 
> Acked-by: Andrea Parri <parri.andrea@gmail.com>

And add this as well, thank you!

							Thanx, Paul

>   Andrea
> 
> 
> > ---
> >  tools/memory-model/linux-kernel.cat | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index cfc1b8fd46da..adf3c4f41229 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> >  let overwrite = co | fr
> >  let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> >  let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> >  
> >  (* Propagation: Ordering from release operations and strong fences. *)
> >  let A-cumul(r) = (rfe ; [Marked])? ; r
> > -- 
> > 2.17.1
> >
  
Jonas Oberhauser March 1, 2023, 10:52 a.m. UTC | #29
On 2/28/2023 4:40 PM, Paul E. McKenney wrote:
> On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote:
>>
>> On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
>>> On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
>>>> On 2/27/2023 8:40 PM, Andrea Parri wrote:
>>>>>> The LKMM doesn't believe that a control or data dependency orders a
>>>>>> plain write after a marked read.  Hence in this test it thinks that P1's
>>>>>> store to u0 can happen before the load of x1.  I don't remember why we
>>>>>> did it this way -- probably we just wanted to minimize the restrictions
>>>>>> on when plain accesses can execute.  (I do remember the reason for
>>>>>> making address dependencies induce order; it was so RCU would work.)
>>>>>>
>>>>>> The patch below will change what the LKMM believes.  It eliminates the
>>>>>> positive outcome of the litmus test and the data race.  Should it be
>>>>>> adopted into the memory model?
>>>>> (Unpopular opinion I know,) it should drop dependencies ordering, not
>>>>> add/promote it.
>>>>>
>>>>>      Andrea
>>>> Maybe not as unpopular as you think... :)
>>>> But either way IMHO it should be consistent; either take all the
>>>> dependencies that are true and add them, or drop them all.
>>>> In the latter case, RCU should change to an acquire barrier. (also, one
>>>> would have to deal with OOTA in some yet different way).
>>>>
>>>> Generally my position is that unless there's a real-world benchmark with
>>>> proven performance benefits of relying on dependency ordering, one should
>>>> use an acquire barrier. I haven't yet met such a case, but maybe one of you
>>>> has...
>>> https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).
>>>
>>> Though this is admittedly for ARMv7 and PowerPC.
>>>
>> Thanks for the link.
>>
>> It's true that on architectures that don't have an acquire load (and have to
>> use a fence), the penalty will be bigger.
>>
>> But the more obvious discussion would be what constitutes a real-world
>> benchmark : )
>> In my experience you can get a lot of performance benefits out of optimizing
>> barriers in code if all you execute is that code.
>> But once you embed that into a real-world application, often 90%-99% of time
>> spent will be in the business logic, not in the data structure.
>>
>> And then the benefits suddenly disappear.
>> Note that a lot of barriers are a lot cheaper as well when there's no
>> contention.
>>
>> Because of that, making optimization decisions based on microbenchmarks can
>> sometimes lead to a very poor "time invested" vs "total product improvement"
>> ratio.
> All true, though that 2x and 4x should be worth something.

I think the most egregious case we had (not barrier related, but cache 
related) was something like ~100x in specific benchmarks and then I 
think somewhere around 1% system-wide. I think the issue was that in the 
larger system, we couldn't keep the cache hot, so our greatly improved 
data locality was being diluted.
But of course it always depends on how much that component actually 
contributes to the overall system performance.
Which IMHO should be one of the measurements taken before starting to 
invest heavily into optimizations.

Surprisingly, many people don't want to do that. I think it's something 
about the misleading calculus of "2 months implementing the optimization 
+ 2 weeks building robust benchmarks & profiling infrastructure > 2 
months implementing the optimization" instead of "2 weeks building 
robust benchmarks & profiling infrastructure + 0 months implementing a 
useless optimization < 2 months implementing the optimization", which 
seems to be the more common case.


> The real-world examples I know of involved garbage collectors, and the
> improvement was said to be a few percent system-wide.  But that was a
> verbal exchange, so I don't have a citation for you.

Thanks for the example, it sounds very reasonable (at least for 
platforms like PowerPC).
GC has all the hallmarks of a good optimization target: measurable 
impact on system wide throughput and latency, and pointer chasing 
(=dependency ordering) being a hot part inside the algorithm.

Best wishes,
jonas
  
Paul E. McKenney March 2, 2023, 1:42 a.m. UTC | #30
On Wed, Mar 01, 2023 at 11:52:09AM +0100, Jonas Oberhauser wrote:
> 
> 
> On 2/28/2023 4:40 PM, Paul E. McKenney wrote:
> > On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote:
> > > 
> > > On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
> > > > On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
> > > > > On 2/27/2023 8:40 PM, Andrea Parri wrote:
> > > > > > > The LKMM doesn't believe that a control or data dependency orders a
> > > > > > > plain write after a marked read.  Hence in this test it thinks that P1's
> > > > > > > store to u0 can happen before the load of x1.  I don't remember why we
> > > > > > > did it this way -- probably we just wanted to minimize the restrictions
> > > > > > > on when plain accesses can execute.  (I do remember the reason for
> > > > > > > making address dependencies induce order; it was so RCU would work.)
> > > > > > > 
> > > > > > > The patch below will change what the LKMM believes.  It eliminates the
> > > > > > > positive outcome of the litmus test and the data race.  Should it be
> > > > > > > adopted into the memory model?
> > > > > > (Unpopular opinion I know,) it should drop dependencies ordering, not
> > > > > > add/promote it.
> > > > > > 
> > > > > >      Andrea
> > > > > Maybe not as unpopular as you think... :)
> > > > > But either way IMHO it should be consistent; either take all the
> > > > > dependencies that are true and add them, or drop them all.
> > > > > In the latter case, RCU should change to an acquire barrier. (also, one
> > > > > would have to deal with OOTA in some yet different way).
> > > > > 
> > > > > Generally my position is that unless there's a real-world benchmark with
> > > > > proven performance benefits of relying on dependency ordering, one should
> > > > > use an acquire barrier. I haven't yet met such a case, but maybe one of you
> > > > > has...
> > > > https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).
> > > > 
> > > > Though this is admittedly for ARMv7 and PowerPC.
> > > > 
> > > Thanks for the link.
> > > 
> > > It's true that on architectures that don't have an acquire load (and have to
> > > use a fence), the penalty will be bigger.
> > > 
> > > But the more obvious discussion would be what constitutes a real-world
> > > benchmark : )
> > > In my experience you can get a lot of performance benefits out of optimizing
> > > barriers in code if all you execute is that code.
> > > But once you embed that into a real-world application, often 90%-99% of time
> > > spent will be in the business logic, not in the data structure.
> > > 
> > > And then the benefits suddenly disappear.
> > > Note that a lot of barriers are a lot cheaper as well when there's no
> > > contention.
> > > 
> > > Because of that, making optimization decisions based on microbenchmarks can
> > > sometimes lead to a very poor "time invested" vs "total product improvement"
> > > ratio.
> > All true, though that 2x and 4x should be worth something.
> 
> I think the most egregious case we had (not barrier related, but cache
> related) was something like ~100x in specific benchmarks and then I think
> somewhere around 1% system-wide. I think the issue was that in the larger
> system, we couldn't keep the cache hot, so our greatly improved data
> locality was being diluted.
> But of course it always depends on how much that component actually
> contributes to the overall system performance.
> Which IMHO should be one of the measurements taken before starting to invest
> heavily into optimizations.

Agreed, it is all too easy for profound local optimizations to have
near-zero global effect.

> Surprisingly, many people don't want to do that. I think it's something
> about the misleading calculus of "2 months implementing the optimization + 2
> weeks building robust benchmarks & profiling infrastructure > 2 months
> implementing the optimization" instead of "2 weeks building robust
> benchmarks & profiling infrastructure + 0 months implementing a useless
> optimization < 2 months implementing the optimization", which seems to be
> the more common case.

Or, for that matter, having the investigation locate a five-minute change
that produces large benefits.

> > The real-world examples I know of involved garbage collectors, and the
> > improvement was said to be a few percent system-wide.  But that was a
> > verbal exchange, so I don't have a citation for you.
> 
> Thanks for the example, it sounds very reasonable (at least for platforms
> like PowerPC).
> GC has all the hallmarks of a good optimization target: measurable impact on
> system wide throughput and latency, and pointer chasing (=dependency
> ordering) being a hot part inside the algorithm.

I believe that the target was ARM.  It was definitely not PowerPC.

But yes, GCs are indeed intense exercise for pointer chasing.

							Thanx, Paul
  

Patch

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index cfc1b8fd46da..adf3c4f41229 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -82,7 +82,7 @@  let rwdep = (dep | ctrl) ; [W]
 let overwrite = co | fr
 let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
 let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
-let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
+let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
 
 (* Propagation: Ordering from release operations and strong fences. *)
 let A-cumul(r) = (rfe ; [Marked])? ; r