[v2,1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po

Message ID 20230126134604.2160-2-jonas.oberhauser@huaweicloud.com
State New
Headers
Series Streamlining treatment of smp_mb__after_unlock_lock |

Commit Message

Jonas Oberhauser Jan. 26, 2023, 1:46 p.m. UTC
  LKMM uses two relations for talking about UNLOCK+LOCK pairings:

	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
	   on the same CPU or immediate lock handovers on the same
	   lock variable

	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
	   literally as described in rcupdate.h#L1002, i.e., even
	   after a sequence of handovers on the same lock variable.

The latter relation is used only once, to provide the guarantee
defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
makes any UNLOCK+LOCK pair followed by the fence behave like a full
barrier.

This patch drops this use in favor of using po-unlock-lock-po
everywhere, which unifies the way the model talks about UNLOCK+LOCK
pairings.  At first glance this seems to weaken the guarantee given
by LKMM: When considering a long sequence of lock handovers
such as below, where P0 hands the lock to P1, which hands it to P2,
which finally executes such an after_unlock_lock fence, the mb
relation currently links any stores in the critical section of P0
to instructions P2 executes after its fence, but not so after the
patch.

P0(int *x, int *y, spinlock_t *mylock)
{
        spin_lock(mylock);
        WRITE_ONCE(*x, 2);
        spin_unlock(mylock);
        WRITE_ONCE(*y, 1);
}

P1(int *y, int *z, spinlock_t *mylock)
{
        int r0 = READ_ONCE(*y); // reads 1
        spin_lock(mylock);
        spin_unlock(mylock);
        WRITE_ONCE(*z,1);
}

P2(int *z, int *d, spinlock_t *mylock)
{
        int r1 = READ_ONCE(*z); // reads 1
        spin_lock(mylock);
        spin_unlock(mylock);
        smp_mb__after_unlock_lock();
        WRITE_ONCE(*d,1);
}

P3(int *x, int *d)
{
        WRITE_ONCE(*d,2);
        smp_mb();
        WRITE_ONCE(*x,1);
}

exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)

Nevertheless, the ordering guarantee given in rcupdate.h is actually
not weakened.  This is because the unlock operations along the
sequence of handovers are A-cumulative fences.  They ensure that any
stores that propagate to the CPU performing the first unlock
operation in the sequence must also propagate to every CPU that
performs a subsequent lock operation in the sequence.  Therefore any
such stores will also be ordered correctly by the fence even if only
the final handover is considered a full barrier.

Indeed this patch does not affect the behaviors allowed by LKMM at
all.  The mb relation is used to define ordering through:
1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
   lock-release, rfe, and unlock-acquire orderings each provide hb
2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
   lock-release orderings simply add more fine-grained cumul-fence
   edges to substitute a single strong-fence edge provided by a long
   lock handover sequence
3) mb/strong-fence/pb and various similar uses in the definition of
   data races, where as discussed above any long handover sequence
   can be turned into a sequence of cumul-fence edges that provide
   the same ordering.

Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
---
 tools/memory-model/linux-kernel.cat | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)
  

Comments

Alan Stern Jan. 26, 2023, 4:36 p.m. UTC | #1
On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> 
> 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> 	   on the same CPU or immediate lock handovers on the same
> 	   lock variable
> 
> 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> 	   literally as described in rcupdate.h#L1002, i.e., even
> 	   after a sequence of handovers on the same lock variable.
> 
> The latter relation is used only once, to provide the guarantee
> defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> makes any UNLOCK+LOCK pair followed by the fence behave like a full
> barrier.
> 
> This patch drops this use in favor of using po-unlock-lock-po
> everywhere, which unifies the way the model talks about UNLOCK+LOCK
> pairings.  At first glance this seems to weaken the guarantee given
> by LKMM: When considering a long sequence of lock handovers
> such as below, where P0 hands the lock to P1, which hands it to P2,
> which finally executes such an after_unlock_lock fence, the mb
> relation currently links any stores in the critical section of P0
> to instructions P2 executes after its fence, but not so after the
> patch.
> 
> P0(int *x, int *y, spinlock_t *mylock)
> {
>         spin_lock(mylock);
>         WRITE_ONCE(*x, 2);
>         spin_unlock(mylock);
>         WRITE_ONCE(*y, 1);
> }
> 
> P1(int *y, int *z, spinlock_t *mylock)
> {
>         int r0 = READ_ONCE(*y); // reads 1
>         spin_lock(mylock);
>         spin_unlock(mylock);
>         WRITE_ONCE(*z,1);
> }
> 
> P2(int *z, int *d, spinlock_t *mylock)
> {
>         int r1 = READ_ONCE(*z); // reads 1
>         spin_lock(mylock);
>         spin_unlock(mylock);
>         smp_mb__after_unlock_lock();
>         WRITE_ONCE(*d,1);
> }
> 
> P3(int *x, int *d)
> {
>         WRITE_ONCE(*d,2);
>         smp_mb();
>         WRITE_ONCE(*x,1);
> }
> 
> exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> 
> Nevertheless, the ordering guarantee given in rcupdate.h is actually
> not weakened.  This is because the unlock operations along the
> sequence of handovers are A-cumulative fences.  They ensure that any
> stores that propagate to the CPU performing the first unlock
> operation in the sequence must also propagate to every CPU that
> performs a subsequent lock operation in the sequence.  Therefore any
> such stores will also be ordered correctly by the fence even if only
> the final handover is considered a full barrier.
> 
> Indeed this patch does not affect the behaviors allowed by LKMM at
> all.  The mb relation is used to define ordering through:
> 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
>    lock-release, rfe, and unlock-acquire orderings each provide hb
> 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
>    lock-release orderings simply add more fine-grained cumul-fence
>    edges to substitute a single strong-fence edge provided by a long
>    lock handover sequence
> 3) mb/strong-fence/pb and various similar uses in the definition of
>    data races, where as discussed above any long handover sequence
>    can be turned into a sequence of cumul-fence edges that provide
>    the same ordering.
> 
> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> ---

Reviewed-by: Alan Stern <stern@rowland.harvard.edu>

>  tools/memory-model/linux-kernel.cat | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 07f884f9b2bf..6e531457bb73 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -37,8 +37,19 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
>  	([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
>  	([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
>  	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> -	([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> -		fencerel(After-unlock-lock) ; [M])
> +(*
> + * Note: The po-unlock-lock-po relation only passes the lock to the direct
> + * successor, perhaps giving the impression that the ordering of the
> + * smp_mb__after_unlock_lock() fence only affects a single lock handover.
> + * However, in a longer sequence of lock handovers, the implicit
> + * A-cumulative release fences of lock-release ensure that any stores that
> + * propagate to one of the involved CPUs before it hands over the lock to
> + * the next CPU will also propagate to the final CPU handing over the lock
> + * to the CPU that executes the fence.  Therefore, all those stores are
> + * also affected by the fence.
> + *)
> +	([M] ; po-unlock-lock-po ;
> +		[After-unlock-lock] ; po ; [M])
>  let gp = po ; [Sync-rcu | Sync-srcu] ; po?
>  let strong-fence = mb | gp
>  
> -- 
> 2.17.1
>
  
Paul E. McKenney Jan. 26, 2023, 8:08 p.m. UTC | #2
On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> > 
> > 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > 	   on the same CPU or immediate lock handovers on the same
> > 	   lock variable
> > 
> > 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > 	   literally as described in rcupdate.h#L1002, i.e., even
> > 	   after a sequence of handovers on the same lock variable.
> > 
> > The latter relation is used only once, to provide the guarantee
> > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > barrier.
> > 
> > This patch drops this use in favor of using po-unlock-lock-po
> > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > pairings.  At first glance this seems to weaken the guarantee given
> > by LKMM: When considering a long sequence of lock handovers
> > such as below, where P0 hands the lock to P1, which hands it to P2,
> > which finally executes such an after_unlock_lock fence, the mb
> > relation currently links any stores in the critical section of P0
> > to instructions P2 executes after its fence, but not so after the
> > patch.
> > 
> > P0(int *x, int *y, spinlock_t *mylock)
> > {
> >         spin_lock(mylock);
> >         WRITE_ONCE(*x, 2);
> >         spin_unlock(mylock);
> >         WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *y, int *z, spinlock_t *mylock)
> > {
> >         int r0 = READ_ONCE(*y); // reads 1
> >         spin_lock(mylock);
> >         spin_unlock(mylock);
> >         WRITE_ONCE(*z,1);
> > }
> > 
> > P2(int *z, int *d, spinlock_t *mylock)
> > {
> >         int r1 = READ_ONCE(*z); // reads 1
> >         spin_lock(mylock);
> >         spin_unlock(mylock);
> >         smp_mb__after_unlock_lock();
> >         WRITE_ONCE(*d,1);
> > }
> > 
> > P3(int *x, int *d)
> > {
> >         WRITE_ONCE(*d,2);
> >         smp_mb();
> >         WRITE_ONCE(*x,1);
> > }
> > 
> > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> > 
> > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > not weakened.  This is because the unlock operations along the
> > sequence of handovers are A-cumulative fences.  They ensure that any
> > stores that propagate to the CPU performing the first unlock
> > operation in the sequence must also propagate to every CPU that
> > performs a subsequent lock operation in the sequence.  Therefore any
> > such stores will also be ordered correctly by the fence even if only
> > the final handover is considered a full barrier.
> > 
> > Indeed this patch does not affect the behaviors allowed by LKMM at
> > all.  The mb relation is used to define ordering through:
> > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> >    lock-release, rfe, and unlock-acquire orderings each provide hb
> > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> >    lock-release orderings simply add more fine-grained cumul-fence
> >    edges to substitute a single strong-fence edge provided by a long
> >    lock handover sequence
> > 3) mb/strong-fence/pb and various similar uses in the definition of
> >    data races, where as discussed above any long handover sequence
> >    can be turned into a sequence of cumul-fence edges that provide
> >    the same ordering.
> > 
> > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > ---
> 
> Reviewed-by: Alan Stern <stern@rowland.harvard.edu>

A quick spot check showed no change in performance, so thank you both!

Queued for review and further testing.

							Thanx, Paul

> >  tools/memory-model/linux-kernel.cat | 15 +++++++++++++--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index 07f884f9b2bf..6e531457bb73 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -37,8 +37,19 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
> >  	([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> >  	([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> >  	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> > -	([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> > -		fencerel(After-unlock-lock) ; [M])
> > +(*
> > + * Note: The po-unlock-lock-po relation only passes the lock to the direct
> > + * successor, perhaps giving the impression that the ordering of the
> > + * smp_mb__after_unlock_lock() fence only affects a single lock handover.
> > + * However, in a longer sequence of lock handovers, the implicit
> > + * A-cumulative release fences of lock-release ensure that any stores that
> > + * propagate to one of the involved CPUs before it hands over the lock to
> > + * the next CPU will also propagate to the final CPU handing over the lock
> > + * to the CPU that executes the fence.  Therefore, all those stores are
> > + * also affected by the fence.
> > + *)
> > +	([M] ; po-unlock-lock-po ;
> > +		[After-unlock-lock] ; po ; [M])
> >  let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> >  let strong-fence = mb | gp
> >  
> > -- 
> > 2.17.1
> >
  
Paul E. McKenney Jan. 26, 2023, 11:21 p.m. UTC | #3
On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
> On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> > On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> > > 
> > > 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > > 	   on the same CPU or immediate lock handovers on the same
> > > 	   lock variable
> > > 
> > > 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > > 	   literally as described in rcupdate.h#L1002, i.e., even
> > > 	   after a sequence of handovers on the same lock variable.
> > > 
> > > The latter relation is used only once, to provide the guarantee
> > > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > > barrier.
> > > 
> > > This patch drops this use in favor of using po-unlock-lock-po
> > > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > > pairings.  At first glance this seems to weaken the guarantee given
> > > by LKMM: When considering a long sequence of lock handovers
> > > such as below, where P0 hands the lock to P1, which hands it to P2,
> > > which finally executes such an after_unlock_lock fence, the mb
> > > relation currently links any stores in the critical section of P0
> > > to instructions P2 executes after its fence, but not so after the
> > > patch.
> > > 
> > > P0(int *x, int *y, spinlock_t *mylock)
> > > {
> > >         spin_lock(mylock);
> > >         WRITE_ONCE(*x, 2);
> > >         spin_unlock(mylock);
> > >         WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *y, int *z, spinlock_t *mylock)
> > > {
> > >         int r0 = READ_ONCE(*y); // reads 1
> > >         spin_lock(mylock);
> > >         spin_unlock(mylock);
> > >         WRITE_ONCE(*z,1);
> > > }
> > > 
> > > P2(int *z, int *d, spinlock_t *mylock)
> > > {
> > >         int r1 = READ_ONCE(*z); // reads 1
> > >         spin_lock(mylock);
> > >         spin_unlock(mylock);
> > >         smp_mb__after_unlock_lock();
> > >         WRITE_ONCE(*d,1);
> > > }
> > > 
> > > P3(int *x, int *d)
> > > {
> > >         WRITE_ONCE(*d,2);
> > >         smp_mb();
> > >         WRITE_ONCE(*x,1);
> > > }
> > > 
> > > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> > > 
> > > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > > not weakened.  This is because the unlock operations along the
> > > sequence of handovers are A-cumulative fences.  They ensure that any
> > > stores that propagate to the CPU performing the first unlock
> > > operation in the sequence must also propagate to every CPU that
> > > performs a subsequent lock operation in the sequence.  Therefore any
> > > such stores will also be ordered correctly by the fence even if only
> > > the final handover is considered a full barrier.
> > > 
> > > Indeed this patch does not affect the behaviors allowed by LKMM at
> > > all.  The mb relation is used to define ordering through:
> > > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> > >    lock-release, rfe, and unlock-acquire orderings each provide hb
> > > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> > >    lock-release orderings simply add more fine-grained cumul-fence
> > >    edges to substitute a single strong-fence edge provided by a long
> > >    lock handover sequence
> > > 3) mb/strong-fence/pb and various similar uses in the definition of
> > >    data races, where as discussed above any long handover sequence
> > >    can be turned into a sequence of cumul-fence edges that provide
> > >    the same ordering.
> > > 
> > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > > ---
> > 
> > Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
> 
> A quick spot check showed no change in performance, so thank you both!
> 
> Queued for review and further testing.

And testing on https://github.com/paulmckrcu/litmus for litmus tests up
to ten processes and allowing 10 minutes per litmus test got this:

Exact output matches: 5208
!!! Timed out: 38
!!! Unknown primitive: 7

This test compared output with and without your patch.

For the tests with a Results clause, these failed:

	manual/kernel/C-srcu-nest-7.litmus
	manual/kernel/C-srcu-nest-5.litmus
	manual/kernel/C-srcu-nest-6.litmus
	manual/kernel/C-srcu-nest-8.litmus

But all of these will continue to fail until we get Alan's new-age SRCU
patch applied.

Therefore, this constitutes success, so good show thus far on testing!  ;-)

Also, I am going to be pushing the scripts I use to mainline.  They might
not be perfect, but they will be quite useful for this sort of change
to the memory model.

						Thanx, Paul
  
Jonas Oberhauser Jan. 27, 2023, 1:18 p.m. UTC | #4
On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
> On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
>> On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
>>> On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
>>>> LKMM uses two relations for talking about UNLOCK+LOCK pairings:
>>>>
>>>> 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
>>>> 	   on the same CPU or immediate lock handovers on the same
>>>> 	   lock variable
>>>>
>>>> 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
>>>> 	   literally as described in rcupdate.h#L1002, i.e., even
>>>> 	   after a sequence of handovers on the same lock variable.
>>>>
>>>> The latter relation is used only once, to provide the guarantee
>>>> defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
>>>> makes any UNLOCK+LOCK pair followed by the fence behave like a full
>>>> barrier.
>>>>
>>>> This patch drops this use in favor of using po-unlock-lock-po
>>>> everywhere, which unifies the way the model talks about UNLOCK+LOCK
>>>> pairings.  At first glance this seems to weaken the guarantee given
>>>> by LKMM: When considering a long sequence of lock handovers
>>>> such as below, where P0 hands the lock to P1, which hands it to P2,
>>>> which finally executes such an after_unlock_lock fence, the mb
>>>> relation currently links any stores in the critical section of P0
>>>> to instructions P2 executes after its fence, but not so after the
>>>> patch.
>>>>
>>>> P0(int *x, int *y, spinlock_t *mylock)
>>>> {
>>>>          spin_lock(mylock);
>>>>          WRITE_ONCE(*x, 2);
>>>>          spin_unlock(mylock);
>>>>          WRITE_ONCE(*y, 1);
>>>> }
>>>>
>>>> P1(int *y, int *z, spinlock_t *mylock)
>>>> {
>>>>          int r0 = READ_ONCE(*y); // reads 1
>>>>          spin_lock(mylock);
>>>>          spin_unlock(mylock);
>>>>          WRITE_ONCE(*z,1);
>>>> }
>>>>
>>>> P2(int *z, int *d, spinlock_t *mylock)
>>>> {
>>>>          int r1 = READ_ONCE(*z); // reads 1
>>>>          spin_lock(mylock);
>>>>          spin_unlock(mylock);
>>>>          smp_mb__after_unlock_lock();
>>>>          WRITE_ONCE(*d,1);
>>>> }
>>>>
>>>> P3(int *x, int *d)
>>>> {
>>>>          WRITE_ONCE(*d,2);
>>>>          smp_mb();
>>>>          WRITE_ONCE(*x,1);
>>>> }
>>>>
>>>> exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
>>>>
>>>> Nevertheless, the ordering guarantee given in rcupdate.h is actually
>>>> not weakened.  This is because the unlock operations along the
>>>> sequence of handovers are A-cumulative fences.  They ensure that any
>>>> stores that propagate to the CPU performing the first unlock
>>>> operation in the sequence must also propagate to every CPU that
>>>> performs a subsequent lock operation in the sequence.  Therefore any
>>>> such stores will also be ordered correctly by the fence even if only
>>>> the final handover is considered a full barrier.
>>>>
>>>> Indeed this patch does not affect the behaviors allowed by LKMM at
>>>> all.  The mb relation is used to define ordering through:
>>>> 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
>>>>     lock-release, rfe, and unlock-acquire orderings each provide hb
>>>> 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
>>>>     lock-release orderings simply add more fine-grained cumul-fence
>>>>     edges to substitute a single strong-fence edge provided by a long
>>>>     lock handover sequence
>>>> 3) mb/strong-fence/pb and various similar uses in the definition of
>>>>     data races, where as discussed above any long handover sequence
>>>>     can be turned into a sequence of cumul-fence edges that provide
>>>>     the same ordering.
>>>>
>>>> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>>>> ---
>>> Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
>> A quick spot check showed no change in performance, so thank you both!
>>
>> Queued for review and further testing.
> And testing on https://github.com/paulmckrcu/litmus for litmus tests up
> to ten processes and allowing 10 minutes per litmus test got this:
>
> Exact output matches: 5208
> !!! Timed out: 38
> !!! Unknown primitive: 7
>
> This test compared output with and without your patch.
>
> For the tests with a Results clause, these failed:

Gave me a heart attack there for a second!

> Also, I am going to be pushing the scripts I use to mainline.  They might
> not be perfect, but they will be quite useful for this sort of change
> to the memory model.

I could also provide Coq proofs, although those are ignoring the 
srcu/data race parts at the moment.

Have fun, jonas
  
Paul E. McKenney Jan. 27, 2023, 3:13 p.m. UTC | #5
On Fri, Jan 27, 2023 at 02:18:41PM +0100, Jonas Oberhauser wrote:
> On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
> > On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
> > > On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> > > > On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > > > > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> > > > > 
> > > > > 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > > > > 	   on the same CPU or immediate lock handovers on the same
> > > > > 	   lock variable
> > > > > 
> > > > > 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > > > > 	   literally as described in rcupdate.h#L1002, i.e., even
> > > > > 	   after a sequence of handovers on the same lock variable.
> > > > > 
> > > > > The latter relation is used only once, to provide the guarantee
> > > > > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > > > > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > > > > barrier.
> > > > > 
> > > > > This patch drops this use in favor of using po-unlock-lock-po
> > > > > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > > > > pairings.  At first glance this seems to weaken the guarantee given
> > > > > by LKMM: When considering a long sequence of lock handovers
> > > > > such as below, where P0 hands the lock to P1, which hands it to P2,
> > > > > which finally executes such an after_unlock_lock fence, the mb
> > > > > relation currently links any stores in the critical section of P0
> > > > > to instructions P2 executes after its fence, but not so after the
> > > > > patch.
> > > > > 
> > > > > P0(int *x, int *y, spinlock_t *mylock)
> > > > > {
> > > > >          spin_lock(mylock);
> > > > >          WRITE_ONCE(*x, 2);
> > > > >          spin_unlock(mylock);
> > > > >          WRITE_ONCE(*y, 1);
> > > > > }
> > > > > 
> > > > > P1(int *y, int *z, spinlock_t *mylock)
> > > > > {
> > > > >          int r0 = READ_ONCE(*y); // reads 1
> > > > >          spin_lock(mylock);
> > > > >          spin_unlock(mylock);
> > > > >          WRITE_ONCE(*z,1);
> > > > > }
> > > > > 
> > > > > P2(int *z, int *d, spinlock_t *mylock)
> > > > > {
> > > > >          int r1 = READ_ONCE(*z); // reads 1
> > > > >          spin_lock(mylock);
> > > > >          spin_unlock(mylock);
> > > > >          smp_mb__after_unlock_lock();
> > > > >          WRITE_ONCE(*d,1);
> > > > > }
> > > > > 
> > > > > P3(int *x, int *d)
> > > > > {
> > > > >          WRITE_ONCE(*d,2);
> > > > >          smp_mb();
> > > > >          WRITE_ONCE(*x,1);
> > > > > }
> > > > > 
> > > > > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> > > > > 
> > > > > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > > > > not weakened.  This is because the unlock operations along the
> > > > > sequence of handovers are A-cumulative fences.  They ensure that any
> > > > > stores that propagate to the CPU performing the first unlock
> > > > > operation in the sequence must also propagate to every CPU that
> > > > > performs a subsequent lock operation in the sequence.  Therefore any
> > > > > such stores will also be ordered correctly by the fence even if only
> > > > > the final handover is considered a full barrier.
> > > > > 
> > > > > Indeed this patch does not affect the behaviors allowed by LKMM at
> > > > > all.  The mb relation is used to define ordering through:
> > > > > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> > > > >     lock-release, rfe, and unlock-acquire orderings each provide hb
> > > > > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> > > > >     lock-release orderings simply add more fine-grained cumul-fence
> > > > >     edges to substitute a single strong-fence edge provided by a long
> > > > >     lock handover sequence
> > > > > 3) mb/strong-fence/pb and various similar uses in the definition of
> > > > >     data races, where as discussed above any long handover sequence
> > > > >     can be turned into a sequence of cumul-fence edges that provide
> > > > >     the same ordering.
> > > > > 
> > > > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > > > > ---
> > > > Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
> > > A quick spot check showed no change in performance, so thank you both!
> > > 
> > > Queued for review and further testing.
> > And testing on https://github.com/paulmckrcu/litmus for litmus tests up
> > to ten processes and allowing 10 minutes per litmus test got this:
> > 
> > Exact output matches: 5208
> > !!! Timed out: 38
> > !!! Unknown primitive: 7
> > 
> > This test compared output with and without your patch.
> > 
> > For the tests with a Results clause, these failed:
> 
> Gave me a heart attack there for a second!

Sorry for the scare!!!

> > Also, I am going to be pushing the scripts I use to mainline.  They might
> > not be perfect, but they will be quite useful for this sort of change
> > to the memory model.
> 
> I could also provide Coq proofs, although those are ignoring the srcu/data
> race parts at the moment.

Can such proofs serve as regression tests for future changes?

							Thanx, Paul
  
Jonas Oberhauser Jan. 27, 2023, 3:57 p.m. UTC | #6
On 1/27/2023 4:13 PM, Paul E. McKenney wrote:
> On Fri, Jan 27, 2023 at 02:18:41PM +0100, Jonas Oberhauser wrote:
>> On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
>>> On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
>>>> On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
>>>>> On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
>>>>>> LKMM uses two relations for talking about UNLOCK+LOCK pairings:
>>>>>>
>>>>>> 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
>>>>>> 	   on the same CPU or immediate lock handovers on the same
>>>>>> 	   lock variable
>>>>>>
>>>>>> 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
>>>>>> 	   literally as described in rcupdate.h#L1002, i.e., even
>>>>>> 	   after a sequence of handovers on the same lock variable.
>>>>>>
>>>>>> The latter relation is used only once, to provide the guarantee
>>>>>> defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
>>>>>> makes any UNLOCK+LOCK pair followed by the fence behave like a full
>>>>>> barrier.
>>>>>>
>>>>>> This patch drops this use in favor of using po-unlock-lock-po
>>>>>> everywhere, which unifies the way the model talks about UNLOCK+LOCK
>>>>>> pairings.  At first glance this seems to weaken the guarantee given
>>>>>> by LKMM: When considering a long sequence of lock handovers
>>>>>> such as below, where P0 hands the lock to P1, which hands it to P2,
>>>>>> which finally executes such an after_unlock_lock fence, the mb
>>>>>> relation currently links any stores in the critical section of P0
>>>>>> to instructions P2 executes after its fence, but not so after the
>>>>>> patch.
>>>>>>
>>>>>> P0(int *x, int *y, spinlock_t *mylock)
>>>>>> {
>>>>>>           spin_lock(mylock);
>>>>>>           WRITE_ONCE(*x, 2);
>>>>>>           spin_unlock(mylock);
>>>>>>           WRITE_ONCE(*y, 1);
>>>>>> }
>>>>>>
>>>>>> P1(int *y, int *z, spinlock_t *mylock)
>>>>>> {
>>>>>>           int r0 = READ_ONCE(*y); // reads 1
>>>>>>           spin_lock(mylock);
>>>>>>           spin_unlock(mylock);
>>>>>>           WRITE_ONCE(*z,1);
>>>>>> }
>>>>>>
>>>>>> P2(int *z, int *d, spinlock_t *mylock)
>>>>>> {
>>>>>>           int r1 = READ_ONCE(*z); // reads 1
>>>>>>           spin_lock(mylock);
>>>>>>           spin_unlock(mylock);
>>>>>>           smp_mb__after_unlock_lock();
>>>>>>           WRITE_ONCE(*d,1);
>>>>>> }
>>>>>>
>>>>>> P3(int *x, int *d)
>>>>>> {
>>>>>>           WRITE_ONCE(*d,2);
>>>>>>           smp_mb();
>>>>>>           WRITE_ONCE(*x,1);
>>>>>> }
>>>>>>
>>>>>> exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
>>>>>>
>>>>>> Nevertheless, the ordering guarantee given in rcupdate.h is actually
>>>>>> not weakened.  This is because the unlock operations along the
>>>>>> sequence of handovers are A-cumulative fences.  They ensure that any
>>>>>> stores that propagate to the CPU performing the first unlock
>>>>>> operation in the sequence must also propagate to every CPU that
>>>>>> performs a subsequent lock operation in the sequence.  Therefore any
>>>>>> such stores will also be ordered correctly by the fence even if only
>>>>>> the final handover is considered a full barrier.
>>>>>>
>>>>>> Indeed this patch does not affect the behaviors allowed by LKMM at
>>>>>> all.  The mb relation is used to define ordering through:
>>>>>> 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
>>>>>>      lock-release, rfe, and unlock-acquire orderings each provide hb
>>>>>> 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
>>>>>>      lock-release orderings simply add more fine-grained cumul-fence
>>>>>>      edges to substitute a single strong-fence edge provided by a long
>>>>>>      lock handover sequence
>>>>>> 3) mb/strong-fence/pb and various similar uses in the definition of
>>>>>>      data races, where as discussed above any long handover sequence
>>>>>>      can be turned into a sequence of cumul-fence edges that provide
>>>>>>      the same ordering.
>>>>>>
>>>>>> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>>>>>> ---
>>>>> Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
>>>> A quick spot check showed no change in performance, so thank you both!
>>>>
>>>> Queued for review and further testing.
>>> And testing on https://github.com/paulmckrcu/litmus for litmus tests up
>>> to ten processes and allowing 10 minutes per litmus test got this:
>>>
>>> Exact output matches: 5208
>>> !!! Timed out: 38
>>> !!! Unknown primitive: 7
>>>
>>> This test compared output with and without your patch.
>>>
>>> For the tests with a Results clause, these failed:
>> Gave me a heart attack there for a second!
> Sorry for the scare!!!
>
>>> Also, I am going to be pushing the scripts I use to mainline.  They might
>>> not be perfect, but they will be quite useful for this sort of change
>>> to the memory model.
>> I could also provide Coq proofs, although those are ignoring the srcu/data
>> race parts at the moment.
> Can such proofs serve as regression tests for future changes?
>
> 							Thanx, Paul

So-so. On the upside, it would be easy to make them raise an alarm if 
the future change breaks stuff.
On the downside, they will often need maintenance together with any 
change. Sometimes a lot, sometimes very little.
I think for the proofs that show the equivalence between two models, the 
maintenance is quite a bit higher because every change needs to be 
reflected in both versions. So if you do 10 equivalent transformations 
and want to show that they remain equivalent with any future changes you 
do, you need to keep at least 10 additional models around ("current LKMM 
where ppo isn't in po, current LKMM where the unlock fence still relies 
on co, ...").

Right now, each equivalence proof I did (e.g., for using 
po-unlock-lock-po here, or the ppo<=po patch I originally proposed) is 
at average in the ballpark of 500 lines of proof script. And as 
evidenced by my discussion with Alan, these proofs only cover the "core 
model".

So for this kind of thing, I think it's better to look at them to have 
more confidence in the patch, and do the patch more based on which model 
is more reasonable (as Alan enforces). Then consider the simplified 
version as the more natural one, and not worry about future changes that 
break the equivalence (that would usually indicate a problem with the 
old model, rather than a problem with the patch).

For regressions, I would rather consider some desirable properties of 
LKMM, like "DRF-SC" or "monotonicity of barriers" or "ppo <= po" and try 
to prove those. This has the upside of not requiring to carry additional 
models around, so much less than half the maintenance effort, and if the 
property should be broken this usually would indicate a problem with the 
patch. So I think the bang for the buck is much higher there.

Those are my current thoughts anyways : )

have fun, jonas
  
Paul E. McKenney Jan. 27, 2023, 4:48 p.m. UTC | #7
On Fri, Jan 27, 2023 at 04:57:43PM +0100, Jonas Oberhauser wrote:
> 
> 
> On 1/27/2023 4:13 PM, Paul E. McKenney wrote:
> > On Fri, Jan 27, 2023 at 02:18:41PM +0100, Jonas Oberhauser wrote:
> > > On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
> > > > On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
> > > > > On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> > > > > > On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > > > > > > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> > > > > > > 
> > > > > > > 	1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > > > > > > 	   on the same CPU or immediate lock handovers on the same
> > > > > > > 	   lock variable
> > > > > > > 
> > > > > > > 	2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > > > > > > 	   literally as described in rcupdate.h#L1002, i.e., even
> > > > > > > 	   after a sequence of handovers on the same lock variable.
> > > > > > > 
> > > > > > > The latter relation is used only once, to provide the guarantee
> > > > > > > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > > > > > > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > > > > > > barrier.
> > > > > > > 
> > > > > > > This patch drops this use in favor of using po-unlock-lock-po
> > > > > > > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > > > > > > pairings.  At first glance this seems to weaken the guarantee given
> > > > > > > by LKMM: When considering a long sequence of lock handovers
> > > > > > > such as below, where P0 hands the lock to P1, which hands it to P2,
> > > > > > > which finally executes such an after_unlock_lock fence, the mb
> > > > > > > relation currently links any stores in the critical section of P0
> > > > > > > to instructions P2 executes after its fence, but not so after the
> > > > > > > patch.
> > > > > > > 
> > > > > > > P0(int *x, int *y, spinlock_t *mylock)
> > > > > > > {
> > > > > > >           spin_lock(mylock);
> > > > > > >           WRITE_ONCE(*x, 2);
> > > > > > >           spin_unlock(mylock);
> > > > > > >           WRITE_ONCE(*y, 1);
> > > > > > > }
> > > > > > > 
> > > > > > > P1(int *y, int *z, spinlock_t *mylock)
> > > > > > > {
> > > > > > >           int r0 = READ_ONCE(*y); // reads 1
> > > > > > >           spin_lock(mylock);
> > > > > > >           spin_unlock(mylock);
> > > > > > >           WRITE_ONCE(*z,1);
> > > > > > > }
> > > > > > > 
> > > > > > > P2(int *z, int *d, spinlock_t *mylock)
> > > > > > > {
> > > > > > >           int r1 = READ_ONCE(*z); // reads 1
> > > > > > >           spin_lock(mylock);
> > > > > > >           spin_unlock(mylock);
> > > > > > >           smp_mb__after_unlock_lock();
> > > > > > >           WRITE_ONCE(*d,1);
> > > > > > > }
> > > > > > > 
> > > > > > > P3(int *x, int *d)
> > > > > > > {
> > > > > > >           WRITE_ONCE(*d,2);
> > > > > > >           smp_mb();
> > > > > > >           WRITE_ONCE(*x,1);
> > > > > > > }
> > > > > > > 
> > > > > > > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> > > > > > > 
> > > > > > > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > > > > > > not weakened.  This is because the unlock operations along the
> > > > > > > sequence of handovers are A-cumulative fences.  They ensure that any
> > > > > > > stores that propagate to the CPU performing the first unlock
> > > > > > > operation in the sequence must also propagate to every CPU that
> > > > > > > performs a subsequent lock operation in the sequence.  Therefore any
> > > > > > > such stores will also be ordered correctly by the fence even if only
> > > > > > > the final handover is considered a full barrier.
> > > > > > > 
> > > > > > > Indeed this patch does not affect the behaviors allowed by LKMM at
> > > > > > > all.  The mb relation is used to define ordering through:
> > > > > > > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> > > > > > >      lock-release, rfe, and unlock-acquire orderings each provide hb
> > > > > > > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> > > > > > >      lock-release orderings simply add more fine-grained cumul-fence
> > > > > > >      edges to substitute a single strong-fence edge provided by a long
> > > > > > >      lock handover sequence
> > > > > > > 3) mb/strong-fence/pb and various similar uses in the definition of
> > > > > > >      data races, where as discussed above any long handover sequence
> > > > > > >      can be turned into a sequence of cumul-fence edges that provide
> > > > > > >      the same ordering.
> > > > > > > 
> > > > > > > Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> > > > > > > ---
> > > > > > Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
> > > > > A quick spot check showed no change in performance, so thank you both!
> > > > > 
> > > > > Queued for review and further testing.
> > > > And testing on https://github.com/paulmckrcu/litmus for litmus tests up
> > > > to ten processes and allowing 10 minutes per litmus test got this:
> > > > 
> > > > Exact output matches: 5208
> > > > !!! Timed out: 38
> > > > !!! Unknown primitive: 7
> > > > 
> > > > This test compared output with and without your patch.
> > > > 
> > > > For the tests with a Results clause, these failed:
> > > Gave me a heart attack there for a second!
> > Sorry for the scare!!!
> > 
> > > > Also, I am going to be pushing the scripts I use to mainline.  They might
> > > > not be perfect, but they will be quite useful for this sort of change
> > > > to the memory model.
> > > I could also provide Coq proofs, although those are ignoring the srcu/data
> > > race parts at the moment.
> > Can such proofs serve as regression tests for future changes?
> > 
> > 							Thanx, Paul
> 
> So-so. On the upside, it would be easy to make them raise an alarm if the
> future change breaks stuff.
> On the downside, they will often need maintenance together with any change.
> Sometimes a lot, sometimes very little.
> I think for the proofs that show the equivalence between two models, the
> maintenance is quite a bit higher because every change needs to be reflected
> in both versions. So if you do 10 equivalent transformations and want to
> show that they remain equivalent with any future changes you do, you need to
> keep at least 10 additional models around ("current LKMM where ppo isn't in
> po, current LKMM where the unlock fence still relies on co, ...").
> 
> Right now, each equivalence proof I did (e.g., for using po-unlock-lock-po
> here, or the ppo<=po patch I originally proposed) is at average in the
> ballpark of 500 lines of proof script. And as evidenced by my discussion
> with Alan, these proofs only cover the "core model".
> 
> So for this kind of thing, I think it's better to look at them to have more
> confidence in the patch, and do the patch more based on which model is more
> reasonable (as Alan enforces). Then consider the simplified version as the
> more natural one, and not worry about future changes that break the
> equivalence (that would usually indicate a problem with the old model,
> rather than a problem with the patch).
> 
> For regressions, I would rather consider some desirable properties of LKMM,
> like "DRF-SC" or "monotonicity of barriers" or "ppo <= po" and try to prove
> those. This has the upside of not requiring to carry additional models
> around, so much less than half the maintenance effort, and if the property
> should be broken this usually would indicate a problem with the patch. So I
> think the bang for the buck is much higher there.
> 
> Those are my current thoughts anyways : )

That matches my experience, for whatever that is worth.  (I first
used Promela/spin in the early 1990s, which proved to be an excellent
cautionary tale.)

							Thanx, Paul
  

Patch

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 07f884f9b2bf..6e531457bb73 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -37,8 +37,19 @@  let mb = ([M] ; fencerel(Mb) ; [M]) |
 	([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
 	([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
 	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
-	([M] ; po ; [UL] ; (co | po) ; [LKW] ;
-		fencerel(After-unlock-lock) ; [M])
+(*
+ * Note: The po-unlock-lock-po relation only passes the lock to the direct
+ * successor, perhaps giving the impression that the ordering of the
+ * smp_mb__after_unlock_lock() fence only affects a single lock handover.
+ * However, in a longer sequence of lock handovers, the implicit
+ * A-cumulative release fences of lock-release ensure that any stores that
+ * propagate to one of the involved CPUs before it hands over the lock to
+ * the next CPU will also propagate to the final CPU handing over the lock
+ * to the CPU that executes the fence.  Therefore, all those stores are
+ * also affected by the fence.
+ *)
+	([M] ; po-unlock-lock-po ;
+		[After-unlock-lock] ; po ; [M])
 let gp = po ; [Sync-rcu | Sync-srcu] ; po?
 let strong-fence = mb | gp