[RFC] srcu: Yet more detail for srcu_readers_active_idx_check() comments

Message ID 20221214191355.GA2596199@paulmck-ThinkPad-P17-Gen-1
State New
Headers
Series [RFC] srcu: Yet more detail for srcu_readers_active_idx_check() comments |

Commit Message

Paul E. McKenney Dec. 14, 2022, 7:13 p.m. UTC
  The comment in srcu_readers_active_idx_check() following the smp_mb()
is out of date, hailing from a simpler time when preemption was disabled
across the bulk of __srcu_read_lock().  The fact that preemption was
disabled meant that the number of tasks that had fetched the old index
but not yet incremented counters was limited by the number of CPUs.

In our more complex modern times, the number of CPUs is no longer a limit.
This commit therefore updates this comment, additionally giving more
memory-ordering detail.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Reported-by: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Reported-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
  

Comments

Joel Fernandes Dec. 14, 2022, 8:51 p.m. UTC | #1
Hi Paul,

On Wed, Dec 14, 2022 at 2:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> The comment in srcu_readers_active_idx_check() following the smp_mb()
> is out of date, hailing from a simpler time when preemption was disabled
> across the bulk of __srcu_read_lock().  The fact that preemption was
> disabled meant that the number of tasks that had fetched the old index
> but not yet incremented counters was limited by the number of CPUs.
>
> In our more complex modern times, the number of CPUs is no longer a limit.
> This commit therefore updates this comment, additionally giving more
> memory-ordering detail.
>
> Reported-by: Boqun Feng <boqun.feng@gmail.com>
> Reported-by: Frederic Weisbecker <frederic@kernel.org>
> Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Reported-by: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
> Reported-by: Uladzislau Rezki <urezki@gmail.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 68b8d8b150db1..ba12c50ee3658 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -469,24 +469,53 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *ssp, int idx)
>
>         /*
>          * If the locks are the same as the unlocks, then there must have
> -        * been no readers on this index at some time in between. This does
> -        * not mean that there are no more readers, as one could have read
> -        * the current index but not have incremented the lock counter yet.
> +        * been no readers on this index at some point in this function.
> +        * But there might be more readers, as a task might have read
> +        * the current ->srcu_idx but not yet have incremented its CPU's
> +        * ->srcu_lock_count[idx] counter.  In fact, it is possible
> +        * that most of the tasks have been preempted between fetching
> +        * ->srcu_idx and incrementing ->srcu_lock_count[idx].  And there
> +        * could be almost (ULONG_MAX / sizeof(struct task_struct)) tasks
> +        * in a system whose address space was fully populated with memory.
> +        * Call this quantity Nt.

If I understand correctly, here we are saying ULONG_MAX bytes is the
total theoretical size of memory right? So we cannot have more than Nt
tasks preempted.

>          *
> -        * So suppose that the updater is preempted here for so long
> -        * that more than ULONG_MAX non-nested readers come and go in
> -        * the meantime.  It turns out that this cannot result in overflow
> -        * because if a reader modifies its unlock count after we read it
> -        * above, then that reader's next load of ->srcu_idx is guaranteed
> -        * to get the new value, which will cause it to operate on the
> -        * other bank of counters, where it cannot contribute to the
> -        * overflow of these counters.  This means that there is a maximum
> -        * of 2*NR_CPUS increments, which cannot overflow given current
> -        * systems, especially not on 64-bit systems.
> +        * So suppose that the updater is preempted at this point in the
> +        * code for a long time.  That now-preempted updater has already
> +        * flipped ->srcu_idx (possibly during the preceding grace period),
> +        * done an smp_mb() (again, possibly during the preceding grace
> +        * period), and summed up the ->srcu_unlock_count[idx] counters.
> +        * How many times can a given one of the aforementioned Nt tasks
> +        * increment the old ->srcu_idx value's ->srcu_lock_count[idx]
> +        * counter, in the absence of nesting?
>          *
> -        * OK, how about nesting?  This does impose a limit on nesting
> -        * of floor(ULONG_MAX/NR_CPUS/2), which should be sufficient,
> -        * especially on 64-bit systems.
> +        * It can clearly do so once, given that it has already fetched
> +        * the old value of ->srcu_idx and is just about to use that value
> +        * to index its increment of ->srcu_lock_count[idx].  But as soon as
> +        * it leaves that SRCU read-side critical section, it will increment
> +        * ->srcu_unlock_count[idx], which must follow the updater's above
> +        * read from that same value.  Thus, as soon the reading task does
> +        * an smp_mb() and a later fetch from ->srcu_idx, that task will be
> +        * guaranteed to get the new index.  Except that the increment of
> +        * ->srcu_unlock_count[idx] in __srcu_read_unlock() is after the
> +        * smp_mb(), and the fetch from ->srcu_idx in __srcu_read_lock()
> +        * is before the smp_mb().  Thus, that task might not see the new
> +        * value of ->srcu_idx until the -second- __srcu_read_lock(),
> +        * which in turn means that this task might well increment
> +        * ->srcu_lock_count[idx] for the old value of ->srcu_idx twice,
> +        * not just once.
> +        *
> +        * That is, there can be almost 2 * Nt further increments of
> +        * ->srcu_lock_count[idx] for the old index.  But this is OK because

s/almost/atmost/ ?

and also, I think you need 1 smp_mb() per CPU to reflect the update to
the per-cpu counter, so it min(2 * Nt, number of CPUs) which I think
is much smaller, but that is ok to not mention.

> +        * the size of the task_struct structure limits the value of Nt.
> +        *
> +        * OK, but what about nesting?  This does impose a limit on
> +        * nesting of half of the size of the task_struct structure
> +        * (measured in bytes), which should be sufficient.  A late 2022

Does nesting here mean one SRCU reader section nested within another?
With this definition of nesting, is it not possible for a single task
to increment the 'lock' counter any number of times before getting
preempted?

> +        * TREE01 rcutorture run reported this size to be no less than
> +        * 9408 bytes, allowing up to 4704 levels of nesting, which is
> +        * comfortably beyond excessive.  Especially on 64-bit systems,
> +        * which are unlikely to be configured with an address space fully
> +        * populated with memory, at least not anytime soon.
>          */

Below is a summary from my point of view. Please correct me if I'm
wrong. I was trying to reason that we only need to care about waiting
for readers that sample idx *after* srcu_read_lock() issued smp_mb().

The case to consider a race between readers and
srcu_readers_active_idx_check() IMO is when a reader samples idx,
issues smp_mb() enters its RSCS. If it does not do smp_mb(), its RSCS
need not be waited on as it is not considered to be entered from a
global memory PoV.  Assuming it did issue the smp_mb() in
srcu_read_lock() and then got preempted (which IMO is the only case to
care about the reader for), and say the first scan failed to track
down this in-flight RSCS. The first scan can fail to track the RSCS
for 2 reasons:

#1. The idx being scanned in the first scan is the one that the reader
did not sample.
#2. The smp_mb() in the first scan's srcu_readers_active_idx_check()
happened *before* the smp_mb() post-counter increment in
srcu_read_lock().

In case of #2, the first scan was too early and the second scan will
not even look at this idx as it gets flipped. So we can safely assume
in #2 that this RSCS need not be waited on and was too early. IOW, the
grace period started before the RSCS, so tough luck for that RSCS.

So AFAICS, case #1 is the only one that matters for consideration of
race. In this case, we will rely on the second scan and assume that we
"need to do the right thing" for the case where the srcu_read_lock()'s
smp_mb() happened *before* the second scan's smp_mb() and the idx
being reader-occupied is supposed to be properly nailed down by the
second scan. In this case, the second scan *will* see the lock count
increment of all in-flight readers, preempted or otherwise, because of
the smp_mb() it issues prior to sampling all the lock counts of the
flipped idx.  And upto Nt number of increments can be "caught" by the
second scan, before a wrap around fools it into believing the Nt
readers don't need any love, quiet to their detriment.

I also did not get why you care about readers that come and ago (you
mentioned the first reader seeing incorrect idx and the second reader
seeing the right flipped one, etc). Those readers are irrelevant
AFAICS since they came and went, and need not be waited on , right?.

Thanks for the discussions today!

thanks,

  - Joel

>         return srcu_readers_lock_idx(ssp, idx) == unlocks;
>  }
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index ee8a6a711719a..399c818fe47ce 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4900,6 +4900,7 @@ void __init rcu_init(void)
>         // Kick-start any polled grace periods that started early.
>         if (!(per_cpu_ptr(&rcu_data, cpu)->mynode->exp_seq_poll_rq & 0x1))
>                 (void)start_poll_synchronize_rcu_expedited();
> +       pr_alert("sizeof(struct task_struct) = %lu\n", sizeof(struct task_struct));
>  }
>
>  #include "tree_stall.h"
  
Paul E. McKenney Dec. 14, 2022, 9:24 p.m. UTC | #2
On Wed, Dec 14, 2022 at 03:51:54PM -0500, Joel Fernandes wrote:
> Hi Paul,
> 
> On Wed, Dec 14, 2022 at 2:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > The comment in srcu_readers_active_idx_check() following the smp_mb()
> > is out of date, hailing from a simpler time when preemption was disabled
> > across the bulk of __srcu_read_lock().  The fact that preemption was
> > disabled meant that the number of tasks that had fetched the old index
> > but not yet incremented counters was limited by the number of CPUs.
> >
> > In our more complex modern times, the number of CPUs is no longer a limit.
> > This commit therefore updates this comment, additionally giving more
> > memory-ordering detail.
> >
> > Reported-by: Boqun Feng <boqun.feng@gmail.com>
> > Reported-by: Frederic Weisbecker <frederic@kernel.org>
> > Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > Reported-by: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
> > Reported-by: Uladzislau Rezki <urezki@gmail.com>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> >
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 68b8d8b150db1..ba12c50ee3658 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -469,24 +469,53 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *ssp, int idx)
> >
> >         /*
> >          * If the locks are the same as the unlocks, then there must have
> > -        * been no readers on this index at some time in between. This does
> > -        * not mean that there are no more readers, as one could have read
> > -        * the current index but not have incremented the lock counter yet.
> > +        * been no readers on this index at some point in this function.
> > +        * But there might be more readers, as a task might have read
> > +        * the current ->srcu_idx but not yet have incremented its CPU's
> > +        * ->srcu_lock_count[idx] counter.  In fact, it is possible
> > +        * that most of the tasks have been preempted between fetching
> > +        * ->srcu_idx and incrementing ->srcu_lock_count[idx].  And there
> > +        * could be almost (ULONG_MAX / sizeof(struct task_struct)) tasks
> > +        * in a system whose address space was fully populated with memory.
> > +        * Call this quantity Nt.
> 
> If I understand correctly, here we are saying ULONG_MAX bytes is the
> total theoretical size of memory right? So we cannot have more than Nt
> tasks preempted.

Yes, that is the intent.

> >          *
> > -        * So suppose that the updater is preempted here for so long
> > -        * that more than ULONG_MAX non-nested readers come and go in
> > -        * the meantime.  It turns out that this cannot result in overflow
> > -        * because if a reader modifies its unlock count after we read it
> > -        * above, then that reader's next load of ->srcu_idx is guaranteed
> > -        * to get the new value, which will cause it to operate on the
> > -        * other bank of counters, where it cannot contribute to the
> > -        * overflow of these counters.  This means that there is a maximum
> > -        * of 2*NR_CPUS increments, which cannot overflow given current
> > -        * systems, especially not on 64-bit systems.
> > +        * So suppose that the updater is preempted at this point in the
> > +        * code for a long time.  That now-preempted updater has already
> > +        * flipped ->srcu_idx (possibly during the preceding grace period),
> > +        * done an smp_mb() (again, possibly during the preceding grace
> > +        * period), and summed up the ->srcu_unlock_count[idx] counters.
> > +        * How many times can a given one of the aforementioned Nt tasks
> > +        * increment the old ->srcu_idx value's ->srcu_lock_count[idx]
> > +        * counter, in the absence of nesting?
> >          *
> > -        * OK, how about nesting?  This does impose a limit on nesting
> > -        * of floor(ULONG_MAX/NR_CPUS/2), which should be sufficient,
> > -        * especially on 64-bit systems.
> > +        * It can clearly do so once, given that it has already fetched
> > +        * the old value of ->srcu_idx and is just about to use that value
> > +        * to index its increment of ->srcu_lock_count[idx].  But as soon as
> > +        * it leaves that SRCU read-side critical section, it will increment
> > +        * ->srcu_unlock_count[idx], which must follow the updater's above
> > +        * read from that same value.  Thus, as soon the reading task does
> > +        * an smp_mb() and a later fetch from ->srcu_idx, that task will be
> > +        * guaranteed to get the new index.  Except that the increment of
> > +        * ->srcu_unlock_count[idx] in __srcu_read_unlock() is after the
> > +        * smp_mb(), and the fetch from ->srcu_idx in __srcu_read_lock()
> > +        * is before the smp_mb().  Thus, that task might not see the new
> > +        * value of ->srcu_idx until the -second- __srcu_read_lock(),
> > +        * which in turn means that this task might well increment
> > +        * ->srcu_lock_count[idx] for the old value of ->srcu_idx twice,
> > +        * not just once.
> > +        *
> > +        * That is, there can be almost 2 * Nt further increments of
> > +        * ->srcu_lock_count[idx] for the old index.  But this is OK because
> 
> s/almost/atmost/ ?

The advantage of "almost" is it accounts for the fact that the update-side
thread isn't going to be doing any SRCU reading.  ;-)

> and also, I think you need 1 smp_mb() per CPU to reflect the update to
> the per-cpu counter, so it min(2 * Nt, number of CPUs) which I think
> is much smaller, but that is ok to not mention.

You make a good point, but that min() would be too small.  It would
instead be Nt + Nc, where "Nc" is the number of CPUs.

I will update this on the next rebase.

> > +        * the size of the task_struct structure limits the value of Nt.
> > +        *
> > +        * OK, but what about nesting?  This does impose a limit on
> > +        * nesting of half of the size of the task_struct structure
> > +        * (measured in bytes), which should be sufficient.  A late 2022
> 
> Does nesting here mean one SRCU reader section nested within another?
> With this definition of nesting, is it not possible for a single task
> to increment the 'lock' counter any number of times before getting
> preempted?

That is exactly why nesting depth must be limited.  And let's face it,
if you have code that nests 4704 times within a given SRCU read-side
critical section, that code needs some serious help!

> > +        * TREE01 rcutorture run reported this size to be no less than
> > +        * 9408 bytes, allowing up to 4704 levels of nesting, which is
> > +        * comfortably beyond excessive.  Especially on 64-bit systems,
> > +        * which are unlikely to be configured with an address space fully
> > +        * populated with memory, at least not anytime soon.
> >          */
> 
> Below is a summary from my point of view. Please correct me if I'm
> wrong. I was trying to reason that we only need to care about waiting
> for readers that sample idx *after* srcu_read_lock() issued smp_mb().
> 
> The case to consider a race between readers and
> srcu_readers_active_idx_check() IMO is when a reader samples idx,

I would instead say "when a reader has sampled ->srcu_idx, but has not
yet executed the smp_mb() or incremented ->srcu_lock_count".

> issues smp_mb() enters its RSCS. If it does not do smp_mb(), its RSCS
> need not be waited on as it is not considered to be entered from a
> global memory PoV.  Assuming it did issue the smp_mb() in
> srcu_read_lock() and then got preempted (which IMO is the only case to
> care about the reader for), and say the first scan failed to track
> down this in-flight RSCS.

Except that this smp_mb() is not externally visible to software.
Other CPUs have to have seen and access following that smp_mb() for it
to matter from a software viewpoint.

>                           The first scan can fail to track the RSCS
> for 2 reasons:
> 
> #1. The idx being scanned in the first scan is the one that the reader
> did not sample.
> #2. The smp_mb() in the first scan's srcu_readers_active_idx_check()
> happened *before* the smp_mb() post-counter increment in
> srcu_read_lock().

Again, software cannot see the smp_mb() in and of itself.  What
matters is the increment of ->srcu_lock_count and the updater's
scan of this same counter.

#3. The reader still hasn't gotten around to incrementing
->srcu_lock_count.

> In case of #2, the first scan was too early and the second scan will
> not even look at this idx as it gets flipped. So we can safely assume
> in #2 that this RSCS need not be waited on and was too early. IOW, the
> grace period started before the RSCS, so tough luck for that RSCS.

And the point of a number of the memory barriers is to ensure that when
this happens, the critical section is guaranteed to see anything that
happened before the start of the current grace period.

> So AFAICS, case #1 is the only one that matters for consideration of
> race. In this case, we will rely on the second scan and assume that we
> "need to do the right thing" for the case where the srcu_read_lock()'s
> smp_mb() happened *before* the second scan's smp_mb() and the idx
> being reader-occupied is supposed to be properly nailed down by the
> second scan. In this case, the second scan *will* see the lock count
> increment of all in-flight readers, preempted or otherwise, because of
> the smp_mb() it issues prior to sampling all the lock counts of the
> flipped idx.  And upto Nt number of increments can be "caught" by the
> second scan, before a wrap around fools it into believing the Nt
> readers don't need any love, quiet to their detriment.

Both #1 and #3 must be handled, right?

> I also did not get why you care about readers that come and ago (you
> mentioned the first reader seeing incorrect idx and the second reader
> seeing the right flipped one, etc). Those readers are irrelevant
> AFAICS since they came and went, and need not be waited on , right?.

The comment is attempting to show (among other things) that we don't
need to care about readers that come and go more than twice during that
critical interval of time during the counter scans.

> Thanks for the discussions today!

Good to have you guys there!

							Thanx, Paul

> thanks,
> 
>   - Joel
> 
> >         return srcu_readers_lock_idx(ssp, idx) == unlocks;
> >  }
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index ee8a6a711719a..399c818fe47ce 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -4900,6 +4900,7 @@ void __init rcu_init(void)
> >         // Kick-start any polled grace periods that started early.
> >         if (!(per_cpu_ptr(&rcu_data, cpu)->mynode->exp_seq_poll_rq & 0x1))
> >                 (void)start_poll_synchronize_rcu_expedited();
> > +       pr_alert("sizeof(struct task_struct) = %lu\n", sizeof(struct task_struct));
> >  }
> >
> >  #include "tree_stall.h"
  
Joel Fernandes Dec. 14, 2022, 11:07 p.m. UTC | #3
On Wed, Dec 14, 2022 at 9:24 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Dec 14, 2022 at 03:51:54PM -0500, Joel Fernandes wrote:
> > Hi Paul,
> >
> > On Wed, Dec 14, 2022 at 2:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > The comment in srcu_readers_active_idx_check() following the smp_mb()
> > > is out of date, hailing from a simpler time when preemption was disabled
> > > across the bulk of __srcu_read_lock().  The fact that preemption was
> > > disabled meant that the number of tasks that had fetched the old index
> > > but not yet incremented counters was limited by the number of CPUs.
[...]
> > > +        * TREE01 rcutorture run reported this size to be no less than
> > > +        * 9408 bytes, allowing up to 4704 levels of nesting, which is
> > > +        * comfortably beyond excessive.  Especially on 64-bit systems,
> > > +        * which are unlikely to be configured with an address space fully
> > > +        * populated with memory, at least not anytime soon.
> > >          */
> >
> > Below is a summary from my point of view. Please correct me if I'm
> > wrong. I was trying to reason that we only need to care about waiting
> > for readers that sample idx *after* srcu_read_lock() issued smp_mb().
> >
> > The case to consider a race between readers and
> > srcu_readers_active_idx_check() IMO is when a reader samples idx,
>
> I would instead say "when a reader has sampled ->srcu_idx, but has not
> yet executed the smp_mb() or incremented ->srcu_lock_count".

If it has not yet executed smp_mb(), then I am missing why this
read-side critical section matters as far as being waited for. If it
is waited for due to a race, great, just a slightly higher delay. If
it is not waited for, then no one should care AFAICS, it is too late
and the next grace period will anyway scan both the idx to track it.

> > issues smp_mb() enters its RSCS. If it does not do smp_mb(), its RSCS
> > need not be waited on as it is not considered to be entered from a
> > global memory PoV.  Assuming it did issue the smp_mb() in
> > srcu_read_lock() and then got preempted (which IMO is the only case to
> > care about the reader for), and say the first scan failed to track
> > down this in-flight RSCS.
>
> Except that this smp_mb() is not externally visible to software.
> Other CPUs have to have seen and access following that smp_mb() for it
> to matter from a software viewpoint.

Sure, that second pairing smp_mb() will be in
srcu_readers_active_idx_check(). I am definitely considering it in
pairs here in the reasoning, and not on its own.

> >                           The first scan can fail to track the RSCS
> > for 2 reasons:
> >
> > #1. The idx being scanned in the first scan is the one that the reader
> > did not sample.
> > #2. The smp_mb() in the first scan's srcu_readers_active_idx_check()
> > happened *before* the smp_mb() post-counter increment in
> > srcu_read_lock().
>
> Again, software cannot see the smp_mb() in and of itself.  What
> matters is the increment of ->srcu_lock_count and the updater's
> scan of this same counter.

Yes, and that scan of the counter happens after a write-side smp_mb() AFAICS.

> #3. The reader still hasn't gotten around to incrementing
> ->srcu_lock_count.

Then it has not executed an smp_mb() on the read-side yet, so it
should not be taken into consideration AFAICS.

> > In case of #2, the first scan was too early and the second scan will
> > not even look at this idx as it gets flipped. So we can safely assume
> > in #2 that this RSCS need not be waited on and was too early. IOW, the
> > grace period started before the RSCS, so tough luck for that RSCS.
>
> And the point of a number of the memory barriers is to ensure that when
> this happens, the critical section is guaranteed to see anything that
> happened before the start of the current grace period.

Sure.

> > So AFAICS, case #1 is the only one that matters for consideration of
> > race. In this case, we will rely on the second scan and assume that we
> > "need to do the right thing" for the case where the srcu_read_lock()'s
> > smp_mb() happened *before* the second scan's smp_mb() and the idx
> > being reader-occupied is supposed to be properly nailed down by the
> > second scan. In this case, the second scan *will* see the lock count
> > increment of all in-flight readers, preempted or otherwise, because of
> > the smp_mb() it issues prior to sampling all the lock counts of the
> > flipped idx.  And upto Nt number of increments can be "caught" by the
> > second scan, before a wrap around fools it into believing the Nt
> > readers don't need any love, quiet to their detriment.
>
> Both #1 and #3 must be handled, right?

This is the part I am not sure, that #3 matters, but I could be
missing something.

> > I also did not get why you care about readers that come and ago (you
> > mentioned the first reader seeing incorrect idx and the second reader
> > seeing the right flipped one, etc). Those readers are irrelevant
> > AFAICS since they came and went, and need not be waited on , right?.
>
> The comment is attempting to show (among other things) that we don't
> need to care about readers that come and go more than twice during that
> critical interval of time during the counter scans.

Why do we need to care about readers that come and go even once? Once
they are gone, they have already done an unlock() and their RSCS is
over, so they need to be considered AFAICS.

Again, sorry if my comments are nonsense, I will try to reason more.
The goal of asking questions is to learn ;-)

Thanks.
  
Joel Fernandes Dec. 14, 2022, 11:10 p.m. UTC | #4
On Wed, Dec 14, 2022 at 11:07 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Wed, Dec 14, 2022 at 9:24 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > I also did not get why you care about readers that come and ago (you
> > > mentioned the first reader seeing incorrect idx and the second reader
> > > seeing the right flipped one, etc). Those readers are irrelevant
> > > AFAICS since they came and went, and need not be waited on , right?.
> >
> > The comment is attempting to show (among other things) that we don't
> > need to care about readers that come and go more than twice during that
> > critical interval of time during the counter scans.
>
> Why do we need to care about readers that come and go even once? Once
> they are gone, they have already done an unlock() and their RSCS is
> over, so they need to be considered AFAICS.
>

Aargh, I meant: "so they need to be considered AFAICS".

Thanks.
  
Joel Fernandes Dec. 14, 2022, 11:14 p.m. UTC | #5
On Wed, Dec 14, 2022 at 11:10 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Wed, Dec 14, 2022 at 11:07 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Wed, Dec 14, 2022 at 9:24 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > I also did not get why you care about readers that come and ago (you
> > > > mentioned the first reader seeing incorrect idx and the second reader
> > > > seeing the right flipped one, etc). Those readers are irrelevant
> > > > AFAICS since they came and went, and need not be waited on , right?.
> > >
> > > The comment is attempting to show (among other things) that we don't
> > > need to care about readers that come and go more than twice during that
> > > critical interval of time during the counter scans.
> >
> > Why do we need to care about readers that come and go even once? Once
> > they are gone, they have already done an unlock() and their RSCS is
> > over, so they need to be considered AFAICS.
> >
>
> Aargh, I meant: "so they need to be considered AFAICS".

Trying again: "so they need not be considered AFAICS".

Anyway, my 1 year old son is sick so signing off for now. Thanks.


 - Joel
  
Paul E. McKenney Dec. 15, 2022, 12:01 a.m. UTC | #6
On Wed, Dec 14, 2022 at 11:07:52PM +0000, Joel Fernandes wrote:
> On Wed, Dec 14, 2022 at 9:24 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Dec 14, 2022 at 03:51:54PM -0500, Joel Fernandes wrote:
> > > Hi Paul,
> > >
> > > On Wed, Dec 14, 2022 at 2:13 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > The comment in srcu_readers_active_idx_check() following the smp_mb()
> > > > is out of date, hailing from a simpler time when preemption was disabled
> > > > across the bulk of __srcu_read_lock().  The fact that preemption was
> > > > disabled meant that the number of tasks that had fetched the old index
> > > > but not yet incremented counters was limited by the number of CPUs.
> [...]
> > > > +        * TREE01 rcutorture run reported this size to be no less than
> > > > +        * 9408 bytes, allowing up to 4704 levels of nesting, which is
> > > > +        * comfortably beyond excessive.  Especially on 64-bit systems,
> > > > +        * which are unlikely to be configured with an address space fully
> > > > +        * populated with memory, at least not anytime soon.
> > > >          */
> > >
> > > Below is a summary from my point of view. Please correct me if I'm
> > > wrong. I was trying to reason that we only need to care about waiting
> > > for readers that sample idx *after* srcu_read_lock() issued smp_mb().
> > >
> > > The case to consider a race between readers and
> > > srcu_readers_active_idx_check() IMO is when a reader samples idx,
> >
> > I would instead say "when a reader has sampled ->srcu_idx, but has not
> > yet executed the smp_mb() or incremented ->srcu_lock_count".
> 
> If it has not yet executed smp_mb(), then I am missing why this
> read-side critical section matters as far as being waited for. If it
> is waited for due to a race, great, just a slightly higher delay. If
> it is not waited for, then no one should care AFAICS, it is too late
> and the next grace period will anyway scan both the idx to track it.

From the viewpoints of other CPUs, it matters not whether that task has
or has not executed smp_mb().  Unless and until it also executes something
following that smp_mb(), that smp_mb() has zero software-visible effect.

To see this, try creating an LKMM litmus test in which it matters whether
a given process has or doesn't have an smp_mb() at the very end of that
process, that is, that smp_mb(), if present is the very last thing that
the process executes.

> > > issues smp_mb() enters its RSCS. If it does not do smp_mb(), its RSCS
> > > need not be waited on as it is not considered to be entered from a
> > > global memory PoV.  Assuming it did issue the smp_mb() in
> > > srcu_read_lock() and then got preempted (which IMO is the only case to
> > > care about the reader for), and say the first scan failed to track
> > > down this in-flight RSCS.
> >
> > Except that this smp_mb() is not externally visible to software.
> > Other CPUs have to have seen and access following that smp_mb() for it
> > to matter from a software viewpoint.
> 
> Sure, that second pairing smp_mb() will be in
> srcu_readers_active_idx_check(). I am definitely considering it in
> pairs here in the reasoning, and not on its own.

Very good.

But you need more than pairs.  You also need memory accesses on both
sides of each smp_mb() in that pairing.

> > >                           The first scan can fail to track the RSCS
> > > for 2 reasons:
> > >
> > > #1. The idx being scanned in the first scan is the one that the reader
> > > did not sample.
> > > #2. The smp_mb() in the first scan's srcu_readers_active_idx_check()
> > > happened *before* the smp_mb() post-counter increment in
> > > srcu_read_lock().
> >
> > Again, software cannot see the smp_mb() in and of itself.  What
> > matters is the increment of ->srcu_lock_count and the updater's
> > scan of this same counter.
> 
> Yes, and that scan of the counter happens after a write-side smp_mb() AFAICS.
> 
> > #3. The reader still hasn't gotten around to incrementing
> > ->srcu_lock_count.
> 
> Then it has not executed an smp_mb() on the read-side yet, so it
> should not be taken into consideration AFAICS.
> 
> > > In case of #2, the first scan was too early and the second scan will
> > > not even look at this idx as it gets flipped. So we can safely assume
> > > in #2 that this RSCS need not be waited on and was too early. IOW, the
> > > grace period started before the RSCS, so tough luck for that RSCS.
> >
> > And the point of a number of the memory barriers is to ensure that when
> > this happens, the critical section is guaranteed to see anything that
> > happened before the start of the current grace period.
> 
> Sure.
> 
> > > So AFAICS, case #1 is the only one that matters for consideration of
> > > race. In this case, we will rely on the second scan and assume that we
> > > "need to do the right thing" for the case where the srcu_read_lock()'s
> > > smp_mb() happened *before* the second scan's smp_mb() and the idx
> > > being reader-occupied is supposed to be properly nailed down by the
> > > second scan. In this case, the second scan *will* see the lock count
> > > increment of all in-flight readers, preempted or otherwise, because of
> > > the smp_mb() it issues prior to sampling all the lock counts of the
> > > flipped idx.  And upto Nt number of increments can be "caught" by the
> > > second scan, before a wrap around fools it into believing the Nt
> > > readers don't need any love, quiet to their detriment.
> >
> > Both #1 and #3 must be handled, right?
> 
> This is the part I am not sure, that #3 matters, but I could be
> missing something.

My kneejerk reaction, right or wrong, is that you are thinking in terms
of a globally agreed-upon timeline.

> > > I also did not get why you care about readers that come and ago (you
> > > mentioned the first reader seeing incorrect idx and the second reader
> > > seeing the right flipped one, etc). Those readers are irrelevant
> > > AFAICS since they came and went, and need not be waited on , right?.
> >
> > The comment is attempting to show (among other things) that we don't
> > need to care about readers that come and go more than twice during that
> > critical interval of time during the counter scans.
> 
> Why do we need to care about readers that come and go even once? Once
> they are gone, they have already done an unlock() and their RSCS is
> over, so they need to be considered AFAICS.

Because if a given reader could come and go 2^32-1 times while always
using the same ->srcu_idx, then the updater could incorrectly lose track
of some other SRCU read-side critical section, and could thus end the
grace period prematurely.

> Again, sorry if my comments are nonsense, I will try to reason more.
> The goal of asking questions is to learn ;-)

Try setting up some LKMM litmus tests.  Those could be good documentation
in any case.  (Note that you have to cheat to make counter wrap happen,
and you need really small counters to avoid overflowing herd7's
capabilities.)

							Thanx, Paul
  
Paul E. McKenney Dec. 15, 2022, 12:04 a.m. UTC | #7
On Wed, Dec 14, 2022 at 11:14:48PM +0000, Joel Fernandes wrote:
> On Wed, Dec 14, 2022 at 11:10 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Wed, Dec 14, 2022 at 11:07 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Wed, Dec 14, 2022 at 9:24 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > I also did not get why you care about readers that come and ago (you
> > > > > mentioned the first reader seeing incorrect idx and the second reader
> > > > > seeing the right flipped one, etc). Those readers are irrelevant
> > > > > AFAICS since they came and went, and need not be waited on , right?.
> > > >
> > > > The comment is attempting to show (among other things) that we don't
> > > > need to care about readers that come and go more than twice during that
> > > > critical interval of time during the counter scans.
> > >
> > > Why do we need to care about readers that come and go even once? Once
> > > they are gone, they have already done an unlock() and their RSCS is
> > > over, so they need to be considered AFAICS.
> > >
> >
> > Aargh, I meant: "so they need to be considered AFAICS".
> 
> Trying again: "so they need not be considered AFAICS".

Give or take counter wrap, which can make it appear that still-present
readers have finished.

> Anyway, my 1 year old son is sick so signing off for now. Thanks.

Ouch!  I hope he recovers quickly and completely!!!

							Thanx, Paul
  
Joel Fernandes Dec. 15, 2022, 1:34 a.m. UTC | #8
On Wed, Dec 14, 2022 at 7:04 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Dec 14, 2022 at 11:14:48PM +0000, Joel Fernandes wrote:
> > On Wed, Dec 14, 2022 at 11:10 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Wed, Dec 14, 2022 at 11:07 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > On Wed, Dec 14, 2022 at 9:24 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > I also did not get why you care about readers that come and ago (you
> > > > > > mentioned the first reader seeing incorrect idx and the second reader
> > > > > > seeing the right flipped one, etc). Those readers are irrelevant
> > > > > > AFAICS since they came and went, and need not be waited on , right?.
> > > > >
> > > > > The comment is attempting to show (among other things) that we don't
> > > > > need to care about readers that come and go more than twice during that
> > > > > critical interval of time during the counter scans.
> > > >
> > > > Why do we need to care about readers that come and go even once? Once
> > > > they are gone, they have already done an unlock() and their RSCS is
> > > > over, so they need to be considered AFAICS.
> > > >
> > >
> > > Aargh, I meant: "so they need to be considered AFAICS".
> >
> > Trying again: "so they need not be considered AFAICS".
>
> Give or take counter wrap, which can make it appear that still-present
> readers have finished.

Ah you mean those flood of readers affect the counter wrapping and not
that those readers have to be waited on or anything, they just happen
to have a side-effect on *existing readers* which need to be waited
on.

Thanks a lot for this explanation, this part I agree. Readers that
sampled the idx before the flip happened, and then did their
lock+unlock counter increments both after the flip, and after the
second unlock counter scan (second scan), can mess up the lock
counters such that the second scan found lock==unlock, even though it
is not to be due to pre-existing readers. But as you pointed out,
there have to be a substantially large number of these to cause the
equality check to match. This might be another reason why it is
important to scan the unlocks first, because the locks are what have
to cause the wrap around of the lock counter. Instead if you counted
locks first, then the unlocks would have to do the catching up to the
locks which are much fewer than a full wrap around.

I still don't see why this affects only the first reader. There could
be more than 1 reader that needs to be waited on (the readers that
started before the grace period started). Say there are 5 of them.
When the grace period starts, the interfering readers (2^32 of them or
so) could have sampled the old idx before the flip, and then do
lock+unlock (on that old pre-flip() idx) in quick succession after the
smp_mb() in the second srcu_readers_active_idx_check(). That causes
those 5 poor readers to not be waited on. Granted, any new readers
after this thundering herd should see the new idx and will not be
affected, thanks to the memory barriers.

Still confused, but hey I'll take it little at a time ;-) Also thanks
for the suggestions for litmus tests.

Cheers,

 - Joel

> > Anyway, my 1 year old son is sick so signing off for now. Thanks.
>
> Ouch!  I hope he recovers quickly and completely!!!
>
>                                                         Thanx, Paul
  
Paul E. McKenney Dec. 15, 2022, 1:39 a.m. UTC | #9
On Wed, Dec 14, 2022 at 08:34:03PM -0500, Joel Fernandes wrote:
> On Wed, Dec 14, 2022 at 7:04 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Dec 14, 2022 at 11:14:48PM +0000, Joel Fernandes wrote:
> > > On Wed, Dec 14, 2022 at 11:10 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > On Wed, Dec 14, 2022 at 11:07 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > >
> > > > > On Wed, Dec 14, 2022 at 9:24 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > I also did not get why you care about readers that come and ago (you
> > > > > > > mentioned the first reader seeing incorrect idx and the second reader
> > > > > > > seeing the right flipped one, etc). Those readers are irrelevant
> > > > > > > AFAICS since they came and went, and need not be waited on , right?.
> > > > > >
> > > > > > The comment is attempting to show (among other things) that we don't
> > > > > > need to care about readers that come and go more than twice during that
> > > > > > critical interval of time during the counter scans.
> > > > >
> > > > > Why do we need to care about readers that come and go even once? Once
> > > > > they are gone, they have already done an unlock() and their RSCS is
> > > > > over, so they need to be considered AFAICS.
> > > > >
> > > >
> > > > Aargh, I meant: "so they need to be considered AFAICS".
> > >
> > > Trying again: "so they need not be considered AFAICS".
> >
> > Give or take counter wrap, which can make it appear that still-present
> > readers have finished.
> 
> Ah you mean those flood of readers affect the counter wrapping and not
> that those readers have to be waited on or anything, they just happen
> to have a side-effect on *existing readers* which need to be waited
> on.

Exactly, that flood of readers could potentially result in a
counter-wrap-induced too-short SRCU grace period, and it is SRCU's job
to avoid that, and specifically the job of the code that this comment
lives in.

> Thanks a lot for this explanation, this part I agree. Readers that
> sampled the idx before the flip happened, and then did their
> lock+unlock counter increments both after the flip, and after the
> second unlock counter scan (second scan), can mess up the lock
> counters such that the second scan found lock==unlock, even though it
> is not to be due to pre-existing readers. But as you pointed out,
> there have to be a substantially large number of these to cause the
> equality check to match. This might be another reason why it is
> important to scan the unlocks first, because the locks are what have
> to cause the wrap around of the lock counter. Instead if you counted
> locks first, then the unlocks would have to do the catching up to the
> locks which are much fewer than a full wrap around.

True enough!

> I still don't see why this affects only the first reader. There could
> be more than 1 reader that needs to be waited on (the readers that
> started before the grace period started). Say there are 5 of them.
> When the grace period starts, the interfering readers (2^32 of them or
> so) could have sampled the old idx before the flip, and then do
> lock+unlock (on that old pre-flip() idx) in quick succession after the
> smp_mb() in the second srcu_readers_active_idx_check(). That causes
> those 5 poor readers to not be waited on. Granted, any new readers
> after this thundering herd should see the new idx and will not be
> affected, thanks to the memory barriers.

Yes, there could be quite a few such readers, but it only takes one
messed-up reader to constitute a bug in SRCU.  ;-)

> Still confused, but hey I'll take it little at a time ;-) Also thanks
> for the suggestions for litmus tests.

Agreed, setting this sort of thing aside for a bit and then coming back
to it can be helpful.

							Thanx, Paul

> Cheers,
> 
>  - Joel
> 
> > > Anyway, my 1 year old son is sick so signing off for now. Thanks.
> >
> > Ouch!  I hope he recovers quickly and completely!!!
> >
> >                                                         Thanx, Paul
  
Frederic Weisbecker Dec. 15, 2022, 4:54 p.m. UTC | #10
On Wed, Dec 14, 2022 at 11:13:55AM -0800, Paul E. McKenney wrote:
> The comment in srcu_readers_active_idx_check() following the smp_mb()
> is out of date, hailing from a simpler time when preemption was disabled
> across the bulk of __srcu_read_lock().  The fact that preemption was
> disabled meant that the number of tasks that had fetched the old index
> but not yet incremented counters was limited by the number of CPUs.
> 
> In our more complex modern times, the number of CPUs is no longer a limit.
> This commit therefore updates this comment, additionally giving more
> memory-ordering detail.
> 
> Reported-by: Boqun Feng <boqun.feng@gmail.com>
> Reported-by: Frederic Weisbecker <frederic@kernel.org>

Not really, while you guys were debating on that comment, I was still starring
at the previous one (as usual).

Or to put it in an SRCU way, while you guys saw the flipped idx, I was still
using the old one :)

> -	 * OK, how about nesting?  This does impose a limit on nesting
> -	 * of floor(ULONG_MAX/NR_CPUS/2), which should be sufficient,
> -	 * especially on 64-bit systems.
> +	 * It can clearly do so once, given that it has already fetched
> +	 * the old value of ->srcu_idx and is just about to use that value
> +	 * to index its increment of ->srcu_lock_count[idx].  But as soon as
> +	 * it leaves that SRCU read-side critical section, it will increment
> +	 * ->srcu_unlock_count[idx], which must follow the updater's above
> +	 * read from that same value.  Thus, as soon the reading task does
> +	 * an smp_mb() and a later fetch from ->srcu_idx, that task will be
> +	 * guaranteed to get the new index.  Except that the increment of
> +	 * ->srcu_unlock_count[idx] in __srcu_read_unlock() is after the
> +	 * smp_mb(), and the fetch from ->srcu_idx in __srcu_read_lock()
> +	 * is before the smp_mb().  Thus, that task might not see the new
> +	 * value of ->srcu_idx until the -second- __srcu_read_lock(),
> +	 * which in turn means that this task might well increment
> +	 * ->srcu_lock_count[idx] for the old value of ->srcu_idx twice,
> +	 * not just once.

You lost me on that one.

      UPDATER                               READER
      -------                               ------
      //srcu_readers_lock_idx               //srcu_read_lock
      idx = ssp->srcu_idx;                  idx = ssp->srcu_idx;
      READ srcu_lock_count[idx ^ 1]         srcu_lock_count[idx]++
      smp_mb();                             smp_mb();
      //flip_index                          /* srcu_read_unlock (ignoring on purpose) */
      ssp->srcu_idx++;                      /* smp_mb(); */
      smp_mb();                             /* srcu_unlock_count[old_idx]++ */
      //srcu_readers_lock_idx               //srcu_read_lock again
      idx = ssp->srcu_idx;                  idx = ssp->srcu_idx;
      READ srcu_lock_count[idx ^ 1]
                                            

Scenario for the reader to increment the old idx once:

_ Assume ssp->srcu_idx is initially 0.
_ The READER reads idx that is 0
_ The updater runs and flips the idx that is now 1
_ The reader resumes with 0 as an index but on the next srcu_read_lock()
  it will see the new idx which is 1

What could be the scenario for it to increment the old idx twice?

Thanks.
  
Paul E. McKenney Dec. 15, 2022, 5:08 p.m. UTC | #11
On Thu, Dec 15, 2022 at 05:54:52PM +0100, Frederic Weisbecker wrote:
> On Wed, Dec 14, 2022 at 11:13:55AM -0800, Paul E. McKenney wrote:
> > The comment in srcu_readers_active_idx_check() following the smp_mb()
> > is out of date, hailing from a simpler time when preemption was disabled
> > across the bulk of __srcu_read_lock().  The fact that preemption was
> > disabled meant that the number of tasks that had fetched the old index
> > but not yet incremented counters was limited by the number of CPUs.
> > 
> > In our more complex modern times, the number of CPUs is no longer a limit.
> > This commit therefore updates this comment, additionally giving more
> > memory-ordering detail.
> > 
> > Reported-by: Boqun Feng <boqun.feng@gmail.com>
> > Reported-by: Frederic Weisbecker <frederic@kernel.org>
> 
> Not really, while you guys were debating on that comment, I was still starring
> at the previous one (as usual).
> 
> Or to put it in an SRCU way, while you guys saw the flipped idx, I was still
> using the old one :)
> 
> > -	 * OK, how about nesting?  This does impose a limit on nesting
> > -	 * of floor(ULONG_MAX/NR_CPUS/2), which should be sufficient,
> > -	 * especially on 64-bit systems.
> > +	 * It can clearly do so once, given that it has already fetched
> > +	 * the old value of ->srcu_idx and is just about to use that value
> > +	 * to index its increment of ->srcu_lock_count[idx].  But as soon as
> > +	 * it leaves that SRCU read-side critical section, it will increment
> > +	 * ->srcu_unlock_count[idx], which must follow the updater's above
> > +	 * read from that same value.  Thus, as soon the reading task does
> > +	 * an smp_mb() and a later fetch from ->srcu_idx, that task will be
> > +	 * guaranteed to get the new index.  Except that the increment of
> > +	 * ->srcu_unlock_count[idx] in __srcu_read_unlock() is after the
> > +	 * smp_mb(), and the fetch from ->srcu_idx in __srcu_read_lock()
> > +	 * is before the smp_mb().  Thus, that task might not see the new
> > +	 * value of ->srcu_idx until the -second- __srcu_read_lock(),
> > +	 * which in turn means that this task might well increment
> > +	 * ->srcu_lock_count[idx] for the old value of ->srcu_idx twice,
> > +	 * not just once.
> 
> You lost me on that one.
> 
>       UPDATER                               READER
>       -------                               ------
>       //srcu_readers_lock_idx               //srcu_read_lock
>       idx = ssp->srcu_idx;                  idx = ssp->srcu_idx;
>       READ srcu_lock_count[idx ^ 1]         srcu_lock_count[idx]++

Shouldn't this be "READ srcu_unlock_count[idx ^ 1]"?

And then the above paragraph assumes that the updater gets stuck here...

>       smp_mb();                             smp_mb();

...or here.

And only then do we do the read of srcu_lock_count[idx ^ 1], correct?

>       //flip_index                          /* srcu_read_unlock (ignoring on purpose) */
>       ssp->srcu_idx++;                      /* smp_mb(); */
>       smp_mb();                             /* srcu_unlock_count[old_idx]++ */
>       //srcu_readers_lock_idx               //srcu_read_lock again
>       idx = ssp->srcu_idx;                  idx = ssp->srcu_idx;
>       READ srcu_lock_count[idx ^ 1]

And likewise here?

> Scenario for the reader to increment the old idx once:
> 
> _ Assume ssp->srcu_idx is initially 0.
> _ The READER reads idx that is 0
> _ The updater runs and flips the idx that is now 1
> _ The reader resumes with 0 as an index but on the next srcu_read_lock()
>   it will see the new idx which is 1
> 
> What could be the scenario for it to increment the old idx twice?

Unless I am missing something, the reader must reference the
srcu_unlock_count[old_idx] and then do smp_mb() before it will be
absolutely guaranteed of seeing the new value of ->srcu_idx.

So what am I missing?

						Thanx, Paul
  
Joel Fernandes Dec. 15, 2022, 5:48 p.m. UTC | #12
On Thu, Dec 15, 2022 at 5:08 PM Paul E. McKenney <paulmck@kernel.org> wrote:

> > Scenario for the reader to increment the old idx once:
> >
> > _ Assume ssp->srcu_idx is initially 0.
> > _ The READER reads idx that is 0
> > _ The updater runs and flips the idx that is now 1
> > _ The reader resumes with 0 as an index but on the next srcu_read_lock()
> >   it will see the new idx which is 1
> >
> > What could be the scenario for it to increment the old idx twice?
>
> Unless I am missing something, the reader must reference the
> srcu_unlock_count[old_idx] and then do smp_mb() before it will be
> absolutely guaranteed of seeing the new value of ->srcu_idx.

I think both of you are right depending on how the flip raced with the
first reader's unlock in that specific task.

If the first read section's srcu_read_unlock() and its corresponding
smp_mb()  happened before the flip, then the increment of old idx
would happen only once. The next srcu_read_lock() will read the new
index. If the srcu_read_unlock() and it's corresponding smp_mb()
happened after the flip, the old_idx will be sampled again and can be
incremented twice. So it depends on how the flip races with
srcu_read_unlock().

Also, since this is all hard to reason about I started making some
diagrams, LOL. For your amusement, here is why need to scan both idx
during grace period detection: https://i.imgur.com/jz4bNKd.png

thanks,

 - Joel
  
Joel Fernandes Dec. 15, 2022, 5:58 p.m. UTC | #13
On Thu, Dec 15, 2022 at 5:48 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Thu, Dec 15, 2022 at 5:08 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> > > Scenario for the reader to increment the old idx once:
> > >
> > > _ Assume ssp->srcu_idx is initially 0.
> > > _ The READER reads idx that is 0
> > > _ The updater runs and flips the idx that is now 1
> > > _ The reader resumes with 0 as an index but on the next srcu_read_lock()
> > >   it will see the new idx which is 1
> > >
> > > What could be the scenario for it to increment the old idx twice?
> >
> > Unless I am missing something, the reader must reference the
> > srcu_unlock_count[old_idx] and then do smp_mb() before it will be
> > absolutely guaranteed of seeing the new value of ->srcu_idx.
>
> I think both of you are right depending on how the flip raced with the
> first reader's unlock in that specific task.
>
> If the first read section's srcu_read_unlock() and its corresponding
> smp_mb()  happened before the flip, then the increment of old idx
> would happen only once. The next srcu_read_lock() will read the new
> index. If the srcu_read_unlock() and it's corresponding smp_mb()
> happened after the flip, the old_idx will be sampled again and can be
> incremented twice. So it depends on how the flip races with
> srcu_read_unlock().

I am sorry this is inverted, but my statement's gist stands I believe:

1. Flip+smp_mb() happened before unlock's smp_mb() -- reader will not
increment old_idx the second time.

2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
has no new smp_mb() that happens AFTER the flip happened. So it can
totally sample the old idx again -- that particular reader will
increment twice, but the next time, it will see the flipped one.

Did I get that right? Thanks.
  
Frederic Weisbecker Dec. 15, 2022, 5:58 p.m. UTC | #14
On Thu, Dec 15, 2022 at 09:08:34AM -0800, Paul E. McKenney wrote:
> On Thu, Dec 15, 2022 at 05:54:52PM +0100, Frederic Weisbecker wrote:
> > On Wed, Dec 14, 2022 at 11:13:55AM -0800, Paul E. McKenney wrote:
> > > The comment in srcu_readers_active_idx_check() following the smp_mb()
> > > is out of date, hailing from a simpler time when preemption was disabled
> > > across the bulk of __srcu_read_lock().  The fact that preemption was
> > > disabled meant that the number of tasks that had fetched the old index
> > > but not yet incremented counters was limited by the number of CPUs.
> > > 
> > > In our more complex modern times, the number of CPUs is no longer a limit.
> > > This commit therefore updates this comment, additionally giving more
> > > memory-ordering detail.
> > > 
> > > Reported-by: Boqun Feng <boqun.feng@gmail.com>
> > > Reported-by: Frederic Weisbecker <frederic@kernel.org>
> > 
> > Not really, while you guys were debating on that comment, I was still starring
> > at the previous one (as usual).
> > 
> > Or to put it in an SRCU way, while you guys saw the flipped idx, I was still
> > using the old one :)
> > 
> > > -	 * OK, how about nesting?  This does impose a limit on nesting
> > > -	 * of floor(ULONG_MAX/NR_CPUS/2), which should be sufficient,
> > > -	 * especially on 64-bit systems.
> > > +	 * It can clearly do so once, given that it has already fetched
> > > +	 * the old value of ->srcu_idx and is just about to use that value
> > > +	 * to index its increment of ->srcu_lock_count[idx].  But as soon as
> > > +	 * it leaves that SRCU read-side critical section, it will increment
> > > +	 * ->srcu_unlock_count[idx], which must follow the updater's above
> > > +	 * read from that same value.  Thus, as soon the reading task does
> > > +	 * an smp_mb() and a later fetch from ->srcu_idx, that task will be
> > > +	 * guaranteed to get the new index.  Except that the increment of
> > > +	 * ->srcu_unlock_count[idx] in __srcu_read_unlock() is after the
> > > +	 * smp_mb(), and the fetch from ->srcu_idx in __srcu_read_lock()
> > > +	 * is before the smp_mb().  Thus, that task might not see the new
> > > +	 * value of ->srcu_idx until the -second- __srcu_read_lock(),
> > > +	 * which in turn means that this task might well increment
> > > +	 * ->srcu_lock_count[idx] for the old value of ->srcu_idx twice,
> > > +	 * not just once.
> > 
> > You lost me on that one.
> > 
> >       UPDATER                               READER
> >       -------                               ------
> >       //srcu_readers_lock_idx               //srcu_read_lock
> >       idx = ssp->srcu_idx;                  idx = ssp->srcu_idx;
> >       READ srcu_lock_count[idx ^ 1]         srcu_lock_count[idx]++
> 
> Shouldn't this be "READ srcu_unlock_count[idx ^ 1]"?
> 
> And then the above paragraph assumes that the updater gets stuck here...

Right I ignored the unlock part on purpose. But ok let's add it (later note: just switch
directly to the next paragraph to see how I realize I'm wrong)


      UPDATER                               READER
      -------                               ------
      idx = ssp->srcu_idx;                  idx = ssp->srcu_idx;
      READ srcu_unlock_count[idx ^ 1]       srcu_lock_count[idx]++
      smp_mb();                             smp_mb();
      READ srcu_lock_count[idx ^ 1]         // read side crit
      smp_mb();                             smp_mb();
      idx = ssp->srcu_idx;                  srcu_unlock_count[old_idx]++
      ssp->srcu_idx++;                      idx = ssp->srcu_idx;
      smp_mb();                             
      READ srcu_unlock_count[idx ^ 1]
      smp_mb();
      READ srcu_lock_count[idx ^ 1]  

> Unless I am missing something, the reader must reference the
> srcu_unlock_count[old_idx] and then do smp_mb() before it will be
> absolutely guaranteed of seeing the new value of ->srcu_idx.
> 
> So what am I missing?

But there is the smp_mb() between the srcu_lock_count[idx]++ of the 1st
srcu_read_lock() and the idx READ from the second srcu_read_lock():

         WRITER                                READER
         -----                                 -------
         WRITE idx                             WRITE srcu_lock_count[old_idx]
         smp_mb()                              smp_mb()
         READ srcu_lock_count[new_idx]         READ idx

Ah wait! On SCAN2 we are reading the count from the _new_ idx, not the old one, ok
that's why it doesn't work. So then for it to write twice on the old idx we have:

_ idx is initially 0
_ READER fetches idx (idx=0) and is preempted
_ New GP: Updater goes through its whole thing and flips idx
_ Yet another new GP: Updater goes again but is preempted in the middle of
SCAN1: it has read unlock_count but not yet lock_count
_ READER increments lock_count, then unlock_count, for the old idx (0).
_ New READER: indeed we don't have a barrier between unlock_count and idx read.
  So we read again the idx unordered against the previous WRITE to unlock_count.
  So this may be still the old idx (0): we increment lock_count, there goes the
  desired smp_mb(), we increment unlock_count of the old idx (0).
_ Yet another READER: finally we see the new idx (1).

Phew! Did I get it right this time? :))
  
Paul E. McKenney Dec. 15, 2022, 6:53 p.m. UTC | #15
On Thu, Dec 15, 2022 at 06:58:30PM +0100, Frederic Weisbecker wrote:
> On Thu, Dec 15, 2022 at 09:08:34AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 15, 2022 at 05:54:52PM +0100, Frederic Weisbecker wrote:
> > > On Wed, Dec 14, 2022 at 11:13:55AM -0800, Paul E. McKenney wrote:
> > > > The comment in srcu_readers_active_idx_check() following the smp_mb()
> > > > is out of date, hailing from a simpler time when preemption was disabled
> > > > across the bulk of __srcu_read_lock().  The fact that preemption was
> > > > disabled meant that the number of tasks that had fetched the old index
> > > > but not yet incremented counters was limited by the number of CPUs.
> > > > 
> > > > In our more complex modern times, the number of CPUs is no longer a limit.
> > > > This commit therefore updates this comment, additionally giving more
> > > > memory-ordering detail.
> > > > 
> > > > Reported-by: Boqun Feng <boqun.feng@gmail.com>
> > > > Reported-by: Frederic Weisbecker <frederic@kernel.org>
> > > 
> > > Not really, while you guys were debating on that comment, I was still starring
> > > at the previous one (as usual).
> > > 
> > > Or to put it in an SRCU way, while you guys saw the flipped idx, I was still
> > > using the old one :)
> > > 
> > > > -	 * OK, how about nesting?  This does impose a limit on nesting
> > > > -	 * of floor(ULONG_MAX/NR_CPUS/2), which should be sufficient,
> > > > -	 * especially on 64-bit systems.
> > > > +	 * It can clearly do so once, given that it has already fetched
> > > > +	 * the old value of ->srcu_idx and is just about to use that value
> > > > +	 * to index its increment of ->srcu_lock_count[idx].  But as soon as
> > > > +	 * it leaves that SRCU read-side critical section, it will increment
> > > > +	 * ->srcu_unlock_count[idx], which must follow the updater's above
> > > > +	 * read from that same value.  Thus, as soon the reading task does
> > > > +	 * an smp_mb() and a later fetch from ->srcu_idx, that task will be
> > > > +	 * guaranteed to get the new index.  Except that the increment of
> > > > +	 * ->srcu_unlock_count[idx] in __srcu_read_unlock() is after the
> > > > +	 * smp_mb(), and the fetch from ->srcu_idx in __srcu_read_lock()
> > > > +	 * is before the smp_mb().  Thus, that task might not see the new
> > > > +	 * value of ->srcu_idx until the -second- __srcu_read_lock(),
> > > > +	 * which in turn means that this task might well increment
> > > > +	 * ->srcu_lock_count[idx] for the old value of ->srcu_idx twice,
> > > > +	 * not just once.
> > > 
> > > You lost me on that one.
> > > 
> > >       UPDATER                               READER
> > >       -------                               ------
> > >       //srcu_readers_lock_idx               //srcu_read_lock
> > >       idx = ssp->srcu_idx;                  idx = ssp->srcu_idx;
> > >       READ srcu_lock_count[idx ^ 1]         srcu_lock_count[idx]++
> > 
> > Shouldn't this be "READ srcu_unlock_count[idx ^ 1]"?
> > 
> > And then the above paragraph assumes that the updater gets stuck here...
> 
> Right I ignored the unlock part on purpose. But ok let's add it (later note: just switch
> directly to the next paragraph to see how I realize I'm wrong)

I do know that feeling!  There are very few things that instill a healthy
sense of humility quite like working with concurrent code.  ;-)

>       UPDATER                               READER
>       -------                               ------
>       idx = ssp->srcu_idx;                  idx = ssp->srcu_idx;
>       READ srcu_unlock_count[idx ^ 1]       srcu_lock_count[idx]++
>       smp_mb();                             smp_mb();
>       READ srcu_lock_count[idx ^ 1]         // read side crit
>       smp_mb();                             smp_mb();
>       idx = ssp->srcu_idx;                  srcu_unlock_count[old_idx]++
>       ssp->srcu_idx++;                      idx = ssp->srcu_idx;
>       smp_mb();                             
>       READ srcu_unlock_count[idx ^ 1]
>       smp_mb();
>       READ srcu_lock_count[idx ^ 1]  
> 
> > Unless I am missing something, the reader must reference the
> > srcu_unlock_count[old_idx] and then do smp_mb() before it will be
> > absolutely guaranteed of seeing the new value of ->srcu_idx.
> > 
> > So what am I missing?
> 
> But there is the smp_mb() between the srcu_lock_count[idx]++ of the 1st
> srcu_read_lock() and the idx READ from the second srcu_read_lock():
> 
>          WRITER                                READER
>          -----                                 -------
>          WRITE idx                             WRITE srcu_lock_count[old_idx]
>          smp_mb()                              smp_mb()
>          READ srcu_lock_count[new_idx]         READ idx
> 
> Ah wait! On SCAN2 we are reading the count from the _new_ idx, not the old one, ok
> that's why it doesn't work. So then for it to write twice on the old idx we have:
> 
> _ idx is initially 0
> _ READER fetches idx (idx=0) and is preempted
> _ New GP: Updater goes through its whole thing and flips idx
> _ Yet another new GP: Updater goes again but is preempted in the middle of
> SCAN1: it has read unlock_count but not yet lock_count
> _ READER increments lock_count, then unlock_count, for the old idx (0).
> _ New READER: indeed we don't have a barrier between unlock_count and idx read.
>   So we read again the idx unordered against the previous WRITE to unlock_count.
>   So this may be still the old idx (0): we increment lock_count, there goes the
>   desired smp_mb(), we increment unlock_count of the old idx (0).
> _ Yet another READER: finally we see the new idx (1).
> 
> Phew! Did I get it right this time? :))

Either you got it right or we both got it wrong in the same way.  ;-)

							Thanx, Paul
  
Paul E. McKenney Dec. 15, 2022, 7:58 p.m. UTC | #16
On Thu, Dec 15, 2022 at 05:48:46PM +0000, Joel Fernandes wrote:
> On Thu, Dec 15, 2022 at 5:08 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> > > Scenario for the reader to increment the old idx once:
> > >
> > > _ Assume ssp->srcu_idx is initially 0.
> > > _ The READER reads idx that is 0
> > > _ The updater runs and flips the idx that is now 1
> > > _ The reader resumes with 0 as an index but on the next srcu_read_lock()
> > >   it will see the new idx which is 1
> > >
> > > What could be the scenario for it to increment the old idx twice?
> >
> > Unless I am missing something, the reader must reference the
> > srcu_unlock_count[old_idx] and then do smp_mb() before it will be
> > absolutely guaranteed of seeing the new value of ->srcu_idx.
> 
> I think both of you are right depending on how the flip raced with the
> first reader's unlock in that specific task.

There are indeed a variety of scenarios and also a variety of failure
cases.

> If the first read section's srcu_read_unlock() and its corresponding
> smp_mb()  happened before the flip, then the increment of old idx
> would happen only once. The next srcu_read_lock() will read the new
> index. If the srcu_read_unlock() and it's corresponding smp_mb()
> happened after the flip, the old_idx will be sampled again and can be
> incremented twice. So it depends on how the flip races with
> srcu_read_unlock().

I do understand that a number of people like reasoning about
memory-barrier ordering, courtesy of the sequentially consistent portions
of the C and C++ memory models, but thinking in terms of the accesses
surrounding the memory barriers has been far less error-prone.

> Also, since this is all hard to reason about I started making some
> diagrams, LOL. For your amusement, here is why need to scan both idx
> during grace period detection: https://i.imgur.com/jz4bNKd.png

Nice!

I suggest placing a gap between GP 2 and GP 3.  That way, you can make it
very clear that Reader 1's critical section starts after the end of GP 2
(thus clearly never blocking GP 2) and before GP 3 (thus possibly having
a reference to some data that is going to be freed at the end of GP 3).

I also suggest coloring Reader 1 red and Reader 2 green, given that the
color red generally indicates danger.

							Thanx, Paul
  
Joel Fernandes Dec. 15, 2022, 8:03 p.m. UTC | #17
Hi Paul,

On Thu, Dec 15, 2022 at 2:58 PM Paul E. McKenney <paulmck@kernel.org> wrote:
[...]
> > If the first read section's srcu_read_unlock() and its corresponding
> > smp_mb()  happened before the flip, then the increment of old idx
> > would happen only once. The next srcu_read_lock() will read the new
> > index. If the srcu_read_unlock() and it's corresponding smp_mb()
> > happened after the flip, the old_idx will be sampled again and can be
> > incremented twice. So it depends on how the flip races with
> > srcu_read_unlock().
>
> I do understand that a number of people like reasoning about
> memory-barrier ordering, courtesy of the sequentially consistent portions
> of the C and C++ memory models, but thinking in terms of the accesses
> surrounding the memory barriers has been far less error-prone.

Sure, but we are already talking in terms of the access to idx right?
That's what we're saying is visible by memory barriers and we are
trying to reason here about the ordering (flip does the write to idx
and followed by smp_mb(), and there is corresponding read of idx on
the srcu_read_lock() side. So we are indeed talking in terms of
access, but let me know if I missed something.

> > Also, since this is all hard to reason about I started making some
> > diagrams, LOL. For your amusement, here is why need to scan both idx
> > during grace period detection: https://i.imgur.com/jz4bNKd.png
>
> Nice!
>
> I suggest placing a gap between GP 2 and GP 3.  That way, you can make it
> very clear that Reader 1's critical section starts after the end of GP 2
> (thus clearly never blocking GP 2) and before GP 3 (thus possibly having
> a reference to some data that is going to be freed at the end of GP 3).
>
> I also suggest coloring Reader 1 red and Reader 2 green, given that the
> color red generally indicates danger.

Thanks for these suggestions! I will make the update. I am planning to
make a number of diagrams for other scenarios as well, as it helps
visualize. Google drawing is nice for these. I am happy to share these
with you all if there is interest :).

Thanks!

 - Joel
  
Paul E. McKenney Dec. 15, 2022, 8:13 p.m. UTC | #18
On Thu, Dec 15, 2022 at 05:58:14PM +0000, Joel Fernandes wrote:
> On Thu, Dec 15, 2022 at 5:48 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Thu, Dec 15, 2022 at 5:08 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > > > Scenario for the reader to increment the old idx once:
> > > >
> > > > _ Assume ssp->srcu_idx is initially 0.
> > > > _ The READER reads idx that is 0
> > > > _ The updater runs and flips the idx that is now 1
> > > > _ The reader resumes with 0 as an index but on the next srcu_read_lock()
> > > >   it will see the new idx which is 1
> > > >
> > > > What could be the scenario for it to increment the old idx twice?
> > >
> > > Unless I am missing something, the reader must reference the
> > > srcu_unlock_count[old_idx] and then do smp_mb() before it will be
> > > absolutely guaranteed of seeing the new value of ->srcu_idx.
> >
> > I think both of you are right depending on how the flip raced with the
> > first reader's unlock in that specific task.
> >
> > If the first read section's srcu_read_unlock() and its corresponding
> > smp_mb()  happened before the flip, then the increment of old idx
> > would happen only once. The next srcu_read_lock() will read the new
> > index. If the srcu_read_unlock() and it's corresponding smp_mb()
> > happened after the flip, the old_idx will be sampled again and can be
> > incremented twice. So it depends on how the flip races with
> > srcu_read_unlock().
> 
> I am sorry this is inverted, but my statement's gist stands I believe:
> 
> 1. Flip+smp_mb() happened before unlock's smp_mb() -- reader will not
> increment old_idx the second time.

By "increment old_idx" you mean "increment ->srcu_lock_count[old_idx]",
correct?

Again, the important ordering isn't the smp_mb(), but the accesses,
in this case, the accesses to ->srcu_unlock_count[idx].

> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
> has no new smp_mb() that happens AFTER the flip happened. So it can
> totally sample the old idx again -- that particular reader will
> increment twice, but the next time, it will see the flipped one.

I will let you transliterate both.  ;-)

> Did I get that right? Thanks.

So why am I unhappy with orderings of smp_mb()?

To see this, let's take the usual store-buffering litmus test:

	CPU 0			CPU 1
	WRITE_ONCE(x, 1);	WRITE_ONCE(y, 1);
	smp_mb();		smp_mb();
	r0 = READ_ONCE(y);	r1 = READ_ONCE(x);

Suppose CPU 0's smp_mb() happens before that of CPU 1:

	CPU 0			CPU 1
	WRITE_ONCE(x, 1);	WRITE_ONCE(y, 1);
	smp_mb();
				smp_mb();
	r0 = READ_ONCE(y);	r1 = READ_ONCE(x);

We get r0 == r1 == 1.

Compare this to CPU 1's smp_mb() happening before that of CPU 0:

	CPU 0			CPU 1
	WRITE_ONCE(x, 1);	WRITE_ONCE(y, 1);
				smp_mb();
	smp_mb();
	r0 = READ_ONCE(y);	r1 = READ_ONCE(x);

We still get r0 == r1 == 1.  Reversing the order of the two smp_mb()
calls changed nothing.

But, if we order CPU 1's write to follow CPU 0's read, then we have
this:

	CPU 0			CPU 1
	WRITE_ONCE(x, 1);
	smp_mb();
	r0 = READ_ONCE(y);
				WRITE_ONCE(y, 1);
				smp_mb();
				r1 = READ_ONCE(x);

Here, given that r0 had the final value of zero, we know that
r1 must have a final value of 1.

And suppose we reverse this:

	CPU 0			CPU 1
				WRITE_ONCE(y, 1);
				smp_mb();
				r1 = READ_ONCE(x);
	WRITE_ONCE(x, 1);
	smp_mb();
	r0 = READ_ONCE(y);

Now there is a software-visible difference in behavior.  The value of
r0 is now 1 instead of zero and the value of r1 is now 0 instead of 1.

Does this make sense?

							Thanx, Paul
  
Joel Fernandes Dec. 15, 2022, 8:33 p.m. UTC | #19
On Thu, Dec 15, 2022 at 3:03 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Paul,
>
> On Thu, Dec 15, 2022 at 2:58 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> [...]
> > > If the first read section's srcu_read_unlock() and its corresponding
> > > smp_mb()  happened before the flip, then the increment of old idx
> > > would happen only once. The next srcu_read_lock() will read the new
> > > index. If the srcu_read_unlock() and it's corresponding smp_mb()
> > > happened after the flip, the old_idx will be sampled again and can be
> > > incremented twice. So it depends on how the flip races with
> > > srcu_read_unlock().
> >
> > I do understand that a number of people like reasoning about
> > memory-barrier ordering, courtesy of the sequentially consistent portions
> > of the C and C++ memory models, but thinking in terms of the accesses
> > surrounding the memory barriers has been far less error-prone.
>
> Sure, but we are already talking in terms of the access to idx right?
> That's what we're saying is visible by memory barriers and we are
> trying to reason here about the ordering (flip does the write to idx
> and followed by smp_mb(), and there is corresponding read of idx on
> the srcu_read_lock() side. So we are indeed talking in terms of
> access, but let me know if I missed something.
>
> > > Also, since this is all hard to reason about I started making some
> > > diagrams, LOL. For your amusement, here is why need to scan both idx
> > > during grace period detection: https://i.imgur.com/jz4bNKd.png
> >
> > Nice!
> >
> > I suggest placing a gap between GP 2 and GP 3.  That way, you can make it
> > very clear that Reader 1's critical section starts after the end of GP 2
> > (thus clearly never blocking GP 2) and before GP 3 (thus possibly having
> > a reference to some data that is going to be freed at the end of GP 3).
> >
> > I also suggest coloring Reader 1 red and Reader 2 green, given that the
> > color red generally indicates danger.
>
> Thanks for these suggestions! I will make the update. I am planning to
> make a number of diagrams for other scenarios as well, as it helps
> visualize. Google drawing is nice for these. I am happy to share these
> with you all if there is interest :).

I made these updates, please see: https://i.imgur.com/hoKLvtt.png

Feel free to use the image for any purpose and thanks ;-)

 - Joel
  
Paul E. McKenney Dec. 15, 2022, 9:39 p.m. UTC | #20
On Thu, Dec 15, 2022 at 03:33:39PM -0500, Joel Fernandes wrote:
> On Thu, Dec 15, 2022 at 3:03 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hi Paul,
> >
> > On Thu, Dec 15, 2022 at 2:58 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > [...]
> > > > If the first read section's srcu_read_unlock() and its corresponding
> > > > smp_mb()  happened before the flip, then the increment of old idx
> > > > would happen only once. The next srcu_read_lock() will read the new
> > > > index. If the srcu_read_unlock() and it's corresponding smp_mb()
> > > > happened after the flip, the old_idx will be sampled again and can be
> > > > incremented twice. So it depends on how the flip races with
> > > > srcu_read_unlock().
> > >
> > > I do understand that a number of people like reasoning about
> > > memory-barrier ordering, courtesy of the sequentially consistent portions
> > > of the C and C++ memory models, but thinking in terms of the accesses
> > > surrounding the memory barriers has been far less error-prone.
> >
> > Sure, but we are already talking in terms of the access to idx right?
> > That's what we're saying is visible by memory barriers and we are
> > trying to reason here about the ordering (flip does the write to idx
> > and followed by smp_mb(), and there is corresponding read of idx on
> > the srcu_read_lock() side. So we are indeed talking in terms of
> > access, but let me know if I missed something.
> >
> > > > Also, since this is all hard to reason about I started making some
> > > > diagrams, LOL. For your amusement, here is why need to scan both idx
> > > > during grace period detection: https://i.imgur.com/jz4bNKd.png
> > >
> > > Nice!
> > >
> > > I suggest placing a gap between GP 2 and GP 3.  That way, you can make it
> > > very clear that Reader 1's critical section starts after the end of GP 2
> > > (thus clearly never blocking GP 2) and before GP 3 (thus possibly having
> > > a reference to some data that is going to be freed at the end of GP 3).
> > >
> > > I also suggest coloring Reader 1 red and Reader 2 green, given that the
> > > color red generally indicates danger.
> >
> > Thanks for these suggestions! I will make the update. I am planning to
> > make a number of diagrams for other scenarios as well, as it helps
> > visualize. Google drawing is nice for these. I am happy to share these
> > with you all if there is interest :).
> 
> I made these updates, please see: https://i.imgur.com/hoKLvtt.png
> 
> Feel free to use the image for any purpose and thanks ;-)

Very good, thank you!

Would it be possible to have an arrow marked "X" or "reference to X"
from the beginning of the 'Mark "x" for GC' box to the box labeled
'Enter RSCS (access "X")'?

							Thanx, Paul
  
Joel Fernandes Dec. 15, 2022, 9:42 p.m. UTC | #21
> On Dec 15, 2022, at 4:39 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Thu, Dec 15, 2022 at 03:33:39PM -0500, Joel Fernandes wrote:
>>> On Thu, Dec 15, 2022 at 3:03 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>> 
>>> Hi Paul,
>>> 
>>>> On Thu, Dec 15, 2022 at 2:58 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>>> [...]
>>>>> If the first read section's srcu_read_unlock() and its corresponding
>>>>> smp_mb()  happened before the flip, then the increment of old idx
>>>>> would happen only once. The next srcu_read_lock() will read the new
>>>>> index. If the srcu_read_unlock() and it's corresponding smp_mb()
>>>>> happened after the flip, the old_idx will be sampled again and can be
>>>>> incremented twice. So it depends on how the flip races with
>>>>> srcu_read_unlock().
>>>> 
>>>> I do understand that a number of people like reasoning about
>>>> memory-barrier ordering, courtesy of the sequentially consistent portions
>>>> of the C and C++ memory models, but thinking in terms of the accesses
>>>> surrounding the memory barriers has been far less error-prone.
>>> 
>>> Sure, but we are already talking in terms of the access to idx right?
>>> That's what we're saying is visible by memory barriers and we are
>>> trying to reason here about the ordering (flip does the write to idx
>>> and followed by smp_mb(), and there is corresponding read of idx on
>>> the srcu_read_lock() side. So we are indeed talking in terms of
>>> access, but let me know if I missed something.
>>> 
>>>>> Also, since this is all hard to reason about I started making some
>>>>> diagrams, LOL. For your amusement, here is why need to scan both idx
>>>>> during grace period detection: https://i.imgur.com/jz4bNKd.png
>>>> 
>>>> Nice!
>>>> 
>>>> I suggest placing a gap between GP 2 and GP 3.  That way, you can make it
>>>> very clear that Reader 1's critical section starts after the end of GP 2
>>>> (thus clearly never blocking GP 2) and before GP 3 (thus possibly having
>>>> a reference to some data that is going to be freed at the end of GP 3).
>>>> 
>>>> I also suggest coloring Reader 1 red and Reader 2 green, given that the
>>>> color red generally indicates danger.
>>> 
>>> Thanks for these suggestions! I will make the update. I am planning to
>>> make a number of diagrams for other scenarios as well, as it helps
>>> visualize. Google drawing is nice for these. I am happy to share these
>>> with you all if there is interest :).
>> 
>> I made these updates, please see: https://i.imgur.com/hoKLvtt.png
>> 
>> Feel free to use the image for any purpose and thanks ;-)
> 
> Very good, thank you!
> 
> Would it be possible to have an arrow marked "X" or "reference to X"
> from the beginning of the 'Mark "x" for GC' box to the box labeled
> 'Enter RSCS (access "X")'?

I am currently away from desk. I shared the google drawing with you. Could you check and make the change, if that’s ok with you?

Thank you so much,

 - Joel

> 
>                            Thanx, Paul
  
Paul E. McKenney Dec. 15, 2022, 10:10 p.m. UTC | #22
On Thu, Dec 15, 2022 at 04:42:15PM -0500, Joel Fernandes wrote:
> 
> 
> > On Dec 15, 2022, at 4:39 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Thu, Dec 15, 2022 at 03:33:39PM -0500, Joel Fernandes wrote:
> >>> On Thu, Dec 15, 2022 at 3:03 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>> 
> >>> Hi Paul,
> >>> 
> >>>> On Thu, Dec 15, 2022 at 2:58 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>> [...]
> >>>>> If the first read section's srcu_read_unlock() and its corresponding
> >>>>> smp_mb()  happened before the flip, then the increment of old idx
> >>>>> would happen only once. The next srcu_read_lock() will read the new
> >>>>> index. If the srcu_read_unlock() and it's corresponding smp_mb()
> >>>>> happened after the flip, the old_idx will be sampled again and can be
> >>>>> incremented twice. So it depends on how the flip races with
> >>>>> srcu_read_unlock().
> >>>> 
> >>>> I do understand that a number of people like reasoning about
> >>>> memory-barrier ordering, courtesy of the sequentially consistent portions
> >>>> of the C and C++ memory models, but thinking in terms of the accesses
> >>>> surrounding the memory barriers has been far less error-prone.
> >>> 
> >>> Sure, but we are already talking in terms of the access to idx right?
> >>> That's what we're saying is visible by memory barriers and we are
> >>> trying to reason here about the ordering (flip does the write to idx
> >>> and followed by smp_mb(), and there is corresponding read of idx on
> >>> the srcu_read_lock() side. So we are indeed talking in terms of
> >>> access, but let me know if I missed something.
> >>> 
> >>>>> Also, since this is all hard to reason about I started making some
> >>>>> diagrams, LOL. For your amusement, here is why need to scan both idx
> >>>>> during grace period detection: https://i.imgur.com/jz4bNKd.png
> >>>> 
> >>>> Nice!
> >>>> 
> >>>> I suggest placing a gap between GP 2 and GP 3.  That way, you can make it
> >>>> very clear that Reader 1's critical section starts after the end of GP 2
> >>>> (thus clearly never blocking GP 2) and before GP 3 (thus possibly having
> >>>> a reference to some data that is going to be freed at the end of GP 3).
> >>>> 
> >>>> I also suggest coloring Reader 1 red and Reader 2 green, given that the
> >>>> color red generally indicates danger.
> >>> 
> >>> Thanks for these suggestions! I will make the update. I am planning to
> >>> make a number of diagrams for other scenarios as well, as it helps
> >>> visualize. Google drawing is nice for these. I am happy to share these
> >>> with you all if there is interest :).
> >> 
> >> I made these updates, please see: https://i.imgur.com/hoKLvtt.png
> >> 
> >> Feel free to use the image for any purpose and thanks ;-)
> > 
> > Very good, thank you!
> > 
> > Would it be possible to have an arrow marked "X" or "reference to X"
> > from the beginning of the 'Mark "x" for GC' box to the box labeled
> > 'Enter RSCS (access "X")'?
> 
> I am currently away from desk. I shared the google drawing with you. Could you check and make the change, if that’s ok with you?
> 
> Thank you so much,

I took a cut at it.  Thoughts?

							Thanx, Paul
  
Joel Fernandes Dec. 15, 2022, 10:13 p.m. UTC | #23
> On Dec 15, 2022, at 3:13 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Thu, Dec 15, 2022 at 05:58:14PM +0000, Joel Fernandes wrote:
>>> On Thu, Dec 15, 2022 at 5:48 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>> 
>>>> On Thu, Dec 15, 2022 at 5:08 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>>> 
>>>>> Scenario for the reader to increment the old idx once:
>>>>> 
>>>>> _ Assume ssp->srcu_idx is initially 0.
>>>>> _ The READER reads idx that is 0
>>>>> _ The updater runs and flips the idx that is now 1
>>>>> _ The reader resumes with 0 as an index but on the next srcu_read_lock()
>>>>>  it will see the new idx which is 1
>>>>> 
>>>>> What could be the scenario for it to increment the old idx twice?
>>>> 
>>>> Unless I am missing something, the reader must reference the
>>>> srcu_unlock_count[old_idx] and then do smp_mb() before it will be
>>>> absolutely guaranteed of seeing the new value of ->srcu_idx.
>>> 
>>> I think both of you are right depending on how the flip raced with the
>>> first reader's unlock in that specific task.
>>> 
>>> If the first read section's srcu_read_unlock() and its corresponding
>>> smp_mb()  happened before the flip, then the increment of old idx
>>> would happen only once. The next srcu_read_lock() will read the new
>>> index. If the srcu_read_unlock() and it's corresponding smp_mb()
>>> happened after the flip, the old_idx will be sampled again and can be
>>> incremented twice. So it depends on how the flip races with
>>> srcu_read_unlock().
>> 
>> I am sorry this is inverted, but my statement's gist stands I believe:
>> 
>> 1. Flip+smp_mb() happened before unlock's smp_mb() -- reader will not
>> increment old_idx the second time.
> 
> By "increment old_idx" you mean "increment ->srcu_lock_count[old_idx]",
> correct?

Yes sorry for confusing, i indeed meant lock count increment corresponding to the old index.
> 
> Again, the important ordering isn't the smp_mb(), but the accesses,
> in this case, the accesses to ->srcu_unlock_count[idx].

I was talking about ordering of the flip of index (write) with respect to both the reading of the old index  in the rcu_read_lock() and its subsequent lock count increment corresponding to that index. I believe we are talking her about how this race can effect the wrap around issues when scanning for readers in the pre flip index, and we concluded that there can be at most 2 of these on the SAME task. The third time, reader will always see the new flipped index because of the memory barriers on both sides. IOW, the same task cannot overflow the lock counter on the preflipped index and cause issues. However there can be Nt different tasks so perhaps you can have 2*Nt number of preempted readers that had sampled the old index and now will do a lock and unlock on that old index, potentially causing a lock==unlock match when there should not be a match.

> 
>> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
>> has no new smp_mb() that happens AFTER the flip happened. So it can
>> totally sample the old idx again -- that particular reader will
>> increment twice, but the next time, it will see the flipped one.
> 
> I will let you transliterate both.  ;-)

I think I see what you mean now :)

I believe the access I am referring to is the read of idx on one side and the write to idx on the other. However that is incomplete and I need to pair that with some of other access on both sides.

So perhaps this:

Writer does flip + smp_mb + read unlock counts [1]

Reader does:
 read idx + smp_mb() + increment lock counts [2]

And subsequently reader does
Smp_mb() + increment unlock count. [3]

So [1] races with either [2] or [2]+[3].

Is that fair?

>> Did I get that right? Thanks.
> 
> So why am I unhappy with orderings of smp_mb()?
> 
> To see this, let's take the usual store-buffering litmus test:
> 
>    CPU 0            CPU 1
>    WRITE_ONCE(x, 1);    WRITE_ONCE(y, 1);
>    smp_mb();        smp_mb();
>    r0 = READ_ONCE(y);    r1 = READ_ONCE(x);
> 
> Suppose CPU 0's smp_mb() happens before that of CPU 1:
> 
>    CPU 0            CPU 1
>    WRITE_ONCE(x, 1);    WRITE_ONCE(y, 1);
>    smp_mb();
>                smp_mb();
>    r0 = READ_ONCE(y);    r1 = READ_ONCE(x);
> 
> We get r0 == r1 == 1.
> 
> Compare this to CPU 1's smp_mb() happening before that of CPU 0:
> 
>    CPU 0            CPU 1
>    WRITE_ONCE(x, 1);    WRITE_ONCE(y, 1);
>                smp_mb();
>    smp_mb();
>    r0 = READ_ONCE(y);    r1 = READ_ONCE(x);
> 
> We still get r0 == r1 == 1.  Reversing the order of the two smp_mb()
> calls changed nothing.
> 
> But, if we order CPU 1's write to follow CPU 0's read, then we have
> this:
> 
>    CPU 0            CPU 1
>    WRITE_ONCE(x, 1);
>    smp_mb();
>    r0 = READ_ONCE(y);
>                WRITE_ONCE(y, 1);
>                smp_mb();
>                r1 = READ_ONCE(x);
> 
> Here, given that r0 had the final value of zero, we know that
> r1 must have a final value of 1.
> 
> And suppose we reverse this:
> 
>    CPU 0            CPU 1
>                WRITE_ONCE(y, 1);
>                smp_mb();
>                r1 = READ_ONCE(x);
>    WRITE_ONCE(x, 1);
>    smp_mb();
>    r0 = READ_ONCE(y);
> 
> Now there is a software-visible difference in behavior.  The value of
> r0 is now 1 instead of zero and the value of r1 is now 0 instead of 1.
> 
> Does this make sense?

Yes I see what you mean. In first case, smp_mb() ordering didn’t matter. But in the second case it does.

Thanks,

 - Joel


> 
>                            Thanx, Paul
  
Joel Fernandes Dec. 15, 2022, 10:16 p.m. UTC | #24
> On Dec 15, 2022, at 5:10 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Thu, Dec 15, 2022 at 04:42:15PM -0500, Joel Fernandes wrote:
>> 
>> 
>>>> On Dec 15, 2022, at 4:39 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
>>> 
>>> On Thu, Dec 15, 2022 at 03:33:39PM -0500, Joel Fernandes wrote:
>>>>> On Thu, Dec 15, 2022 at 3:03 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>> 
>>>>> Hi Paul,
>>>>> 
>>>>>> On Thu, Dec 15, 2022 at 2:58 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>>>>> [...]
>>>>>>> If the first read section's srcu_read_unlock() and its corresponding
>>>>>>> smp_mb()  happened before the flip, then the increment of old idx
>>>>>>> would happen only once. The next srcu_read_lock() will read the new
>>>>>>> index. If the srcu_read_unlock() and it's corresponding smp_mb()
>>>>>>> happened after the flip, the old_idx will be sampled again and can be
>>>>>>> incremented twice. So it depends on how the flip races with
>>>>>>> srcu_read_unlock().
>>>>>> 
>>>>>> I do understand that a number of people like reasoning about
>>>>>> memory-barrier ordering, courtesy of the sequentially consistent portions
>>>>>> of the C and C++ memory models, but thinking in terms of the accesses
>>>>>> surrounding the memory barriers has been far less error-prone.
>>>>> 
>>>>> Sure, but we are already talking in terms of the access to idx right?
>>>>> That's what we're saying is visible by memory barriers and we are
>>>>> trying to reason here about the ordering (flip does the write to idx
>>>>> and followed by smp_mb(), and there is corresponding read of idx on
>>>>> the srcu_read_lock() side. So we are indeed talking in terms of
>>>>> access, but let me know if I missed something.
>>>>> 
>>>>>>> Also, since this is all hard to reason about I started making some
>>>>>>> diagrams, LOL. For your amusement, here is why need to scan both idx
>>>>>>> during grace period detection: https://i.imgur.com/jz4bNKd.png
>>>>>> 
>>>>>> Nice!
>>>>>> 
>>>>>> I suggest placing a gap between GP 2 and GP 3.  That way, you can make it
>>>>>> very clear that Reader 1's critical section starts after the end of GP 2
>>>>>> (thus clearly never blocking GP 2) and before GP 3 (thus possibly having
>>>>>> a reference to some data that is going to be freed at the end of GP 3).
>>>>>> 
>>>>>> I also suggest coloring Reader 1 red and Reader 2 green, given that the
>>>>>> color red generally indicates danger.
>>>>> 
>>>>> Thanks for these suggestions! I will make the update. I am planning to
>>>>> make a number of diagrams for other scenarios as well, as it helps
>>>>> visualize. Google drawing is nice for these. I am happy to share these
>>>>> with you all if there is interest :).
>>>> 
>>>> I made these updates, please see: https://i.imgur.com/hoKLvtt.png
>>>> 
>>>> Feel free to use the image for any purpose and thanks ;-)
>>> 
>>> Very good, thank you!
>>> 
>>> Would it be possible to have an arrow marked "X" or "reference to X"
>>> from the beginning of the 'Mark "x" for GC' box to the box labeled
>>> 'Enter RSCS (access "X")'?
>> 
>> I am currently away from desk. I shared the google drawing with you. Could you check and make the change, if that’s ok with you?
>> 
>> Thank you so much,
> 
> I took a cut at it.  Thoughts?

Yes perfect now :) and handy future reference! Thanks!

- Joel

> 
>                            Thanx, Paul
  
Paul E. McKenney Dec. 16, 2022, 1:09 a.m. UTC | #25
On Thu, Dec 15, 2022 at 05:13:47PM -0500, Joel Fernandes wrote:
> > On Dec 15, 2022, at 3:13 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > On Thu, Dec 15, 2022 at 05:58:14PM +0000, Joel Fernandes wrote:
> >>> On Thu, Dec 15, 2022 at 5:48 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>> 
> >>>> On Thu, Dec 15, 2022 at 5:08 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>> 
> >>>>> Scenario for the reader to increment the old idx once:
> >>>>> 
> >>>>> _ Assume ssp->srcu_idx is initially 0.
> >>>>> _ The READER reads idx that is 0
> >>>>> _ The updater runs and flips the idx that is now 1
> >>>>> _ The reader resumes with 0 as an index but on the next srcu_read_lock()
> >>>>>  it will see the new idx which is 1
> >>>>> 
> >>>>> What could be the scenario for it to increment the old idx twice?
> >>>> 
> >>>> Unless I am missing something, the reader must reference the
> >>>> srcu_unlock_count[old_idx] and then do smp_mb() before it will be
> >>>> absolutely guaranteed of seeing the new value of ->srcu_idx.
> >>> 
> >>> I think both of you are right depending on how the flip raced with the
> >>> first reader's unlock in that specific task.
> >>> 
> >>> If the first read section's srcu_read_unlock() and its corresponding
> >>> smp_mb()  happened before the flip, then the increment of old idx
> >>> would happen only once. The next srcu_read_lock() will read the new
> >>> index. If the srcu_read_unlock() and it's corresponding smp_mb()
> >>> happened after the flip, the old_idx will be sampled again and can be
> >>> incremented twice. So it depends on how the flip races with
> >>> srcu_read_unlock().
> >> 
> >> I am sorry this is inverted, but my statement's gist stands I believe:
> >> 
> >> 1. Flip+smp_mb() happened before unlock's smp_mb() -- reader will not
> >> increment old_idx the second time.
> > 
> > By "increment old_idx" you mean "increment ->srcu_lock_count[old_idx]",
> > correct?
> 
> Yes sorry for confusing, i indeed meant lock count increment corresponding to the old index.

I guessed correctly!!!  Don't worry, it won't happen again.  ;-)

> > Again, the important ordering isn't the smp_mb(), but the accesses,
> > in this case, the accesses to ->srcu_unlock_count[idx].
> 
> I was talking about ordering of the flip of index (write) with respect
> to both the reading of the old index  in the rcu_read_lock() and its
> subsequent lock count increment corresponding to that index. I believe
> we are talking her about how this race can effect the wrap around issues
> when scanning for readers in the pre flip index, and we concluded that
> there can be at most 2 of these on the SAME task.

Agreed.

>                                                   The third time, reader
> will always see the new flipped index because of the memory barriers on
> both sides. IOW, the same task cannot overflow the lock counter on the
> preflipped index and cause issues. However there can be Nt different
> tasks so perhaps you can have 2*Nt number of preempted readers that had
> sampled the old index and now will do a lock and unlock on that old index,
> potentially causing a lock==unlock match when there should not be a match.

So each task can do one old-index ->srcu_lock_count[] increment, and Nc of
them can do a second one, where Nc is the number of CPUs.  This is because
a given task's smp_mb() applies to all later code executed by that task
and also to code executed by other tasks running later on that same CPU.

> >> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
> >> has no new smp_mb() that happens AFTER the flip happened. So it can
> >> totally sample the old idx again -- that particular reader will
> >> increment twice, but the next time, it will see the flipped one.
> > 
> > I will let you transliterate both.  ;-)
> 
> I think I see what you mean now :)
> 
> I believe the access I am referring to is the read of idx on one side and the write to idx on the other. However that is incomplete and I need to pair that with some of other access on both sides.
> 
> So perhaps this:
> 
> Writer does flip + smp_mb + read unlock counts [1]
> 
> Reader does:
>  read idx + smp_mb() + increment lock counts [2]
> 
> And subsequently reader does
> Smp_mb() + increment unlock count. [3]
> 
> So [1] races with either [2] or [2]+[3].
> 
> Is that fair?

That does look much better, thank you!

> >> Did I get that right? Thanks.
> > 
> > So why am I unhappy with orderings of smp_mb()?
> > 
> > To see this, let's take the usual store-buffering litmus test:
> > 
> >    CPU 0            CPU 1
> >    WRITE_ONCE(x, 1);    WRITE_ONCE(y, 1);
> >    smp_mb();        smp_mb();
> >    r0 = READ_ONCE(y);    r1 = READ_ONCE(x);
> > 
> > Suppose CPU 0's smp_mb() happens before that of CPU 1:
> > 
> >    CPU 0            CPU 1
> >    WRITE_ONCE(x, 1);    WRITE_ONCE(y, 1);
> >    smp_mb();
> >                smp_mb();
> >    r0 = READ_ONCE(y);    r1 = READ_ONCE(x);
> > 
> > We get r0 == r1 == 1.
> > 
> > Compare this to CPU 1's smp_mb() happening before that of CPU 0:
> > 
> >    CPU 0            CPU 1
> >    WRITE_ONCE(x, 1);    WRITE_ONCE(y, 1);
> >                smp_mb();
> >    smp_mb();
> >    r0 = READ_ONCE(y);    r1 = READ_ONCE(x);
> > 
> > We still get r0 == r1 == 1.  Reversing the order of the two smp_mb()
> > calls changed nothing.
> > 
> > But, if we order CPU 1's write to follow CPU 0's read, then we have
> > this:
> > 
> >    CPU 0            CPU 1
> >    WRITE_ONCE(x, 1);
> >    smp_mb();
> >    r0 = READ_ONCE(y);
> >                WRITE_ONCE(y, 1);
> >                smp_mb();
> >                r1 = READ_ONCE(x);
> > 
> > Here, given that r0 had the final value of zero, we know that
> > r1 must have a final value of 1.
> > 
> > And suppose we reverse this:
> > 
> >    CPU 0            CPU 1
> >                WRITE_ONCE(y, 1);
> >                smp_mb();
> >                r1 = READ_ONCE(x);
> >    WRITE_ONCE(x, 1);
> >    smp_mb();
> >    r0 = READ_ONCE(y);
> > 
> > Now there is a software-visible difference in behavior.  The value of
> > r0 is now 1 instead of zero and the value of r1 is now 0 instead of 1.
> > 
> > Does this make sense?
> 
> Yes I see what you mean. In first case, smp_mb() ordering didn’t matter. But in the second case it does.

Yes, there have to be accesses for the software to even see the effects
of a given smp_mb().

							Thanx, Paul
  
Paul E. McKenney Dec. 16, 2022, 1:13 a.m. UTC | #26
On Thu, Dec 15, 2022 at 05:22:21PM -0500, Joel Fernandes wrote:
> > On Dec 15, 2022, at 5:13 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> >> On Dec 15, 2022, at 3:13 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> >> 
> >> On Thu, Dec 15, 2022 at 05:58:14PM +0000, Joel Fernandes wrote:
> >>>> On Thu, Dec 15, 2022 at 5:48 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>>> 
> >>>>>> On Thu, Dec 15, 2022 at 5:08 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>>>> 
> >>>>>>> Scenario for the reader to increment the old idx once:
> >>>>>>> 
> >>>>>>> _ Assume ssp->srcu_idx is initially 0.
> >>>>>>> _ The READER reads idx that is 0
> >>>>>>> _ The updater runs and flips the idx that is now 1
> >>>>>>> _ The reader resumes with 0 as an index but on the next srcu_read_lock()
> >>>>>>> it will see the new idx which is 1
> >>>>>>> 
> >>>>>>> What could be the scenario for it to increment the old idx twice?
> >>>>>> 
> >>>>>> Unless I am missing something, the reader must reference the
> >>>>>> srcu_unlock_count[old_idx] and then do smp_mb() before it will be
> >>>>>> absolutely guaranteed of seeing the new value of ->srcu_idx.
> >>>>> 
> >>>>> I think both of you are right depending on how the flip raced with the
> >>>>> first reader's unlock in that specific task.
> >>>>> 
> >>>>> If the first read section's srcu_read_unlock() and its corresponding
> >>>>> smp_mb()  happened before the flip, then the increment of old idx
> >>>>> would happen only once. The next srcu_read_lock() will read the new
> >>>>> index. If the srcu_read_unlock() and it's corresponding smp_mb()
> >>>>> happened after the flip, the old_idx will be sampled again and can be
> >>>>> incremented twice. So it depends on how the flip races with
> >>>>> srcu_read_unlock().
> >>> 
> >>> I am sorry this is inverted, but my statement's gist stands I believe:
> >>> 
> >>> 1. Flip+smp_mb() happened before unlock's smp_mb() -- reader will not
> >>> increment old_idx the second time.
> >> 
> >> By "increment old_idx" you mean "increment ->srcu_lock_count[old_idx]",
> >> correct?
> > 
> > Yes sorry for confusing, i indeed meant lock count increment corresponding to the old index.
> >> 
> >> Again, the important ordering isn't the smp_mb(), but the accesses,
> >> in this case, the accesses to ->srcu_unlock_count[idx].
> > 
> > I was talking about ordering of the flip of index (write) with respect to both the reading of the old index  in the rcu_read_lock() and its subsequent lock count increment corresponding to that index. I believe we are talking her about how this race can effect the wrap around issues when scanning for readers in the pre flip index, and we concluded that there can be at most 2 of these on the SAME task. The third time, reader will always see the new flipped index because of the memory barriers on both sides. IOW, the same task cannot overflow the lock counter on the preflipped index and cause issues. However there can be Nt different tasks so perhaps you can have 2*Nt number of preempted
> 
> Sorry, to be more precise, I mean you have Nt preempted readers, which owing to memory barriers, if you have at least Nt CPUs, and they each ran on those CPUs, then you can have 2*Nt increments on the lock count at the old index. 
> 
> Or something.

Agreed!

And if there are more tasks than CPUs, the maximum number of increments
is Nt+Nc.

								Thanx, Paul

> Thanks.
> 
> 
> 
> 
> > readers that had sampled the old index and now will do a lock and unlock on that old index, potentially causing a lock==unlock match when there should not be a match.
> > 
> >> 
> >>> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
> >>> has no new smp_mb() that happens AFTER the flip happened. So it can
> >>> totally sample the old idx again -- that particular reader will
> >>> increment twice, but the next time, it will see the flipped one.
> >> 
> >> I will let you transliterate both.  ;-)
> > 
> > I think I see what you mean now :)
> > 
> > I believe the access I am referring to is the read of idx on one side and the write to idx on the other. However that is incomplete and I need to pair that with some of other access on both sides.
> > 
> > So perhaps this:
> > 
> > Writer does flip + smp_mb + read unlock counts [1]
> > 
> > Reader does:
> > read idx + smp_mb() + increment lock counts [2]
> > 
> > And subsequently reader does
> > Smp_mb() + increment unlock count. [3]
> > 
> > So [1] races with either [2] or [2]+[3].
> > 
> > Is that fair?
> > 
> >>> Did I get that right? Thanks.
> >> 
> >> So why am I unhappy with orderings of smp_mb()?
> >> 
> >> To see this, let's take the usual store-buffering litmus test:
> >> 
> >>   CPU 0            CPU 1
> >>   WRITE_ONCE(x, 1);    WRITE_ONCE(y, 1);
> >>   smp_mb();        smp_mb();
> >>   r0 = READ_ONCE(y);    r1 = READ_ONCE(x);
> >> 
> >> Suppose CPU 0's smp_mb() happens before that of CPU 1:
> >> 
> >>   CPU 0            CPU 1
> >>   WRITE_ONCE(x, 1);    WRITE_ONCE(y, 1);
> >>   smp_mb();
> >>               smp_mb();
> >>   r0 = READ_ONCE(y);    r1 = READ_ONCE(x);
> >> 
> >> We get r0 == r1 == 1.
> >> 
> >> Compare this to CPU 1's smp_mb() happening before that of CPU 0:
> >> 
> >>   CPU 0            CPU 1
> >>   WRITE_ONCE(x, 1);    WRITE_ONCE(y, 1);
> >>               smp_mb();
> >>   smp_mb();
> >>   r0 = READ_ONCE(y);    r1 = READ_ONCE(x);
> >> 
> >> We still get r0 == r1 == 1.  Reversing the order of the two smp_mb()
> >> calls changed nothing.
> >> 
> >> But, if we order CPU 1's write to follow CPU 0's read, then we have
> >> this:
> >> 
> >>   CPU 0            CPU 1
> >>   WRITE_ONCE(x, 1);
> >>   smp_mb();
> >>   r0 = READ_ONCE(y);
> >>               WRITE_ONCE(y, 1);
> >>               smp_mb();
> >>               r1 = READ_ONCE(x);
> >> 
> >> Here, given that r0 had the final value of zero, we know that
> >> r1 must have a final value of 1.
> >> 
> >> And suppose we reverse this:
> >> 
> >>   CPU 0            CPU 1
> >>               WRITE_ONCE(y, 1);
> >>               smp_mb();
> >>               r1 = READ_ONCE(x);
> >>   WRITE_ONCE(x, 1);
> >>   smp_mb();
> >>   r0 = READ_ONCE(y);
> >> 
> >> Now there is a software-visible difference in behavior.  The value of
> >> r0 is now 1 instead of zero and the value of r1 is now 0 instead of 1.
> >> 
> >> Does this make sense?
> > 
> > Yes I see what you mean. In first case, smp_mb() ordering didn’t matter. But in the second case it does.
> > 
> > Thanks,
> > 
> > - Joel
> > 
> > 
> >> 
> >>                           Thanx, Paul
  
Joel Fernandes Dec. 16, 2022, 4:32 p.m. UTC | #27
On Thu, Dec 15, 2022 at 05:09:14PM -0800, Paul E. McKenney wrote:
[...]
> > >> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
> > >> has no new smp_mb() that happens AFTER the flip happened. So it can
> > >> totally sample the old idx again -- that particular reader will
> > >> increment twice, but the next time, it will see the flipped one.
> > > 
> > > I will let you transliterate both.  ;-)
> > 
> > I think I see what you mean now :)
> > 
> > I believe the access I am referring to is the read of idx on one side and
> > the write to idx on the other. However that is incomplete and I need to
> > pair that with some of other access on both sides.
> > 
> > So perhaps this:
> > 
> > Writer does flip + smp_mb + read unlock counts [1]
> > 
> > Reader does:
> >  read idx + smp_mb() + increment lock counts [2]
> > 
> > And subsequently reader does
> > Smp_mb() + increment unlock count. [3]
> > 
> > So [1] races with either [2] or [2]+[3].
> > 
> > Is that fair?
> 
> That does look much better, thank you!

Perhaps a comment with an ASCII diagram will help?


Case 2:
Both the reader and the updater see each other's writes too late, but because
of memory barriers on both sides, they will eventually see each other's write
with respect to their own. This is similar to the store-buffer problem. This
let's a single reader contribute a maximum (unlock minus lock) imbalance of 2.

The following diagram shows the subtle worst case followed by a simplified
store-buffer explanation.

READER                  UPDATER
-------------           ----------
                           // idx is initially 0.
read_lock() {
  READ(idx) = 0;
  lock[0]++; --------------------------------------------,
                           flip() {                      |               
                              smp_mb();                  |
  smp_mb();                                              |
}                                                        |
                                                         |
// RSCS                                                  |
                                                         |
read_unlock() {                                          |
  smp_mb();                                              |
                              idx++;  // P               |
                              smp_mb();                  |
                           }                             |
                                                         |
                           scan_readers_idx(0) {         |
                               count all unlock[0];      |
                                   |                     |
                                   |                     |
  unlock[0]++; //X <--not-counted--`-----,               |
                                         |               |
}                                        V               `------,
                               // Will make sure next scan      |
                               // will not miss this unlock (X) |
                               // if other side saw flip (P) ,--`
                               // Call this MB [1]           |
                               // Order write(idx) with      |
                               // next scan's unlock.        |
                               smp_mb();                 ,---`
read_lock() {                                            |
  READ(idx)=0;                                           |
  lock[0]++; ----------------> count all lock[0];        |
  smp_mb();         |     }                              |
}     |             |                                    V
      |             `---> // Incorrect contribution to lock counting
      |                   // upto a maximum of 2 times.
      |
       `---> // Pairs with MB [1]. Makes sure that
             // the next read_lock()'s' idx read (Y) is ordered
             // with above write to unlock[0] (X).
                            |
rcu_read_unlock() {         |
  smp_mb(); <---------------`
  unlock[0]++; 
}

read_lock() {
  READ(idx) = 1; //Y
  lock[1]++;
  ...
}
                           scan_readers_idx(0) {
                               count all unlock[0]; //Q
                               ...
                          }

This makes it similar to the store buffer pattern. Using X, Y, P and Q
annotated above, we get:

READER                    UPDATER
X (write)                 P (write)

smp_mb();                 smp_mb();

Y (read)                  Q (read)


thanks,

 - Joel
  
Joel Fernandes Dec. 16, 2022, 4:39 p.m. UTC | #28
On Fri, Dec 16, 2022 at 04:32:39PM +0000, Joel Fernandes wrote:
> On Thu, Dec 15, 2022 at 05:09:14PM -0800, Paul E. McKenney wrote:
> [...]
> > > >> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
> > > >> has no new smp_mb() that happens AFTER the flip happened. So it can
> > > >> totally sample the old idx again -- that particular reader will
> > > >> increment twice, but the next time, it will see the flipped one.
> > > > 
> > > > I will let you transliterate both.  ;-)
> > > 
> > > I think I see what you mean now :)
> > > 
> > > I believe the access I am referring to is the read of idx on one side and
> > > the write to idx on the other. However that is incomplete and I need to
> > > pair that with some of other access on both sides.
> > > 
> > > So perhaps this:
> > > 
> > > Writer does flip + smp_mb + read unlock counts [1]
> > > 
> > > Reader does:
> > >  read idx + smp_mb() + increment lock counts [2]
> > > 
> > > And subsequently reader does
> > > Smp_mb() + increment unlock count. [3]
> > > 
> > > So [1] races with either [2] or [2]+[3].
> > > 
> > > Is that fair?
> > 
> > That does look much better, thank you!
> 
> Perhaps a comment with an ASCII diagram will help?
> 
> Case 2:

And sorry I did not mention that Case 1 for me is the more trivial one where
the reader is preempted after sampling idx, and then all of them do
lock+unlock in quick succession to induce the counter-delta wrap around.

thanks,

 - Joel


> Both the reader and the updater see each other's writes too late, but because
> of memory barriers on both sides, they will eventually see each other's write
> with respect to their own. This is similar to the store-buffer problem. This
> let's a single reader contribute a maximum (unlock minus lock) imbalance of 2.
> 
> The following diagram shows the subtle worst case followed by a simplified
> store-buffer explanation.
> 
> READER                  UPDATER
> -------------           ----------
>                            // idx is initially 0.
> read_lock() {
>   READ(idx) = 0;
>   lock[0]++; --------------------------------------------,
>                            flip() {                      |               
>                               smp_mb();                  |
>   smp_mb();                                              |
> }                                                        |
>                                                          |
> // RSCS                                                  |
>                                                          |
> read_unlock() {                                          |
>   smp_mb();                                              |
>                               idx++;  // P               |
>                               smp_mb();                  |
>                            }                             |
>                                                          |
>                            scan_readers_idx(0) {         |
>                                count all unlock[0];      |
>                                    |                     |
>                                    |                     |
>   unlock[0]++; //X <--not-counted--`-----,               |
>                                          |               |
> }                                        V               `------,
>                                // Will make sure next scan      |
>                                // will not miss this unlock (X) |
>                                // if other side saw flip (P) ,--`
>                                // Call this MB [1]           |
>                                // Order write(idx) with      |
>                                // next scan's unlock.        |
>                                smp_mb();                 ,---`
> read_lock() {                                            |
>   READ(idx)=0;                                           |
>   lock[0]++; ----------------> count all lock[0];        |
>   smp_mb();         |     }                              |
> }     |             |                                    V
>       |             `---> // Incorrect contribution to lock counting
>       |                   // upto a maximum of 2 times.
>       |
>        `---> // Pairs with MB [1]. Makes sure that
>              // the next read_lock()'s' idx read (Y) is ordered
>              // with above write to unlock[0] (X).
>                             |
> rcu_read_unlock() {         |
>   smp_mb(); <---------------`
>   unlock[0]++; 
> }
> 
> read_lock() {
>   READ(idx) = 1; //Y
>   lock[1]++;
>   ...
> }
>                            scan_readers_idx(0) {
>                                count all unlock[0]; //Q
>                                ...
>                           }
> 
> This makes it similar to the store buffer pattern. Using X, Y, P and Q
> annotated above, we get:
> 
> READER                    UPDATER
> X (write)                 P (write)
> 
> smp_mb();                 smp_mb();
> 
> Y (read)                  Q (read)
> 
> 
> thanks,
> 
>  - Joel
>
  
Paul E. McKenney Dec. 16, 2022, 4:51 p.m. UTC | #29
On Fri, Dec 16, 2022 at 04:32:39PM +0000, Joel Fernandes wrote:
> On Thu, Dec 15, 2022 at 05:09:14PM -0800, Paul E. McKenney wrote:
> [...]
> > > >> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
> > > >> has no new smp_mb() that happens AFTER the flip happened. So it can
> > > >> totally sample the old idx again -- that particular reader will
> > > >> increment twice, but the next time, it will see the flipped one.
> > > > 
> > > > I will let you transliterate both.  ;-)
> > > 
> > > I think I see what you mean now :)
> > > 
> > > I believe the access I am referring to is the read of idx on one side and
> > > the write to idx on the other. However that is incomplete and I need to
> > > pair that with some of other access on both sides.
> > > 
> > > So perhaps this:
> > > 
> > > Writer does flip + smp_mb + read unlock counts [1]
> > > 
> > > Reader does:
> > >  read idx + smp_mb() + increment lock counts [2]
> > > 
> > > And subsequently reader does
> > > Smp_mb() + increment unlock count. [3]
> > > 
> > > So [1] races with either [2] or [2]+[3].
> > > 
> > > Is that fair?
> > 
> > That does look much better, thank you!
> 
> Perhaps a comment with an ASCII diagram will help?
> 
> 
> Case 2:
> Both the reader and the updater see each other's writes too late, but because
> of memory barriers on both sides, they will eventually see each other's write
> with respect to their own. This is similar to the store-buffer problem. This
> let's a single reader contribute a maximum (unlock minus lock) imbalance of 2.
> 
> The following diagram shows the subtle worst case followed by a simplified
> store-buffer explanation.
> 
> READER                  UPDATER
> -------------           ----------
>                            // idx is initially 0.
> read_lock() {
>   READ(idx) = 0;
>   lock[0]++; --------------------------------------------,
>                            flip() {                      |               
>                               smp_mb();                  |
>   smp_mb();                                              |
> }                                                        |
>                                                          |
> // RSCS                                                  |
>                                                          |
> read_unlock() {                                          |
>   smp_mb();                                              |
>                               idx++;  // P               |
>                               smp_mb();                  |
>                            }                             |
>                                                          |
>                            scan_readers_idx(0) {         |
>                                count all unlock[0];      |
>                                    |                     |
>                                    |                     |
>   unlock[0]++; //X <--not-counted--`-----,               |
>                                          |               |
> }                                        V               `------,
>                                // Will make sure next scan      |
>                                // will not miss this unlock (X) |
>                                // if other side saw flip (P) ,--`
>                                // Call this MB [1]           |
>                                // Order write(idx) with      |
>                                // next scan's unlock.        |
>                                smp_mb();                 ,---`
> read_lock() {                                            |
>   READ(idx)=0;                                           |
>   lock[0]++; ----------------> count all lock[0];        |
>   smp_mb();         |     }                              |
> }     |             |                                    V
>       |             `---> // Incorrect contribution to lock counting
>       |                   // upto a maximum of 2 times.
>       |
>        `---> // Pairs with MB [1]. Makes sure that
>              // the next read_lock()'s' idx read (Y) is ordered
>              // with above write to unlock[0] (X).
>                             |
> rcu_read_unlock() {         |
>   smp_mb(); <---------------`
>   unlock[0]++; 
> }
> 
> read_lock() {
>   READ(idx) = 1; //Y
>   lock[1]++;
>   ...
> }
>                            scan_readers_idx(0) {
>                                count all unlock[0]; //Q
>                                ...
> 
> 
> thanks,
> 
>  - Joel
> 
>                           }
> 
> This makes it similar to the store buffer pattern. Using X, Y, P and Q
> annotated above, we get:
> 
> READER                    UPDATER
> X (write)                 P (write)
> 
> smp_mb();                 smp_mb();
> 
> Y (read)                  Q (read)

Given that this diagram is more than 50 lines long, it might go better in
a design document describing this part of RCU.  Perhaps less detail or
segmented, but the same general idea as this guy:

Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst

Thoughts?

						Thanx, Paul
  
Joel Fernandes Dec. 16, 2022, 4:54 p.m. UTC | #30
> On Dec 16, 2022, at 11:51 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Fri, Dec 16, 2022 at 04:32:39PM +0000, Joel Fernandes wrote:
>> On Thu, Dec 15, 2022 at 05:09:14PM -0800, Paul E. McKenney wrote:
>> [...]
>>>>>> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
>>>>>> has no new smp_mb() that happens AFTER the flip happened. So it can
>>>>>> totally sample the old idx again -- that particular reader will
>>>>>> increment twice, but the next time, it will see the flipped one.
>>>>> 
>>>>> I will let you transliterate both.  ;-)
>>>> 
>>>> I think I see what you mean now :)
>>>> 
>>>> I believe the access I am referring to is the read of idx on one side and
>>>> the write to idx on the other. However that is incomplete and I need to
>>>> pair that with some of other access on both sides.
>>>> 
>>>> So perhaps this:
>>>> 
>>>> Writer does flip + smp_mb + read unlock counts [1]
>>>> 
>>>> Reader does:
>>>> read idx + smp_mb() + increment lock counts [2]
>>>> 
>>>> And subsequently reader does
>>>> Smp_mb() + increment unlock count. [3]
>>>> 
>>>> So [1] races with either [2] or [2]+[3].
>>>> 
>>>> Is that fair?
>>> 
>>> That does look much better, thank you!
>> 
>> Perhaps a comment with an ASCII diagram will help?
>> 
>> 
>> Case 2:
>> Both the reader and the updater see each other's writes too late, but because
>> of memory barriers on both sides, they will eventually see each other's write
>> with respect to their own. This is similar to the store-buffer problem. This
>> let's a single reader contribute a maximum (unlock minus lock) imbalance of 2.
>> 
>> The following diagram shows the subtle worst case followed by a simplified
>> store-buffer explanation.
>> 
>> READER                  UPDATER
>> -------------           ----------
>>                           // idx is initially 0.
>> read_lock() {
>>  READ(idx) = 0;
>>  lock[0]++; --------------------------------------------,
>>                           flip() {                      |               
>>                              smp_mb();                  |
>>  smp_mb();                                              |
>> }                                                        |
>>                                                         |
>> // RSCS                                                  |
>>                                                         |
>> read_unlock() {                                          |
>>  smp_mb();                                              |
>>                              idx++;  // P               |
>>                              smp_mb();                  |
>>                           }                             |
>>                                                         |
>>                           scan_readers_idx(0) {         |
>>                               count all unlock[0];      |
>>                                   |                     |
>>                                   |                     |
>>  unlock[0]++; //X <--not-counted--`-----,               |
>>                                         |               |
>> }                                        V               `------,
>>                               // Will make sure next scan      |
>>                               // will not miss this unlock (X) |
>>                               // if other side saw flip (P) ,--`
>>                               // Call this MB [1]           |
>>                               // Order write(idx) with      |
>>                               // next scan's unlock.        |
>>                               smp_mb();                 ,---`
>> read_lock() {                                            |
>>  READ(idx)=0;                                           |
>>  lock[0]++; ----------------> count all lock[0];        |
>>  smp_mb();         |     }                              |
>> }     |             |                                    V
>>      |             `---> // Incorrect contribution to lock counting
>>      |                   // upto a maximum of 2 times.
>>      |
>>       `---> // Pairs with MB [1]. Makes sure that
>>             // the next read_lock()'s' idx read (Y) is ordered
>>             // with above write to unlock[0] (X).
>>                            |
>> rcu_read_unlock() {         |
>>  smp_mb(); <---------------`
>>  unlock[0]++; 
>> }
>> 
>> read_lock() {
>>  READ(idx) = 1; //Y
>>  lock[1]++;
>>  ...
>> }
>>                           scan_readers_idx(0) {
>>                               count all unlock[0]; //Q
>>                               ...
>> 
>> 
>> thanks,
>> 
>> - Joel
>> 
>>                          }
>> 
>> This makes it similar to the store buffer pattern. Using X, Y, P and Q
>> annotated above, we get:
>> 
>> READER                    UPDATER
>> X (write)                 P (write)
>> 
>> smp_mb();                 smp_mb();
>> 
>> Y (read)                  Q (read)
> 
> Given that this diagram is more than 50 lines long, it might go better in
> a design document describing this part of RCU.  Perhaps less detail or
> segmented, but the same general idea as this guy:
> 
> Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst

Yes, this sounds like a good place to add it and perhaps we refer to it from the C source file? I can take this up to do over the holidays, if you prefer.

Thanks,

  - Joel


> 
> Thoughts?
> 
>                        Thanx, Paul
  
Paul E. McKenney Dec. 16, 2022, 5:13 p.m. UTC | #31
On Fri, Dec 16, 2022 at 11:54:19AM -0500, Joel Fernandes wrote:
> 
> 
> > On Dec 16, 2022, at 11:51 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Fri, Dec 16, 2022 at 04:32:39PM +0000, Joel Fernandes wrote:
> >> On Thu, Dec 15, 2022 at 05:09:14PM -0800, Paul E. McKenney wrote:
> >> [...]
> >>>>>> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
> >>>>>> has no new smp_mb() that happens AFTER the flip happened. So it can
> >>>>>> totally sample the old idx again -- that particular reader will
> >>>>>> increment twice, but the next time, it will see the flipped one.
> >>>>> 
> >>>>> I will let you transliterate both.  ;-)
> >>>> 
> >>>> I think I see what you mean now :)
> >>>> 
> >>>> I believe the access I am referring to is the read of idx on one side and
> >>>> the write to idx on the other. However that is incomplete and I need to
> >>>> pair that with some of other access on both sides.
> >>>> 
> >>>> So perhaps this:
> >>>> 
> >>>> Writer does flip + smp_mb + read unlock counts [1]
> >>>> 
> >>>> Reader does:
> >>>> read idx + smp_mb() + increment lock counts [2]
> >>>> 
> >>>> And subsequently reader does
> >>>> Smp_mb() + increment unlock count. [3]
> >>>> 
> >>>> So [1] races with either [2] or [2]+[3].
> >>>> 
> >>>> Is that fair?
> >>> 
> >>> That does look much better, thank you!
> >> 
> >> Perhaps a comment with an ASCII diagram will help?
> >> 
> >> 
> >> Case 2:
> >> Both the reader and the updater see each other's writes too late, but because
> >> of memory barriers on both sides, they will eventually see each other's write
> >> with respect to their own. This is similar to the store-buffer problem. This
> >> let's a single reader contribute a maximum (unlock minus lock) imbalance of 2.
> >> 
> >> The following diagram shows the subtle worst case followed by a simplified
> >> store-buffer explanation.
> >> 
> >> READER                  UPDATER
> >> -------------           ----------
> >>                           // idx is initially 0.
> >> read_lock() {
> >>  READ(idx) = 0;
> >>  lock[0]++; --------------------------------------------,
> >>                           flip() {                      |               
> >>                              smp_mb();                  |
> >>  smp_mb();                                              |
> >> }                                                        |
> >>                                                         |
> >> // RSCS                                                  |
> >>                                                         |
> >> read_unlock() {                                          |
> >>  smp_mb();                                              |
> >>                              idx++;  // P               |
> >>                              smp_mb();                  |
> >>                           }                             |
> >>                                                         |
> >>                           scan_readers_idx(0) {         |
> >>                               count all unlock[0];      |
> >>                                   |                     |
> >>                                   |                     |
> >>  unlock[0]++; //X <--not-counted--`-----,               |
> >>                                         |               |
> >> }                                        V               `------,
> >>                               // Will make sure next scan      |
> >>                               // will not miss this unlock (X) |
> >>                               // if other side saw flip (P) ,--`
> >>                               // Call this MB [1]           |
> >>                               // Order write(idx) with      |
> >>                               // next scan's unlock.        |
> >>                               smp_mb();                 ,---`
> >> read_lock() {                                            |
> >>  READ(idx)=0;                                           |
> >>  lock[0]++; ----------------> count all lock[0];        |
> >>  smp_mb();         |     }                              |
> >> }     |             |                                    V
> >>      |             `---> // Incorrect contribution to lock counting
> >>      |                   // upto a maximum of 2 times.
> >>      |
> >>       `---> // Pairs with MB [1]. Makes sure that
> >>             // the next read_lock()'s' idx read (Y) is ordered
> >>             // with above write to unlock[0] (X).
> >>                            |
> >> rcu_read_unlock() {         |
> >>  smp_mb(); <---------------`
> >>  unlock[0]++; 
> >> }
> >> 
> >> read_lock() {
> >>  READ(idx) = 1; //Y
> >>  lock[1]++;
> >>  ...
> >> }
> >>                           scan_readers_idx(0) {
> >>                               count all unlock[0]; //Q
> >>                               ...
> >> 
> >> 
> >> thanks,
> >> 
> >> - Joel
> >> 
> >>                          }
> >> 
> >> This makes it similar to the store buffer pattern. Using X, Y, P and Q
> >> annotated above, we get:
> >> 
> >> READER                    UPDATER
> >> X (write)                 P (write)
> >> 
> >> smp_mb();                 smp_mb();
> >> 
> >> Y (read)                  Q (read)
> > 
> > Given that this diagram is more than 50 lines long, it might go better in
> > a design document describing this part of RCU.  Perhaps less detail or
> > segmented, but the same general idea as this guy:
> > 
> > Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
> 
> Yes, this sounds like a good place to add it and perhaps we refer to
> it from the C source file? I can take this up to do over the holidays,
> if you prefer.

Indeed, that comment is quite large already, arguably obscuring the code!
It would be good to offload some of it.

							Thanx, Paul
  
Joel Fernandes Dec. 17, 2022, 3:19 a.m. UTC | #32
Hi,
On the related subject of this function, I drew a diagram for one of
the reasons why per-CPU unlock counts have to be scanned first, for a
particular index, before the per-CPU lock counts, and not the other
way. Otherwise, a reader that got preempted after reading the index,
can suddenly get scheduled during the inactive index's scan, and cause
the total lock and unlock counts to falsely match:
https://i.imgur.com/79fDWdQ.png

Cheers,

 - Joel



On Fri, Dec 16, 2022 at 11:54 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
>
>
> > On Dec 16, 2022, at 11:51 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Dec 16, 2022 at 04:32:39PM +0000, Joel Fernandes wrote:
> >> On Thu, Dec 15, 2022 at 05:09:14PM -0800, Paul E. McKenney wrote:
> >> [...]
> >>>>>> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
> >>>>>> has no new smp_mb() that happens AFTER the flip happened. So it can
> >>>>>> totally sample the old idx again -- that particular reader will
> >>>>>> increment twice, but the next time, it will see the flipped one.
> >>>>>
> >>>>> I will let you transliterate both.  ;-)
> >>>>
> >>>> I think I see what you mean now :)
> >>>>
> >>>> I believe the access I am referring to is the read of idx on one side and
> >>>> the write to idx on the other. However that is incomplete and I need to
> >>>> pair that with some of other access on both sides.
> >>>>
> >>>> So perhaps this:
> >>>>
> >>>> Writer does flip + smp_mb + read unlock counts [1]
> >>>>
> >>>> Reader does:
> >>>> read idx + smp_mb() + increment lock counts [2]
> >>>>
> >>>> And subsequently reader does
> >>>> Smp_mb() + increment unlock count. [3]
> >>>>
> >>>> So [1] races with either [2] or [2]+[3].
> >>>>
> >>>> Is that fair?
> >>>
> >>> That does look much better, thank you!
> >>
> >> Perhaps a comment with an ASCII diagram will help?
> >>
> >>
> >> Case 2:
> >> Both the reader and the updater see each other's writes too late, but because
> >> of memory barriers on both sides, they will eventually see each other's write
> >> with respect to their own. This is similar to the store-buffer problem. This
> >> let's a single reader contribute a maximum (unlock minus lock) imbalance of 2.
> >>
> >> The following diagram shows the subtle worst case followed by a simplified
> >> store-buffer explanation.
> >>
> >> READER                  UPDATER
> >> -------------           ----------
> >>                           // idx is initially 0.
> >> read_lock() {
> >>  READ(idx) = 0;
> >>  lock[0]++; --------------------------------------------,
> >>                           flip() {                      |
> >>                              smp_mb();                  |
> >>  smp_mb();                                              |
> >> }                                                        |
> >>                                                         |
> >> // RSCS                                                  |
> >>                                                         |
> >> read_unlock() {                                          |
> >>  smp_mb();                                              |
> >>                              idx++;  // P               |
> >>                              smp_mb();                  |
> >>                           }                             |
> >>                                                         |
> >>                           scan_readers_idx(0) {         |
> >>                               count all unlock[0];      |
> >>                                   |                     |
> >>                                   |                     |
> >>  unlock[0]++; //X <--not-counted--`-----,               |
> >>                                         |               |
> >> }                                        V               `------,
> >>                               // Will make sure next scan      |
> >>                               // will not miss this unlock (X) |
> >>                               // if other side saw flip (P) ,--`
> >>                               // Call this MB [1]           |
> >>                               // Order write(idx) with      |
> >>                               // next scan's unlock.        |
> >>                               smp_mb();                 ,---`
> >> read_lock() {                                            |
> >>  READ(idx)=0;                                           |
> >>  lock[0]++; ----------------> count all lock[0];        |
> >>  smp_mb();         |     }                              |
> >> }     |             |                                    V
> >>      |             `---> // Incorrect contribution to lock counting
> >>      |                   // upto a maximum of 2 times.
> >>      |
> >>       `---> // Pairs with MB [1]. Makes sure that
> >>             // the next read_lock()'s' idx read (Y) is ordered
> >>             // with above write to unlock[0] (X).
> >>                            |
> >> rcu_read_unlock() {         |
> >>  smp_mb(); <---------------`
> >>  unlock[0]++;
> >> }
> >>
> >> read_lock() {
> >>  READ(idx) = 1; //Y
> >>  lock[1]++;
> >>  ...
> >> }
> >>                           scan_readers_idx(0) {
> >>                               count all unlock[0]; //Q
> >>                               ...
> >>
> >>
> >> thanks,
> >>
> >> - Joel
> >>
> >>                          }
> >>
> >> This makes it similar to the store buffer pattern. Using X, Y, P and Q
> >> annotated above, we get:
> >>
> >> READER                    UPDATER
> >> X (write)                 P (write)
> >>
> >> smp_mb();                 smp_mb();
> >>
> >> Y (read)                  Q (read)
> >
> > Given that this diagram is more than 50 lines long, it might go better in
> > a design document describing this part of RCU.  Perhaps less detail or
> > segmented, but the same general idea as this guy:
> >
> > Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
>
> Yes, this sounds like a good place to add it and perhaps we refer to it from the C source file? I can take this up to do over the holidays, if you prefer.
>
> Thanks,
>
>   - Joel
>
>
> >
> > Thoughts?
> >
> >                        Thanx, Paul
  
Joel Fernandes Dec. 17, 2022, 3:21 a.m. UTC | #33
On Fri, Dec 16, 2022 at 10:19 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi,
> On the related subject of this function, I drew a diagram for one of
> the reasons why per-CPU unlock counts have to be scanned first, for a
> particular index, before the per-CPU lock counts, and not the other
> way. Otherwise, a reader that got preempted after reading the index,
> can suddenly get scheduled during the inactive index's scan, and cause
> the total lock and unlock counts to falsely match:
> https://i.imgur.com/79fDWdQ.png

Better diagram: https://i.imgur.com/PXKJnmW.png
(Added the preemption reasoning for Reader 0).

thanks,

 - Joel


> Cheers,
>
>  - Joel
>
>
>
> On Fri, Dec 16, 2022 at 11:54 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> >
> >
> > > On Dec 16, 2022, at 11:51 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Fri, Dec 16, 2022 at 04:32:39PM +0000, Joel Fernandes wrote:
> > >> On Thu, Dec 15, 2022 at 05:09:14PM -0800, Paul E. McKenney wrote:
> > >> [...]
> > >>>>>> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
> > >>>>>> has no new smp_mb() that happens AFTER the flip happened. So it can
> > >>>>>> totally sample the old idx again -- that particular reader will
> > >>>>>> increment twice, but the next time, it will see the flipped one.
> > >>>>>
> > >>>>> I will let you transliterate both.  ;-)
> > >>>>
> > >>>> I think I see what you mean now :)
> > >>>>
> > >>>> I believe the access I am referring to is the read of idx on one side and
> > >>>> the write to idx on the other. However that is incomplete and I need to
> > >>>> pair that with some of other access on both sides.
> > >>>>
> > >>>> So perhaps this:
> > >>>>
> > >>>> Writer does flip + smp_mb + read unlock counts [1]
> > >>>>
> > >>>> Reader does:
> > >>>> read idx + smp_mb() + increment lock counts [2]
> > >>>>
> > >>>> And subsequently reader does
> > >>>> Smp_mb() + increment unlock count. [3]
> > >>>>
> > >>>> So [1] races with either [2] or [2]+[3].
> > >>>>
> > >>>> Is that fair?
> > >>>
> > >>> That does look much better, thank you!
> > >>
> > >> Perhaps a comment with an ASCII diagram will help?
> > >>
> > >>
> > >> Case 2:
> > >> Both the reader and the updater see each other's writes too late, but because
> > >> of memory barriers on both sides, they will eventually see each other's write
> > >> with respect to their own. This is similar to the store-buffer problem. This
> > >> let's a single reader contribute a maximum (unlock minus lock) imbalance of 2.
> > >>
> > >> The following diagram shows the subtle worst case followed by a simplified
> > >> store-buffer explanation.
> > >>
> > >> READER                  UPDATER
> > >> -------------           ----------
> > >>                           // idx is initially 0.
> > >> read_lock() {
> > >>  READ(idx) = 0;
> > >>  lock[0]++; --------------------------------------------,
> > >>                           flip() {                      |
> > >>                              smp_mb();                  |
> > >>  smp_mb();                                              |
> > >> }                                                        |
> > >>                                                         |
> > >> // RSCS                                                  |
> > >>                                                         |
> > >> read_unlock() {                                          |
> > >>  smp_mb();                                              |
> > >>                              idx++;  // P               |
> > >>                              smp_mb();                  |
> > >>                           }                             |
> > >>                                                         |
> > >>                           scan_readers_idx(0) {         |
> > >>                               count all unlock[0];      |
> > >>                                   |                     |
> > >>                                   |                     |
> > >>  unlock[0]++; //X <--not-counted--`-----,               |
> > >>                                         |               |
> > >> }                                        V               `------,
> > >>                               // Will make sure next scan      |
> > >>                               // will not miss this unlock (X) |
> > >>                               // if other side saw flip (P) ,--`
> > >>                               // Call this MB [1]           |
> > >>                               // Order write(idx) with      |
> > >>                               // next scan's unlock.        |
> > >>                               smp_mb();                 ,---`
> > >> read_lock() {                                            |
> > >>  READ(idx)=0;                                           |
> > >>  lock[0]++; ----------------> count all lock[0];        |
> > >>  smp_mb();         |     }                              |
> > >> }     |             |                                    V
> > >>      |             `---> // Incorrect contribution to lock counting
> > >>      |                   // upto a maximum of 2 times.
> > >>      |
> > >>       `---> // Pairs with MB [1]. Makes sure that
> > >>             // the next read_lock()'s' idx read (Y) is ordered
> > >>             // with above write to unlock[0] (X).
> > >>                            |
> > >> rcu_read_unlock() {         |
> > >>  smp_mb(); <---------------`
> > >>  unlock[0]++;
> > >> }
> > >>
> > >> read_lock() {
> > >>  READ(idx) = 1; //Y
> > >>  lock[1]++;
> > >>  ...
> > >> }
> > >>                           scan_readers_idx(0) {
> > >>                               count all unlock[0]; //Q
> > >>                               ...
> > >>
> > >>
> > >> thanks,
> > >>
> > >> - Joel
> > >>
> > >>                          }
> > >>
> > >> This makes it similar to the store buffer pattern. Using X, Y, P and Q
> > >> annotated above, we get:
> > >>
> > >> READER                    UPDATER
> > >> X (write)                 P (write)
> > >>
> > >> smp_mb();                 smp_mb();
> > >>
> > >> Y (read)                  Q (read)
> > >
> > > Given that this diagram is more than 50 lines long, it might go better in
> > > a design document describing this part of RCU.  Perhaps less detail or
> > > segmented, but the same general idea as this guy:
> > >
> > > Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
> >
> > Yes, this sounds like a good place to add it and perhaps we refer to it from the C source file? I can take this up to do over the holidays, if you prefer.
> >
> > Thanks,
> >
> >   - Joel
> >
> >
> > >
> > > Thoughts?
> > >
> > >                        Thanx, Paul
  
Paul E. McKenney Dec. 17, 2022, 5:15 a.m. UTC | #34
On Fri, Dec 16, 2022 at 10:21:25PM -0500, Joel Fernandes wrote:
> On Fri, Dec 16, 2022 at 10:19 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hi,
> > On the related subject of this function, I drew a diagram for one of
> > the reasons why per-CPU unlock counts have to be scanned first, for a
> > particular index, before the per-CPU lock counts, and not the other
> > way. Otherwise, a reader that got preempted after reading the index,
> > can suddenly get scheduled during the inactive index's scan, and cause
> > the total lock and unlock counts to falsely match:
> > https://i.imgur.com/79fDWdQ.png
> 
> Better diagram: https://i.imgur.com/PXKJnmW.png
> (Added the preemption reasoning for Reader 0).

Nice!!!

The other way to look at this is using a memory-ordering viewpoint.
This is a member of the message-passing litmus-test family, and the reader
must read the variables in the opposite order that the writer writes them.

(See the infamous test6.pdf file, "MP" pattern.)

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> 
> > Cheers,
> >
> >  - Joel
> >
> >
> >
> > On Fri, Dec 16, 2022 at 11:54 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > >
> > >
> > > > On Dec 16, 2022, at 11:51 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Fri, Dec 16, 2022 at 04:32:39PM +0000, Joel Fernandes wrote:
> > > >> On Thu, Dec 15, 2022 at 05:09:14PM -0800, Paul E. McKenney wrote:
> > > >> [...]
> > > >>>>>> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
> > > >>>>>> has no new smp_mb() that happens AFTER the flip happened. So it can
> > > >>>>>> totally sample the old idx again -- that particular reader will
> > > >>>>>> increment twice, but the next time, it will see the flipped one.
> > > >>>>>
> > > >>>>> I will let you transliterate both.  ;-)
> > > >>>>
> > > >>>> I think I see what you mean now :)
> > > >>>>
> > > >>>> I believe the access I am referring to is the read of idx on one side and
> > > >>>> the write to idx on the other. However that is incomplete and I need to
> > > >>>> pair that with some of other access on both sides.
> > > >>>>
> > > >>>> So perhaps this:
> > > >>>>
> > > >>>> Writer does flip + smp_mb + read unlock counts [1]
> > > >>>>
> > > >>>> Reader does:
> > > >>>> read idx + smp_mb() + increment lock counts [2]
> > > >>>>
> > > >>>> And subsequently reader does
> > > >>>> Smp_mb() + increment unlock count. [3]
> > > >>>>
> > > >>>> So [1] races with either [2] or [2]+[3].
> > > >>>>
> > > >>>> Is that fair?
> > > >>>
> > > >>> That does look much better, thank you!
> > > >>
> > > >> Perhaps a comment with an ASCII diagram will help?
> > > >>
> > > >>
> > > >> Case 2:
> > > >> Both the reader and the updater see each other's writes too late, but because
> > > >> of memory barriers on both sides, they will eventually see each other's write
> > > >> with respect to their own. This is similar to the store-buffer problem. This
> > > >> let's a single reader contribute a maximum (unlock minus lock) imbalance of 2.
> > > >>
> > > >> The following diagram shows the subtle worst case followed by a simplified
> > > >> store-buffer explanation.
> > > >>
> > > >> READER                  UPDATER
> > > >> -------------           ----------
> > > >>                           // idx is initially 0.
> > > >> read_lock() {
> > > >>  READ(idx) = 0;
> > > >>  lock[0]++; --------------------------------------------,
> > > >>                           flip() {                      |
> > > >>                              smp_mb();                  |
> > > >>  smp_mb();                                              |
> > > >> }                                                        |
> > > >>                                                         |
> > > >> // RSCS                                                  |
> > > >>                                                         |
> > > >> read_unlock() {                                          |
> > > >>  smp_mb();                                              |
> > > >>                              idx++;  // P               |
> > > >>                              smp_mb();                  |
> > > >>                           }                             |
> > > >>                                                         |
> > > >>                           scan_readers_idx(0) {         |
> > > >>                               count all unlock[0];      |
> > > >>                                   |                     |
> > > >>                                   |                     |
> > > >>  unlock[0]++; //X <--not-counted--`-----,               |
> > > >>                                         |               |
> > > >> }                                        V               `------,
> > > >>                               // Will make sure next scan      |
> > > >>                               // will not miss this unlock (X) |
> > > >>                               // if other side saw flip (P) ,--`
> > > >>                               // Call this MB [1]           |
> > > >>                               // Order write(idx) with      |
> > > >>                               // next scan's unlock.        |
> > > >>                               smp_mb();                 ,---`
> > > >> read_lock() {                                            |
> > > >>  READ(idx)=0;                                           |
> > > >>  lock[0]++; ----------------> count all lock[0];        |
> > > >>  smp_mb();         |     }                              |
> > > >> }     |             |                                    V
> > > >>      |             `---> // Incorrect contribution to lock counting
> > > >>      |                   // upto a maximum of 2 times.
> > > >>      |
> > > >>       `---> // Pairs with MB [1]. Makes sure that
> > > >>             // the next read_lock()'s' idx read (Y) is ordered
> > > >>             // with above write to unlock[0] (X).
> > > >>                            |
> > > >> rcu_read_unlock() {         |
> > > >>  smp_mb(); <---------------`
> > > >>  unlock[0]++;
> > > >> }
> > > >>
> > > >> read_lock() {
> > > >>  READ(idx) = 1; //Y
> > > >>  lock[1]++;
> > > >>  ...
> > > >> }
> > > >>                           scan_readers_idx(0) {
> > > >>                               count all unlock[0]; //Q
> > > >>                               ...
> > > >>
> > > >>
> > > >> thanks,
> > > >>
> > > >> - Joel
> > > >>
> > > >>                          }
> > > >>
> > > >> This makes it similar to the store buffer pattern. Using X, Y, P and Q
> > > >> annotated above, we get:
> > > >>
> > > >> READER                    UPDATER
> > > >> X (write)                 P (write)
> > > >>
> > > >> smp_mb();                 smp_mb();
> > > >>
> > > >> Y (read)                  Q (read)
> > > >
> > > > Given that this diagram is more than 50 lines long, it might go better in
> > > > a design document describing this part of RCU.  Perhaps less detail or
> > > > segmented, but the same general idea as this guy:
> > > >
> > > > Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
> > >
> > > Yes, this sounds like a good place to add it and perhaps we refer to it from the C source file? I can take this up to do over the holidays, if you prefer.
> > >
> > > Thanks,
> > >
> > >   - Joel
> > >
> > >
> > > >
> > > > Thoughts?
> > > >
> > > >                        Thanx, Paul
  
Joel Fernandes Dec. 17, 2022, 6:32 a.m. UTC | #35
> On Dec 17, 2022, at 12:15 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Fri, Dec 16, 2022 at 10:21:25PM -0500, Joel Fernandes wrote:
>>> On Fri, Dec 16, 2022 at 10:19 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>> 
>>> Hi,
>>> On the related subject of this function, I drew a diagram for one of
>>> the reasons why per-CPU unlock counts have to be scanned first, for a
>>> particular index, before the per-CPU lock counts, and not the other
>>> way. Otherwise, a reader that got preempted after reading the index,
>>> can suddenly get scheduled during the inactive index's scan, and cause
>>> the total lock and unlock counts to falsely match:
>>> https://i.imgur.com/79fDWdQ.png
>> 
>> Better diagram: https://i.imgur.com/PXKJnmW.png
>> (Added the preemption reasoning for Reader 0).
> 
> Nice!!!

Thanks!

> The other way to look at this is using a memory-ordering viewpoint.
> This is a member of the message-passing litmus-test family, and the reader
> must read the variables in the opposite order that the writer writes them.

Exactly, thanks. If we read the unlock counts first, which includes the unlock count of that spurious reader 0 in my figure, then would have already seen the reader 0’s lock count when we sum up the lock counts, thanks to MP pattern - so that spurious reader cannot do any damage or imbalance in the delta between the lock and unlock counts.   In MP pattern, if we see the flag, then we would have already seen the data (message). And on the other CPU the data is written first followed by the write to the flag. Which is exactly what we do in the SRCU as you mentioned with the reversal of the ordering of the variables we read. I will reason about it this way as well in my notes, thank you!

The other way I reason is, during scanning, if we count unlocks first before locks, it’s OK to miss some unlocks, as that just means we will falsely delay ending of GP. However if we count locks first and then unlocks, we are screwed if we miss counting a lock, because if we count the unlock count for that missed lock, then we might end the GP sooner than we should!!!

Thanks,

  - Joel 


> 
> (See the infamous test6.pdf file, "MP" pattern.)
> 
>                            Thanx, Paul
> 
>> thanks,
>> 
>> - Joel
>> 
>> 
>>> Cheers,
>>> 
>>> - Joel
>>> 
>>> 
>>> 
>>> On Fri, Dec 16, 2022 at 11:54 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Dec 16, 2022, at 11:51 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
>>>>> 
>>>>> On Fri, Dec 16, 2022 at 04:32:39PM +0000, Joel Fernandes wrote:
>>>>>> On Thu, Dec 15, 2022 at 05:09:14PM -0800, Paul E. McKenney wrote:
>>>>>> [...]
>>>>>>>>>> 2. unlock()'s smp_mb() happened before Flip+smp_mb() , now the reader
>>>>>>>>>> has no new smp_mb() that happens AFTER the flip happened. So it can
>>>>>>>>>> totally sample the old idx again -- that particular reader will
>>>>>>>>>> increment twice, but the next time, it will see the flipped one.
>>>>>>>>> 
>>>>>>>>> I will let you transliterate both.  ;-)
>>>>>>>> 
>>>>>>>> I think I see what you mean now :)
>>>>>>>> 
>>>>>>>> I believe the access I am referring to is the read of idx on one side and
>>>>>>>> the write to idx on the other. However that is incomplete and I need to
>>>>>>>> pair that with some of other access on both sides.
>>>>>>>> 
>>>>>>>> So perhaps this:
>>>>>>>> 
>>>>>>>> Writer does flip + smp_mb + read unlock counts [1]
>>>>>>>> 
>>>>>>>> Reader does:
>>>>>>>> read idx + smp_mb() + increment lock counts [2]
>>>>>>>> 
>>>>>>>> And subsequently reader does
>>>>>>>> Smp_mb() + increment unlock count. [3]
>>>>>>>> 
>>>>>>>> So [1] races with either [2] or [2]+[3].
>>>>>>>> 
>>>>>>>> Is that fair?
>>>>>>> 
>>>>>>> That does look much better, thank you!
>>>>>> 
>>>>>> Perhaps a comment with an ASCII diagram will help?
>>>>>> 
>>>>>> 
>>>>>> Case 2:
>>>>>> Both the reader and the updater see each other's writes too late, but because
>>>>>> of memory barriers on both sides, they will eventually see each other's write
>>>>>> with respect to their own. This is similar to the store-buffer problem. This
>>>>>> let's a single reader contribute a maximum (unlock minus lock) imbalance of 2.
>>>>>> 
>>>>>> The following diagram shows the subtle worst case followed by a simplified
>>>>>> store-buffer explanation.
>>>>>> 
>>>>>> READER                  UPDATER
>>>>>> -------------           ----------
>>>>>>                          // idx is initially 0.
>>>>>> read_lock() {
>>>>>> READ(idx) = 0;
>>>>>> lock[0]++; --------------------------------------------,
>>>>>>                          flip() {                      |
>>>>>>                             smp_mb();                  |
>>>>>> smp_mb();                                              |
>>>>>> }                                                        |
>>>>>>                                                        |
>>>>>> // RSCS                                                  |
>>>>>>                                                        |
>>>>>> read_unlock() {                                          |
>>>>>> smp_mb();                                              |
>>>>>>                             idx++;  // P               |
>>>>>>                             smp_mb();                  |
>>>>>>                          }                             |
>>>>>>                                                        |
>>>>>>                          scan_readers_idx(0) {         |
>>>>>>                              count all unlock[0];      |
>>>>>>                                  |                     |
>>>>>>                                  |                     |
>>>>>> unlock[0]++; //X <--not-counted--`-----,               |
>>>>>>                                        |               |
>>>>>> }                                        V               `------,
>>>>>>                              // Will make sure next scan      |
>>>>>>                              // will not miss this unlock (X) |
>>>>>>                              // if other side saw flip (P) ,--`
>>>>>>                              // Call this MB [1]           |
>>>>>>                              // Order write(idx) with      |
>>>>>>                              // next scan's unlock.        |
>>>>>>                              smp_mb();                 ,---`
>>>>>> read_lock() {                                            |
>>>>>> READ(idx)=0;                                           |
>>>>>> lock[0]++; ----------------> count all lock[0];        |
>>>>>> smp_mb();         |     }                              |
>>>>>> }     |             |                                    V
>>>>>>     |             `---> // Incorrect contribution to lock counting
>>>>>>     |                   // upto a maximum of 2 times.
>>>>>>     |
>>>>>>      `---> // Pairs with MB [1]. Makes sure that
>>>>>>            // the next read_lock()'s' idx read (Y) is ordered
>>>>>>            // with above write to unlock[0] (X).
>>>>>>                           |
>>>>>> rcu_read_unlock() {         |
>>>>>> smp_mb(); <---------------`
>>>>>> unlock[0]++;
>>>>>> }
>>>>>> 
>>>>>> read_lock() {
>>>>>> READ(idx) = 1; //Y
>>>>>> lock[1]++;
>>>>>> ...
>>>>>> }
>>>>>>                          scan_readers_idx(0) {
>>>>>>                              count all unlock[0]; //Q
>>>>>>                              ...
>>>>>> 
>>>>>> 
>>>>>> thanks,
>>>>>> 
>>>>>> - Joel
>>>>>> 
>>>>>>                         }
>>>>>> 
>>>>>> This makes it similar to the store buffer pattern. Using X, Y, P and Q
>>>>>> annotated above, we get:
>>>>>> 
>>>>>> READER                    UPDATER
>>>>>> X (write)                 P (write)
>>>>>> 
>>>>>> smp_mb();                 smp_mb();
>>>>>> 
>>>>>> Y (read)                  Q (read)
>>>>> 
>>>>> Given that this diagram is more than 50 lines long, it might go better in
>>>>> a design document describing this part of RCU.  Perhaps less detail or
>>>>> segmented, but the same general idea as this guy:
>>>>> 
>>>>> Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
>>>> 
>>>> Yes, this sounds like a good place to add it and perhaps we refer to it from the C source file? I can take this up to do over the holidays, if you prefer.
>>>> 
>>>> Thanks,
>>>> 
>>>>  - Joel
>>>> 
>>>> 
>>>>> 
>>>>> Thoughts?
>>>>> 
>>>>>                       Thanx, Paul
  

Patch

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 68b8d8b150db1..ba12c50ee3658 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -469,24 +469,53 @@  static bool srcu_readers_active_idx_check(struct srcu_struct *ssp, int idx)
 
 	/*
 	 * If the locks are the same as the unlocks, then there must have
-	 * been no readers on this index at some time in between. This does
-	 * not mean that there are no more readers, as one could have read
-	 * the current index but not have incremented the lock counter yet.
+	 * been no readers on this index at some point in this function.
+	 * But there might be more readers, as a task might have read
+	 * the current ->srcu_idx but not yet have incremented its CPU's
+	 * ->srcu_lock_count[idx] counter.  In fact, it is possible
+	 * that most of the tasks have been preempted between fetching
+	 * ->srcu_idx and incrementing ->srcu_lock_count[idx].  And there
+	 * could be almost (ULONG_MAX / sizeof(struct task_struct)) tasks
+	 * in a system whose address space was fully populated with memory.
+	 * Call this quantity Nt.
 	 *
-	 * So suppose that the updater is preempted here for so long
-	 * that more than ULONG_MAX non-nested readers come and go in
-	 * the meantime.  It turns out that this cannot result in overflow
-	 * because if a reader modifies its unlock count after we read it
-	 * above, then that reader's next load of ->srcu_idx is guaranteed
-	 * to get the new value, which will cause it to operate on the
-	 * other bank of counters, where it cannot contribute to the
-	 * overflow of these counters.  This means that there is a maximum
-	 * of 2*NR_CPUS increments, which cannot overflow given current
-	 * systems, especially not on 64-bit systems.
+	 * So suppose that the updater is preempted at this point in the
+	 * code for a long time.  That now-preempted updater has already
+	 * flipped ->srcu_idx (possibly during the preceding grace period),
+	 * done an smp_mb() (again, possibly during the preceding grace
+	 * period), and summed up the ->srcu_unlock_count[idx] counters.
+	 * How many times can a given one of the aforementioned Nt tasks
+	 * increment the old ->srcu_idx value's ->srcu_lock_count[idx]
+	 * counter, in the absence of nesting?
 	 *
-	 * OK, how about nesting?  This does impose a limit on nesting
-	 * of floor(ULONG_MAX/NR_CPUS/2), which should be sufficient,
-	 * especially on 64-bit systems.
+	 * It can clearly do so once, given that it has already fetched
+	 * the old value of ->srcu_idx and is just about to use that value
+	 * to index its increment of ->srcu_lock_count[idx].  But as soon as
+	 * it leaves that SRCU read-side critical section, it will increment
+	 * ->srcu_unlock_count[idx], which must follow the updater's above
+	 * read from that same value.  Thus, as soon the reading task does
+	 * an smp_mb() and a later fetch from ->srcu_idx, that task will be
+	 * guaranteed to get the new index.  Except that the increment of
+	 * ->srcu_unlock_count[idx] in __srcu_read_unlock() is after the
+	 * smp_mb(), and the fetch from ->srcu_idx in __srcu_read_lock()
+	 * is before the smp_mb().  Thus, that task might not see the new
+	 * value of ->srcu_idx until the -second- __srcu_read_lock(),
+	 * which in turn means that this task might well increment
+	 * ->srcu_lock_count[idx] for the old value of ->srcu_idx twice,
+	 * not just once.
+	 *
+	 * That is, there can be almost 2 * Nt further increments of
+	 * ->srcu_lock_count[idx] for the old index.  But this is OK because
+	 * the size of the task_struct structure limits the value of Nt.
+	 *
+	 * OK, but what about nesting?  This does impose a limit on
+	 * nesting of half of the size of the task_struct structure
+	 * (measured in bytes), which should be sufficient.  A late 2022
+	 * TREE01 rcutorture run reported this size to be no less than
+	 * 9408 bytes, allowing up to 4704 levels of nesting, which is
+	 * comfortably beyond excessive.  Especially on 64-bit systems,
+	 * which are unlikely to be configured with an address space fully
+	 * populated with memory, at least not anytime soon.
 	 */
 	return srcu_readers_lock_idx(ssp, idx) == unlocks;
 }
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ee8a6a711719a..399c818fe47ce 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4900,6 +4900,7 @@  void __init rcu_init(void)
 	// Kick-start any polled grace periods that started early.
 	if (!(per_cpu_ptr(&rcu_data, cpu)->mynode->exp_seq_poll_rq & 0x1))
 		(void)start_poll_synchronize_rcu_expedited();
+	pr_alert("sizeof(struct task_struct) = %lu\n", sizeof(struct task_struct));
 }
 
 #include "tree_stall.h"