diff mbox series

[2/4] rcu/tasks: Handle new PF_IDLE semantics

Message ID	20231027144050.110601-3-frederic@kernel.org
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; From: Frederic Weisbecker <frederic@kernel.org> To: LKML <linux-kernel@vger.kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org>, Boqun Feng <boqun.feng@gmail.com>, Joel Fernandes <joel@joelfernandes.org>, Josh Triplett <josh@joshtriplett.org>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Neeraj Upadhyay <neeraj.upadhyay@amd.com>, "Paul E . McKenney" <paulmck@kernel.org>, Steven Rostedt <rostedt@goodmis.org>, Uladzislau Rezki <urezki@gmail.com>, rcu <rcu@vger.kernel.org>, Zqiang <qiang.zhang1211@gmail.com>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, Peter Zijlstra <peterz@infradead.org> Subject: [PATCH 2/4] rcu/tasks: Handle new PF_IDLE semantics Date: Fri, 27 Oct 2023 16:40:48 +0200 Message-Id: <20231027144050.110601-3-frederic@kernel.org> In-Reply-To: <20231027144050.110601-1-frederic@kernel.org> References: <20231027144050.110601-1-frederic@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	rcu: Fix PF_IDLE related issues v3 \| [0/4] rcu: Fix PF_IDLE related issues v3 [1/4] rcu: Introduce rcu_cpu_online() [2/4] rcu/tasks: Handle new PF_IDLE semantics [3/4] rcu/tasks-trace: Handle new PF_IDLE semantics [4/4] sched: Exclude CPU boot code from PF_IDLE area

Commit Message

Frederic Weisbecker Oct. 27, 2023, 2:40 p.m. UTC

  The commit:

	cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")

has changed the semantics of what is to be considered an idle task in
such a way that CPU boot code preceding the actual idle loop is excluded
from it.

This has however introduced new potential RCU-tasks stalls when either:

1) Grace period is started before init/0 had a chance to set PF_IDLE,
   keeping it stuck in the holdout list until idle ever schedules.

2) Grace period is started when some possible CPUs have never been
   online, keeping their idle tasks stuck in the holdout list until the
   CPU ever boots up.

3) Similar to 1) but with secondary CPUs: Grace period is started
   concurrently with secondary CPU booting, putting its idle task in
   the holdout list because PF_IDLE isn't yet observed on it. It stays
   then stuck in the holdout list until that CPU ever schedules. The
   effect is mitigated here by the hotplug AP thread that must run to
   bring the CPU up.

Fix this with handling the new semantics of PF_IDLE, keeping in mind
that it may or may not be set on an idle task. Take advantage of that to
strengthen the coverage of an RCU-tasks quiescent state within an idle
task, excluding the CPU boot code from it. Only the code running within
the idle loop is now a quiescent state, along with offline CPUs.

Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Suggested-by: Paul E . McKenney" <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tasks.h | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra Oct. 27, 2023, 7:20 p.m. UTC | #1

On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote:

> +	/* Has the task been seen voluntarily sleeping? */
> +	if (!READ_ONCE(t->on_rq))
> +		return false;

> -	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {

AFAICT this ->on_rq usage is outside of scheduler locks and that
READ_ONCE isn't going to help much.

Obviously a pre-existing issue, and I suppose all it cares about is
seeing a 0 or not, irrespective of the races, but urgh..

Paul E. McKenney Oct. 27, 2023, 9:23 p.m. UTC | #2

On Fri, Oct 27, 2023 at 09:20:26PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote:
> 
> > +	/* Has the task been seen voluntarily sleeping? */
> > +	if (!READ_ONCE(t->on_rq))
> > +		return false;
> 
> > -	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
> 
> AFAICT this ->on_rq usage is outside of scheduler locks and that
> READ_ONCE isn't going to help much.
> 
> Obviously a pre-existing issue, and I suppose all it cares about is
> seeing a 0 or not, irrespective of the races, but urgh..

The trick is that RCU Tasks only needs to spot a task voluntarily blocked
once at any point in the grace period.  The beginning and end of the
grace-period process have full barriers, so if this code sees t->on_rq
equal to zero, we know that the task was voluntarily blocked at some
point during the grace period, as required.

In theory, we could acquire a scheduler lock, but in practice this would
cause CPU-latency problems at a certain set of large datacenters, and
for once, not the datacenters operated by my employer.

In theory, we could make separate lists of tasks that we need to wait on,
thus avoiding the need to scan the full task list, but in practice this
would require a synchronized linked-list operation on every voluntary
context switch, both in and out.

In theory, the task list could sharded, so that it could be scanned
incrementally, but in practice, this is a bit non-trivial.  Though this
particular use case doesn't care about new tasks, so it could live with
something simpler than would be required for certain types of signal
delivery.

In theory, we could place rcu_segcblist-like mid pointers into the
task list, so that scans could restart from any mid pointer.  Care is
required because the mid pointers would likely need to be recycled as
new tasks are added.  Plus care is needed because it has been a good
long time since I have looked at the code managing the tasks list,
and I am probably woefully out of date on how it all works.

So, is there a better way?

							Thanx, Paul

Peter Zijlstra Oct. 27, 2023, 10:46 p.m. UTC | #3

On Fri, Oct 27, 2023 at 02:23:56PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 27, 2023 at 09:20:26PM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote:
> > 
> > > +	/* Has the task been seen voluntarily sleeping? */
> > > +	if (!READ_ONCE(t->on_rq))
> > > +		return false;
> > 
> > > -	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
> > 
> > AFAICT this ->on_rq usage is outside of scheduler locks and that
> > READ_ONCE isn't going to help much.
> > 
> > Obviously a pre-existing issue, and I suppose all it cares about is
> > seeing a 0 or not, irrespective of the races, but urgh..
> 
> The trick is that RCU Tasks only needs to spot a task voluntarily blocked
> once at any point in the grace period.  The beginning and end of the
> grace-period process have full barriers, so if this code sees t->on_rq
> equal to zero, we know that the task was voluntarily blocked at some
> point during the grace period, as required.
> 
> In theory, we could acquire a scheduler lock, but in practice this would
> cause CPU-latency problems at a certain set of large datacenters, and
> for once, not the datacenters operated by my employer.
> 
> In theory, we could make separate lists of tasks that we need to wait on,
> thus avoiding the need to scan the full task list, but in practice this
> would require a synchronized linked-list operation on every voluntary
> context switch, both in and out.
> 
> In theory, the task list could sharded, so that it could be scanned
> incrementally, but in practice, this is a bit non-trivial.  Though this
> particular use case doesn't care about new tasks, so it could live with
> something simpler than would be required for certain types of signal
> delivery.
> 
> In theory, we could place rcu_segcblist-like mid pointers into the
> task list, so that scans could restart from any mid pointer.  Care is
> required because the mid pointers would likely need to be recycled as
> new tasks are added.  Plus care is needed because it has been a good
> long time since I have looked at the code managing the tasks list,
> and I am probably woefully out of date on how it all works.
> 
> So, is there a better way?

Nah, this is more or less what I feared. I just worry people will come
around and put WRITE_ONCE() on the other end. I don't think that'll buy
us much. Nor do I think the current READ_ONCE()s actually matter.

But perhaps put a comment there, that we don't care for the races and
only need to observe a 0 once or something.

Paul E. McKenney Oct. 27, 2023, 11:41 p.m. UTC | #4

On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 27, 2023 at 02:23:56PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 27, 2023 at 09:20:26PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 27, 2023 at 04:40:48PM +0200, Frederic Weisbecker wrote:
> > > 
> > > > +	/* Has the task been seen voluntarily sleeping? */
> > > > +	if (!READ_ONCE(t->on_rq))
> > > > +		return false;
> > > 
> > > > -	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
> > > 
> > > AFAICT this ->on_rq usage is outside of scheduler locks and that
> > > READ_ONCE isn't going to help much.
> > > 
> > > Obviously a pre-existing issue, and I suppose all it cares about is
> > > seeing a 0 or not, irrespective of the races, but urgh..
> > 
> > The trick is that RCU Tasks only needs to spot a task voluntarily blocked
> > once at any point in the grace period.  The beginning and end of the
> > grace-period process have full barriers, so if this code sees t->on_rq
> > equal to zero, we know that the task was voluntarily blocked at some
> > point during the grace period, as required.
> > 
> > In theory, we could acquire a scheduler lock, but in practice this would
> > cause CPU-latency problems at a certain set of large datacenters, and
> > for once, not the datacenters operated by my employer.
> > 
> > In theory, we could make separate lists of tasks that we need to wait on,
> > thus avoiding the need to scan the full task list, but in practice this
> > would require a synchronized linked-list operation on every voluntary
> > context switch, both in and out.
> > 
> > In theory, the task list could sharded, so that it could be scanned
> > incrementally, but in practice, this is a bit non-trivial.  Though this
> > particular use case doesn't care about new tasks, so it could live with
> > something simpler than would be required for certain types of signal
> > delivery.
> > 
> > In theory, we could place rcu_segcblist-like mid pointers into the
> > task list, so that scans could restart from any mid pointer.  Care is
> > required because the mid pointers would likely need to be recycled as
> > new tasks are added.  Plus care is needed because it has been a good
> > long time since I have looked at the code managing the tasks list,
> > and I am probably woefully out of date on how it all works.
> > 
> > So, is there a better way?
> 
> Nah, this is more or less what I feared. I just worry people will come
> around and put WRITE_ONCE() on the other end. I don't think that'll buy
> us much. Nor do I think the current READ_ONCE()s actually matter.

My friend, you trust compilers more than I ever will.  ;-)

> But perhaps put a comment there, that we don't care for the races and
> only need to observe a 0 once or something.

There are these two passagers in the big lock comment preceding the
RCU Tasks code:

// rcu_tasks_pregp_step():
//      Invokes synchronize_rcu() in order to wait for all in-flight
//      t->on_rq and t->nvcsw transitions to complete.  This works because
//      all such transitions are carried out with interrupts disabled.

and:

// rcu_tasks_postgp():
//      Invokes synchronize_rcu() in order to ensure that all prior
//      t->on_rq and t->nvcsw transitions are seen by all CPUs and tasks
//      to have happened before the end of this RCU Tasks grace period.
//      Again, this works because all such transitions are carried out
//      with interrupts disabled.

The rcu_tasks_pregp_step() function contains this comment:

	/*
	 * Wait for all pre-existing t->on_rq and t->nvcsw transitions
	 * to complete.  Invoking synchronize_rcu() suffices because all
	 * these transitions occur with interrupts disabled.  Without this
	 * synchronize_rcu(), a read-side critical section that started
	 * before the grace period might be incorrectly seen as having
	 * started after the grace period.
	 *
	 * This synchronize_rcu() also dispenses with the need for a
	 * memory barrier on the first store to t->rcu_tasks_holdout,
	 * as it forces the store to happen after the beginning of the
	 * grace period.
	 */

And the rcu_tasks_postgp() function contains this comment:

	/*
	 * Because ->on_rq and ->nvcsw are not guaranteed to have a full
	 * memory barriers prior to them in the schedule() path, memory
	 * reordering on other CPUs could cause their RCU-tasks read-side
	 * critical sections to extend past the end of the grace period.
	 * However, because these ->nvcsw updates are carried out with
	 * interrupts disabled, we can use synchronize_rcu() to force the
	 * needed ordering on all such CPUs.
	 *
	 * This synchronize_rcu() also confines all ->rcu_tasks_holdout
	 * accesses to be within the grace period, avoiding the need for
	 * memory barriers for ->rcu_tasks_holdout accesses.
	 *
	 * In addition, this synchronize_rcu() waits for exiting tasks
	 * to complete their final preempt_disable() region of execution,
	 * cleaning up after synchronize_srcu(&tasks_rcu_exit_srcu),
	 * enforcing the whole region before tasklist removal until
	 * the final schedule() with TASK_DEAD state to be an RCU TASKS
	 * read side critical section.
	 */

Does that suffice, or should we add more?

							Thanx, Paul

Peter Zijlstra Oct. 30, 2023, 8:21 a.m. UTC | #5

On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote:
> On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote:

> > Nah, this is more or less what I feared. I just worry people will come
> > around and put WRITE_ONCE() on the other end. I don't think that'll buy
> > us much. Nor do I think the current READ_ONCE()s actually matter.
> 
> My friend, you trust compilers more than I ever will.  ;-)

Well, we only use the values {0,1,2}, that's contained in the first
byte. Are we saying compiler will not only byte-split but also
bit-split the loads?

But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't
getting you anything, and if you really worried about it, shouldn't you
have proposed a patch making it all WRITE_ONCE() back when you did this
tasks-rcu stuff?

> > But perhaps put a comment there, that we don't care for the races and
> > only need to observe a 0 once or something.
> 
> There are these two passagers in the big lock comment preceding the
> RCU Tasks code:

> // rcu_tasks_pregp_step():
> //      Invokes synchronize_rcu() in order to wait for all in-flight
> //      t->on_rq and t->nvcsw transitions to complete.  This works because
> //      all such transitions are carried out with interrupts disabled.

> Does that suffice, or should we add more?

Probably sufficient. If one were to have used the search option :-)

Anyway, this brings me to nvcsw, exact same problem there, except
possibly worse, because now we actually do care about the full word.

No WRITE_ONCE() write side, so the READ_ONCE() don't help against
store-tearing (however unlikely that actually is in this case).

Also, I'm not entirely sure I see why you need on_rq and nvcsw. Would
not nvcsw increasing be enough to know it passed through a quiescent
state? Are you trying to say that if nvcsw hasn't advanced but on_rq is
still 0, nothing has changed and you can proceed?

Or rather, looking at the code it seems use the inverse, if on_rq, nvcsw
must change.

Makes sense I suppose, no point waiting for nvcsw to change if the task
never did anything.

Paul E. McKenney Oct. 30, 2023, 8:11 p.m. UTC | #6

On Mon, Oct 30, 2023 at 09:21:38AM +0100, Peter Zijlstra wrote:
> On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote:
> > On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote:
> 
> > > Nah, this is more or less what I feared. I just worry people will come
> > > around and put WRITE_ONCE() on the other end. I don't think that'll buy
> > > us much. Nor do I think the current READ_ONCE()s actually matter.
> > 
> > My friend, you trust compilers more than I ever will.  ;-)
> 
> Well, we only use the values {0,1,2}, that's contained in the first
> byte. Are we saying compiler will not only byte-split but also
> bit-split the loads?
> 
> But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't
> getting you anything, and if you really worried about it, shouldn't you
> have proposed a patch making it all WRITE_ONCE() back when you did this
> tasks-rcu stuff?

There are not all that many of them.  If such a WRITE_ONCE() patch would
be welcome, I would be happy to put it together.

> > > But perhaps put a comment there, that we don't care for the races and
> > > only need to observe a 0 once or something.
> > 
> > There are these two passagers in the big lock comment preceding the
> > RCU Tasks code:
> 
> > // rcu_tasks_pregp_step():
> > //      Invokes synchronize_rcu() in order to wait for all in-flight
> > //      t->on_rq and t->nvcsw transitions to complete.  This works because
> > //      all such transitions are carried out with interrupts disabled.
> 
> > Does that suffice, or should we add more?
> 
> Probably sufficient. If one were to have used the search option :-)
> 
> Anyway, this brings me to nvcsw, exact same problem there, except
> possibly worse, because now we actually do care about the full word.
> 
> No WRITE_ONCE() write side, so the READ_ONCE() don't help against
> store-tearing (however unlikely that actually is in this case).

Again, if such a WRITE_ONCE() patch would be welcome, I would be happy
to put it together.

> Also, I'm not entirely sure I see why you need on_rq and nvcsw. Would
> not nvcsw increasing be enough to know it passed through a quiescent
> state? Are you trying to say that if nvcsw hasn't advanced but on_rq is
> still 0, nothing has changed and you can proceed?
> 
> Or rather, looking at the code it seems use the inverse, if on_rq, nvcsw
> must change.
> 
> Makes sense I suppose, no point waiting for nvcsw to change if the task
> never did anything.

Exactly, the on_rq check is needed to avoid excessively long grace
periods for tasks that are blocked for long periods of time.

							Thanx, Paul

Peter Zijlstra Oct. 31, 2023, 9:52 a.m. UTC | #7

On Mon, Oct 30, 2023 at 01:11:41PM -0700, Paul E. McKenney wrote:
> On Mon, Oct 30, 2023 at 09:21:38AM +0100, Peter Zijlstra wrote:
> > On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote:
> > > On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote:
> > 
> > > > Nah, this is more or less what I feared. I just worry people will come
> > > > around and put WRITE_ONCE() on the other end. I don't think that'll buy
> > > > us much. Nor do I think the current READ_ONCE()s actually matter.
> > > 
> > > My friend, you trust compilers more than I ever will.  ;-)
> > 
> > Well, we only use the values {0,1,2}, that's contained in the first
> > byte. Are we saying compiler will not only byte-split but also
> > bit-split the loads?
> > 
> > But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't
> > getting you anything, and if you really worried about it, shouldn't you
> > have proposed a patch making it all WRITE_ONCE() back when you did this
> > tasks-rcu stuff?
> 
> There are not all that many of them.  If such a WRITE_ONCE() patch would
> be welcome, I would be happy to put it together.
> 
> > > > But perhaps put a comment there, that we don't care for the races and
> > > > only need to observe a 0 once or something.
> > > 
> > > There are these two passagers in the big lock comment preceding the
> > > RCU Tasks code:
> > 
> > > // rcu_tasks_pregp_step():
> > > //      Invokes synchronize_rcu() in order to wait for all in-flight
> > > //      t->on_rq and t->nvcsw transitions to complete.  This works because
> > > //      all such transitions are carried out with interrupts disabled.
> > 
> > > Does that suffice, or should we add more?
> > 
> > Probably sufficient. If one were to have used the search option :-)
> > 
> > Anyway, this brings me to nvcsw, exact same problem there, except
> > possibly worse, because now we actually do care about the full word.
> > 
> > No WRITE_ONCE() write side, so the READ_ONCE() don't help against
> > store-tearing (however unlikely that actually is in this case).
> 
> Again, if such a WRITE_ONCE() patch would be welcome, I would be happy
> to put it together.

Welcome is not the right word. What bugs me most is that this was never
raised when this code was written :/

Mostly my problem is that GCC generates such utter shite when you
mention volatile. See, the below patch changes the perfectly fine and
non-broken:

0148  1d8:      49 83 06 01             addq   $0x1,(%r14)

into:

0148  1d8:	49 8b 06             	mov    (%r14),%rax
014b  1db:	48 83 c0 01          	add    $0x1,%rax
014f  1df:	49 89 06             	mov    %rax,(%r14)

For absolutely no reason :-(

At least clang doesn't do this, it stays:

0403  413:      49 ff 45 00             incq   0x0(%r13)

irrespective of the volatile.

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 802551e0009b..d616211b9151 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6575,8 +6575,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  */
 static void __sched notrace __schedule(unsigned int sched_mode)
 {
 	struct task_struct *prev, *next;
-	unsigned long *switch_count;
+	volatile unsigned long *switch_count;
 	unsigned long prev_state;
 	struct rq_flags rf;
 	struct rq *rq;

Michael Matz Oct. 31, 2023, 2:16 p.m. UTC | #8

Hello,

On Tue, 31 Oct 2023, Peter Zijlstra wrote:

(I can't say anything about the WRITE_ONCE/rcu code, just about the below 
codegen part)

> Welcome is not the right word. What bugs me most is that this was never
> raised when this code was written :/
> 
> Mostly my problem is that GCC generates such utter shite when you
> mention volatile. See, the below patch changes the perfectly fine and
> non-broken:
> 
> 0148  1d8:      49 83 06 01             addq   $0x1,(%r14)

What is non-broken here that is ...

> into:
> 
> 0148  1d8:	49 8b 06             	mov    (%r14),%rax
> 014b  1db:	48 83 c0 01          	add    $0x1,%rax
> 014f  1df:	49 89 06             	mov    %rax,(%r14)

... broken here?  (Sure code size and additional register use, but I don't 
think you mean this with broken).

> For absolutely no reason :-(

The reason is simple (and should be obvious): to adhere to the abstract 
machine regarding volatile.  When x is volatile then x++ consists of a 
read and a write, in this order.  The easiest way to ensure this is to 
actually generate a read and a write instruction.  Anything else is an 
optimization, and for each such optimization you need to actively find an 
argument why this optimization is correct to start with (and then if it's 
an optimization at all).  In this case the argument needs to somehow 
involve arguing that an rmw instruction on x86 is in fact completely 
equivalent to the separate instructions, from read cycle to write cycle 
over all pipeline stages, on all implementations of x86.  I.e. that a rmw 
instruction is spec'ed to be equivalent.

You most probably can make that argument in this specific case, I'll give 
you that.  But why bother to start with, in a piece of software that is 
already fairly complex (the compiler)?  It's much easier to just not do 
much anything with volatile accesses at all and be guaranteed correct.
Even more so as the software author, when using volatile, most likely is 
much more interested in correct code (even from a abstract machine 
perspective) than micro optimizations.

> At least clang doesn't do this, it stays:
> 
> 0403  413:      49 ff 45 00             incq   0x0(%r13)
> 
> irrespective of the volatile.

And, are you 100% sure that this is correct?  Even for x86 CPU 
pipeline implementations that you aren't intimately knowing about? ;-)

But all that seems to be a side-track anyway, what's your real worry with  
the code sequence generated by GCC?

Ciao,
Michael.

Paul E. McKenney Oct. 31, 2023, 2:24 p.m. UTC | #9

On Tue, Oct 31, 2023 at 10:52:02AM +0100, Peter Zijlstra wrote:
> On Mon, Oct 30, 2023 at 01:11:41PM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 30, 2023 at 09:21:38AM +0100, Peter Zijlstra wrote:
> > > On Fri, Oct 27, 2023 at 04:41:30PM -0700, Paul E. McKenney wrote:
> > > > On Sat, Oct 28, 2023 at 12:46:28AM +0200, Peter Zijlstra wrote:
> > > 
> > > > > Nah, this is more or less what I feared. I just worry people will come
> > > > > around and put WRITE_ONCE() on the other end. I don't think that'll buy
> > > > > us much. Nor do I think the current READ_ONCE()s actually matter.
> > > > 
> > > > My friend, you trust compilers more than I ever will.  ;-)
> > > 
> > > Well, we only use the values {0,1,2}, that's contained in the first
> > > byte. Are we saying compiler will not only byte-split but also
> > > bit-split the loads?
> > > 
> > > But again, lacking the WRITE_ONCE() counterpart, this READ_ONCE() isn't
> > > getting you anything, and if you really worried about it, shouldn't you
> > > have proposed a patch making it all WRITE_ONCE() back when you did this
> > > tasks-rcu stuff?
> > 
> > There are not all that many of them.  If such a WRITE_ONCE() patch would
> > be welcome, I would be happy to put it together.
> > 
> > > > > But perhaps put a comment there, that we don't care for the races and
> > > > > only need to observe a 0 once or something.
> > > > 
> > > > There are these two passagers in the big lock comment preceding the
> > > > RCU Tasks code:
> > > 
> > > > // rcu_tasks_pregp_step():
> > > > //      Invokes synchronize_rcu() in order to wait for all in-flight
> > > > //      t->on_rq and t->nvcsw transitions to complete.  This works because
> > > > //      all such transitions are carried out with interrupts disabled.
> > > 
> > > > Does that suffice, or should we add more?
> > > 
> > > Probably sufficient. If one were to have used the search option :-)
> > > 
> > > Anyway, this brings me to nvcsw, exact same problem there, except
> > > possibly worse, because now we actually do care about the full word.
> > > 
> > > No WRITE_ONCE() write side, so the READ_ONCE() don't help against
> > > store-tearing (however unlikely that actually is in this case).
> > 
> > Again, if such a WRITE_ONCE() patch would be welcome, I would be happy
> > to put it together.
> 
> Welcome is not the right word. What bugs me most is that this was never
> raised when this code was written :/

Me, I consider those READ_ONCE() calls to be documentation as well as
defense against overly enthusiastic optimizers.  "This access is racy."

> Mostly my problem is that GCC generates such utter shite when you
> mention volatile. See, the below patch changes the perfectly fine and
> non-broken:
> 
> 0148  1d8:      49 83 06 01             addq   $0x1,(%r14)
> 
> into:
> 
> 0148  1d8:	49 8b 06             	mov    (%r14),%rax
> 014b  1db:	48 83 c0 01          	add    $0x1,%rax
> 014f  1df:	49 89 06             	mov    %rax,(%r14)
> 
> For absolutely no reason :-(
> 
> At least clang doesn't do this, it stays:
> 
> 0403  413:      49 ff 45 00             incq   0x0(%r13)
> 
> irrespective of the volatile.

Sounds like a bug in GCC, perhaps depending on the microarchitecture
in question.  And it was in fact reported in the past, but closed as
not-a-bug.  Perhaps clang's fix for this will help GCC along.

And yes, I do see that ++*switch_count in __schedule().

So, at least until GCC catches up to clang's code generation, I take it
that you don't want WRITE_ONCE() for that ->nvcsw increment.  Thoughts on
->on_rq?

							Thanx, Paul

> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 802551e0009b..d616211b9151 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6575,8 +6575,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>   */
>  static void __sched notrace __schedule(unsigned int sched_mode)
>  {
>  	struct task_struct *prev, *next;
> -	unsigned long *switch_count;
> +	volatile unsigned long *switch_count;
>  	unsigned long prev_state;
>  	struct rq_flags rf;
>  	struct rq *rq;

Paul E. McKenney Oct. 31, 2023, 2:43 p.m. UTC | #10

On Tue, Oct 31, 2023 at 02:16:34PM +0000, Michael Matz wrote:
> Hello,
> 
> On Tue, 31 Oct 2023, Peter Zijlstra wrote:
> 
> (I can't say anything about the WRITE_ONCE/rcu code, just about the below 
> codegen part)
> 
> > Welcome is not the right word. What bugs me most is that this was never
> > raised when this code was written :/
> > 
> > Mostly my problem is that GCC generates such utter shite when you
> > mention volatile. See, the below patch changes the perfectly fine and
> > non-broken:
> > 
> > 0148  1d8:      49 83 06 01             addq   $0x1,(%r14)
> 
> What is non-broken here that is ...
> 
> > into:
> > 
> > 0148  1d8:	49 8b 06             	mov    (%r14),%rax
> > 014b  1db:	48 83 c0 01          	add    $0x1,%rax
> > 014f  1df:	49 89 06             	mov    %rax,(%r14)
> 
> ... broken here?  (Sure code size and additional register use, but I don't 
> think you mean this with broken).
> 
> > For absolutely no reason :-(
> 
> The reason is simple (and should be obvious): to adhere to the abstract 
> machine regarding volatile.  When x is volatile then x++ consists of a 
> read and a write, in this order.  The easiest way to ensure this is to 
> actually generate a read and a write instruction.  Anything else is an 
> optimization, and for each such optimization you need to actively find an 
> argument why this optimization is correct to start with (and then if it's 
> an optimization at all).  In this case the argument needs to somehow 
> involve arguing that an rmw instruction on x86 is in fact completely 
> equivalent to the separate instructions, from read cycle to write cycle 
> over all pipeline stages, on all implementations of x86.  I.e. that a rmw 
> instruction is spec'ed to be equivalent.
> 
> You most probably can make that argument in this specific case, I'll give 
> you that.  But why bother to start with, in a piece of software that is 
> already fairly complex (the compiler)?  It's much easier to just not do 
> much anything with volatile accesses at all and be guaranteed correct.
> Even more so as the software author, when using volatile, most likely is 
> much more interested in correct code (even from a abstract machine 
> perspective) than micro optimizations.

I suspect that Peter cares because the code in question is on a fast
path within the scheduler.

> > At least clang doesn't do this, it stays:
> > 
> > 0403  413:      49 ff 45 00             incq   0x0(%r13)
> > 
> > irrespective of the volatile.
> 
> And, are you 100% sure that this is correct?  Even for x86 CPU 
> pipeline implementations that you aren't intimately knowing about? ;-)

Me, I am not even sure that the load-increment-store is always faithfully
executed by the hardware.  ;-)

> But all that seems to be a side-track anyway, what's your real worry with  
> the code sequence generated by GCC?

Again, my guess is that Peter cares deeply about the speed of the fast
path on which this increment occurs.

							Thanx, Paul

Peter Zijlstra Oct. 31, 2023, 3:16 p.m. UTC | #11

On Tue, Oct 31, 2023 at 02:16:34PM +0000, Michael Matz wrote:

> > Mostly my problem is that GCC generates such utter shite when you
> > mention volatile. See, the below patch changes the perfectly fine and
> > non-broken:
> > 
> > 0148  1d8:      49 83 06 01             addq   $0x1,(%r14)
> 
> What is non-broken here that is ...
> 
> > into:
> > 
> > 0148  1d8:	49 8b 06             	mov    (%r14),%rax
> > 014b  1db:	48 83 c0 01          	add    $0x1,%rax
> > 014f  1df:	49 89 06             	mov    %rax,(%r14)
> 
> ... broken here?  (Sure code size and additional register use, but I don't 
> think you mean this with broken).

The point was that the code was perfectly fine without adding the
volatile, and adding volatile makes it worse.

> > For absolutely no reason :-(
> 
> The reason is simple (and should be obvious): to adhere to the abstract 
> machine regarding volatile.  When x is volatile then x++ consists of a 
> read and a write, in this order.  The easiest way to ensure this is to 
> actually generate a read and a write instruction.  Anything else is an 
> optimization, and for each such optimization you need to actively find an 
> argument why this optimization is correct to start with (and then if it's 
> an optimization at all).  In this case the argument needs to somehow 
> involve arguing that an rmw instruction on x86 is in fact completely 
> equivalent to the separate instructions, from read cycle to write cycle 
> over all pipeline stages, on all implementations of x86.  I.e. that a rmw 
> instruction is spec'ed to be equivalent.
> 
> You most probably can make that argument in this specific case, I'll give 
> you that.  But why bother to start with, in a piece of software that is 
> already fairly complex (the compiler)?  It's much easier to just not do 
> much anything with volatile accesses at all and be guaranteed correct.
> Even more so as the software author, when using volatile, most likely is 
> much more interested in correct code (even from a abstract machine 
> perspective) than micro optimizations.

There's a pile of situations where a RmW instruction is actively
different vs a load-store split, esp for volatile variables that are
explicitly expected to change asynchronously.

The original RmW instruction is IRQ-safe, while the load-store version
is not. If an interrupt lands in between the load and store and also
modifies the variable then the store after interrupt-return will
over-write said modification.

These are not equivalent.

In this case that's not relevant because the increment happens to happen
with IRQs disabled. But the point is that these forms are very much not
equivalent.

> > At least clang doesn't do this, it stays:
> > 
> > 0403  413:      49 ff 45 00             incq   0x0(%r13)
> > 
> > irrespective of the volatile.
> 
> And, are you 100% sure that this is correct?  Even for x86 CPU 
> pipeline implementations that you aren't intimately knowing about? ;-)

It so happens that the x86 architecture does guarantee RmW ops are
IRQ-safe or locally atomic. SMP/concurrent loads will observe either
pre or post but no intermediate state as well.

> But all that seems to be a side-track anyway, what's your real worry with  
> the code sequence generated by GCC?

In this case it's sub-optimal code, both larger and possibly slower for
having two memops.

The reason to have volatile is because that's what Linux uses to
dis-allow store-tearing, something that doesn't happen in this case. A
suitably insane but conforming compiler could compile a non-volatile
memory increment into something insane like:

	load byte-0, r1
	increment r1
	store r1, byte-0
	jno done
	load byte-1, r1
	increment ri
	store r1, byte 1
	jno done
	...
done:

We want to explicitly dis-allow this.

I know C has recently (2011) grown this _Atomic thing, but that has
other problems.

Peter Zijlstra Oct. 31, 2023, 3:20 p.m. UTC | #12

On Tue, Oct 31, 2023 at 07:24:13AM -0700, Paul E. McKenney wrote:

> So, at least until GCC catches up to clang's code generation, I take it
> that you don't want WRITE_ONCE() for that ->nvcsw increment.  Thoughts on
> ->on_rq?

I've not done the patch yet, but I suspect those would be fine, those
are straight up stores, hard to get wrong (famous last words).

Michael Matz Oct. 31, 2023, 3:55 p.m. UTC | #13

Hello,

On Tue, 31 Oct 2023, Peter Zijlstra wrote:

> > > For absolutely no reason :-(
> > 
> > The reason is simple (and should be obvious): to adhere to the abstract 
> > machine regarding volatile.  When x is volatile then x++ consists of a 
> > read and a write, in this order.  The easiest way to ensure this is to 
> > actually generate a read and a write instruction.  Anything else is an 
> > optimization, and for each such optimization you need to actively find an 
> > argument why this optimization is correct to start with (and then if it's 
> > an optimization at all).  In this case the argument needs to somehow 
> > involve arguing that an rmw instruction on x86 is in fact completely 
> > equivalent to the separate instructions, from read cycle to write cycle 
> > over all pipeline stages, on all implementations of x86.  I.e. that a rmw 
> > instruction is spec'ed to be equivalent.
> > 
> > You most probably can make that argument in this specific case, I'll give 
> > you that.  But why bother to start with, in a piece of software that is 
> > already fairly complex (the compiler)?  It's much easier to just not do 
> > much anything with volatile accesses at all and be guaranteed correct.
> > Even more so as the software author, when using volatile, most likely is 
> > much more interested in correct code (even from a abstract machine 
> > perspective) than micro optimizations.
> 
> There's a pile of situations where a RmW instruction is actively
> different vs a load-store split, esp for volatile variables that are
> explicitly expected to change asynchronously.
> 
> The original RmW instruction is IRQ-safe, while the load-store version
> is not. If an interrupt lands in between the load and store and also
> modifies the variable then the store after interrupt-return will
> over-write said modification.
> 
> These are not equivalent.

Okay, then there you have it.  Namely that LLVM has a bug (but see next 
paragraph).  For volatile x, x++ _must_ expand to a separate read and 
write, because the abstract machine of C says so.  If a RmW isn't 
equivalent to that, then it can't be used in this situation.  If you 
_have_ to use a RmW for other reasons like interrupt safety, then a 
volatile variable is not the way to force this, as C simply doesn't have 
that concept and hence can't talk about it.  (Of course it can't, as not 
all architectures could implement such, if it were required).

(If an RmW merely gives you more guarantees than a split load-store then 
of course LLVM doesn't have a bug, but you said not-equivalent, so I'm 
assuming the worst, that RmW also has fewer (other) guarantees)

> > > At least clang doesn't do this, it stays:
> > > 
> > > 0403  413:      49 ff 45 00             incq   0x0(%r13)
> > > 
> > > irrespective of the volatile.
> > 
> > And, are you 100% sure that this is correct?  Even for x86 CPU 
> > pipeline implementations that you aren't intimately knowing about? ;-)
> 
> It so happens that the x86 architecture does guarantee RmW ops are
> IRQ-safe or locally atomic. SMP/concurrent loads will observe either
> pre or post but no intermediate state as well.

So, are RMW ops a strict superset (vis the guarantees they give) of split 
load-store?  If so we can at least say that using RMW is a valid 
optimization :)  Still, an optmization only.

> > But all that seems to be a side-track anyway, what's your real worry with  
> > the code sequence generated by GCC?
> 
> In this case it's sub-optimal code, both larger and possibly slower for
> having two memops.
> 
> The reason to have volatile is because that's what Linux uses to
> dis-allow store-tearing, something that doesn't happen in this case. A
> suitably insane but conforming compiler could compile a non-volatile
> memory increment into something insane like:
> 
> 	load byte-0, r1
> 	increment r1
> 	store r1, byte-0
> 	jno done
> 	load byte-1, r1
> 	increment ri
> 	store r1, byte 1
> 	jno done
> 	...
> done:
> 
> We want to explicitly dis-allow this.

Yeah, I see.  Within C you don't have much choice than volatile for this 
:-/  Funny thing: on some architectures this is actually what is generated 
sometimes, even if it has multi-byte loads/stores.  This came up 
recently on the gcc list and the byte-per-byte sequence was faster ;-) 
(it was rather: load-by-bytes, form whole value via shifts, increment, 
store-by-bytes)
Insane codegen for insane micro-architectures!

> I know C has recently (2011) grown this _Atomic thing, but that has 
> other problems.

Yeah.

So, hmm, I don't quite know what to say, you're between a rock and a hard 
place, I guess.  You have to use volatile for its effects but then are 
unhappy about its effects :)

If you can confirm the above about validity of the optimization, then at 
least there'd by a point for adding a peephole in GCC for this, even if 
current codegen isn't a bug, but I still wouldn't hold my breath.  
volatile is so ... ewww, it's best left alone.

Ciao,
Michael.

Peter Zijlstra Oct. 31, 2023, 4:23 p.m. UTC | #14

On Tue, Oct 31, 2023 at 03:55:20PM +0000, Michael Matz wrote:

> > There's a pile of situations where a RmW instruction is actively
> > different vs a load-store split, esp for volatile variables that are
> > explicitly expected to change asynchronously.
> > 
> > The original RmW instruction is IRQ-safe, while the load-store version
> > is not. If an interrupt lands in between the load and store and also
> > modifies the variable then the store after interrupt-return will
> > over-write said modification.
> > 
> > These are not equivalent.
> 
> Okay, then there you have it.  Namely that LLVM has a bug (but see next 
> paragraph).  For volatile x, x++ _must_ expand to a separate read and 
> write, because the abstract machine of C says so.  If a RmW isn't 
> equivalent to that, then it can't be used in this situation.  If you 
> _have_ to use a RmW for other reasons like interrupt safety, then a 
> volatile variable is not the way to force this, as C simply doesn't have 
> that concept and hence can't talk about it.  (Of course it can't, as not 
> all architectures could implement such, if it were required).

Yeah, RISC archs typically lack the RmW ops. I can understand C not
mandating their use. However, on architectures that do have them, using
them makes a ton of sense.

For us living in the real world, this C abstract machine is mostly a
pain in the arse :-)

> (If an RmW merely gives you more guarantees than a split load-store then 
> of course LLVM doesn't have a bug, but you said not-equivalent, so I'm 
> assuming the worst, that RmW also has fewer (other) guarantees)

RmW is strict superset of load-store, and as such not equivalent :-)

Specifically, using volatile degrades the guarantees -- which is counter
intuitive.

> So, are RMW ops a strict superset (vis the guarantees they give) of split 
> load-store?  If so we can at least say that using RMW is a valid 
> optimization :)  Still, an optmization only.

This.

> > > But all that seems to be a side-track anyway, what's your real worry with  
> > > the code sequence generated by GCC?
> > 
> > In this case it's sub-optimal code, both larger and possibly slower for
> > having two memops.
> > 
> > The reason to have volatile is because that's what Linux uses to
> > dis-allow store-tearing, something that doesn't happen in this case. A
> > suitably insane but conforming compiler could compile a non-volatile
> > memory increment into something insane like:
> > 
> > 	load byte-0, r1
> > 	increment r1
> > 	store r1, byte-0
> > 	jno done
> > 	load byte-1, r1
> > 	increment ri
> > 	store r1, byte 1
> > 	jno done
> > 	...
> > done:
> > 
> > We want to explicitly dis-allow this.
> 
> Yeah, I see.  Within C you don't have much choice than volatile for this 
> :-/  Funny thing: on some architectures this is actually what is generated 
> sometimes, even if it has multi-byte loads/stores.  This came up 
> recently on the gcc list and the byte-per-byte sequence was faster ;-) 
> (it was rather: load-by-bytes, form whole value via shifts, increment, 
> store-by-bytes)
> Insane codegen for insane micro-architectures!

*groan*

> > I know C has recently (2011) grown this _Atomic thing, but that has 
> > other problems.
> 
> Yeah.
> 
> So, hmm, I don't quite know what to say, you're between a rock and a hard 
> place, I guess.  You have to use volatile for its effects but then are 
> unhappy about its effects :)

Notably, Linux uses a *ton* of volatile and there has historically been
a lot of grumbling about the GCC stance of 'stupid' codegen the moment
it sees volatile.

It really would help us (the Linux community) if GCC were to be less
offended by the whole volatile thing and would try to generate better
code.

Paul has been on the C/C++ committee meetings and keeps telling me them
folks hate volatile with a passion up to the point of proposing to
remove it from the language or somesuch. But the reality is that Linux
very heavily relies on it and _Atomic simply cannot replace it.

> If you can confirm the above about validity of the optimization, then at 
> least there'd by a point for adding a peephole in GCC for this, even if 
> current codegen isn't a bug, but I still wouldn't hold my breath.  
> volatile is so ... ewww, it's best left alone.

Confirmed, and please, your SMP computer only works becuase of volatile,
it *is* important.

Michael Matz Oct. 31, 2023, 4:49 p.m. UTC | #15

Hey,

On Tue, 31 Oct 2023, Peter Zijlstra wrote:

> > equivalent to that, then it can't be used in this situation.  If you 
> > _have_ to use a RmW for other reasons like interrupt safety, then a 
> > volatile variable is not the way to force this, as C simply doesn't have 
> > that concept and hence can't talk about it.  (Of course it can't, as not 
> > all architectures could implement such, if it were required).
> 
> Yeah, RISC archs typically lack the RmW ops. I can understand C not
> mandating their use. However, on architectures that do have them, using
> them makes a ton of sense.
> 
> For us living in the real world, this C abstract machine is mostly a
> pain in the arse :-)

Believe me, without it you would live in a world where only languages like 
ECMA script or Rust existed, without any reliable spec at all ("it's 
obvious, the language should behave like this-and-that compiler from 2010 
implements it! Or was it 2012?").  Even if it sometimes would make life 
easier without (also for compilers!), at least you _have_ an arse to feel 
pain in! :-)  Ahem.

> > So, hmm, I don't quite know what to say, you're between a rock and a hard 
> > place, I guess.  You have to use volatile for its effects but then are 
> > unhappy about its effects :)
> 
> Notably, Linux uses a *ton* of volatile and there has historically been
> a lot of grumbling about the GCC stance of 'stupid' codegen the moment
> it sees volatile.
> 
> It really would help us (the Linux community) if GCC were to be less
> offended by the whole volatile thing and would try to generate better
> code.
> 
> Paul has been on the C/C++ committee meetings and keeps telling me them
> folks hate volatile with a passion up to the point of proposing to
> remove it from the language or somesuch. But the reality is that Linux
> very heavily relies on it and _Atomic simply cannot replace it.

Oh yeah, I agree.  People trying to get rid of volatile are misguided.  
Of course one can try to capture all the individual aspects of it, and 
make individual language constructs for them (_Atomic is one).  It makes 
arguing about and precisely specifying the aspects much easier.  But it 
also makes the feature-interoperability matrix explode and the language 
more complicated in areas that were already arcane to start with.

But it's precisely _because_ of the large-scale feature set of volatile 
and the compilers wish to cater for the real world, that it's mostly left 
alone, as is, as written by the author.  Sure, one can wish for better 
codegen related to volatile.  But it's a slippery slope: "here I have 
volatile, because I don't want optimizations to happen." - "but here I 
want a little optimization to happen" - "but not these" - ... It's ... not 
so easy :)

In this specific case I think we have now sufficiently argued that 
"load-modify-store --> rmw" on x86 even for volatile accesses is a correct 
transformation (and one that has sufficiently local effects that our heads 
don't explode while thinking about all consequences).  You'd have to do 
that for each and every transformation where volatile stuff is involved, 
just so to not throw out the baby with the water.

> > If you can confirm the above about validity of the optimization, then at 
> > least there'd by a point for adding a peephole in GCC for this, even if 
> > current codegen isn't a bug, but I still wouldn't hold my breath.  
> > volatile is so ... ewww, it's best left alone.
> 
> Confirmed, and please, your SMP computer only works becuase of volatile,
> it *is* important.

Agreed.

Ciao,
Michael.

Paul E. McKenney Oct. 31, 2023, 5:10 p.m. UTC | #16

On Tue, Oct 31, 2023 at 04:49:56PM +0000, Michael Matz wrote:
> Hey,
> 
> On Tue, 31 Oct 2023, Peter Zijlstra wrote:
> 
> > > equivalent to that, then it can't be used in this situation.  If you 
> > > _have_ to use a RmW for other reasons like interrupt safety, then a 
> > > volatile variable is not the way to force this, as C simply doesn't have 
> > > that concept and hence can't talk about it.  (Of course it can't, as not 
> > > all architectures could implement such, if it were required).
> > 
> > Yeah, RISC archs typically lack the RmW ops. I can understand C not
> > mandating their use. However, on architectures that do have them, using
> > them makes a ton of sense.
> > 
> > For us living in the real world, this C abstract machine is mostly a
> > pain in the arse :-)
> 
> Believe me, without it you would live in a world where only languages like 
> ECMA script or Rust existed, without any reliable spec at all ("it's 
> obvious, the language should behave like this-and-that compiler from 2010 
> implements it! Or was it 2012?").  Even if it sometimes would make life 
> easier without (also for compilers!), at least you _have_ an arse to feel 
> pain in! :-)  Ahem.

You mean like Rust volatiles considering conflicting accesses to be
data races?  That certainly leads me to wonder how a Rust-language device
driver is supposed to interoperate with Rust-language device firmware.

They currently propose atomics and things like the barrier() asm to make
that work, and their definition of atomic might just allow it.

> > > So, hmm, I don't quite know what to say, you're between a rock and a hard 
> > > place, I guess.  You have to use volatile for its effects but then are 
> > > unhappy about its effects :)
> > 
> > Notably, Linux uses a *ton* of volatile and there has historically been
> > a lot of grumbling about the GCC stance of 'stupid' codegen the moment
> > it sees volatile.
> > 
> > It really would help us (the Linux community) if GCC were to be less
> > offended by the whole volatile thing and would try to generate better
> > code.
> > 
> > Paul has been on the C/C++ committee meetings and keeps telling me them
> > folks hate volatile with a passion up to the point of proposing to
> > remove it from the language or somesuch. But the reality is that Linux
> > very heavily relies on it and _Atomic simply cannot replace it.
> 
> Oh yeah, I agree.  People trying to get rid of volatile are misguided.  
> Of course one can try to capture all the individual aspects of it, and 
> make individual language constructs for them (_Atomic is one).  It makes 
> arguing about and precisely specifying the aspects much easier.  But it 
> also makes the feature-interoperability matrix explode and the language 
> more complicated in areas that were already arcane to start with.

Agreed, and I have personally witnessed some primal-scream therapy
undertaken in response to attempts to better define volatile.

> But it's precisely _because_ of the large-scale feature set of volatile 
> and the compilers wish to cater for the real world, that it's mostly left 
> alone, as is, as written by the author.  Sure, one can wish for better 
> codegen related to volatile.  But it's a slippery slope: "here I have 
> volatile, because I don't want optimizations to happen." - "but here I 
> want a little optimization to happen" - "but not these" - ... It's ... not 
> so easy :)

And to your point, there really have been optimization bugs that have
broken volatile.  So I do very much appreciate your careful attention
to this matter.

> In this specific case I think we have now sufficiently argued that 
> "load-modify-store --> rmw" on x86 even for volatile accesses is a correct 
> transformation (and one that has sufficiently local effects that our heads 
> don't explode while thinking about all consequences).  You'd have to do 
> that for each and every transformation where volatile stuff is involved, 
> just so to not throw out the baby with the water.

Understood!

> > > If you can confirm the above about validity of the optimization, then at 
> > > least there'd by a point for adding a peephole in GCC for this, even if 
> > > current codegen isn't a bug, but I still wouldn't hold my breath.  
> > > volatile is so ... ewww, it's best left alone.
> > 
> > Confirmed, and please, your SMP computer only works becuase of volatile,
> > it *is* important.
> 
> Agreed.

Good to hear!!!

							Thanx, Paul

Paul E. McKenney Oct. 31, 2023, 6:12 p.m. UTC | #17

On Tue, Oct 31, 2023 at 04:20:33PM +0100, Peter Zijlstra wrote:
> On Tue, Oct 31, 2023 at 07:24:13AM -0700, Paul E. McKenney wrote:
> 
> > So, at least until GCC catches up to clang's code generation, I take it
> > that you don't want WRITE_ONCE() for that ->nvcsw increment.  Thoughts on
> > ->on_rq?
> 
> I've not done the patch yet, but I suspect those would be fine, those
> are straight up stores, hard to get wrong (famous last words).

Assuming that the reads are already either marked with READ_ONCE() or
are under appropriate locks, my immediate thought would be something
like the all-too-lightly tested patch shown below.

The ASSERT_EXCLUSIVE_WRITER() causes KCSAN to complain if there is a
concurrent store of any kind to the location.

Of course, please feel free to ignore.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81885748871d..aeace19ad7f5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2124,12 +2124,14 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_task(rq, p, flags);
 
-	p->on_rq = TASK_ON_RQ_QUEUED;
+	WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
+	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
 }
 
 void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 {
-	p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
+	WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING);
+	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
 
 	dequeue_task(rq, p, flags);
 }

diff mbox series

Patch

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index bf5f178fe723..a604f59aee0b 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -895,10 +895,36 @@  static void rcu_tasks_pregp_step(struct list_head *hop)
 	synchronize_rcu();
 }
 
+/* Check for quiescent states since the pregp's synchronize_rcu() */
+static bool rcu_tasks_is_holdout(struct task_struct *t)
+{
+	int cpu;
+
+	/* Has the task been seen voluntarily sleeping? */
+	if (!READ_ONCE(t->on_rq))
+		return false;
+
+	/*
+	 * Idle tasks (or idle injection) within the idle loop are RCU-tasks
+	 * quiescent states. But CPU boot code performed by the idle task
+	 * isn't a quiescent state.
+	 */
+	if (is_idle_task(t))
+		return false;
+
+	cpu = task_cpu(t);
+
+	/* Idle tasks on offline CPUs are RCU-tasks quiescent states. */
+	if (t == idle_task(cpu) && !rcu_cpu_online(cpu))
+		return false;
+
+	return true;
+}
+
 /* Per-task initial processing. */
 static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
 {
-	if (t != current && READ_ONCE(t->on_rq) && !is_idle_task(t)) {
+	if (t != current && rcu_tasks_is_holdout(t)) {
 		get_task_struct(t);
 		t->rcu_tasks_nvcsw = READ_ONCE(t->nvcsw);
 		WRITE_ONCE(t->rcu_tasks_holdout, true);
@@ -947,7 +973,7 @@  static void check_holdout_task(struct task_struct *t,
 
 	if (!READ_ONCE(t->rcu_tasks_holdout) ||
 	    t->rcu_tasks_nvcsw != READ_ONCE(t->nvcsw) ||
-	    !READ_ONCE(t->on_rq) ||
+	    !rcu_tasks_is_holdout(t) ||
 	    (IS_ENABLED(CONFIG_NO_HZ_FULL) &&
 	     !is_idle_task(t) && READ_ONCE(t->rcu_tasks_idle_cpu) >= 0)) {
 		WRITE_ONCE(t->rcu_tasks_holdout, false);