diff mbox series

[2/3] rcu: Defer RCU kthreads wakeup when CPU is dying

Message ID	20231218231916.11719-3-frederic@kernel.org
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-4477-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; From: Frederic Weisbecker <frederic@kernel.org> To: LKML <linux-kernel@vger.kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org>, Boqun Feng <boqun.feng@gmail.com>, Joel Fernandes <joel@joelfernandes.org>, Neeraj Upadhyay <neeraj.upadhyay@amd.com>, Uladzislau Rezki <urezki@gmail.com>, Zqiang <qiang.zhang1211@gmail.com>, rcu <rcu@vger.kernel.org>, "Paul E . McKenney" <paulmck@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Peter Zijlstra <peterz@infradead.org> Subject: [PATCH 2/3] rcu: Defer RCU kthreads wakeup when CPU is dying Date: Tue, 19 Dec 2023 00:19:15 +0100 Message-ID: <20231218231916.11719-3-frederic@kernel.org> In-Reply-To: <20231218231916.11719-1-frederic@kernel.org> References: <20231218231916.11719-1-frederic@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX
Series	timers & RCU: Fix TREE03 stalls \| [0/3] timers & RCU: Fix TREE03 stalls [1/3] hrtimer: Report offline hrtimer enqueue [2/3] rcu: Defer RCU kthreads wakeup when CPU is dying [3/3] rcu/exp: Remove full barrier upon main thread wakeup

Commit Message

Frederic Weisbecker Dec. 18, 2023, 11:19 p.m. UTC

  When the CPU goes idle for the last time during the CPU down hotplug
process, RCU reports a final quiescent state for the current CPU. If
this quiescent state propagates up to the top, some tasks may then be
woken up to complete the grace period: the main grace period kthread
and/or the expedited main workqueue (or kworker).

If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Since this happens
after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
timer gets ignored. Therefore if the RCU kthreads are waiting for RT
bandwidth to be available, they may never be actually scheduled.

This triggers TREE03 rcutorture hangs:

	 rcu: INFO: rcu_preempt self-detected stall on CPU
	 rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
	 rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
	 rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
	 rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
	 rcu: RCU grace-period kthread stack dump:
	 task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
	 Call Trace:
	  <TASK>
	  __schedule+0x2eb/0xa80
	  schedule+0x1f/0x90
	  schedule_timeout+0x163/0x270
	  ? __pfx_process_timeout+0x10/0x10
	  rcu_gp_fqs_loop+0x37c/0x5b0
	  ? __pfx_rcu_gp_kthread+0x10/0x10
	  rcu_gp_kthread+0x17c/0x200
	  kthread+0xde/0x110
	  ? __pfx_kthread+0x10/0x10
	  ret_from_fork+0x2b/0x40
	  ? __pfx_kthread+0x10/0x10
	  ret_from_fork_asm+0x1b/0x30
	  </TASK>

The situation can't be solved with just unpinning the timer. The hrtimer
infrastructure and the nohz heuristics involved in finding the best
remote target for an unpinned timer would then also need to handle
enqueues from an offline CPU in the most horrendous way.

So fix this on the RCU side instead and defer the wake up to an online
CPU if it's too late for the local one.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree.c     | 34 +++++++++++++++++++++++++++++++++-
 kernel/rcu/tree_exp.h |  3 +--
 2 files changed, 34 insertions(+), 3 deletions(-)

Comments

Joel Fernandes Dec. 19, 2023, 3:38 a.m. UTC | #1

On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote:
> When the CPU goes idle for the last time during the CPU down hotplug
> process, RCU reports a final quiescent state for the current CPU. If
> this quiescent state propagates up to the top, some tasks may then be
> woken up to complete the grace period: the main grace period kthread
> and/or the expedited main workqueue (or kworker).
> 
> If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
> arm the RT bandwith timer to the local offline CPU. Since this happens
> after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
> timer gets ignored. Therefore if the RCU kthreads are waiting for RT
> bandwidth to be available, they may never be actually scheduled.
> 
> This triggers TREE03 rcutorture hangs:
> 
> 	 rcu: INFO: rcu_preempt self-detected stall on CPU
> 	 rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
> 	 rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
> 	 rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> 	 rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
> 	 rcu: RCU grace-period kthread stack dump:
> 	 task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
> 	 Call Trace:
> 	  <TASK>
> 	  __schedule+0x2eb/0xa80
> 	  schedule+0x1f/0x90
> 	  schedule_timeout+0x163/0x270
> 	  ? __pfx_process_timeout+0x10/0x10
> 	  rcu_gp_fqs_loop+0x37c/0x5b0
> 	  ? __pfx_rcu_gp_kthread+0x10/0x10
> 	  rcu_gp_kthread+0x17c/0x200
> 	  kthread+0xde/0x110
> 	  ? __pfx_kthread+0x10/0x10
> 	  ret_from_fork+0x2b/0x40
> 	  ? __pfx_kthread+0x10/0x10
> 	  ret_from_fork_asm+0x1b/0x30
> 	  </TASK>
> 
> The situation can't be solved with just unpinning the timer. The hrtimer
> infrastructure and the nohz heuristics involved in finding the best
> remote target for an unpinned timer would then also need to handle
> enqueues from an offline CPU in the most horrendous way.
> 
> So fix this on the RCU side instead and defer the wake up to an online
> CPU if it's too late for the local one.

Ah, ideally we'd not run into this if sched_feat(TTWU_QUEUE) was enabled
but then in any case there is also the ttwu_queue_cond() also shutting down
the remote queueing..

> Reported-by: Paul E. McKenney <paulmck@kernel.org>
> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  kernel/rcu/tree.c     | 34 +++++++++++++++++++++++++++++++++-
>  kernel/rcu/tree_exp.h |  3 +--
>  2 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 3ac3c846105f..157f3ca2a9b5 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
>  	return needmore;
>  }
>  
> +static void swake_up_one_online_ipi(void *arg)
> +{
> +	struct swait_queue_head *wqh = arg;
> +
> +	swake_up_one(wqh);
> +}

Speaking of, the scheduler refuses to do remote-IPI-style wakeups
(TTWU_QUEUE) whenever the destination CPU is in a hotplug state.

static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
{
	/*
	 * Do not complicate things with the async wake_list while the CPU is
	 * in hotplug state.
	 */
	if (!cpu_active(cpu))
		return false;
	...
}

Along these lines, I wonder if, it is safe to do a wakeup in this fashion (as
done by this patch) if the destination CPU was also going down.

Also the same ttwu_queue_cond() checks for CPU affinities before deciding to
not do the IPI-style queue.

	/* Ensure the task will still be allowed to run on the CPU. */
	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
		return false;

Not that anyone should be changing RCU thread priorities around while the IPI
is in flight, but...

I wonder if the reason TTWU is excessively paranoid is that the IPI can be
delayed for example, leading to race conditions.

Anyway, just my 2 cents.

Happy holidays! thanks,

 - Joel


> +
> +static void swake_up_one_online(struct swait_queue_head *wqh)
> +{
> +	int cpu = get_cpu();
> +
> +	/*
> +	 * If called from rcutree_report_cpu_starting(), wake up
> +	 * is dangerous that late in the CPU-down hotplug process. The
> +	 * scheduler might queue an ignored hrtimer. Defer the wake up
> +	 * to an online CPU instead.
> +	 */
> +	if (unlikely(cpu_is_offline(cpu))) {
> +		int target;
> +
> +		target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
> +					 cpu_online_mask);
> +
> +		smp_call_function_single(target, swake_up_one_online_ipi,
> +					 wqh, 0);
> +		put_cpu();
> +	} else {
> +		put_cpu();
> +		swake_up_one(wqh);
> +	}
> +}
> +
>  /*
>   * Awaken the grace-period kthread.  Don't do a self-awaken (unless in an
>   * interrupt or softirq handler, in which case we just might immediately
> @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void)
>  		return;
>  	WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
>  	WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
> -	swake_up_one(&rcu_state.gp_wq);
> +	swake_up_one_online(&rcu_state.gp_wq);
>  }
>  
>  /*
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6d7cea5d591f..2ac440bc7e10 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
>  	return ret;
>  }
>  
> -
>  /*
>   * Report the exit from RCU read-side critical section for the last task
>   * that queued itself during or before the current expedited preemptible-RCU
> @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
>  			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  			if (wake) {
>  				smp_mb(); /* EGP done before wake_up(). */
> -				swake_up_one(&rcu_state.expedited_wq);
> +				swake_up_one_online(&rcu_state.expedited_wq);
>  			}
>  			break;
>  		}
> -- 
> 2.42.1
>

Hillf Danton Dec. 19, 2023, 4:42 a.m. UTC | #2

On Tue, 19 Dec 2023 00:19:15 +0100 Frederic Weisbecker <frederic@kernel.org>
> +static void swake_up_one_online(struct swait_queue_head *wqh)
> +{
> +	int cpu = get_cpu();
> +
> +	/*
> +	 * If called from rcutree_report_cpu_starting(), wake up
> +	 * is dangerous that late in the CPU-down hotplug process. The
> +	 * scheduler might queue an ignored hrtimer. Defer the wake up
> +	 * to an online CPU instead.
> +	 */

But why is scheduler having any interest selecting a dying CPU for
adding a hrtimer on at the first place?

> +	if (unlikely(cpu_is_offline(cpu))) {
> +		int target;
> +
> +		target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
> +					 cpu_online_mask);
> +
> +		smp_call_function_single(target, swake_up_one_online_ipi,
> +					 wqh, 0);
> +		put_cpu();
> +	} else {
> +		put_cpu();
> +		swake_up_one(wqh);
> +	}
> +}

Frederic Weisbecker Dec. 19, 2023, 11:56 a.m. UTC | #3

On Mon, Dec 18, 2023 at 10:38:52PM -0500, Joel Fernandes wrote:
> On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote:
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 3ac3c846105f..157f3ca2a9b5 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
> >  	return needmore;
> >  }
> >  
> > +static void swake_up_one_online_ipi(void *arg)
> > +{
> > +	struct swait_queue_head *wqh = arg;
> > +
> > +	swake_up_one(wqh);
> > +}
> 
> Speaking of, the scheduler refuses to do remote-IPI-style wakeups
> (TTWU_QUEUE) whenever the destination CPU is in a hotplug state.
> 
> static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
> {
> 	/*
> 	 * Do not complicate things with the async wake_list while the CPU is
> 	 * in hotplug state.
> 	 */
> 	if (!cpu_active(cpu))
> 		return false;
> 	...
> }

Yes, because all irrelevant tasks must be migrated out upon
CPUHP_AP_SCHED_WAIT_EMPTY, thanks to balance_push_set().

(Though right now I'm missing the flush_smp_call_function_queue() call that flushes
the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty())

> 
> Along these lines, I wonder if, it is safe to do a wakeup in this fashion (as
> done by this patch) if the destination CPU was also going down.
> 
> Also the same ttwu_queue_cond() checks for CPU affinities before deciding to
> not do the IPI-style queue.
> 
> 	/* Ensure the task will still be allowed to run on the CPU. */
> 	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> 		return false;
> 
> Not that anyone should be changing RCU thread priorities around while the IPI
> is in flight, but...
> 
> I wonder if the reason TTWU is excessively paranoid is that the IPI can be
> delayed for example, leading to race conditions.

It's because nothing irrelevant must be queued after sched_cpu_wait_empty().

But note this patch does something different, it doesn't defer the runqueue
enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the
decision as to where to queue the task is delegated to an online CPU. So it's
not the same constraints. Waking up a task _from_ a CPU that is active or not but
at least online is supposed to be fine.

Thanks.

Paul E. McKenney Dec. 19, 2023, 3:29 p.m. UTC | #4

On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote:
> When the CPU goes idle for the last time during the CPU down hotplug
> process, RCU reports a final quiescent state for the current CPU. If
> this quiescent state propagates up to the top, some tasks may then be
> woken up to complete the grace period: the main grace period kthread
> and/or the expedited main workqueue (or kworker).
> 
> If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
> arm the RT bandwith timer to the local offline CPU. Since this happens
> after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
> timer gets ignored. Therefore if the RCU kthreads are waiting for RT
> bandwidth to be available, they may never be actually scheduled.
> 
> This triggers TREE03 rcutorture hangs:
> 
> 	 rcu: INFO: rcu_preempt self-detected stall on CPU
> 	 rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
> 	 rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
> 	 rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> 	 rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
> 	 rcu: RCU grace-period kthread stack dump:
> 	 task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
> 	 Call Trace:
> 	  <TASK>
> 	  __schedule+0x2eb/0xa80
> 	  schedule+0x1f/0x90
> 	  schedule_timeout+0x163/0x270
> 	  ? __pfx_process_timeout+0x10/0x10
> 	  rcu_gp_fqs_loop+0x37c/0x5b0
> 	  ? __pfx_rcu_gp_kthread+0x10/0x10
> 	  rcu_gp_kthread+0x17c/0x200
> 	  kthread+0xde/0x110
> 	  ? __pfx_kthread+0x10/0x10
> 	  ret_from_fork+0x2b/0x40
> 	  ? __pfx_kthread+0x10/0x10
> 	  ret_from_fork_asm+0x1b/0x30
> 	  </TASK>
> 
> The situation can't be solved with just unpinning the timer. The hrtimer
> infrastructure and the nohz heuristics involved in finding the best
> remote target for an unpinned timer would then also need to handle
> enqueues from an offline CPU in the most horrendous way.
> 
> So fix this on the RCU side instead and defer the wake up to an online
> CPU if it's too late for the local one.

One question below...

> Reported-by: Paul E. McKenney <paulmck@kernel.org>
> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  kernel/rcu/tree.c     | 34 +++++++++++++++++++++++++++++++++-
>  kernel/rcu/tree_exp.h |  3 +--
>  2 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 3ac3c846105f..157f3ca2a9b5 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
>  	return needmore;
>  }
>  
> +static void swake_up_one_online_ipi(void *arg)
> +{
> +	struct swait_queue_head *wqh = arg;
> +
> +	swake_up_one(wqh);
> +}
> +
> +static void swake_up_one_online(struct swait_queue_head *wqh)
> +{
> +	int cpu = get_cpu();

This works because get_cpu() is currently preempt_disable().  If there are plans to
make get_cpu() be some sort of read lock, we might deadlock when synchronize_rcu()
is invoked from a CPU-hotplug notifier, correct?

Might this be worth a comment somewhere at some point?

								Thanx, Paul

> +
> +	/*
> +	 * If called from rcutree_report_cpu_starting(), wake up
> +	 * is dangerous that late in the CPU-down hotplug process. The
> +	 * scheduler might queue an ignored hrtimer. Defer the wake up
> +	 * to an online CPU instead.
> +	 */
> +	if (unlikely(cpu_is_offline(cpu))) {
> +		int target;
> +
> +		target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
> +					 cpu_online_mask);
> +
> +		smp_call_function_single(target, swake_up_one_online_ipi,
> +					 wqh, 0);
> +		put_cpu();
> +	} else {
> +		put_cpu();
> +		swake_up_one(wqh);
> +	}
> +}
> +
>  /*
>   * Awaken the grace-period kthread.  Don't do a self-awaken (unless in an
>   * interrupt or softirq handler, in which case we just might immediately
> @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void)
>  		return;
>  	WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
>  	WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
> -	swake_up_one(&rcu_state.gp_wq);
> +	swake_up_one_online(&rcu_state.gp_wq);
>  }
>  
>  /*
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6d7cea5d591f..2ac440bc7e10 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
>  	return ret;
>  }
>  
> -
>  /*
>   * Report the exit from RCU read-side critical section for the last task
>   * that queued itself during or before the current expedited preemptible-RCU
> @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
>  			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  			if (wake) {
>  				smp_mb(); /* EGP done before wake_up(). */
> -				swake_up_one(&rcu_state.expedited_wq);
> +				swake_up_one_online(&rcu_state.expedited_wq);
>  			}
>  			break;
>  		}
> -- 
> 2.42.1
>

Frederic Weisbecker Dec. 19, 2023, 3:47 p.m. UTC | #5

On Tue, Dec 19, 2023 at 07:29:23AM -0800, Paul E. McKenney wrote:
> > +static void swake_up_one_online(struct swait_queue_head *wqh)
> > +{
> > +	int cpu = get_cpu();
> 
> This works because get_cpu() is currently preempt_disable().  If there are plans to
> make get_cpu() be some sort of read lock, we might deadlock when synchronize_rcu()
> is invoked from a CPU-hotplug notifier, correct?
> 
> Might this be worth a comment somewhere at some point?

Sure, I can add that.

Thanks.

Frederic Weisbecker Dec. 19, 2023, 11:42 p.m. UTC | #6

On Tue, Dec 19, 2023 at 12:42:47PM +0800, Hillf Danton wrote:
> On Tue, 19 Dec 2023 00:19:15 +0100 Frederic Weisbecker <frederic@kernel.org>
> > +static void swake_up_one_online(struct swait_queue_head *wqh)
> > +{
> > +	int cpu = get_cpu();
> > +
> > +	/*
> > +	 * If called from rcutree_report_cpu_starting(), wake up
> > +	 * is dangerous that late in the CPU-down hotplug process. The
> > +	 * scheduler might queue an ignored hrtimer. Defer the wake up
> > +	 * to an online CPU instead.
> > +	 */
> 
> But why is scheduler having any interest selecting a dying CPU for
> adding a hrtimer on at the first place?

So indeed that timer could be unpinned. But we tried that and it's not
enough. If we want to make hrtimers and nohz infrastructure aware of
the fact the current CPU is offline when it queues an hrtimer, we must
face the ugliness below. And still it's hacky because we must also find an
online target whose earliest deadline is below/equal the scheduler hrtimer
we are trying to enqueue. And that requires even more ugliness that isn't
handled below.

So for now I assume that queuing a timer after hrtimers_cpu_dying() is
unreasonable and that RCU is the only candidate trying that. If there are
more to be reported, we shall see...

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index f2044d5a652b..9eac39fad31c 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -234,6 +234,7 @@ struct hrtimer_cpu_base {
 	struct hrtimer			*next_timer;
 	ktime_t				softirq_expires_next;
 	struct hrtimer			*softirq_next_timer;
+	int online;
 	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
 } ____cacheline_aligned;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a708d225c28e..83c75768f290 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1088,7 +1088,7 @@ int get_nohz_timer_target(void)
 	struct sched_domain *sd;
 	const struct cpumask *hk_mask;
 
-	if (housekeeping_cpu(cpu, HK_TYPE_TIMER)) {
+	if (housekeeping_cpu(cpu, HK_TYPE_TIMER) && cpu_online(cpu)) {
 		if (!idle_cpu(cpu))
 			return cpu;
 		default_cpu = cpu;
@@ -1109,7 +1109,8 @@ int get_nohz_timer_target(void)
 	}
 
 	if (default_cpu == -1)
-		default_cpu = housekeeping_any_cpu(HK_TYPE_TIMER);
+		default_cpu = cpumask_any_and(housekeeping_cpumask(HK_TYPE_TIMER),
+					      cpu_online_mask);
 
 	return default_cpu;
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6aaf0a3d6081..26cb9455272a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -118,7 +118,7 @@ static inline void do_start_rt_bandwidth(struct rt_bandwidth *rt_b)
 		 */
 		hrtimer_forward_now(&rt_b->rt_period_timer, ns_to_ktime(0));
 		hrtimer_start_expires(&rt_b->rt_period_timer,
-				      HRTIMER_MODE_ABS_PINNED_HARD);
+				      HRTIMER_MODE_ABS_HARD);
 	}
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 760793998cdd..82f9ace2e4fd 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -36,6 +36,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/sched/rt.h>
 #include <linux/sched/deadline.h>
+#include <linux/sched/isolation.h>
 #include <linux/sched/nohz.h>
 #include <linux/sched/debug.h>
 #include <linux/timer.h>
@@ -206,6 +207,12 @@ struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base,
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
 	if (static_branch_likely(&timers_migration_enabled) && !pinned)
 		return &per_cpu(hrtimer_bases, get_nohz_timer_target());
+#else
+	if (!base->online) {
+		int cpu = cpumask_any_and(housekeeping_cpumask(HK_TYPE_TIMER),
+					  cpu_online_mask);
+		base = &per_cpu(hrtimer_bases, cpu);
+	}
 #endif
 	return base;
 }
@@ -254,7 +261,13 @@ switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
 		raw_spin_lock(&new_base->cpu_base->lock);
 
 		if (new_cpu_base != this_cpu_base &&
-		    hrtimer_check_target(timer, new_base)) {
+		    hrtimer_check_target(timer, new_base) &&
+		    /*
+		     * Crude hack and buggy: if this CPU is offline and
+		     * the timer is the earliest on the remote target, the timer
+		     * will expire late...
+		     */
+		    this_cpu_base->online) {
 			raw_spin_unlock(&new_base->cpu_base->lock);
 			raw_spin_lock(&base->cpu_base->lock);
 			new_cpu_base = this_cpu_base;
@@ -2183,6 +2196,7 @@ int hrtimers_prepare_cpu(unsigned int cpu)
 	cpu_base->softirq_next_timer = NULL;
 	cpu_base->expires_next = KTIME_MAX;
 	cpu_base->softirq_expires_next = KTIME_MAX;
+	cpu_base->online = 1;
 	hrtimer_cpu_base_init_expiry_lock(cpu_base);
 	return 0;
 }
@@ -2248,7 +2262,7 @@ int hrtimers_cpu_dying(unsigned int dying_cpu)
 	__hrtimer_get_next_event(new_base, HRTIMER_ACTIVE_SOFT);
 	/* Tell the other CPU to retrigger the next event */
 	smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
-
+	old_base->online = 0;
 	raw_spin_unlock(&new_base->lock);
 	raw_spin_unlock(&old_base->lock);

Joel Fernandes Dec. 20, 2023, 3:01 a.m. UTC | #7

On Tue, Dec 19, 2023 at 12:56:50PM +0100, Frederic Weisbecker wrote:
> On Mon, Dec 18, 2023 at 10:38:52PM -0500, Joel Fernandes wrote:
> > On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote:
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 3ac3c846105f..157f3ca2a9b5 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
> > >  	return needmore;
> > >  }
> > >  
> > > +static void swake_up_one_online_ipi(void *arg)
> > > +{
> > > +	struct swait_queue_head *wqh = arg;
> > > +
> > > +	swake_up_one(wqh);
> > > +}
> > 
> > Speaking of, the scheduler refuses to do remote-IPI-style wakeups
> > (TTWU_QUEUE) whenever the destination CPU is in a hotplug state.
> > 
> > static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
> > {
> > 	/*
> > 	 * Do not complicate things with the async wake_list while the CPU is
> > 	 * in hotplug state.
> > 	 */
> > 	if (!cpu_active(cpu))
> > 		return false;
> > 	...
> > }
> 
> Yes, because all irrelevant tasks must be migrated out upon
> CPUHP_AP_SCHED_WAIT_EMPTY, thanks to balance_push_set().

Ah, got it.

> (Though right now I'm missing the flush_smp_call_function_queue() call that flushes
> the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty())

Possible. I saw your IRC message to Peter on that as well, thanks for
following up. I need to find some time to look more into that, but that does
sound concerning.

> > Along these lines, I wonder if, it is safe to do a wakeup in this fashion (as
> > done by this patch) if the destination CPU was also going down.
> > 
> > Also the same ttwu_queue_cond() checks for CPU affinities before deciding to
> > not do the IPI-style queue.
> > 
> > 	/* Ensure the task will still be allowed to run on the CPU. */
> > 	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> > 		return false;
> > 
> > Not that anyone should be changing RCU thread priorities around while the IPI
> > is in flight, but...
> > 
> > I wonder if the reason TTWU is excessively paranoid is that the IPI can be
> > delayed for example, leading to race conditions.
> 
> It's because nothing irrelevant must be queued after sched_cpu_wait_empty().

Makes sense.

> But note this patch does something different, it doesn't defer the runqueue
> enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the
> decision as to where to queue the task is delegated to an online CPU. So it's
> not the same constraints. Waking up a task _from_ a CPU that is active or not but
> at least online is supposed to be fine.

Agreed, thanks for the clarifications. But along similar lines (and at the
risk of oversimplifying), is it not possible to send an IPI to an online CPU
to queue the hrtimer locally there if you detect that the current CPU is
going down? In the other thread to Hilf, you mentioned the hrtimer infra has
to have equal or earlier deadline, but you can just queue the hrtimer from
the IPI handler and that should take care of it?

Let me know if I missed something which should make for some good holiday
reading material. ;-)

thanks,

 - Joel

Z qiang Dec. 20, 2023, 8:24 a.m. UTC | #8

>
> When the CPU goes idle for the last time during the CPU down hotplug
> process, RCU reports a final quiescent state for the current CPU. If
> this quiescent state propagates up to the top, some tasks may then be
> woken up to complete the grace period: the main grace period kthread
> and/or the expedited main workqueue (or kworker).
>
> If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
> arm the RT bandwith timer to the local offline CPU. Since this happens
> after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
> timer gets ignored. Therefore if the RCU kthreads are waiting for RT
> bandwidth to be available, they may never be actually scheduled.
>

In the rcutree_report_cpu_dead(), the rcuog kthreads may also be wakeup in
do_nocb_deferred_wakeup(), if the rcuog kthreads is rt-fifo and wakeup happen,
the rt_period_active is set 1 and enqueue hrtimer to offline CPU in
do_start_rt_bandwidth(),
after that, we invoke swake_up_one_online() send ipi to online CPU, due to the
rt_period_active is 1, the rt-bandwith hrtimer will not enqueue to online CPU.
any thoughts?

Thanks
Zqiang


>
> This triggers TREE03 rcutorture hangs:
>
>          rcu: INFO: rcu_preempt self-detected stall on CPU
>          rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
>          rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
>          rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
>          rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
>          rcu: RCU grace-period kthread stack dump:
>          task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
>          Call Trace:
>           <TASK>
>           __schedule+0x2eb/0xa80
>           schedule+0x1f/0x90
>           schedule_timeout+0x163/0x270
>           ? __pfx_process_timeout+0x10/0x10
>           rcu_gp_fqs_loop+0x37c/0x5b0
>           ? __pfx_rcu_gp_kthread+0x10/0x10
>           rcu_gp_kthread+0x17c/0x200
>           kthread+0xde/0x110
>           ? __pfx_kthread+0x10/0x10
>           ret_from_fork+0x2b/0x40
>           ? __pfx_kthread+0x10/0x10
>           ret_from_fork_asm+0x1b/0x30
>           </TASK>
>
> The situation can't be solved with just unpinning the timer. The hrtimer
> infrastructure and the nohz heuristics involved in finding the best
> remote target for an unpinned timer would then also need to handle
> enqueues from an offline CPU in the most horrendous way.
>
> So fix this on the RCU side instead and defer the wake up to an online
> CPU if it's too late for the local one.
>
> Reported-by: Paul E. McKenney <paulmck@kernel.org>
> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  kernel/rcu/tree.c     | 34 +++++++++++++++++++++++++++++++++-
>  kernel/rcu/tree_exp.h |  3 +--
>  2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 3ac3c846105f..157f3ca2a9b5 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
>         return needmore;
>  }
>
> +static void swake_up_one_online_ipi(void *arg)
> +{
> +       struct swait_queue_head *wqh = arg;
> +
> +       swake_up_one(wqh);
> +}
> +
> +static void swake_up_one_online(struct swait_queue_head *wqh)
> +{
> +       int cpu = get_cpu();
> +
> +       /*
> +        * If called from rcutree_report_cpu_starting(), wake up
> +        * is dangerous that late in the CPU-down hotplug process. The
> +        * scheduler might queue an ignored hrtimer. Defer the wake up
> +        * to an online CPU instead.
> +        */
> +       if (unlikely(cpu_is_offline(cpu))) {
> +               int target;
> +
> +               target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
> +                                        cpu_online_mask);
> +
> +               smp_call_function_single(target, swake_up_one_online_ipi,
> +                                        wqh, 0);
> +               put_cpu();
> +       } else {
> +               put_cpu();
> +               swake_up_one(wqh);
> +       }
> +}
> +
>  /*
>   * Awaken the grace-period kthread.  Don't do a self-awaken (unless in an
>   * interrupt or softirq handler, in which case we just might immediately
> @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void)
>                 return;
>         WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
>         WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
> -       swake_up_one(&rcu_state.gp_wq);
> +       swake_up_one_online(&rcu_state.gp_wq);
>  }
>
>  /*
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6d7cea5d591f..2ac440bc7e10 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
>         return ret;
>  }
>
> -
>  /*
>   * Report the exit from RCU read-side critical section for the last task
>   * that queued itself during or before the current expedited preemptible-RCU
> @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
>                         raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>                         if (wake) {
>                                 smp_mb(); /* EGP done before wake_up(). */
> -                               swake_up_one(&rcu_state.expedited_wq);
> +                               swake_up_one_online(&rcu_state.expedited_wq);
>                         }
>                         break;
>                 }
> --
> 2.42.1
>

Frederic Weisbecker Dec. 20, 2023, 3:13 p.m. UTC | #9

Le Wed, Dec 20, 2023 at 04:24:35PM +0800, Z qiang a écrit :
> >
> > When the CPU goes idle for the last time during the CPU down hotplug
> > process, RCU reports a final quiescent state for the current CPU. If
> > this quiescent state propagates up to the top, some tasks may then be
> > woken up to complete the grace period: the main grace period kthread
> > and/or the expedited main workqueue (or kworker).
> >
> > If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
> > arm the RT bandwith timer to the local offline CPU. Since this happens
> > after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
> > timer gets ignored. Therefore if the RCU kthreads are waiting for RT
> > bandwidth to be available, they may never be actually scheduled.
> >
> 
> In the rcutree_report_cpu_dead(), the rcuog kthreads may also be wakeup in
> do_nocb_deferred_wakeup(), if the rcuog kthreads is rt-fifo and wakeup happen,
> the rt_period_active is set 1 and enqueue hrtimer to offline CPU in
> do_start_rt_bandwidth(),
> after that, we invoke swake_up_one_online() send ipi to online CPU, due to the
> rt_period_active is 1, the rt-bandwith hrtimer will not enqueue to online CPU.
> any thoughts?

Duh, you're right, that one too. How many more? This hrtimer situation is scary...

Thanks.

Frederic Weisbecker Dec. 20, 2023, 3:50 p.m. UTC | #10

Le Tue, Dec 19, 2023 at 10:01:55PM -0500, Joel Fernandes a écrit :
> > (Though right now I'm missing the flush_smp_call_function_queue() call that flushes
> > the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty())
> 
> Possible. I saw your IRC message to Peter on that as well, thanks for
> following up. I need to find some time to look more into that, but that does
> sound concerning.

Found it! It's smpcfd_dying_cpu().

> > But note this patch does something different, it doesn't defer the runqueue
> > enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the
> > decision as to where to queue the task is delegated to an online CPU. So it's
> > not the same constraints. Waking up a task _from_ a CPU that is active or not but
> > at least online is supposed to be fine.
> 
> Agreed, thanks for the clarifications. But along similar lines (and at the
> risk of oversimplifying), is it not possible to send an IPI to an online CPU
> to queue the hrtimer locally there if you detect that the current CPU is
> going down? In the other thread to Hilf, you mentioned the hrtimer infra has
> to have equal or earlier deadline, but you can just queue the hrtimer from
> the IPI handler and that should take care of it?

This is something that Thomas wanted to avoid IIRC, because the IPI can make
it miss the deadline. But I guess in the case of an offline CPU, it can be a
last resort.

> Let me know if I missed something which should make for some good holiday
> reading material. ;-)

Let me summarize the possible fixes we can have:

1) It's RCU's fault! We must check and fix all the wake ups performed by RCU
   from rcutree_report_cpu_dead(). But beware other possible wake-ups/timer
   enqueue from the outgoing CPU after hrtimers are migrated.

2) It's scheduler's fault! do_start_rt_bandwidth() should check if the current
   CPU is offline and place manually the timer to an online CPU (through an
   IPI? yuck)

3) It's hrtimer's fault! If the current CPU is offline, it must arrange for
   queueing to an online CPU. Not easy to do as we must find one whose next
   expiry is below/equal the scheduler timer. As a last resort, this could be
   force queued to any and then signalled through an IPI, even though it's
   something we've tried to avoid until now.

   Also It's hard for me to think about another way to fix the deadlock fixed
   by 5c0930ccaad5a74d74e8b18b648c5eb21ed2fe94. Hrtimers migration can't happen
   after rcutree_report_cpu_dead(), because it may use RCU...

None of the above look pretty anyway. Thoughts?

Paul E. McKenney Dec. 21, 2023, 12:57 a.m. UTC | #11

On Wed, Dec 20, 2023 at 04:50:41PM +0100, Frederic Weisbecker wrote:
> Le Tue, Dec 19, 2023 at 10:01:55PM -0500, Joel Fernandes a écrit :
> > > (Though right now I'm missing the flush_smp_call_function_queue() call that flushes
> > > the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty())
> > 
> > Possible. I saw your IRC message to Peter on that as well, thanks for
> > following up. I need to find some time to look more into that, but that does
> > sound concerning.
> 
> Found it! It's smpcfd_dying_cpu().
> 
> > > But note this patch does something different, it doesn't defer the runqueue
> > > enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the
> > > decision as to where to queue the task is delegated to an online CPU. So it's
> > > not the same constraints. Waking up a task _from_ a CPU that is active or not but
> > > at least online is supposed to be fine.
> > 
> > Agreed, thanks for the clarifications. But along similar lines (and at the
> > risk of oversimplifying), is it not possible to send an IPI to an online CPU
> > to queue the hrtimer locally there if you detect that the current CPU is
> > going down? In the other thread to Hilf, you mentioned the hrtimer infra has
> > to have equal or earlier deadline, but you can just queue the hrtimer from
> > the IPI handler and that should take care of it?
> 
> This is something that Thomas wanted to avoid IIRC, because the IPI can make
> it miss the deadline. But I guess in the case of an offline CPU, it can be a
> last resort.
> 
> > Let me know if I missed something which should make for some good holiday
> > reading material. ;-)
> 
> Let me summarize the possible fixes we can have:
> 
> 1) It's RCU's fault! We must check and fix all the wake ups performed by RCU
>    from rcutree_report_cpu_dead(). But beware other possible wake-ups/timer
>    enqueue from the outgoing CPU after hrtimers are migrated.
> 
> 2) It's scheduler's fault! do_start_rt_bandwidth() should check if the current
>    CPU is offline and place manually the timer to an online CPU (through an
>    IPI? yuck)
> 
> 3) It's hrtimer's fault! If the current CPU is offline, it must arrange for
>    queueing to an online CPU. Not easy to do as we must find one whose next
>    expiry is below/equal the scheduler timer. As a last resort, this could be
>    force queued to any and then signalled through an IPI, even though it's
>    something we've tried to avoid until now.
> 
>    Also It's hard for me to think about another way to fix the deadlock fixed
>    by 5c0930ccaad5a74d74e8b18b648c5eb21ed2fe94. Hrtimers migration can't happen
>    after rcutree_report_cpu_dead(), because it may use RCU...
> 
> None of the above look pretty anyway. Thoughts?

Make one of the surviving CPUs grab any leftover timers from the outgoing
CPU, possibly checking periodically.  Not pretty either, but three ugly
options deserve a fourth one!

							Thanx, Paul

diff mbox series

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 3ac3c846105f..157f3ca2a9b5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1013,6 +1013,38 @@  static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
 	return needmore;
 }
 
+static void swake_up_one_online_ipi(void *arg)
+{
+	struct swait_queue_head *wqh = arg;
+
+	swake_up_one(wqh);
+}
+
+static void swake_up_one_online(struct swait_queue_head *wqh)
+{
+	int cpu = get_cpu();
+
+	/*
+	 * If called from rcutree_report_cpu_starting(), wake up
+	 * is dangerous that late in the CPU-down hotplug process. The
+	 * scheduler might queue an ignored hrtimer. Defer the wake up
+	 * to an online CPU instead.
+	 */
+	if (unlikely(cpu_is_offline(cpu))) {
+		int target;
+
+		target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
+					 cpu_online_mask);
+
+		smp_call_function_single(target, swake_up_one_online_ipi,
+					 wqh, 0);
+		put_cpu();
+	} else {
+		put_cpu();
+		swake_up_one(wqh);
+	}
+}
+
 /*
  * Awaken the grace-period kthread.  Don't do a self-awaken (unless in an
  * interrupt or softirq handler, in which case we just might immediately
@@ -1037,7 +1069,7 @@  static void rcu_gp_kthread_wake(void)
 		return;
 	WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
 	WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
-	swake_up_one(&rcu_state.gp_wq);
+	swake_up_one_online(&rcu_state.gp_wq);
 }
 
 /*
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6d7cea5d591f..2ac440bc7e10 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -173,7 +173,6 @@  static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
 	return ret;
 }
 
-
 /*
  * Report the exit from RCU read-side critical section for the last task
  * that queued itself during or before the current expedited preemptible-RCU
@@ -201,7 +200,7 @@  static void __rcu_report_exp_rnp(struct rcu_node *rnp,
 			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 			if (wake) {
 				smp_mb(); /* EGP done before wake_up(). */
-				swake_up_one(&rcu_state.expedited_wq);
+				swake_up_one_online(&rcu_state.expedited_wq);
 			}
 			break;
 		}