Message ID | 20231218231916.11719-3-frederic@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-4477-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:24d3:b0:fb:cd0c:d3e with SMTP id r19csp1594343dyi; Mon, 18 Dec 2023 15:20:07 -0800 (PST) X-Google-Smtp-Source: AGHT+IFz52DRXSAewbLfrfsgVT5fnRot5Hk60mncE+vhB5mZFidFMFMEU1y1FrvGN5Ibr/xAJogd X-Received: by 2002:a05:6a00:ac7:b0:6d0:aece:cdfc with SMTP id c7-20020a056a000ac700b006d0aececdfcmr5467439pfl.41.1702941607173; Mon, 18 Dec 2023 15:20:07 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702941607; cv=none; d=google.com; s=arc-20160816; b=uIjP+IFZ7BcTZa/I93D86IC/vYu+2koPiueq47Kit3mUtfZCI7xZ/1YCSwzBCg93T+ IEYE6AmtHx+U2vAJ7p7bcPmRiYSFZh8L+lABpDlZ7pAwuTscvLPiFaMBr2YvjmLmCkRF X7L3NWGggQY01TDtQcsOypy5S/CsX84GHX9tfET3LUzmtOYL4vxiTWIx1oachGZhgxx5 XMpZ5culPBvND9gg5YblyT8QybjWqPc44UIHX+/elwI4zRQc2lxyTnSpFsCtBoFkOQ+M qyPZtirB96QiBSPaNJmuHLBgBFv5qcyUAncRiAPlX1ocORIvjYOyZDhVFbx1mSrKyPv6 qqcQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=siIBhypBPJBsU+JxaET//sOJpdfgrnmQnavzlIOImLM=; fh=xvjKaM6hmtHqnOhs670mkmt5SmfrGcW6BwluoXDWXqs=; b=QiTBih7bxvvoUa6kLjlFioIJ4lhE9eqs21/VNxAHmY26J5amztBOCMefprc8iFUEu8 bgm7qoE9AyrkYvC1cgiZQyzKjzkoRhXmeGKElBBLEArf3KTbaN3U9Yj8fcXAk4vYvNJ/ GXtgYpSK3wq2FGvTIV28U+8uSxhQBO5yuvcz1f8rKYbQ31vQXs9v3SJlf9PvQ98dhIK4 MMRgKQ5CGXZ09US1eo8J3TUQ1PXJ5qcNEW6CPc0tmQAmDSTI3zXM2BOinqcBjQ1S8a45 uRIJ8Tz+4D1ErYdUWwKYeuE27XshUeTk0fcUrD2fQ5tB4SFEpxlwDUmB2hLsARPgrpbq xsog== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=L8J2LrPG; spf=pass (google.com: domain of linux-kernel+bounces-4477-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-4477-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id w11-20020a056a0014cb00b006d71fc698f8si2622632pfu.30.2023.12.18.15.20.07 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Dec 2023 15:20:07 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-4477-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=L8J2LrPG; spf=pass (google.com: domain of linux-kernel+bounces-4477-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-4477-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id BE6A3285424 for <ouuuleilei@gmail.com>; Mon, 18 Dec 2023 23:20:06 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id B5CEE768ED; Mon, 18 Dec 2023 23:19:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="L8J2LrPG" X-Original-To: linux-kernel@vger.kernel.org Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ED61A79944; Mon, 18 Dec 2023 23:19:29 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 02F3FC433D9; Mon, 18 Dec 2023 23:19:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1702941569; bh=FjnYKWmKazFNSVW61ol8hhKh2qDPsvgWNQgyqkISaA8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=L8J2LrPGtoGQcZ1ZvOzNwBrc2kpjH28ttt/sL3uw0zBI2iBEVxxQfynVn3lDyDk9N 1t5z5Yw4arepFW3hGmvgFTg3/DZfofyYJ4reEfO/qx9llQctX6HMLKNxOUo+Tf5hyz l7iywMPD6S3TZxGCjZhnRyo/mJbIJCJYS3h2hjbGjTQVS8J0QR0VCKYBx0KqaP3w/G EQiKFdixyevxwCsm85+DjiPdIua6uCPZVT4ucU/Nlw7gw3KdMVbLVIDiCIPBpjaGfw CHlBnagO7M2QWUAPXbzTYEf/wXU17zp1GzVwWal4b6fLKa3CTBU8Aqf9JWqCRdtHT+ uL36thz7eMxvg== From: Frederic Weisbecker <frederic@kernel.org> To: LKML <linux-kernel@vger.kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org>, Boqun Feng <boqun.feng@gmail.com>, Joel Fernandes <joel@joelfernandes.org>, Neeraj Upadhyay <neeraj.upadhyay@amd.com>, Uladzislau Rezki <urezki@gmail.com>, Zqiang <qiang.zhang1211@gmail.com>, rcu <rcu@vger.kernel.org>, "Paul E . McKenney" <paulmck@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Peter Zijlstra <peterz@infradead.org> Subject: [PATCH 2/3] rcu: Defer RCU kthreads wakeup when CPU is dying Date: Tue, 19 Dec 2023 00:19:15 +0100 Message-ID: <20231218231916.11719-3-frederic@kernel.org> X-Mailer: git-send-email 2.42.1 In-Reply-To: <20231218231916.11719-1-frederic@kernel.org> References: <20231218231916.11719-1-frederic@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785663698605253530 X-GMAIL-MSGID: 1785663698605253530 |
Series |
timers & RCU: Fix TREE03 stalls
|
|
Commit Message
Frederic Weisbecker
Dec. 18, 2023, 11:19 p.m. UTC
When the CPU goes idle for the last time during the CPU down hotplug
process, RCU reports a final quiescent state for the current CPU. If
this quiescent state propagates up to the top, some tasks may then be
woken up to complete the grace period: the main grace period kthread
and/or the expedited main workqueue (or kworker).
If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Since this happens
after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
timer gets ignored. Therefore if the RCU kthreads are waiting for RT
bandwidth to be available, they may never be actually scheduled.
This triggers TREE03 rcutorture hangs:
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6)
rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2eb/0xa80
schedule+0x1f/0x90
schedule_timeout+0x163/0x270
? __pfx_process_timeout+0x10/0x10
rcu_gp_fqs_loop+0x37c/0x5b0
? __pfx_rcu_gp_kthread+0x10/0x10
rcu_gp_kthread+0x17c/0x200
kthread+0xde/0x110
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2b/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
The situation can't be solved with just unpinning the timer. The hrtimer
infrastructure and the nohz heuristics involved in finding the best
remote target for an unpinned timer would then also need to handle
enqueues from an offline CPU in the most horrendous way.
So fix this on the RCU side instead and defer the wake up to an online
CPU if it's too late for the local one.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/rcu/tree.c | 34 +++++++++++++++++++++++++++++++++-
kernel/rcu/tree_exp.h | 3 +--
2 files changed, 34 insertions(+), 3 deletions(-)
Comments
On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote: > When the CPU goes idle for the last time during the CPU down hotplug > process, RCU reports a final quiescent state for the current CPU. If > this quiescent state propagates up to the top, some tasks may then be > woken up to complete the grace period: the main grace period kthread > and/or the expedited main workqueue (or kworker). > > If those kthreads have a SCHED_FIFO policy, the wake up can indirectly > arm the RT bandwith timer to the local offline CPU. Since this happens > after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the > timer gets ignored. Therefore if the RCU kthreads are waiting for RT > bandwidth to be available, they may never be actually scheduled. > > This triggers TREE03 rcutorture hangs: > > rcu: INFO: rcu_preempt self-detected stall on CPU > rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved) > rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6) > rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 > rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. > rcu: RCU grace-period kthread stack dump: > task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000 > Call Trace: > <TASK> > __schedule+0x2eb/0xa80 > schedule+0x1f/0x90 > schedule_timeout+0x163/0x270 > ? __pfx_process_timeout+0x10/0x10 > rcu_gp_fqs_loop+0x37c/0x5b0 > ? __pfx_rcu_gp_kthread+0x10/0x10 > rcu_gp_kthread+0x17c/0x200 > kthread+0xde/0x110 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x2b/0x40 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1b/0x30 > </TASK> > > The situation can't be solved with just unpinning the timer. The hrtimer > infrastructure and the nohz heuristics involved in finding the best > remote target for an unpinned timer would then also need to handle > enqueues from an offline CPU in the most horrendous way. > > So fix this on the RCU side instead and defer the wake up to an online > CPU if it's too late for the local one. Ah, ideally we'd not run into this if sched_feat(TTWU_QUEUE) was enabled but then in any case there is also the ttwu_queue_cond() also shutting down the remote queueing.. > Reported-by: Paul E. McKenney <paulmck@kernel.org> > Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") > Signed-off-by: Frederic Weisbecker <frederic@kernel.org> > --- > kernel/rcu/tree.c | 34 +++++++++++++++++++++++++++++++++- > kernel/rcu/tree_exp.h | 3 +-- > 2 files changed, 34 insertions(+), 3 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 3ac3c846105f..157f3ca2a9b5 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) > return needmore; > } > > +static void swake_up_one_online_ipi(void *arg) > +{ > + struct swait_queue_head *wqh = arg; > + > + swake_up_one(wqh); > +} Speaking of, the scheduler refuses to do remote-IPI-style wakeups (TTWU_QUEUE) whenever the destination CPU is in a hotplug state. static inline bool ttwu_queue_cond(struct task_struct *p, int cpu) { /* * Do not complicate things with the async wake_list while the CPU is * in hotplug state. */ if (!cpu_active(cpu)) return false; ... } Along these lines, I wonder if, it is safe to do a wakeup in this fashion (as done by this patch) if the destination CPU was also going down. Also the same ttwu_queue_cond() checks for CPU affinities before deciding to not do the IPI-style queue. /* Ensure the task will still be allowed to run on the CPU. */ if (!cpumask_test_cpu(cpu, p->cpus_ptr)) return false; Not that anyone should be changing RCU thread priorities around while the IPI is in flight, but... I wonder if the reason TTWU is excessively paranoid is that the IPI can be delayed for example, leading to race conditions. Anyway, just my 2 cents. Happy holidays! thanks, - Joel > + > +static void swake_up_one_online(struct swait_queue_head *wqh) > +{ > + int cpu = get_cpu(); > + > + /* > + * If called from rcutree_report_cpu_starting(), wake up > + * is dangerous that late in the CPU-down hotplug process. The > + * scheduler might queue an ignored hrtimer. Defer the wake up > + * to an online CPU instead. > + */ > + if (unlikely(cpu_is_offline(cpu))) { > + int target; > + > + target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU), > + cpu_online_mask); > + > + smp_call_function_single(target, swake_up_one_online_ipi, > + wqh, 0); > + put_cpu(); > + } else { > + put_cpu(); > + swake_up_one(wqh); > + } > +} > + > /* > * Awaken the grace-period kthread. Don't do a self-awaken (unless in an > * interrupt or softirq handler, in which case we just might immediately > @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void) > return; > WRITE_ONCE(rcu_state.gp_wake_time, jiffies); > WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq)); > - swake_up_one(&rcu_state.gp_wq); > + swake_up_one_online(&rcu_state.gp_wq); > } > > /* > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > index 6d7cea5d591f..2ac440bc7e10 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h > @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp) > return ret; > } > > - > /* > * Report the exit from RCU read-side critical section for the last task > * that queued itself during or before the current expedited preemptible-RCU > @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp, > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > if (wake) { > smp_mb(); /* EGP done before wake_up(). */ > - swake_up_one(&rcu_state.expedited_wq); > + swake_up_one_online(&rcu_state.expedited_wq); > } > break; > } > -- > 2.42.1 >
On Tue, 19 Dec 2023 00:19:15 +0100 Frederic Weisbecker <frederic@kernel.org> > +static void swake_up_one_online(struct swait_queue_head *wqh) > +{ > + int cpu = get_cpu(); > + > + /* > + * If called from rcutree_report_cpu_starting(), wake up > + * is dangerous that late in the CPU-down hotplug process. The > + * scheduler might queue an ignored hrtimer. Defer the wake up > + * to an online CPU instead. > + */ But why is scheduler having any interest selecting a dying CPU for adding a hrtimer on at the first place? > + if (unlikely(cpu_is_offline(cpu))) { > + int target; > + > + target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU), > + cpu_online_mask); > + > + smp_call_function_single(target, swake_up_one_online_ipi, > + wqh, 0); > + put_cpu(); > + } else { > + put_cpu(); > + swake_up_one(wqh); > + } > +}
On Mon, Dec 18, 2023 at 10:38:52PM -0500, Joel Fernandes wrote: > On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote: > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index 3ac3c846105f..157f3ca2a9b5 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) > > return needmore; > > } > > > > +static void swake_up_one_online_ipi(void *arg) > > +{ > > + struct swait_queue_head *wqh = arg; > > + > > + swake_up_one(wqh); > > +} > > Speaking of, the scheduler refuses to do remote-IPI-style wakeups > (TTWU_QUEUE) whenever the destination CPU is in a hotplug state. > > static inline bool ttwu_queue_cond(struct task_struct *p, int cpu) > { > /* > * Do not complicate things with the async wake_list while the CPU is > * in hotplug state. > */ > if (!cpu_active(cpu)) > return false; > ... > } Yes, because all irrelevant tasks must be migrated out upon CPUHP_AP_SCHED_WAIT_EMPTY, thanks to balance_push_set(). (Though right now I'm missing the flush_smp_call_function_queue() call that flushes the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty()) > > Along these lines, I wonder if, it is safe to do a wakeup in this fashion (as > done by this patch) if the destination CPU was also going down. > > Also the same ttwu_queue_cond() checks for CPU affinities before deciding to > not do the IPI-style queue. > > /* Ensure the task will still be allowed to run on the CPU. */ > if (!cpumask_test_cpu(cpu, p->cpus_ptr)) > return false; > > Not that anyone should be changing RCU thread priorities around while the IPI > is in flight, but... > > I wonder if the reason TTWU is excessively paranoid is that the IPI can be > delayed for example, leading to race conditions. It's because nothing irrelevant must be queued after sched_cpu_wait_empty(). But note this patch does something different, it doesn't defer the runqueue enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the decision as to where to queue the task is delegated to an online CPU. So it's not the same constraints. Waking up a task _from_ a CPU that is active or not but at least online is supposed to be fine. Thanks.
On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote: > When the CPU goes idle for the last time during the CPU down hotplug > process, RCU reports a final quiescent state for the current CPU. If > this quiescent state propagates up to the top, some tasks may then be > woken up to complete the grace period: the main grace period kthread > and/or the expedited main workqueue (or kworker). > > If those kthreads have a SCHED_FIFO policy, the wake up can indirectly > arm the RT bandwith timer to the local offline CPU. Since this happens > after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the > timer gets ignored. Therefore if the RCU kthreads are waiting for RT > bandwidth to be available, they may never be actually scheduled. > > This triggers TREE03 rcutorture hangs: > > rcu: INFO: rcu_preempt self-detected stall on CPU > rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved) > rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6) > rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 > rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. > rcu: RCU grace-period kthread stack dump: > task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000 > Call Trace: > <TASK> > __schedule+0x2eb/0xa80 > schedule+0x1f/0x90 > schedule_timeout+0x163/0x270 > ? __pfx_process_timeout+0x10/0x10 > rcu_gp_fqs_loop+0x37c/0x5b0 > ? __pfx_rcu_gp_kthread+0x10/0x10 > rcu_gp_kthread+0x17c/0x200 > kthread+0xde/0x110 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x2b/0x40 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1b/0x30 > </TASK> > > The situation can't be solved with just unpinning the timer. The hrtimer > infrastructure and the nohz heuristics involved in finding the best > remote target for an unpinned timer would then also need to handle > enqueues from an offline CPU in the most horrendous way. > > So fix this on the RCU side instead and defer the wake up to an online > CPU if it's too late for the local one. One question below... > Reported-by: Paul E. McKenney <paulmck@kernel.org> > Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") > Signed-off-by: Frederic Weisbecker <frederic@kernel.org> > --- > kernel/rcu/tree.c | 34 +++++++++++++++++++++++++++++++++- > kernel/rcu/tree_exp.h | 3 +-- > 2 files changed, 34 insertions(+), 3 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 3ac3c846105f..157f3ca2a9b5 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) > return needmore; > } > > +static void swake_up_one_online_ipi(void *arg) > +{ > + struct swait_queue_head *wqh = arg; > + > + swake_up_one(wqh); > +} > + > +static void swake_up_one_online(struct swait_queue_head *wqh) > +{ > + int cpu = get_cpu(); This works because get_cpu() is currently preempt_disable(). If there are plans to make get_cpu() be some sort of read lock, we might deadlock when synchronize_rcu() is invoked from a CPU-hotplug notifier, correct? Might this be worth a comment somewhere at some point? Thanx, Paul > + > + /* > + * If called from rcutree_report_cpu_starting(), wake up > + * is dangerous that late in the CPU-down hotplug process. The > + * scheduler might queue an ignored hrtimer. Defer the wake up > + * to an online CPU instead. > + */ > + if (unlikely(cpu_is_offline(cpu))) { > + int target; > + > + target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU), > + cpu_online_mask); > + > + smp_call_function_single(target, swake_up_one_online_ipi, > + wqh, 0); > + put_cpu(); > + } else { > + put_cpu(); > + swake_up_one(wqh); > + } > +} > + > /* > * Awaken the grace-period kthread. Don't do a self-awaken (unless in an > * interrupt or softirq handler, in which case we just might immediately > @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void) > return; > WRITE_ONCE(rcu_state.gp_wake_time, jiffies); > WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq)); > - swake_up_one(&rcu_state.gp_wq); > + swake_up_one_online(&rcu_state.gp_wq); > } > > /* > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > index 6d7cea5d591f..2ac440bc7e10 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h > @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp) > return ret; > } > > - > /* > * Report the exit from RCU read-side critical section for the last task > * that queued itself during or before the current expedited preemptible-RCU > @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp, > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > if (wake) { > smp_mb(); /* EGP done before wake_up(). */ > - swake_up_one(&rcu_state.expedited_wq); > + swake_up_one_online(&rcu_state.expedited_wq); > } > break; > } > -- > 2.42.1 >
On Tue, Dec 19, 2023 at 07:29:23AM -0800, Paul E. McKenney wrote: > > +static void swake_up_one_online(struct swait_queue_head *wqh) > > +{ > > + int cpu = get_cpu(); > > This works because get_cpu() is currently preempt_disable(). If there are plans to > make get_cpu() be some sort of read lock, we might deadlock when synchronize_rcu() > is invoked from a CPU-hotplug notifier, correct? > > Might this be worth a comment somewhere at some point? Sure, I can add that. Thanks.
On Tue, Dec 19, 2023 at 12:42:47PM +0800, Hillf Danton wrote: > On Tue, 19 Dec 2023 00:19:15 +0100 Frederic Weisbecker <frederic@kernel.org> > > +static void swake_up_one_online(struct swait_queue_head *wqh) > > +{ > > + int cpu = get_cpu(); > > + > > + /* > > + * If called from rcutree_report_cpu_starting(), wake up > > + * is dangerous that late in the CPU-down hotplug process. The > > + * scheduler might queue an ignored hrtimer. Defer the wake up > > + * to an online CPU instead. > > + */ > > But why is scheduler having any interest selecting a dying CPU for > adding a hrtimer on at the first place? So indeed that timer could be unpinned. But we tried that and it's not enough. If we want to make hrtimers and nohz infrastructure aware of the fact the current CPU is offline when it queues an hrtimer, we must face the ugliness below. And still it's hacky because we must also find an online target whose earliest deadline is below/equal the scheduler hrtimer we are trying to enqueue. And that requires even more ugliness that isn't handled below. So for now I assume that queuing a timer after hrtimers_cpu_dying() is unreasonable and that RCU is the only candidate trying that. If there are more to be reported, we shall see... diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index f2044d5a652b..9eac39fad31c 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -234,6 +234,7 @@ struct hrtimer_cpu_base { struct hrtimer *next_timer; ktime_t softirq_expires_next; struct hrtimer *softirq_next_timer; + int online; struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES]; } ____cacheline_aligned; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a708d225c28e..83c75768f290 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1088,7 +1088,7 @@ int get_nohz_timer_target(void) struct sched_domain *sd; const struct cpumask *hk_mask; - if (housekeeping_cpu(cpu, HK_TYPE_TIMER)) { + if (housekeeping_cpu(cpu, HK_TYPE_TIMER) && cpu_online(cpu)) { if (!idle_cpu(cpu)) return cpu; default_cpu = cpu; @@ -1109,7 +1109,8 @@ int get_nohz_timer_target(void) } if (default_cpu == -1) - default_cpu = housekeeping_any_cpu(HK_TYPE_TIMER); + default_cpu = cpumask_any_and(housekeeping_cpumask(HK_TYPE_TIMER), + cpu_online_mask); return default_cpu; } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 6aaf0a3d6081..26cb9455272a 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -118,7 +118,7 @@ static inline void do_start_rt_bandwidth(struct rt_bandwidth *rt_b) */ hrtimer_forward_now(&rt_b->rt_period_timer, ns_to_ktime(0)); hrtimer_start_expires(&rt_b->rt_period_timer, - HRTIMER_MODE_ABS_PINNED_HARD); + HRTIMER_MODE_ABS_HARD); } raw_spin_unlock(&rt_b->rt_runtime_lock); } diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 760793998cdd..82f9ace2e4fd 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -36,6 +36,7 @@ #include <linux/sched/sysctl.h> #include <linux/sched/rt.h> #include <linux/sched/deadline.h> +#include <linux/sched/isolation.h> #include <linux/sched/nohz.h> #include <linux/sched/debug.h> #include <linux/timer.h> @@ -206,6 +207,12 @@ struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) if (static_branch_likely(&timers_migration_enabled) && !pinned) return &per_cpu(hrtimer_bases, get_nohz_timer_target()); +#else + if (!base->online) { + int cpu = cpumask_any_and(housekeeping_cpumask(HK_TYPE_TIMER), + cpu_online_mask); + base = &per_cpu(hrtimer_bases, cpu); + } #endif return base; } @@ -254,7 +261,13 @@ switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base, raw_spin_lock(&new_base->cpu_base->lock); if (new_cpu_base != this_cpu_base && - hrtimer_check_target(timer, new_base)) { + hrtimer_check_target(timer, new_base) && + /* + * Crude hack and buggy: if this CPU is offline and + * the timer is the earliest on the remote target, the timer + * will expire late... + */ + this_cpu_base->online) { raw_spin_unlock(&new_base->cpu_base->lock); raw_spin_lock(&base->cpu_base->lock); new_cpu_base = this_cpu_base; @@ -2183,6 +2196,7 @@ int hrtimers_prepare_cpu(unsigned int cpu) cpu_base->softirq_next_timer = NULL; cpu_base->expires_next = KTIME_MAX; cpu_base->softirq_expires_next = KTIME_MAX; + cpu_base->online = 1; hrtimer_cpu_base_init_expiry_lock(cpu_base); return 0; } @@ -2248,7 +2262,7 @@ int hrtimers_cpu_dying(unsigned int dying_cpu) __hrtimer_get_next_event(new_base, HRTIMER_ACTIVE_SOFT); /* Tell the other CPU to retrigger the next event */ smp_call_function_single(ncpu, retrigger_next_event, NULL, 0); - + old_base->online = 0; raw_spin_unlock(&new_base->lock); raw_spin_unlock(&old_base->lock);
On Tue, Dec 19, 2023 at 12:56:50PM +0100, Frederic Weisbecker wrote: > On Mon, Dec 18, 2023 at 10:38:52PM -0500, Joel Fernandes wrote: > > On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote: > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > index 3ac3c846105f..157f3ca2a9b5 100644 > > > --- a/kernel/rcu/tree.c > > > +++ b/kernel/rcu/tree.c > > > @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) > > > return needmore; > > > } > > > > > > +static void swake_up_one_online_ipi(void *arg) > > > +{ > > > + struct swait_queue_head *wqh = arg; > > > + > > > + swake_up_one(wqh); > > > +} > > > > Speaking of, the scheduler refuses to do remote-IPI-style wakeups > > (TTWU_QUEUE) whenever the destination CPU is in a hotplug state. > > > > static inline bool ttwu_queue_cond(struct task_struct *p, int cpu) > > { > > /* > > * Do not complicate things with the async wake_list while the CPU is > > * in hotplug state. > > */ > > if (!cpu_active(cpu)) > > return false; > > ... > > } > > Yes, because all irrelevant tasks must be migrated out upon > CPUHP_AP_SCHED_WAIT_EMPTY, thanks to balance_push_set(). Ah, got it. > (Though right now I'm missing the flush_smp_call_function_queue() call that flushes > the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty()) Possible. I saw your IRC message to Peter on that as well, thanks for following up. I need to find some time to look more into that, but that does sound concerning. > > Along these lines, I wonder if, it is safe to do a wakeup in this fashion (as > > done by this patch) if the destination CPU was also going down. > > > > Also the same ttwu_queue_cond() checks for CPU affinities before deciding to > > not do the IPI-style queue. > > > > /* Ensure the task will still be allowed to run on the CPU. */ > > if (!cpumask_test_cpu(cpu, p->cpus_ptr)) > > return false; > > > > Not that anyone should be changing RCU thread priorities around while the IPI > > is in flight, but... > > > > I wonder if the reason TTWU is excessively paranoid is that the IPI can be > > delayed for example, leading to race conditions. > > It's because nothing irrelevant must be queued after sched_cpu_wait_empty(). Makes sense. > But note this patch does something different, it doesn't defer the runqueue > enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the > decision as to where to queue the task is delegated to an online CPU. So it's > not the same constraints. Waking up a task _from_ a CPU that is active or not but > at least online is supposed to be fine. Agreed, thanks for the clarifications. But along similar lines (and at the risk of oversimplifying), is it not possible to send an IPI to an online CPU to queue the hrtimer locally there if you detect that the current CPU is going down? In the other thread to Hilf, you mentioned the hrtimer infra has to have equal or earlier deadline, but you can just queue the hrtimer from the IPI handler and that should take care of it? Let me know if I missed something which should make for some good holiday reading material. ;-) thanks, - Joel
> > When the CPU goes idle for the last time during the CPU down hotplug > process, RCU reports a final quiescent state for the current CPU. If > this quiescent state propagates up to the top, some tasks may then be > woken up to complete the grace period: the main grace period kthread > and/or the expedited main workqueue (or kworker). > > If those kthreads have a SCHED_FIFO policy, the wake up can indirectly > arm the RT bandwith timer to the local offline CPU. Since this happens > after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the > timer gets ignored. Therefore if the RCU kthreads are waiting for RT > bandwidth to be available, they may never be actually scheduled. > In the rcutree_report_cpu_dead(), the rcuog kthreads may also be wakeup in do_nocb_deferred_wakeup(), if the rcuog kthreads is rt-fifo and wakeup happen, the rt_period_active is set 1 and enqueue hrtimer to offline CPU in do_start_rt_bandwidth(), after that, we invoke swake_up_one_online() send ipi to online CPU, due to the rt_period_active is 1, the rt-bandwith hrtimer will not enqueue to online CPU. any thoughts? Thanks Zqiang > > This triggers TREE03 rcutorture hangs: > > rcu: INFO: rcu_preempt self-detected stall on CPU > rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved) > rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6) > rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 > rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. > rcu: RCU grace-period kthread stack dump: > task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000 > Call Trace: > <TASK> > __schedule+0x2eb/0xa80 > schedule+0x1f/0x90 > schedule_timeout+0x163/0x270 > ? __pfx_process_timeout+0x10/0x10 > rcu_gp_fqs_loop+0x37c/0x5b0 > ? __pfx_rcu_gp_kthread+0x10/0x10 > rcu_gp_kthread+0x17c/0x200 > kthread+0xde/0x110 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x2b/0x40 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1b/0x30 > </TASK> > > The situation can't be solved with just unpinning the timer. The hrtimer > infrastructure and the nohz heuristics involved in finding the best > remote target for an unpinned timer would then also need to handle > enqueues from an offline CPU in the most horrendous way. > > So fix this on the RCU side instead and defer the wake up to an online > CPU if it's too late for the local one. > > Reported-by: Paul E. McKenney <paulmck@kernel.org> > Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") > Signed-off-by: Frederic Weisbecker <frederic@kernel.org> > --- > kernel/rcu/tree.c | 34 +++++++++++++++++++++++++++++++++- > kernel/rcu/tree_exp.h | 3 +-- > 2 files changed, 34 insertions(+), 3 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 3ac3c846105f..157f3ca2a9b5 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) > return needmore; > } > > +static void swake_up_one_online_ipi(void *arg) > +{ > + struct swait_queue_head *wqh = arg; > + > + swake_up_one(wqh); > +} > + > +static void swake_up_one_online(struct swait_queue_head *wqh) > +{ > + int cpu = get_cpu(); > + > + /* > + * If called from rcutree_report_cpu_starting(), wake up > + * is dangerous that late in the CPU-down hotplug process. The > + * scheduler might queue an ignored hrtimer. Defer the wake up > + * to an online CPU instead. > + */ > + if (unlikely(cpu_is_offline(cpu))) { > + int target; > + > + target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU), > + cpu_online_mask); > + > + smp_call_function_single(target, swake_up_one_online_ipi, > + wqh, 0); > + put_cpu(); > + } else { > + put_cpu(); > + swake_up_one(wqh); > + } > +} > + > /* > * Awaken the grace-period kthread. Don't do a self-awaken (unless in an > * interrupt or softirq handler, in which case we just might immediately > @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void) > return; > WRITE_ONCE(rcu_state.gp_wake_time, jiffies); > WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq)); > - swake_up_one(&rcu_state.gp_wq); > + swake_up_one_online(&rcu_state.gp_wq); > } > > /* > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > index 6d7cea5d591f..2ac440bc7e10 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h > @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp) > return ret; > } > > - > /* > * Report the exit from RCU read-side critical section for the last task > * that queued itself during or before the current expedited preemptible-RCU > @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp, > raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > if (wake) { > smp_mb(); /* EGP done before wake_up(). */ > - swake_up_one(&rcu_state.expedited_wq); > + swake_up_one_online(&rcu_state.expedited_wq); > } > break; > } > -- > 2.42.1 >
Le Wed, Dec 20, 2023 at 04:24:35PM +0800, Z qiang a écrit : > > > > When the CPU goes idle for the last time during the CPU down hotplug > > process, RCU reports a final quiescent state for the current CPU. If > > this quiescent state propagates up to the top, some tasks may then be > > woken up to complete the grace period: the main grace period kthread > > and/or the expedited main workqueue (or kworker). > > > > If those kthreads have a SCHED_FIFO policy, the wake up can indirectly > > arm the RT bandwith timer to the local offline CPU. Since this happens > > after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the > > timer gets ignored. Therefore if the RCU kthreads are waiting for RT > > bandwidth to be available, they may never be actually scheduled. > > > > In the rcutree_report_cpu_dead(), the rcuog kthreads may also be wakeup in > do_nocb_deferred_wakeup(), if the rcuog kthreads is rt-fifo and wakeup happen, > the rt_period_active is set 1 and enqueue hrtimer to offline CPU in > do_start_rt_bandwidth(), > after that, we invoke swake_up_one_online() send ipi to online CPU, due to the > rt_period_active is 1, the rt-bandwith hrtimer will not enqueue to online CPU. > any thoughts? Duh, you're right, that one too. How many more? This hrtimer situation is scary... Thanks.
Le Tue, Dec 19, 2023 at 10:01:55PM -0500, Joel Fernandes a écrit : > > (Though right now I'm missing the flush_smp_call_function_queue() call that flushes > > the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty()) > > Possible. I saw your IRC message to Peter on that as well, thanks for > following up. I need to find some time to look more into that, but that does > sound concerning. Found it! It's smpcfd_dying_cpu(). > > But note this patch does something different, it doesn't defer the runqueue > > enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the > > decision as to where to queue the task is delegated to an online CPU. So it's > > not the same constraints. Waking up a task _from_ a CPU that is active or not but > > at least online is supposed to be fine. > > Agreed, thanks for the clarifications. But along similar lines (and at the > risk of oversimplifying), is it not possible to send an IPI to an online CPU > to queue the hrtimer locally there if you detect that the current CPU is > going down? In the other thread to Hilf, you mentioned the hrtimer infra has > to have equal or earlier deadline, but you can just queue the hrtimer from > the IPI handler and that should take care of it? This is something that Thomas wanted to avoid IIRC, because the IPI can make it miss the deadline. But I guess in the case of an offline CPU, it can be a last resort. > Let me know if I missed something which should make for some good holiday > reading material. ;-) Let me summarize the possible fixes we can have: 1) It's RCU's fault! We must check and fix all the wake ups performed by RCU from rcutree_report_cpu_dead(). But beware other possible wake-ups/timer enqueue from the outgoing CPU after hrtimers are migrated. 2) It's scheduler's fault! do_start_rt_bandwidth() should check if the current CPU is offline and place manually the timer to an online CPU (through an IPI? yuck) 3) It's hrtimer's fault! If the current CPU is offline, it must arrange for queueing to an online CPU. Not easy to do as we must find one whose next expiry is below/equal the scheduler timer. As a last resort, this could be force queued to any and then signalled through an IPI, even though it's something we've tried to avoid until now. Also It's hard for me to think about another way to fix the deadlock fixed by 5c0930ccaad5a74d74e8b18b648c5eb21ed2fe94. Hrtimers migration can't happen after rcutree_report_cpu_dead(), because it may use RCU... None of the above look pretty anyway. Thoughts?
On Wed, Dec 20, 2023 at 04:50:41PM +0100, Frederic Weisbecker wrote: > Le Tue, Dec 19, 2023 at 10:01:55PM -0500, Joel Fernandes a écrit : > > > (Though right now I'm missing the flush_smp_call_function_queue() call that flushes > > > the ttwu queue between sched_cpu_deactivate() and sched_cpu_wait_empty()) > > > > Possible. I saw your IRC message to Peter on that as well, thanks for > > following up. I need to find some time to look more into that, but that does > > sound concerning. > > Found it! It's smpcfd_dying_cpu(). > > > > But note this patch does something different, it doesn't defer the runqueue > > > enqueue like ttwu queue does. It defers the whole actual wakeup. This means that the > > > decision as to where to queue the task is delegated to an online CPU. So it's > > > not the same constraints. Waking up a task _from_ a CPU that is active or not but > > > at least online is supposed to be fine. > > > > Agreed, thanks for the clarifications. But along similar lines (and at the > > risk of oversimplifying), is it not possible to send an IPI to an online CPU > > to queue the hrtimer locally there if you detect that the current CPU is > > going down? In the other thread to Hilf, you mentioned the hrtimer infra has > > to have equal or earlier deadline, but you can just queue the hrtimer from > > the IPI handler and that should take care of it? > > This is something that Thomas wanted to avoid IIRC, because the IPI can make > it miss the deadline. But I guess in the case of an offline CPU, it can be a > last resort. > > > Let me know if I missed something which should make for some good holiday > > reading material. ;-) > > Let me summarize the possible fixes we can have: > > 1) It's RCU's fault! We must check and fix all the wake ups performed by RCU > from rcutree_report_cpu_dead(). But beware other possible wake-ups/timer > enqueue from the outgoing CPU after hrtimers are migrated. > > 2) It's scheduler's fault! do_start_rt_bandwidth() should check if the current > CPU is offline and place manually the timer to an online CPU (through an > IPI? yuck) > > 3) It's hrtimer's fault! If the current CPU is offline, it must arrange for > queueing to an online CPU. Not easy to do as we must find one whose next > expiry is below/equal the scheduler timer. As a last resort, this could be > force queued to any and then signalled through an IPI, even though it's > something we've tried to avoid until now. > > Also It's hard for me to think about another way to fix the deadlock fixed > by 5c0930ccaad5a74d74e8b18b648c5eb21ed2fe94. Hrtimers migration can't happen > after rcutree_report_cpu_dead(), because it may use RCU... > > None of the above look pretty anyway. Thoughts? Make one of the surviving CPUs grab any leftover timers from the outgoing CPU, possibly checking periodically. Not pretty either, but three ugly options deserve a fourth one! Thanx, Paul
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 3ac3c846105f..157f3ca2a9b5 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) return needmore; } +static void swake_up_one_online_ipi(void *arg) +{ + struct swait_queue_head *wqh = arg; + + swake_up_one(wqh); +} + +static void swake_up_one_online(struct swait_queue_head *wqh) +{ + int cpu = get_cpu(); + + /* + * If called from rcutree_report_cpu_starting(), wake up + * is dangerous that late in the CPU-down hotplug process. The + * scheduler might queue an ignored hrtimer. Defer the wake up + * to an online CPU instead. + */ + if (unlikely(cpu_is_offline(cpu))) { + int target; + + target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU), + cpu_online_mask); + + smp_call_function_single(target, swake_up_one_online_ipi, + wqh, 0); + put_cpu(); + } else { + put_cpu(); + swake_up_one(wqh); + } +} + /* * Awaken the grace-period kthread. Don't do a self-awaken (unless in an * interrupt or softirq handler, in which case we just might immediately @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void) return; WRITE_ONCE(rcu_state.gp_wake_time, jiffies); WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq)); - swake_up_one(&rcu_state.gp_wq); + swake_up_one_online(&rcu_state.gp_wq); } /* diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 6d7cea5d591f..2ac440bc7e10 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp) return ret; } - /* * Report the exit from RCU read-side critical section for the last task * that queued itself during or before the current expedited preemptible-RCU @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp, raw_spin_unlock_irqrestore_rcu_node(rnp, flags); if (wake) { smp_mb(); /* EGP done before wake_up(). */ - swake_up_one(&rcu_state.expedited_wq); + swake_up_one_online(&rcu_state.expedited_wq); } break; }