Message ID | Y++UzubyNavLKFDP@linutronix.de |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp929259wrn; Fri, 17 Feb 2023 06:56:17 -0800 (PST) X-Google-Smtp-Source: AK7set983J+1qUGphSH3ZrPz6OCP7br7DPtwY/lWMopWDCy4P3NCup/ZM8HVEKPFGc6tR1u3LagF X-Received: by 2002:a17:902:b610:b0:19a:b302:5176 with SMTP id b16-20020a170902b61000b0019ab3025176mr1204463pls.46.1676645777631; Fri, 17 Feb 2023 06:56:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676645777; cv=none; d=google.com; s=arc-20160816; b=CVR4XmL5fYcv6hmglPsZENLH66cVkVCXFuy5I7uqIZiWtoMEYJDV5PNXSDoTKq0gw0 L8aceX47v6YB2pUeS6HjJgOLRUhHGP9/vqTvxXtAYx3qONMbLN9FgdZWzlgnmlQIoiDH Yjba0J5e/WU3dJZ8qz+9p8CjygWaiecTesHe5g5Dl7QYnGNWgCFxaZf95G/DZH10Npva Ty0EYs5Si6ums7yd2MjIkMB9RUQxWglyPGJaxZes86ZlyNoXN9ACEY9hChO4x/Z4efOW PBF5mp2HvO0k1Xs7+IkNMJlYEh78lOqvhm+gIPT3Ro3Jy886s8V9CMvyC+UQe5KmIwHA fq4Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-disposition:mime-version:message-id :subject:cc:to:from:dkim-signature:dkim-signature:date; bh=eBITpJvFAgJDlrvkIZ8qiH8cZafnQorRAJ90UsZZfh8=; b=V6F2LezfxRgUdvZ2CQc9sG3FshgsGB06YwNcFejo6cacqCpgVNbCo4eL3ujzUZIa4V rv4Rw11fq/WXC81Vbl+4skJOvBl0pN3IYKu/AVKAVP88WFSdRFoAR4ednUGupTHS051m s+7T4V7glu6IDDz1f8ix1LSrwAvgNoHP6S1mbhTAb4burT8rt4G0t6ZriGoaujRqJX2i 0qQnwuhYHoo+hdhS9pRZ0VMXZXZQSmY2fwaAGX8Ak9SnHBplDvicoAo3KZEmVKa+R2eA PUYam9hNlGQNd30Ysiw2TMv2Y9C99oU+oI96IKxRpGFv//GlfDpC7wf3g9jDBGgrzDQb JuOA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=DEAEl9ro; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c10-20020a170903234a00b00194bcc88b52si5588664plh.363.2023.02.17.06.56.04; Fri, 17 Feb 2023 06:56:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=DEAEl9ro; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229940AbjBQOxf (ORCPT <rfc822;aimixsaka@gmail.com> + 99 others); Fri, 17 Feb 2023 09:53:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46554 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229896AbjBQOxd (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 17 Feb 2023 09:53:33 -0500 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2CB583B0D7 for <linux-kernel@vger.kernel.org>; Fri, 17 Feb 2023 06:53:15 -0800 (PST) Date: Fri, 17 Feb 2023 15:53:02 +0100 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1676645593; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type; bh=eBITpJvFAgJDlrvkIZ8qiH8cZafnQorRAJ90UsZZfh8=; b=DEAEl9roZus/CNCAm5+6poOLk9HQXmUVToZJmyysfvdx0XhtspkSxOZX0ReGtGqU/hxxAm MNyl7TgMNRZh8QjVd/ywNPw/nXxWT7iMQFEGQfdqVG4tVprovvtO0bEp6tBIiWr5dgRsw5 pa17jLPT2TqU5npTsXmTQdhrUoVbGNaNlyZJ2m9tz2uKNFrjMlqmsj62ePTRQqv/ommd+I R7ZPLpzpREfztW3scv7VEsipGzU9X+LEBLOfUuyzkCEepkJFA5WAUeKHorwVckAUC475RX FlwFI88g9aEnX6GSObvFNwvLlHp+Ru0UCBuHnp7eI8nMLXfdfvYgtxUTBqv4FA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1676645593; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type; bh=eBITpJvFAgJDlrvkIZ8qiH8cZafnQorRAJ90UsZZfh8=; b=8Kv/gmXsSjVXxQ9/NI0Wdr+MXMkGPlc67msI8eJuJ9o2FICMtJfX/otamCA+Edc6xotfRO /RRz0glgXysZWUCA== From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: linux-kernel@vger.kernel.org Cc: Ben Segall <bsegall@google.com>, Daniel Bristot de Oliveira <bristot@redhat.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>, Mel Gorman <mgorman@suse.de>, Peter Zijlstra <peterz@infradead.org>, Steven Rostedt <rostedt@goodmis.org>, Thomas Gleixner <tglx@linutronix.de>, Valentin Schneider <vschneid@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org> Subject: [PATCH] sched: Consider task_struct::saved_state in wait_task_inactive(). Message-ID: <Y++UzubyNavLKFDP@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758090522730561412?= X-GMAIL-MSGID: =?utf-8?q?1758090522730561412?= |
Series |
sched: Consider task_struct::saved_state in wait_task_inactive().
|
|
Commit Message
Sebastian Andrzej Siewior
Feb. 17, 2023, 2:53 p.m. UTC
wait_task_inactive() waits for thread to unschedule in a certain task state. On PREEMPT_RT that state may be stored in task_struct::saved_state while the thread, that is being waited for, blocks on a sleeping lock and task_struct::__state is set to TASK_RTLOCK_WAIT. It is not possible to check only for TASK_RTLOCK_WAIT to be sure that the task is blocked on a sleeping lock because during wake up (after the sleeping lock has been acquired) the task state is set TASK_RUNNING. After the task in on CPU and acquired the pi_lock it will reset the state accordingly but until then TASK_RUNNING will be observed (with the desired state is saved in saved_state). Check also for task_struct::saved_state if the desired match was not found in task_struct::__state on PREEMPT_RT. If the state was found in saved_state, wait until the task is idle and state is visible in task_struct::__state. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Valentin Schneider <vschneid@redhat.com> --- Repost of https://lore.kernel.org/Yt%2FpQAFQ1xKNK0RY@linutronix.de kernel/sched/core.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 76 insertions(+), 5 deletions(-)
Comments
On Fri, Feb 17, 2023 at 03:53:02PM +0100, Sebastian Andrzej Siewior wrote: > wait_task_inactive() waits for thread to unschedule in a certain task state. > On PREEMPT_RT that state may be stored in task_struct::saved_state while the > thread, that is being waited for, blocks on a sleeping lock and > task_struct::__state is set to TASK_RTLOCK_WAIT. > It is not possible to check only for TASK_RTLOCK_WAIT to be sure that the task > is blocked on a sleeping lock because during wake up (after the sleeping lock > has been acquired) the task state is set TASK_RUNNING. After the task in on CPU > and acquired the pi_lock it will reset the state accordingly but until then > TASK_RUNNING will be observed (with the desired state is saved in saved_state). > > Check also for task_struct::saved_state if the desired match was not found in > task_struct::__state on PREEMPT_RT. If the state was found in saved_state, wait > until the task is idle and state is visible in task_struct::__state. > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Reviewed-by: Valentin Schneider <vschneid@redhat.com> > --- Which if the very few wait_task_inactive() users requires this?
On 2023-02-22 14:36:14 [+0100], Peter Zijlstra wrote:
> Which if the very few wait_task_inactive() users requires this?
ptrace is the remaining (known) one (just verified on v6.2-rt3).
ptrace_check_attach() waits for the child which blocks on tasklist_lock.
tglx argued that wait_task_inactive() should work regardless of the
task, that is being waited for, blocks on a sleeping lock.
Sebastian
On 2023-02-23 17:53:48 [+0100], To Peter Zijlstra wrote: > On 2023-02-22 14:36:14 [+0100], Peter Zijlstra wrote: > > Which if the very few wait_task_inactive() users requires this? > > ptrace is the remaining (known) one (just verified on v6.2-rt3). > ptrace_check_attach() waits for the child which blocks on tasklist_lock. > > tglx argued that wait_task_inactive() should work regardless of the > task, that is being waited for, blocks on a sleeping lock. a polite ping. Sebastian
On 2023-03-29 15:33:39 [+0200], To Peter Zijlstra wrote: > On 2023-02-23 17:53:48 [+0100], To Peter Zijlstra wrote: > > On 2023-02-22 14:36:14 [+0100], Peter Zijlstra wrote: > > > Which if the very few wait_task_inactive() users requires this? > > > > ptrace is the remaining (known) one (just verified on v6.2-rt3). > > ptrace_check_attach() waits for the child which blocks on tasklist_lock. > > > > tglx argued that wait_task_inactive() should work regardless of the > > task, that is being waited for, blocks on a sleeping lock. > > a polite ping. a very polity ping. Sebastian
On Fri, Feb 17, 2023 at 03:53:02PM +0100, Sebastian Andrzej Siewior wrote: > +static __always_inline bool state_mismatch(struct task_struct *p, unsigned int match_state) > +{ > + unsigned long flags; > + bool mismatch; > + > + raw_spin_lock_irqsave(&p->pi_lock, flags); > + if (READ_ONCE(p->__state) & match_state) > + mismatch = false; > + else if (READ_ONCE(p->saved_state) & match_state) > + mismatch = false; > + else > + mismatch = true; > + > + raw_spin_unlock_irqrestore(&p->pi_lock, flags); > + return mismatch; > +} > +static __always_inline bool state_match(struct task_struct *p, unsigned int match_state, > + bool *wait) > +{ > + if (READ_ONCE(p->__state) & match_state) > + return true; > + if (READ_ONCE(p->saved_state) & match_state) { > + *wait = true; > + return true; > + } > + return false; > +} > +#else > +static __always_inline bool state_mismatch(struct task_struct *p, unsigned int match_state) > +{ > + return !(READ_ONCE(p->__state) & match_state); > +} > +static __always_inline bool state_match(struct task_struct *p, unsigned int match_state, > + bool *wait) > +{ > + return (READ_ONCE(p->__state) & match_state); > +} > +#endif > + > /* > * wait_task_inactive - wait for a thread to unschedule. > * Urgh... I've ended up with the below.. I've tried folding it with ttwu_state_match() but every attempt so far makes it an unholy mess. Now, if only we had proper lock guard then we could drop another few lines, but alas. --- kernel/sched/core.c | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a68d1276bab0..5a106629a98d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3341,6 +3341,37 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p, } #endif /* CONFIG_NUMA_BALANCING */ +static __always_inline +bool __wti_state_match(struct task_struct *p, unsigned int state, int *queued) +{ + if (READ_ONCE(p->__state) & state) + return true; + +#ifdef CONFIG_PREEMPT_RT + if (READ_ONCE(p->saved_state) & state) { + if (queued) + *queued = 1; + return true; + } +#endif + return false; +} + +static __always_inline bool wti_state_match(struct task_struct *p, unsigned int state) +{ +#ifdef CONFIG_PREEMPT_RT + bool match; + + raw_spin_lock_irq(&p->pi_lock); + match = __wti_state_match(p, state, NULL); + raw_spin_unlock_irq(&p->pi_lock); + + return match; +#else + return __wti_state_match(p, state, NULL); +#endif +} + /* * wait_task_inactive - wait for a thread to unschedule. * @@ -3385,7 +3416,7 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state * is actually now running somewhere else! */ while (task_on_cpu(rq, p)) { - if (!(READ_ONCE(p->__state) & match_state)) + if (!wti_state_match(p, match_state)) return 0; cpu_relax(); } @@ -3400,7 +3431,7 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state running = task_on_cpu(rq, p); queued = task_on_rq_queued(p); ncsw = 0; - if (READ_ONCE(p->__state) & match_state) + if (__wti_state_match(p, match_state, &queued)) ncsw = p->nvcsw | LONG_MIN; /* sets MSB */ task_rq_unlock(rq, p, &rf);
On Thu, May 25, 2023 at 06:52:44PM +0200, Peter Zijlstra wrote: > On Fri, Feb 17, 2023 at 03:53:02PM +0100, Sebastian Andrzej Siewior wrote: > > > +static __always_inline bool state_mismatch(struct task_struct *p, unsigned int match_state) > > +{ > > + unsigned long flags; > > + bool mismatch; > > + > > + raw_spin_lock_irqsave(&p->pi_lock, flags); > > + if (READ_ONCE(p->__state) & match_state) > > + mismatch = false; > > + else if (READ_ONCE(p->saved_state) & match_state) > > + mismatch = false; > > + else > > + mismatch = true; > > + > > + raw_spin_unlock_irqrestore(&p->pi_lock, flags); > > + return mismatch; > > +} > > +static __always_inline bool state_match(struct task_struct *p, unsigned int match_state, > > + bool *wait) > > +{ > > + if (READ_ONCE(p->__state) & match_state) > > + return true; > > + if (READ_ONCE(p->saved_state) & match_state) { > > + *wait = true; > > + return true; > > + } > > + return false; > > +} > > +#else > > +static __always_inline bool state_mismatch(struct task_struct *p, unsigned int match_state) > > +{ > > + return !(READ_ONCE(p->__state) & match_state); > > +} > > +static __always_inline bool state_match(struct task_struct *p, unsigned int match_state, > > + bool *wait) > > +{ > > + return (READ_ONCE(p->__state) & match_state); > > +} > > +#endif > > + > > /* > > * wait_task_inactive - wait for a thread to unschedule. > > * > > Urgh... > > I've ended up with the below.. I've tried folding it with > ttwu_state_match() but every attempt so far makes it an unholy mess. > > Now, if only we had proper lock guard then we could drop another few > lines, but alas. New day, new chances... How's this? Code-gen doesn't look totally insane, but then, making sense of an optimizing compiler's output is always a wee challenge. --- kernel/sched/core.c | 55 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 44 insertions(+), 11 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a68d1276bab0..d89610fffd23 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3341,6 +3341,35 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p, } #endif /* CONFIG_NUMA_BALANCING */ +static __always_inline +int __task_state_match(struct task_struct *p, unsigned int state) +{ + if (READ_ONCE(p->__state) & state) + return 1; + +#ifdef CONFIG_PREEMPT_RT + if (READ_ONCE(p->saved_state) & state) + return -1; +#endif + return 0; +} + +static __always_inline +int task_state_match(struct task_struct *p, unsigned int state) +{ +#ifdef CONFIG_PREEMPT_RT + int match; + + raw_spin_lock_irq(&p->pi_lock); + match = __task_state_match(p, state); + raw_spin_unlock_irq(&p->pi_lock); + + return match; +#else + return __task_state_match(p, state); +#endif +} + /* * wait_task_inactive - wait for a thread to unschedule. * @@ -3359,7 +3388,7 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p, */ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state) { - int running, queued; + int running, queued, match; struct rq_flags rf; unsigned long ncsw; struct rq *rq; @@ -3385,7 +3414,7 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state * is actually now running somewhere else! */ while (task_on_cpu(rq, p)) { - if (!(READ_ONCE(p->__state) & match_state)) + if (!task_state_match(p, match_state)) return 0; cpu_relax(); } @@ -3400,8 +3429,15 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state running = task_on_cpu(rq, p); queued = task_on_rq_queued(p); ncsw = 0; - if (READ_ONCE(p->__state) & match_state) + if ((match = __task_state_match(p, match_state))) { + /* + * When matching on p->saved_state, consider this task + * still queued so it will wait. + */ + if (match < 0) + queued = 1; ncsw = p->nvcsw | LONG_MIN; /* sets MSB */ + } task_rq_unlock(rq, p, &rf); /* @@ -4003,15 +4039,14 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags) static __always_inline bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success) { + int match; + if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)) { WARN_ON_ONCE((state & TASK_RTLOCK_WAIT) && state != TASK_RTLOCK_WAIT); } - if (READ_ONCE(p->__state) & state) { - *success = 1; - return true; - } + *success = !!(match = __task_state_match(p, state)); #ifdef CONFIG_PREEMPT_RT /* @@ -4027,12 +4062,10 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success) * p::saved_state to TASK_RUNNING so any further tests will * not result in false positives vs. @success */ - if (p->saved_state & state) { + if (match < 0) p->saved_state = TASK_RUNNING; - *success = 1; - } #endif - return false; + return match > 0; } /*
On 2023-05-25 18:52:44 [+0200], Peter Zijlstra wrote: > Urgh... > > I've ended up with the below.. I've tried folding it with > ttwu_state_match() but every attempt so far makes it an unholy mess. > > Now, if only we had proper lock guard then we could drop another few > lines, but alas. perfect, thank you. Tested the bits. Sebastian
On 2023-05-26 10:05:43 [+0200], Peter Zijlstra wrote: > New day, new chances... How's this? Code-gen doesn't look totally > insane, but then, making sense of an optimizing compiler's output is > always a wee challenge. Noticed it too late but looks good. Tested, works. Sebastian
On Fri, May 26, 2023 at 05:13:35PM +0200, Sebastian Andrzej Siewior wrote: > On 2023-05-26 10:05:43 [+0200], Peter Zijlstra wrote: > > New day, new chances... How's this? Code-gen doesn't look totally > > insane, but then, making sense of an optimizing compiler's output is > > always a wee challenge. > > Noticed it too late but looks good. Tested, works. Excellent; full patch below. Will go stick in tip/sched/core soonish. --- Subject: sched: Consider task_struct::saved_state in wait_task_inactive() From: Peter Zijlstra <peterz@infradead.org> Date: Wed May 31 16:39:07 CEST 2023 With the introduction of task_struct::saved_state in commit 5f220be21418 ("sched/wakeup: Prepare for RT sleeping spin/rwlocks") matching the task state has gotten more complicated. That same commit changed try_to_wake_up() to consider both states, but wait_task_inactive() has been neglected. Sebastian noted that the wait_task_inactive() usage in ptrace_check_attach() can misbehave when ptrace_stop() is blocked on the tasklist_lock after it sets TASK_TRACED. Therefore extract a common helper from ttwu_state_match() and use that to teach wait_task_inactive() about the PREEMPT_RT locks. Originally-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- kernel/sched/core.c | 59 ++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 48 insertions(+), 11 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3341,6 +3341,39 @@ int migrate_swap(struct task_struct *cur } #endif /* CONFIG_NUMA_BALANCING */ +static __always_inline +int __task_state_match(struct task_struct *p, unsigned int state) +{ + if (READ_ONCE(p->__state) & state) + return 1; + +#ifdef CONFIG_PREEMPT_RT + if (READ_ONCE(p->saved_state) & state) + return -1; +#endif + return 0; +} + +static __always_inline +int task_state_match(struct task_struct *p, unsigned int state) +{ +#ifdef CONFIG_PREEMPT_RT + int match; + + /* + * Serialize against current_save_and_set_rtlock_wait_state() and + * current_restore_rtlock_saved_state(). + */ + raw_spin_lock_irq(&p->pi_lock); + match = __task_state_match(p, state); + raw_spin_unlock_irq(&p->pi_lock); + + return match; +#else + return __task_state_match(p, state); +#endif +} + /* * wait_task_inactive - wait for a thread to unschedule. * @@ -3359,7 +3392,7 @@ int migrate_swap(struct task_struct *cur */ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state) { - int running, queued; + int running, queued, match; struct rq_flags rf; unsigned long ncsw; struct rq *rq; @@ -3385,7 +3418,7 @@ unsigned long wait_task_inactive(struct * is actually now running somewhere else! */ while (task_on_cpu(rq, p)) { - if (!(READ_ONCE(p->__state) & match_state)) + if (!task_state_match(p, match_state)) return 0; cpu_relax(); } @@ -3400,8 +3433,15 @@ unsigned long wait_task_inactive(struct running = task_on_cpu(rq, p); queued = task_on_rq_queued(p); ncsw = 0; - if (READ_ONCE(p->__state) & match_state) + if ((match = __task_state_match(p, match_state))) { + /* + * When matching on p->saved_state, consider this task + * still queued so it will wait. + */ + if (match < 0) + queued = 1; ncsw = p->nvcsw | LONG_MIN; /* sets MSB */ + } task_rq_unlock(rq, p, &rf); /* @@ -4003,15 +4043,14 @@ static void ttwu_queue(struct task_struc static __always_inline bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success) { + int match; + if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)) { WARN_ON_ONCE((state & TASK_RTLOCK_WAIT) && state != TASK_RTLOCK_WAIT); } - if (READ_ONCE(p->__state) & state) { - *success = 1; - return true; - } + *success = !!(match = __task_state_match(p, state)); #ifdef CONFIG_PREEMPT_RT /* @@ -4027,12 +4066,10 @@ bool ttwu_state_match(struct task_struct * p::saved_state to TASK_RUNNING so any further tests will * not result in false positives vs. @success */ - if (p->saved_state & state) { + if (match < 0) p->saved_state = TASK_RUNNING; - *success = 1; - } #endif - return false; + return match > 0; } /*
On Thu, Jun 01, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote: > On Fri, May 26, 2023 at 05:13:35PM +0200, Sebastian Andrzej Siewior wrote: > > On 2023-05-26 10:05:43 [+0200], Peter Zijlstra wrote: > > > New day, new chances... How's this? Code-gen doesn't look totally > > > insane, but then, making sense of an optimizing compiler's output is > > > always a wee challenge. > > > > Noticed it too late but looks good. Tested, works. > > Excellent; full patch below. Will go stick in tip/sched/core soonish. Urgh, so robot kicked me for breaking !SMP. And that made me realize that UP wait_task_inactive() is broken on PREEMPT_RT. Let me figure out what best to do about that..
On Fri, Jun 02, 2023 at 10:25:03AM +0200, Peter Zijlstra wrote: > On Thu, Jun 01, 2023 at 11:12:34AM +0200, Peter Zijlstra wrote: > > On Fri, May 26, 2023 at 05:13:35PM +0200, Sebastian Andrzej Siewior wrote: > > > On 2023-05-26 10:05:43 [+0200], Peter Zijlstra wrote: > > > > New day, new chances... How's this? Code-gen doesn't look totally > > > > insane, but then, making sense of an optimizing compiler's output is > > > > always a wee challenge. > > > > > > Noticed it too late but looks good. Tested, works. > > > > Excellent; full patch below. Will go stick in tip/sched/core soonish. > > Urgh, so robot kicked me for breaking !SMP. And that made me realize > that UP wait_task_inactive() is broken on PREEMPT_RT. > > Let me figure out what best to do about that.. I'll stick this in front -- see what happens ;-) --- Subject: sched: Unconditionally use full-fat wait_task_inactive() From: Peter Zijlstra <peterz@infradead.org> Date: Fri Jun 2 10:42:53 CEST 2023 While modifying wait_task_inactive() for PREEMPT_RT; the build robot noted that UP got broken. This led to audit and consideration of the UP implementation of wait_task_inactive(). It looks like the UP implementation is also broken for PREEMPT; consider task_current_syscall() getting preempted between the two calls to wait_task_inactive(). Therefore move the wait_task_inactive() implementation out of CONFIG_SMP and unconditionally use it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/sched.h | 7 - kernel/sched/core.c | 216 +++++++++++++++++++++++++------------------------- 2 files changed, 110 insertions(+), 113 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2006,15 +2006,12 @@ static __always_inline void scheduler_ip */ preempt_fold_need_resched(); } -extern unsigned long wait_task_inactive(struct task_struct *, unsigned int match_state); #else static inline void scheduler_ipi(void) { } -static inline unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state) -{ - return 1; -} #endif +extern unsigned long wait_task_inactive(struct task_struct *, unsigned int match_state); + /* * Set thread flags in other task's structures. * See asm/thread_info.h for TIF_xxxx flags available: --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2213,6 +2213,114 @@ void check_preempt_curr(struct rq *rq, s rq_clock_skip_update(rq); } +/* + * wait_task_inactive - wait for a thread to unschedule. + * + * Wait for the thread to block in any of the states set in @match_state. + * If it changes, i.e. @p might have woken up, then return zero. When we + * succeed in waiting for @p to be off its CPU, we return a positive number + * (its total switch count). If a second call a short while later returns the + * same number, the caller can be sure that @p has remained unscheduled the + * whole time. + * + * The caller must ensure that the task *will* unschedule sometime soon, + * else this function might spin for a *long* time. This function can't + * be called with interrupts off, or it may introduce deadlock with + * smp_call_function() if an IPI is sent by the same process we are + * waiting to become inactive. + */ +unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state) +{ + int running, queued; + struct rq_flags rf; + unsigned long ncsw; + struct rq *rq; + + for (;;) { + /* + * We do the initial early heuristics without holding + * any task-queue locks at all. We'll only try to get + * the runqueue lock when things look like they will + * work out! + */ + rq = task_rq(p); + + /* + * If the task is actively running on another CPU + * still, just relax and busy-wait without holding + * any locks. + * + * NOTE! Since we don't hold any locks, it's not + * even sure that "rq" stays as the right runqueue! + * But we don't care, since "task_on_cpu()" will + * return false if the runqueue has changed and p + * is actually now running somewhere else! + */ + while (task_on_cpu(rq, p)) { + if (!(READ_ONCE(p->__state) & match_state)) + return 0; + cpu_relax(); + } + + /* + * Ok, time to look more closely! We need the rq + * lock now, to be *sure*. If we're wrong, we'll + * just go back and repeat. + */ + rq = task_rq_lock(p, &rf); + trace_sched_wait_task(p); + running = task_on_cpu(rq, p); + queued = task_on_rq_queued(p); + ncsw = 0; + if (READ_ONCE(p->__state) & match_state) + ncsw = p->nvcsw | LONG_MIN; /* sets MSB */ + task_rq_unlock(rq, p, &rf); + + /* + * If it changed from the expected state, bail out now. + */ + if (unlikely(!ncsw)) + break; + + /* + * Was it really running after all now that we + * checked with the proper locks actually held? + * + * Oops. Go back and try again.. + */ + if (unlikely(running)) { + cpu_relax(); + continue; + } + + /* + * It's not enough that it's not actively running, + * it must be off the runqueue _entirely_, and not + * preempted! + * + * So if it was still runnable (but just not actively + * running right now), it's preempted, and we should + * yield - it could be a while. + */ + if (unlikely(queued)) { + ktime_t to = NSEC_PER_SEC / HZ; + + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_hrtimeout(&to, HRTIMER_MODE_REL_HARD); + continue; + } + + /* + * Ahh, all good. It wasn't running, and it wasn't + * runnable, which means that it will never become + * running in the future either. We're all done! + */ + break; + } + + return ncsw; +} + #ifdef CONFIG_SMP static void @@ -3341,114 +3449,6 @@ int migrate_swap(struct task_struct *cur } #endif /* CONFIG_NUMA_BALANCING */ -/* - * wait_task_inactive - wait for a thread to unschedule. - * - * Wait for the thread to block in any of the states set in @match_state. - * If it changes, i.e. @p might have woken up, then return zero. When we - * succeed in waiting for @p to be off its CPU, we return a positive number - * (its total switch count). If a second call a short while later returns the - * same number, the caller can be sure that @p has remained unscheduled the - * whole time. - * - * The caller must ensure that the task *will* unschedule sometime soon, - * else this function might spin for a *long* time. This function can't - * be called with interrupts off, or it may introduce deadlock with - * smp_call_function() if an IPI is sent by the same process we are - * waiting to become inactive. - */ -unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state) -{ - int running, queued; - struct rq_flags rf; - unsigned long ncsw; - struct rq *rq; - - for (;;) { - /* - * We do the initial early heuristics without holding - * any task-queue locks at all. We'll only try to get - * the runqueue lock when things look like they will - * work out! - */ - rq = task_rq(p); - - /* - * If the task is actively running on another CPU - * still, just relax and busy-wait without holding - * any locks. - * - * NOTE! Since we don't hold any locks, it's not - * even sure that "rq" stays as the right runqueue! - * But we don't care, since "task_on_cpu()" will - * return false if the runqueue has changed and p - * is actually now running somewhere else! - */ - while (task_on_cpu(rq, p)) { - if (!(READ_ONCE(p->__state) & match_state)) - return 0; - cpu_relax(); - } - - /* - * Ok, time to look more closely! We need the rq - * lock now, to be *sure*. If we're wrong, we'll - * just go back and repeat. - */ - rq = task_rq_lock(p, &rf); - trace_sched_wait_task(p); - running = task_on_cpu(rq, p); - queued = task_on_rq_queued(p); - ncsw = 0; - if (READ_ONCE(p->__state) & match_state) - ncsw = p->nvcsw | LONG_MIN; /* sets MSB */ - task_rq_unlock(rq, p, &rf); - - /* - * If it changed from the expected state, bail out now. - */ - if (unlikely(!ncsw)) - break; - - /* - * Was it really running after all now that we - * checked with the proper locks actually held? - * - * Oops. Go back and try again.. - */ - if (unlikely(running)) { - cpu_relax(); - continue; - } - - /* - * It's not enough that it's not actively running, - * it must be off the runqueue _entirely_, and not - * preempted! - * - * So if it was still runnable (but just not actively - * running right now), it's preempted, and we should - * yield - it could be a while. - */ - if (unlikely(queued)) { - ktime_t to = NSEC_PER_SEC / HZ; - - set_current_state(TASK_UNINTERRUPTIBLE); - schedule_hrtimeout(&to, HRTIMER_MODE_REL_HARD); - continue; - } - - /* - * Ahh, all good. It wasn't running, and it wasn't - * runnable, which means that it will never become - * running in the future either. We're all done! - */ - break; - } - - return ncsw; -} - /*** * kick_process - kick a running thread to enter/exit the kernel * @p: the to-be-kicked thread
On 2023-06-02 12:37:31 [+0200], Peter Zijlstra wrote: > --- > Subject: sched: Unconditionally use full-fat wait_task_inactive() > From: Peter Zijlstra <peterz@infradead.org> > Date: Fri Jun 2 10:42:53 CEST 2023 > > While modifying wait_task_inactive() for PREEMPT_RT; the build robot > noted that UP got broken. This led to audit and consideration of the > UP implementation of wait_task_inactive(). > > It looks like the UP implementation is also broken for PREEMPT; If UP is broken for PREEMPT, shouldn't it get a fixes or stable tag? Eitherway, I will try to stuff this in RT today and give feedback. I actually never booted this on UP, will try to do so today… Sebastian
On Fri, Jun 02, 2023 at 12:49:58PM +0200, Sebastian Andrzej Siewior wrote: > On 2023-06-02 12:37:31 [+0200], Peter Zijlstra wrote: > > --- > > Subject: sched: Unconditionally use full-fat wait_task_inactive() > > From: Peter Zijlstra <peterz@infradead.org> > > Date: Fri Jun 2 10:42:53 CEST 2023 > > > > While modifying wait_task_inactive() for PREEMPT_RT; the build robot > > noted that UP got broken. This led to audit and consideration of the > > UP implementation of wait_task_inactive(). > > > > It looks like the UP implementation is also broken for PREEMPT; > > If UP is broken for PREEMPT, shouldn't it get a fixes or stable tag? It has been broken *forever*, I don't think we need to 'rush' a fix. Also, I don't think anybody actually uses a UP+PREEMPT kernel much, but what do I know.
On 2023-06-02 12:37:31 [+0200], Peter Zijlstra wrote:
> I'll stick this in front -- see what happens ;-)
Tested this with the previous one. All good.
Sebastian
--- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3266,6 +3266,76 @@ int migrate_swap(struct task_struct *cur } #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_PREEMPT_RT + +/* + * Consider: + * + * set_special_state(X); + * + * do_things() + * // Somewhere in there is an rtlock that can be contended: + * current_save_and_set_rtlock_wait_state(); + * [...] + * schedule_rtlock(); (A) + * [...] + * current_restore_rtlock_saved_state(); + * + * schedule(); (B) + * + * If p->saved_state is anything else than TASK_RUNNING, then p blocked on an + * rtlock (A) *before* voluntarily calling into schedule() (B) after setting its + * state to X. For things like ptrace (X=TASK_TRACED), the task could have more + * work to do upon acquiring the lock in do_things() before whoever called + * wait_task_inactive() should return. IOW, we have to wait for: + * + * p.saved_state = TASK_RUNNING + * p.__state = X + * + * which implies the task isn't blocked on an RT lock and got to schedule() (B). + * + * Also see comments in ttwu_state_match(). + */ + +static __always_inline bool state_mismatch(struct task_struct *p, unsigned int match_state) +{ + unsigned long flags; + bool mismatch; + + raw_spin_lock_irqsave(&p->pi_lock, flags); + if (READ_ONCE(p->__state) & match_state) + mismatch = false; + else if (READ_ONCE(p->saved_state) & match_state) + mismatch = false; + else + mismatch = true; + + raw_spin_unlock_irqrestore(&p->pi_lock, flags); + return mismatch; +} +static __always_inline bool state_match(struct task_struct *p, unsigned int match_state, + bool *wait) +{ + if (READ_ONCE(p->__state) & match_state) + return true; + if (READ_ONCE(p->saved_state) & match_state) { + *wait = true; + return true; + } + return false; +} +#else +static __always_inline bool state_mismatch(struct task_struct *p, unsigned int match_state) +{ + return !(READ_ONCE(p->__state) & match_state); +} +static __always_inline bool state_match(struct task_struct *p, unsigned int match_state, + bool *wait) +{ + return (READ_ONCE(p->__state) & match_state); +} +#endif + /* * wait_task_inactive - wait for a thread to unschedule. * @@ -3284,7 +3354,7 @@ int migrate_swap(struct task_struct *cur */ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state) { - int running, queued; + bool running, wait; struct rq_flags rf; unsigned long ncsw; struct rq *rq; @@ -3310,7 +3380,7 @@ unsigned long wait_task_inactive(struct * is actually now running somewhere else! */ while (task_on_cpu(rq, p)) { - if (!(READ_ONCE(p->__state) & match_state)) + if (state_mismatch(p, match_state)) return 0; cpu_relax(); } @@ -3323,9 +3393,10 @@ unsigned long wait_task_inactive(struct rq = task_rq_lock(p, &rf); trace_sched_wait_task(p); running = task_on_cpu(rq, p); - queued = task_on_rq_queued(p); + wait = task_on_rq_queued(p); ncsw = 0; - if (READ_ONCE(p->__state) & match_state) + + if (state_match(p, match_state, &wait)) ncsw = p->nvcsw | LONG_MIN; /* sets MSB */ task_rq_unlock(rq, p, &rf); @@ -3355,7 +3426,7 @@ unsigned long wait_task_inactive(struct * running right now), it's preempted, and we should * yield - it could be a while. */ - if (unlikely(queued)) { + if (unlikely(wait)) { ktime_t to = NSEC_PER_SEC / HZ; set_current_state(TASK_UNINTERRUPTIBLE);