diff mbox series

[v2] sched/stat: correct the task blocking state

Message ID	20240103081042.1549189-1-alexs@kernel.org
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-15262-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; From: alexs@kernel.org To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, linux-kernel@vger.kernel.org Cc: curuwang@tencent.com, Alex Shi <alexs@kernel.org> Subject: [PATCH v2] sched/stat: correct the task blocking state Date: Wed, 3 Jan 2024 16:10:42 +0800 Message-ID: <20240103081042.1549189-1-alexs@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX
Series	[v2] sched/stat: correct the task blocking state \| [v2] sched/stat: correct the task blocking state

Commit Message

alexs@kernel.org Jan. 3, 2024, 8:10 a.m. UTC

  From: Alex Shi <alexs@kernel.org>

The commit 80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE")
stopped the idle kthreads from contributing to the load average. However,
the idle state time still contributes to the blocked state time instead of
the sleep time. As a result, we cannot determine if a task is stopped due
to some reasons or if it is idle by its own initiative. 

Distinguishing between these two states would make the system state clearer
and provide us with an opportunity to use the 'D' state of a task as an
indicator of latency issues.

Originally-from: Curu Wong <curuwang@tencent.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
To: linux-kernel@vger.kernel.org
To: Valentin Schneider <vschneid@redhat.com>
To: Daniel Bristot de Oliveira <bristot@redhat.com>
To: Mel Gorman <mgorman@suse.de>
To: Ben Segall <bsegall@google.com>
To: Steven Rostedt <rostedt@goodmis.org>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
To: Juri Lelli <juri.lelli@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
To: Ingo Molnar <mingo@redhat.com>
---
 include/linux/sched.h   | 6 ++++++
 kernel/sched/deadline.c | 5 +++--
 kernel/sched/fair.c     | 5 +++--
 kernel/sched/rt.c       | 5 +++--
 4 files changed, 15 insertions(+), 6 deletions(-)

Comments

Valentin Schneider Jan. 4, 2024, 10:38 a.m. UTC | #1

On 03/01/24 16:10, alexs@kernel.org wrote:
> From: Alex Shi <alexs@kernel.org>
>
> The commit 80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE")
> stopped the idle kthreads from contributing to the load average. However,
> the idle state time still contributes to the blocked state time instead of
> the sleep time. As a result, we cannot determine if a task is stopped due
> to some reasons or if it is idle by its own initiative.
>
> Distinguishing between these two states would make the system state clearer
> and provide us with an opportunity to use the 'D' state of a task as an
> indicator of latency issues.
>

get_task_state() already reports TASK_IDLE as 'I', which should be what
userspace sees (e.g. via /proc/$pid/stat). This is also the case for the
sched_switch and sched_wakeup trace events.

I assume what you mean here is you first turn to schedstats to check
whether there is any abnormal amount of blocking, and then if there is any
you turn to tracing, in which case you'd like the schedstats to not make
things look worse than they really are?

kuiliang Shi Jan. 4, 2024, 11:29 a.m. UTC | #2

On Thu, Jan 4, 2024 at 6:38 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> On 03/01/24 16:10, alexs@kernel.org wrote:
> > From: Alex Shi <alexs@kernel.org>
> >
> > The commit 80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE")
> > stopped the idle kthreads from contributing to the load average. However,
> > the idle state time still contributes to the blocked state time instead of
> > the sleep time. As a result, we cannot determine if a task is stopped due
> > to some reasons or if it is idle by its own initiative.
> >
> > Distinguishing between these two states would make the system state clearer
> > and provide us with an opportunity to use the 'D' state of a task as an
> > indicator of latency issues.
> >
>
> get_task_state() already reports TASK_IDLE as 'I', which should be what
> userspace sees (e.g. via /proc/$pid/stat). This is also the case for the
> sched_switch and sched_wakeup trace events.
>
> I assume what you mean here is you first turn to schedstats to check
> whether there is any abnormal amount of blocking, and then if there is any
> you turn to tracing, in which case you'd like the schedstats to not make
> things look worse than they really are?

Yes, switch/wakeup or others could help to figure out real blocked
time, but with this change, schedstats could give a neat and elegant
way.

Thanks!
>

kuiliang Shi Jan. 17, 2024, 8:48 a.m. UTC | #3

Alex Shi <seakeel@gmail.com> 于2024年1月4日周四 19:29写道：
>
> On Thu, Jan 4, 2024 at 6:38 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >
> > On 03/01/24 16:10, alexs@kernel.org wrote:
> > > From: Alex Shi <alexs@kernel.org>
> > >
> > > The commit 80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE")
> > > stopped the idle kthreads from contributing to the load average. However,
> > > the idle state time still contributes to the blocked state time instead of
> > > the sleep time. As a result, we cannot determine if a task is stopped due
> > > to some reasons or if it is idle by its own initiative.
> > >
> > > Distinguishing between these two states would make the system state clearer
> > > and provide us with an opportunity to use the 'D' state of a task as an
> > > indicator of latency issues.
> > >
> >
> > get_task_state() already reports TASK_IDLE as 'I', which should be what
> > userspace sees (e.g. via /proc/$pid/stat). This is also the case for the
> > sched_switch and sched_wakeup trace events.
> >
> > I assume what you mean here is you first turn to schedstats to check
> > whether there is any abnormal amount of blocking, and then if there is any
> > you turn to tracing, in which case you'd like the schedstats to not make
> > things look worse than they really are?
>
> Yes, switch/wakeup or others could help to figure out real blocked
> time, but with this change, schedstats could give a neat and elegant
> way.

For all of the shortages I can imagine, we can't treat blocked and
sleep states as before,  and that may force some scripts to change on
these things, but giving 2 states the correct way is better way, and
make sleep/block state more meaningful and helpful in normal usage.
Are there any concerns on this?

Thanks
Alex

diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c31697248..002f80291837 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -140,6 +140,12 @@  struct user_event_mm;
 #define is_special_task_state(state)				\
 	((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED | TASK_DEAD))
 
+/* blocked task is UNINTERRUPTIBLE but not NOLOAD */
+#define is_blocked_state(state)				\
+	((state) & TASK_UNINTERRUPTIBLE && (!((state) & TASK_NOLOAD)))
+
+#define is_idle_state(state)	(((state) & TASK_IDLE) == TASK_IDLE)
+
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 # define debug_normal_state_change(state_value)				\
 	do {								\
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b28114478b82..99d46affc2aa 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1566,11 +1566,12 @@  update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
 		unsigned int state;
 
 		state = READ_ONCE(p->__state);
-		if (state & TASK_INTERRUPTIBLE)
+		/* idle state still accounts into sleep */
+		if (state & TASK_INTERRUPTIBLE || is_idle_state(state))
 			__schedstat_set(p->stats.sleep_start,
 					rq_clock(rq_of_dl_rq(dl_rq)));
 
-		if (state & TASK_UNINTERRUPTIBLE)
+		if (is_blocked_state(state))
 			__schedstat_set(p->stats.block_start,
 					rq_clock(rq_of_dl_rq(dl_rq)));
 	}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..69506253aadf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1278,10 +1278,11 @@  update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl
 
 		/* XXX racy against TTWU */
 		state = READ_ONCE(tsk->__state);
-		if (state & TASK_INTERRUPTIBLE)
+		/* idle state still accounts into sleep */
+		if (state & TASK_INTERRUPTIBLE || is_idle_state(state))
 			__schedstat_set(tsk->stats.sleep_start,
 				      rq_clock(rq_of(cfs_rq)));
-		if (state & TASK_UNINTERRUPTIBLE)
+		if (is_blocked_state(state))
 			__schedstat_set(tsk->stats.block_start,
 				      rq_clock(rq_of(cfs_rq)));
 	}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6aaf0a3d6081..dd0e381689f8 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1371,11 +1371,12 @@  update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
 		unsigned int state;
 
 		state = READ_ONCE(p->__state);
-		if (state & TASK_INTERRUPTIBLE)
+		/* idle state still accounts into sleep */
+		if (state & TASK_INTERRUPTIBLE || is_idle_state(state))
 			__schedstat_set(p->stats.sleep_start,
 					rq_clock(rq_of_rt_rq(rt_rq)));
 
-		if (state & TASK_UNINTERRUPTIBLE)
+		if (is_blocked_state(state))
 			__schedstat_set(p->stats.block_start,
 					rq_clock(rq_of_rt_rq(rt_rq)));
 	}