[v2] sched/stat: correct the task blocking state
Commit Message
From: Alex Shi <alexs@kernel.org>
The commit 80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE")
stopped the idle kthreads from contributing to the load average. However,
the idle state time still contributes to the blocked state time instead of
the sleep time. As a result, we cannot determine if a task is stopped due
to some reasons or if it is idle by its own initiative.
Distinguishing between these two states would make the system state clearer
and provide us with an opportunity to use the 'D' state of a task as an
indicator of latency issues.
Originally-from: Curu Wong <curuwang@tencent.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
To: linux-kernel@vger.kernel.org
To: Valentin Schneider <vschneid@redhat.com>
To: Daniel Bristot de Oliveira <bristot@redhat.com>
To: Mel Gorman <mgorman@suse.de>
To: Ben Segall <bsegall@google.com>
To: Steven Rostedt <rostedt@goodmis.org>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
To: Juri Lelli <juri.lelli@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
To: Ingo Molnar <mingo@redhat.com>
---
include/linux/sched.h | 6 ++++++
kernel/sched/deadline.c | 5 +++--
kernel/sched/fair.c | 5 +++--
kernel/sched/rt.c | 5 +++--
4 files changed, 15 insertions(+), 6 deletions(-)
Comments
On 03/01/24 16:10, alexs@kernel.org wrote:
> From: Alex Shi <alexs@kernel.org>
>
> The commit 80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE")
> stopped the idle kthreads from contributing to the load average. However,
> the idle state time still contributes to the blocked state time instead of
> the sleep time. As a result, we cannot determine if a task is stopped due
> to some reasons or if it is idle by its own initiative.
>
> Distinguishing between these two states would make the system state clearer
> and provide us with an opportunity to use the 'D' state of a task as an
> indicator of latency issues.
>
get_task_state() already reports TASK_IDLE as 'I', which should be what
userspace sees (e.g. via /proc/$pid/stat). This is also the case for the
sched_switch and sched_wakeup trace events.
I assume what you mean here is you first turn to schedstats to check
whether there is any abnormal amount of blocking, and then if there is any
you turn to tracing, in which case you'd like the schedstats to not make
things look worse than they really are?
On Thu, Jan 4, 2024 at 6:38 PM Valentin Schneider <vschneid@redhat.com> wrote:
>
> On 03/01/24 16:10, alexs@kernel.org wrote:
> > From: Alex Shi <alexs@kernel.org>
> >
> > The commit 80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE")
> > stopped the idle kthreads from contributing to the load average. However,
> > the idle state time still contributes to the blocked state time instead of
> > the sleep time. As a result, we cannot determine if a task is stopped due
> > to some reasons or if it is idle by its own initiative.
> >
> > Distinguishing between these two states would make the system state clearer
> > and provide us with an opportunity to use the 'D' state of a task as an
> > indicator of latency issues.
> >
>
> get_task_state() already reports TASK_IDLE as 'I', which should be what
> userspace sees (e.g. via /proc/$pid/stat). This is also the case for the
> sched_switch and sched_wakeup trace events.
>
> I assume what you mean here is you first turn to schedstats to check
> whether there is any abnormal amount of blocking, and then if there is any
> you turn to tracing, in which case you'd like the schedstats to not make
> things look worse than they really are?
Yes, switch/wakeup or others could help to figure out real blocked
time, but with this change, schedstats could give a neat and elegant
way.
Thanks!
>
Alex Shi <seakeel@gmail.com> 于2024年1月4日周四 19:29写道:
>
> On Thu, Jan 4, 2024 at 6:38 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >
> > On 03/01/24 16:10, alexs@kernel.org wrote:
> > > From: Alex Shi <alexs@kernel.org>
> > >
> > > The commit 80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE")
> > > stopped the idle kthreads from contributing to the load average. However,
> > > the idle state time still contributes to the blocked state time instead of
> > > the sleep time. As a result, we cannot determine if a task is stopped due
> > > to some reasons or if it is idle by its own initiative.
> > >
> > > Distinguishing between these two states would make the system state clearer
> > > and provide us with an opportunity to use the 'D' state of a task as an
> > > indicator of latency issues.
> > >
> >
> > get_task_state() already reports TASK_IDLE as 'I', which should be what
> > userspace sees (e.g. via /proc/$pid/stat). This is also the case for the
> > sched_switch and sched_wakeup trace events.
> >
> > I assume what you mean here is you first turn to schedstats to check
> > whether there is any abnormal amount of blocking, and then if there is any
> > you turn to tracing, in which case you'd like the schedstats to not make
> > things look worse than they really are?
>
> Yes, switch/wakeup or others could help to figure out real blocked
> time, but with this change, schedstats could give a neat and elegant
> way.
For all of the shortages I can imagine, we can't treat blocked and
sleep states as before, and that may force some scripts to change on
these things, but giving 2 states the correct way is better way, and
make sleep/block state more meaningful and helpful in normal usage.
Are there any concerns on this?
Thanks
Alex
@@ -140,6 +140,12 @@ struct user_event_mm;
#define is_special_task_state(state) \
((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED | TASK_DEAD))
+/* blocked task is UNINTERRUPTIBLE but not NOLOAD */
+#define is_blocked_state(state) \
+ ((state) & TASK_UNINTERRUPTIBLE && (!((state) & TASK_NOLOAD)))
+
+#define is_idle_state(state) (((state) & TASK_IDLE) == TASK_IDLE)
+
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
# define debug_normal_state_change(state_value) \
do { \
@@ -1566,11 +1566,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
unsigned int state;
state = READ_ONCE(p->__state);
- if (state & TASK_INTERRUPTIBLE)
+ /* idle state still accounts into sleep */
+ if (state & TASK_INTERRUPTIBLE || is_idle_state(state))
__schedstat_set(p->stats.sleep_start,
rq_clock(rq_of_dl_rq(dl_rq)));
- if (state & TASK_UNINTERRUPTIBLE)
+ if (is_blocked_state(state))
__schedstat_set(p->stats.block_start,
rq_clock(rq_of_dl_rq(dl_rq)));
}
@@ -1278,10 +1278,11 @@ update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl
/* XXX racy against TTWU */
state = READ_ONCE(tsk->__state);
- if (state & TASK_INTERRUPTIBLE)
+ /* idle state still accounts into sleep */
+ if (state & TASK_INTERRUPTIBLE || is_idle_state(state))
__schedstat_set(tsk->stats.sleep_start,
rq_clock(rq_of(cfs_rq)));
- if (state & TASK_UNINTERRUPTIBLE)
+ if (is_blocked_state(state))
__schedstat_set(tsk->stats.block_start,
rq_clock(rq_of(cfs_rq)));
}
@@ -1371,11 +1371,12 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
unsigned int state;
state = READ_ONCE(p->__state);
- if (state & TASK_INTERRUPTIBLE)
+ /* idle state still accounts into sleep */
+ if (state & TASK_INTERRUPTIBLE || is_idle_state(state))
__schedstat_set(p->stats.sleep_start,
rq_clock(rq_of_rt_rq(rt_rq)));
- if (state & TASK_UNINTERRUPTIBLE)
+ if (is_blocked_state(state))
__schedstat_set(p->stats.block_start,
rq_clock(rq_of_rt_rq(rt_rq)));
}