diff mbox series

sched/fair: remove util_est boosting

Message ID	20230706135144.324311-1-vincent.guittot@linaro.org
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Vincent Guittot <vincent.guittot@linaro.org> To: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, linux-kernel@vger.kernel.org Cc: qyousef@layalina.io, Vincent Guittot <vincent.guittot@linaro.org> Subject: [PATCH] sched/fair: remove util_est boosting Date: Thu, 6 Jul 2023 15:51:44 +0200 Message-Id: <20230706135144.324311-1-vincent.guittot@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	sched/fair: remove util_est boosting \| sched/fair: remove util_est boosting

Commit Message

Vincent Guittot July 6, 2023, 1:51 p.m. UTC

  There is no need to use runnable_avg when estimating util_est and that
even generates wrong behavior because one includes blocked tasks whereas
the other one doesn't. This can lead to accounting twice the waking task p,
once with the blocked runnable_avg and another one when adding its
util_est.

cpu's runnable_avg is already used when computing util_avg which is then
compared with util_est.

In some situation, feec will not select prev_cpu but another one on the
same performance domain because of higher max_util

Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 3 ---
 1 file changed, 3 deletions(-)

Comments

Qais Yousef July 11, 2023, 3:47 p.m. UTC | #1

On 07/06/23 15:51, Vincent Guittot wrote:
> There is no need to use runnable_avg when estimating util_est and that
> even generates wrong behavior because one includes blocked tasks whereas
> the other one doesn't. This can lead to accounting twice the waking task p,
> once with the blocked runnable_avg and another one when adding its
> util_est.
> 
> cpu's runnable_avg is already used when computing util_avg which is then
> compared with util_est.
> 
> In some situation, feec will not select prev_cpu but another one on the
> same performance domain because of higher max_util
> 
> Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---

Can we verify the numbers that introduced this magic boost are still valid
please?

Otherwise LGTM.


Thanks!

--

Qais Yousef

>  kernel/sched/fair.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..77c9f5816c31 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7289,9 +7289,6 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>  
>  		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
>  
> -		if (boost)
> -			util_est = max(util_est, runnable);
> -
>  		/*
>  		 * During wake-up @p isn't enqueued yet and doesn't contribute
>  		 * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
> -- 
> 2.34.1
>

Vincent Guittot July 12, 2023, 3:30 p.m. UTC | #2

On Tue, 11 Jul 2023 at 17:47, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 07/06/23 15:51, Vincent Guittot wrote:
> > There is no need to use runnable_avg when estimating util_est and that
> > even generates wrong behavior because one includes blocked tasks whereas
> > the other one doesn't. This can lead to accounting twice the waking task p,
> > once with the blocked runnable_avg and another one when adding its
> > util_est.
> >
> > cpu's runnable_avg is already used when computing util_avg which is then
> > compared with util_est.
> >
> > In some situation, feec will not select prev_cpu but another one on the
> > same performance domain because of higher max_util
> >
> > Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
>
> Can we verify the numbers that introduced this magic boost are still valid
> please?

TBH I don't expect it but I agree it's worth checking. Dietmar could
you rerun your tests with this change ?

>
> Otherwise LGTM.
>
>
> Thanks!
>
> --
>
> Qais Yousef
>
> >  kernel/sched/fair.c | 3 ---
> >  1 file changed, 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..77c9f5816c31 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7289,9 +7289,6 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> >
> >               util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> >
> > -             if (boost)
> > -                     util_est = max(util_est, runnable);
> > -
> >               /*
> >                * During wake-up @p isn't enqueued yet and doesn't contribute
> >                * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
> > --
> > 2.34.1
> >

Dietmar Eggemann July 21, 2023, 4:09 p.m. UTC | #3

On 12/07/2023 17:30, Vincent Guittot wrote:
> On Tue, 11 Jul 2023 at 17:47, Qais Yousef <qyousef@layalina.io> wrote:
>>
>> On 07/06/23 15:51, Vincent Guittot wrote:
>>> There is no need to use runnable_avg when estimating util_est and that
>>> even generates wrong behavior because one includes blocked tasks whereas
>>> the other one doesn't. This can lead to accounting twice the waking task p,
>>> once with the blocked runnable_avg and another one when adding its
>>> util_est.

... and we don't have this issue for the util_avg case since we have:

7317         } else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
                             ^^^^^^^^^^^^^^^^^^
7318                 util += task_util(p);

>>> cpu's runnable_avg is already used when computing util_avg which is then
>>> compared with util_est.

We discussed why I have to use max(X, runnable) for X=util and
X=util_est in v2:

https://lkml.kernel.org/r/251b524a-2c44-3892-1bae-03f879d6a64b@arm.com

-->

I need the util_est = max(util_est, runnable) further down as well. Just
want to fetch runnable only once.

util = 50, task_util = 5, util_est = 60, task_util_est = 10, runnable = 70

max(70 + 5, 60 + 10) != max (70 + 5, 70 + 10) when dst_cpu == cpu

<--

But I assume your point is that:

7327       if (boost)
7328           util_est = max(util_est, runnable);

7356       if (dst_cpu == cpu)                                   <-- (1)
7357           util_est += _task_util_est(p);
7358       else if (p && unlikely(task_on_rq_queued(p) || current == p))
7359           lsub_positive(&util_est, _task_util_est(p));
7360
7361       util = max(util, util_est);

--> (1) doesn't work anymore in case `util_est == runnable`.

It will break the assumption for the if condition depicted in
cpu_util()'s comment:

7331  * During wake-up (2) @p isn't enqueued yet and doesn't contribute
7332  * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
7333  * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
7334  * has been enqueued.

(2) eenv_pd_max_util() and find_energy_efficient_cpu() call-site.

<---

Rerunning Jankbench tests on Pix6 will tell if boosting util_avg instead
of both will still show the anticipated results. Likelihood is high that
it will since we do `util = max(util, util_est)` at the end of cpu_util().

>>> In some situation, feec will not select prev_cpu but another one on the
>>> same performance domain because of higher max_util
>>>
>>> Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>
>> Can we verify the numbers that introduced this magic boost are still valid
>> please?
> 
> TBH I don't expect it but I agree it's worth checking. Dietmar could
> you rerun your tests with this change ?

Could do. But first lets understand the issue properly.

>> Otherwise LGTM.
>>
>>
>> Thanks!
>>
>> --
>>
>> Qais Yousef
>>
>>>  kernel/sched/fair.c | 3 ---
>>>  1 file changed, 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index a80a73909dc2..77c9f5816c31 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7289,9 +7289,6 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>>>
>>>               util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
>>>
>>> -             if (boost)
>>> -                     util_est = max(util_est, runnable);
>>> -
>>>               /*
>>>                * During wake-up @p isn't enqueued yet and doesn't contribute
>>>                * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
>>> --
>>> 2.34.1
>>>

Vincent Guittot July 24, 2023, 1:06 p.m. UTC | #4

On Fri, 21 Jul 2023 at 18:09, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 12/07/2023 17:30, Vincent Guittot wrote:
> > On Tue, 11 Jul 2023 at 17:47, Qais Yousef <qyousef@layalina.io> wrote:
> >>
> >> On 07/06/23 15:51, Vincent Guittot wrote:
> >>> There is no need to use runnable_avg when estimating util_est and that
> >>> even generates wrong behavior because one includes blocked tasks whereas
> >>> the other one doesn't. This can lead to accounting twice the waking task p,
> >>> once with the blocked runnable_avg and another one when adding its
> >>> util_est.
>
> ... and we don't have this issue for the util_avg case since we have:
>
> 7317         } else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
>                              ^^^^^^^^^^^^^^^^^^
> 7318                 util += task_util(p);
>
> >>> cpu's runnable_avg is already used when computing util_avg which is then
> >>> compared with util_est.
>
> We discussed why I have to use max(X, runnable) for X=util and
> X=util_est in v2:
>
> https://lkml.kernel.org/r/251b524a-2c44-3892-1bae-03f879d6a64b@arm.com
>
> -->
>
> I need the util_est = max(util_est, runnable) further down as well. Just
> want to fetch runnable only once.
>
> util = 50, task_util = 5, util_est = 60, task_util_est = 10, runnable = 70
>
> max(70 + 5, 60 + 10) != max (70 + 5, 70 + 10) when dst_cpu == cpu
>

Hmm, I don't get your point here. Why should they be equal ?

Below is a example to describe my problem:

task A with util_avg=200 util_est=300 runnable=200
task A is attached to CPU0 so it contributes to CPU0's util_avg and
runnable_avg.

In eenv_pd_max_util() we call cpu_util(cpu, p, dst_cpu, 1) to get the
max utilization and the OPP to use to compute energy.

Let say that there is nothing else running on CPU0 and CPU1 and the
both belong to the same performance domain so
CPU0 util_avg= 200 util_est=0 runnable_avg=200
CPU1 util_avg=0 util_est=0 runnable_avg=0

For CPU0, cpu_util(cpu, p, dst_cpu, 1) will return (200 + 300) = 500
For CPU1, cpu_util(cpu, p, dst_cpu, 1) will return (0 + 300) = 300

If there is an OPP with a capacity between these 2 values, CPU1 will
use a lower OPP than CPU0 and its computed energy will be lower.

The condition  if (max_spare_cap_cpu >= 0 && max_spare_cap >
prev_spare_cap) filters some cases when CPU0 and CPU1 have the exact
same spare capacity. But we often see a smaller spare capacity for
CPU0 because of small side activities like cpufreq, timer, irq, rcu
... The difference is often only 1 but enough to bypass the condition
above. task A will migrate to CPU1 whereas there is no need. Then it
will move back to CPU0 once CPU1 will have a smaller spare capacity

I ran a test on snapdragon RB5 with the latest tip/sched/core. I start
3 tasks: 1 large enough to be on medium CPUs and 2 small enough to
stay on little CPUs during 30 seconds
With tip/sched/core, the 3 tasks are migrating around 3665
With the patch, there is only 8 migration at the beginning of the test

> <--
>
> But I assume your point is that:
>
> 7327       if (boost)
> 7328           util_est = max(util_est, runnable);
>
> 7356       if (dst_cpu == cpu)                                   <-- (1)
> 7357           util_est += _task_util_est(p);
> 7358       else if (p && unlikely(task_on_rq_queued(p) || current == p))
> 7359           lsub_positive(&util_est, _task_util_est(p));
> 7360
> 7361       util = max(util, util_est);
>
> --> (1) doesn't work anymore in case `util_est == runnable`.
>
> It will break the assumption for the if condition depicted in
> cpu_util()'s comment:

exactly
>
> 7331  * During wake-up (2) @p isn't enqueued yet and doesn't contribute
> 7332  * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
> 7333  * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
> 7334  * has been enqueued.
>
> (2) eenv_pd_max_util() and find_energy_efficient_cpu() call-site.
>
> <---
>
> Rerunning Jankbench tests on Pix6 will tell if boosting util_avg instead
> of both will still show the anticipated results. Likelihood is high that
> it will since we do `util = max(util, util_est)` at the end of cpu_util().

 I think the same

>
> >>> In some situation, feec will not select prev_cpu but another one on the
> >>> same performance domain because of higher max_util
> >>>
> >>> Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
> >>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>> ---
> >>
> >> Can we verify the numbers that introduced this magic boost are still valid
> >> please?
> >
> > TBH I don't expect it but I agree it's worth checking. Dietmar could
> > you rerun your tests with this change ?
>
> Could do. But first lets understand the issue properly.
>
> >> Otherwise LGTM.
> >>
> >>
> >> Thanks!
> >>
> >> --
> >>
> >> Qais Yousef
> >>
> >>>  kernel/sched/fair.c | 3 ---
> >>>  1 file changed, 3 deletions(-)
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index a80a73909dc2..77c9f5816c31 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -7289,9 +7289,6 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> >>>
> >>>               util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> >>>
> >>> -             if (boost)
> >>> -                     util_est = max(util_est, runnable);
> >>> -
> >>>               /*
> >>>                * During wake-up @p isn't enqueued yet and doesn't contribute
> >>>                * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
> >>> --
> >>> 2.34.1
> >>>
>

Dietmar Eggemann July 24, 2023, 9:11 p.m. UTC | #5

On 24/07/2023 15:06, Vincent Guittot wrote:
> On Fri, 21 Jul 2023 at 18:09, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 12/07/2023 17:30, Vincent Guittot wrote:
>>> On Tue, 11 Jul 2023 at 17:47, Qais Yousef <qyousef@layalina.io> wrote:
>>>>
>>>> On 07/06/23 15:51, Vincent Guittot wrote:

[...]

>> -->
>>
>> I need the util_est = max(util_est, runnable) further down as well. Just
>> want to fetch runnable only once.
>>
>> util = 50, task_util = 5, util_est = 60, task_util_est = 10, runnable = 70
>>
>> max(70 + 5, 60 + 10) != max (70 + 5, 70 + 10) when dst_cpu == cpu
>>
> 
> Hmm, I don't get your point here. Why should they be equal ?
> 
> Below is a example to describe my problem:
> 
> task A with util_avg=200 util_est=300 runnable=200
> task A is attached to CPU0 so it contributes to CPU0's util_avg and
> runnable_avg.
> 
> In eenv_pd_max_util() we call cpu_util(cpu, p, dst_cpu, 1) to get the
> max utilization and the OPP to use to compute energy.
> 
> Let say that there is nothing else running on CPU0 and CPU1 and the
> both belong to the same performance domain so
> CPU0 util_avg= 200 util_est=0 runnable_avg=200
> CPU1 util_avg=0 util_est=0 runnable_avg=0
> 
> For CPU0, cpu_util(cpu, p, dst_cpu, 1) will return (200 + 300) = 500
> For CPU1, cpu_util(cpu, p, dst_cpu, 1) will return (0 + 300) = 300
> 
> If there is an OPP with a capacity between these 2 values, CPU1 will
> use a lower OPP than CPU0 and its computed energy will be lower.
> 
> The condition  if (max_spare_cap_cpu >= 0 && max_spare_cap >
> prev_spare_cap) filters some cases when CPU0 and CPU1 have the exact
> same spare capacity. But we often see a smaller spare capacity for
> CPU0 because of small side activities like cpufreq, timer, irq, rcu
> ... The difference is often only 1 but enough to bypass the condition
> above. task A will migrate to CPU1 whereas there is no need. Then it
> will move back to CPU0 once CPU1 will have a smaller spare capacity
> 
> I ran a test on snapdragon RB5 with the latest tip/sched/core. I start
> 3 tasks: 1 large enough to be on medium CPUs and 2 small enough to
> stay on little CPUs during 30 seconds
> With tip/sched/core, the 3 tasks are migrating around 3665
> With the patch, there is only 8 migration at the beginning of the test

I agree with this. The fact that cfs_rq->avg.runnable_avg contains
blocked contributions from task A makes it unsuitable for the util_est
(no blocked contributions) if condition (dst_cpu == cpu) since we don't
want to add A's util_est to util_est to simulate during wakeup that A is
enqueued.

>> <--
>>
>> But I assume your point is that:
>>
>> 7327       if (boost)
>> 7328           util_est = max(util_est, runnable);
>>
>> 7356       if (dst_cpu == cpu)                                   <-- (1)
>> 7357           util_est += _task_util_est(p);
>> 7358       else if (p && unlikely(task_on_rq_queued(p) || current == p))
>> 7359           lsub_positive(&util_est, _task_util_est(p));
>> 7360
>> 7361       util = max(util, util_est);
>>
>> --> (1) doesn't work anymore in case `util_est == runnable`.
>>
>> It will break the assumption for the if condition depicted in
>> cpu_util()'s comment:
> 
> exactly

OK.

>>
>> 7331  * During wake-up (2) @p isn't enqueued yet and doesn't contribute
>> 7332  * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
>> 7333  * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
>> 7334  * has been enqueued.
>>
>> (2) eenv_pd_max_util() and find_energy_efficient_cpu() call-site.
>>
>> <---
>>
>> Rerunning Jankbench tests on Pix6 will tell if boosting util_avg instead
>> of both will still show the anticipated results. Likelihood is high that
>> it will since we do `util = max(util, util_est)` at the end of cpu_util().
> 
>  I think the same

Reran the Jankbench test with the patch (fix) on exactly the same
platform (Pixel6, Android 12) I used for v3 (base, runnable):

https://lkml.kernel.org/r/20230515115735.296329-1-dietmar.eggemann@arm.com

Max_frame_duration:
+-----------------+------------+
|     kernel      | value [ms] |
+-----------------+------------+
|      base       |   163.1    |
|    runnable     |   162.0    |
|       fix       |   157.1    |
+-----------------+------------+

Mean_frame_duration:
+-----------------+------------+----------+
|     kernel      | value [ms] | diff [%] |
+-----------------+------------+----------+
|      base       |    18.0    |    0.0   |
|    runnable     |    12.7    |  -29.43  |
|       fix       |    13.0    |  -27.78  |
+-----------------+------------+----------+

Jank percentage (Jank deadline 16ms):
+-----------------+------------+----------+
|     kernel      | value [%]  | diff [%] |
+-----------------+------------+----------+
|      base       |     3.6    |    0.0   |
|    runnable     |     1.0    |  -68.86  |
|       fix       |     1.0    |  -68.86  |
+-----------------+------------+----------+

Power usage [mW] (total - all CPUs):
+-----------------+------------+----------+
|     kernel      | value [mW] | diff [%] |
+-----------------+------------+----------+
|      base       |    129.5   |    0.0   |
|    runnable     |    134.3   |   3.71   |
|       fix       |    129.9   |   0.31   |
+-----------------+------------+----------+

Test results look good to me.

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

[...]

Qais Yousef July 27, 2023, 11:56 a.m. UTC | #6

On 07/24/23 23:11, Dietmar Eggemann wrote:

> Reran the Jankbench test with the patch (fix) on exactly the same
> platform (Pixel6, Android 12) I used for v3 (base, runnable):
> 
> https://lkml.kernel.org/r/20230515115735.296329-1-dietmar.eggemann@arm.com
> 
> Max_frame_duration:
> +-----------------+------------+
> |     kernel      | value [ms] |
> +-----------------+------------+
> |      base       |   163.1    |
> |    runnable     |   162.0    |
> |       fix       |   157.1    |
> +-----------------+------------+
> 
> Mean_frame_duration:
> +-----------------+------------+----------+
> |     kernel      | value [ms] | diff [%] |
> +-----------------+------------+----------+
> |      base       |    18.0    |    0.0   |
> |    runnable     |    12.7    |  -29.43  |
> |       fix       |    13.0    |  -27.78  |
> +-----------------+------------+----------+
> 
> Jank percentage (Jank deadline 16ms):
> +-----------------+------------+----------+
> |     kernel      | value [%]  | diff [%] |
> +-----------------+------------+----------+
> |      base       |     3.6    |    0.0   |
> |    runnable     |     1.0    |  -68.86  |
> |       fix       |     1.0    |  -68.86  |
> +-----------------+------------+----------+
> 
> Power usage [mW] (total - all CPUs):
> +-----------------+------------+----------+
> |     kernel      | value [mW] | diff [%] |
> +-----------------+------------+----------+
> |      base       |    129.5   |    0.0   |
> |    runnable     |    134.3   |   3.71   |
> |       fix       |    129.9   |   0.31   |
> +-----------------+------------+----------+
> 
> Test results look good to me.
> 
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Thanks for re-running the test!


Cheers

--
Qais Yousef

diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..77c9f5816c31 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7289,9 +7289,6 @@  cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
 
 		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
 
-		if (boost)
-			util_est = max(util_est, runnable);
-
 		/*
 		 * During wake-up @p isn't enqueued yet and doesn't contribute
 		 * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.