[v2,1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0

Message ID 20230205224318.2035646-2-qyousef@layalina.io
State New
Headers
Series Fix a couple of corner cases in feec() when using uclamp_max |

Commit Message

Qais Yousef Feb. 5, 2023, 10:43 p.m. UTC
  When uclamp_max is being used, the util of the task could be higher than
the spare capacity of the CPU, but due to uclamp_max value we force fit
it there.

The way the condition for checking for max_spare_cap in
find_energy_efficient_cpu() was constructed; it ignored any CPU that has
its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
hence ending up never performing compute_energy() for this cluster and
missing an opportunity for a better energy efficient placement to honour
uclamp_max setting.

	max_spare_cap = 0;
	cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high

	...

	util_fits_cpu(...);		// will return true if uclamp_max forces it to fit

	...

	// this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
	if (cpu_cap > max_spare_cap) {
		max_spare_cap = cpu_cap;
		max_spare_cap_cpu = cpu;
	}

prev_spare_cap suffers from a similar problem.

Fix the logic by converting the variables into long and treating -1
value as 'not populated' instead of 0 which is a viable and correct
spare capacity value.

Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/fair.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)
  

Comments

Vincent Guittot Feb. 7, 2023, 9:45 a.m. UTC | #1
On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
>
> When uclamp_max is being used, the util of the task could be higher than
> the spare capacity of the CPU, but due to uclamp_max value we force fit
> it there.
>
> The way the condition for checking for max_spare_cap in
> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> hence ending up never performing compute_energy() for this cluster and
> missing an opportunity for a better energy efficient placement to honour
> uclamp_max setting.
>
>         max_spare_cap = 0;
>         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
>
>         ...
>
>         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
>
>         ...
>
>         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
>         if (cpu_cap > max_spare_cap) {
>                 max_spare_cap = cpu_cap;
>                 max_spare_cap_cpu = cpu;
>         }
>
> prev_spare_cap suffers from a similar problem.
>
> Fix the logic by converting the variables into long and treating -1
> value as 'not populated' instead of 0 which is a viable and correct
> spare capacity value.
>
> Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  kernel/sched/fair.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c6c8e7f52935..7a21ee74139f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>         for (; pd; pd = pd->next) {
>                 unsigned long util_min = p_util_min, util_max = p_util_max;
>                 unsigned long cpu_cap, cpu_thermal_cap, util;
> -               unsigned long cur_delta, max_spare_cap = 0;
> +               long prev_spare_cap = -1, max_spare_cap = -1;
>                 unsigned long rq_util_min, rq_util_max;
> -               unsigned long prev_spare_cap = 0;
> +               unsigned long cur_delta, base_energy;
>                 int max_spare_cap_cpu = -1;
> -               unsigned long base_energy;
>                 int fits, max_fits = -1;
>
>                 cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>                         }
>                 }
>
> -               if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
> +               if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
>                         continue;
>
>                 eenv_pd_busy_time(&eenv, cpus, p);
> @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>                 base_energy = compute_energy(&eenv, pd, cpus, p, -1);
>
>                 /* Evaluate the energy impact of using prev_cpu. */
> -               if (prev_spare_cap > 0) {
> +               if (prev_spare_cap > -1) {
>                         prev_delta = compute_energy(&eenv, pd, cpus, p,
>                                                     prev_cpu);
>                         /* CPU utilization has changed */

I think that you also need the change below to make sure that the
signed comparison will be used. I have quickly checked the assembly
code for aarch64 and your patch keeps using unsigned comparison (b.ls)
   ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
ffff8000080e4c98: eb00003f cmp x1, x0
ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0
<select_task_rq_fair+0x570>  // b.plast

Whereas the change below make it to use the signed version (b.le)
   ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
ffff8000080e4c98: eb00003f cmp x1, x0
ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570>

-- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct
task_struct *p, int prev_cpu)
                                prev_spare_cap = cpu_cap;
                                prev_fits = fits;
                        } else if ((fits > max_fits) ||
-                                  ((fits == max_fits) && (cpu_cap >
max_spare_cap))) {
+                                  ((fits == max_fits) &&
((long)cpu_cap > max_spare_cap))) {
                                /*
                                 * Find the CPU with the maximum spare capacity
                                 * among the remaining CPUs in the performance

> --
> 2.25.1
>
  
Dietmar Eggemann Feb. 9, 2023, 6:02 p.m. UTC | #2
On 07/02/2023 10:45, Vincent Guittot wrote:
> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
>>
>> When uclamp_max is being used, the util of the task could be higher than
>> the spare capacity of the CPU, but due to uclamp_max value we force fit
>> it there.
>>
>> The way the condition for checking for max_spare_cap in
>> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
>> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
>> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
>> hence ending up never performing compute_energy() for this cluster and
>> missing an opportunity for a better energy efficient placement to honour
>> uclamp_max setting.
>>
>>         max_spare_cap = 0;
>>         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
>>
>>         ...
>>
>>         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit

s/true/1/ ?

>>
>>         ...
>>
>>         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
>>         if (cpu_cap > max_spare_cap) {
>>                 max_spare_cap = cpu_cap;
>>                 max_spare_cap_cpu = cpu;
>>         }
>>
>> prev_spare_cap suffers from a similar problem.
>>
>> Fix the logic by converting the variables into long and treating -1
>> value as 'not populated' instead of 0 which is a viable and correct
>> spare capacity value.

The issue I see here is in case we don't have any spare capacity left,
the energy calculation (in terms CPU utilization) isn't correct anymore.

Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
you never know how big the `busy_time` for the PD really is in this moment.

eenv_pd_busy_time()

  for_each_cpu(cpu, pd_cpus)
    busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
    ^^^^^^^^^

with:

  sum_util = min(busy_time + task_busy_time, pd_cap)
                 ^^^^^^^^^

  freq = (1.25 * max_util / max) * max_freq

  energy = (perf_state(freq)->cost / max) * sum_util


energy is not related to CPU utilization anymore (since there is no idle
time/spare capacity) left.

So EAS keeps packing on the cheaper PD/clamped OPP.

E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
tasks all running on little PD, cpumask=0x39. The big PD is idle but
never beats prev_cpu in terms of energy consumption for the wakee.

[...]
  
Qais Yousef Feb. 11, 2023, 5:28 p.m. UTC | #3
On 02/07/23 10:45, Vincent Guittot wrote:
> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > When uclamp_max is being used, the util of the task could be higher than
> > the spare capacity of the CPU, but due to uclamp_max value we force fit
> > it there.
> >
> > The way the condition for checking for max_spare_cap in
> > find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> > hence ending up never performing compute_energy() for this cluster and
> > missing an opportunity for a better energy efficient placement to honour
> > uclamp_max setting.
> >
> >         max_spare_cap = 0;
> >         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
> >
> >         ...
> >
> >         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
> >
> >         ...
> >
> >         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> >         if (cpu_cap > max_spare_cap) {
> >                 max_spare_cap = cpu_cap;
> >                 max_spare_cap_cpu = cpu;
> >         }
> >
> > prev_spare_cap suffers from a similar problem.
> >
> > Fix the logic by converting the variables into long and treating -1
> > value as 'not populated' instead of 0 which is a viable and correct
> > spare capacity value.
> >
> > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
> > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > ---
> >  kernel/sched/fair.c | 9 ++++-----
> >  1 file changed, 4 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c6c8e7f52935..7a21ee74139f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >         for (; pd; pd = pd->next) {
> >                 unsigned long util_min = p_util_min, util_max = p_util_max;
> >                 unsigned long cpu_cap, cpu_thermal_cap, util;
> > -               unsigned long cur_delta, max_spare_cap = 0;
> > +               long prev_spare_cap = -1, max_spare_cap = -1;
> >                 unsigned long rq_util_min, rq_util_max;
> > -               unsigned long prev_spare_cap = 0;
> > +               unsigned long cur_delta, base_energy;
> >                 int max_spare_cap_cpu = -1;
> > -               unsigned long base_energy;
> >                 int fits, max_fits = -1;
> >
> >                 cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                         }
> >                 }
> >
> > -               if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
> > +               if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> >                         continue;
> >
> >                 eenv_pd_busy_time(&eenv, cpus, p);
> > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                 base_energy = compute_energy(&eenv, pd, cpus, p, -1);
> >
> >                 /* Evaluate the energy impact of using prev_cpu. */
> > -               if (prev_spare_cap > 0) {
> > +               if (prev_spare_cap > -1) {
> >                         prev_delta = compute_energy(&eenv, pd, cpus, p,
> >                                                     prev_cpu);
> >                         /* CPU utilization has changed */
> 
> I think that you also need the change below to make sure that the
> signed comparison will be used. I have quickly checked the assembly
> code for aarch64 and your patch keeps using unsigned comparison (b.ls)
>    ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> ffff8000080e4c98: eb00003f cmp x1, x0
> ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0
> <select_task_rq_fair+0x570>  // b.plast
> 
> Whereas the change below make it to use the signed version (b.le)
>    ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> ffff8000080e4c98: eb00003f cmp x1, x0
> ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570>
> 
> -- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct
> task_struct *p, int prev_cpu)
>                                 prev_spare_cap = cpu_cap;
>                                 prev_fits = fits;
>                         } else if ((fits > max_fits) ||
> -                                  ((fits == max_fits) && (cpu_cap >
> max_spare_cap))) {
> +                                  ((fits == max_fits) &&
> ((long)cpu_cap > max_spare_cap))) {
>                                 /*
>                                  * Find the CPU with the maximum spare capacity
>                                  * among the remaining CPUs in the performance

Isn't it better to go back to v1 form then? The inconsistent type paired with
the cast isn't getting too ugly for me :(

I don't think we can convert cpu_cap to long without having to do more work as
it is used with 'util'.


Cheers

--
Qais Yousef
  
Qais Yousef Feb. 11, 2023, 5:50 p.m. UTC | #4
On 02/09/23 19:02, Dietmar Eggemann wrote:
> On 07/02/2023 10:45, Vincent Guittot wrote:
> > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> >>
> >> When uclamp_max is being used, the util of the task could be higher than
> >> the spare capacity of the CPU, but due to uclamp_max value we force fit
> >> it there.
> >>
> >> The way the condition for checking for max_spare_cap in
> >> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> >> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> >> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> >> hence ending up never performing compute_energy() for this cluster and
> >> missing an opportunity for a better energy efficient placement to honour
> >> uclamp_max setting.
> >>
> >>         max_spare_cap = 0;
> >>         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
> >>
> >>         ...
> >>
> >>         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
> 
> s/true/1/ ?
> 
> >>
> >>         ...
> >>
> >>         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> >>         if (cpu_cap > max_spare_cap) {
> >>                 max_spare_cap = cpu_cap;
> >>                 max_spare_cap_cpu = cpu;
> >>         }
> >>
> >> prev_spare_cap suffers from a similar problem.
> >>
> >> Fix the logic by converting the variables into long and treating -1
> >> value as 'not populated' instead of 0 which is a viable and correct
> >> spare capacity value.
> 
> The issue I see here is in case we don't have any spare capacity left,
> the energy calculation (in terms CPU utilization) isn't correct anymore.
> 
> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
> you never know how big the `busy_time` for the PD really is in this moment.
> 
> eenv_pd_busy_time()
> 
>   for_each_cpu(cpu, pd_cpus)
>     busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
>     ^^^^^^^^^
> 
> with:
> 
>   sum_util = min(busy_time + task_busy_time, pd_cap)
>                  ^^^^^^^^^
> 
>   freq = (1.25 * max_util / max) * max_freq
> 
>   energy = (perf_state(freq)->cost / max) * sum_util
> 
> 
> energy is not related to CPU utilization anymore (since there is no idle
> time/spare capacity) left.

Am I right that what you're saying is that the energy calculation for the PD
will be capped to a certain value and this is why you think the energy is
incorrect?

What should be the correct energy (in theory at least)?

> 
> So EAS keeps packing on the cheaper PD/clamped OPP.

Which is the desired behavior for uclamp_max?

The only issue I see is that we want to distribute within a pd. Which is
something I was going to work on and send after later - but can lump it in this
series if it helps.

> 
> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
> tasks all running on little PD, cpumask=0x39. The big PD is idle but
> never beats prev_cpu in terms of energy consumption for the wakee.

IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
is two folds:

	1. Prevent tasks from consuming energy.
	2. Keep them away from expensive CPUs.

2 is actually very important for 2 reasons:

	a. Because of max aggregation - any uncapped tasks that wakes up will
	   cause a frequency spike on this 'expensive' cpu. We don't have
	   a mechanism to downmigrate it - which is another thing I'm working
	   on.
	b. It is desired to keep these bigger cpu idle ready for more important
	   work.

For 2, generally we don't want these tasks to steal bandwidth from these CPUs
that we'd like to preserve for other type of work.

Of course userspace has control by selecting the right uclamp_max value. They
can increase it to allow a spill to next pd - or keep it low to steer them more
strongly on a specific pd.


Cheers

--
Qais Yousef
  
Dietmar Eggemann Feb. 14, 2023, 12:47 p.m. UTC | #5
On 11/02/2023 18:50, Qais Yousef wrote:
> On 02/09/23 19:02, Dietmar Eggemann wrote:
>> On 07/02/2023 10:45, Vincent Guittot wrote:
>>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:

[...]

>>>> Fix the logic by converting the variables into long and treating -1
>>>> value as 'not populated' instead of 0 which is a viable and correct
>>>> spare capacity value.
>>
>> The issue I see here is in case we don't have any spare capacity left,
>> the energy calculation (in terms CPU utilization) isn't correct anymore.
>>
>> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
>> you never know how big the `busy_time` for the PD really is in this moment.
>>
>> eenv_pd_busy_time()
>>
>>   for_each_cpu(cpu, pd_cpus)
>>     busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
>>     ^^^^^^^^^
>>
>> with:
>>
>>   sum_util = min(busy_time + task_busy_time, pd_cap)
>>                  ^^^^^^^^^
>>
>>   freq = (1.25 * max_util / max) * max_freq
>>
>>   energy = (perf_state(freq)->cost / max) * sum_util
>>
>>
>> energy is not related to CPU utilization anymore (since there is no idle
>> time/spare capacity) left.
> 
> Am I right that what you're saying is that the energy calculation for the PD
> will be capped to a certain value and this is why you think the energy is
> incorrect?
> 
> What should be the correct energy (in theory at least)?

The energy value for the perf_state is correct but the decision based 
on `energy diff` in which PD to put the task might not be.

In case CPUx already has some tasks, its `pd_busy_time` contribution 
is still capped by its capacity_orig.

eenv_pd_busy_time() -> cpu_util_next()
                         return min(util, capacity_orig_of(cpu))

In this case we can't use `energy diff` anymore to make the decision 
where to put the task.

The base_energy of CPUx's PD is already so high that the 
`energy diff = cur_energy - base_energy` is small enough so that CPUx 
is selected as target again.

>> So EAS keeps packing on the cheaper PD/clamped OPP.

Sometimes yes, but there are occurrences in which a big CPU ends up
with the majority of the tasks. It depends on the start condition.

> Which is the desired behavior for uclamp_max?

Not sure. Essentially, EAS can't do its job properly if there is no idle 
time left. As soon as uclamp_max restricts the system (like in my
example) task placement decisions via EAS (min energy diff based) can be
wrong. 

> The only issue I see is that we want to distribute within a pd. Which is
> something I was going to work on and send after later - but can lump it in this
> series if it helps.

I assume you refer to

    } else if ((fits > max_fits) ||
        ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {

here?

>> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
>> tasks all running on little PD, cpumask=0x39. The big PD is idle but
>> never beats prev_cpu in terms of energy consumption for the wakee.
> 
> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
> is two folds:
> 
> 	1. Prevent tasks from consuming energy.
> 	2. Keep them away from expensive CPUs.

Like I tried to reason about above, the energy diff based task placement
can't always assure this.

> 
> 2 is actually very important for 2 reasons:
> 
> 	a. Because of max aggregation - any uncapped tasks that wakes up will
> 	   cause a frequency spike on this 'expensive' cpu. We don't have
> 	   a mechanism to downmigrate it - which is another thing I'm working
> 	   on.

True. And it can also lead to CPU overutilization which then leads to
load-balancing kicking in.

> 	b. It is desired to keep these bigger cpu idle ready for more important
> 	   work.

But this is not happening all the time. Using the same workload
(6 8ms/16ms uclamp_max=440) on Juno-r0 [446 1024 1024 446 446 446]
sometimes ends up with all 6 tasks on big CPU1.

A corresponding EAS task placement looks like this one:

<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[1]=0 cpu_util[f|e]=[446 994]  
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[2]=0 cpu_util[f|e]=[445 1004]

<idle>-0 [005] select_task_rq_fair: CPU5 cpu=1 busy_time=994 util=994 pd_cap=2048
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=2 busy_time=1998 util=1004 pd_cap=2048

<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=821866 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=1998 cpu_cap=1024
<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=1  energy=842434 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=2048 cpu_cap=1024
                                                            ^^^^^^^^^^^^^
                                                            diff = 20,568

<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[0]=0 cpu_util[f|e]=[440 446] 
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[3]=0 cpu_util[f|e]=[6 6]
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[4]=0 cpu_util[f|e]=[440 446] 
<idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[5]=0 cpu_util[f|e]=[440 446] 

<idle>-0 [005] select_task_rq_fair: CPU5 cpu=0 busy_time=446 util=446 pd_cap=1784
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=3 busy_time=452 util=2 pd_cap=1784
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=4 busy_time=898 util=446 pd_cap=1784
<idle>-0 [005] select_task_rq_fair: CPU5 cpu=5 busy_time=907 util=1 pd_cap=1784

<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=242002 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=907 cpu_cap=446
<idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=5  energy=476000 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=1784 cpu_cap=446
                                                            ^^^^^^^^^^^^^
                                                            diff = 233,998

<idle>-0 [005] select_task_rq_fair: p=[task0-5 2551] target=1

> For 2, generally we don't want these tasks to steal bandwidth from these CPUs
> that we'd like to preserve for other type of work.
> 
> Of course userspace has control by selecting the right uclamp_max value. They
> can increase it to allow a spill to next pd - or keep it low to steer them more
> strongly on a specific pd.

[...]
  
Qais Yousef Feb. 14, 2023, 6:09 p.m. UTC | #6
On 02/14/23 13:47, Dietmar Eggemann wrote:
> On 11/02/2023 18:50, Qais Yousef wrote:
> > On 02/09/23 19:02, Dietmar Eggemann wrote:
> >> On 07/02/2023 10:45, Vincent Guittot wrote:
> >>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> 
> [...]
> 
> >>>> Fix the logic by converting the variables into long and treating -1
> >>>> value as 'not populated' instead of 0 which is a viable and correct
> >>>> spare capacity value.
> >>
> >> The issue I see here is in case we don't have any spare capacity left,
> >> the energy calculation (in terms CPU utilization) isn't correct anymore.
> >>
> >> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
> >> you never know how big the `busy_time` for the PD really is in this moment.
> >>
> >> eenv_pd_busy_time()
> >>
> >>   for_each_cpu(cpu, pd_cpus)
> >>     busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
> >>     ^^^^^^^^^
> >>
> >> with:
> >>
> >>   sum_util = min(busy_time + task_busy_time, pd_cap)
> >>                  ^^^^^^^^^
> >>
> >>   freq = (1.25 * max_util / max) * max_freq
> >>
> >>   energy = (perf_state(freq)->cost / max) * sum_util
> >>
> >>
> >> energy is not related to CPU utilization anymore (since there is no idle
> >> time/spare capacity) left.
> > 
> > Am I right that what you're saying is that the energy calculation for the PD
> > will be capped to a certain value and this is why you think the energy is
> > incorrect?
> > 
> > What should be the correct energy (in theory at least)?
> 
> The energy value for the perf_state is correct but the decision based 
> on `energy diff` in which PD to put the task might not be.
> 
> In case CPUx already has some tasks, its `pd_busy_time` contribution 
> is still capped by its capacity_orig.
> 
> eenv_pd_busy_time() -> cpu_util_next()
>                          return min(util, capacity_orig_of(cpu))
> 
> In this case we can't use `energy diff` anymore to make the decision 
> where to put the task.
> 
> The base_energy of CPUx's PD is already so high that the 
> `energy diff = cur_energy - base_energy` is small enough so that CPUx 
> is selected as target again.

I'm sorry bear with me as I'm still struggling to fully understand the case.

You're thinking that if all the CPUs in the pd are _already_ fully busy before
adding this newly woken up task there, then we'll end up with the wrong
energy_diff? IOW, when the pd is fully loaded, then the cost of adding extra
load will always look cheap is what you're saying?

> 
> >> So EAS keeps packing on the cheaper PD/clamped OPP.
> 
> Sometimes yes, but there are occurrences in which a big CPU ends up
> with the majority of the tasks. It depends on the start condition.

It should depend on the energy cost, yes. Which does depend on the current
state of the system.

> 
> > Which is the desired behavior for uclamp_max?
> 
> Not sure. Essentially, EAS can't do its job properly if there is no idle 

This "not sure" statement is making me worried. Are you not sure about how
uclamp_max should work in force fitting big tasks into small cores? Or just
about handling some of the corner cases? I understood the former, not the
latter.

> time left. As soon as uclamp_max restricts the system (like in my
> example) task placement decisions via EAS (min energy diff based) can be
> wrong. 

I'm assuming 'no idle time' refers to the pd being fully loaded already
_before_ placing the newly woken up task. If you refer to it not having idle
time _after_ placing it - then I'm confused as one of the goals of uclamp_max
is to act as a task placement hint and if it can't do that in this simple
scenarios, it won't be much useful? I can appreciate it failing at some corner
cases. But for example if a little core is idle and a 1024 task wakes up with
uclamp_max that makes the little a candidate; then yeah it'll not leave any
idle time on that cpu - but placing it there (if it the energy efficient cpu)
is the desired effect, no?

> 
> > The only issue I see is that we want to distribute within a pd. Which is
> > something I was going to work on and send after later - but can lump it in this
> > series if it helps.
> 
> I assume you refer to
> 
>     } else if ((fits > max_fits) ||
>         ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> 
> here?

Yes. If there are multiple cpus with the same max_spare_cap maybe we can
distribute among them rather than always pick the first one.

> 
> >> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
> >> tasks all running on little PD, cpumask=0x39. The big PD is idle but
> >> never beats prev_cpu in terms of energy consumption for the wakee.
> > 
> > IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
> > is two folds:
> > 
> > 	1. Prevent tasks from consuming energy.
> > 	2. Keep them away from expensive CPUs.
> 
> Like I tried to reason about above, the energy diff based task placement
> can't always assure this.

Re assure: It is not expected to be 100% guarantee. But it shouldn't fail
simply too.

> 
> > 
> > 2 is actually very important for 2 reasons:
> > 
> > 	a. Because of max aggregation - any uncapped tasks that wakes up will
> > 	   cause a frequency spike on this 'expensive' cpu. We don't have
> > 	   a mechanism to downmigrate it - which is another thing I'm working
> > 	   on.
> 
> True. And it can also lead to CPU overutilization which then leads to
> load-balancing kicking in.

Yep.

> 
> > 	b. It is desired to keep these bigger cpu idle ready for more important
> > 	   work.
> 
> But this is not happening all the time. Using the same workload
> (6 8ms/16ms uclamp_max=440) on Juno-r0 [446 1024 1024 446 446 446]
> sometimes ends up with all 6 tasks on big CPU1.

This seems similar to a case I see on pinebook pro but with uclamp_min.

$ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
/sys/devices/system/cpu/cpu0/cpu_capacity:381
/sys/devices/system/cpu/cpu1/cpu_capacity:381
/sys/devices/system/cpu/cpu2/cpu_capacity:381
/sys/devices/system/cpu/cpu3/cpu_capacity:381
/sys/devices/system/cpu/cpu4/cpu_capacity:1024
/sys/devices/system/cpu/cpu5/cpu_capacity:1024

ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy
253049258065, 4, 0, 381, 1024, 10562
253049269732, 3, 0, 381, 1024, 18814
253065609673, 4, 0, 381, 1024, 10562
253065621340, 3, 0, 381, 1024, 17874
253082033696, 4, 0, 381, 1024, 14669
253082045071, 2, 0, 381, 1024, 17403
253098438637, 4, 0, 381, 1024, 14082
253098450011, 3, 0, 381, 1024, 17403
253114803452, 4, 0, 381, 1024, 17016
253114814827, 0, 0, 381, 1024, 16933
253131192435, 4, 0, 381, 1024, 15843
253131204101, 2, 0, 381, 1024, 16933
253147557125, 4, 0, 381, 1024, 18776
253147568500, 0, 0, 381, 1024, 15992
253163935608, 4, 0, 381, 1024, 17603
253163946108, 0, 0, 381, 1024, 15522
253180299382, 4, 0, 381, 1024, 17016
253180306965, 3, 0, 381, 1024, 26811
253196598074, 4, 0, 381, 1024, 16429
253196606532, 0, 0, 381, 1024, 25870
253212953723, 4, 0, 381, 1024, 15256
253212965681, 0, 0, 381, 1024, 25400
253229376288, 4, 0, 381, 1024, 19363

With uclamp_max energy looks different, but p_util is different too which has
impact on compute_energy() calculations

ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy
760154735422179, 4, 407, 0, 381, 237058
760154735426845, 0, 407, 0, 381, 192382
760154751547464, 4, 407, 0, 381, 237058
760154751552131, 0, 407, 0, 381, 191912
760154767690833, 4, 407, 0, 381, 237058
760154767696667, 0, 407, 0, 381, 202730
760154783818744, 4, 407, 0, 381, 237058
760154783823994, 0, 407, 0, 381, 198967
760154799944613, 4, 407, 0, 381, 237058
760154799949280, 0, 407, 0, 381, 198967
760154816074565, 4, 407, 0, 381, 237058
760154816079232, 0, 407, 0, 381, 195675
760154832201309, 4, 407, 0, 381, 237058
760154832205976, 0, 407, 0, 381, 195204
760154848328053, 4, 407, 0, 381, 237058
760154848333012, 0, 407, 0, 381, 193793
760154864453631, 4, 407, 0, 381, 237058
760154864458297, 0, 407, 0, 381, 193793
760154880578333, 4, 407, 0, 381, 237058
760154880583000, 0, 407, 0, 381, 192852
760154896705369, 4, 407, 0, 381, 237058
760154896710619, 0, 407, 0, 381, 192852

I'm not sure if there's an error in compute_energy to be honest - but as you
said depends on the conditions of the system the most energy efficient cpu
could be different.

Without this patch we don't even call compute_energy() for the little core; but
now it is a viable candidate.

> 
> A corresponding EAS task placement looks like this one:
> 
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[1]=0 cpu_util[f|e]=[446 994]  
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[2]=0 cpu_util[f|e]=[445 1004]
> 
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=1 busy_time=994 util=994 pd_cap=2048
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=2 busy_time=1998 util=1004 pd_cap=2048
> 
> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=821866 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=1998 cpu_cap=1024
> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=1  energy=842434 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=2048 cpu_cap=1024
>                                                             ^^^^^^^^^^^^^
>                                                             diff = 20,568

This is not necessarily a problem, unless the energy calculations are wrong of
course.

Setting uclamp_max near the top capacity of the littles will hopefully make it
run there more often - but we know that the top frequencies of the little are
not necessarily efficient ones too.

Does lowering uclamp_max more towards the 'knee' of the little help keeping the
tasks there?

Note what this patch does is make sure that the little is a 'candidate'. Energy
efficiency is the ultimate choice of which candidate cpu/pd to pick.

Being more strict about uclamp_max choice might be necessary if there's
a stronger desire by userspace to keep the tasks on specific cluster.

If there're errors in calculating energy, I'd appreciate your help on how to
resolve them.


Thanks!

--
Qais Yousef

> 
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[0]=0 cpu_util[f|e]=[440 446] 
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[3]=0 cpu_util[f|e]=[6 6]
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[4]=0 cpu_util[f|e]=[440 446] 
> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[5]=0 cpu_util[f|e]=[440 446] 
> 
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=0 busy_time=446 util=446 pd_cap=1784
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=3 busy_time=452 util=2 pd_cap=1784
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=4 busy_time=898 util=446 pd_cap=1784
> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=5 busy_time=907 util=1 pd_cap=1784
> 
> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=242002 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=907 cpu_cap=446
> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=5  energy=476000 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=1784 cpu_cap=446
>                                                             ^^^^^^^^^^^^^
>                                                             diff = 233,998
> 
> <idle>-0 [005] select_task_rq_fair: p=[task0-5 2551] target=1
> 
> > For 2, generally we don't want these tasks to steal bandwidth from these CPUs
> > that we'd like to preserve for other type of work.
> > 
> > Of course userspace has control by selecting the right uclamp_max value. They
> > can increase it to allow a spill to next pd - or keep it low to steer them more
> > strongly on a specific pd.
> 
> [...]
>
  
Vincent Guittot Feb. 20, 2023, 5:02 p.m. UTC | #7
On Sat, 11 Feb 2023 at 18:28, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 02/07/23 10:45, Vincent Guittot wrote:
> > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > When uclamp_max is being used, the util of the task could be higher than
> > > the spare capacity of the CPU, but due to uclamp_max value we force fit
> > > it there.
> > >
> > > The way the condition for checking for max_spare_cap in
> > > find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> > > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> > > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> > > hence ending up never performing compute_energy() for this cluster and
> > > missing an opportunity for a better energy efficient placement to honour
> > > uclamp_max setting.
> > >
> > >         max_spare_cap = 0;
> > >         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
> > >
> > >         ...
> > >
> > >         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
> > >
> > >         ...
> > >
> > >         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> > >         if (cpu_cap > max_spare_cap) {
> > >                 max_spare_cap = cpu_cap;
> > >                 max_spare_cap_cpu = cpu;
> > >         }
> > >
> > > prev_spare_cap suffers from a similar problem.
> > >
> > > Fix the logic by converting the variables into long and treating -1
> > > value as 'not populated' instead of 0 which is a viable and correct
> > > spare capacity value.
> > >
> > > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
> > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > > ---
> > >  kernel/sched/fair.c | 9 ++++-----
> > >  1 file changed, 4 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index c6c8e7f52935..7a21ee74139f 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >         for (; pd; pd = pd->next) {
> > >                 unsigned long util_min = p_util_min, util_max = p_util_max;
> > >                 unsigned long cpu_cap, cpu_thermal_cap, util;
> > > -               unsigned long cur_delta, max_spare_cap = 0;
> > > +               long prev_spare_cap = -1, max_spare_cap = -1;
> > >                 unsigned long rq_util_min, rq_util_max;
> > > -               unsigned long prev_spare_cap = 0;
> > > +               unsigned long cur_delta, base_energy;
> > >                 int max_spare_cap_cpu = -1;
> > > -               unsigned long base_energy;
> > >                 int fits, max_fits = -1;
> > >
> > >                 cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> > > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >                         }
> > >                 }
> > >
> > > -               if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
> > > +               if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> > >                         continue;
> > >
> > >                 eenv_pd_busy_time(&eenv, cpus, p);
> > > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >                 base_energy = compute_energy(&eenv, pd, cpus, p, -1);
> > >
> > >                 /* Evaluate the energy impact of using prev_cpu. */
> > > -               if (prev_spare_cap > 0) {
> > > +               if (prev_spare_cap > -1) {
> > >                         prev_delta = compute_energy(&eenv, pd, cpus, p,
> > >                                                     prev_cpu);
> > >                         /* CPU utilization has changed */
> >
> > I think that you also need the change below to make sure that the
> > signed comparison will be used. I have quickly checked the assembly
> > code for aarch64 and your patch keeps using unsigned comparison (b.ls)
> >    ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > ffff8000080e4c98: eb00003f cmp x1, x0
> > ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0
> > <select_task_rq_fair+0x570>  // b.plast
> >
> > Whereas the change below make it to use the signed version (b.le)
> >    ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > ffff8000080e4c98: eb00003f cmp x1, x0
> > ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570>
> >
> > -- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct
> > task_struct *p, int prev_cpu)
> >                                 prev_spare_cap = cpu_cap;
> >                                 prev_fits = fits;
> >                         } else if ((fits > max_fits) ||
> > -                                  ((fits == max_fits) && (cpu_cap >
> > max_spare_cap))) {
> > +                                  ((fits == max_fits) &&
> > ((long)cpu_cap > max_spare_cap))) {
> >                                 /*
> >                                  * Find the CPU with the maximum spare capacity
> >                                  * among the remaining CPUs in the performance
>
> Isn't it better to go back to v1 form then? The inconsistent type paired with
> the cast isn't getting too ugly for me :(

the cast into a long of the cpu capacity in the condition was a good
way to fix this unsigned/signed comparison and make is consistent with
the use of -1 as default value IMO
    ((long)cpu_cap > max_spare_cap)
>
> I don't think we can convert cpu_cap to long without having to do more work as
> it is used with 'util'.
>
>
> Cheers
>
> --
> Qais Yousef
  
Qais Yousef Feb. 21, 2023, 12:05 p.m. UTC | #8
On 02/20/23 18:02, Vincent Guittot wrote:
> On Sat, 11 Feb 2023 at 18:28, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 02/07/23 10:45, Vincent Guittot wrote:
> > > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > When uclamp_max is being used, the util of the task could be higher than
> > > > the spare capacity of the CPU, but due to uclamp_max value we force fit
> > > > it there.
> > > >
> > > > The way the condition for checking for max_spare_cap in
> > > > find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> > > > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> > > > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> > > > hence ending up never performing compute_energy() for this cluster and
> > > > missing an opportunity for a better energy efficient placement to honour
> > > > uclamp_max setting.
> > > >
> > > >         max_spare_cap = 0;
> > > >         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
> > > >
> > > >         ...
> > > >
> > > >         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
> > > >
> > > >         ...
> > > >
> > > >         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> > > >         if (cpu_cap > max_spare_cap) {
> > > >                 max_spare_cap = cpu_cap;
> > > >                 max_spare_cap_cpu = cpu;
> > > >         }
> > > >
> > > > prev_spare_cap suffers from a similar problem.
> > > >
> > > > Fix the logic by converting the variables into long and treating -1
> > > > value as 'not populated' instead of 0 which is a viable and correct
> > > > spare capacity value.
> > > >
> > > > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
> > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > > > ---
> > > >  kernel/sched/fair.c | 9 ++++-----
> > > >  1 file changed, 4 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index c6c8e7f52935..7a21ee74139f 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >         for (; pd; pd = pd->next) {
> > > >                 unsigned long util_min = p_util_min, util_max = p_util_max;
> > > >                 unsigned long cpu_cap, cpu_thermal_cap, util;
> > > > -               unsigned long cur_delta, max_spare_cap = 0;
> > > > +               long prev_spare_cap = -1, max_spare_cap = -1;
> > > >                 unsigned long rq_util_min, rq_util_max;
> > > > -               unsigned long prev_spare_cap = 0;
> > > > +               unsigned long cur_delta, base_energy;
> > > >                 int max_spare_cap_cpu = -1;
> > > > -               unsigned long base_energy;
> > > >                 int fits, max_fits = -1;
> > > >
> > > >                 cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> > > > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >                         }
> > > >                 }
> > > >
> > > > -               if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
> > > > +               if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> > > >                         continue;
> > > >
> > > >                 eenv_pd_busy_time(&eenv, cpus, p);
> > > > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >                 base_energy = compute_energy(&eenv, pd, cpus, p, -1);
> > > >
> > > >                 /* Evaluate the energy impact of using prev_cpu. */
> > > > -               if (prev_spare_cap > 0) {
> > > > +               if (prev_spare_cap > -1) {
> > > >                         prev_delta = compute_energy(&eenv, pd, cpus, p,
> > > >                                                     prev_cpu);
> > > >                         /* CPU utilization has changed */
> > >
> > > I think that you also need the change below to make sure that the
> > > signed comparison will be used. I have quickly checked the assembly
> > > code for aarch64 and your patch keeps using unsigned comparison (b.ls)
> > >    ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > > ffff8000080e4c98: eb00003f cmp x1, x0
> > > ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0
> > > <select_task_rq_fair+0x570>  // b.plast
> > >
> > > Whereas the change below make it to use the signed version (b.le)
> > >    ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > > ffff8000080e4c98: eb00003f cmp x1, x0
> > > ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570>
> > >
> > > -- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct
> > > task_struct *p, int prev_cpu)
> > >                                 prev_spare_cap = cpu_cap;
> > >                                 prev_fits = fits;
> > >                         } else if ((fits > max_fits) ||
> > > -                                  ((fits == max_fits) && (cpu_cap >
> > > max_spare_cap))) {
> > > +                                  ((fits == max_fits) &&
> > > ((long)cpu_cap > max_spare_cap))) {
> > >                                 /*
> > >                                  * Find the CPU with the maximum spare capacity
> > >                                  * among the remaining CPUs in the performance
> >
> > Isn't it better to go back to v1 form then? The inconsistent type paired with
> > the cast isn't getting too ugly for me :(
> 
> the cast into a long of the cpu capacity in the condition was a good
> way to fix this unsigned/signed comparison and make is consistent with
> the use of -1 as default value IMO
>     ((long)cpu_cap > max_spare_cap)

Fair enough. Our boolean brains differ :-) I'll add the cast.

Do you see the energy calculation issue Dietmar was trying to highlight? As it
stands I only have boolean tweaks to do for v3.


Cheers

--
Qais Yousef

> >
> > I don't think we can convert cpu_cap to long without having to do more work as
> > it is used with 'util'.
> >
> >
> > Cheers
> >
> > --
> > Qais Yousef
  
Dietmar Eggemann Feb. 21, 2023, 12:20 p.m. UTC | #9
On 14/02/2023 19:09, Qais Yousef wrote:
> On 02/14/23 13:47, Dietmar Eggemann wrote:
>> On 11/02/2023 18:50, Qais Yousef wrote:
>>> On 02/09/23 19:02, Dietmar Eggemann wrote:
>>>> On 07/02/2023 10:45, Vincent Guittot wrote:
>>>>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:

[...]

>>
>> The energy value for the perf_state is correct but the decision based 
>> on `energy diff` in which PD to put the task might not be.
>>
>> In case CPUx already has some tasks, its `pd_busy_time` contribution 
>> is still capped by its capacity_orig.
>>
>> eenv_pd_busy_time() -> cpu_util_next()
>>                          return min(util, capacity_orig_of(cpu))
>>
>> In this case we can't use `energy diff` anymore to make the decision 
>> where to put the task.
>>
>> The base_energy of CPUx's PD is already so high that the 
>> `energy diff = cur_energy - base_energy` is small enough so that CPUx 
>> is selected as target again.
> 
> I'm sorry bear with me as I'm still struggling to fully understand the case.
> 
> You're thinking that if all the CPUs in the pd are _already_ fully busy before
> adding this newly woken up task there, then we'll end up with the wrong
> energy_diff? IOW, when the pd is fully loaded, then the cost of adding extra
> load will always look cheap is what you're saying?

Sort of. The key to this is:

  compute_energy()

    if (dst_cpu >= 0)
     busy_time = min(pd_capacity, pd_busy_time + task_busy_time);
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                  pd + task contribution

If `pd + task contribution` > `pd_capacity` then we're saturated and the
energy diff is bogus. This includes the case in which `pd contribution`
> `pd_capacity`.

>>>> So EAS keeps packing on the cheaper PD/clamped OPP.
>>
>> Sometimes yes, but there are occurrences in which a big CPU ends up
>> with the majority of the tasks. It depends on the start condition.
> 
> It should depend on the energy cost, yes. Which does depend on the current
> state of the system.

I analyzed one of the last traces I got with my example:

During the initial wakeup the system is in CPU OU. So feec() returns
`target = -1` and `sis()->sic()` (a) does the initial placement for all
the 6 tasks.

CPU  (a)     (b) (c) (d)    (e)
      ^       ^   ^   ^      ^
 0   t1-------->| |
 1   t0 t3    |t5 |t1 |t4|   |   |   |   | ...
 2   t2--------------------->|
 3
 4   t4------------>| |
 5   t5---->| |

(b) feec() for t5:

    eenv_pd_busy_time()

      pd_busy_time = 1998 = 994 (CPU1) + 1004 (CPU2)

    compute_energy()

      busy_time = min(pd_capacity, pd_busy_time + task_busy_time)

                = min(2048, 1998 + 921)

                = 2048

    We go into saturation here and EAS assumes it can place t5 for the
    cost of 2048 - 1998 = 50 util where in reality t5 has a util of
    ~921. And that's why t5 ends up on CPU1 too.

(c) & (d) similar to (b)

(e) From here on no wakeups anymore. Only tick preemption on CPU1 every
    4ms (250Hz) between the 5 tasks and t2 finishing on CPU2 eventually.

>>> Which is the desired behavior for uclamp_max?
>>
>> Not sure. Essentially, EAS can't do its job properly if there is no idle 
> 
> This "not sure" statement is making me worried. Are you not sure about how
> uclamp_max should work in force fitting big tasks into small cores? Or just
> about handling some of the corner cases? I understood the former, not the
> latter.

I'm not sure if we can call the issue that EAS doesn't work in
saturation anymore a corner case. In case uclamp_max has an effect, it
can quite easily create these saturation scenarios in which EAS is still
enabled but it can't do its job anymore.
Making sure that we probe each PD is not getting rid of this more
fundamental issue.

>> time left. As soon as uclamp_max restricts the system (like in my
>> example) task placement decisions via EAS (min energy diff based) can be
>> wrong. 
> 
> I'm assuming 'no idle time' refers to the pd being fully loaded already
> _before_ placing the newly woken up task. If you refer to it not having idle
> time _after_ placing it - then I'm confused as one of the goals of uclamp_max
> is to act as a task placement hint and if it can't do that in this simple

We can't consider an individual task here. After placing `t0` might be
before placing `t1` for which we might then run into this saturation.

> scenarios, it won't be much useful? I can appreciate it failing at some corner
> cases. But for example if a little core is idle and a 1024 task wakes up with
> uclamp_max that makes the little a candidate; then yeah it'll not leave any
> idle time on that cpu - but placing it there (if it the energy efficient cpu)
> is the desired effect, no?

Not having idle time means EAS can't do its job properly and uclamp_max
can create these scenarios.

>>> The only issue I see is that we want to distribute within a pd. Which is
>>> something I was going to work on and send after later - but can lump it in this
>>> series if it helps.
>>
>> I assume you refer to
>>
>>     } else if ((fits > max_fits) ||
>>         ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
>>
>> here?
> 
> Yes. If there are multiple cpus with the same max_spare_cap maybe we can
> distribute among them rather than always pick the first one.

This can mitigate the issue but not solve it.

[...]

>>> 	b. It is desired to keep these bigger cpu idle ready for more important
>>> 	   work.
>>
>> But this is not happening all the time. Using the same workload
>> (6 8ms/16ms uclamp_max=440) on Juno-r0 [446 1024 1024 446 446 446]
>> sometimes ends up with all 6 tasks on big CPU1.
> 
> This seems similar to a case I see on pinebook pro but with uclamp_min.
> 
> $ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity
> /sys/devices/system/cpu/cpu0/cpu_capacity:381
> /sys/devices/system/cpu/cpu1/cpu_capacity:381
> /sys/devices/system/cpu/cpu2/cpu_capacity:381
> /sys/devices/system/cpu/cpu3/cpu_capacity:381
> /sys/devices/system/cpu/cpu4/cpu_capacity:1024
> /sys/devices/system/cpu/cpu5/cpu_capacity:1024
> 
> ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy
> 253049258065, 4, 0, 381, 1024, 10562
> 253049269732, 3, 0, 381, 1024, 18814

So the energy after placing p on CPU4 (big) is lower than placing it on
CPU3 (LITTLE). But to see the energy-diff we would need to see the base
energy (dst_cpu = -1) as well.

Why is p_util = 0 here?

> 253065609673, 4, 0, 381, 1024, 10562
> 253065621340, 3, 0, 381, 1024, 17874

[..]

> 
> With uclamp_max energy looks different, but p_util is different too which has
> impact on compute_energy() calculations

p_util is p->se.avg.util_avg here?

> ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy
> 760154735422179, 4, 407, 0, 381, 237058
> 760154735426845, 0, 407, 0, 381, 192382
> 
> I'm not sure if there's an error in compute_energy to be honest - but as you
> said depends on the conditions of the system the most energy efficient cpu
> could be different.
> 
> Without this patch we don't even call compute_energy() for the little core; but
> now it is a viable candidate.

Understood. But this doesn't fix the `EAS not working under saturation
problem`.

>> A corresponding EAS task placement looks like this one:
>>
>> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[1]=0 cpu_util[f|e]=[446 994]  
>> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[2]=0 cpu_util[f|e]=[445 1004]
>>
>> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=1 busy_time=994 util=994 pd_cap=2048
>> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=2 busy_time=1998 util=1004 pd_cap=2048
>>
>> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=821866 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=1998 cpu_cap=1024
>> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=1  energy=842434 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=2048 cpu_cap=1024
>>                                                             ^^^^^^^^^^^^^
>>                                                             diff = 20,568
> 
> This is not necessarily a problem, unless the energy calculations are wrong of
> course.

The problem is that placing p=[task0-5 2551] 'costs' only the
energy-equivalence of 2048-1998 = 50 util instead of 921.
> 
> Setting uclamp_max near the top capacity of the littles will hopefully make it
> run there more often - but we know that the top frequencies of the little are
> not necessarily efficient ones too.
> 
> Does lowering uclamp_max more towards the 'knee' of the little help keeping the
> tasks there?
> 
> Note what this patch does is make sure that the little is a 'candidate'. Energy
> efficiency is the ultimate choice of which candidate cpu/pd to pick.
> 
> Being more strict about uclamp_max choice might be necessary if there's
> a stronger desire by userspace to keep the tasks on specific cluster.
> 
> If there're errors in calculating energy, I'd appreciate your help on how to
> resolve them.

I don't think there is an error in the energy calculation.
But EAS doesn't work properly in these saturation cases which uclamp_max
can create. And considering each PD won't solve this. I'm afraid we do
need a solution for this saturation issue.
  
Vincent Guittot Feb. 22, 2023, 10:59 a.m. UTC | #10
On Tue, 21 Feb 2023 at 13:05, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 02/20/23 18:02, Vincent Guittot wrote:
> > On Sat, 11 Feb 2023 at 18:28, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 02/07/23 10:45, Vincent Guittot wrote:
> > > > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
> > > > >
> > > > > When uclamp_max is being used, the util of the task could be higher than
> > > > > the spare capacity of the CPU, but due to uclamp_max value we force fit
> > > > > it there.
> > > > >
> > > > > The way the condition for checking for max_spare_cap in
> > > > > find_energy_efficient_cpu() was constructed; it ignored any CPU that has
> > > > > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
> > > > > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
> > > > > hence ending up never performing compute_energy() for this cluster and
> > > > > missing an opportunity for a better energy efficient placement to honour
> > > > > uclamp_max setting.
> > > > >
> > > > >         max_spare_cap = 0;
> > > > >         cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
> > > > >
> > > > >         ...
> > > > >
> > > > >         util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
> > > > >
> > > > >         ...
> > > > >
> > > > >         // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
> > > > >         if (cpu_cap > max_spare_cap) {
> > > > >                 max_spare_cap = cpu_cap;
> > > > >                 max_spare_cap_cpu = cpu;
> > > > >         }
> > > > >
> > > > > prev_spare_cap suffers from a similar problem.
> > > > >
> > > > > Fix the logic by converting the variables into long and treating -1
> > > > > value as 'not populated' instead of 0 which is a viable and correct
> > > > > spare capacity value.
> > > > >
> > > > > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
> > > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > > > > ---
> > > > >  kernel/sched/fair.c | 9 ++++-----
> > > > >  1 file changed, 4 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index c6c8e7f52935..7a21ee74139f 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > > >         for (; pd; pd = pd->next) {
> > > > >                 unsigned long util_min = p_util_min, util_max = p_util_max;
> > > > >                 unsigned long cpu_cap, cpu_thermal_cap, util;
> > > > > -               unsigned long cur_delta, max_spare_cap = 0;
> > > > > +               long prev_spare_cap = -1, max_spare_cap = -1;
> > > > >                 unsigned long rq_util_min, rq_util_max;
> > > > > -               unsigned long prev_spare_cap = 0;
> > > > > +               unsigned long cur_delta, base_energy;
> > > > >                 int max_spare_cap_cpu = -1;
> > > > > -               unsigned long base_energy;
> > > > >                 int fits, max_fits = -1;
> > > > >
> > > > >                 cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> > > > > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > > >                         }
> > > > >                 }
> > > > >
> > > > > -               if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
> > > > > +               if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> > > > >                         continue;
> > > > >
> > > > >                 eenv_pd_busy_time(&eenv, cpus, p);
> > > > > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > > >                 base_energy = compute_energy(&eenv, pd, cpus, p, -1);
> > > > >
> > > > >                 /* Evaluate the energy impact of using prev_cpu. */
> > > > > -               if (prev_spare_cap > 0) {
> > > > > +               if (prev_spare_cap > -1) {
> > > > >                         prev_delta = compute_energy(&eenv, pd, cpus, p,
> > > > >                                                     prev_cpu);
> > > > >                         /* CPU utilization has changed */
> > > >
> > > > I think that you also need the change below to make sure that the
> > > > signed comparison will be used. I have quickly checked the assembly
> > > > code for aarch64 and your patch keeps using unsigned comparison (b.ls)
> > > >    ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> > > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > > > ffff8000080e4c98: eb00003f cmp x1, x0
> > > > ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0
> > > > <select_task_rq_fair+0x570>  // b.plast
> > > >
> > > > Whereas the change below make it to use the signed version (b.le)
> > > >    ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
> > > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200]
> > > > ffff8000080e4c98: eb00003f cmp x1, x0
> > > > ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570>
> > > >
> > > > -- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct
> > > > task_struct *p, int prev_cpu)
> > > >                                 prev_spare_cap = cpu_cap;
> > > >                                 prev_fits = fits;
> > > >                         } else if ((fits > max_fits) ||
> > > > -                                  ((fits == max_fits) && (cpu_cap >
> > > > max_spare_cap))) {
> > > > +                                  ((fits == max_fits) &&
> > > > ((long)cpu_cap > max_spare_cap))) {
> > > >                                 /*
> > > >                                  * Find the CPU with the maximum spare capacity
> > > >                                  * among the remaining CPUs in the performance
> > >
> > > Isn't it better to go back to v1 form then? The inconsistent type paired with
> > > the cast isn't getting too ugly for me :(
> >
> > the cast into a long of the cpu capacity in the condition was a good
> > way to fix this unsigned/signed comparison and make is consistent with
> > the use of -1 as default value IMO
> >     ((long)cpu_cap > max_spare_cap)
>
> Fair enough. Our boolean brains differ :-) I'll add the cast.
>
> Do you see the energy calculation issue Dietmar was trying to highlight? As it
> stands I only have boolean tweaks to do for v3.

I haven't looked too much at uclamp_max impact in energy calculation.
Nevertheless, I wonder if one solution could be to not clamp the
utilization to cpu max capacity in this case. The fact that
utilization can go above cpu capacity when we clamp its frequency
reflect that it would need more compute capacity or it will run
longer. I will try to look more deeply in this use case


>
>
> Cheers
>
> --
> Qais Yousef
>
> > >
> > > I don't think we can convert cpu_cap to long without having to do more work as
> > > it is used with 'util'.
> > >
> > >
> > > Cheers
> > >
> > > --
> > > Qais Yousef
  
Qais Yousef Feb. 23, 2023, 3:12 p.m. UTC | #11
On 02/21/23 13:20, Dietmar Eggemann wrote:

> I analyzed one of the last traces I got with my example:
> 
> During the initial wakeup the system is in CPU OU. So feec() returns
> `target = -1` and `sis()->sic()` (a) does the initial placement for all
> the 6 tasks.
> 
> CPU  (a)     (b) (c) (d)    (e)
>       ^       ^   ^   ^      ^
>  0   t1-------->| |
>  1   t0 t3    |t5 |t1 |t4|   |   |   |   | ...
>  2   t2--------------------->|
>  3
>  4   t4------------>| |
>  5   t5---->| |
> 
> (b) feec() for t5:
> 
>     eenv_pd_busy_time()
> 
>       pd_busy_time = 1998 = 994 (CPU1) + 1004 (CPU2)
> 
>     compute_energy()
> 
>       busy_time = min(pd_capacity, pd_busy_time + task_busy_time)
> 
>                 = min(2048, 1998 + 921)
> 
>                 = 2048
> 
>     We go into saturation here and EAS assumes it can place t5 for the
>     cost of 2048 - 1998 = 50 util where in reality t5 has a util of
>     ~921. And that's why t5 ends up on CPU1 too.

So to rephrase as it was hard to follow your line thoughts.

The problem you're highlighting is that with uclamp_max we can end up with
a fully utilized pd. But the energy cost of this always-runing pd is capped to
a constant value. But you think this should be modeled better to reflect it'll
run for longer period of time, hence cost more energy.

There are multiple compound issue that makes this difficult to be reflected
accurately and makes think this best-effort is not really as bad as you think:

1. The capacity of little cores has become small, 158 on pixel 6. This makes
   the definition of 'busy' task from the little's point of view rather
   relaxed/small. But there's a strong desire to enable uclamp_max to act as
   a soft affinity to prevent these tasks from interfering with more important
   work on bigger cores and potentially waste power.

2. If the task is truly long running 1024 tasks (even on big core), then
   knowing its actual utilization and runtime would be impossible and we'd have
   to cap it to something. It's basically inifinity.

3. Because of uclamp_max capping, the util_avg could be wrong - depends on
   actually how big the task is compared the uclamp_max. In worst case scenario
   if no idle time is seen util_avg can easily grow into 1024 - although if the
   task is allowed to run uncapped it actually might be 300 or something in
   reality.

4. Because of max aggregation, the uclamp_max tasks are a frequency spike
   hazard. Again if the task is a true 1024 (and capping those tasks is one of
   the major use cases for uclamp_max, if not The Major use case) then as soon
   as uncapped task wakes up on the CPU, there'd be a frequency spike
   regardless of how small this task is.

   If at wake-up it thought a big core is okay, and ends up running for the new
   few 100s ms with randodm rt/kworkers waking up on that core - the big core
   will run enough times at most energy enefficient point of the system. Which
   is a bigger problem as the power cost will be very high and noticeable. And
   uclamp_max isn't being used because of this already today.

   And because this is an always running task for the foreseeable future from
   the scheduler PoV, and, the lack of a downmigration mechanism to move these
   tasks away (I have patches for that - but trying to send fixes one by one),
   these frequency spikes poses a bigger hazard of wasting power than making
   a one off miscalculation at wakeup time. The decision at wake up is sampling
   at that exact instant of time. But generally uclamp_max is desired for those
   long running tasks whose potential energy cost over a long period of time
   (many ticks worth of realtime) is the worrisome. Immediately after that
   sampling time the condition could change even today and render our decision
   completely wrong. And there's no fixing for that. It's the nature of the
   beast.

5 One of the desired properties of uclamp_max in practice is to act as a soft
  affinity. In Android background tasks are restricted by cpuset. But this
  could potentially lead to lost opportunities of better energy placement under
  normal lightly loaded circumstances where there are a lot of small tasks
  running but nothing particularly busy. EAS shines under those scenarios. But
  fails for long running tasks - and here where uclamp_max is expected to help.

> 
> (c) & (d) similar to (b)
> 
> (e) From here on no wakeups anymore. Only tick preemption on CPU1 every
>     4ms (250Hz) between the 5 tasks and t2 finishing on CPU2 eventually.
> 
> >>> Which is the desired behavior for uclamp_max?
> >>
> >> Not sure. Essentially, EAS can't do its job properly if there is no idle 
> > 
> > This "not sure" statement is making me worried. Are you not sure about how
> > uclamp_max should work in force fitting big tasks into small cores? Or just
> > about handling some of the corner cases? I understood the former, not the
> > latter.
> 
> I'm not sure if we can call the issue that EAS doesn't work in
> saturation anymore a corner case. In case uclamp_max has an effect, it
> can quite easily create these saturation scenarios in which EAS is still
> enabled but it can't do its job anymore.
> Making sure that we probe each PD is not getting rid of this more
> fundamental issue.

I disagree with there being a fundamental issue in regards to energy
calculation. I see a potential nice-to-have improvement for estimating energy
calculation of long running tasks.

What I see as a fundamental error that this series is fixing is that the hint
to keep tasks on little cores doesn't work. Being able to consider a better
energy efficient CPU is less of a problem and not sure if it's really affecting
anyone in practice. Today's mainline will not consider any busy task as
a candidate for little core even of uclamp_max says it's okay. The task has to
be small enough (< 158 on pixel 6) for EAS to consider it, but those tasks are
not the ones that need uclamp_max hint.

We could try to remove the min(). But I worry this is premature now before
first introducing my fixes to teach the load balancer to do down migration. And
I think we need to do more to address the frequency spike problem - which I do
have proposals ready to address this problem too.

As it stands we can't use uclamp_max in products, and there are a series of
fixes required to achieve that. And what you see as a fundamental issue is not
one of the blocking issues I am aware of. It seems a nice enhancement to make
the wakeup  energy estimation even better. But I do think it requires some
crystal ball and it'll remain a best effort even after that.

I think the current outcome is the desired behavior to be honest. Maybe we can
do better, but that's a separate problem/question and not a fundamental
blocking to enabling uclamp_max to do what it says on the tin.

I can add this as another limitation in the uclamp doc.

The only thing that probably should do now is add a limit to how much cramming
we can do before we say it's too much for this pd even with uclamp_max. Maybe
we can put a limit based on load_avg as I think it'll continue to grow unlike
util_avg which will settle on 1024.


Thanks!

--
Qais Yousef
  
Qais Yousef Feb. 23, 2023, 3:13 p.m. UTC | #12
On 02/22/23 11:59, Vincent Guittot wrote:

> I haven't looked too much at uclamp_max impact in energy calculation.
> Nevertheless, I wonder if one solution could be to not clamp the
> utilization to cpu max capacity in this case. The fact that
> utilization can go above cpu capacity when we clamp its frequency
> reflect that it would need more compute capacity or it will run
> longer. I will try to look more deeply in this use case

Okay thanks!


--
Qais Yousef
  
Lukasz Luba May 22, 2023, 8:30 a.m. UTC | #13
Hi Qais,

I have a question regarding the 'soft cpu affinity'.

On 2/11/23 17:50, Qais Yousef wrote:
> On 02/09/23 19:02, Dietmar Eggemann wrote:
>> On 07/02/2023 10:45, Vincent Guittot wrote:
>>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote:
>>>>
>>>> When uclamp_max is being used, the util of the task could be higher than
>>>> the spare capacity of the CPU, but due to uclamp_max value we force fit
>>>> it there.
>>>>
>>>> The way the condition for checking for max_spare_cap in
>>>> find_energy_efficient_cpu() was constructed; it ignored any CPU that has
>>>> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
>>>> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
>>>> hence ending up never performing compute_energy() for this cluster and
>>>> missing an opportunity for a better energy efficient placement to honour
>>>> uclamp_max setting.
>>>>
>>>>          max_spare_cap = 0;
>>>>          cpu_cap = capacity_of(cpu) - task_util(p);  // 0 if task_util(p) is high
>>>>
>>>>          ...
>>>>
>>>>          util_fits_cpu(...);             // will return true if uclamp_max forces it to fit
>>
>> s/true/1/ ?
>>
>>>>
>>>>          ...
>>>>
>>>>          // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
>>>>          if (cpu_cap > max_spare_cap) {
>>>>                  max_spare_cap = cpu_cap;
>>>>                  max_spare_cap_cpu = cpu;
>>>>          }
>>>>
>>>> prev_spare_cap suffers from a similar problem.
>>>>
>>>> Fix the logic by converting the variables into long and treating -1
>>>> value as 'not populated' instead of 0 which is a viable and correct
>>>> spare capacity value.
>>
>> The issue I see here is in case we don't have any spare capacity left,
>> the energy calculation (in terms CPU utilization) isn't correct anymore.
>>
>> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()`
>> you never know how big the `busy_time` for the PD really is in this moment.
>>
>> eenv_pd_busy_time()
>>
>>    for_each_cpu(cpu, pd_cpus)
>>      busy_time += effective_cpu_util(..., ENERGY_UTIL, ...)
>>      ^^^^^^^^^
>>
>> with:
>>
>>    sum_util = min(busy_time + task_busy_time, pd_cap)
>>                   ^^^^^^^^^
>>
>>    freq = (1.25 * max_util / max) * max_freq
>>
>>    energy = (perf_state(freq)->cost / max) * sum_util
>>
>>
>> energy is not related to CPU utilization anymore (since there is no idle
>> time/spare capacity) left.
> 
> Am I right that what you're saying is that the energy calculation for the PD
> will be capped to a certain value and this is why you think the energy is
> incorrect?
> 
> What should be the correct energy (in theory at least)?
> 
>>
>> So EAS keeps packing on the cheaper PD/clamped OPP.
> 
> Which is the desired behavior for uclamp_max?
> 
> The only issue I see is that we want to distribute within a pd. Which is
> something I was going to work on and send after later - but can lump it in this
> series if it helps.
> 
>>
>> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440
>> tasks all running on little PD, cpumask=0x39. The big PD is idle but
>> never beats prev_cpu in terms of energy consumption for the wakee.
> 
> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
> is two folds:
> 
> 	1. Prevent tasks from consuming energy.
> 	2. Keep them away from expensive CPUs.
> 
> 2 is actually very important for 2 reasons:
> 
> 	a. Because of max aggregation - any uncapped tasks that wakes up will
> 	   cause a frequency spike on this 'expensive' cpu. We don't have
> 	   a mechanism to downmigrate it - which is another thing I'm working
> 	   on.
> 	b. It is desired to keep these bigger cpu idle ready for more important
> 	   work.
> 
> For 2, generally we don't want these tasks to steal bandwidth from these CPUs
> that we'd like to preserve for other type of work.

I'm a bit afraid about such 'strong force'. That means the task would
not go via EAS if we set uclamp_max e.g. 90, while the little capacity
is 125. Or am I missing something?

This might effectively use more energy for those tasks which can run on
any CPU and EAS would figure a good energy placement. I'm worried
about this, since we have L3+littles in one DVFS domain and the L3
would be only bigger in future.

IMO to keep the big cpus more in idle, we should give them big energy
wake up cost. That's my 3rd feature to the EM presented in OSPM2023.

> 
> Of course userspace has control by selecting the right uclamp_max value. They
> can increase it to allow a spill to next pd - or keep it low to steer them more
> strongly on a specific pd.

This would we be interesting to see in practice. I think we need such
experiment, for such changes.

Regards,
Lukasz
  
Qais Yousef May 31, 2023, 6:22 p.m. UTC | #14
Hi Lukasz!

Sorry for late response..

On 05/22/23 09:30, Lukasz Luba wrote:
> Hi Qais,
> 
> I have a question regarding the 'soft cpu affinity'.

[...]

> > IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
> > is two folds:
> > 
> > 	1. Prevent tasks from consuming energy.
> > 	2. Keep them away from expensive CPUs.
> > 
> > 2 is actually very important for 2 reasons:
> > 
> > 	a. Because of max aggregation - any uncapped tasks that wakes up will
> > 	   cause a frequency spike on this 'expensive' cpu. We don't have
> > 	   a mechanism to downmigrate it - which is another thing I'm working
> > 	   on.
> > 	b. It is desired to keep these bigger cpu idle ready for more important
> > 	   work.
> > 
> > For 2, generally we don't want these tasks to steal bandwidth from these CPUs
> > that we'd like to preserve for other type of work.
> 
> I'm a bit afraid about such 'strong force'. That means the task would
> not go via EAS if we set uclamp_max e.g. 90, while the little capacity
> is 125. Or am I missing something?

We should go via EAS, actually that's the whole point.

Why do you think we won't go via EAS? The logic should be is we give a hint to
prefer the little core, but we still can pick something else if it's more
energy efficient.

What uclamp_max enables us is to still consider that little core even if it's
utilization says it doesn't fit there. We need to merge these patches first
though as it's broken at the moment. if little capacity is 125 and utilization
of the task is 125, then even if uclamp_max is 0, EAS will skip the whole
little cluster as apotential candidate because there's no spare_capacity there.
Even if the whole little cluster is idle.

> 
> This might effectively use more energy for those tasks which can run on
> any CPU and EAS would figure a good energy placement. I'm worried
> about this, since we have L3+littles in one DVFS domain and the L3
> would be only bigger in future.

It's a bias that will enable the search algorithm in EAS to still consider the
little core for big tasks. This bias will depend on the uclamp_max value chosen
by userspace (so they have some control on how hard to cap the task), and what
else is happening in the system at the time it wakes up.

> 
> IMO to keep the big cpus more in idle, we should give them big energy
> wake up cost. That's my 3rd feature to the EM presented in OSPM2023.

Considering the wake up cost in EAS would be a great addition to have :)

> 
> > 
> > Of course userspace has control by selecting the right uclamp_max value. They
> > can increase it to allow a spill to next pd - or keep it low to steer them more
> > strongly on a specific pd.
> 
> This would we be interesting to see in practice. I think we need such
> experiment, for such changes.

I'm not sure what you mean by 'such changes'. I hope you don't mean these
patches as they are not the key. They fix an obvious bug where task placement
hint won't work at all. They don't modify any behavior that shouldn't have
already been there. Nor introduce new limitation. I have to say I am
disappointed that these patches aren't considered an important fix for an
obvious breakage.


Thanks

--
Qais Yousef
  
Dietmar Eggemann June 5, 2023, 11:29 a.m. UTC | #15
On 31/05/2023 20:22, Qais Yousef wrote:
> Hi Lukasz!
> 
> Sorry for late response..
> 
> On 05/22/23 09:30, Lukasz Luba wrote:
>> Hi Qais,
>>
>> I have a question regarding the 'soft cpu affinity'.
> 
> [...]
> 
>>> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
>>> is two folds:
>>>
>>> 	1. Prevent tasks from consuming energy.
>>> 	2. Keep them away from expensive CPUs.
>>>
>>> 2 is actually very important for 2 reasons:
>>>
>>> 	a. Because of max aggregation - any uncapped tasks that wakes up will
>>> 	   cause a frequency spike on this 'expensive' cpu. We don't have
>>> 	   a mechanism to downmigrate it - which is another thing I'm working
>>> 	   on.
>>> 	b. It is desired to keep these bigger cpu idle ready for more important
>>> 	   work.
>>>
>>> For 2, generally we don't want these tasks to steal bandwidth from these CPUs
>>> that we'd like to preserve for other type of work.
>>
>> I'm a bit afraid about such 'strong force'. That means the task would
>> not go via EAS if we set uclamp_max e.g. 90, while the little capacity
>> is 125. Or am I missing something?
> 
> We should go via EAS, actually that's the whole point.
> 
> Why do you think we won't go via EAS? The logic should be is we give a hint to
> prefer the little core, but we still can pick something else if it's more
> energy efficient.
> 
> What uclamp_max enables us is to still consider that little core even if it's
> utilization says it doesn't fit there. We need to merge these patches first
> though as it's broken at the moment. if little capacity is 125 and utilization
> of the task is 125, then even if uclamp_max is 0, EAS will skip the whole
> little cluster as apotential candidate because there's no spare_capacity there.
> Even if the whole little cluster is idle.

I'm not against letting uclamp_max force fit the placement of p. I'm
just worried that using the EM (especially in it's current state) for
that is wrong and will only work in certain scenarios like the one you
picked.

I did show a counter-example (Juno-r0 [446 1024 1024 446 446 446] with 6
8ms/16ms uclamp_max=440). The issue is that once we have no idle time
left, the decisions of the EM are simply wrong and we shouldn't enforce
those decisions.

There is a related issue. E.g. on Pixel6 with its 4 little CPUs with
cpu_cap_orig = 124. If you admit a task with p->util_avg > cpu_cap_orig
it doesn't even fit onto a CPU. But we then do an energy calculation
taking the values of the whole little PD into consideration. This makes
no sense. The EM is not uclamp_max aware right now.

What about using `sis() -> sic()` for uclamp_max constrained tasks? We
would just have to iterate over the CPUs in cpu_cap_orig order. (1)

Or the EM has to be made uclamp_max aware. (2)

Your example is the idle system with 1 task p waking up. This task has a
util_avg which excludes the little CPUs from being the new task_cpu(p).
But p's uclamp_max value would allow this. But we can't just consider
the placement of 1 task here.

>> This might effectively use more energy for those tasks which can run on
>> any CPU and EAS would figure a good energy placement. I'm worried
>> about this, since we have L3+littles in one DVFS domain and the L3
>> would be only bigger in future.
> 
> It's a bias that will enable the search algorithm in EAS to still consider the
> little core for big tasks. This bias will depend on the uclamp_max value chosen
> by userspace (so they have some control on how hard to cap the task), and what
> else is happening in the system at the time it wakes up.

To teach the EM about such tricky dependencies is IMHO outside the scope
of `how to select a CPU for a uclamp_max constrained task`. (3)

>> IMO to keep the big cpus more in idle, we should give them big energy
>> wake up cost. That's my 3rd feature to the EM presented in OSPM2023.
> 
> Considering the wake up cost in EAS would be a great addition to have :)

I see this one as unrelated to (3) as well.

>>> Of course userspace has control by selecting the right uclamp_max value. They
>>> can increase it to allow a spill to next pd - or keep it low to steer them more
>>> strongly on a specific pd.
>>
>> This would we be interesting to see in practice. I think we need such
>> experiment, for such changes.
> 
> I'm not sure what you mean by 'such changes'. I hope you don't mean these
> patches as they are not the key. They fix an obvious bug where task placement
> hint won't work at all. They don't modify any behavior that shouldn't have
> already been there. Nor introduce new limitation. I have to say I am
> disappointed that these patches aren't considered an important fix for an
> obvious breakage.

To me it's a dead-end to go this way. We need to see the full picture
including something like (1) or (2) or patches you have mentioned, like
the `down-migration in load-balance` thing.

Maybe we can at least list all the use cases for uclamp_max capping here:

It was mentioned:

(A) `soft affinity for tasks w/ util_avg > uclamp_max`.

Are there more?
  
Lukasz Luba June 5, 2023, 1:07 p.m. UTC | #16
On 5/31/23 19:22, Qais Yousef wrote:
> Hi Lukasz!
> 
> Sorry for late response..
> 
> On 05/22/23 09:30, Lukasz Luba wrote:
>> Hi Qais,
>>
>> I have a question regarding the 'soft cpu affinity'.
> 
> [...]
> 
>>> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
>>> is two folds:
>>>
>>> 	1. Prevent tasks from consuming energy.
>>> 	2. Keep them away from expensive CPUs.
>>>
>>> 2 is actually very important for 2 reasons:
>>>
>>> 	a. Because of max aggregation - any uncapped tasks that wakes up will
>>> 	   cause a frequency spike on this 'expensive' cpu. We don't have
>>> 	   a mechanism to downmigrate it - which is another thing I'm working
>>> 	   on.
>>> 	b. It is desired to keep these bigger cpu idle ready for more important
>>> 	   work.
>>>
>>> For 2, generally we don't want these tasks to steal bandwidth from these CPUs
>>> that we'd like to preserve for other type of work.
>>
>> I'm a bit afraid about such 'strong force'. That means the task would
>> not go via EAS if we set uclamp_max e.g. 90, while the little capacity
>> is 125. Or am I missing something?
> 
> We should go via EAS, actually that's the whole point.
> 
> Why do you think we won't go via EAS? The logic should be is we give a hint to
> prefer the little core, but we still can pick something else if it's more
> energy efficient.
> 
> What uclamp_max enables us is to still consider that little core even if it's
> utilization says it doesn't fit there. We need to merge these patches first
> though as it's broken at the moment. if little capacity is 125 and utilization
> of the task is 125, then even if uclamp_max is 0, EAS will skip the whole
> little cluster as apotential candidate because there's no spare_capacity there.
> Even if the whole little cluster is idle.

OK, I see now - it's a bug then.

> 
>>
>> This might effectively use more energy for those tasks which can run on
>> any CPU and EAS would figure a good energy placement. I'm worried
>> about this, since we have L3+littles in one DVFS domain and the L3
>> would be only bigger in future.
> 
> It's a bias that will enable the search algorithm in EAS to still consider the
> little core for big tasks. This bias will depend on the uclamp_max value chosen
> by userspace (so they have some control on how hard to cap the task), and what
> else is happening in the system at the time it wakes up.

OK, so we would go via EAS and check the energy impact in 3 PDs - which
is desired.

> 
>>
>> IMO to keep the big cpus more in idle, we should give them big energy
>> wake up cost. That's my 3rd feature to the EM presented in OSPM2023.
> 
> Considering the wake up cost in EAS would be a great addition to have :)
> 
>>
>>>
>>> Of course userspace has control by selecting the right uclamp_max value. They
>>> can increase it to allow a spill to next pd - or keep it low to steer them more
>>> strongly on a specific pd.
>>
>> This would we be interesting to see in practice. I think we need such
>> experiment, for such changes.
> 
> I'm not sure what you mean by 'such changes'. I hope you don't mean these
> patches as they are not the key. They fix an obvious bug where task placement
> hint won't work at all. They don't modify any behavior that shouldn't have
> already been there. Nor introduce new limitation. I have to say I am
> disappointed that these patches aren't considered an important fix for an
> obvious breakage.

I mean, in practice - in our pixel6 test 3-gear :)

Thank for explanation.
  
Hongyan Xia June 7, 2023, 11:50 a.m. UTC | #17
Hi Qais,

On 2023-02-11 17:28, Qais Yousef wrote:
> On 02/07/23 10:45, Vincent Guittot wrote:
>> [...]
> 
> Isn't it better to go back to v1 form then? The inconsistent type paired with
> the cast isn't getting too ugly for me :(
> 
> I don't think we can convert cpu_cap to long without having to do more work as
> it is used with 'util'.

Sorry if I'm missing something obvious, but why can't we convert util to 
long? The only place I see which mixes with util is

	lsub_positive(&cpu_cap, util);

but at this point both cpu_cap and util should have sane values (top bit 
not set) so doing clamped subtraction should just work fine?

Hongyan
  
Hongyan Xia June 7, 2023, 2:52 p.m. UTC | #18
Hi Qais,

On 2023-02-11 17:50, Qais Yousef wrote:
> [...]
>>
>> So EAS keeps packing on the cheaper PD/clamped OPP.
> 
> Which is the desired behavior for uclamp_max?
> 
> The only issue I see is that we want to distribute within a pd. Which is
> something I was going to work on and send after later - but can lump it in this
> series if it helps.

I more or less share the same concern with Dietmar, which is packing 
things on the same small CPU when everyone has spare cpu_cap of 0.

I wonder if this could be useful: On the side of cfs_rq->avg.util_avg, 
we have a cfs_rq->avg.util_avg_uclamp_max. It is keeping track of 
util_avg, but each task on the rq is capped at its uclamp_max value, so 
even if there's two always-running tasks with uclamp_max values of 100 
with no idle time, the cfs_rq only sees cpu_util() of 200 and still has 
remaining capacity of 1024 - 200, not 0. This also helps balancing the 
load when rqs have no idle time. Even if two CPUs both have no idle 
time, but one is running a single task clamped at 100, the other running 
2 such tasks, the first sees a remaining capacity of 1024 - 100, while 
the 2nd is 1024 - 200, so we still prefer the first one.

And I wonder if this could also help calculating energy when there's no 
idle time under uclamp_max. Instead of seeing a util_avg at 1024, we 
actually see a lower value. This is also what cpu_util_next() does in 
Android's sum aggregation, but I'm thinking of maintaining it right 
beside util_avg so that we don't have to sum up everything every time.

Hongyan
  
Qais Yousef June 30, 2023, 11:30 a.m. UTC | #19
On 06/05/23 13:29, Dietmar Eggemann wrote:
> On 31/05/2023 20:22, Qais Yousef wrote:
> > Hi Lukasz!
> > 
> > Sorry for late response..
> > 
> > On 05/22/23 09:30, Lukasz Luba wrote:
> >> Hi Qais,
> >>
> >> I have a question regarding the 'soft cpu affinity'.
> > 
> > [...]
> > 
> >>> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max
> >>> is two folds:
> >>>
> >>> 	1. Prevent tasks from consuming energy.
> >>> 	2. Keep them away from expensive CPUs.
> >>>
> >>> 2 is actually very important for 2 reasons:
> >>>
> >>> 	a. Because of max aggregation - any uncapped tasks that wakes up will
> >>> 	   cause a frequency spike on this 'expensive' cpu. We don't have
> >>> 	   a mechanism to downmigrate it - which is another thing I'm working
> >>> 	   on.
> >>> 	b. It is desired to keep these bigger cpu idle ready for more important
> >>> 	   work.
> >>>
> >>> For 2, generally we don't want these tasks to steal bandwidth from these CPUs
> >>> that we'd like to preserve for other type of work.
> >>
> >> I'm a bit afraid about such 'strong force'. That means the task would
> >> not go via EAS if we set uclamp_max e.g. 90, while the little capacity
> >> is 125. Or am I missing something?
> > 
> > We should go via EAS, actually that's the whole point.
> > 
> > Why do you think we won't go via EAS? The logic should be is we give a hint to
> > prefer the little core, but we still can pick something else if it's more
> > energy efficient.
> > 
> > What uclamp_max enables us is to still consider that little core even if it's
> > utilization says it doesn't fit there. We need to merge these patches first
> > though as it's broken at the moment. if little capacity is 125 and utilization
> > of the task is 125, then even if uclamp_max is 0, EAS will skip the whole
> > little cluster as apotential candidate because there's no spare_capacity there.
> > Even if the whole little cluster is idle.
> 
> I'm not against letting uclamp_max force fit the placement of p. I'm
> just worried that using the EM (especially in it's current state) for
> that is wrong and will only work in certain scenarios like the one you
> picked.
> 
> I did show a counter-example (Juno-r0 [446 1024 1024 446 446 446] with 6
> 8ms/16ms uclamp_max=440). The issue is that once we have no idle time
> left, the decisions of the EM are simply wrong and we shouldn't enforce
> those decisions.

But this is a different problem, no? If you consider the current level of
brokeness, the impact is worse IMO.

Given that Android force packs things on little cores using cpuset; I see this
wrong calculation bug less impactful. But I do agree we should do better. But
I don't see this as a blocker to merging this fix.

The thing we need to keep in mind is that if the big cores are busy or running
at high frequencies, then yes we want to cram these tasks on the little cores.
The impact of them running accidentally there (especially we lack support for
correction action in the load balancer yet) is worse.

> 
> There is a related issue. E.g. on Pixel6 with its 4 little CPUs with
> cpu_cap_orig = 124. If you admit a task with p->util_avg > cpu_cap_orig
> it doesn't even fit onto a CPU. But we then do an energy calculation
> taking the values of the whole little PD into consideration. This makes
> no sense. The EM is not uclamp_max aware right now.

Sorry could explain the difference between this problem and the one we already
were discussing?

> 
> What about using `sis() -> sic()` for uclamp_max constrained tasks? We
> would just have to iterate over the CPUs in cpu_cap_orig order. (1)

What's the advantage here? This is not useful because we only search idle
capacities. So on scenarios where the system is busy, they'll end up randomly
but we do want to pack them on little cores IMHO.

> Or the EM has to be made uclamp_max aware. (2)

Do you have any suggestions here?

Note we lack the evidence how problematic this is in real world scenarios. All
the systems I've seen already pack everything on little cores via cpuset. So
nothing gets worse here. But we get a chance to do better in other cases while
we look for ways to improve this.

> 
> Your example is the idle system with 1 task p waking up. This task has a
> util_avg which excludes the little CPUs from being the new task_cpu(p).
> But p's uclamp_max value would allow this. But we can't just consider
> the placement of 1 task here.

No one is saying we need to consider the placement of 1 task only. I think you
exaggerate the impact of cramming things and not noticing that whether it's
1 or 100 tasks, EAS will not even consider the little cores for those busy
tasks. uclamp_max was introduce to allow this, and it just doesn't work since
by definition these tasks will cause the little cores to have 0 spare
capacity. But we skip all cpus with 0 spare capacities.

> 
> >> This might effectively use more energy for those tasks which can run on
> >> any CPU and EAS would figure a good energy placement. I'm worried
> >> about this, since we have L3+littles in one DVFS domain and the L3
> >> would be only bigger in future.
> > 
> > It's a bias that will enable the search algorithm in EAS to still consider the
> > little core for big tasks. This bias will depend on the uclamp_max value chosen
> > by userspace (so they have some control on how hard to cap the task), and what
> > else is happening in the system at the time it wakes up.
> 
> To teach the EM about such tricky dependencies is IMHO outside the scope
> of `how to select a CPU for a uclamp_max constrained task`. (3)
> 
> >> IMO to keep the big cpus more in idle, we should give them big energy
> >> wake up cost. That's my 3rd feature to the EM presented in OSPM2023.
> > 
> > Considering the wake up cost in EAS would be a great addition to have :)
> 
> I see this one as unrelated to (3) as well.
> 
> >>> Of course userspace has control by selecting the right uclamp_max value. They
> >>> can increase it to allow a spill to next pd - or keep it low to steer them more
> >>> strongly on a specific pd.
> >>
> >> This would we be interesting to see in practice. I think we need such
> >> experiment, for such changes.
> > 
> > I'm not sure what you mean by 'such changes'. I hope you don't mean these
> > patches as they are not the key. They fix an obvious bug where task placement
> > hint won't work at all. They don't modify any behavior that shouldn't have
> > already been there. Nor introduce new limitation. I have to say I am
> > disappointed that these patches aren't considered an important fix for an
> > obvious breakage.
> 
> To me it's a dead-end to go this way. We need to see the full picture
> including something like (1) or (2) or patches you have mentioned, like
> the `down-migration in load-balance` thing.

The down migration is to address another problem. It is a necessary fix to do
corrections when we end up in bad situation. Wake up path still has to do the
right thing first.

> 
> Maybe we can at least list all the use cases for uclamp_max capping here:
> 
> It was mentioned:
> 
> (A) `soft affinity for tasks w/ util_avg > uclamp_max`.
> 
> Are there more?

We are trying to fix bugs here. Do you want to restart the discussion of why
uclamp_max was introduced?

Uclamp max is a hint to bias task placement and power consumption by help
keeping these task away from big cpus and higher frequencies based on the max
allowed performance level the task is allowed to reach.


Cheers

--
Qais Yousef
  
Qais Yousef June 30, 2023, 11:41 a.m. UTC | #20
On 06/07/23 15:52, Hongyan Xia wrote:
> Hi Qais,
> 
> On 2023-02-11 17:50, Qais Yousef wrote:
> > [...]
> > > 
> > > So EAS keeps packing on the cheaper PD/clamped OPP.
> > 
> > Which is the desired behavior for uclamp_max?
> > 
> > The only issue I see is that we want to distribute within a pd. Which is
> > something I was going to work on and send after later - but can lump it in this
> > series if it helps.
> 
> I more or less share the same concern with Dietmar, which is packing things
> on the same small CPU when everyone has spare cpu_cap of 0.
> 
> I wonder if this could be useful: On the side of cfs_rq->avg.util_avg, we
> have a cfs_rq->avg.util_avg_uclamp_max. It is keeping track of util_avg, but
> each task on the rq is capped at its uclamp_max value, so even if there's
> two always-running tasks with uclamp_max values of 100 with no idle time,
> the cfs_rq only sees cpu_util() of 200 and still has remaining capacity of
> 1024 - 200, not 0. This also helps balancing the load when rqs have no idle
> time. Even if two CPUs both have no idle time, but one is running a single
> task clamped at 100, the other running 2 such tasks, the first sees a
> remaining capacity of 1024 - 100, while the 2nd is 1024 - 200, so we still
> prefer the first one.

If I understood correctly you're suggesting do accounting of the sum of
uclamp_max for all the enqueued tasks?

I think we discussed this in the past. Can't remember the details now, but
adding additional accounting seemed undeseriable.

And I had issue with treating uclamp_max as a bandwidth hint rather than
a performance requirements hint. Limiting a task to 200 means it can't run
faster than this, but it doesn't mean it is not allowed to consume more
bandwidth than 200. Nice value and cfs bandwidth controllers should be used for
that.

> And I wonder if this could also help calculating energy when there's no idle
> time under uclamp_max. Instead of seeing a util_avg at 1024, we actually see
> a lower value. This is also what cpu_util_next() does in Android's sum
> aggregation, but I'm thinking of maintaining it right beside util_avg so
> that we don't have to sum up everything every time.

I haven't thought about how to improve the EM calculations to be honest, I see
this as a secondary problem compared to the other issue we need to fix first.

It seems load_avg can grow unboundedly, can you look at using this signal to
distribute on a cluster and as a hint we might be better off spilling to other
if they're already running at a perf level <= uclamp_max?


Thanks

--
Qais Yousef
  
Qais Yousef June 30, 2023, 11:44 a.m. UTC | #21
On 06/07/23 12:50, Hongyan Xia wrote:
> Hi Qais,
> 
> On 2023-02-11 17:28, Qais Yousef wrote:
> > On 02/07/23 10:45, Vincent Guittot wrote:
> > > [...]
> > 
> > Isn't it better to go back to v1 form then? The inconsistent type paired with
> > the cast isn't getting too ugly for me :(
> > 
> > I don't think we can convert cpu_cap to long without having to do more work as
> > it is used with 'util'.
> 
> Sorry if I'm missing something obvious, but why can't we convert util to
> long? The only place I see which mixes with util is
> 
> 	lsub_positive(&cpu_cap, util);
> 
> but at this point both cpu_cap and util should have sane values (top bit not
> set) so doing clamped subtraction should just work fine?

I completely have this series paged out now. I'll consider this comment when
I prepare v3 - which I hope I'll have some down time in few weeks time to send
some fixes I have been sitting on for a bit.


Cheers

--
Qais Yousef
  
Qais Yousef July 17, 2023, 9:49 p.m. UTC | #22
On 06/07/23 12:50, Hongyan Xia wrote:
> Hi Qais,
> 
> On 2023-02-11 17:28, Qais Yousef wrote:
> > On 02/07/23 10:45, Vincent Guittot wrote:
> > > [...]
> > 
> > Isn't it better to go back to v1 form then? The inconsistent type paired with
> > the cast isn't getting too ugly for me :(
> > 
> > I don't think we can convert cpu_cap to long without having to do more work as
> > it is used with 'util'.
> 
> Sorry if I'm missing something obvious, but why can't we convert util to
> long? The only place I see which mixes with util is
> 
> 	lsub_positive(&cpu_cap, util);
> 
> but at this point both cpu_cap and util should have sane values (top bit not
> set) so doing clamped subtraction should just work fine?

I think all options are not very pretty. But the least evil is what Vincent
suggested to simply do the cast in this one place. Util is used in more places
and assumed to be unsigned long, so chances of unhappy accidents are higher and
I don't feel like auditing more code for correctness given we have a simple
straightforward alternative now :)


Thanks

--
Qais Yousef
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6c8e7f52935..7a21ee74139f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7382,11 +7382,10 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	for (; pd; pd = pd->next) {
 		unsigned long util_min = p_util_min, util_max = p_util_max;
 		unsigned long cpu_cap, cpu_thermal_cap, util;
-		unsigned long cur_delta, max_spare_cap = 0;
+		long prev_spare_cap = -1, max_spare_cap = -1;
 		unsigned long rq_util_min, rq_util_max;
-		unsigned long prev_spare_cap = 0;
+		unsigned long cur_delta, base_energy;
 		int max_spare_cap_cpu = -1;
-		unsigned long base_energy;
 		int fits, max_fits = -1;
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
@@ -7461,7 +7460,7 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			}
 		}
 
-		if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
+		if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
 			continue;
 
 		eenv_pd_busy_time(&eenv, cpus, p);
@@ -7469,7 +7468,7 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		base_energy = compute_energy(&eenv, pd, cpus, p, -1);
 
 		/* Evaluate the energy impact of using prev_cpu. */
-		if (prev_spare_cap > 0) {
+		if (prev_spare_cap > -1) {
 			prev_delta = compute_energy(&eenv, pd, cpus, p,
 						    prev_cpu);
 			/* CPU utilization has changed */