[v2] sched/fair: unlink misfit task from cpu overutilized

Message ID 20221228165415.3436-1-vincent.guittot@linaro.org
State New
Headers
Series [v2] sched/fair: unlink misfit task from cpu overutilized |

Commit Message

Vincent Guittot Dec. 28, 2022, 4:54 p.m. UTC
  By taking into account uclamp_min, the 1:1 relation between task misfit
and cpu overutilized is no more true as a task with a small util_avg of
may not may not fit a high capacity cpu because of uclamp_min constraint.

Add a new state in util_fits_cpu() to reflect the case that task would fit
a CPU except for the uclamp_min hint which is a performance requirement.

Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
can use this new value to take additional action to select the best CPU
that doesn't match uclamp_min hint.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

Change since v1:
- fix some wrong conditions
- take into account more cases

 kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++------------
 1 file changed, 74 insertions(+), 25 deletions(-)
  

Comments

Qais Yousef Jan. 12, 2023, 2:26 p.m. UTC | #1
Hi Vincent

On 12/28/22 17:54, Vincent Guittot wrote:
> By taking into account uclamp_min, the 1:1 relation between task misfit
> and cpu overutilized is no more true as a task with a small util_avg of
> may not may not fit a high capacity cpu because of uclamp_min constraint.

Wouldn't it be better to split this into two patches

	* Unlink/Decouple misfit ...
	* Unlink/Decouple util_fits_cpu from HMP

?

> 
> Add a new state in util_fits_cpu() to reflect the case that task would fit
> a CPU except for the uclamp_min hint which is a performance requirement.
> 
> Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> can use this new value to take additional action to select the best CPU
> that doesn't match uclamp_min hint.

This part has nothing to do with the commit subject. I think it's better to
split the patches if it's not too much work for you.

> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> 
> Change since v1:
> - fix some wrong conditions
> - take into account more cases
> 
>  kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++------------
>  1 file changed, 74 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1649e7d71d24..57077f0a897e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4582,8 +4582,7 @@ static inline int util_fits_cpu(unsigned long util,
>  	 *     2. The system is being saturated when we're operating near
>  	 *        max capacity, it doesn't make sense to block overutilized.
>  	 */
> -	uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> -	uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
> +	uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
>  	fits = fits || uclamp_max_fits;
>  
>  	/*
> @@ -4618,8 +4617,8 @@ static inline int util_fits_cpu(unsigned long util,
>  	 * handle the case uclamp_min > uclamp_max.
>  	 */
>  	uclamp_min = min(uclamp_min, uclamp_max);
> -	if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
> -		fits = fits && (uclamp_min <= capacity_orig_thermal);
> +	if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
> +		return -1;
>  
>  	return fits;
>  }
> @@ -4629,7 +4628,7 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
>  	unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
>  	unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
>  	unsigned long util = task_util_est(p);
> -	return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
> +	return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);

So the big difference between your approach and my approach is that
task_fits_cpu() and asym_fits_cpu() now are very strict in regards to thermal
pressure since with your approach we delegate the smartness to the caller.

Should we add a comment for these 2 users to make it obvious we intentionally
ignore the '-1' value and why it is okay?

I'm not sure I can write a reasonable rationale myself. I'm actually worried
this might subtly break decisions made by select_idle_capacity() or feec() when
doing the LB.

Have you considered this?

>  }
>  
>  static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> @@ -6864,6 +6863,7 @@ static int
>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  {
>  	unsigned long task_util, util_min, util_max, best_cap = 0;
> +	int fits, best_fits = 0;
>  	int cpu, best_cpu = -1;
>  	struct cpumask *cpus;
>  
> @@ -6879,12 +6879,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  
>  		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
>  			continue;
> -		if (util_fits_cpu(task_util, util_min, util_max, cpu))
> +
> +		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> +
> +		/* This CPU fits with all capacity and performance requirements */
> +		if (fits > 0)
>  			return cpu;
> +		/*
> +		 * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> +		 * for the CPU with highest performance capacity.
> +		 */
> +		else if (fits < 0)
> +			cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
>  
> -		if (cpu_cap > best_cap) {
> +		/*
> +		 * First, select cpu which fits better (-1 being better than 0).
> +		 * Then, select the one with largest capacity at same level.
> +		 */
> +		if ((fits < best_fits) ||
> +		    ((fits == best_fits) && (cpu_cap > best_cap))) {
>  			best_cap = cpu_cap;
>  			best_cpu = cpu;
> +			best_fits = fits;
>  		}
>  	}
>  
> @@ -6897,7 +6913,7 @@ static inline bool asym_fits_cpu(unsigned long util,
>  				 int cpu)
>  {
>  	if (sched_asym_cpucap_active())
> -		return util_fits_cpu(util, util_min, util_max, cpu);
> +		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
>  
>  	return true;
>  }
> @@ -7264,6 +7280,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
>  	struct root_domain *rd = this_rq()->rd;
>  	int cpu, best_energy_cpu, target = -1;
> +	int prev_fits = -1, best_fits = -1;
> +	unsigned long best_thermal_cap = 0;
> +	unsigned long prev_thermal_cap = 0;
>  	struct sched_domain *sd;
>  	struct perf_domain *pd;
>  	struct energy_env eenv;
> @@ -7299,6 +7318,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  		unsigned long prev_spare_cap = 0;
>  		int max_spare_cap_cpu = -1;
>  		unsigned long base_energy;
> +		int fits, max_fits = -1;
>  
>  		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
>  
> @@ -7351,7 +7371,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  					util_max = max(rq_util_max, p_util_max);
>  				}
>  			}
> -			if (!util_fits_cpu(util, util_min, util_max, cpu))
> +
> +			fits = util_fits_cpu(util, util_min, util_max, cpu);
> +			if (!fits)
>  				continue;
>  
>  			lsub_positive(&cpu_cap, util);
> @@ -7359,7 +7381,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  			if (cpu == prev_cpu) {
>  				/* Always use prev_cpu as a candidate. */
>  				prev_spare_cap = cpu_cap;
> -			} else if (cpu_cap > max_spare_cap) {
> +				prev_fits = fits;
> +			} else if ((fits > max_fits) ||
> +				   ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
>  				/*
>  				 * Find the CPU with the maximum spare capacity
>  				 * among the remaining CPUs in the performance
> @@ -7367,6 +7391,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  				 */
>  				max_spare_cap = cpu_cap;
>  				max_spare_cap_cpu = cpu;
> +				max_fits = fits;
>  			}
>  		}
>  
> @@ -7385,26 +7410,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  			if (prev_delta < base_energy)
>  				goto unlock;
>  			prev_delta -= base_energy;
> +			prev_thermal_cap = cpu_thermal_cap;
>  			best_delta = min(best_delta, prev_delta);
>  		}
>  
>  		/* Evaluate the energy impact of using max_spare_cap_cpu. */
>  		if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
> +			/* Current best energy cpu fits better */
> +			if (max_fits < best_fits)
> +				continue;
> +
> +			/*
> +			 * Both don't fit performance (i.e. uclamp_min) but
> +			 * best energy cpu has better performance.
> +			 */
> +			if ((max_fits < 0) &&
> +			    (cpu_thermal_cap <= best_thermal_cap))
> +				continue;
> +
>  			cur_delta = compute_energy(&eenv, pd, cpus, p,
>  						   max_spare_cap_cpu);
>  			/* CPU utilization has changed */
>  			if (cur_delta < base_energy)
>  				goto unlock;
>  			cur_delta -= base_energy;
> -			if (cur_delta < best_delta) {
> -				best_delta = cur_delta;
> -				best_energy_cpu = max_spare_cap_cpu;
> -			}
> +
> +			/*
> +			 * Both fit for the task but best energy cpu has lower
> +			 * energy impact.
> +			 */
> +			if ((max_fits > 0) &&

Shouldn't this be

			if ((max_fits > 0) && (max_fits == best_fits) &&
?

We should update best_delta unconditionally first time we hit max_fits = 1, no?

I think it's worth extending the comment with something along the lines of

			* ... except for the first time max_fits becomes 1
			* then we must update best_delta unconditionally

> +			    (cur_delta >= best_delta))
> +				continue;
> +
> +			best_delta = cur_delta;
> +			best_energy_cpu = max_spare_cap_cpu;
> +			best_fits = max_fits;
> +			best_thermal_cap = cpu_thermal_cap;
>  		}
>  	}
>  	rcu_read_unlock();
>  
> -	if (best_delta < prev_delta)
> +	if ((best_fits > prev_fits) ||
> +	    ((best_fits > 0) && (best_delta < prev_delta)) ||
> +	    ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
>  		target = best_energy_cpu;

Overall I think the approach is sound. I tested it on my pinebook pro and
couldn't catch obvious breakage at least.

I am still worried though about spilling the knowledge outside of
util_fits_cpu() is creating extra complexity in the callers and potentially
more fragility when these callers evolve overtime e.g:
task_fits_cpu()/asym_fits_cpu() gain a new user that must actually care about
the -1 return value.

I think we can still optimize the capacity inversion logic to use no loops
without having to spill the knowledge to the users/callers of util_fits_cpu(),
no?

That said except for the few comments I had this LGTM anyway. Thanks for your
effort!


Cheers

--
Qais Yousef

>  
>  	return target;
> @@ -10228,24 +10277,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	 */
>  	update_sd_lb_stats(env, &sds);
>  
> -	if (sched_energy_enabled()) {
> -		struct root_domain *rd = env->dst_rq->rd;
> -
> -		if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> -			goto out_balanced;
> -	}
> -
> -	local = &sds.local_stat;
> -	busiest = &sds.busiest_stat;
> -
>  	/* There is no busy sibling group to pull tasks from */
>  	if (!sds.busiest)
>  		goto out_balanced;
>  
> +	busiest = &sds.busiest_stat;
> +
>  	/* Misfit tasks should be dealt with regardless of the avg load */
>  	if (busiest->group_type == group_misfit_task)
>  		goto force_balance;
>  
> +	if (sched_energy_enabled()) {
> +		struct root_domain *rd = env->dst_rq->rd;
> +
> +		if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> +			goto out_balanced;
> +	}
> +
>  	/* ASYM feature bypasses nice load balance check */
>  	if (busiest->group_type == group_asym_packing)
>  		goto force_balance;
> @@ -10258,6 +10306,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	if (busiest->group_type == group_imbalanced)
>  		goto force_balance;
>  
> +	local = &sds.local_stat;
>  	/*
>  	 * If the local group is busier than the selected busiest group
>  	 * don't try and pull any tasks.
> -- 
> 2.17.1
>
  
Vincent Guittot Jan. 13, 2023, 8:53 a.m. UTC | #2
On Thu, 12 Jan 2023 at 15:26, Qais Yousef <qyousef@layalina.io> wrote:
>
> Hi Vincent
>
> On 12/28/22 17:54, Vincent Guittot wrote:
> > By taking into account uclamp_min, the 1:1 relation between task misfit
> > and cpu overutilized is no more true as a task with a small util_avg of
> > may not may not fit a high capacity cpu because of uclamp_min constraint.
>
> Wouldn't it be better to split this into two patches
>
>         * Unlink/Decouple misfit ...
>         * Unlink/Decouple util_fits_cpu from HMP
>
> ?

I'm afraid that git bisect could then raise a false positive between
the 2 commits

>
> >
> > Add a new state in util_fits_cpu() to reflect the case that task would fit
> > a CPU except for the uclamp_min hint which is a performance requirement.
> >
> > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> > can use this new value to take additional action to select the best CPU
> > that doesn't match uclamp_min hint.
>
> This part has nothing to do with the commit subject. I think it's better to
> split the patches if it's not too much work for you.
>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >
> > Change since v1:
> > - fix some wrong conditions
> > - take into account more cases
> >
> >  kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++------------
> >  1 file changed, 74 insertions(+), 25 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1649e7d71d24..57077f0a897e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4582,8 +4582,7 @@ static inline int util_fits_cpu(unsigned long util,
> >        *     2. The system is being saturated when we're operating near
> >        *        max capacity, it doesn't make sense to block overutilized.
> >        */
> > -     uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> > -     uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
> > +     uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
> >       fits = fits || uclamp_max_fits;
> >
> >       /*
> > @@ -4618,8 +4617,8 @@ static inline int util_fits_cpu(unsigned long util,
> >        * handle the case uclamp_min > uclamp_max.
> >        */
> >       uclamp_min = min(uclamp_min, uclamp_max);
> > -     if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
> > -             fits = fits && (uclamp_min <= capacity_orig_thermal);
> > +     if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
> > +             return -1;
> >
> >       return fits;
> >  }
> > @@ -4629,7 +4628,7 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> >       unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> >       unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> >       unsigned long util = task_util_est(p);
> > -     return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
> > +     return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
>
> So the big difference between your approach and my approach is that
> task_fits_cpu() and asym_fits_cpu() now are very strict in regards to thermal
> pressure since with your approach we delegate the smartness to the caller.
>
> Should we add a comment for these 2 users to make it obvious we intentionally
> ignore the '-1' value and why it is okay?

I can probably add something saying that a positive value (1 in this
case) is the only one that says that a task fits to a cpu. Other
returned values imply that either the utilization or the uclamp
constraints are not meet
>
> I'm not sure I can write a reasonable rationale myself. I'm actually worried
> this might subtly break decisions made by select_idle_capacity() or feec() when
> doing the LB.
>
> Have you considered this?

Yes, that why i have keep the changes in 1 patch

>
> >  }
> >
> >  static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > @@ -6864,6 +6863,7 @@ static int
> >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  {
> >       unsigned long task_util, util_min, util_max, best_cap = 0;
> > +     int fits, best_fits = 0;
> >       int cpu, best_cpu = -1;
> >       struct cpumask *cpus;
> >
> > @@ -6879,12 +6879,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >
> >               if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> >                       continue;
> > -             if (util_fits_cpu(task_util, util_min, util_max, cpu))
> > +
> > +             fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> > +
> > +             /* This CPU fits with all capacity and performance requirements */
> > +             if (fits > 0)
> >                       return cpu;
> > +             /*
> > +              * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> > +              * for the CPU with highest performance capacity.
> > +              */
> > +             else if (fits < 0)
> > +                     cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
> >
> > -             if (cpu_cap > best_cap) {
> > +             /*
> > +              * First, select cpu which fits better (-1 being better than 0).
> > +              * Then, select the one with largest capacity at same level.
> > +              */
> > +             if ((fits < best_fits) ||
> > +                 ((fits == best_fits) && (cpu_cap > best_cap))) {
> >                       best_cap = cpu_cap;
> >                       best_cpu = cpu;
> > +                     best_fits = fits;
> >               }
> >       }
> >
> > @@ -6897,7 +6913,7 @@ static inline bool asym_fits_cpu(unsigned long util,
> >                                int cpu)
> >  {
> >       if (sched_asym_cpucap_active())
> > -             return util_fits_cpu(util, util_min, util_max, cpu);
> > +             return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> >
> >       return true;
> >  }
> > @@ -7264,6 +7280,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >       unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
> >       struct root_domain *rd = this_rq()->rd;
> >       int cpu, best_energy_cpu, target = -1;
> > +     int prev_fits = -1, best_fits = -1;
> > +     unsigned long best_thermal_cap = 0;
> > +     unsigned long prev_thermal_cap = 0;
> >       struct sched_domain *sd;
> >       struct perf_domain *pd;
> >       struct energy_env eenv;
> > @@ -7299,6 +7318,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >               unsigned long prev_spare_cap = 0;
> >               int max_spare_cap_cpu = -1;
> >               unsigned long base_energy;
> > +             int fits, max_fits = -1;
> >
> >               cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> >
> > @@ -7351,7 +7371,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                                       util_max = max(rq_util_max, p_util_max);
> >                               }
> >                       }
> > -                     if (!util_fits_cpu(util, util_min, util_max, cpu))
> > +
> > +                     fits = util_fits_cpu(util, util_min, util_max, cpu);
> > +                     if (!fits)
> >                               continue;
> >
> >                       lsub_positive(&cpu_cap, util);
> > @@ -7359,7 +7381,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                       if (cpu == prev_cpu) {
> >                               /* Always use prev_cpu as a candidate. */
> >                               prev_spare_cap = cpu_cap;
> > -                     } else if (cpu_cap > max_spare_cap) {
> > +                             prev_fits = fits;
> > +                     } else if ((fits > max_fits) ||
> > +                                ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> >                               /*
> >                                * Find the CPU with the maximum spare capacity
> >                                * among the remaining CPUs in the performance
> > @@ -7367,6 +7391,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                                */
> >                               max_spare_cap = cpu_cap;
> >                               max_spare_cap_cpu = cpu;
> > +                             max_fits = fits;
> >                       }
> >               }
> >
> > @@ -7385,26 +7410,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                       if (prev_delta < base_energy)
> >                               goto unlock;
> >                       prev_delta -= base_energy;
> > +                     prev_thermal_cap = cpu_thermal_cap;
> >                       best_delta = min(best_delta, prev_delta);
> >               }
> >
> >               /* Evaluate the energy impact of using max_spare_cap_cpu. */
> >               if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
> > +                     /* Current best energy cpu fits better */
> > +                     if (max_fits < best_fits)
> > +                             continue;
> > +
> > +                     /*
> > +                      * Both don't fit performance (i.e. uclamp_min) but
> > +                      * best energy cpu has better performance.
> > +                      */
> > +                     if ((max_fits < 0) &&
> > +                         (cpu_thermal_cap <= best_thermal_cap))
> > +                             continue;
> > +
> >                       cur_delta = compute_energy(&eenv, pd, cpus, p,
> >                                                  max_spare_cap_cpu);
> >                       /* CPU utilization has changed */
> >                       if (cur_delta < base_energy)
> >                               goto unlock;
> >                       cur_delta -= base_energy;
> > -                     if (cur_delta < best_delta) {
> > -                             best_delta = cur_delta;
> > -                             best_energy_cpu = max_spare_cap_cpu;
> > -                     }
> > +
> > +                     /*
> > +                      * Both fit for the task but best energy cpu has lower
> > +                      * energy impact.
> > +                      */
> > +                     if ((max_fits > 0) &&
>
> Shouldn't this be
>
>                         if ((max_fits > 0) && (max_fits == best_fits) &&
> ?

I will use the below which match better the comment and the fact that
both fit for the task:

+                    if ((max_fits > 0) && (best_fits > 0) &&

>
> We should update best_delta unconditionally first time we hit max_fits = 1, no?
>
> I think it's worth extending the comment with something along the lines of
>
>                         * ... except for the first time max_fits becomes 1
>                         * then we must update best_delta unconditionally

With the new condition above this is not needed anymore

>
> > +                         (cur_delta >= best_delta))
> > +                             continue;
> > +
> > +                     best_delta = cur_delta;
> > +                     best_energy_cpu = max_spare_cap_cpu;
> > +                     best_fits = max_fits;
> > +                     best_thermal_cap = cpu_thermal_cap;
> >               }
> >       }
> >       rcu_read_unlock();
> >
> > -     if (best_delta < prev_delta)
> > +     if ((best_fits > prev_fits) ||
> > +         ((best_fits > 0) && (best_delta < prev_delta)) ||
> > +         ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
> >               target = best_energy_cpu;
>
> Overall I think the approach is sound. I tested it on my pinebook pro and
> couldn't catch obvious breakage at least.
>
> I am still worried though about spilling the knowledge outside of
> util_fits_cpu() is creating extra complexity in the callers and potentially
> more fragility when these callers evolve overtime e.g:
> task_fits_cpu()/asym_fits_cpu() gain a new user that must actually care about
> the -1 return value.

ask_fits_cpu()/asym_fits_cpu() remain simple booleans that return true
if the task fits the cpu in regards to all requirements. If a new user
wants to make smarter decisions like select_idle_capacity() or feec(),
it will have to use util_fits_cpu(). Both handle the case where
uclamp_min is not met differently.

>
> I think we can still optimize the capacity inversion logic to use no loops
> without having to spill the knowledge to the users/callers of util_fits_cpu(),
> no?

TBH, I don't know how because we are not always comparing the same
things depending of the reason it doesn't fit

>
> That said except for the few comments I had this LGTM anyway. Thanks for your
> effort!

Thanks

I'm going to prepare  v2

>
>
> Cheers
>
> --
> Qais Yousef
>
> >
> >       return target;
> > @@ -10228,24 +10277,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >        */
> >       update_sd_lb_stats(env, &sds);
> >
> > -     if (sched_energy_enabled()) {
> > -             struct root_domain *rd = env->dst_rq->rd;
> > -
> > -             if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> > -                     goto out_balanced;
> > -     }
> > -
> > -     local = &sds.local_stat;
> > -     busiest = &sds.busiest_stat;
> > -
> >       /* There is no busy sibling group to pull tasks from */
> >       if (!sds.busiest)
> >               goto out_balanced;
> >
> > +     busiest = &sds.busiest_stat;
> > +
> >       /* Misfit tasks should be dealt with regardless of the avg load */
> >       if (busiest->group_type == group_misfit_task)
> >               goto force_balance;
> >
> > +     if (sched_energy_enabled()) {
> > +             struct root_domain *rd = env->dst_rq->rd;
> > +
> > +             if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> > +                     goto out_balanced;
> > +     }
> > +
> >       /* ASYM feature bypasses nice load balance check */
> >       if (busiest->group_type == group_asym_packing)
> >               goto force_balance;
> > @@ -10258,6 +10306,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >       if (busiest->group_type == group_imbalanced)
> >               goto force_balance;
> >
> > +     local = &sds.local_stat;
> >       /*
> >        * If the local group is busier than the selected busiest group
> >        * don't try and pull any tasks.
> > --
> > 2.17.1
> >
  
Kajetan Puchalski Jan. 13, 2023, 1:49 p.m. UTC | #3
Hi,

> By taking into account uclamp_min, the 1:1 relation between task misfit
> and cpu overutilized is no more true as a task with a small util_avg of
> may not may not fit a high capacity cpu because of uclamp_min constraint.
> 
> Add a new state in util_fits_cpu() to reflect the case that task would fit
> a CPU except for the uclamp_min hint which is a performance requirement.
> 
> Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> can use this new value to take additional action to select the best CPU
> that doesn't match uclamp_min hint.

I just wanted to flag some issues I noticed with this patch and the
entire topic.

I was testing this on a Pixel 6 with a 5.18 android-mainline kernel with
all the relevant uclamp and CFS scheduling patches backported to it from
mainline. From what I can see, the 'uclamp fits capacity' patchset
introduced some alarming power usage & performance issues that this
patch makes even worse.

The patch stack for the following tables is as follows:

(ufc_patched) sched/fair: unlink misfit task from cpu overutilized
sched/uclamp: Fix a uninitialized variable warnings
(baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec()
sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
sched/uclamp: Fix fits_capacity() check in feec()
sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
sched/uclamp: Fix relationship between uclamp and migration margin
(previous 'baseline' was here)

I omitted the 3 patches relating directly to capacity_inversion but in
the other tests I did with those there were similar issues. It's
probably easier to consider the uclamp parts and their effects in
isolation.

1. Geekbench 5 (performance regression)

+-----------------+----------------------------+--------+-----------+
|     metric      |           kernel           | value  | perc_diff |
+-----------------+----------------------------+--------+-----------+
| multicore_score |          baseline          | 2765.4 |   0.0%    |
| multicore_score |        baseline_ufc        | 2704.3 |  -2.21%   | <-- a noticeable score decrease already
| multicore_score |        ufc_patched         | 2443.2 |  -11.65%  | <-- a massive score decrease
+-----------------+----------------------------+--------+-----------+

+--------------+--------+----------------------------+--------+-----------+
|  chan_name   | metric |           kernel           | value  | perc_diff |
+--------------+--------+----------------------------+--------+-----------+
| total_power  | gmean  |          baseline          | 2664.0 |   0.0%    |
| total_power  | gmean  |        baseline_ufc        | 2621.5 |   -1.6%   | <-- worse performance per watt
| total_power  | gmean  |        ufc_patched         | 2601.2 |  -2.36%   | <-- much worse performance per watt
+--------------+--------+----------------------------+--------+-----------+

The most likely cause for the regression seen above is the decrease in the amount of
time spent while overutilized with these patches. Maximising
overutilization for GB5 is the desired outcome as the benchmark for
almost its entire duration keeps either 1 core or all the cores
completely saturated so EAS cannot be effective. These patches have the
opposite from the desired effect in this area.

+----------------------------+--------------------+--------------------+------------+
|          kernel            |        time        |     total_time     | percentage |
+----------------------------+--------------------+--------------------+------------+
|          baseline          |      121.979       |      181.065       |   67.46    |
|        baseline_ufc        |      120.355       |      184.255       |   65.32    |
|        ufc_patched         |       60.715       |      196.135       |   30.98    | <-- !!!
+----------------------------+--------------------+--------------------+------------+

2. Jankbench (power usage regression)

+--------+---------------+---------------------------------+-------+-----------+
| metric |   variable    |             kernel              | value | perc_diff |
+--------+---------------+---------------------------------+-------+-----------+
| gmean  | mean_duration |          baseline_60hz          | 14.6  |   0.0%    |
| gmean  | mean_duration |        baseline_ufc_60hz        | 15.2  |   3.83%   |
| gmean  | mean_duration |        ufc_patched_60hz         | 14.0  |  -4.12%   |
+--------+---------------+---------------------------------+-------+-----------+

+--------+-----------+---------------------------------+-------+-----------+
| metric | variable  |             kernel              | value | perc_diff |
+--------+-----------+---------------------------------+-------+-----------+
| gmean  | jank_perc |          baseline_60hz          |  1.9  |   0.0%    |
| gmean  | jank_perc |        baseline_ufc_60hz        |  2.2  |  15.39%   |
| gmean  | jank_perc |        ufc_patched_60hz         |  2.0  |   3.61%   |
+--------+-----------+---------------------------------+-------+-----------+

+--------------+--------+---------------------------------+-------+-----------+
|  chan_name   | metric |             kernel              | value | perc_diff |
+--------------+--------+---------------------------------+-------+-----------+
| total_power  | gmean  |          baseline_60hz          | 135.9 |   0.0%    |
| total_power  | gmean  |        baseline_ufc_60hz        | 155.7 |  14.61%   | <-- !!!
| total_power  | gmean  |        ufc_patched_60hz         | 157.1 |  15.63%   | <-- !!!
+--------------+--------+---------------------------------+-------+-----------+

With these patches while running Jankbench we use up ~15% more power
just to achieve roughly the same results. Here I'm not sure where this
issue is coming from exactly but all the results above are very consistent
across different runs.

> -- 
> 2.17.1
> 
>
  
Vincent Guittot Jan. 13, 2023, 2:28 p.m. UTC | #4
Hi Kajetan,

On Fri, 13 Jan 2023 at 14:50, Kajetan Puchalski
<kajetan.puchalski@arm.com> wrote:
>
> Hi,
>
> > By taking into account uclamp_min, the 1:1 relation between task misfit
> > and cpu overutilized is no more true as a task with a small util_avg of
> > may not may not fit a high capacity cpu because of uclamp_min constraint.
> >
> > Add a new state in util_fits_cpu() to reflect the case that task would fit
> > a CPU except for the uclamp_min hint which is a performance requirement.
> >
> > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> > can use this new value to take additional action to select the best CPU
> > that doesn't match uclamp_min hint.
>
> I just wanted to flag some issues I noticed with this patch and the
> entire topic.
>
> I was testing this on a Pixel 6 with a 5.18 android-mainline kernel with

Do you have more details to share on your setup ?
Android kernel has some hack on top of the mainline. Do you use some ?
Then, the perf and the power can be largely impacted by the cgroup
configuration. Have you got details on your setup ?

I'm going to try to reproduce the behavior

> all the relevant uclamp and CFS scheduling patches backported to it from
> mainline. From what I can see, the 'uclamp fits capacity' patchset
> introduced some alarming power usage & performance issues that this
> patch makes even worse.
>
> The patch stack for the following tables is as follows:
>
> (ufc_patched) sched/fair: unlink misfit task from cpu overutilized

I just sent a v3 which fixes a condition. Wonder if this could have an
impact on the results both perf and power

> sched/uclamp: Fix a uninitialized variable warnings
> (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec()
> sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
> sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
> sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
> sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
> sched/uclamp: Fix fits_capacity() check in feec()
> sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
> sched/uclamp: Fix relationship between uclamp and migration margin
> (previous 'baseline' was here)
>
> I omitted the 3 patches relating directly to capacity_inversion but in
> the other tests I did with those there were similar issues. It's
> probably easier to consider the uclamp parts and their effects in
> isolation.
>
> 1. Geekbench 5 (performance regression)
>
> +-----------------+----------------------------+--------+-----------+
> |     metric      |           kernel           | value  | perc_diff |
> +-----------------+----------------------------+--------+-----------+
> | multicore_score |          baseline          | 2765.4 |   0.0%    |
> | multicore_score |        baseline_ufc        | 2704.3 |  -2.21%   | <-- a noticeable score decrease already
> | multicore_score |        ufc_patched         | 2443.2 |  -11.65%  | <-- a massive score decrease
> +-----------------+----------------------------+--------+-----------+
>
> +--------------+--------+----------------------------+--------+-----------+
> |  chan_name   | metric |           kernel           | value  | perc_diff |
> +--------------+--------+----------------------------+--------+-----------+
> | total_power  | gmean  |          baseline          | 2664.0 |   0.0%    |
> | total_power  | gmean  |        baseline_ufc        | 2621.5 |   -1.6%   | <-- worse performance per watt
> | total_power  | gmean  |        ufc_patched         | 2601.2 |  -2.36%   | <-- much worse performance per watt
> +--------------+--------+----------------------------+--------+-----------+
>
> The most likely cause for the regression seen above is the decrease in the amount of
> time spent while overutilized with these patches. Maximising
> overutilization for GB5 is the desired outcome as the benchmark for
> almost its entire duration keeps either 1 core or all the cores
> completely saturated so EAS cannot be effective. These patches have the
> opposite from the desired effect in this area.
>
> +----------------------------+--------------------+--------------------+------------+
> |          kernel            |        time        |     total_time     | percentage |
> +----------------------------+--------------------+--------------------+------------+
> |          baseline          |      121.979       |      181.065       |   67.46    |
> |        baseline_ufc        |      120.355       |      184.255       |   65.32    |
> |        ufc_patched         |       60.715       |      196.135       |   30.98    | <-- !!!
> +----------------------------+--------------------+--------------------+------------+

I'm not surprised because some use cases which were not overutilized
were wrongly triggered as overutilized so switching back to
performance mode. You might have to tune the uclamp value

>
> 2. Jankbench (power usage regression)
>
> +--------+---------------+---------------------------------+-------+-----------+
> | metric |   variable    |             kernel              | value | perc_diff |
> +--------+---------------+---------------------------------+-------+-----------+
> | gmean  | mean_duration |          baseline_60hz          | 14.6  |   0.0%    |
> | gmean  | mean_duration |        baseline_ufc_60hz        | 15.2  |   3.83%   |
> | gmean  | mean_duration |        ufc_patched_60hz         | 14.0  |  -4.12%   |
> +--------+---------------+---------------------------------+-------+-----------+
>
> +--------+-----------+---------------------------------+-------+-----------+
> | metric | variable  |             kernel              | value | perc_diff |
> +--------+-----------+---------------------------------+-------+-----------+
> | gmean  | jank_perc |          baseline_60hz          |  1.9  |   0.0%    |
> | gmean  | jank_perc |        baseline_ufc_60hz        |  2.2  |  15.39%   |
> | gmean  | jank_perc |        ufc_patched_60hz         |  2.0  |   3.61%   |
> +--------+-----------+---------------------------------+-------+-----------+
>
> +--------------+--------+---------------------------------+-------+-----------+
> |  chan_name   | metric |             kernel              | value | perc_diff |
> +--------------+--------+---------------------------------+-------+-----------+
> | total_power  | gmean  |          baseline_60hz          | 135.9 |   0.0%    |
> | total_power  | gmean  |        baseline_ufc_60hz        | 155.7 |  14.61%   | <-- !!!
> | total_power  | gmean  |        ufc_patched_60hz         | 157.1 |  15.63%   | <-- !!!
> +--------------+--------+---------------------------------+-------+-----------+
>
> With these patches while running Jankbench we use up ~15% more power
> just to achieve roughly the same results. Here I'm not sure where this
> issue is coming from exactly but all the results above are very consistent
> across different runs.
>
> > --
> > 2.17.1
> >
> >
  
Kajetan Puchalski Jan. 13, 2023, 2:50 p.m. UTC | #5
> > I was testing this on a Pixel 6 with a 5.18 android-mainline kernel with

> Do you have more details to share on your setup ?
> Android kernel has some hack on top of the mainline. Do you use some ?
> Then, the perf and the power can be largely impacted by the cgroup
> configuration. Have you got details on your setup ?

The kernel I use has all the vendor hooks and hacks switched off to
keep it as close to mainline as possible. Unfortunately 5.18 was the
last mainline that worked on this device due to some driver issues so we
just backport mainline scheduling patches as they come out to at least
keep the scheduler itself up to date.

> I just sent a v3 which fixes a condition. Wonder if this could have an
> impact on the results both perf and power

I don't think it'll fix the GB5 score side of it as that's clearly
related to overutilization while the condition changed in v3 is inside
the non-OU section of feec(). I'll still test the v3 on the weekend
if I have some free time.

The power usage issue was already introduced in the uclamp fits capacity
patchset that's been merged so I doubt this change will be enough to
account for it but I'll give it a try regardless.

> > The most likely cause for the regression seen above is the decrease in the amount of
> > time spent while overutilized with these patches. Maximising
> > overutilization for GB5 is the desired outcome as the benchmark for
> > almost its entire duration keeps either 1 core or all the cores
> > completely saturated so EAS cannot be effective. These patches have the
> > opposite from the desired effect in this area.
> >
> > +----------------------------+--------------------+--------------------+------------+
> > |          kernel            |        time        |     total_time     | percentage |
> > +----------------------------+--------------------+--------------------+------------+
> > |          baseline          |      121.979       |      181.065       |   67.46    |
> > |        baseline_ufc        |      120.355       |      184.255       |   65.32    |
> > |        ufc_patched         |       60.715       |      196.135       |   30.98    | <-- !!!
> > +----------------------------+--------------------+--------------------+------------+
>
> I'm not surprised because some use cases which were not overutilized
> were wrongly triggered as overutilized so switching back to
> performance mode. You might have to tune the uclamp value

But they'd be wrongly triggered with the 'baseline_ufc' variant while
not with the 'baseline' variant. The baseline here is pre taking uclamp
into account for cpu_overutilized, all cpu_overutilized did on that
kernel was compare util against capacity.
Meaning that the 'real' overutilised would be in the ~67% ballpark while
the patch makes it incorrectly not trigger more than half the time. I'm
not sure we can tweak uclamp enough to fix that.

> >
> > 2. Jankbench (power usage regression)
> >
> > +--------+---------------+---------------------------------+-------+-----------+
> > | metric |   variable    |             kernel              | value | perc_diff |
> > +--------+---------------+---------------------------------+-------+-----------+
> > | gmean  | mean_duration |          baseline_60hz          | 14.6  |   0.0%    |
> > | gmean  | mean_duration |        baseline_ufc_60hz        | 15.2  |   3.83%   |
> > | gmean  | mean_duration |        ufc_patched_60hz         | 14.0  |  -4.12%   |
> > +--------+---------------+---------------------------------+-------+-----------+
> >
> > +--------+-----------+---------------------------------+-------+-----------+
> > | metric | variable  |             kernel              | value | perc_diff |
> > +--------+-----------+---------------------------------+-------+-----------+
> > | gmean  | jank_perc |          baseline_60hz          |  1.9  |   0.0%    |
> > | gmean  | jank_perc |        baseline_ufc_60hz        |  2.2  |  15.39%   |
> > | gmean  | jank_perc |        ufc_patched_60hz         |  2.0  |   3.61%   |
> > +--------+-----------+---------------------------------+-------+-----------+
> >
> > +--------------+--------+---------------------------------+-------+-----------+
> > |  chan_name   | metric |             kernel              | value | perc_diff |
> > +--------------+--------+---------------------------------+-------+-----------+
> > | total_power  | gmean  |          baseline_60hz          | 135.9 |   0.0%    |
> > | total_power  | gmean  |        baseline_ufc_60hz        | 155.7 |  14.61%   | <-- !!!
> > | total_power  | gmean  |        ufc_patched_60hz         | 157.1 |  15.63%   | <-- !!!
> > +--------------+--------+---------------------------------+-------+-----------+
> >
> > With these patches while running Jankbench we use up ~15% more power
> > just to achieve roughly the same results. Here I'm not sure where this
> > issue is coming from exactly but all the results above are very consistent
> > across different runs.
> >
> > > --
> > > 2.17.1
> > >
> > >
  
Qais Yousef Jan. 14, 2023, 6:21 p.m. UTC | #6
On 01/13/23 09:53, Vincent Guittot wrote:
> On Thu, 12 Jan 2023 at 15:26, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > Hi Vincent
> >
> > On 12/28/22 17:54, Vincent Guittot wrote:
> > > By taking into account uclamp_min, the 1:1 relation between task misfit
> > > and cpu overutilized is no more true as a task with a small util_avg of
> > > may not may not fit a high capacity cpu because of uclamp_min constraint.
> >
> > Wouldn't it be better to split this into two patches
> >
> >         * Unlink/Decouple misfit ...
> >         * Unlink/Decouple util_fits_cpu from HMP
> >
> > ?
> 
> I'm afraid that git bisect could then raise a false positive between
> the 2 commits

They should be independent, no? Anyway, I don't feel that strongly about it but
as a minor nit the commit subject could be better.

> 
> >
> > >
> > > Add a new state in util_fits_cpu() to reflect the case that task would fit
> > > a CPU except for the uclamp_min hint which is a performance requirement.
> > >
> > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> > > can use this new value to take additional action to select the best CPU
> > > that doesn't match uclamp_min hint.
> >
> > This part has nothing to do with the commit subject. I think it's better to
> > split the patches if it's not too much work for you.
> >
> > >
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > ---
> > >
> > > Change since v1:
> > > - fix some wrong conditions
> > > - take into account more cases
> > >
> > >  kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++------------
> > >  1 file changed, 74 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 1649e7d71d24..57077f0a897e 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4582,8 +4582,7 @@ static inline int util_fits_cpu(unsigned long util,
> > >        *     2. The system is being saturated when we're operating near
> > >        *        max capacity, it doesn't make sense to block overutilized.
> > >        */
> > > -     uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> > > -     uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
> > > +     uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
> > >       fits = fits || uclamp_max_fits;
> > >
> > >       /*
> > > @@ -4618,8 +4617,8 @@ static inline int util_fits_cpu(unsigned long util,
> > >        * handle the case uclamp_min > uclamp_max.
> > >        */
> > >       uclamp_min = min(uclamp_min, uclamp_max);
> > > -     if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
> > > -             fits = fits && (uclamp_min <= capacity_orig_thermal);
> > > +     if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
> > > +             return -1;
> > >
> > >       return fits;
> > >  }
> > > @@ -4629,7 +4628,7 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> > >       unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> > >       unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> > >       unsigned long util = task_util_est(p);
> > > -     return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
> > > +     return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
> >
> > So the big difference between your approach and my approach is that
> > task_fits_cpu() and asym_fits_cpu() now are very strict in regards to thermal
> > pressure since with your approach we delegate the smartness to the caller.
> >
> > Should we add a comment for these 2 users to make it obvious we intentionally
> > ignore the '-1' value and why it is okay?
> 
> I can probably add something saying that a positive value (1 in this
> case) is the only one that says that a task fits to a cpu. Other
> returned values imply that either the utilization or the uclamp
> constraints are not meet

Sounds good to me. So at least whoever looks at this later and doesn't know
the history will at least try to dig more before using it as-is and assume
wonders.

> >
> > I'm not sure I can write a reasonable rationale myself. I'm actually worried
> > this might subtly break decisions made by select_idle_capacity() or feec() when
> > doing the LB.
> >
> > Have you considered this?
> 
> Yes, that why i have keep the changes in 1 patch

Okay I think I see what you're on about now.

[...]

> > > +                     /*
> > > +                      * Both fit for the task but best energy cpu has lower
> > > +                      * energy impact.
> > > +                      */
> > > +                     if ((max_fits > 0) &&
> >
> > Shouldn't this be
> >
> >                         if ((max_fits > 0) && (max_fits == best_fits) &&
> > ?
> 
> I will use the below which match better the comment and the fact that
> both fit for the task:
> 
> +                    if ((max_fits > 0) && (best_fits > 0) &&
> 

Fine by me.

> >
> > We should update best_delta unconditionally first time we hit max_fits = 1, no?
> >
> > I think it's worth extending the comment with something along the lines of
> >
> >                         * ... except for the first time max_fits becomes 1
> >                         * then we must update best_delta unconditionally
> 
> With the new condition above this is not needed anymore

+1

> 
> >
> > > +                         (cur_delta >= best_delta))
> > > +                             continue;
> > > +
> > > +                     best_delta = cur_delta;
> > > +                     best_energy_cpu = max_spare_cap_cpu;
> > > +                     best_fits = max_fits;
> > > +                     best_thermal_cap = cpu_thermal_cap;
> > >               }
> > >       }
> > >       rcu_read_unlock();
> > >
> > > -     if (best_delta < prev_delta)
> > > +     if ((best_fits > prev_fits) ||
> > > +         ((best_fits > 0) && (best_delta < prev_delta)) ||
> > > +         ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
> > >               target = best_energy_cpu;
> >
> > Overall I think the approach is sound. I tested it on my pinebook pro and
> > couldn't catch obvious breakage at least.
> >
> > I am still worried though about spilling the knowledge outside of
> > util_fits_cpu() is creating extra complexity in the callers and potentially
> > more fragility when these callers evolve overtime e.g:
> > task_fits_cpu()/asym_fits_cpu() gain a new user that must actually care about
> > the -1 return value.
> 
> ask_fits_cpu()/asym_fits_cpu() remain simple booleans that return true
> if the task fits the cpu in regards to all requirements. If a new user
> wants to make smarter decisions like select_idle_capacity() or feec(),
> it will have to use util_fits_cpu(). Both handle the case where
> uclamp_min is not met differently.

I think with that comment added we can hope this will promote new users to
think more.

> 
> >
> > I think we can still optimize the capacity inversion logic to use no loops
> > without having to spill the knowledge to the users/callers of util_fits_cpu(),
> > no?
> 
> TBH, I don't know how because we are not always comparing the same
> things depending of the reason it doesn't fit

Your patch will do a more comprehensive search, true. It will try harder to
find a best fit. I do have a bit of a bias that some severe thermal conditions
are indication of a bigger problem to try harder to honour uclamp_min (warrant
the additional complexity). But I thought more about it and I think it actually
might be worth while. Xuewen had a system where medium's capacity was 900 and
bigs could easily be inverted with some thermal pressure on mediums too. So
this harder search can help these systems to keep these tasks on mediums. e.g:
if uclamp_min was 900 and big is inverted and medium is under pressure my
approach would have returned false for all CPUs then; while yours will keep it
on medium cpus.

> 
> >
> > That said except for the few comments I had this LGTM anyway. Thanks for your
> > effort!
> 
> Thanks
> 
> I'm going to prepare  v2

Thanks!


--
Qais Yousef
  
Qais Yousef Jan. 14, 2023, 9:18 p.m. UTC | #7
On 01/13/23 15:28, Vincent Guittot wrote:
> Hi Kajetan,
> 
> On Fri, 13 Jan 2023 at 14:50, Kajetan Puchalski
> <kajetan.puchalski@arm.com> wrote:
> >
> > Hi,
> >
> > > By taking into account uclamp_min, the 1:1 relation between task misfit
> > > and cpu overutilized is no more true as a task with a small util_avg of
> > > may not may not fit a high capacity cpu because of uclamp_min constraint.
> > >
> > > Add a new state in util_fits_cpu() to reflect the case that task would fit
> > > a CPU except for the uclamp_min hint which is a performance requirement.
> > >
> > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> > > can use this new value to take additional action to select the best CPU
> > > that doesn't match uclamp_min hint.
> >
> > I just wanted to flag some issues I noticed with this patch and the
> > entire topic.
> >
> > I was testing this on a Pixel 6 with a 5.18 android-mainline kernel with
> 
> Do you have more details to share on your setup ?
> Android kernel has some hack on top of the mainline. Do you use some ?
> Then, the perf and the power can be largely impacted by the cgroup
> configuration. Have you got details on your setup ?
> 
> I'm going to try to reproduce the behavior
> 
> > all the relevant uclamp and CFS scheduling patches backported to it from
> > mainline. From what I can see, the 'uclamp fits capacity' patchset
> > introduced some alarming power usage & performance issues that this
> > patch makes even worse.
> >
> > The patch stack for the following tables is as follows:
> >
> > (ufc_patched) sched/fair: unlink misfit task from cpu overutilized
> 
> I just sent a v3 which fixes a condition. Wonder if this could have an
> impact on the results both perf and power
> 
> > sched/uclamp: Fix a uninitialized variable warnings
> > (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec()
> > sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
> > sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
> > sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
> > sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
> > sched/uclamp: Fix fits_capacity() check in feec()
> > sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
> > sched/uclamp: Fix relationship between uclamp and migration margin
> > (previous 'baseline' was here)
> >
> > I omitted the 3 patches relating directly to capacity_inversion but in

This could lead to confusion. Was there a specific reason for this omission?
Did you hit some problem?

> > the other tests I did with those there were similar issues. It's
> > probably easier to consider the uclamp parts and their effects in
> > isolation.
> >
> > 1. Geekbench 5 (performance regression)
> >
> > +-----------------+----------------------------+--------+-----------+
> > |     metric      |           kernel           | value  | perc_diff |
> > +-----------------+----------------------------+--------+-----------+
> > | multicore_score |          baseline          | 2765.4 |   0.0%    |
> > | multicore_score |        baseline_ufc        | 2704.3 |  -2.21%   | <-- a noticeable score decrease already
> > | multicore_score |        ufc_patched         | 2443.2 |  -11.65%  | <-- a massive score decrease
> > +-----------------+----------------------------+--------+-----------+
> >
> > +--------------+--------+----------------------------+--------+-----------+
> > |  chan_name   | metric |           kernel           | value  | perc_diff |
> > +--------------+--------+----------------------------+--------+-----------+
> > | total_power  | gmean  |          baseline          | 2664.0 |   0.0%    |
> > | total_power  | gmean  |        baseline_ufc        | 2621.5 |   -1.6%   | <-- worse performance per watt
> > | total_power  | gmean  |        ufc_patched         | 2601.2 |  -2.36%   | <-- much worse performance per watt
> > +--------------+--------+----------------------------+--------+-----------+

Hmm I think I've seen a better score but at the cost of more energy in my
testing in the past.

> >
> > The most likely cause for the regression seen above is the decrease in the amount of
> > time spent while overutilized with these patches. Maximising
> > overutilization for GB5 is the desired outcome as the benchmark for
> > almost its entire duration keeps either 1 core or all the cores
> > completely saturated so EAS cannot be effective. These patches have the
> > opposite from the desired effect in this area.
> >
> > +----------------------------+--------------------+--------------------+------------+
> > |          kernel            |        time        |     total_time     | percentage |
> > +----------------------------+--------------------+--------------------+------------+
> > |          baseline          |      121.979       |      181.065       |   67.46    |
> > |        baseline_ufc        |      120.355       |      184.255       |   65.32    |
> > |        ufc_patched         |       60.715       |      196.135       |   30.98    | <-- !!!
> > +----------------------------+--------------------+--------------------+------------+
> 
> I'm not surprised because some use cases which were not overutilized
> were wrongly triggered as overutilized so switching back to
> performance mode. You might have to tune the uclamp value

I remember there were tasks with uclamp_min=512 or something like that in the
system. I wonder if they're making the difference here - they will steal time
from bigger cores and increase energy too.

The big jump with Vincent patches is strange though. How many iterations do you
run? How long do you wait between each iteration?

The original behavior of overutilized in regards to util_avg shouldn't have
changed. It's worth digging a bit more into it.

I looked at my previous results and I was seeing ~57% on android12-5.10 kernel
for both with and without the patch.

> 
> >
> > 2. Jankbench (power usage regression)
> >
> > +--------+---------------+---------------------------------+-------+-----------+
> > | metric |   variable    |             kernel              | value | perc_diff |
> > +--------+---------------+---------------------------------+-------+-----------+
> > | gmean  | mean_duration |          baseline_60hz          | 14.6  |   0.0%    |
> > | gmean  | mean_duration |        baseline_ufc_60hz        | 15.2  |   3.83%   |
> > | gmean  | mean_duration |        ufc_patched_60hz         | 14.0  |  -4.12%   |
> > +--------+---------------+---------------------------------+-------+-----------+
> >
> > +--------+-----------+---------------------------------+-------+-----------+
> > | metric | variable  |             kernel              | value | perc_diff |
> > +--------+-----------+---------------------------------+-------+-----------+
> > | gmean  | jank_perc |          baseline_60hz          |  1.9  |   0.0%    |
> > | gmean  | jank_perc |        baseline_ufc_60hz        |  2.2  |  15.39%   |
> > | gmean  | jank_perc |        ufc_patched_60hz         |  2.0  |   3.61%   |
> > +--------+-----------+---------------------------------+-------+-----------+

How many iterations did you run? Do you think they could be within the noise
region?

> >
> > +--------------+--------+---------------------------------+-------+-----------+
> > |  chan_name   | metric |             kernel              | value | perc_diff |
> > +--------------+--------+---------------------------------+-------+-----------+
> > | total_power  | gmean  |          baseline_60hz          | 135.9 |   0.0%    |
> > | total_power  | gmean  |        baseline_ufc_60hz        | 155.7 |  14.61%   | <-- !!!
> > | total_power  | gmean  |        ufc_patched_60hz         | 157.1 |  15.63%   | <-- !!!
> > +--------------+--------+---------------------------------+-------+-----------+
> >
> > With these patches while running Jankbench we use up ~15% more power
> > just to achieve roughly the same results. Here I'm not sure where this
> > issue is coming from exactly but all the results above are very consistent
> > across different runs.

Have you tried to look at uclamp_min/max values of the tasks/cpus?

Do you know which cluster ended up using more energy? Have you looked at freq
residency between the runs?

I think the system could dynamically boost some UI tasks. I'd expect their
residency to increase on the medium/big cores compared to before the patch.
Which could explain what you see.

Did you check your schedutil/rate_limit_us is not using the default 10ms? It
should be 500us.

I had another patch to set transition latency of the cpufreq driver to 500us
instead of 5ms - but I doubt this actually was making any difference.

FWIW, I did compare my series against vanilla Pixel 6 kernel where I used
geekbench, speedometer, pcmark all with and without heavy background activities
to measure the impact of uclamp_max and nothing stood out then.

I sadly lost my original setup now. I doubt I'll be able to recreate it to
re-run these tests again anytime soon :/

Could you try removing all thermal handling from util_fits_cpu() so that my
series behaves like v1 again and see if this makes a difference? It's highlight
if subtle issues with thermal pressure handling are the culprit. Most obvious
is using instantaneous thermal pressure in ufc().


Thanks!

--
Qais Yousef
  
Dietmar Eggemann Jan. 16, 2023, 9:21 a.m. UTC | #8
On 14/01/2023 22:18, Qais Yousef wrote:
> On 01/13/23 15:28, Vincent Guittot wrote:
>> Hi Kajetan,
>>
>> On Fri, 13 Jan 2023 at 14:50, Kajetan Puchalski
>> <kajetan.puchalski@arm.com> wrote:

[...]

>>> sched/uclamp: Fix a uninitialized variable warnings
>>> (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec()
>>> sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
>>> sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
>>> sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
>>> sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
>>> sched/uclamp: Fix fits_capacity() check in feec()
>>> sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
>>> sched/uclamp: Fix relationship between uclamp and migration margin
>>> (previous 'baseline' was here)
>>>
>>> I omitted the 3 patches relating directly to capacity_inversion but in
> 
> This could lead to confusion. Was there a specific reason for this omission?
> Did you hit some problem?

I thought Vincent's 'split MF from CPU OU' should replace 'CapInv'?

https://lkml.kernel.org/r/20221209164739.GA24368@vingu-book

... I have reverted patches:
Revert "sched/fair: Detect capacity inversion"
Revert "sched/fair: Consider capacity inversion in util_fits_cpu()"
Revert "sched/uclamp: Make cpu_overutilized() use util_fits_cpu()" ...

You also mentioned this in
https://lkml.kernel.org/r/<20230115001906.v7uq4ddodrbvye7d@airbuntu

[...]
  
Vincent Guittot Jan. 16, 2023, 11:25 a.m. UTC | #9
On Mon, 16 Jan 2023 at 10:21, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 14/01/2023 22:18, Qais Yousef wrote:
> > On 01/13/23 15:28, Vincent Guittot wrote:
> >> Hi Kajetan,
> >>
> >> On Fri, 13 Jan 2023 at 14:50, Kajetan Puchalski
> >> <kajetan.puchalski@arm.com> wrote:
>
> [...]
>
> >>> sched/uclamp: Fix a uninitialized variable warnings
> >>> (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec()
> >>> sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
> >>> sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
> >>> sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
> >>> sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
> >>> sched/uclamp: Fix fits_capacity() check in feec()
> >>> sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
> >>> sched/uclamp: Fix relationship between uclamp and migration margin
> >>> (previous 'baseline' was here)
> >>>
> >>> I omitted the 3 patches relating directly to capacity_inversion but in
> >
> > This could lead to confusion. Was there a specific reason for this omission?
> > Did you hit some problem?
>
> I thought Vincent's 'split MF from CPU OU' should replace 'CapInv'?

Current patch v3 applies on top of tip/sched/core which includes
'CapInv'. The end goal being that CapInv becomes useless and can be
removed

>
> https://lkml.kernel.org/r/20221209164739.GA24368@vingu-book
>
> ... I have reverted patches:
> Revert "sched/fair: Detect capacity inversion"
> Revert "sched/fair: Consider capacity inversion in util_fits_cpu()"
> Revert "sched/uclamp: Make cpu_overutilized() use util_fits_cpu()" ...
>
> You also mentioned this in
> https://lkml.kernel.org/r/<20230115001906.v7uq4ddodrbvye7d@airbuntu
>
> [...]
  
Qais Yousef Jan. 17, 2023, 4:40 p.m. UTC | #10
On 01/16/23 09:21, Dietmar Eggemann wrote:
> On 14/01/2023 22:18, Qais Yousef wrote:
> > On 01/13/23 15:28, Vincent Guittot wrote:
> >> Hi Kajetan,
> >>
> >> On Fri, 13 Jan 2023 at 14:50, Kajetan Puchalski
> >> <kajetan.puchalski@arm.com> wrote:
> 
> [...]
> 
> >>> sched/uclamp: Fix a uninitialized variable warnings
> >>> (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec()
> >>> sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
> >>> sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
> >>> sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
> >>> sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
> >>> sched/uclamp: Fix fits_capacity() check in feec()
> >>> sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
> >>> sched/uclamp: Fix relationship between uclamp and migration margin
> >>> (previous 'baseline' was here)
> >>>
> >>> I omitted the 3 patches relating directly to capacity_inversion but in
> > 
> > This could lead to confusion. Was there a specific reason for this omission?
> > Did you hit some problem?
> 
> I thought Vincent's 'split MF from CPU OU' should replace 'CapInv'?

Ah I thought my version was tested with these removed. Got ya now.


Cheers

--
Qais Yousef

> 
> https://lkml.kernel.org/r/20221209164739.GA24368@vingu-book
> 
> ... I have reverted patches:
> Revert "sched/fair: Detect capacity inversion"
> Revert "sched/fair: Consider capacity inversion in util_fits_cpu()"
> Revert "sched/uclamp: Make cpu_overutilized() use util_fits_cpu()" ...
> 
> You also mentioned this in
> https://lkml.kernel.org/r/<20230115001906.v7uq4ddodrbvye7d@airbuntu
> 
> [...]
  
Kajetan Puchalski Jan. 20, 2023, 4:08 p.m. UTC | #11
On Sat, Jan 14, 2023 at 09:18:54PM +0000, Qais Yousef wrote:

[...]

> I remember there were tasks with uclamp_min=512 or something like that in the
> system. I wonder if they're making the difference here - they will steal time
> from bigger cores and increase energy too.
> 
> The big jump with Vincent patches is strange though. How many iterations do you
> run? How long do you wait between each iteration?

It's 3 iterations for GB5 and 10 for Jankbench, all back to back after
keeping the phone in a fridge for an hour. This methodology has proven
to give pretty consistent results so far.

> The original behavior of overutilized in regards to util_avg shouldn't have
> changed. It's worth digging a bit more into it.
> 
> I looked at my previous results and I was seeing ~57% on android12-5.10 kernel
> for both with and without the patch.

As you said in the v3 thread this is caused by the "no OU on big cores"
issue. I'll reply in the v4 thread once I have a more complete set of
numbers but safe to say removing that change fixed the drop in score.

+-----------------+-------------------------+--------+-----------+
|     metric      |         kernel          | value  | perc_diff |
+-----------------+-------------------------+--------+-----------+
| multicore_score |        baseline         | 2765.4 |   0.0%    |
| multicore_score |      baseline_ufc       | 2704.3 |  -2.21%   |
| multicore_score |     ufc_patched_v4      | 2839.8 |   2.69%   |
+-----------------+-------------------------+--------+-----------+

> > 
> > >
> > > 2. Jankbench (power usage regression)
> > >
> > > +--------+---------------+---------------------------------+-------+-----------+
> > > | metric |   variable    |             kernel              | value | perc_diff |
> > > +--------+---------------+---------------------------------+-------+-----------+
> > > | gmean  | mean_duration |          baseline_60hz          | 14.6  |   0.0%    |
> > > | gmean  | mean_duration |        baseline_ufc_60hz        | 15.2  |   3.83%   |
> > > | gmean  | mean_duration |        ufc_patched_60hz         | 14.0  |  -4.12%   |
> > > +--------+---------------+---------------------------------+-------+-----------+
> > >
> > > +--------+-----------+---------------------------------+-------+-----------+
> > > | metric | variable  |             kernel              | value | perc_diff |
> > > +--------+-----------+---------------------------------+-------+-----------+
> > > | gmean  | jank_perc |          baseline_60hz          |  1.9  |   0.0%    |
> > > | gmean  | jank_perc |        baseline_ufc_60hz        |  2.2  |  15.39%   |
> > > | gmean  | jank_perc |        ufc_patched_60hz         |  2.0  |   3.61%   |
> > > +--------+-----------+---------------------------------+-------+-----------+
> 
> How many iterations did you run? Do you think they could be within the noise
> region?

As mentioned, 10 iterations on a cooled down phone. The frame
durations/jank above could be vaguely within noise levels, yes. I'd read
it as "no major change". This is mainly included for contrast with the
power usage spike.

> > >
> > > +--------------+--------+---------------------------------+-------+-----------+
> > > |  chan_name   | metric |             kernel              | value | perc_diff |
> > > +--------------+--------+---------------------------------+-------+-----------+
> > > | total_power  | gmean  |          baseline_60hz          | 135.9 |   0.0%    |
> > > | total_power  | gmean  |        baseline_ufc_60hz        | 155.7 |  14.61%   | <-- !!!
> > > | total_power  | gmean  |        ufc_patched_60hz         | 157.1 |  15.63%   | <-- !!!
> > > +--------------+--------+---------------------------------+-------+-----------+
> > >
> > > With these patches while running Jankbench we use up ~15% more power
> > > just to achieve roughly the same results. Here I'm not sure where this
> > > issue is coming from exactly but all the results above are very consistent
> > > across different runs.
> 
> Have you tried to look at uclamp_min/max values of the tasks/cpus?
> 
> Do you know which cluster ended up using more energy? Have you looked at freq
> residency between the runs?

+--------------+--------+------------------------------+-------+-----------+
|  chan_name   | metric |            kernel            | value | perc_diff |
+--------------+--------+------------------------------+-------+-----------+
| little_power | gmean  |        baseline_60hz         | 69.8  |   0.0%    |
| little_power | gmean  |      baseline_ufc_60hz       | 74.2  |   6.29%   |
|  mid_power   | gmean  |        baseline_60hz         | 20.1  |   0.0%    |
|  mid_power   | gmean  |      baseline_ufc_60hz       | 22.6  |  12.73%   |
|  big_power   | gmean  |        baseline_60hz         | 45.9  |   0.0%    |
|  big_power   | gmean  |      baseline_ufc_60hz       | 58.7  |  27.68%   |
| total_power  | gmean  |        baseline_60hz         | 135.9 |   0.0%    |
| total_power  | gmean  |      baseline_ufc_60hz       | 155.7 |  14.61%   |
+--------------+--------+------------------------------+-------+-----------+

All 3 clusters end up using more power, numerically & proportionally the biggest
increase is on the big cores. If we're looking for anything specific I
can dig into freq residency, it's just 3 clusters * 10 runs * 40
different OPPs is quite a lot when it comes to trying to find a clear
trend..

> I think the system could dynamically boost some UI tasks. I'd expect their
> residency to increase on the medium/big cores compared to before the patch.
> Which could explain what you see.
> 
> Did you check your schedutil/rate_limit_us is not using the default 10ms? It
> should be 500us.

I do most tests with the mainline default of 10ms. I did a set of runs
with 500us just to compare what impacts it has and I got worse Jankbench
results (mean duration 15.2 -> 16.9 and jank% 2.2 -> 3.7) with roughly
the same power usage (155.7 -> 156.4).
It did help with the GB5 score decrease though it was still below the
original baseline.

+-----------------+-------------------------+--------+-----------+
|     metric      |         kernel          | value  | perc_diff |
+-----------------+-------------------------+--------+-----------+
| multicore_score |        baseline         | 2765.4 |   0.0%    |
| multicore_score |      baseline_ufc       | 2704.3 |  -2.21%   |
| multicore_score |   baseline_ufc_su500    | 2759.0 |  -0.23%   |
+-----------------+-------------------------+--------+-----------+

> I had another patch to set transition latency of the cpufreq driver to 500us
> instead of 5ms - but I doubt this actually was making any difference.
> 
> FWIW, I did compare my series against vanilla Pixel 6 kernel where I used
> geekbench, speedometer, pcmark all with and without heavy background activities
> to measure the impact of uclamp_max and nothing stood out then.

Maybe comparing it against device kernels could be causing some issues?
Wouldn't they be pretty far from mainline to begin with and then get
even further with vendor hooks etc? There could be plenty of side
effects and interactions that come out differently.

> I sadly lost my original setup now. I doubt I'll be able to recreate it to
> re-run these tests again anytime soon :/
> 
> Could you try removing all thermal handling from util_fits_cpu() so that my
> series behaves like v1 again and see if this makes a difference? It's highlight
> if subtle issues with thermal pressure handling are the culprit. Most obvious
> is using instantaneous thermal pressure in ufc().

This where the interesting stuff comes in, looks like it is the thermal
handling causing problems, yes.

+-----------------+-------------------------+--------+-----------+
|     metric      |         kernel          | value  | perc_diff |
+-----------------+-------------------------+--------+-----------+
| multicore_score |        baseline         | 2765.4 |   0.0%    |
| multicore_score |      baseline_ufc       | 2704.3 |  -2.21%   |
| multicore_score | baseline_ufc_no_thermal | 2870.8 |   3.81%   | <-- 170 pt difference
+-----------------+-------------------------+--------+-----------+

+--------------+--------+-------------------------+--------+-----------+
|  chan_name   | metric |         kernel          | value  | perc_diff |
+--------------+--------+-------------------------+--------+-----------+
| total_power  | gmean  |        baseline         | 2664.0 |   0.0%    |
| total_power  | gmean  |      baseline_ufc       | 2621.5 |   -1.6%   |
| total_power  | gmean  | baseline_ufc_no_thermal | 2710.0 |   1.73%   | <-- better PPW
+--------------+--------+-------------------------+--------+-----------+

The Jankbench frame numbers are just below the original baseline but
very much within the noise levels so I'd say "no change" for brevity.

+--------------+--------+------------------------------+-------+-----------+
|  chan_name   | metric |            kernel            | value | perc_diff |
+--------------+--------+------------------------------+-------+-----------+
| total_power  | gmean  |        baseline_60hz         | 135.9 |   0.0%    |
| total_power  | gmean  |      baseline_ufc_60hz       | 155.7 |  14.61%   |
| total_power  | gmean  | baseline_ufc_no_thermal_60hz | 134.5 |  -1.01%   | <-- power saving!
+--------------+--------+------------------------------+-------+-----------+

This is the diff I got the previous results from:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e5f7c48950f8..2b846a7e2ed3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4262,7 +4262,7 @@ static inline int util_fits_cpu(unsigned long util,
                                unsigned long uclamp_max,
                                int cpu)
 {
-       unsigned long capacity_orig, capacity_orig_thermal;
+       unsigned long capacity_orig;
        unsigned long capacity = capacity_of(cpu);
        bool fits, uclamp_max_fits;

@@ -4299,7 +4299,6 @@ static inline int util_fits_cpu(unsigned long util,
         * the time.
         */
        capacity_orig = capacity_orig_of(cpu);
-       capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);

        /*
         * We want to force a task to fit a cpu as implied by uclamp_max.
@@ -4375,7 +4374,7 @@ static inline int util_fits_cpu(unsigned long util,
         */
        uclamp_min = min(uclamp_min, uclamp_max);
        if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
-               fits = fits && (uclamp_min <= capacity_orig_thermal);
+               fits = fits && (uclamp_min <= capacity_orig);

        return fits;
 }

All in all taking the thermal handling out gives us better scores,
better PPW and better power usage on the whole in Jankbench so there's
clearly some issue with it.

Since you wrote in the v3 thread that we can't use thermal_load_avg()
maybe we should just take this part out completely for the time being
until there's a better solution? I could send a patch with the above
diff.

---
Kajetan
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1649e7d71d24..57077f0a897e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4582,8 +4582,7 @@  static inline int util_fits_cpu(unsigned long util,
 	 *     2. The system is being saturated when we're operating near
 	 *        max capacity, it doesn't make sense to block overutilized.
 	 */
-	uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
-	uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
+	uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
 	fits = fits || uclamp_max_fits;
 
 	/*
@@ -4618,8 +4617,8 @@  static inline int util_fits_cpu(unsigned long util,
 	 * handle the case uclamp_min > uclamp_max.
 	 */
 	uclamp_min = min(uclamp_min, uclamp_max);
-	if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
-		fits = fits && (uclamp_min <= capacity_orig_thermal);
+	if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
+		return -1;
 
 	return fits;
 }
@@ -4629,7 +4628,7 @@  static inline int task_fits_cpu(struct task_struct *p, int cpu)
 	unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
 	unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
 	unsigned long util = task_util_est(p);
-	return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
+	return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
 }
 
 static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
@@ -6864,6 +6863,7 @@  static int
 select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	unsigned long task_util, util_min, util_max, best_cap = 0;
+	int fits, best_fits = 0;
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
 
@@ -6879,12 +6879,28 @@  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 
 		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
 			continue;
-		if (util_fits_cpu(task_util, util_min, util_max, cpu))
+
+		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
+
+		/* This CPU fits with all capacity and performance requirements */
+		if (fits > 0)
 			return cpu;
+		/*
+		 * Only the min performance (i.e. uclamp_min) doesn't fit. Look
+		 * for the CPU with highest performance capacity.
+		 */
+		else if (fits < 0)
+			cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
 
-		if (cpu_cap > best_cap) {
+		/*
+		 * First, select cpu which fits better (-1 being better than 0).
+		 * Then, select the one with largest capacity at same level.
+		 */
+		if ((fits < best_fits) ||
+		    ((fits == best_fits) && (cpu_cap > best_cap))) {
 			best_cap = cpu_cap;
 			best_cpu = cpu;
+			best_fits = fits;
 		}
 	}
 
@@ -6897,7 +6913,7 @@  static inline bool asym_fits_cpu(unsigned long util,
 				 int cpu)
 {
 	if (sched_asym_cpucap_active())
-		return util_fits_cpu(util, util_min, util_max, cpu);
+		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
 
 	return true;
 }
@@ -7264,6 +7280,9 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
 	struct root_domain *rd = this_rq()->rd;
 	int cpu, best_energy_cpu, target = -1;
+	int prev_fits = -1, best_fits = -1;
+	unsigned long best_thermal_cap = 0;
+	unsigned long prev_thermal_cap = 0;
 	struct sched_domain *sd;
 	struct perf_domain *pd;
 	struct energy_env eenv;
@@ -7299,6 +7318,7 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		unsigned long prev_spare_cap = 0;
 		int max_spare_cap_cpu = -1;
 		unsigned long base_energy;
+		int fits, max_fits = -1;
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
 
@@ -7351,7 +7371,9 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 					util_max = max(rq_util_max, p_util_max);
 				}
 			}
-			if (!util_fits_cpu(util, util_min, util_max, cpu))
+
+			fits = util_fits_cpu(util, util_min, util_max, cpu);
+			if (!fits)
 				continue;
 
 			lsub_positive(&cpu_cap, util);
@@ -7359,7 +7381,9 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (cpu == prev_cpu) {
 				/* Always use prev_cpu as a candidate. */
 				prev_spare_cap = cpu_cap;
-			} else if (cpu_cap > max_spare_cap) {
+				prev_fits = fits;
+			} else if ((fits > max_fits) ||
+				   ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
 				/*
 				 * Find the CPU with the maximum spare capacity
 				 * among the remaining CPUs in the performance
@@ -7367,6 +7391,7 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 				 */
 				max_spare_cap = cpu_cap;
 				max_spare_cap_cpu = cpu;
+				max_fits = fits;
 			}
 		}
 
@@ -7385,26 +7410,50 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (prev_delta < base_energy)
 				goto unlock;
 			prev_delta -= base_energy;
+			prev_thermal_cap = cpu_thermal_cap;
 			best_delta = min(best_delta, prev_delta);
 		}
 
 		/* Evaluate the energy impact of using max_spare_cap_cpu. */
 		if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
+			/* Current best energy cpu fits better */
+			if (max_fits < best_fits)
+				continue;
+
+			/*
+			 * Both don't fit performance (i.e. uclamp_min) but
+			 * best energy cpu has better performance.
+			 */
+			if ((max_fits < 0) &&
+			    (cpu_thermal_cap <= best_thermal_cap))
+				continue;
+
 			cur_delta = compute_energy(&eenv, pd, cpus, p,
 						   max_spare_cap_cpu);
 			/* CPU utilization has changed */
 			if (cur_delta < base_energy)
 				goto unlock;
 			cur_delta -= base_energy;
-			if (cur_delta < best_delta) {
-				best_delta = cur_delta;
-				best_energy_cpu = max_spare_cap_cpu;
-			}
+
+			/*
+			 * Both fit for the task but best energy cpu has lower
+			 * energy impact.
+			 */
+			if ((max_fits > 0) &&
+			    (cur_delta >= best_delta))
+				continue;
+
+			best_delta = cur_delta;
+			best_energy_cpu = max_spare_cap_cpu;
+			best_fits = max_fits;
+			best_thermal_cap = cpu_thermal_cap;
 		}
 	}
 	rcu_read_unlock();
 
-	if (best_delta < prev_delta)
+	if ((best_fits > prev_fits) ||
+	    ((best_fits > 0) && (best_delta < prev_delta)) ||
+	    ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
 		target = best_energy_cpu;
 
 	return target;
@@ -10228,24 +10277,23 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	 */
 	update_sd_lb_stats(env, &sds);
 
-	if (sched_energy_enabled()) {
-		struct root_domain *rd = env->dst_rq->rd;
-
-		if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
-			goto out_balanced;
-	}
-
-	local = &sds.local_stat;
-	busiest = &sds.busiest_stat;
-
 	/* There is no busy sibling group to pull tasks from */
 	if (!sds.busiest)
 		goto out_balanced;
 
+	busiest = &sds.busiest_stat;
+
 	/* Misfit tasks should be dealt with regardless of the avg load */
 	if (busiest->group_type == group_misfit_task)
 		goto force_balance;
 
+	if (sched_energy_enabled()) {
+		struct root_domain *rd = env->dst_rq->rd;
+
+		if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
+			goto out_balanced;
+	}
+
 	/* ASYM feature bypasses nice load balance check */
 	if (busiest->group_type == group_asym_packing)
 		goto force_balance;
@@ -10258,6 +10306,7 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	if (busiest->group_type == group_imbalanced)
 		goto force_balance;
 
+	local = &sds.local_stat;
 	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.