[RFC,1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature

Message ID 20231018204511.1563390-2-mathieu.desnoyers@efficios.com
State New
Headers
Series sched/fair migration reduction features |

Commit Message

Mathieu Desnoyers Oct. 18, 2023, 8:45 p.m. UTC
  Introduce the UTIL_FITS_CAPACITY scheduler feature. The runqueue
selection picks the previous, target, or recent runqueues if they have
enough remaining capacity to enqueue the task before scanning for an
idle cpu.

This feature is introduced in preparation for the SELECT_BIAS_PREV
scheduler feature. Its performance benefits are noticeable when combined
with the SELECT_BIAS_PREV feature.

The following benchmarks only cover the UTIL_FITS_CAPACITY feature.
Those are performed on a v6.5.5 kernel with mitigations=off.

The following hackbench workload on a 192 cores AMD EPYC 9654 96-Core
Processor (over 2 sockets) keeps relatively the same wall time (49s).

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

We can observe that the number of migrations is reduced significantly
with this patch (improvement):

Baseline:      117M cpu-migrations  (9.355 K/sec)
With patch:     67M cpu-migrations  (5.470 K/sec)

The task-clock utilization is reduced (degradation):

Baseline:      253.275 CPUs utilized
With patch:    223.130 CPUs utilized

The number of context-switches is increased (degradation):

Baseline:      445M context-switches (35.516 K/sec)
With patch:    581M context-switches (47.548 K/sec)

So the improvement due to reduction of migrations is countered by the
degradation in CPU utilization and context-switches. The following
SELECT_BIAS_PREV feature will address this.

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/
Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/
Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Gautham R . Shenoy <gautham.shenoy@amd.com>
Cc: x86@kernel.org
---
 kernel/sched/fair.c     | 49 ++++++++++++++++++++++++++++++++++++-----
 kernel/sched/features.h |  6 +++++
 kernel/sched/sched.h    |  5 +++++
 3 files changed, 54 insertions(+), 6 deletions(-)
  

Comments

Chen Yu Oct. 19, 2023, 11:35 a.m. UTC | #1
On 2023-10-18 at 16:45:10 -0400, Mathieu Desnoyers wrote:
> Introduce the UTIL_FITS_CAPACITY scheduler feature. The runqueue
> selection picks the previous, target, or recent runqueues if they have
> enough remaining capacity to enqueue the task before scanning for an
> idle cpu.
> 
> This feature is introduced in preparation for the SELECT_BIAS_PREV
> scheduler feature. Its performance benefits are noticeable when combined
> with the SELECT_BIAS_PREV feature.
> 
> The following benchmarks only cover the UTIL_FITS_CAPACITY feature.
> Those are performed on a v6.5.5 kernel with mitigations=off.
> 
> The following hackbench workload on a 192 cores AMD EPYC 9654 96-Core
> Processor (over 2 sockets) keeps relatively the same wall time (49s).
> 
> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
> 
> We can observe that the number of migrations is reduced significantly
> with this patch (improvement):
> 
> Baseline:      117M cpu-migrations  (9.355 K/sec)
> With patch:     67M cpu-migrations  (5.470 K/sec)
> 
> The task-clock utilization is reduced (degradation):
> 
> Baseline:      253.275 CPUs utilized
> With patch:    223.130 CPUs utilized
> 
> The number of context-switches is increased (degradation):
> 
> Baseline:      445M context-switches (35.516 K/sec)
> With patch:    581M context-switches (47.548 K/sec)
> 
> So the improvement due to reduction of migrations is countered by the
> degradation in CPU utilization and context-switches. The following
> SELECT_BIAS_PREV feature will address this.
> 
> Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
> Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
> Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/
> Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/
> Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
> Cc: Aaron Lu <aaron.lu@intel.com>
> Cc: Chen Yu <yu.c.chen@intel.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Gautham R . Shenoy <gautham.shenoy@amd.com>
> Cc: x86@kernel.org
> ---
>  kernel/sched/fair.c     | 49 ++++++++++++++++++++++++++++++++++++-----
>  kernel/sched/features.h |  6 +++++
>  kernel/sched/sched.h    |  5 +++++
>  3 files changed, 54 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d9c2482c5a3..8058058afb11 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4497,6 +4497,37 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>  	trace_sched_util_est_se_tp(&p->se);
>  }
>  
> +/*
> + * Returns true if adding the task utilization to the estimated
> + * utilization of the runnable tasks on @cpu does not exceed the
> + * capacity of @cpu.
> + *
> + * This considers only the utilization of _runnable_ tasks on the @cpu
> + * runqueue, excluding blocked and sleeping tasks. This is achieved by
> + * using the runqueue util_est.enqueued, and by estimating the capacity
> + * of @cpu based on arch_scale_cpu_capacity and arch_scale_thermal_pressure
> + * rather than capacity_of() because capacity_of() considers
> + * blocked/sleeping tasks in other scheduler classes.
> + *
> + * The utilization vs capacity comparison is done without the margin
> + * provided by fits_capacity(), because fits_capacity() is used to
> + * validate whether the utilization of a task fits within the overall
> + * capacity of a cpu, whereas this function validates whether the task
> + * utilization fits within the _remaining_ capacity of the cpu, which is
> + * more precise.
> + */
> +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
> +						    int cpu)
> +{
> +	unsigned long total_util, capacity;
> +
> +	if (!sched_util_fits_capacity_active())
> +		return false;
> +	total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util;
> +	capacity = arch_scale_cpu_capacity(cpu) - arch_scale_thermal_pressure(cpu);

scale_rt_capacity(cpu) could provide the remaining cpu capacity after substracted by
the side activity(rt tasks/thermal pressure/irq time), maybe it would be more accurate?

> +	return total_util <= capacity;
> +}
> +
>  static inline int util_fits_cpu(unsigned long util,
>  				unsigned long uclamp_min,
>  				unsigned long uclamp_max,
> @@ -7124,12 +7155,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	int i, recent_used_cpu;
>  
>  	/*
> -	 * On asymmetric system, update task utilization because we will check
> -	 * that the task fits with cpu's capacity.
> +	 * With the UTIL_FITS_CAPACITY feature and on asymmetric system,
> +	 * update task utilization because we will check that the task
> +	 * fits with cpu's capacity.
>  	 */
> -	if (sched_asym_cpucap_active()) {
> +	if (sched_util_fits_capacity_active() || sched_asym_cpucap_active()) {
>  		sync_entity_load_avg(&p->se);
>  		task_util = task_util_est(p);
> +	}
> +	if (sched_asym_cpucap_active()) {
>  		util_min = uclamp_eff_value(p, UCLAMP_MIN);
>  		util_max = uclamp_eff_value(p, UCLAMP_MAX);
>  	}
> @@ -7139,7 +7173,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	 */
>  	lockdep_assert_irqs_disabled();
>  
> -	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> +	if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
> +	    task_fits_remaining_cpu_capacity(task_util, target)) &&

Compared to the previous version posted here[1], when the cpu's util_est is lower than 25% of CPU
capacity we choose the previous CPU, current version seems to be more aggressive.
it is possible that a short running task is queued on the near 100% busy cpu while there
is still an idle cpu in the system.

https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/

thanks,
Chenyu
  
Mathieu Desnoyers Oct. 19, 2023, 1:28 p.m. UTC | #2
On 2023-10-19 07:35, Chen Yu wrote:
> On 2023-10-18 at 16:45:10 -0400, Mathieu Desnoyers wrote:
>> Introduce the UTIL_FITS_CAPACITY scheduler feature. The runqueue
>> selection picks the previous, target, or recent runqueues if they have
>> enough remaining capacity to enqueue the task before scanning for an
>> idle cpu.
>>
>> This feature is introduced in preparation for the SELECT_BIAS_PREV
>> scheduler feature. Its performance benefits are noticeable when combined
>> with the SELECT_BIAS_PREV feature.
>>
>> The following benchmarks only cover the UTIL_FITS_CAPACITY feature.
>> Those are performed on a v6.5.5 kernel with mitigations=off.
>>
>> The following hackbench workload on a 192 cores AMD EPYC 9654 96-Core
>> Processor (over 2 sockets) keeps relatively the same wall time (49s).
>>
>> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
>>
>> We can observe that the number of migrations is reduced significantly
>> with this patch (improvement):
>>
>> Baseline:      117M cpu-migrations  (9.355 K/sec)
>> With patch:     67M cpu-migrations  (5.470 K/sec)
>>
>> The task-clock utilization is reduced (degradation):
>>
>> Baseline:      253.275 CPUs utilized
>> With patch:    223.130 CPUs utilized
>>
>> The number of context-switches is increased (degradation):
>>
>> Baseline:      445M context-switches (35.516 K/sec)
>> With patch:    581M context-switches (47.548 K/sec)
>>
>> So the improvement due to reduction of migrations is countered by the
>> degradation in CPU utilization and context-switches. The following
>> SELECT_BIAS_PREV feature will address this.
>>
>> Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
>> Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
>> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
>> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
>> Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
>> Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
>> Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/
>> Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
>> Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/
>> Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@efficios.com/
>> Link: https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/
>> Link: https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Valentin Schneider <vschneid@redhat.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ben Segall <bsegall@google.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
>> Cc: Aaron Lu <aaron.lu@intel.com>
>> Cc: Chen Yu <yu.c.chen@intel.com>
>> Cc: Tim Chen <tim.c.chen@intel.com>
>> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
>> Cc: Gautham R . Shenoy <gautham.shenoy@amd.com>
>> Cc: x86@kernel.org
>> ---
>>   kernel/sched/fair.c     | 49 ++++++++++++++++++++++++++++++++++++-----
>>   kernel/sched/features.h |  6 +++++
>>   kernel/sched/sched.h    |  5 +++++
>>   3 files changed, 54 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1d9c2482c5a3..8058058afb11 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4497,6 +4497,37 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>>   	trace_sched_util_est_se_tp(&p->se);
>>   }
>>   
>> +/*
>> + * Returns true if adding the task utilization to the estimated
>> + * utilization of the runnable tasks on @cpu does not exceed the
>> + * capacity of @cpu.
>> + *
>> + * This considers only the utilization of _runnable_ tasks on the @cpu
>> + * runqueue, excluding blocked and sleeping tasks. This is achieved by
>> + * using the runqueue util_est.enqueued, and by estimating the capacity
>> + * of @cpu based on arch_scale_cpu_capacity and arch_scale_thermal_pressure
>> + * rather than capacity_of() because capacity_of() considers
>> + * blocked/sleeping tasks in other scheduler classes.
>> + *
>> + * The utilization vs capacity comparison is done without the margin
>> + * provided by fits_capacity(), because fits_capacity() is used to
>> + * validate whether the utilization of a task fits within the overall
>> + * capacity of a cpu, whereas this function validates whether the task
>> + * utilization fits within the _remaining_ capacity of the cpu, which is
>> + * more precise.
>> + */
>> +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
>> +						    int cpu)
>> +{
>> +	unsigned long total_util, capacity;
>> +
>> +	if (!sched_util_fits_capacity_active())
>> +		return false;
>> +	total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util;
>> +	capacity = arch_scale_cpu_capacity(cpu) - arch_scale_thermal_pressure(cpu);
> 
> scale_rt_capacity(cpu) could provide the remaining cpu capacity after substracted by
> the side activity(rt tasks/thermal pressure/irq time), maybe it would be more accurate?

AFAIU, scale_rt_capacity(cpu) works similarly to capacity_of(cpu) and 
considers blocked and sleeping tasks in the rq->avg_rt.util_avg and 
rq->avg_dl.util_avg. I'm not sure sure about rq->avg_irq.util_avg and 
thermal_load_avg().

This goes against what is needed here: we need a utilization that only 
considers enqueued runnable tasks (exluding blocked and sleeping tasks). 
Or am I missing something ?

> 
>> +	return total_util <= capacity;
>> +}
>> +
>>   static inline int util_fits_cpu(unsigned long util,
>>   				unsigned long uclamp_min,
>>   				unsigned long uclamp_max,
>> @@ -7124,12 +7155,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>   	int i, recent_used_cpu;
>>   
>>   	/*
>> -	 * On asymmetric system, update task utilization because we will check
>> -	 * that the task fits with cpu's capacity.
>> +	 * With the UTIL_FITS_CAPACITY feature and on asymmetric system,
>> +	 * update task utilization because we will check that the task
>> +	 * fits with cpu's capacity.
>>   	 */
>> -	if (sched_asym_cpucap_active()) {
>> +	if (sched_util_fits_capacity_active() || sched_asym_cpucap_active()) {
>>   		sync_entity_load_avg(&p->se);
>>   		task_util = task_util_est(p);
>> +	}
>> +	if (sched_asym_cpucap_active()) {
>>   		util_min = uclamp_eff_value(p, UCLAMP_MIN);
>>   		util_max = uclamp_eff_value(p, UCLAMP_MAX);
>>   	}
>> @@ -7139,7 +7173,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>   	 */
>>   	lockdep_assert_irqs_disabled();
>>   
>> -	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
>> +	if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
>> +	    task_fits_remaining_cpu_capacity(task_util, target)) &&
> 
> Compared to the previous version posted here[1], when the cpu's util_est is lower than 25% of CPU
> capacity we choose the previous CPU, current version seems to be more aggressive.
> it is possible that a short running task is queued on the near 100% busy cpu while there
> is still an idle cpu in the system.
> 
> https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/

This previous version had a somewhat arbitrary cutoff at 75% of util_est 
(25% spare capacity remaining). Yes, this new version is more 
aggressive, and indeed it does not keep room for inaccuracy of the 
util_est metric compared to real-life behavior of the task when it gets 
scheduled.

One option would be to change the comparison in 
task_fits_remaining_cpu_capacity() as follows:

-       return total_util <= capacity;
+       return fits_capacity(total_util, capacity);

"fits_capacity()" includes a 20% unused margin. Using this, the 
benchmark goes from 26s to 29s, which is not the end of the world, and 
would keep room for metric inaccuracy.

Thoughts ?

Thanks,

Mathieu

> 
> thanks,
> Chenyu
  
Mathieu Desnoyers Oct. 19, 2023, 2:49 p.m. UTC | #3
On 2023-10-19 09:28, Mathieu Desnoyers wrote:
> On 2023-10-19 07:35, Chen Yu wrote:
[...]
>>> +/*
>>> + * Returns true if adding the task utilization to the estimated
>>> + * utilization of the runnable tasks on @cpu does not exceed the
>>> + * capacity of @cpu.
>>> + *
>>> + * This considers only the utilization of _runnable_ tasks on the @cpu
>>> + * runqueue, excluding blocked and sleeping tasks. This is achieved by
>>> + * using the runqueue util_est.enqueued, and by estimating the capacity
>>> + * of @cpu based on arch_scale_cpu_capacity and 
>>> arch_scale_thermal_pressure
>>> + * rather than capacity_of() because capacity_of() considers
>>> + * blocked/sleeping tasks in other scheduler classes.
>>> + *
>>> + * The utilization vs capacity comparison is done without the margin
>>> + * provided by fits_capacity(), because fits_capacity() is used to
>>> + * validate whether the utilization of a task fits within the overall
>>> + * capacity of a cpu, whereas this function validates whether the task
>>> + * utilization fits within the _remaining_ capacity of the cpu, 
>>> which is
>>> + * more precise.
>>> + */
>>> +static inline bool task_fits_remaining_cpu_capacity(unsigned long 
>>> task_util,
>>> +                            int cpu)
>>> +{
>>> +    unsigned long total_util, capacity;
>>> +
>>> +    if (!sched_util_fits_capacity_active())
>>> +        return false;
>>> +    total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + 
>>> task_util;
>>> +    capacity = arch_scale_cpu_capacity(cpu) - 
>>> arch_scale_thermal_pressure(cpu);
>>
>> scale_rt_capacity(cpu) could provide the remaining cpu capacity after 
>> substracted by
>> the side activity(rt tasks/thermal pressure/irq time), maybe it would 
>> be more accurate?
> 
> AFAIU, scale_rt_capacity(cpu) works similarly to capacity_of(cpu) and 
> considers blocked and sleeping tasks in the rq->avg_rt.util_avg and 
> rq->avg_dl.util_avg. I'm not sure sure about rq->avg_irq.util_avg and 
> thermal_load_avg().
> 
> This goes against what is needed here: we need a utilization that only 
> considers enqueued runnable tasks (exluding blocked and sleeping tasks). 
> Or am I missing something ?
> 

I was wrong. Looking more closely at dl and rt sched classes, unlike the 
fair sched class, they don't appear to take into account 
sleeping/blocked tasks in their util_avg. They just accumulate the rq 
util_sum and derive a rq util_avg from it. Likewise for thermal and irq.

So both capacity_of(cpu) and scale_rt_capacity(cpu) would appear to do 
what we need here, but AFAIU capacity_of(cpu) is based on a metric which 
is only updated once per jiffy or so.

Let me try using scale_rt_capacity(cpu) then.

Thanks!

Mathieu
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d9c2482c5a3..8058058afb11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4497,6 +4497,37 @@  static inline void util_est_update(struct cfs_rq *cfs_rq,
 	trace_sched_util_est_se_tp(&p->se);
 }
 
+/*
+ * Returns true if adding the task utilization to the estimated
+ * utilization of the runnable tasks on @cpu does not exceed the
+ * capacity of @cpu.
+ *
+ * This considers only the utilization of _runnable_ tasks on the @cpu
+ * runqueue, excluding blocked and sleeping tasks. This is achieved by
+ * using the runqueue util_est.enqueued, and by estimating the capacity
+ * of @cpu based on arch_scale_cpu_capacity and arch_scale_thermal_pressure
+ * rather than capacity_of() because capacity_of() considers
+ * blocked/sleeping tasks in other scheduler classes.
+ *
+ * The utilization vs capacity comparison is done without the margin
+ * provided by fits_capacity(), because fits_capacity() is used to
+ * validate whether the utilization of a task fits within the overall
+ * capacity of a cpu, whereas this function validates whether the task
+ * utilization fits within the _remaining_ capacity of the cpu, which is
+ * more precise.
+ */
+static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
+						    int cpu)
+{
+	unsigned long total_util, capacity;
+
+	if (!sched_util_fits_capacity_active())
+		return false;
+	total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util;
+	capacity = arch_scale_cpu_capacity(cpu) - arch_scale_thermal_pressure(cpu);
+	return total_util <= capacity;
+}
+
 static inline int util_fits_cpu(unsigned long util,
 				unsigned long uclamp_min,
 				unsigned long uclamp_max,
@@ -7124,12 +7155,15 @@  static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	int i, recent_used_cpu;
 
 	/*
-	 * On asymmetric system, update task utilization because we will check
-	 * that the task fits with cpu's capacity.
+	 * With the UTIL_FITS_CAPACITY feature and on asymmetric system,
+	 * update task utilization because we will check that the task
+	 * fits with cpu's capacity.
 	 */
-	if (sched_asym_cpucap_active()) {
+	if (sched_util_fits_capacity_active() || sched_asym_cpucap_active()) {
 		sync_entity_load_avg(&p->se);
 		task_util = task_util_est(p);
+	}
+	if (sched_asym_cpucap_active()) {
 		util_min = uclamp_eff_value(p, UCLAMP_MIN);
 		util_max = uclamp_eff_value(p, UCLAMP_MAX);
 	}
@@ -7139,7 +7173,8 @@  static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	lockdep_assert_irqs_disabled();
 
-	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
+	if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
+	    task_fits_remaining_cpu_capacity(task_util, target)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
 
@@ -7147,7 +7182,8 @@  static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
 	if (prev != target && cpus_share_cache(prev, target) &&
-	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
+	    (available_idle_cpu(prev) || sched_idle_cpu(prev) ||
+	    task_fits_remaining_cpu_capacity(task_util, prev)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, prev))
 		return prev;
 
@@ -7173,7 +7209,8 @@  static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
-	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
+	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu) ||
+	    task_fits_remaining_cpu_capacity(task_util, recent_used_cpu)) &&
 	    cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
 	    asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
 		return recent_used_cpu;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..9a84a1401123 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -97,6 +97,12 @@  SCHED_FEAT(WA_BIAS, true)
 SCHED_FEAT(UTIL_EST, true)
 SCHED_FEAT(UTIL_EST_FASTUP, true)
 
+/*
+ * Select the previous, target, or recent runqueue if they have enough
+ * remaining capacity to enqueue the task. Requires UTIL_EST.
+ */
+SCHED_FEAT(UTIL_FITS_CAPACITY, true)
+
 SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e93e006a942b..463e75084aed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2090,6 +2090,11 @@  static const_debug __maybe_unused unsigned int sysctl_sched_features =
 
 #endif /* SCHED_DEBUG */
 
+static __always_inline bool sched_util_fits_capacity_active(void)
+{
+	return sched_feat(UTIL_EST) && sched_feat(UTIL_FITS_CAPACITY);
+}
+
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;