diff mbox series

sched/fair: Use all little CPUs for CPU-bound workload

Message ID	20231110125902.2152380-1-pierre.gondois@arm.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; From: Pierre Gondois <pierre.gondois@arm.com> To: linux-kernel@vger.kernel.org Cc: Pierre Gondois <pierre.gondois@arm.com>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com> Subject: [PATCH] sched/fair: Use all little CPUs for CPU-bound workload Date: Fri, 10 Nov 2023 13:59:02 +0100 Message-Id: <20231110125902.2152380-1-pierre.gondois@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	sched/fair: Use all little CPUs for CPU-bound workload \| sched/fair: Use all little CPUs for CPU-bound workload

Commit Message

Pierre Gondois Nov. 10, 2023, 12:59 p.m. UTC

  Running n CPU-bound tasks on an n CPUs platform with asymmetric CPU
capacity might result in a task placement where two tasks run on a
big CPU and none on a little CPU. This placement could be more optimal
by using all CPUs.

Testing platform:
Juno-r2:
- 2 big CPUs (1-2), maximum capacity of 1024
- 4 little CPUs (0,3-5), maximum capacity of 383

Testing workload ([1]):
Spawn 6 CPU-bound tasks. During the first 100ms (step 1), each tasks
is affine to a CPU, except for:
- one little CPU which is left idle.
- one big CPU which has 2 tasks affine.
After the 100ms (step 2), remove the cpumask affinity.

Before patch:
During step 2, the load balancer running from the idle CPU tags sched
domains as:
- little CPUs: 'group_has_spare'. Indeed, 3 CPU-bound tasks run on a
  4 CPUs sched-domain, and the idle CPU provides enough spare
  capacity.
- big CPUs: 'group_overloaded'. Indeed, 3 tasks run on a 2 CPUs
  sched-domain, so the following path is used:
  group_is_overloaded()
  \-if (sgs->sum_nr_running <= sgs->group_weight) return true;

  The following path which would change the migration type to
  'migrate_task' is not taken:
  calculate_imbalance()
  \-if (env->idle != CPU_NOT_IDLE && env->imbalance == 0)
  as the local group has some spare capacity, so the imbalance
  is not 0.

The migration type requested is 'migrate_util' and the busiest
runqueue is the big CPU's runqueue having 2 tasks (each having a
utilization of 512). The idle little CPU cannot pull one of these
task as its capacity is too small for the task. The following path
is used:
detach_tasks()
\-case migrate_util:
  \-if (util > env->imbalance) goto next;

After patch:
When the local group has spare capacity and the busiest group is at
least tagged as 'group_fully_busy', if the local group has more CPUs
than CFS tasks and the busiest group more CFS tasks than CPUs,
request a 'migrate_task' type migration.

Improvement:
Running the testing workload [1] with the step 2 representing
a ~10s load for a big CPU:
Before patch: ~19.3s
After patch: ~18s (-6.7%)

The issue only happens at the DIE level on platforms able to have
'migrate_util' migration types, i.e. no DynamIQ systems where
SD_SHARE_PKG_RESOURCES is set.

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
---
 kernel/sched/fair.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Comments

Dietmar Eggemann Nov. 16, 2023, 4:34 p.m. UTC | #1

On 10/11/2023 13:59, Pierre Gondois wrote:
> Running n CPU-bound tasks on an n CPUs platform with asymmetric CPU
> capacity might result in a task placement where two tasks run on a
> big CPU and none on a little CPU. This placement could be more optimal
> by using all CPUs.
> 
> Testing platform:
> Juno-r2:
> - 2 big CPUs (1-2), maximum capacity of 1024
> - 4 little CPUs (0,3-5), maximum capacity of 383
> 
> Testing workload ([1]):
> Spawn 6 CPU-bound tasks. During the first 100ms (step 1), each tasks
> is affine to a CPU, except for:
> - one little CPU which is left idle.
> - one big CPU which has 2 tasks affine.
> After the 100ms (step 2), remove the cpumask affinity.

I used your workload on my Juno-r0 with LISA (rt-app overwrites the
mainline CPU capacity values [446 1024 1024 446 446 466] to [675 1024
1024 675 675 675] to adapt to rt-app busy loop's instruction mix.

Here I can't see the issue you bring up here. The two tasks sharing a
CPU have util_avg = ~512 and the load-balance from one task to the idle
little CPU is happening,

I assume it's the diff in CPU capacity of the little CPUs: 383 < 512 <
675 ?

> 
> Before patch:
> During step 2, the load balancer running from the idle CPU tags sched
> domains as:
> - little CPUs: 'group_has_spare'. Indeed, 3 CPU-bound tasks run on a
>   4 CPUs sched-domain, and the idle CPU provides enough spare
>   capacity.
> - big CPUs: 'group_overloaded'. Indeed, 3 tasks run on a 2 CPUs
>   sched-domain, so the following path is used:
>   group_is_overloaded()
>   \-if (sgs->sum_nr_running <= sgs->group_weight) return true;
> 
>   The following path which would change the migration type to
>   'migrate_task' is not taken:
>   calculate_imbalance()
>   \-if (env->idle != CPU_NOT_IDLE && env->imbalance == 0)
>   as the local group has some spare capacity, so the imbalance
>   is not 0.
> 
> The migration type requested is 'migrate_util' and the busiest
> runqueue is the big CPU's runqueue having 2 tasks (each having a
> utilization of 512). The idle little CPU cannot pull one of these
> task as its capacity is too small for the task. The following path
> is used:

Ah, here you're describing the issue I mentioned above.

> detach_tasks()
> \-case migrate_util:
>   \-if (util > env->imbalance) goto next;
> 
> After patch:
> When the local group has spare capacity and the busiest group is at
> least tagged as 'group_fully_busy', if the local group has more CPUs
> than CFS tasks and the busiest group more CFS tasks than CPUs,
> request a 'migrate_task' type migration.
> 
> Improvement:
> Running the testing workload [1] with the step 2 representing
> a ~10s load for a big CPU:
> Before patch: ~19.3s
> After patch: ~18s (-6.7%)
> 
> The issue only happens at the DIE level on platforms able to have
> 'migrate_util' migration types, i.e. no DynamIQ systems where
> SD_SHARE_PKG_RESOURCES is set.

Right, mainline Arm DynamIQ should only have 1 SD level (MC). Android
might still be affected since they run MC and DIE.

> 
> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
> ---
>  kernel/sched/fair.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df348aa55d3c..5a215c96d420 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10495,6 +10495,23 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  			env->imbalance = max(local->group_capacity, local->group_util) -
>  					 local->group_util;
>  
> +			/*
> +			 * On an asymmetric system with CPU-bound tasks, a
> +			 * migrate_util balance might not be able to migrate a
> +			 * task from a big to a little CPU, letting a little
> +			 * CPU unused.
> +			 * If local has an empty CPU and busiest is overloaded,
> +			 * balance one task with a migrate_task migration type
> +			 * instead.
> +			 */
> +			if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> +			    local->sum_nr_running < local->group_weight &&
> +			    busiest->sum_nr_running > busiest->group_weight) {
> +				env->migration_type = migrate_task;
> +				env->imbalance = 1;
> +				return;
> +			}
> +
>  			/*
>  			 * In some cases, the group's utilization is max or even
>  			 * higher than capacity because of migrations but the

Vincent Guittot Nov. 17, 2023, 2:17 p.m. UTC | #2

Hi Pierre,

On Fri, 10 Nov 2023 at 13:59, Pierre Gondois <pierre.gondois@arm.com> wrote:
>
> Running n CPU-bound tasks on an n CPUs platform with asymmetric CPU
> capacity might result in a task placement where two tasks run on a
> big CPU and none on a little CPU. This placement could be more optimal
> by using all CPUs.
>
> Testing platform:
> Juno-r2:
> - 2 big CPUs (1-2), maximum capacity of 1024
> - 4 little CPUs (0,3-5), maximum capacity of 383
>
> Testing workload ([1]):
> Spawn 6 CPU-bound tasks. During the first 100ms (step 1), each tasks
> is affine to a CPU, except for:
> - one little CPU which is left idle.
> - one big CPU which has 2 tasks affine.
> After the 100ms (step 2), remove the cpumask affinity.
>
> Before patch:
> During step 2, the load balancer running from the idle CPU tags sched
> domains as:
> - little CPUs: 'group_has_spare'. Indeed, 3 CPU-bound tasks run on a
>   4 CPUs sched-domain, and the idle CPU provides enough spare
>   capacity.
> - big CPUs: 'group_overloaded'. Indeed, 3 tasks run on a 2 CPUs
>   sched-domain, so the following path is used:
>   group_is_overloaded()
>   \-if (sgs->sum_nr_running <= sgs->group_weight) return true;

This remembers me a similar discussion with Qais:
https://lore.kernel.org/lkml/20230716014125.139577-1-qyousef@layalina.io/

>
>   The following path which would change the migration type to
>   'migrate_task' is not taken:
>   calculate_imbalance()
>   \-if (env->idle != CPU_NOT_IDLE && env->imbalance == 0)
>   as the local group has some spare capacity, so the imbalance
>   is not 0.
>
> The migration type requested is 'migrate_util' and the busiest
> runqueue is the big CPU's runqueue having 2 tasks (each having a
> utilization of 512). The idle little CPU cannot pull one of these
> task as its capacity is too small for the task. The following path
> is used:
> detach_tasks()
> \-case migrate_util:
>   \-if (util > env->imbalance) goto next;
>
> After patch:
> When the local group has spare capacity and the busiest group is at
> least tagged as 'group_fully_busy', if the local group has more CPUs

the busiest group is more than  'group_fully_busy'

> than CFS tasks and the busiest group more CFS tasks than CPUs,
> request a 'migrate_task' type migration.
>
> Improvement:
> Running the testing workload [1] with the step 2 representing
> a ~10s load for a big CPU:
> Before patch: ~19.3s
> After patch: ~18s (-6.7%)
>
> The issue only happens at the DIE level on platforms able to have
> 'migrate_util' migration types, i.e. no DynamIQ systems where
> SD_SHARE_PKG_RESOURCES is set.
>
> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
> ---
>  kernel/sched/fair.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df348aa55d3c..5a215c96d420 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10495,6 +10495,23 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                         env->imbalance = max(local->group_capacity, local->group_util) -
>                                          local->group_util;
>
> +                       /*
> +                        * On an asymmetric system with CPU-bound tasks, a
> +                        * migrate_util balance might not be able to migrate a
> +                        * task from a big to a little CPU, letting a little
> +                        * CPU unused.
> +                        * If local has an empty CPU and busiest is overloaded,

group_has_spare doesn't mean that the local has an empty cpu but that
one or more cpu might be idle some time which could not be the case
when the load balance happen

> +                        * balance one task with a migrate_task migration type
> +                        * instead.
> +                        */
> +                       if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> +                           local->sum_nr_running < local->group_weight &&
> +                           busiest->sum_nr_running > busiest->group_weight) {
> +                               env->migration_type = migrate_task;
> +                               env->imbalance = 1;

I wonder if this is too aggressive. We can have cases where
(local->sum_nr_running < local->group_weight) at the time of the load
balance because one cpu can be shortly idle and you will migrate the
task that will then compete with another one on a little core. So
maybe you should do something similar to the migrate_load in
detach_tasks like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc8e9ced6aa8..3a04fa0f1eae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8977,7 +8977,7 @@ static int detach_tasks(struct lb_env *env)
                case migrate_util:
                        util = task_util_est(p);

-                       if (util > env->imbalance)
+                       if (shr_bound(util,
env->sd->nr_balance_failed) > env->imbalance)
                                goto next;

                        env->imbalance -= util;
--

This should cover more intermediate cases and would benefit to more
topology and cases

> +                               return;
> +                       }
> +
>                         /*
>                          * In some cases, the group's utilization is max or even
>                          * higher than capacity because of migrations but the
> --
> 2.25.1
>

Pierre Gondois Nov. 21, 2023, 8:58 a.m. UTC | #3

Hello Vincent,

On 11/17/23 15:17, Vincent Guittot wrote:
> Hi Pierre,
> 
> On Fri, 10 Nov 2023 at 13:59, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>
>> Running n CPU-bound tasks on an n CPUs platform with asymmetric CPU
>> capacity might result in a task placement where two tasks run on a
>> big CPU and none on a little CPU. This placement could be more optimal
>> by using all CPUs.
>>
>> Testing platform:
>> Juno-r2:
>> - 2 big CPUs (1-2), maximum capacity of 1024
>> - 4 little CPUs (0,3-5), maximum capacity of 383
>>
>> Testing workload ([1]):
>> Spawn 6 CPU-bound tasks. During the first 100ms (step 1), each tasks
>> is affine to a CPU, except for:
>> - one little CPU which is left idle.
>> - one big CPU which has 2 tasks affine.
>> After the 100ms (step 2), remove the cpumask affinity.
>>
>> Before patch:
>> During step 2, the load balancer running from the idle CPU tags sched
>> domains as:
>> - little CPUs: 'group_has_spare'. Indeed, 3 CPU-bound tasks run on a
>>    4 CPUs sched-domain, and the idle CPU provides enough spare
>>    capacity.
>> - big CPUs: 'group_overloaded'. Indeed, 3 tasks run on a 2 CPUs
>>    sched-domain, so the following path is used:
>>    group_is_overloaded()
>>    \-if (sgs->sum_nr_running <= sgs->group_weight) return true;
> 
> This remembers me a similar discussion with Qais:
> https://lore.kernel.org/lkml/20230716014125.139577-1-qyousef@layalina.io/

Yes indeed, this is exactly the same case and same backstory actually.

> 
>>
>>    The following path which would change the migration type to
>>    'migrate_task' is not taken:
>>    calculate_imbalance()
>>    \-if (env->idle != CPU_NOT_IDLE && env->imbalance == 0)
>>    as the local group has some spare capacity, so the imbalance
>>    is not 0.
>>
>> The migration type requested is 'migrate_util' and the busiest
>> runqueue is the big CPU's runqueue having 2 tasks (each having a
>> utilization of 512). The idle little CPU cannot pull one of these
>> task as its capacity is too small for the task. The following path
>> is used:
>> detach_tasks()
>> \-case migrate_util:
>>    \-if (util > env->imbalance) goto next;
>>
>> After patch:
>> When the local group has spare capacity and the busiest group is at
>> least tagged as 'group_fully_busy', if the local group has more CPUs
> 
> the busiest group is more than  'group_fully_busy'
> 
>> than CFS tasks and the busiest group more CFS tasks than CPUs,
>> request a 'migrate_task' type migration.
>>
>> Improvement:
>> Running the testing workload [1] with the step 2 representing
>> a ~10s load for a big CPU:
>> Before patch: ~19.3s
>> After patch: ~18s (-6.7%)
>>
>> The issue only happens at the DIE level on platforms able to have
>> 'migrate_util' migration types, i.e. no DynamIQ systems where
>> SD_SHARE_PKG_RESOURCES is set.
>>
>> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
>> ---
>>   kernel/sched/fair.c | 17 +++++++++++++++++
>>   1 file changed, 17 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index df348aa55d3c..5a215c96d420 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10495,6 +10495,23 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>                          env->imbalance = max(local->group_capacity, local->group_util) -
>>                                           local->group_util;
>>
>> +                       /*
>> +                        * On an asymmetric system with CPU-bound tasks, a
>> +                        * migrate_util balance might not be able to migrate a
>> +                        * task from a big to a little CPU, letting a little
>> +                        * CPU unused.
>> +                        * If local has an empty CPU and busiest is overloaded,
> 
> group_has_spare doesn't mean that the local has an empty cpu but that
> one or more cpu might be idle some time which could not be the case
> when the load balance happen
> 
>> +                        * balance one task with a migrate_task migration type
>> +                        * instead.
>> +                        */
>> +                       if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
>> +                           local->sum_nr_running < local->group_weight &&
>> +                           busiest->sum_nr_running > busiest->group_weight) {
>> +                               env->migration_type = migrate_task;
>> +                               env->imbalance = 1;
> 
> I wonder if this is too aggressive. We can have cases where
> (local->sum_nr_running < local->group_weight) at the time of the load
> balance because one cpu can be shortly idle and you will migrate the
> task that will then compete with another one on a little core. So

Ok right.

> maybe you should do something similar to the migrate_load in
> detach_tasks like:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fc8e9ced6aa8..3a04fa0f1eae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8977,7 +8977,7 @@ static int detach_tasks(struct lb_env *env)
>                  case migrate_util:
>                          util = task_util_est(p);
> 
> -                       if (util > env->imbalance)
> +                       if (shr_bound(util,
> env->sd->nr_balance_failed) > env->imbalance)
>                                  goto next;
> 
>                          env->imbalance -= util;
> --
> 
> This should cover more intermediate cases and would benefit to more
> topology and cases

Your change also solves the issue. I'll try to see if this might
raise other issues,

Thanks for the suggestion,
Regards,
Pierre

> 
>> +                               return;
>> +                       }
>> +
>>                          /*
>>                           * In some cases, the group's utilization is max or even
>>                           * higher than capacity because of migrations but the
>> --
>> 2.25.1
>>

diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df348aa55d3c..5a215c96d420 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10495,6 +10495,23 @@  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->imbalance = max(local->group_capacity, local->group_util) -
 					 local->group_util;
 
+			/*
+			 * On an asymmetric system with CPU-bound tasks, a
+			 * migrate_util balance might not be able to migrate a
+			 * task from a big to a little CPU, letting a little
+			 * CPU unused.
+			 * If local has an empty CPU and busiest is overloaded,
+			 * balance one task with a migrate_task migration type
+			 * instead.
+			 */
+			if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+			    local->sum_nr_running < local->group_weight &&
+			    busiest->sum_nr_running > busiest->group_weight) {
+				env->migration_type = migrate_task;
+				env->imbalance = 1;
+				return;
+			}
+
 			/*
 			 * In some cases, the group's utilization is max or even
 			 * higher than capacity because of migrations but the