[RFC,1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

Message ID 20230515114601.12737-2-huschle@linux.ibm.com
State New
Headers
Series sched/fair: Consider asymmetric scheduler groups in load balancer |

Commit Message

Tobias Huschle May 15, 2023, 11:46 a.m. UTC
  The current load balancer implementation implies that scheduler groups,
within the same domain, all host the same number of CPUs. This is
reflected in the condition, that a scheduler group, which is load
balancing and classified as having spare capacity, should pull work
from the busiest group, if the local group runs less processes than
the busiest one. This implies that these two groups should run the
same number of processes, which is problematic if the groups are not 
of the same size.

The assumption that scheduler groups within the same scheduler domain
host the same number of CPUs appears to be true for non-s390 
architectures. Nevertheless, s390 can have scheduler groups of unequal 
size.

This introduces a performance degredation in the following scenario:

Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
the remaining 2 are located on another socket:

Socket   -----1-----    -2-
CPU      1 2 3 4 5 6    7 8

Placing some workload ( x = one task ) yields the following
scenarios:

The first 5 tasks are distributed evenly across the two groups.

Socket   -----1-----    -2-
CPU      1 2 3 4 5 6    7 8
         x x x          x x

Adding a 6th task yields the following distribution:

Socket   -----1-----    -2-
CPU      1 2 3 4 5 6    7 8
SMT1     x x x          x x
SMT2                    x

The task is added to the 2nd scheduler group, as the scheduler has the
assumption that scheduler groups are of the same size, so they should
also host the same number of tasks. This makes CPU 7 run into SMT 
thread, which comes with a performance penalty. This means, that in 
the window of 6-8 tasks, load balancing is done suboptimally, because 
SMT is used although there is no reason to do so as fully idle CPUs 
are still available.

Taking the weight of the scheduler groups into account, ensures that
a load balancing CPU within a smaller group will not try to pull tasks
from a bigger group while the bigger group still has idle CPUs
available.

Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
  

Comments

Vincent Guittot May 16, 2023, 1:36 p.m. UTC | #1
On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com> wrote:
>
> The current load balancer implementation implies that scheduler groups,
> within the same domain, all host the same number of CPUs. This is
> reflected in the condition, that a scheduler group, which is load
> balancing and classified as having spare capacity, should pull work
> from the busiest group, if the local group runs less processes than
> the busiest one. This implies that these two groups should run the
> same number of processes, which is problematic if the groups are not
> of the same size.
>
> The assumption that scheduler groups within the same scheduler domain
> host the same number of CPUs appears to be true for non-s390
> architectures. Nevertheless, s390 can have scheduler groups of unequal
> size.
>
> This introduces a performance degredation in the following scenario:
>
> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> the remaining 2 are located on another socket:
>
> Socket   -----1-----    -2-
> CPU      1 2 3 4 5 6    7 8
>
> Placing some workload ( x = one task ) yields the following
> scenarios:
>
> The first 5 tasks are distributed evenly across the two groups.
>
> Socket   -----1-----    -2-
> CPU      1 2 3 4 5 6    7 8
>          x x x          x x
>
> Adding a 6th task yields the following distribution:
>
> Socket   -----1-----    -2-
> CPU      1 2 3 4 5 6    7 8
> SMT1     x x x          x x
> SMT2                    x

Your description is a bit confusing for me. What you name CPU above
should be named Core, doesn' it ?

Could you share with us your scheduler topology ?

>
> The task is added to the 2nd scheduler group, as the scheduler has the
> assumption that scheduler groups are of the same size, so they should
> also host the same number of tasks. This makes CPU 7 run into SMT
> thread, which comes with a performance penalty. This means, that in
> the window of 6-8 tasks, load balancing is done suboptimally, because
> SMT is used although there is no reason to do so as fully idle CPUs
> are still available.
>
> Taking the weight of the scheduler groups into account, ensures that
> a load balancing CPU within a smaller group will not try to pull tasks
> from a bigger group while the bigger group still has idle CPUs
> available.
>
> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
> ---
>  kernel/sched/fair.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..b1307d7e4065 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>          * group's child domain.
>          */
>         if (sds.prefer_sibling && local->group_type == group_has_spare &&
> -           busiest->sum_nr_running > local->sum_nr_running + 1)
> +           busiest->sum_nr_running * local->group_weight >
> +                       local->sum_nr_running * busiest->group_weight + 1)

This is the prefer_sibling path. Could it be that you should disable
prefer_siling between your sockets for such topology ? the default
path compares the number of idle CPUs when groups has spare capacity


>                 goto force_balance;
>
>         if (busiest->group_type != group_overloaded) {
> --
> 2.34.1
>
  
Tobias Huschle June 5, 2023, 8:07 a.m. UTC | #2
On 2023-05-16 15:36, Vincent Guittot wrote:
> On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com> 
> wrote:
>> 
>> The current load balancer implementation implies that scheduler 
>> groups,
>> within the same domain, all host the same number of CPUs. This is
>> reflected in the condition, that a scheduler group, which is load
>> balancing and classified as having spare capacity, should pull work
>> from the busiest group, if the local group runs less processes than
>> the busiest one. This implies that these two groups should run the
>> same number of processes, which is problematic if the groups are not
>> of the same size.
>> 
>> The assumption that scheduler groups within the same scheduler domain
>> host the same number of CPUs appears to be true for non-s390
>> architectures. Nevertheless, s390 can have scheduler groups of unequal
>> size.
>> 
>> This introduces a performance degredation in the following scenario:
>> 
>> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>> the remaining 2 are located on another socket:
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>> 
>> Placing some workload ( x = one task ) yields the following
>> scenarios:
>> 
>> The first 5 tasks are distributed evenly across the two groups.
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>>          x x x          x x
>> 
>> Adding a 6th task yields the following distribution:
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>> SMT1     x x x          x x
>> SMT2                    x
> 
> Your description is a bit confusing for me. What you name CPU above
> should be named Core, doesn' it ?
> 
> Could you share with us your scheduler topology ?
> 

You are correct, it should say core instead of CPU.

One actual configuration from one of my test machines (lscpu -e):

CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGURED 
POLARIZATION ADDRESS
   0    0      0    0      0    0 0:0:0         yes yes        horizontal 
   0
   1    0      0    0      0    0 1:1:0         yes yes        horizontal 
   1
   2    0      0    0      0    1 2:2:0         yes yes        horizontal 
   2
   3    0      0    0      0    1 3:3:0         yes yes        horizontal 
   3
   4    0      0    0      0    2 4:4:0         yes yes        horizontal 
   4
   5    0      0    0      0    2 5:5:0         yes yes        horizontal 
   5
   6    0      0    0      0    3 6:6:0         yes yes        horizontal 
   6
   7    0      0    0      0    3 7:7:0         yes yes        horizontal 
   7
   8    0      0    0      0    4 8:8:0         yes yes        horizontal 
   8
   9    0      0    0      0    4 9:9:0         yes yes        horizontal 
   9
  10    0      0    0      0    5 10:10:0       yes yes        horizontal 
   10
  11    0      0    0      0    5 11:11:0       yes yes        horizontal 
   11
  12    0      0    0      1    6 12:12:0       yes yes        horizontal 
   12
  13    0      0    0      1    6 13:13:0       yes yes        horizontal 
   13
  14    0      0    0      1    7 14:14:0       yes yes        horizontal 
   14
  15    0      0    0      1    7 15:15:0       yes yes        horizontal 
   15

So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

If I run stress-ng with 8 cpu stressors on the original code I get a 
distribution
like this:

00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
                 x     x     x     x      x  x  x  x

Which means that the two cores in the smaller group are running into SMT 
while two
cores in the larger group are still idle. This is caused by the 
prefer_sibling path
which really wants to see both groups run the same number of tasks.

>> 
>> The task is added to the 2nd scheduler group, as the scheduler has the
>> assumption that scheduler groups are of the same size, so they should
>> also host the same number of tasks. This makes CPU 7 run into SMT
>> thread, which comes with a performance penalty. This means, that in
>> the window of 6-8 tasks, load balancing is done suboptimally, because
>> SMT is used although there is no reason to do so as fully idle CPUs
>> are still available.
>> 
>> Taking the weight of the scheduler groups into account, ensures that
>> a load balancing CPU within a smaller group will not try to pull tasks
>> from a bigger group while the bigger group still has idle CPUs
>> available.
>> 
>> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>> ---
>>  kernel/sched/fair.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48b6f0ca13ac..b1307d7e4065 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10426,7 +10426,8 @@ static struct sched_group 
>> *find_busiest_group(struct lb_env *env)
>>          * group's child domain.
>>          */
>>         if (sds.prefer_sibling && local->group_type == group_has_spare 
>> &&
>> -           busiest->sum_nr_running > local->sum_nr_running + 1)
>> +           busiest->sum_nr_running * local->group_weight >
>> +                       local->sum_nr_running * busiest->group_weight 
>> + 1)
> 
> This is the prefer_sibling path. Could it be that you should disable
> prefer_siling between your sockets for such topology ? the default
> path compares the number of idle CPUs when groups has spare capacity
> 
> 

If I disable prefer_sibling (I played around with it a bit), I run into 
the problem,
that the tasks are distributed s.t. each group has the same amount of 
idle CPUs, which
yields distributions similar to this:

00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
     x  x  x     x  x     x     x  x

Now both groups have 4 idle CPUs which fulfills the criteria imposed by 
the load balancer,
but the larger group is now running SMT while the smaller one is just 
idle.

So, in this asymmetric setup, both criteria seem to not work in an 
optimal way. Going for
the same number of idle CPUs or alternatively for the same number of 
running processes
both cause a sub-optimal distribution of tasks, leading to unnecessary 
SMT.

It seems also to be possible to address the regular load balancing path 
by aiming to have the
same unused capacity between groups instead of the same number of idle 
CPUs. This seems to
have been considered in the past, but the choice went in favor of the 
number of idle CPUs.
Since this decision was actively taken already, I focused on the 
prefer_sibling path.

The question now would be how to address this properly (or if I'm 
missing something here).
As mentioned in the cover letter, this was the most simplistic and least 
invasive approach
I could find, others might be more sophisticated but also have some 
side-effects.

I have a bit of a hard time leaving this one as-is, as it just 
introduces two additional
multiplications with no effect for most architectures. Maybe an 
architectures specific
inline function that the compiler can optimize away if not needed?

>>                 goto force_balance;
>> 
>>         if (busiest->group_type != group_overloaded) {
>> --
>> 2.34.1
>>
  
Peter Zijlstra July 4, 2023, 1:40 p.m. UTC | #3
On Mon, May 15, 2023 at 01:46:01PM +0200, Tobias Huschle wrote:
> The current load balancer implementation implies that scheduler groups,
> within the same domain, all host the same number of CPUs. This is
> reflected in the condition, that a scheduler group, which is load
> balancing and classified as having spare capacity, should pull work
> from the busiest group, if the local group runs less processes than
> the busiest one. This implies that these two groups should run the
> same number of processes, which is problematic if the groups are not 
> of the same size.
> 
> The assumption that scheduler groups within the same scheduler domain
> host the same number of CPUs appears to be true for non-s390 
> architectures.

Mostly; there's historically the cpuset case where we can create
lopsided groups like that. And today we're growing all these hybrid
things that will also tickle this, except they're looking towards
different balancer extentions to deal with the IPC difference so might
not be immediately caring about this here issue.


> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
> ---
>  kernel/sched/fair.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..b1307d7e4065 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	 * group's child domain.
>  	 */
>  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> -	    busiest->sum_nr_running > local->sum_nr_running + 1)
> +	    busiest->sum_nr_running * local->group_weight >
> +			local->sum_nr_running * busiest->group_weight + 1)

Should that not be: busiest->group_weight * (local->sum_nr_running + 1) ?

I'm not opposed to this -- it seems fairly straight forward.

>  		goto force_balance;
>  
>  	if (busiest->group_type != group_overloaded) {
> -- 
> 2.34.1
>
  
Vincent Guittot July 5, 2023, 7:52 a.m. UTC | #4
Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
> On 2023-05-16 15:36, Vincent Guittot wrote:
> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com>
> > wrote:
> > > 
> > > The current load balancer implementation implies that scheduler
> > > groups,
> > > within the same domain, all host the same number of CPUs. This is
> > > reflected in the condition, that a scheduler group, which is load
> > > balancing and classified as having spare capacity, should pull work
> > > from the busiest group, if the local group runs less processes than
> > > the busiest one. This implies that these two groups should run the
> > > same number of processes, which is problematic if the groups are not
> > > of the same size.
> > > 
> > > The assumption that scheduler groups within the same scheduler domain
> > > host the same number of CPUs appears to be true for non-s390
> > > architectures. Nevertheless, s390 can have scheduler groups of unequal
> > > size.
> > > 
> > > This introduces a performance degredation in the following scenario:
> > > 
> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> > > the remaining 2 are located on another socket:
> > > 
> > > Socket   -----1-----    -2-
> > > CPU      1 2 3 4 5 6    7 8
> > > 
> > > Placing some workload ( x = one task ) yields the following
> > > scenarios:
> > > 
> > > The first 5 tasks are distributed evenly across the two groups.
> > > 
> > > Socket   -----1-----    -2-
> > > CPU      1 2 3 4 5 6    7 8
> > >          x x x          x x
> > > 
> > > Adding a 6th task yields the following distribution:
> > > 
> > > Socket   -----1-----    -2-
> > > CPU      1 2 3 4 5 6    7 8
> > > SMT1     x x x          x x
> > > SMT2                    x
> > 
> > Your description is a bit confusing for me. What you name CPU above
> > should be named Core, doesn' it ?
> > 
> > Could you share with us your scheduler topology ?
> > 
> 
> You are correct, it should say core instead of CPU.
> 
> One actual configuration from one of my test machines (lscpu -e):
> 

[...]

> 
> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

Thaks for the details

> 
> If I run stress-ng with 8 cpu stressors on the original code I get a
> distribution
> like this:
> 
> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>                 x     x     x     x      x  x  x  x
> 
> Which means that the two cores in the smaller group are running into SMT
> while two
> cores in the larger group are still idle. This is caused by the
> prefer_sibling path
> which really wants to see both groups run the same number of tasks.

yes and it considers that there are the same number of CPUs per group

> 
> > > 
> > > The task is added to the 2nd scheduler group, as the scheduler has the
> > > assumption that scheduler groups are of the same size, so they should
> > > also host the same number of tasks. This makes CPU 7 run into SMT
> > > thread, which comes with a performance penalty. This means, that in
> > > the window of 6-8 tasks, load balancing is done suboptimally, because
> > > SMT is used although there is no reason to do so as fully idle CPUs
> > > are still available.
> > > 
> > > Taking the weight of the scheduler groups into account, ensures that
> > > a load balancing CPU within a smaller group will not try to pull tasks
> > > from a bigger group while the bigger group still has idle CPUs
> > > available.
> > > 
> > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
> > > ---
> > >  kernel/sched/fair.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 48b6f0ca13ac..b1307d7e4065 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10426,7 +10426,8 @@ static struct sched_group
> > > *find_busiest_group(struct lb_env *env)
> > >          * group's child domain.
> > >          */
> > >         if (sds.prefer_sibling && local->group_type ==
> > > group_has_spare &&
> > > -           busiest->sum_nr_running > local->sum_nr_running + 1)
> > > +           busiest->sum_nr_running * local->group_weight >
> > > +                       local->sum_nr_running *
> > > busiest->group_weight + 1)


what you want to test here is that moving 1 task from busiest to local group
would help and balance the ratio of tasks per cpu

(busiest->sum_nr_running - 1) / busiest->group_weight > (local->sum_nr_running + 1) / local->group_weight

which can be develop into

busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight

and you also have to change how we calculate the imbalance which just provide the half of the diff of nr_running

by something like

(busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * busiest->group_weight) / (busiest->group_weight + local->group_weight)

> > 
> > This is the prefer_sibling path. Could it be that you should disable
> > prefer_siling between your sockets for such topology ? the default
> > path compares the number of idle CPUs when groups has spare capacity
> > 
> > 
> 
> If I disable prefer_sibling (I played around with it a bit), I run into the
> problem,
> that the tasks are distributed s.t. each group has the same amount of idle
> CPUs, which
> yields distributions similar to this:
> 
> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>     x  x  x     x  x     x     x  x
> 
> Now both groups have 4 idle CPUs which fulfills the criteria imposed by the
> load balancer,
> but the larger group is now running SMT while the smaller one is just idle.
> 
> So, in this asymmetric setup, both criteria seem to not work in an optimal
> way. Going for
> the same number of idle CPUs or alternatively for the same number of running
> processes
> both cause a sub-optimal distribution of tasks, leading to unnecessary SMT.

there is the same behavior and assumption here too


> 
> It seems also to be possible to address the regular load balancing path by
> aiming to have the
> same unused capacity between groups instead of the same number of idle CPUs.
> This seems to
> have been considered in the past, but the choice went in favor of the number
> of idle CPUs.

unused capacity doesn't give the instantaneous state so a group can be idle but without
unused capacity

> Since this decision was actively taken already, I focused on the
> prefer_sibling path.
> 
> The question now would be how to address this properly (or if I'm missing
> something here).
> As mentioned in the cover letter, this was the most simplistic and least
> invasive approach
> I could find, others might be more sophisticated but also have some
> side-effects.
> 
> I have a bit of a hard time leaving this one as-is, as it just introduces
> two additional
> multiplications with no effect for most architectures. Maybe an
> architectures specific
> inline function that the compiler can optimize away if not needed?
> 
> > >                 goto force_balance;
> > > 
> > >         if (busiest->group_type != group_overloaded) {
> > > --
> > > 2.34.1
> > > 
>
  
Shrikanth Hegde July 6, 2023, 5:19 p.m. UTC | #5
On 5/15/23 5:16 PM, Tobias Huschle wrote:
> The current load balancer implementation implies that scheduler groups,
> within the same domain, all host the same number of CPUs. This is
> reflected in the condition, that a scheduler group, which is load
> balancing and classified as having spare capacity, should pull work
> from the busiest group, if the local group runs less processes than
> the busiest one. This implies that these two groups should run the
> same number of processes, which is problematic if the groups are not 
> of the same size.
> 
> The assumption that scheduler groups within the same scheduler domain
> host the same number of CPUs appears to be true for non-s390 
> architectures. Nevertheless, s390 can have scheduler groups of unequal 
> size.
> 
> This introduces a performance degredation in the following scenario:
> 
> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> the remaining 2 are located on another socket:
> 
> Socket   -----1-----    -2-
> CPU      1 2 3 4 5 6    7 8
> 
> Placing some workload ( x = one task ) yields the following
> scenarios:
> 
> The first 5 tasks are distributed evenly across the two groups.
> 
> Socket   -----1-----    -2-
> CPU      1 2 3 4 5 6    7 8
>          x x x          x x
> 
> Adding a 6th task yields the following distribution:
> 
> Socket   -----1-----    -2-
> CPU      1 2 3 4 5 6    7 8
> SMT1     x x x          x x
> SMT2                    x
> 
> The task is added to the 2nd scheduler group, as the scheduler has the
> assumption that scheduler groups are of the same size, so they should
> also host the same number of tasks. This makes CPU 7 run into SMT 
> thread, which comes with a performance penalty. This means, that in 
> the window of 6-8 tasks, load balancing is done suboptimally, because 
> SMT is used although there is no reason to do so as fully idle CPUs 
> are still available.
> 
> Taking the weight of the scheduler groups into account, ensures that
> a load balancing CPU within a smaller group will not try to pull tasks
> from a bigger group while the bigger group still has idle CPUs
> available.
> 
> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
> ---
>  kernel/sched/fair.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..b1307d7e4065 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	 * group's child domain.
>  	 */
>  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> -	    busiest->sum_nr_running > local->sum_nr_running + 1)
> +	    busiest->sum_nr_running * local->group_weight >
> +			local->sum_nr_running * busiest->group_weight + 1)
>  		goto force_balance;
> 


I assume its SMT2 here. sched group is enclosed in [busy_cpus/idle_cpus/weight] 

Lets take an example:  we will begin the with the said imbalance.
[3/9/12] -- local group 
[3/1/4] -- busy group. 
we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to balance. 
but calculate_imbalance will lead to zero as the imbalance no? in case of prefer sibling 
case it find the difference of sum_nr_running of the two groups. It will be 3-3 = 0

this would need modifications to calculate_imbalance. 
Maybe something like this will help? NOT TESTED AT ALL. 

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..e027d4086edc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
                        return;
                }

-               if (busiest->group_weight == 1 || sds->prefer_sibling) {
+               if (busiest->group_weight == 1 ||
+                       (sds->prefer_sibling &&
+                        busiest->group_weight == local->group_weight)) {
                        unsigned int nr_diff = busiest->sum_nr_running;
                        /*
                         * When prefer sibling, evenly spread running tasks on
@@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
                        env->migration_type = migrate_task;
                        lsub_positive(&nr_diff, local->sum_nr_running);
                        env->imbalance = nr_diff;
+               }
+               if (sds->prefer_sibling &&
+                   busiest->group_weight != local->group_weight) {
+                       env->migration_type = migrate_task;
+                       env->imbalance = 1;
                } else {


---------------------------------------------------------------------------------------------------
On a tangential dimension.


Since prefer_siblings make inherent assumption that all groups have equal weight, 
it will cause  complications when group_weights are different. I think that becomes very
tricky when there are more than two groups. Depending on which cpu is doing 
load_balance there can be unnecessary migrations. 


Currently even in NUMA this can be similar case right? There will be unequal number of CPU's right? 
If we fix that case and remove prefer siblings in your arch, will that work? 

Maybe something like this? NOT TESTED AT ALL.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..fc6377f48ced 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10514,6 +10514,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
                        goto out_balanced;
 
                if (busiest->group_weight > 1 &&
+                   busiest->group_weight == local->group_weight &&
                    local->idle_cpus <= (busiest->idle_cpus + 1))
                        /*
                         * If the busiest group is not overloaded
@@ -10526,6 +10527,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
                         */
                        goto out_balanced;
 
+               if ((busiest->group_weight != local->group_weight) &&
+                     (local->idle_cpus * busiest->group_weight <=
+                              local->group_weight * (busiest->idle_cpus + 1)))
+                       goto out_balanced;
+
                if (busiest->sum_h_nr_running == 1)
                        /*
                         * busiest doesn't have any tasks waiting to run





>  	if (busiest->group_type != group_overloaded) {
  
Tobias Huschle July 7, 2023, 7:44 a.m. UTC | #6
On 2023-07-05 09:52, Vincent Guittot wrote:
> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
>> On 2023-05-16 15:36, Vincent Guittot wrote:
>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com>
>> > wrote:
>> > >
>> > > The current load balancer implementation implies that scheduler
>> > > groups,
>> > > within the same domain, all host the same number of CPUs. This is
>> > > reflected in the condition, that a scheduler group, which is load
>> > > balancing and classified as having spare capacity, should pull work
>> > > from the busiest group, if the local group runs less processes than
>> > > the busiest one. This implies that these two groups should run the
>> > > same number of processes, which is problematic if the groups are not
>> > > of the same size.
>> > >
>> > > The assumption that scheduler groups within the same scheduler domain
>> > > host the same number of CPUs appears to be true for non-s390
>> > > architectures. Nevertheless, s390 can have scheduler groups of unequal
>> > > size.
>> > >
>> > > This introduces a performance degredation in the following scenario:
>> > >
>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>> > > the remaining 2 are located on another socket:
>> > >
>> > > Socket   -----1-----    -2-
>> > > CPU      1 2 3 4 5 6    7 8
>> > >
>> > > Placing some workload ( x = one task ) yields the following
>> > > scenarios:
>> > >
>> > > The first 5 tasks are distributed evenly across the two groups.
>> > >
>> > > Socket   -----1-----    -2-
>> > > CPU      1 2 3 4 5 6    7 8
>> > >          x x x          x x
>> > >
>> > > Adding a 6th task yields the following distribution:
>> > >
>> > > Socket   -----1-----    -2-
>> > > CPU      1 2 3 4 5 6    7 8
>> > > SMT1     x x x          x x
>> > > SMT2                    x
>> >
>> > Your description is a bit confusing for me. What you name CPU above
>> > should be named Core, doesn' it ?
>> >
>> > Could you share with us your scheduler topology ?
>> >
>> 
>> You are correct, it should say core instead of CPU.
>> 
>> One actual configuration from one of my test machines (lscpu -e):
>> 
> 
> [...]
> 
>> 
>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.
> 
> Thaks for the details
> 
>> 
>> If I run stress-ng with 8 cpu stressors on the original code I get a
>> distribution
>> like this:
>> 
>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>                 x     x     x     x      x  x  x  x
>> 
>> Which means that the two cores in the smaller group are running into 
>> SMT
>> while two
>> cores in the larger group are still idle. This is caused by the
>> prefer_sibling path
>> which really wants to see both groups run the same number of tasks.
> 
> yes and it considers that there are the same number of CPUs per group
> 
>> 
>> > >
>> > > The task is added to the 2nd scheduler group, as the scheduler has the
>> > > assumption that scheduler groups are of the same size, so they should
>> > > also host the same number of tasks. This makes CPU 7 run into SMT
>> > > thread, which comes with a performance penalty. This means, that in
>> > > the window of 6-8 tasks, load balancing is done suboptimally, because
>> > > SMT is used although there is no reason to do so as fully idle CPUs
>> > > are still available.
>> > >
>> > > Taking the weight of the scheduler groups into account, ensures that
>> > > a load balancing CPU within a smaller group will not try to pull tasks
>> > > from a bigger group while the bigger group still has idle CPUs
>> > > available.
>> > >
>> > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>> > > ---
>> > >  kernel/sched/fair.c | 3 ++-
>> > >  1 file changed, 2 insertions(+), 1 deletion(-)
>> > >
>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > > index 48b6f0ca13ac..b1307d7e4065 100644
>> > > --- a/kernel/sched/fair.c
>> > > +++ b/kernel/sched/fair.c
>> > > @@ -10426,7 +10426,8 @@ static struct sched_group
>> > > *find_busiest_group(struct lb_env *env)
>> > >          * group's child domain.
>> > >          */
>> > >         if (sds.prefer_sibling && local->group_type ==
>> > > group_has_spare &&
>> > > -           busiest->sum_nr_running > local->sum_nr_running + 1)
>> > > +           busiest->sum_nr_running * local->group_weight >
>> > > +                       local->sum_nr_running *
>> > > busiest->group_weight + 1)
> 
> 
> what you want to test here is that moving 1 task from busiest to local 
> group
> would help and balance the ratio of tasks per cpu
> 
> (busiest->sum_nr_running - 1) / busiest->group_weight >
> (local->sum_nr_running + 1) / local->group_weight
> 
> which can be develop into
> 
> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
> * busiest->group_weight + busiest->group_weight + local->group_weight
> 
> and you also have to change how we calculate the imbalance which just
> provide the half of the diff of nr_running
> 
> by something like
> 
> (busiest->sum_nr_running * local->group_weight) -
> (local->sum_nr_running * busiest->group_weight) /
> (busiest->group_weight + local->group_weight)
> 

Ahh right, I had a look at the imbalance part now and your suggestion 
works
pretty well. Just had to make some minor adjustments so far.
Nice side effect is, that this allows the load balancer to behave 
exactly the
same as before in the cases where local->group_weight == 
busiest->group_weight.
The behavior only changes for the case where the groups are not of equal 
size.

I will figure out a solution and send a patch soon which incorporates 
these
adjustments plus a more detailed description.

Thanks for the feedback.

>> >
>> > This is the prefer_sibling path. Could it be that you should disable
>> > prefer_siling between your sockets for such topology ? the default
>> > path compares the number of idle CPUs when groups has spare capacity
>> >
>> >
>> 
>> If I disable prefer_sibling (I played around with it a bit), I run 
>> into the
>> problem,
>> that the tasks are distributed s.t. each group has the same amount of 
>> idle
>> CPUs, which
>> yields distributions similar to this:
>> 
>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>     x  x  x     x  x     x     x  x
>> 
>> Now both groups have 4 idle CPUs which fulfills the criteria imposed 
>> by the
>> load balancer,
>> but the larger group is now running SMT while the smaller one is just 
>> idle.
>> 
>> So, in this asymmetric setup, both criteria seem to not work in an 
>> optimal
>> way. Going for
>> the same number of idle CPUs or alternatively for the same number of 
>> running
>> processes
>> both cause a sub-optimal distribution of tasks, leading to unnecessary 
>> SMT.
> 
> there is the same behavior and assumption here too
> 
> 
>> 
>> It seems also to be possible to address the regular load balancing 
>> path by
>> aiming to have the
>> same unused capacity between groups instead of the same number of idle 
>> CPUs.
>> This seems to
>> have been considered in the past, but the choice went in favor of the 
>> number
>> of idle CPUs.
> 
> unused capacity doesn't give the instantaneous state so a group can be
> idle but without
> unused capacity
> 
>> Since this decision was actively taken already, I focused on the
>> prefer_sibling path.
>> 
>> The question now would be how to address this properly (or if I'm 
>> missing
>> something here).
>> As mentioned in the cover letter, this was the most simplistic and 
>> least
>> invasive approach
>> I could find, others might be more sophisticated but also have some
>> side-effects.
>> 
>> I have a bit of a hard time leaving this one as-is, as it just 
>> introduces
>> two additional
>> multiplications with no effect for most architectures. Maybe an
>> architectures specific
>> inline function that the compiler can optimize away if not needed?
>> 
>> > >                 goto force_balance;
>> > >
>> > >         if (busiest->group_type != group_overloaded) {
>> > > --
>> > > 2.34.1
>> > >
>>
  
Tobias Huschle July 7, 2023, 7:44 a.m. UTC | #7
On 2023-07-04 15:40, Peter Zijlstra wrote:
> On Mon, May 15, 2023 at 01:46:01PM +0200, Tobias Huschle wrote:
>> The current load balancer implementation implies that scheduler 
>> groups,
>> within the same domain, all host the same number of CPUs. This is
>> reflected in the condition, that a scheduler group, which is load
>> balancing and classified as having spare capacity, should pull work
>> from the busiest group, if the local group runs less processes than
>> the busiest one. This implies that these two groups should run the
>> same number of processes, which is problematic if the groups are not
>> of the same size.
>> 
>> The assumption that scheduler groups within the same scheduler domain
>> host the same number of CPUs appears to be true for non-s390
>> architectures.
> 
> Mostly; there's historically the cpuset case where we can create
> lopsided groups like that. And today we're growing all these hybrid
> things that will also tickle this, except they're looking towards
> different balancer extentions to deal with the IPC difference so might
> not be immediately caring about this here issue.
> 
> 
>> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>> ---
>>  kernel/sched/fair.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48b6f0ca13ac..b1307d7e4065 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10426,7 +10426,8 @@ static struct sched_group 
>> *find_busiest_group(struct lb_env *env)
>>  	 * group's child domain.
>>  	 */
>>  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
>> -	    busiest->sum_nr_running > local->sum_nr_running + 1)
>> +	    busiest->sum_nr_running * local->group_weight >
>> +			local->sum_nr_running * busiest->group_weight + 1)
> 
> Should that not be: busiest->group_weight * (local->sum_nr_running + 1) 
> ?

I agree, adding the brackets makes more sense and is clearer on what's
intended by this check while also preserving the original behavior for
local->group_weight == busiest->group_weight

> 
> I'm not opposed to this -- it seems fairly straight forward.

Appreciated, I will go ahead and send a patch once I incorporated the 
other feedback I got.
Thanks.

> 
>>  		goto force_balance;
>> 
>>  	if (busiest->group_type != group_overloaded) {
>> --
>> 2.34.1
>>
  
Tobias Huschle July 7, 2023, 7:45 a.m. UTC | #8
On 2023-07-06 19:19, Shrikanth Hegde wrote:
> On 5/15/23 5:16 PM, Tobias Huschle wrote:
>> The current load balancer implementation implies that scheduler 
>> groups,
>> within the same domain, all host the same number of CPUs. This is
>> reflected in the condition, that a scheduler group, which is load
>> balancing and classified as having spare capacity, should pull work
>> from the busiest group, if the local group runs less processes than
>> the busiest one. This implies that these two groups should run the
>> same number of processes, which is problematic if the groups are not
>> of the same size.
>> 
>> The assumption that scheduler groups within the same scheduler domain
>> host the same number of CPUs appears to be true for non-s390
>> architectures. Nevertheless, s390 can have scheduler groups of unequal
>> size.
>> 
>> This introduces a performance degredation in the following scenario:
>> 
>> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>> the remaining 2 are located on another socket:
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>> 
>> Placing some workload ( x = one task ) yields the following
>> scenarios:
>> 
>> The first 5 tasks are distributed evenly across the two groups.
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>>          x x x          x x
>> 
>> Adding a 6th task yields the following distribution:
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>> SMT1     x x x          x x
>> SMT2                    x
>> 
>> The task is added to the 2nd scheduler group, as the scheduler has the
>> assumption that scheduler groups are of the same size, so they should
>> also host the same number of tasks. This makes CPU 7 run into SMT
>> thread, which comes with a performance penalty. This means, that in
>> the window of 6-8 tasks, load balancing is done suboptimally, because
>> SMT is used although there is no reason to do so as fully idle CPUs
>> are still available.
>> 
>> Taking the weight of the scheduler groups into account, ensures that
>> a load balancing CPU within a smaller group will not try to pull tasks
>> from a bigger group while the bigger group still has idle CPUs
>> available.
>> 
>> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>> ---
>>  kernel/sched/fair.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48b6f0ca13ac..b1307d7e4065 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10426,7 +10426,8 @@ static struct sched_group 
>> *find_busiest_group(struct lb_env *env)
>>  	 * group's child domain.
>>  	 */
>>  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
>> -	    busiest->sum_nr_running > local->sum_nr_running + 1)
>> +	    busiest->sum_nr_running * local->group_weight >
>> +			local->sum_nr_running * busiest->group_weight + 1)
>>  		goto force_balance;
>> 
> 
> 
> I assume its SMT2 here. sched group is enclosed in 
> [busy_cpus/idle_cpus/weight]
> 
> Lets take an example:  we will begin the with the said imbalance.
> [3/9/12] -- local group
> [3/1/4] -- busy group.
> we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to 
> balance.
> but calculate_imbalance will lead to zero as the imbalance no? in case
> of prefer sibling
> case it find the difference of sum_nr_running of the two groups. It
> will be 3-3 = 0
> 
> this would need modifications to calculate_imbalance.
> Maybe something like this will help? NOT TESTED AT ALL.
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e027d4086edc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
>                         return;
>                 }
> 
> -               if (busiest->group_weight == 1 || sds->prefer_sibling) 
> {
> +               if (busiest->group_weight == 1 ||
> +                       (sds->prefer_sibling &&
> +                        busiest->group_weight == local->group_weight)) 
> {
>                         unsigned int nr_diff = busiest->sum_nr_running;
>                         /*
>                          * When prefer sibling, evenly spread running 
> tasks on
> @@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
>                         env->migration_type = migrate_task;
>                         lsub_positive(&nr_diff, local->sum_nr_running);
>                         env->imbalance = nr_diff;
> +               }
> +               if (sds->prefer_sibling &&
> +                   busiest->group_weight != local->group_weight) {
> +                       env->migration_type = migrate_task;
> +                       env->imbalance = 1;
>                 } else {
> 

I also had a look at this when Vincent pointed out that this part is 
missing.
The formula proposed by Vincent works pretty well, it is not even 
necessary
to add additional if-statements here.
> 
> ---------------------------------------------------------------------------------------------------
> On a tangential dimension.
> 
> 
> Since prefer_siblings make inherent assumption that all groups have
> equal weight,
> it will cause  complications when group_weights are different. I think
> that becomes very
> tricky when there are more than two groups. Depending on which cpu is 
> doing
> load_balance there can be unnecessary migrations.
> 
> 
> Currently even in NUMA this can be similar case right? There will be
> unequal number of CPU's right?
> If we fix that case and remove prefer siblings in your arch, will that 
> work?
> 
> Maybe something like this? NOT TESTED AT ALL.
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..fc6377f48ced 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10514,6 +10514,7 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
>                         goto out_balanced;
> 
>                 if (busiest->group_weight > 1 &&
> +                   busiest->group_weight == local->group_weight &&
>                     local->idle_cpus <= (busiest->idle_cpus + 1))
>                         /*
>                          * If the busiest group is not overloaded
> @@ -10526,6 +10527,11 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
>                          */
>                         goto out_balanced;
> 
> +               if ((busiest->group_weight != local->group_weight) &&
> +                     (local->idle_cpus * busiest->group_weight <=
> +                              local->group_weight * 
> (busiest->idle_cpus + 1)))
> +                       goto out_balanced;
> +
>                 if (busiest->sum_h_nr_running == 1)
>                         /*
>                          * busiest doesn't have any tasks waiting to 
> run
> 
> 
> 
> 
> 
>>  	if (busiest->group_type != group_overloaded) {

I played around with alternate solutions as well, yours could be 
interesting in order
to declare the problematic state as balanced essentially. I wouldn't be 
opposed to
remove prefer_siblings. The question would be if it is necessary to 
address both
scenarios anyway.
  
Shrikanth Hegde July 7, 2023, 2:33 p.m. UTC | #9
On 7/7/23 1:14 PM, Tobias Huschle wrote:
> On 2023-07-05 09:52, Vincent Guittot wrote:
>> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
>>> On 2023-05-16 15:36, Vincent Guittot wrote:
>>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com>
>>> > wrote:
>>> > >
>>> > > The current load balancer implementation implies that scheduler
>>> > > groups,
>>> > > within the same domain, all host the same number of CPUs. This is
>>> > > reflected in the condition, that a scheduler group, which is load
>>> > > balancing and classified as having spare capacity, should pull work
>>> > > from the busiest group, if the local group runs less processes than
>>> > > the busiest one. This implies that these two groups should run the
>>> > > same number of processes, which is problematic if the groups are not
>>> > > of the same size.
>>> > >
>>> > > The assumption that scheduler groups within the same scheduler
>>> domain
>>> > > host the same number of CPUs appears to be true for non-s390
>>> > > architectures. Nevertheless, s390 can have scheduler groups of
>>> unequal
>>> > > size.
>>> > >
>>> > > This introduces a performance degredation in the following scenario:
>>> > >
>>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>>> > > the remaining 2 are located on another socket:
>>> > >
>>> > > Socket   -----1-----    -2-
>>> > > CPU      1 2 3 4 5 6    7 8
>>> > >
>>> > > Placing some workload ( x = one task ) yields the following
>>> > > scenarios:
>>> > >
>>> > > The first 5 tasks are distributed evenly across the two groups.
>>> > >
>>> > > Socket   -----1-----    -2-
>>> > > CPU      1 2 3 4 5 6    7 8
>>> > >          x x x          x x
>>> > >
>>> > > Adding a 6th task yields the following distribution:
>>> > >
>>> > > Socket   -----1-----    -2-
>>> > > CPU      1 2 3 4 5 6    7 8
>>> > > SMT1     x x x          x x
>>> > > SMT2                    x
>>> >
>>> > Your description is a bit confusing for me. What you name CPU above
>>> > should be named Core, doesn' it ?
>>> >
>>> > Could you share with us your scheduler topology ?
>>> >
>>>
>>> You are correct, it should say core instead of CPU.
>>>
>>> One actual configuration from one of my test machines (lscpu -e):
>>>
>>
>> [...]
>>
>>>
>>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.
>>
>> Thaks for the details
>>
>>>
>>> If I run stress-ng with 8 cpu stressors on the original code I get a
>>> distribution
>>> like this:
>>>
>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>                 x     x     x     x      x  x  x  x
>>>
>>> Which means that the two cores in the smaller group are running into SMT
>>> while two
>>> cores in the larger group are still idle. This is caused by the
>>> prefer_sibling path
>>> which really wants to see both groups run the same number of tasks.
>>
>> yes and it considers that there are the same number of CPUs per group
>>
>>>
>>> > >
>>> > > The task is added to the 2nd scheduler group, as the scheduler
>>> has the
>>> > > assumption that scheduler groups are of the same size, so they
>>> should
>>> > > also host the same number of tasks. This makes CPU 7 run into SMT
>>> > > thread, which comes with a performance penalty. This means, that in
>>> > > the window of 6-8 tasks, load balancing is done suboptimally,
>>> because
>>> > > SMT is used although there is no reason to do so as fully idle CPUs
>>> > > are still available.
>>> > >
>>> > > Taking the weight of the scheduler groups into account, ensures that
>>> > > a load balancing CPU within a smaller group will not try to pull
>>> tasks
>>> > > from a bigger group while the bigger group still has idle CPUs
>>> > > available.
>>> > >
>>> > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>>> > > ---
>>> > >  kernel/sched/fair.c | 3 ++-
>>> > >  1 file changed, 2 insertions(+), 1 deletion(-)
>>> > >
>>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> > > index 48b6f0ca13ac..b1307d7e4065 100644
>>> > > --- a/kernel/sched/fair.c
>>> > > +++ b/kernel/sched/fair.c
>>> > > @@ -10426,7 +10426,8 @@ static struct sched_group
>>> > > *find_busiest_group(struct lb_env *env)
>>> > >          * group's child domain.
>>> > >          */
>>> > >         if (sds.prefer_sibling && local->group_type ==
>>> > > group_has_spare &&
>>> > > -           busiest->sum_nr_running > local->sum_nr_running + 1)
>>> > > +           busiest->sum_nr_running * local->group_weight >
>>> > > +                       local->sum_nr_running *
>>> > > busiest->group_weight + 1)
>>
>>
>> what you want to test here is that moving 1 task from busiest to local
>> group
>> would help and balance the ratio of tasks per cpu
>>
>> (busiest->sum_nr_running - 1) / busiest->group_weight >
>> (local->sum_nr_running + 1) / local->group_weight
>>
>> which can be develop into
>>
>> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
>> * busiest->group_weight + busiest->group_weight + local->group_weight
>>
>> and you also have to change how we calculate the imbalance which just
>> provide the half of the diff of nr_running
>>
>> by something like
>>
>> (busiest->sum_nr_running * local->group_weight) -
>> (local->sum_nr_running * busiest->group_weight) /
>> (busiest->group_weight + local->group_weight)
>>
> 
> Ahh right, I had a look at the imbalance part now and your suggestion works
> pretty well. Just had to make some minor adjustments so far.
> Nice side effect is, that this allows the load balancer to behave
> exactly the
> same as before in the cases where local->group_weight ==
> busiest->group_weight.
> The behavior only changes for the case where the groups are not of equal
> size.


Not sure if it has been figured/discussed out already, pointing one possible scenario. 

Taking the formulas:
busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight
and calulate_imbalance:
(busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * busiest->group_weight) / (busiest->group_weight + local->group_weight)


First lets say imbalance was like this. same example as before. sched_group in [busy_cpus/idle_cpus/group_weight]
[3/9/12] - local group.
[3/1/4] - busy group.

3*12 >= 3*4+12+4 --> true and imbalance would be (3*12-3*4)/(12+4) -- 24/16 >> 1 -- lets say 1. 
we will balance, good.

[4/8/12] 
[2/2/4] 
There will not be further load balances. good.

a new process comes, it would be scheduled on [4/8/120 sched group as that would be idlest. 
[5/7/12] 
[2/2/4] 

Process running on [2/2/4] exits. then in this scenario do you expect the balance to happen again? Since balancing would result into optimal performance.
[5/7/12] - busy_group
[1/3/4] - local group

5*4 >= 1*12+12+4 --> will not balance. 

[5/7/12] - local group
[1/3/4] - busy group 
1*12 >= 5*4 + 12 + 4 --> will not balance. 

Is this scenario needs to be handled as well?

> 
> I will figure out a solution and send a patch soon which incorporates these
> adjustments plus a more detailed description.
> 
> Thanks for the feedback.
> 
>>> >
>>> > This is the prefer_sibling path. Could it be that you should disable
>>> > prefer_siling between your sockets for such topology ? the default
>>> > path compares the number of idle CPUs when groups has spare capacity
>>> >
>>> >
>>>
>>> If I disable prefer_sibling (I played around with it a bit), I run
>>> into the
>>> problem,
>>> that the tasks are distributed s.t. each group has the same amount of
>>> idle
>>> CPUs, which
>>> yields distributions similar to this:
>>>
>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>     x  x  x     x  x     x     x  x
>>>
>>> Now both groups have 4 idle CPUs which fulfills the criteria imposed
>>> by the
>>> load balancer,
>>> but the larger group is now running SMT while the smaller one is just
>>> idle.
>>>
>>> So, in this asymmetric setup, both criteria seem to not work in an
>>> optimal
>>> way. Going for
>>> the same number of idle CPUs or alternatively for the same number of
>>> running
>>> processes
>>> both cause a sub-optimal distribution of tasks, leading to
>>> unnecessary SMT.
>>
>> there is the same behavior and assumption here too
>>
>>
>>>
>>> It seems also to be possible to address the regular load balancing
>>> path by
>>> aiming to have the
>>> same unused capacity between groups instead of the same number of
>>> idle CPUs.
>>> This seems to
>>> have been considered in the past, but the choice went in favor of the
>>> number
>>> of idle CPUs.
>>
>> unused capacity doesn't give the instantaneous state so a group can be
>> idle but without
>> unused capacity
>>
>>> Since this decision was actively taken already, I focused on the
>>> prefer_sibling path.
>>>
>>> The question now would be how to address this properly (or if I'm
>>> missing
>>> something here).
>>> As mentioned in the cover letter, this was the most simplistic and least
>>> invasive approach
>>> I could find, others might be more sophisticated but also have some
>>> side-effects.
>>>
>>> I have a bit of a hard time leaving this one as-is, as it just
>>> introduces
>>> two additional
>>> multiplications with no effect for most architectures. Maybe an
>>> architectures specific
>>> inline function that the compiler can optimize away if not needed?
>>>
>>> > >                 goto force_balance;
>>> > >
>>> > >         if (busiest->group_type != group_overloaded) {
>>> > > --
>>> > > 2.34.1
>>> > >
>>>
  
Tobias Huschle July 7, 2023, 3:59 p.m. UTC | #10
On 2023-07-07 16:33, Shrikanth Hegde wrote:
> On 7/7/23 1:14 PM, Tobias Huschle wrote:
>> On 2023-07-05 09:52, Vincent Guittot wrote:
>>> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
>>>> On 2023-05-16 15:36, Vincent Guittot wrote:
>>>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com>
>>>> > wrote:
>>>> > >
>>>> > > The current load balancer implementation implies that scheduler
>>>> > > groups,
>>>> > > within the same domain, all host the same number of CPUs. This is
>>>> > > reflected in the condition, that a scheduler group, which is load
>>>> > > balancing and classified as having spare capacity, should pull work
>>>> > > from the busiest group, if the local group runs less processes than
>>>> > > the busiest one. This implies that these two groups should run the
>>>> > > same number of processes, which is problematic if the groups are not
>>>> > > of the same size.
>>>> > >
>>>> > > The assumption that scheduler groups within the same scheduler
>>>> domain
>>>> > > host the same number of CPUs appears to be true for non-s390
>>>> > > architectures. Nevertheless, s390 can have scheduler groups of
>>>> unequal
>>>> > > size.
>>>> > >
>>>> > > This introduces a performance degredation in the following scenario:
>>>> > >
>>>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>>>> > > the remaining 2 are located on another socket:
>>>> > >
>>>> > > Socket   -----1-----    -2-
>>>> > > CPU      1 2 3 4 5 6    7 8
>>>> > >
>>>> > > Placing some workload ( x = one task ) yields the following
>>>> > > scenarios:
>>>> > >
>>>> > > The first 5 tasks are distributed evenly across the two groups.
>>>> > >
>>>> > > Socket   -----1-----    -2-
>>>> > > CPU      1 2 3 4 5 6    7 8
>>>> > >          x x x          x x
>>>> > >
>>>> > > Adding a 6th task yields the following distribution:
>>>> > >
>>>> > > Socket   -----1-----    -2-
>>>> > > CPU      1 2 3 4 5 6    7 8
>>>> > > SMT1     x x x          x x
>>>> > > SMT2                    x
>>>> >
>>>> > Your description is a bit confusing for me. What you name CPU above
>>>> > should be named Core, doesn' it ?
>>>> >
>>>> > Could you share with us your scheduler topology ?
>>>> >
>>>> 
>>>> You are correct, it should say core instead of CPU.
>>>> 
>>>> One actual configuration from one of my test machines (lscpu -e):
>>>> 
>>> 
>>> [...]
>>> 
>>>> 
>>>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.
>>> 
>>> Thaks for the details
>>> 
>>>> 
>>>> If I run stress-ng with 8 cpu stressors on the original code I get a
>>>> distribution
>>>> like this:
>>>> 
>>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>>                 x     x     x     x      x  x  x  x
>>>> 
>>>> Which means that the two cores in the smaller group are running into 
>>>> SMT
>>>> while two
>>>> cores in the larger group are still idle. This is caused by the
>>>> prefer_sibling path
>>>> which really wants to see both groups run the same number of tasks.
>>> 
>>> yes and it considers that there are the same number of CPUs per group
>>> 
>>>> 
>>>> > >
>>>> > > The task is added to the 2nd scheduler group, as the scheduler
>>>> has the
>>>> > > assumption that scheduler groups are of the same size, so they
>>>> should
>>>> > > also host the same number of tasks. This makes CPU 7 run into SMT
>>>> > > thread, which comes with a performance penalty. This means, that in
>>>> > > the window of 6-8 tasks, load balancing is done suboptimally,
>>>> because
>>>> > > SMT is used although there is no reason to do so as fully idle CPUs
>>>> > > are still available.
>>>> > >
>>>> > > Taking the weight of the scheduler groups into account, ensures that
>>>> > > a load balancing CPU within a smaller group will not try to pull
>>>> tasks
>>>> > > from a bigger group while the bigger group still has idle CPUs
>>>> > > available.
>>>> > >
>>>> > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>>>> > > ---
>>>> > >  kernel/sched/fair.c | 3 ++-
>>>> > >  1 file changed, 2 insertions(+), 1 deletion(-)
>>>> > >
>>>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> > > index 48b6f0ca13ac..b1307d7e4065 100644
>>>> > > --- a/kernel/sched/fair.c
>>>> > > +++ b/kernel/sched/fair.c
>>>> > > @@ -10426,7 +10426,8 @@ static struct sched_group
>>>> > > *find_busiest_group(struct lb_env *env)
>>>> > >          * group's child domain.
>>>> > >          */
>>>> > >         if (sds.prefer_sibling && local->group_type ==
>>>> > > group_has_spare &&
>>>> > > -           busiest->sum_nr_running > local->sum_nr_running + 1)
>>>> > > +           busiest->sum_nr_running * local->group_weight >
>>>> > > +                       local->sum_nr_running *
>>>> > > busiest->group_weight + 1)
>>> 
>>> 
>>> what you want to test here is that moving 1 task from busiest to 
>>> local
>>> group
>>> would help and balance the ratio of tasks per cpu
>>> 
>>> (busiest->sum_nr_running - 1) / busiest->group_weight >
>>> (local->sum_nr_running + 1) / local->group_weight
>>> 
>>> which can be develop into
>>> 
>>> busiest->sum_nr_running * local->group_weight >= 
>>> local->sum_nr_running
>>> * busiest->group_weight + busiest->group_weight + local->group_weight
>>> 
>>> and you also have to change how we calculate the imbalance which just
>>> provide the half of the diff of nr_running
>>> 
>>> by something like
>>> 
>>> (busiest->sum_nr_running * local->group_weight) -
>>> (local->sum_nr_running * busiest->group_weight) /
>>> (busiest->group_weight + local->group_weight)
>>> 
>> 
>> Ahh right, I had a look at the imbalance part now and your suggestion 
>> works
>> pretty well. Just had to make some minor adjustments so far.
>> Nice side effect is, that this allows the load balancer to behave
>> exactly the
>> same as before in the cases where local->group_weight ==
>> busiest->group_weight.
>> The behavior only changes for the case where the groups are not of 
>> equal
>> size.
> 
> 
> Not sure if it has been figured/discussed out already, pointing one
> possible scenario.
> 
> Taking the formulas:
> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
> * busiest->group_weight + busiest->group_weight + local->group_weight
> and calulate_imbalance:
> (busiest->sum_nr_running * local->group_weight) -
> (local->sum_nr_running * busiest->group_weight) /
> (busiest->group_weight + local->group_weight)
> 

I was considering to just use the imbalance as an indicator whether we 
should
balance or not, i.e. check if the second formula yields a value greater 
than 0.
Will have to play around with that a bit though.

> 
> First lets say imbalance was like this. same example as before.
> sched_group in [busy_cpus/idle_cpus/group_weight]
> [3/9/12] - local group.
> [3/1/4] - busy group.
> 
> 3*12 >= 3*4+12+4 --> true and imbalance would be (3*12-3*4)/(12+4) --
> 24/16 >> 1 -- lets say 1.
> we will balance, good.
> 
> [4/8/12]
> [2/2/4]
> There will not be further load balances. good.
> 
> a new process comes, it would be scheduled on [4/8/120 sched group as
> that would be idlest.
> [5/7/12]
> [2/2/4]
> 
> Process running on [2/2/4] exits. then in this scenario do you expect
> the balance to happen again? Since balancing would result into optimal
> performance.
> [5/7/12] - busy_group
> [1/3/4] - local group
> 
> 5*4 >= 1*12+12+4 --> will not balance.
> 
> [5/7/12] - local group
> [1/3/4] - busy group
> 1*12 >= 5*4 + 12 + 4 --> will not balance.
> 
> Is this scenario needs to be handled as well?

So, from an SMT standpoint, we would not need to balance here, both 
groups
should not run into SMT. Now, would it be beneficial to balance anyway?
Now we have:
[5/7/12] -> 42% busy
[1/3/4]  -> 25% busy

If we would now balance and move one task around we would get either
[6/6/12] -> 50% busy
[0/4/4]  ->  0% busy
or
[4/8/12] -> 33% busy
[2/2/4]  -> 50% busy

The first case does probably not make that much sense (unless we have 
workload
which would benefit from maybe cache locality) and we want everything to 
run
in one group.
The second case brings the smaller group right onto the edge of using 
SMT, while
also creating the possibility (depending on the algorithm we would use), 
that
now the larger group will attempt to pull work from the smaller group 
again,
ending up in a back and forth between the two. This is obviously also 
true for
the first variant.

Could you maybe elaborate on what you meant by optimal performance?

> 
>> 
>> I will figure out a solution and send a patch soon which incorporates 
>> these
>> adjustments plus a more detailed description.
>> 
>> Thanks for the feedback.
>> 
>>>> >
>>>> > This is the prefer_sibling path. Could it be that you should disable
>>>> > prefer_siling between your sockets for such topology ? the default
>>>> > path compares the number of idle CPUs when groups has spare capacity
>>>> >
>>>> >
>>>> 
>>>> If I disable prefer_sibling (I played around with it a bit), I run
>>>> into the
>>>> problem,
>>>> that the tasks are distributed s.t. each group has the same amount 
>>>> of
>>>> idle
>>>> CPUs, which
>>>> yields distributions similar to this:
>>>> 
>>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>>     x  x  x     x  x     x     x  x
>>>> 
>>>> Now both groups have 4 idle CPUs which fulfills the criteria imposed
>>>> by the
>>>> load balancer,
>>>> but the larger group is now running SMT while the smaller one is 
>>>> just
>>>> idle.
>>>> 
>>>> So, in this asymmetric setup, both criteria seem to not work in an
>>>> optimal
>>>> way. Going for
>>>> the same number of idle CPUs or alternatively for the same number of
>>>> running
>>>> processes
>>>> both cause a sub-optimal distribution of tasks, leading to
>>>> unnecessary SMT.
>>> 
>>> there is the same behavior and assumption here too
>>> 
>>> 
>>>> 
>>>> It seems also to be possible to address the regular load balancing
>>>> path by
>>>> aiming to have the
>>>> same unused capacity between groups instead of the same number of
>>>> idle CPUs.
>>>> This seems to
>>>> have been considered in the past, but the choice went in favor of 
>>>> the
>>>> number
>>>> of idle CPUs.
>>> 
>>> unused capacity doesn't give the instantaneous state so a group can 
>>> be
>>> idle but without
>>> unused capacity
>>> 
>>>> Since this decision was actively taken already, I focused on the
>>>> prefer_sibling path.
>>>> 
>>>> The question now would be how to address this properly (or if I'm
>>>> missing
>>>> something here).
>>>> As mentioned in the cover letter, this was the most simplistic and 
>>>> least
>>>> invasive approach
>>>> I could find, others might be more sophisticated but also have some
>>>> side-effects.
>>>> 
>>>> I have a bit of a hard time leaving this one as-is, as it just
>>>> introduces
>>>> two additional
>>>> multiplications with no effect for most architectures. Maybe an
>>>> architectures specific
>>>> inline function that the compiler can optimize away if not needed?
>>>> 
>>>> > >                 goto force_balance;
>>>> > >
>>>> > >         if (busiest->group_type != group_overloaded) {
>>>> > > --
>>>> > > 2.34.1
>>>> > >
>>>>
  
Shrikanth Hegde July 7, 2023, 4:26 p.m. UTC | #11
On 7/7/23 9:29 PM, Tobias Huschle wrote:
> On 2023-07-07 16:33, Shrikanth Hegde wrote:
>> On 7/7/23 1:14 PM, Tobias Huschle wrote:
>>> On 2023-07-05 09:52, Vincent Guittot wrote:
>>>> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
>>>>> On 2023-05-16 15:36, Vincent Guittot wrote:
>>>>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com>
>>>>> > wrote:
>>>>> > >
>>>>> > > The current load balancer implementation implies that scheduler
>>>>> > > groups,
>>>>> > > within the same domain, all host the same number of CPUs. This is
>>>>> > > reflected in the condition, that a scheduler group, which is load
>>>>> > > balancing and classified as having spare capacity, should pull
>>>>> work
>>>>> > > from the busiest group, if the local group runs less processes
>>>>> than
>>>>> > > the busiest one. This implies that these two groups should run the
>>>>> > > same number of processes, which is problematic if the groups
>>>>> are not
>>>>> > > of the same size.
>>>>> > >
>>>>> > > The assumption that scheduler groups within the same scheduler
>>>>> domain
>>>>> > > host the same number of CPUs appears to be true for non-s390
>>>>> > > architectures. Nevertheless, s390 can have scheduler groups of
>>>>> unequal
>>>>> > > size.
>>>>> > >
>>>>> > > This introduces a performance degredation in the following
>>>>> scenario:
>>>>> > >
>>>>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU
>>>>> socket,
>>>>> > > the remaining 2 are located on another socket:
>>>>> > >
>>>>> > > Socket   -----1-----    -2-
>>>>> > > CPU      1 2 3 4 5 6    7 8
>>>>> > >
>>>>> > > Placing some workload ( x = one task ) yields the following
>>>>> > > scenarios:
>>>>> > >
>>>>> > > The first 5 tasks are distributed evenly across the two groups.
>>>>> > >
>>>>> > > Socket   -----1-----    -2-
>>>>> > > CPU      1 2 3 4 5 6    7 8
>>>>> > >          x x x          x x
>>>>> > >
>>>>> > > Adding a 6th task yields the following distribution:
>>>>> > >
>>>>> > > Socket   -----1-----    -2-
>>>>> > > CPU      1 2 3 4 5 6    7 8
>>>>> > > SMT1     x x x          x x
>>>>> > > SMT2                    x
>>>>> >
>>>>> > Your description is a bit confusing for me. What you name CPU above
>>>>> > should be named Core, doesn' it ?
>>>>> >
>>>>> > Could you share with us your scheduler topology ?
>>>>> >
>>>>>
>>>>> You are correct, it should say core instead of CPU.
>>>>>
>>>>> One actual configuration from one of my test machines (lscpu -e):
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>
>>>>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.
>>>>
>>>> Thaks for the details
>>>>
>>>>>
>>>>> If I run stress-ng with 8 cpu stressors on the original code I get a
>>>>> distribution
>>>>> like this:
>>>>>
>>>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>>>                 x     x     x     x      x  x  x  x
>>>>>
>>>>> Which means that the two cores in the smaller group are running
>>>>> into SMT
>>>>> while two
>>>>> cores in the larger group are still idle. This is caused by the
>>>>> prefer_sibling path
>>>>> which really wants to see both groups run the same number of tasks.
>>>>
>>>> yes and it considers that there are the same number of CPUs per group
>>>>
>>>>>
>>>>> > >
>>>>> > > The task is added to the 2nd scheduler group, as the scheduler
>>>>> has the
>>>>> > > assumption that scheduler groups are of the same size, so they
>>>>> should
>>>>> > > also host the same number of tasks. This makes CPU 7 run into SMT
>>>>> > > thread, which comes with a performance penalty. This means,
>>>>> that in
>>>>> > > the window of 6-8 tasks, load balancing is done suboptimally,
>>>>> because
>>>>> > > SMT is used although there is no reason to do so as fully idle
>>>>> CPUs
>>>>> > > are still available.
>>>>> > >
>>>>> > > Taking the weight of the scheduler groups into account, ensures
>>>>> that
>>>>> > > a load balancing CPU within a smaller group will not try to pull
>>>>> tasks
>>>>> > > from a bigger group while the bigger group still has idle CPUs
>>>>> > > available.
>>>>> > >
>>>>> > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>>>>> > > ---
>>>>> > >  kernel/sched/fair.c | 3 ++-
>>>>> > >  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>> > >
>>>>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> > > index 48b6f0ca13ac..b1307d7e4065 100644
>>>>> > > --- a/kernel/sched/fair.c
>>>>> > > +++ b/kernel/sched/fair.c
>>>>> > > @@ -10426,7 +10426,8 @@ static struct sched_group
>>>>> > > *find_busiest_group(struct lb_env *env)
>>>>> > >          * group's child domain.
>>>>> > >          */
>>>>> > >         if (sds.prefer_sibling && local->group_type ==
>>>>> > > group_has_spare &&
>>>>> > > -           busiest->sum_nr_running > local->sum_nr_running + 1)
>>>>> > > +           busiest->sum_nr_running * local->group_weight >
>>>>> > > +                       local->sum_nr_running *
>>>>> > > busiest->group_weight + 1)
>>>>
>>>>
>>>> what you want to test here is that moving 1 task from busiest to local
>>>> group
>>>> would help and balance the ratio of tasks per cpu
>>>>
>>>> (busiest->sum_nr_running - 1) / busiest->group_weight >
>>>> (local->sum_nr_running + 1) / local->group_weight
>>>>
>>>> which can be develop into
>>>>
>>>> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
>>>> * busiest->group_weight + busiest->group_weight + local->group_weight
>>>>
>>>> and you also have to change how we calculate the imbalance which just
>>>> provide the half of the diff of nr_running
>>>>
>>>> by something like
>>>>
>>>> (busiest->sum_nr_running * local->group_weight) -
>>>> (local->sum_nr_running * busiest->group_weight) /
>>>> (busiest->group_weight + local->group_weight)
>>>>
>>>
>>> Ahh right, I had a look at the imbalance part now and your suggestion
>>> works
>>> pretty well. Just had to make some minor adjustments so far.
>>> Nice side effect is, that this allows the load balancer to behave
>>> exactly the
>>> same as before in the cases where local->group_weight ==
>>> busiest->group_weight.
>>> The behavior only changes for the case where the groups are not of equal
>>> size.
>>
>>
>> Not sure if it has been figured/discussed out already, pointing one
>> possible scenario.
>>
>> Taking the formulas:
>> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
>> * busiest->group_weight + busiest->group_weight + local->group_weight
>> and calulate_imbalance:
>> (busiest->sum_nr_running * local->group_weight) -
>> (local->sum_nr_running * busiest->group_weight) /
>> (busiest->group_weight + local->group_weight)
>>
> 
> I was considering to just use the imbalance as an indicator whether we
> should
> balance or not, i.e. check if the second formula yields a value greater
> than 0.
> Will have to play around with that a bit though.
> 
>>
>> First lets say imbalance was like this. same example as before.
>> sched_group in [busy_cpus/idle_cpus/group_weight]
>> [3/9/12] - local group.
>> [3/1/4] - busy group.
>>
>> 3*12 >= 3*4+12+4 --> true and imbalance would be (3*12-3*4)/(12+4) --
>> 24/16 >> 1 -- lets say 1.
>> we will balance, good.
>>
>> [4/8/12]
>> [2/2/4]
>> There will not be further load balances. good.
>>
>> a new process comes, it would be scheduled on [4/8/120 sched group as
>> that would be idlest.
>> [5/7/12]
>> [2/2/4]
>>
>> Process running on [2/2/4] exits. then in this scenario do you expect
>> the balance to happen again? Since balancing would result into optimal
>> performance.
>> [5/7/12] - busy_group
>> [1/3/4] - local group
>>
>> 5*4 >= 1*12+12+4 --> will not balance.
>>
>> [5/7/12] - local group
>> [1/3/4] - busy group
>> 1*12 >= 5*4 + 12 + 4 --> will not balance.
>>
>> Is this scenario needs to be handled as well?
> 
> So, from an SMT standpoint, we would not need to balance here, both groups
> should not run into SMT. Now, would it be beneficial to balance anyway?
> Now we have:
> [5/7/12] -> 42% busy
> [1/3/4]  -> 25% busy
> 
> If we would now balance and move one task around we would get either
> [6/6/12] -> 50% busy
> [0/4/4]  ->  0% busy
> or
> [4/8/12] -> 33% busy
> [2/2/4]  -> 50% busy
> 
> The first case does probably not make that much sense (unless we have
> workload
> which would benefit from maybe cache locality) and we want everything to
> run
> in one group.
> The second case brings the smaller group right onto the edge of using
> SMT, while
> also creating the possibility (depending on the algorithm we would use),
> that
> now the larger group will attempt to pull work from the smaller group
> again,
> ending up in a back and forth between the two. This is obviously also
> true for
> the first variant.
> 
> Could you maybe elaborate on what you meant by optimal performance?
> 

I assumed it might be optimal to have have both group run more or less 
similar utilization point. 

Now, that i read your description, it makes sense. load balance may not be needed
in this case. Did a few combinations for the check to balance condition. I think 
it holds good. ( Haven't done all the case). Sorry for the noise. 


>>
>>>
>>> I will figure out a solution and send a patch soon which incorporates
>>> these
>>> adjustments plus a more detailed description.
>>>
>>> Thanks for the feedback.
>>>
>>>>> >
>>>>> > This is the prefer_sibling path. Could it be that you should disable
>>>>> > prefer_siling between your sockets for such topology ? the default
>>>>> > path compares the number of idle CPUs when groups has spare capacity
>>>>> >
>>>>> >
>>>>>
>>>>> If I disable prefer_sibling (I played around with it a bit), I run
>>>>> into the
>>>>> problem,
>>>>> that the tasks are distributed s.t. each group has the same amount of
>>>>> idle
>>>>> CPUs, which
>>>>> yields distributions similar to this:
>>>>>
>>>>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>>>>     x  x  x     x  x     x     x  x
>>>>>
>>>>> Now both groups have 4 idle CPUs which fulfills the criteria imposed
>>>>> by the
>>>>> load balancer,
>>>>> but the larger group is now running SMT while the smaller one is just
>>>>> idle.
>>>>>
>>>>> So, in this asymmetric setup, both criteria seem to not work in an
>>>>> optimal
>>>>> way. Going for
>>>>> the same number of idle CPUs or alternatively for the same number of
>>>>> running
>>>>> processes
>>>>> both cause a sub-optimal distribution of tasks, leading to
>>>>> unnecessary SMT.
>>>>
>>>> there is the same behavior and assumption here too
>>>>
>>>>
>>>>>
>>>>> It seems also to be possible to address the regular load balancing
>>>>> path by
>>>>> aiming to have the
>>>>> same unused capacity between groups instead of the same number of
>>>>> idle CPUs.
>>>>> This seems to
>>>>> have been considered in the past, but the choice went in favor of the
>>>>> number
>>>>> of idle CPUs.
>>>>
>>>> unused capacity doesn't give the instantaneous state so a group can be
>>>> idle but without
>>>> unused capacity
>>>>
>>>>> Since this decision was actively taken already, I focused on the
>>>>> prefer_sibling path.
>>>>>
>>>>> The question now would be how to address this properly (or if I'm
>>>>> missing
>>>>> something here).
>>>>> As mentioned in the cover letter, this was the most simplistic and
>>>>> least
>>>>> invasive approach
>>>>> I could find, others might be more sophisticated but also have some
>>>>> side-effects.
>>>>>
>>>>> I have a bit of a hard time leaving this one as-is, as it just
>>>>> introduces
>>>>> two additional
>>>>> multiplications with no effect for most architectures. Maybe an
>>>>> architectures specific
>>>>> inline function that the compiler can optimize away if not needed?
>>>>>
>>>>> > >                 goto force_balance;
>>>>> > >
>>>>> > >         if (busiest->group_type != group_overloaded) {
>>>>> > > --
>>>>> > > 2.34.1
>>>>> > >
>>>>>
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..b1307d7e4065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10426,7 +10426,8 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	 * group's child domain.
 	 */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
-	    busiest->sum_nr_running > local->sum_nr_running + 1)
+	    busiest->sum_nr_running * local->group_weight >
+			local->sum_nr_running * busiest->group_weight + 1)
 		goto force_balance;
 
 	if (busiest->group_type != group_overloaded) {