[4/6] sched/fair: Skip prefer sibling move between SMT group and non-SMT group

Message ID b20517e3986bfdde8a605afa19d144ec411c7a42.1683156492.git.tim.c.chen@linux.intel.com
State New
Headers
Series Enable Cluster Scheduling for x86 Hybrid CPUs |

Commit Message

Tim Chen May 4, 2023, 4:09 p.m. UTC
  From: Tim C Chen <tim.c.chen@linux.intel.com>

Do not try to move tasks between non SMT sched group and SMT sched
group for "prefer sibling" load balance.
Let asym_active_balance_busiest() handle that case properly.
Otherwise we could get task bouncing back and forth between
the SMT sched group and non SMT sched group.

Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)
  

Comments

Peter Zijlstra May 5, 2023, 1:22 p.m. UTC | #1
On Thu, May 04, 2023 at 09:09:54AM -0700, Tim Chen wrote:
> From: Tim C Chen <tim.c.chen@linux.intel.com>
> 
> Do not try to move tasks between non SMT sched group and SMT sched
> group for "prefer sibling" load balance.
> Let asym_active_balance_busiest() handle that case properly.
> Otherwise we could get task bouncing back and forth between
> the SMT sched group and non SMT sched group.
> 
> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a325db34b02..58ef7d529731 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	/*
>  	 * Try to move all excess tasks to a sibling domain of the busiest
>  	 * group's child domain.
> +	 *
> +	 * Do not try to move between non smt sched group and smt sched
> +	 * group. Let asym active balance properly handle that case.
>  	 */
>  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> +	    !asymmetric_groups(sds.busiest, sds.local) &&
>  	    busiest->sum_nr_running > local->sum_nr_running + 1)
>  		goto force_balance;

This seems to have the hidden assumption that a !SMT core is somehow
'less' that an SMT code. Should this not also look at
sched_asym_prefer() to establush this is so?

I mean, imagine I have a regular system and just offline one smt sibling
for giggles.
  
Tim Chen May 5, 2023, 11:07 p.m. UTC | #2
On Fri, 2023-05-05 at 15:22 +0200, Peter Zijlstra wrote:
> On Thu, May 04, 2023 at 09:09:54AM -0700, Tim Chen wrote:
> > From: Tim C Chen <tim.c.chen@linux.intel.com>
> > 
> > Do not try to move tasks between non SMT sched group and SMT sched
> > group for "prefer sibling" load balance.
> > Let asym_active_balance_busiest() handle that case properly.
> > Otherwise we could get task bouncing back and forth between
> > the SMT sched group and non SMT sched group.
> > 
> > Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >  kernel/sched/fair.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a325db34b02..58ef7d529731 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >  	/*
> >  	 * Try to move all excess tasks to a sibling domain of the busiest
> >  	 * group's child domain.
> > +	 *
> > +	 * Do not try to move between non smt sched group and smt sched
> > +	 * group. Let asym active balance properly handle that case.
> >  	 */
> >  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > +	    !asymmetric_groups(sds.busiest, sds.local) &&
> >  	    busiest->sum_nr_running > local->sum_nr_running + 1)
> >  		goto force_balance;
> 
> This seems to have the hidden assumption that a !SMT core is somehow
> 'less' that an SMT code. Should this not also look at
> sched_asym_prefer() to establush this is so?
> 
> I mean, imagine I have a regular system and just offline one smt sibling
> for giggles.

I don't quite follow your point as asymmetric_groups() returns false even
one smt sibling is offlined.

Even say sds.busiest has 1 SMT and sds.local has 2 SMT, both sched groups still
have SD_SHARE_CPUCAPACITY flag turned on.  So asymmetric_groups() return
false and the load balancing logic is not changed for regular non-hybrid system.

I may be missing something.

Tim
  
Peter Zijlstra May 5, 2023, 11:38 p.m. UTC | #3
On Fri, May 05, 2023 at 04:07:39PM -0700, Tim Chen wrote:
> On Fri, 2023-05-05 at 15:22 +0200, Peter Zijlstra wrote:
> > On Thu, May 04, 2023 at 09:09:54AM -0700, Tim Chen wrote:
> > > From: Tim C Chen <tim.c.chen@linux.intel.com>
> > > 
> > > Do not try to move tasks between non SMT sched group and SMT sched
> > > group for "prefer sibling" load balance.
> > > Let asym_active_balance_busiest() handle that case properly.
> > > Otherwise we could get task bouncing back and forth between
> > > the SMT sched group and non SMT sched group.
> > > 
> > > Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > ---
> > >  kernel/sched/fair.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 8a325db34b02..58ef7d529731 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > >  	/*
> > >  	 * Try to move all excess tasks to a sibling domain of the busiest
> > >  	 * group's child domain.
> > > +	 *
> > > +	 * Do not try to move between non smt sched group and smt sched
> > > +	 * group. Let asym active balance properly handle that case.
> > >  	 */
> > >  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > > +	    !asymmetric_groups(sds.busiest, sds.local) &&
> > >  	    busiest->sum_nr_running > local->sum_nr_running + 1)
> > >  		goto force_balance;
> > 
> > This seems to have the hidden assumption that a !SMT core is somehow
> > 'less' that an SMT code. Should this not also look at
> > sched_asym_prefer() to establush this is so?
> > 
> > I mean, imagine I have a regular system and just offline one smt sibling
> > for giggles.
> 
> I don't quite follow your point as asymmetric_groups() returns false even
> one smt sibling is offlined.
> 
> Even say sds.busiest has 1 SMT and sds.local has 2 SMT, both sched groups still
> have SD_SHARE_CPUCAPACITY flag turned on.  So asymmetric_groups() return
> false and the load balancing logic is not changed for regular non-hybrid system.
> 
> I may be missing something.

What's the difference between the two cases? That is, if the remaining
sibling will have SD_SHARE_CPUCAPACIY from the degenerate SMT domain
that's been reaped, then why doesn't the same thing apply to the atoms
in the hybrid muck?

Those two cases *should* be identical, both cases you have cores with
and cores without SMT.
  
Peter Zijlstra May 6, 2023, 12:08 a.m. UTC | #4
On Sat, May 06, 2023 at 01:38:36AM +0200, Peter Zijlstra wrote:
> On Fri, May 05, 2023 at 04:07:39PM -0700, Tim Chen wrote:
> > On Fri, 2023-05-05 at 15:22 +0200, Peter Zijlstra wrote:
> > > On Thu, May 04, 2023 at 09:09:54AM -0700, Tim Chen wrote:
> > > > From: Tim C Chen <tim.c.chen@linux.intel.com>
> > > > 
> > > > Do not try to move tasks between non SMT sched group and SMT sched
> > > > group for "prefer sibling" load balance.
> > > > Let asym_active_balance_busiest() handle that case properly.
> > > > Otherwise we could get task bouncing back and forth between
> > > > the SMT sched group and non SMT sched group.
> > > > 
> > > > Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > ---
> > > >  kernel/sched/fair.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 8a325db34b02..58ef7d529731 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > > >  	/*
> > > >  	 * Try to move all excess tasks to a sibling domain of the busiest
> > > >  	 * group's child domain.
> > > > +	 *
> > > > +	 * Do not try to move between non smt sched group and smt sched
> > > > +	 * group. Let asym active balance properly handle that case.
> > > >  	 */
> > > >  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > > > +	    !asymmetric_groups(sds.busiest, sds.local) &&
> > > >  	    busiest->sum_nr_running > local->sum_nr_running + 1)
> > > >  		goto force_balance;
> > > 
> > > This seems to have the hidden assumption that a !SMT core is somehow
> > > 'less' that an SMT code. Should this not also look at
> > > sched_asym_prefer() to establush this is so?
> > > 
> > > I mean, imagine I have a regular system and just offline one smt sibling
> > > for giggles.
> > 
> > I don't quite follow your point as asymmetric_groups() returns false even
> > one smt sibling is offlined.
> > 
> > Even say sds.busiest has 1 SMT and sds.local has 2 SMT, both sched groups still
> > have SD_SHARE_CPUCAPACITY flag turned on.  So asymmetric_groups() return
> > false and the load balancing logic is not changed for regular non-hybrid system.
> > 
> > I may be missing something.
> 
> What's the difference between the two cases? That is, if the remaining
> sibling will have SD_SHARE_CPUCAPACIY from the degenerate SMT domain
> that's been reaped, then why doesn't the same thing apply to the atoms
> in the hybrid muck?
> 
> Those two cases *should* be identical, both cases you have cores with
> and cores without SMT.

On my alderlake:

[  202.222019] CPU0 attaching sched-domain(s):
[  202.222509]  domain-0: span=0-1 level=SMT
[  202.222707]   groups: 0:{ span=0 }, 1:{ span=1 }
[  202.222945]   domain-1: span=0-23 level=MC
[  202.223148]    groups: 0:{ span=0-1 cap=2048 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }, 10:{ span=10-11 cap=2048 },12:{ span=12-13 cap=2048 }, 14:{ span=14-15 cap=2048 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }, 23:{ span=23 }
...
[  202.249979] CPU23 attaching sched-domain(s):
[  202.250127]  domain-0: span=0-23 level=MC
[  202.250198]   groups: 23:{ span=23 }, 0:{ span=0-1 cap=2048 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }, 10:{ span=10-11 cap=2048 }, 12:{ span=12-13 cap=2048 }, 14:{ span=14-15 cap=2048 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }

$ echo 0 > /sys/devices/system/cpu/cpu1/online
[  251.213848] CPU0 attaching sched-domain(s):
[  251.214376]  domain-0: span=0,2-23 level=MC
[  251.214580]   groups: 0:{ span=0 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }, 10:{ span=10-11 cap=2048 }, 12:{ span=12-13 cap=2048 }, 14:{ span=14-15 cap=2048 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }, 23:{ span=23 }
...
[  251.239511] CPU23 attaching sched-domain(s):
[  251.239656]  domain-0: span=0,2-23 level=MC
[  251.239727]   groups: 23:{ span=23 }, 0:{ span=0 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }, 10:{ span=10-11 cap=2048 }, 12:{ span=12-13 cap=2048 }, 14:{ span=14-15 cap=2048 }, 16:{ span=16 }, 17:{ span=17 }, 18:{ span=18 }, 19:{ span=19 }, 20:{ span=20 }, 21:{ span=21 }, 22:{ span=22 }

$ cat /debug/sched/domains/cpu0/domain0/groups_flags

$ cat /debug/sched/domains/cpu23/domain0/groups_flags


IOW, neither the big core with SMT with one sibling offline, nor the
little core with no SMT on at all have the relevant flags set on their
domain0 groups.



---
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 98bfc0f4ec94..e408b2889186 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -427,6 +427,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
 #undef SDM
 
 	debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
+	debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
 }
 
 void update_sched_domain_debugfs(void)
  
Vincent Guittot May 9, 2023, 1:36 p.m. UTC | #5
On Thu, 4 May 2023 at 18:11, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> From: Tim C Chen <tim.c.chen@linux.intel.com>
>
> Do not try to move tasks between non SMT sched group and SMT sched
> group for "prefer sibling" load balance.
> Let asym_active_balance_busiest() handle that case properly.
> Otherwise we could get task bouncing back and forth between
> the SMT sched group and non SMT sched group.
>
> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a325db34b02..58ef7d529731 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>         /*
>          * Try to move all excess tasks to a sibling domain of the busiest
>          * group's child domain.
> +        *
> +        * Do not try to move between non smt sched group and smt sched
> +        * group. Let asym active balance properly handle that case.
>          */
>         if (sds.prefer_sibling && local->group_type == group_has_spare &&
> +           !asymmetric_groups(sds.busiest, sds.local) &&

Can't you delete SD_PREFER_SIBLING flags when building topology like
SD_ASYM_CPUCAPACITY does ?

Generally speaking  SD_ASYM_CPUCAPACITY and SD_ASYM_PACKING are doing
quite similar thing, it would be good to get one common solution
instead 2 parallel paths

>             busiest->sum_nr_running > local->sum_nr_running + 1)
>                 goto force_balance;
>
> --
> 2.32.0
>
  
Tim Chen May 9, 2023, 11:35 p.m. UTC | #6
On Tue, 2023-05-09 at 15:36 +0200, Vincent Guittot wrote:
> On Thu, 4 May 2023 at 18:11, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > 
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a325db34b02..58ef7d529731 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10411,8 +10411,12 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >         /*
> >          * Try to move all excess tasks to a sibling domain of the busiest
> >          * group's child domain.
> > +        *
> > +        * Do not try to move between non smt sched group and smt sched
> > +        * group. Let asym active balance properly handle that case.
> >          */
> >         if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > +           !asymmetric_groups(sds.busiest, sds.local) &&
> 
> Can't you delete SD_PREFER_SIBLING flags when building topology like
> SD_ASYM_CPUCAPACITY does ?

The sched domain actually can have a mixture of sched groups with Atom modules
and sched groups with SMT cores.  When comparing sched group of Atom core cluster
and Atom core cluster, or SMT core with SMT core, I think we do want the prefer sibling logic.
It is only when we are comparing SMT core and Atom core cluster we
want to skip this. Ricardo, please correct me if I am wrong.

> 
> Generally speaking  SD_ASYM_CPUCAPACITY and SD_ASYM_PACKING are doing
> quite similar thing, it would be good to get one common solution
> instead 2 parallel paths

Okay. I'll see what I can do to merge the handling.

Tim


> 
> >             busiest->sum_nr_running > local->sum_nr_running + 1)
> >                 goto force_balance;
> > 
> > --
> > 2.32.0
> >
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a325db34b02..58ef7d529731 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10411,8 +10411,12 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	/*
 	 * Try to move all excess tasks to a sibling domain of the busiest
 	 * group's child domain.
+	 *
+	 * Do not try to move between non smt sched group and smt sched
+	 * group. Let asym active balance properly handle that case.
 	 */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
+	    !asymmetric_groups(sds.busiest, sds.local) &&
 	    busiest->sum_nr_running > local->sum_nr_running + 1)
 		goto force_balance;