[v3,06/10] sched/fair: Use the prefer_sibling flag of the current sched domain

Message ID 20230207045838.11243-7-ricardo.neri-calderon@linux.intel.com
State New
Headers
Series sched/fair: Avoid unnecessary migrations within SMT domains |

Commit Message

Ricardo Neri Feb. 7, 2023, 4:58 a.m. UTC
  SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).

Above the SMT sched domain, all domains have a child. The SD_PREFER_
SIBLING is honored always regardless of the scheduling domain at which the
load balance takes place.

There are cases, however, in which the busiest CPU's sched domain has
child but the destination CPU's does not. Consider, for instance a non-SMT
core (or an SMT core with only one online sibling) doing load balance with
an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
left with a fully busy SMT core and an idle non-SMT core.

Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
scheduling domain, not its child.

The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
will not spread load among NUMA sched groups, as desired.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 kernel/sched/fair.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)
  

Comments

Vincent Guittot Feb. 8, 2023, 7:48 a.m. UTC | #1
On Tue, 7 Feb 2023 at 05:50, Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
> non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
>
> Above the SMT sched domain, all domains have a child. The SD_PREFER_
> SIBLING is honored always regardless of the scheduling domain at which the
> load balance takes place.
>
> There are cases, however, in which the busiest CPU's sched domain has
> child but the destination CPU's does not. Consider, for instance a non-SMT
> core (or an SMT core with only one online sibling) doing load balance with
> an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
> left with a fully busy SMT core and an idle non-SMT core.
>
> Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
> scheduling domain, not its child.
>
> The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
> will not spread load among NUMA sched groups, as desired.

This is a significant change in the behavior of the numa system. It
would be good to get figures or confirmation that demonstrate that
it's ok to remove prefer_sibling behavior at the 1st numa level.

>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim C. Chen <tim.c.chen@intel.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: x86@kernel.org
> Cc: linux-kernel@vger.kernel.org
> Suggested-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v2:
>  * Introduced this patch.
>
> Changes since v1:
>  * N/A
> ---
>  kernel/sched/fair.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df7bcbf634a8..a37ad59f20ea 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10004,7 +10004,6 @@ static void update_idle_cpu_scan(struct lb_env *env,
>
>  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
>  {
> -       struct sched_domain *child = env->sd->child;
>         struct sched_group *sg = env->sd->groups;
>         struct sg_lb_stats *local = &sds->local_stat;
>         struct sg_lb_stats tmp_sgs;
> @@ -10045,9 +10044,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>                 sg = sg->next;
>         } while (sg != env->sd->groups);
>
> -       /* Tag domain that child domain prefers tasks go to siblings first */
> -       sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> -
> +       /*
> +        * Tag domain that @env::sd prefers to spread excess tasks among
> +        * sibling sched groups.
> +        */
> +       sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
>
>         if (env->sd->flags & SD_NUMA)
>                 env->fbq_type = fbq_classify_group(&sds->busiest_stat);
> @@ -10346,7 +10347,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>                         goto out_balanced;
>         }
>
> -       /* Try to move all excess tasks to child's sibling domain */
>         if (sds.prefer_sibling && local->group_type == group_has_spare &&
>             busiest->sum_nr_running > local->sum_nr_running + 1)
>                 goto force_balance;
> --
> 2.25.1
>
  
Chen Yu Feb. 9, 2023, 1:17 p.m. UTC | #2
On 2023-02-06 at 20:58:34 -0800, Ricardo Neri wrote:
> SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
> non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
> 
> Above the SMT sched domain, all domains have a child. The SD_PREFER_
> SIBLING is honored always regardless of the scheduling domain at which the
> load balance takes place.
> 
> There are cases, however, in which the busiest CPU's sched domain has
> child but the destination CPU's does not. Consider, for instance a non-SMT
> core (or an SMT core with only one online sibling) doing load balance with
> an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
> left with a fully busy SMT core and an idle non-SMT core.
> 
> Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
> scheduling domain, not its child.
> 
> The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
> will not spread load among NUMA sched groups, as desired.
> 
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim C. Chen <tim.c.chen@intel.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: x86@kernel.org
> Cc: linux-kernel@vger.kernel.org
> Suggested-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v2:
>  * Introduced this patch.
> 
> Changes since v1:
>  * N/A
> ---
>  kernel/sched/fair.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df7bcbf634a8..a37ad59f20ea 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10004,7 +10004,6 @@ static void update_idle_cpu_scan(struct lb_env *env,
>  
>  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
>  {
> -	struct sched_domain *child = env->sd->child;
>  	struct sched_group *sg = env->sd->groups;
>  	struct sg_lb_stats *local = &sds->local_stat;
>  	struct sg_lb_stats tmp_sgs;
> @@ -10045,9 +10044,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		sg = sg->next;
>  	} while (sg != env->sd->groups);
>  
> -	/* Tag domain that child domain prefers tasks go to siblings first */
> -	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> -
> +	/*
> +	 * Tag domain that @env::sd prefers to spread excess tasks among
> +	 * sibling sched groups.
> +	 */
> +	sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
>
This does help fix the issue that non-SMT core fails to pull task from busy SMT-cores.
And it also semantically changes the definination of prefer sibling. Do we also
need to change this:
        if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
                sd->child->flags &= ~SD_PREFER_SIBLING;
might be:
        if ((sd->flags & SD_ASYM_CPUCAPACITY))
                sd->flags &= ~SD_PREFER_SIBLING;

thanks,
Chenyu

>
  
Chen, Tim C Feb. 9, 2023, 8 p.m. UTC | #3
>>  static inline void update_sd_lb_stats(struct lb_env *env, struct
>> sd_lb_stats *sds)  {
>> -	struct sched_domain *child = env->sd->child;
>>  	struct sched_group *sg = env->sd->groups;
>>  	struct sg_lb_stats *local = &sds->local_stat;
>>  	struct sg_lb_stats tmp_sgs;
>> @@ -10045,9 +10044,11 @@ static inline void update_sd_lb_stats(struct
>lb_env *env, struct sd_lb_stats *sd
>>  		sg = sg->next;
>>  	} while (sg != env->sd->groups);
>>
>> -	/* Tag domain that child domain prefers tasks go to siblings first */
>> -	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>> -
>> +	/*
>> +	 * Tag domain that @env::sd prefers to spread excess tasks among
>> +	 * sibling sched groups.
>> +	 */
>> +	sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
>>
>This does help fix the issue that non-SMT core fails to pull task from busy SMT-
>cores.
>And it also semantically changes the definination of prefer sibling. Do we also
>need to change this:
>        if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
>                sd->child->flags &= ~SD_PREFER_SIBLING; might be:
>        if ((sd->flags & SD_ASYM_CPUCAPACITY))
>                sd->flags &= ~SD_PREFER_SIBLING;
>

Yu,

I think you are talking about the code in sd_init() 
where SD_PREFER_SIBLING is first set
to "ON" and updated depending on SD_ASYM_CPUCAPACITY.  The intention of the code
is if there are cpus in the scheduler domain that have differing cpu capacities,
we do not want to do spreading among the child groups in the sched domain.
So the flag is turned off in the child group level and not the parent level. But with your above
change, the parent's flag is turned off, leaving the child level flag on. 
This moves the level where spreading happens (SD_PREFER_SIBLING on) 
up one level which is undesired (see table below).

							SD_PREFER_SIBLING after init             
							original code	proposed
SD Level		 SD_ASYM_CPUCAPACITY	 
root			ON				ON		OFF      	(note: SD_PREFER_SIBLING unused at this level)				
first level                           ON				OFF		OFF
second level                     OFF				OFF		ON
third level                         OFF				ON		ON

Tim
  
Tim Chen Feb. 9, 2023, 11:05 p.m. UTC | #4
On Thu, 2023-02-09 at 20:00 +0000, Chen, Tim C wrote:
> > >  static inline void update_sd_lb_stats(struct lb_env *env, struct
> > > sd_lb_stats *sds)  {
> > > -       struct sched_domain *child = env->sd->child;
> > >         struct sched_group *sg = env->sd->groups;
> > >         struct sg_lb_stats *local = &sds->local_stat;
> > >         struct sg_lb_stats tmp_sgs;
> > > @@ -10045,9 +10044,11 @@ static inline void
> > > update_sd_lb_stats(struct
> > lb_env *env, struct sd_lb_stats *sd
> > >                 sg = sg->next;
> > >         } while (sg != env->sd->groups);
> > > 
> > > -       /* Tag domain that child domain prefers tasks go to
> > > siblings first */
> > > -       sds->prefer_sibling = child && child->flags &
> > > SD_PREFER_SIBLING;
> > > -
> > > +       /*
> > > +        * Tag domain that @env::sd prefers to spread excess
> > > tasks among
> > > +        * sibling sched groups.
> > > +        */
> > > +       sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
> > > 
> > This does help fix the issue that non-SMT core fails to pull task
> > from busy SMT-
> > cores.
> > And it also semantically changes the definination of prefer
> > sibling. Do we also
> > need to change this:
> >        if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> >                sd->child->flags &= ~SD_PREFER_SIBLING; might be:
> >        if ((sd->flags & SD_ASYM_CPUCAPACITY))
> >                sd->flags &= ~SD_PREFER_SIBLING;
> > 
> 
> Yu,
> 
> I think you are talking about the code in sd_init() 
> where SD_PREFER_SIBLING is first set
> to "ON" and updated depending on SD_ASYM_CPUCAPACITY.  The intention
> of the code
> is if there are cpus in the scheduler domain that have differing cpu
> capacities,
> we do not want to do spreading among the child groups in the sched
> domain.
> So the flag is turned off in the child group level and not the parent
> level. But with your above
> change, the parent's flag is turned off, leaving the child level flag
> on. 
> This moves the level where spreading happens (SD_PREFER_SIBLING on) 
> up one level which is undesired (see table below).
> 
> 
Sorry got a bad mail client messing up the table format.  Updated below

			SD_ASYM_CPUCAPACITY	SD_PREFER_SIBLING after init             
						original code 	proposed
SD Level		 	 
root			ON			ON		OFF      (note: SD_PREFER_SIBLING unused at this level)				
first level             ON			OFF		OFF
second level		OFF			OFF		ON
third level             OFF			ON		ON

Tim
  
Ricardo Neri Feb. 10, 2023, 3:16 a.m. UTC | #5
On Thu, Feb 09, 2023 at 03:05:03PM -0800, Tim Chen wrote:
> On Thu, 2023-02-09 at 20:00 +0000, Chen, Tim C wrote:
> > > >  static inline void update_sd_lb_stats(struct lb_env *env, struct
> > > > sd_lb_stats *sds)  {
> > > > -       struct sched_domain *child = env->sd->child;
> > > >         struct sched_group *sg = env->sd->groups;
> > > >         struct sg_lb_stats *local = &sds->local_stat;
> > > >         struct sg_lb_stats tmp_sgs;
> > > > @@ -10045,9 +10044,11 @@ static inline void
> > > > update_sd_lb_stats(struct
> > > lb_env *env, struct sd_lb_stats *sd
> > > >                 sg = sg->next;
> > > >         } while (sg != env->sd->groups);
> > > > 
> > > > -       /* Tag domain that child domain prefers tasks go to
> > > > siblings first */
> > > > -       sds->prefer_sibling = child && child->flags &
> > > > SD_PREFER_SIBLING;
> > > > -
> > > > +       /*
> > > > +        * Tag domain that @env::sd prefers to spread excess
> > > > tasks among
> > > > +        * sibling sched groups.
> > > > +        */
> > > > +       sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
> > > > 
> > > This does help fix the issue that non-SMT core fails to pull task
> > > from busy SMT-
> > > cores.
> > > And it also semantically changes the definination of prefer
> > > sibling. Do we also
> > > need to change this:
> > >        if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> > >                sd->child->flags &= ~SD_PREFER_SIBLING; might be:
> > >        if ((sd->flags & SD_ASYM_CPUCAPACITY))
> > >                sd->flags &= ~SD_PREFER_SIBLING;
> > > 
> > 
> > Yu,
> > 
> > I think you are talking about the code in sd_init() 
> > where SD_PREFER_SIBLING is first set
> > to "ON" and updated depending on SD_ASYM_CPUCAPACITY.  The intention
> > of the code
> > is if there are cpus in the scheduler domain that have differing cpu
> > capacities,
> > we do not want to do spreading among the child groups in the sched
> > domain.
> > So the flag is turned off in the child group level and not the parent
> > level. But with your above
> > change, the parent's flag is turned off, leaving the child level flag
> > on. 
> > This moves the level where spreading happens (SD_PREFER_SIBLING on) 
> > up one level which is undesired (see table below).

But my patch moves the level at which we act on prefer_sibling: it now
checks the SD_PREFER_SIBLING flag at the current level, not its child.
Thus, removing SD_PREFER_SIBLING from a level with SD_ASYM_CPUCAPACITY
prevents spreading among CPUs of different CPU capacity, no?

Thanks and BR,
Ricardo
  
Chen Yu Feb. 10, 2023, 6:55 a.m. UTC | #6
On 2023-02-09 at 15:05:03 -0800, Tim Chen wrote:
> On Thu, 2023-02-09 at 20:00 +0000, Chen, Tim C wrote:
> > > >  static inline void update_sd_lb_stats(struct lb_env *env, struct
> > > > sd_lb_stats *sds)  {
> > > > -       struct sched_domain *child = env->sd->child;
> > > >         struct sched_group *sg = env->sd->groups;
> > > >         struct sg_lb_stats *local = &sds->local_stat;
> > > >         struct sg_lb_stats tmp_sgs;
> > > > @@ -10045,9 +10044,11 @@ static inline void
> > > > update_sd_lb_stats(struct
> > > lb_env *env, struct sd_lb_stats *sd
> > > >                 sg = sg->next;
> > > >         } while (sg != env->sd->groups);
> > > > 
> > > > -       /* Tag domain that child domain prefers tasks go to
> > > > siblings first */
> > > > -       sds->prefer_sibling = child && child->flags &
> > > > SD_PREFER_SIBLING;
> > > > -
> > > > +       /*
> > > > +        * Tag domain that @env::sd prefers to spread excess
> > > > tasks among
> > > > +        * sibling sched groups.
> > > > +        */
> > > > +       sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
> > > > 
> > > This does help fix the issue that non-SMT core fails to pull task
> > > from busy SMT-
> > > cores.
> > > And it also semantically changes the definination of prefer
> > > sibling. Do we also
> > > need to change this:
> > >        if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> > >                sd->child->flags &= ~SD_PREFER_SIBLING; might be:
> > >        if ((sd->flags & SD_ASYM_CPUCAPACITY))
> > >                sd->flags &= ~SD_PREFER_SIBLING;
> > > 
> > 
> > Yu,
> > 
> > I think you are talking about the code in sd_init() 
> > where SD_PREFER_SIBLING is first set
> > to "ON" and updated depending on SD_ASYM_CPUCAPACITY.  The intention
> > of the code
> > is if there are cpus in the scheduler domain that have differing cpu
> > capacities,
> > we do not want to do spreading among the child groups in the sched
> > domain.
> > So the flag is turned off in the child group level and not the parent
> > level. But with your above
> > change, the parent's flag is turned off, leaving the child level flag
> > on. 
> > This moves the level where spreading happens (SD_PREFER_SIBLING on) 
> > up one level which is undesired (see table below).
> >
Yes, it moves the flag 1 level up. And if I understand correctly, with Ricardo's patch
applied, we have changed the original meaning of SD_PREFER_SIBLING:
Original: Tasks in this sched domain want to be migrated to another sched domain.
After init change: Tasks in the sched group under this sched domain want to
                   be migrated to a sibling group.
> > 
> Sorry got a bad mail client messing up the table format.  Updated below
> 
> 			SD_ASYM_CPUCAPACITY	SD_PREFER_SIBLING after init             
> 						original code 	proposed
> SD Level		 	 
> root			ON			ON		OFF      (note: SD_PREFER_SIBLING unused at this level)
SD_PREFER_SIBLING is hornored in root level after the init proposal.
> first level             ON			OFF		OFF
Before the init proposed, tasks in first level sd do not want
to be spreaded to a sibling sd. After the init proposeal, tasks
in all sched groups under root sd, do not want to be spreaded
to a sibling sched group(AKA first level sd)

thanks,
Chenyu
> second level		OFF			OFF		ON
> third level             OFF			ON		ON
> 
> Tim
  
Peter Zijlstra Feb. 10, 2023, 10:08 a.m. UTC | #7
On Mon, Feb 06, 2023 at 08:58:34PM -0800, Ricardo Neri wrote:
> SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
> non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
> 
> Above the SMT sched domain, all domains have a child. The SD_PREFER_
> SIBLING is honored always regardless of the scheduling domain at which the
> load balance takes place.
> 
> There are cases, however, in which the busiest CPU's sched domain has
> child but the destination CPU's does not. Consider, for instance a non-SMT
> core (or an SMT core with only one online sibling) doing load balance with
> an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
> left with a fully busy SMT core and an idle non-SMT core.
> 
> Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
> scheduling domain, not its child.
> 
> The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
> will not spread load among NUMA sched groups, as desired.
> 

Like many of the others; I don't much like this.

Why not simply detect this asymmetric having of SMT and kill the
PREFER_SIBLING flag on the SMT leafs in that case?

Specifically, I'm thinking something in the degenerate area where it
looks if a given domain has equal depth children or so.

Note that this should not be tied to having special hardware, you can
create the very same weirdness by just offlining a few SMT siblings and
leaving a few on.
  
Ricardo Neri Feb. 10, 2023, 1:24 p.m. UTC | #8
On Wed, Feb 08, 2023 at 08:48:05AM +0100, Vincent Guittot wrote:
> On Tue, 7 Feb 2023 at 05:50, Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
> > non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
> >
> > Above the SMT sched domain, all domains have a child. The SD_PREFER_
> > SIBLING is honored always regardless of the scheduling domain at which the
> > load balance takes place.
> >
> > There are cases, however, in which the busiest CPU's sched domain has
> > child but the destination CPU's does not. Consider, for instance a non-SMT
> > core (or an SMT core with only one online sibling) doing load balance with
> > an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
> > left with a fully busy SMT core and an idle non-SMT core.
> >
> > Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
> > scheduling domain, not its child.
> >
> > The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
> > will not spread load among NUMA sched groups, as desired.
> 
> This is a significant change in the behavior of the numa system. It
> would be good to get figures or confirmation that demonstrate that
> it's ok to remove prefer_sibling behavior at the 1st numa level.
> 
Thank you very much for your feedback Vincent!

You are right. It does change the behavior at the first numa level. Peter
did not like this change either. It also made things confusing for
SD_ASYM_CPUCAPACITY vs SD_PREFER_SIBLING.

I'll work in a different approach.
  
Valentin Schneider Feb. 10, 2023, 2:54 p.m. UTC | #9
On 10/02/23 11:08, Peter Zijlstra wrote:
> On Mon, Feb 06, 2023 at 08:58:34PM -0800, Ricardo Neri wrote:
>> SD_PREFER_SIBLING is set from the SMT scheduling domain up to the first
>> non-NUMA domain (the exception is systems with SD_ASYM_CPUCAPACITY).
>>
>> Above the SMT sched domain, all domains have a child. The SD_PREFER_
>> SIBLING is honored always regardless of the scheduling domain at which the
>> load balance takes place.
>>
>> There are cases, however, in which the busiest CPU's sched domain has
>> child but the destination CPU's does not. Consider, for instance a non-SMT
>> core (or an SMT core with only one online sibling) doing load balance with
>> an SMT core at the MC level. SD_PREFER_SIBLING will not be honored. We are
>> left with a fully busy SMT core and an idle non-SMT core.
>>
>> Avoid inconsistent behavior. Use the prefer_sibling behavior at the current
>> scheduling domain, not its child.
>>
>> The NUMA sched domain does not have the SD_PREFER_SIBLING flag. Thus, we
>> will not spread load among NUMA sched groups, as desired.
>>
>
> Like many of the others; I don't much like this.
>
> Why not simply detect this asymmetric having of SMT and kill the
> PREFER_SIBLING flag on the SMT leafs in that case?
>
> Specifically, I'm thinking something in the degenerate area where it
> looks if a given domain has equal depth children or so.
>
> Note that this should not be tied to having special hardware, you can
> create the very same weirdness by just offlining a few SMT siblings and
> leaving a few on.

So something like have SD_PREFER_SIBLING affect the SD it's on (and not
its parent), but remove it from the lowest non-degenerated topology level?
(+ add it to the first NUMA level to keep things as they are, even if TBF I
find relying on it for NUMA balancing a bit odd).
  
Peter Zijlstra Feb. 10, 2023, 4:53 p.m. UTC | #10
On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:

> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
> its parent), but remove it from the lowest non-degenerated topology level?

So I was rather confused about the whole moving it between levels things
this morning -- conceptually, prefer siblings says you want to try
sibling domains before filling up your current domain. Now, balancing
between siblings happens one level up, hence looking at child->flags
makes perfect sense.

But looking at the current domain and still calling it prefer sibling
makes absolutely no sense what so ever.

In that confusion I think I also got the polarity wrong, I thought you
wanted to kill prefer_sibling for the assymetric SMT cases, instead you
want to force enable it as long as there is one SMT child around.

Whichever way around it we do it, I'm thinking perhaps some renaming
might be in order to clarify things.

How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
of the behaviour, but have it be set by children with SD_PREFER_SIBLING
or something.

OTOH, there's also

  if (busiest->group_weight == 1 || sds->prefer_sibling) {

which explicitly also takes the group-of-one (the !child case) into
account, but that's not consistently done.

	sds->prefer_sibling = !child || child->flags & SD_PREFER_SIBLING;

seems an interesting option, except perhaps ASYM_CPUCAPACITY -- I
forget, can CPUs of different capacity be in the same leaf domain? With
big.LITTLE I think not, they had their own cache domains and so you get
at least MC domains per capacity, but DynamiQ might have totally wrecked
that party.

> (+ add it to the first NUMA level to keep things as they are, even if TBF I
> find relying on it for NUMA balancing a bit odd).

Arguably it ought to perhaps be one of those node_reclaim_distance
things. The thing is that NUMA-1 is often fairly quick, esp. these days
where it's basically on die numa.
  
Valentin Schneider Feb. 10, 2023, 5:12 p.m. UTC | #11
On 10/02/23 17:53, Peter Zijlstra wrote:
> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
>
>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
>> its parent), but remove it from the lowest non-degenerated topology level?
>
> So I was rather confused about the whole moving it between levels things
> this morning -- conceptually, prefer siblings says you want to try
> sibling domains before filling up your current domain. Now, balancing
> between siblings happens one level up, hence looking at child->flags
> makes perfect sense.
>
> But looking at the current domain and still calling it prefer sibling
> makes absolutely no sense what so ever.
>

True :-)

> In that confusion I think I also got the polarity wrong, I thought you
> wanted to kill prefer_sibling for the assymetric SMT cases, instead you
> want to force enable it as long as there is one SMT child around.
>
> Whichever way around it we do it, I'm thinking perhaps some renaming
> might be in order to clarify things.
>
> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
> of the behaviour, but have it be set by children with SD_PREFER_SIBLING
> or something.
>

Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
something along those lines.

> OTOH, there's also
>
>   if (busiest->group_weight == 1 || sds->prefer_sibling) {
>
> which explicitly also takes the group-of-one (the !child case) into
> account, but that's not consistently done.
>
>       sds->prefer_sibling = !child || child->flags & SD_PREFER_SIBLING;
>
> seems an interesting option,

> except perhaps ASYM_CPUCAPACITY -- I
> forget, can CPUs of different capacity be in the same leaf domain? With
> big.LITTLE I think not, they had their own cache domains and so you get
> at least MC domains per capacity, but DynamiQ might have totally wrecked
> that party.

Yeah, newer systems can have different capacities in one MC domain, cf:

  b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan")

>
>> (+ add it to the first NUMA level to keep things as they are, even if TBF I
>> find relying on it for NUMA balancing a bit odd).
>
> Arguably it ought to perhaps be one of those node_reclaim_distance
> things. The thing is that NUMA-1 is often fairly quick, esp. these days
> where it's basically on die numa.

Right, makes sense, thanks.
  
Ricardo Neri Feb. 10, 2023, 6:31 p.m. UTC | #12
On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
> On 10/02/23 17:53, Peter Zijlstra wrote:
> > On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
> >
> >> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
> >> its parent), but remove it from the lowest non-degenerated topology level?
> >
> > So I was rather confused about the whole moving it between levels things
> > this morning -- conceptually, prefer siblings says you want to try
> > sibling domains before filling up your current domain. Now, balancing
> > between siblings happens one level up, hence looking at child->flags
> > makes perfect sense.
> >
> > But looking at the current domain and still calling it prefer sibling
> > makes absolutely no sense what so ever.
> >
> 
> True :-)
> 
> > In that confusion I think I also got the polarity wrong, I thought you
> > wanted to kill prefer_sibling for the assymetric SMT cases, instead you
> > want to force enable it as long as there is one SMT child around.

Exactly.

> >
> > Whichever way around it we do it, I'm thinking perhaps some renaming
> > might be in order to clarify things.
> >
> > How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
> > of the behaviour, but have it be set by children with SD_PREFER_SIBLING
> > or something.
> >
> 
> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
> something along those lines.

I sense a consesus towards SD_SPREAD_TASKS.

> 
> > OTOH, there's also
> >
> >   if (busiest->group_weight == 1 || sds->prefer_sibling) {
> >
> > which explicitly also takes the group-of-one (the !child case) into
> > account, but that's not consistently done.
> >
> >       sds->prefer_sibling = !child || child->flags & SD_PREFER_SIBLING;

This would need a special provision for SD_ASYM_CPUCAPACITY.

> >
> > seems an interesting option,
> 
> > except perhaps ASYM_CPUCAPACITY -- I
> > forget, can CPUs of different capacity be in the same leaf domain? With
> > big.LITTLE I think not, they had their own cache domains and so you get
> > at least MC domains per capacity, but DynamiQ might have totally wrecked
> > that party.
> 
> Yeah, newer systems can have different capacities in one MC domain, cf:
> 
>   b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan")
> 
> >
> >> (+ add it to the first NUMA level to keep things as they are, even if TBF I
> >> find relying on it for NUMA balancing a bit odd).
> >
> > Arguably it ought to perhaps be one of those node_reclaim_distance
> > things. The thing is that NUMA-1 is often fairly quick, esp. these days
> > where it's basically on die numa.

To conserve the current behavior the NUMA level would need to have
SD_SPREAD_TASKS. It will be cleared along with SD_BALANCE_{EXEC, FORK} and
SD_WAKE_AFFINE if the numa distance is larger than node_reclaim_distance,
yes?

Thanks and BR,
Ricardo
  
Dietmar Eggemann Feb. 13, 2023, 12:17 p.m. UTC | #13
On 10/02/2023 19:31, Ricardo Neri wrote:
> On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
>> On 10/02/23 17:53, Peter Zijlstra wrote:
>>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
>>>
>>>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
>>>> its parent), but remove it from the lowest non-degenerated topology level?
>>>
>>> So I was rather confused about the whole moving it between levels things
>>> this morning -- conceptually, prefer siblings says you want to try
>>> sibling domains before filling up your current domain. Now, balancing
>>> between siblings happens one level up, hence looking at child->flags
>>> makes perfect sense.
>>>
>>> But looking at the current domain and still calling it prefer sibling
>>> makes absolutely no sense what so ever.
>>>
>>
>> True :-)
>>
>>> In that confusion I think I also got the polarity wrong, I thought you
>>> wanted to kill prefer_sibling for the assymetric SMT cases, instead you
>>> want to force enable it as long as there is one SMT child around.
> 
> Exactly.
> 
>>>
>>> Whichever way around it we do it, I'm thinking perhaps some renaming
>>> might be in order to clarify things.
>>>
>>> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
>>> of the behaviour, but have it be set by children with SD_PREFER_SIBLING
>>> or something.
>>>
>>
>> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
>> something along those lines.
> 
> I sense a consesus towards SD_SPREAD_TASKS.

Can you not detect the E-core dst_cpu case on MC with:

+       if (child)
+               sds->prefer_sibling = child->flags & SD_PREFER_SIBLING;
+       else if (sds->busiest)
+               sds->prefer_sibling = sds->busiest->group_weight > 1;
+

[...]
  
Ricardo Neri Feb. 14, 2023, 6:43 a.m. UTC | #14
On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote:
> On 10/02/2023 19:31, Ricardo Neri wrote:
> > On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
> >> On 10/02/23 17:53, Peter Zijlstra wrote:
> >>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
> >>>
> >>>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
> >>>> its parent), but remove it from the lowest non-degenerated topology level?
> >>>
> >>> So I was rather confused about the whole moving it between levels things
> >>> this morning -- conceptually, prefer siblings says you want to try
> >>> sibling domains before filling up your current domain. Now, balancing
> >>> between siblings happens one level up, hence looking at child->flags
> >>> makes perfect sense.
> >>>
> >>> But looking at the current domain and still calling it prefer sibling
> >>> makes absolutely no sense what so ever.
> >>>
> >>
> >> True :-)
> >>
> >>> In that confusion I think I also got the polarity wrong, I thought you
> >>> wanted to kill prefer_sibling for the assymetric SMT cases, instead you
> >>> want to force enable it as long as there is one SMT child around.
> > 
> > Exactly.
> > 
> >>>
> >>> Whichever way around it we do it, I'm thinking perhaps some renaming
> >>> might be in order to clarify things.
> >>>
> >>> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
> >>> of the behaviour, but have it be set by children with SD_PREFER_SIBLING
> >>> or something.
> >>>
> >>
> >> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
> >> something along those lines.
> > 
> > I sense a consesus towards SD_SPREAD_TASKS.
> 
> Can you not detect the E-core dst_cpu case on MC with:
> 
> +       if (child)
> +               sds->prefer_sibling = child->flags & SD_PREFER_SIBLING;
> +       else if (sds->busiest)
> +               sds->prefer_sibling = sds->busiest->group_weight > 1;

Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it
is set based on the flags of the destination CPU's sched domain. But when
used in find_busiest_group() tasks are spread from the busiest group's
child domain.

Your proposed code, also needs a check for SD_PREFER_SIBLING, no?

Thanks and BR,
Ricardo
  
Ricardo Neri Feb. 16, 2023, 5:21 a.m. UTC | #15
On Mon, Feb 13, 2023 at 10:43:28PM -0800, Ricardo Neri wrote:
> On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote:
> > On 10/02/2023 19:31, Ricardo Neri wrote:
> > > On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
> > >> On 10/02/23 17:53, Peter Zijlstra wrote:
> > >>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
> > >>>
> > >>>> So something like have SD_PREFER_SIBLING affect the SD it's on (and not
> > >>>> its parent), but remove it from the lowest non-degenerated topology level?
> > >>>
> > >>> So I was rather confused about the whole moving it between levels things
> > >>> this morning -- conceptually, prefer siblings says you want to try
> > >>> sibling domains before filling up your current domain. Now, balancing
> > >>> between siblings happens one level up, hence looking at child->flags
> > >>> makes perfect sense.
> > >>>
> > >>> But looking at the current domain and still calling it prefer sibling
> > >>> makes absolutely no sense what so ever.
> > >>>
> > >>
> > >> True :-)
> > >>
> > >>> In that confusion I think I also got the polarity wrong, I thought you
> > >>> wanted to kill prefer_sibling for the assymetric SMT cases, instead you
> > >>> want to force enable it as long as there is one SMT child around.
> > > 
> > > Exactly.
> > > 
> > >>>
> > >>> Whichever way around it we do it, I'm thinking perhaps some renaming
> > >>> might be in order to clarify things.
> > >>>
> > >>> How about adding a flag SD_SPREAD_TASKS, which is the effective toggle
> > >>> of the behaviour, but have it be set by children with SD_PREFER_SIBLING
> > >>> or something.
> > >>>
> > >>
> > >> Or entirely bin SD_PREFER_SIBLING and stick with SD_SPREAD_TASKS, but yeah
> > >> something along those lines.
> > > 
> > > I sense a consesus towards SD_SPREAD_TASKS.
> > 
> > Can you not detect the E-core dst_cpu case on MC with:
> > 
> > +       if (child)
> > +               sds->prefer_sibling = child->flags & SD_PREFER_SIBLING;
> > +       else if (sds->busiest)
> > +               sds->prefer_sibling = sds->busiest->group_weight > 1;
> 
> Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it
> is set based on the flags of the destination CPU's sched domain. But when
> used in find_busiest_group() tasks are spread from the busiest group's
> child domain.
> 
> Your proposed code, also needs a check for SD_PREFER_SIBLING, no?

I tweaked the solution that Dietmar proposed:

-	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+	if (sds->busiest)
+		sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;

This comes from the observation that the prefer_sibling setting acts on
busiest group. It then depends on whether the busiest group, not the local
group, has child sched sched domains. Today it works because in most cases
both the local and the busiest groups have child domains with the SD_
PREFER_SIBLING flag.

This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as
prefer_sibling would not be set in that case.

It would also conserve the current behavior at the NUMA level. We would
not need to implement SD_SPREAD_TASKS.

This would both fix the SMT vs non-SMT bug and be less invasive.

Thoughts?

Thanks and BR,
Ricardo
  
Peter Zijlstra Feb. 16, 2023, 12:16 p.m. UTC | #16
On Wed, Feb 15, 2023 at 09:21:05PM -0800, Ricardo Neri wrote:

> I tweaked the solution that Dietmar proposed:
> 
> -	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> +	if (sds->busiest)
> +		sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;
> 
> This comes from the observation that the prefer_sibling setting acts on
> busiest group. It then depends on whether the busiest group, not the local
> group, has child sched sched domains. Today it works because in most cases
> both the local and the busiest groups have child domains with the SD_
> PREFER_SIBLING flag.
> 
> This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as
> prefer_sibling would not be set in that case.
> 
> It would also conserve the current behavior at the NUMA level. We would
> not need to implement SD_SPREAD_TASKS.
> 
> This would both fix the SMT vs non-SMT bug and be less invasive.
> 
> Thoughts?

That does look nice. Be sure to put in a nice comment too.
  
Ricardo Neri Feb. 17, 2023, 1:41 a.m. UTC | #17
On Thu, Feb 16, 2023 at 01:16:42PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 15, 2023 at 09:21:05PM -0800, Ricardo Neri wrote:
> 
> > I tweaked the solution that Dietmar proposed:
> > 
> > -	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> > +	if (sds->busiest)
> > +		sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;
> > 
> > This comes from the observation that the prefer_sibling setting acts on
> > busiest group. It then depends on whether the busiest group, not the local
> > group, has child sched sched domains. Today it works because in most cases
> > both the local and the busiest groups have child domains with the SD_
> > PREFER_SIBLING flag.
> > 
> > This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as
> > prefer_sibling would not be set in that case.
> > 
> > It would also conserve the current behavior at the NUMA level. We would
> > not need to implement SD_SPREAD_TASKS.
> > 
> > This would both fix the SMT vs non-SMT bug and be less invasive.
> > 
> > Thoughts?
> 
> That does look nice. Be sure to put in a nice comment too.

Will do!
  
Valentin Schneider Feb. 20, 2023, 9:45 a.m. UTC | #18
On 20/02/23 15:22, hupu wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>> Specifically, I'm thinking something in the degenerate area where it
>> looks if a given domain has equal depth children or so.
>
> By the way, in the last email you mentioned "the degenerate area". I can't understand what it means because of cultural differences. Does it refer to this scenario: sched_domain at the SMT layer is destroyed after an SMT thread goes offline, so mc_domain->child is NULL?
>

This references sched domain degeneration, cf. sd_parent_degenerate() and
sd_degenerate() in sched/topology.c

Sched domains are built following a layout described in
sched_domain_topology[], and depending on the actual CPU layout some
domains end up not being worth keeping (e.g. they only contain one CPU - we
need at least 2 to balance things), so we destroy (degenerate) them.

> hupu
  
Valentin Schneider Feb. 21, 2023, 6:15 p.m. UTC | #19
On 21/02/23 17:49, tmmdh wrote:
> From: Valentin Schneider <vschneid@redhat.com>
> But I am still confused by Peter's description about prefer_sibling in the last email, linked at
> https://lore.kernel.org/all/Y+Z2b%2FOtZDk9cT53@hirez.programming.kicks-ass.net/
>> this morning -- conceptually, prefer siblings says you want to try
>> sibling domains before filling up your current domain.
>
> Why should we try sibling domains before filling up your current domain? Why does Peter think that sibling domains is better than current domain.
> My understanding about this problem is described as follows, but I am not
> sure if it is correct. I think the sibling domain is a domain lower than
> the current level. Just like SMT is the sibling of MC, while DIE is the
> sibling of NUMA.

That's the wrong way around; going up (or down) the sched_domain hierarchy is
done via parent (or child) pointers. Sibling means going sideways (i.e. the
same topology level but viewed from a different CPU)

> Is it because the cpus covered by sibling domains share more resources (such as cache), which can improve the performance of task running?
>

If sibling domains are in same cache hierarchy then spreading tasks between
them can improve overall performance. That doesn't work with NUMA or
big.LITTLE domains, so we don't have the flag on those.
  
Dietmar Eggemann Feb. 23, 2023, 10:09 a.m. UTC | #20
On 16/02/2023 06:21, Ricardo Neri wrote:
> On Mon, Feb 13, 2023 at 10:43:28PM -0800, Ricardo Neri wrote:
>> On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote:
>>> On 10/02/2023 19:31, Ricardo Neri wrote:
>>>> On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
>>>>> On 10/02/23 17:53, Peter Zijlstra wrote:
>>>>>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:

[...]

>>> Can you not detect the E-core dst_cpu case on MC with:
>>>
>>> +       if (child)
>>> +               sds->prefer_sibling = child->flags & SD_PREFER_SIBLING;
>>> +       else if (sds->busiest)
>>> +               sds->prefer_sibling = sds->busiest->group_weight > 1;
>>
>> Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it
>> is set based on the flags of the destination CPU's sched domain. But when
>> used in find_busiest_group() tasks are spread from the busiest group's
>> child domain.
>>
>> Your proposed code, also needs a check for SD_PREFER_SIBLING, no?
> 
> I tweaked the solution that Dietmar proposed:
> 
> -	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> +	if (sds->busiest)
> +		sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;

Maybe:

sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING);

1 vs 2048 ?

> This comes from the observation that the prefer_sibling setting acts on
> busiest group. It then depends on whether the busiest group, not the local
> group, has child sched sched domains. Today it works because in most cases
> both the local and the busiest groups have child domains with the SD_
> PREFER_SIBLING flag.
> 
> This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as
> prefer_sibling would not be set in that case.
> 
> It would also conserve the current behavior at the NUMA level. We would
> not need to implement SD_SPREAD_TASKS.
> 
> This would both fix the SMT vs non-SMT bug and be less invasive.

Yeah, much better! I always forget that we have those flags on SGs now
as well. Luckily, we just need to check busiest sg to cover all cases.
  
Ricardo Neri Feb. 24, 2023, 12:29 p.m. UTC | #21
On Thu, Feb 23, 2023 at 11:09:55AM +0100, Dietmar Eggemann wrote:
> On 16/02/2023 06:21, Ricardo Neri wrote:
> > On Mon, Feb 13, 2023 at 10:43:28PM -0800, Ricardo Neri wrote:
> >> On Mon, Feb 13, 2023 at 01:17:09PM +0100, Dietmar Eggemann wrote:
> >>> On 10/02/2023 19:31, Ricardo Neri wrote:
> >>>> On Fri, Feb 10, 2023 at 05:12:30PM +0000, Valentin Schneider wrote:
> >>>>> On 10/02/23 17:53, Peter Zijlstra wrote:
> >>>>>> On Fri, Feb 10, 2023 at 02:54:56PM +0000, Valentin Schneider wrote:
> 
> [...]
> 
> >>> Can you not detect the E-core dst_cpu case on MC with:
> >>>
> >>> +       if (child)
> >>> +               sds->prefer_sibling = child->flags & SD_PREFER_SIBLING;
> >>> +       else if (sds->busiest)
> >>> +               sds->prefer_sibling = sds->busiest->group_weight > 1;
> >>
> >> Whose child wants the prefer_sibling setting? In update_sd_lb_stats(), it
> >> is set based on the flags of the destination CPU's sched domain. But when
> >> used in find_busiest_group() tasks are spread from the busiest group's
> >> child domain.
> >>
> >> Your proposed code, also needs a check for SD_PREFER_SIBLING, no?
> > 
> > I tweaked the solution that Dietmar proposed:
> > 
> > -	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> > +	if (sds->busiest)
> > +		sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;
> 
> Maybe:
> 
> sds->prefer_sibling = !!(sds->busiest->flags & SD_PREFER_SIBLING);
> 
> 1 vs 2048 ?

Sure, I can do this.
> 
> > This comes from the observation that the prefer_sibling setting acts on
> > busiest group. It then depends on whether the busiest group, not the local
> > group, has child sched sched domains. Today it works because in most cases
> > both the local and the busiest groups have child domains with the SD_
> > PREFER_SIBLING flag.
> > 
> > This would also satisfy sched domains with the SD_ASYM_CPUCAPACITY flag as
> > prefer_sibling would not be set in that case.
> > 
> > It would also conserve the current behavior at the NUMA level. We would
> > not need to implement SD_SPREAD_TASKS.
> > 
> > This would both fix the SMT vs non-SMT bug and be less invasive.
> 
> Yeah, much better! I always forget that we have those flags on SGs now
> as well. Luckily, we just need to check busiest sg to cover all cases.

Right. I can add a comment to clarify from where the flags come.
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df7bcbf634a8..a37ad59f20ea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10004,7 +10004,6 @@  static void update_idle_cpu_scan(struct lb_env *env,
 
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
@@ -10045,9 +10044,11 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
-	/* Tag domain that child domain prefers tasks go to siblings first */
-	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
-
+	/*
+	 * Tag domain that @env::sd prefers to spread excess tasks among
+	 * sibling sched groups.
+	 */
+	sds->prefer_sibling = env->sd->flags & SD_PREFER_SIBLING;
 
 	if (env->sd->flags & SD_NUMA)
 		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
@@ -10346,7 +10347,6 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 			goto out_balanced;
 	}
 
-	/* Try to move all excess tasks to child's sibling domain */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
 	    busiest->sum_nr_running > local->sum_nr_running + 1)
 		goto force_balance;