[v2,1/7] sched/fair: Generalize asym_packing logic for SMT local sched group

Message ID 20221122203532.15013-2-ricardo.neri-calderon@linux.intel.com
State New
Headers
Series x86/sched: Avoid unnecessary migrations within SMT domains |

Commit Message

Ricardo Neri Nov. 22, 2022, 8:35 p.m. UTC
  When balancing load between two physical cores, an idle destination CPU can
help another core only if all of its SMT siblings are also idle. Otherwise,
there is not increase in throughput. It does not matter whether the other
core is composed of SMT siblings.

Simply check if there are any tasks running on the local group and the
other core has exactly one busy CPU before proceeding. Let
find_busiest_group() handle the case of more than one busy CPU. This is
true for SMT2, SMT4, SMT8, etc.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v1:
 * Reworded commit message and inline comments for clarity.
 * Stated that this changeset does not impact STM4 or SMT8.
---
 kernel/sched/fair.c | 29 +++++++++--------------------
 1 file changed, 9 insertions(+), 20 deletions(-)
  

Comments

Dietmar Eggemann Dec. 6, 2022, 5:22 p.m. UTC | #1
On 22/11/2022 21:35, Ricardo Neri wrote:
> When balancing load between two physical cores, an idle destination CPU can
> help another core only if all of its SMT siblings are also idle. Otherwise,
> there is not increase in throughput. It does not matter whether the other
> core is composed of SMT siblings.
> 
> Simply check if there are any tasks running on the local group and the
> other core has exactly one busy CPU before proceeding. Let
> find_busiest_group() handle the case of more than one busy CPU. This is
> true for SMT2, SMT4, SMT8, etc.

[...]

> Changes since v1:
>  * Reworded commit message and inline comments for clarity.
>  * Stated that this changeset does not impact STM4 or SMT8.
> ---
>  kernel/sched/fair.c | 29 +++++++++--------------------
>  1 file changed, 9 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e4a0b8bd941c..18c672ff39ef 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8900,12 +8900,10 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
>  				    struct sched_group *sg)

I'm not sure why you change asym_smt_can_pull_tasks() together with
removing SD_ASYM_PACKING from SMT level (patch 5/7)?

update_sg_lb_stats()

  ... && env->sd->flags & SD_ASYM_PACKING && .. && sched_asym()
                                                   ^^^^^^^^^^^^
    sched_asym()

      if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
          (group->flags & SD_SHARE_CPUCAPACITY))
        return asym_smt_can_pull_tasks()
               ^^^^^^^^^^^^^^^^^^^^^^^^^

So x86 won't have a sched domain with SD_SHARE_CPUCAPACITY and
SD_ASYM_PACKING anymore. So sched_asym() would call sched_asym_prefer()
directly on MC. What do I miss here?

[...]
  
Ricardo Neri Dec. 12, 2022, 5:53 p.m. UTC | #2
On Tue, Dec 06, 2022 at 06:22:41PM +0100, Dietmar Eggemann wrote:
> On 22/11/2022 21:35, Ricardo Neri wrote:
> > When balancing load between two physical cores, an idle destination CPU can
> > help another core only if all of its SMT siblings are also idle. Otherwise,
> > there is not increase in throughput. It does not matter whether the other
> > core is composed of SMT siblings.
> > 
> > Simply check if there are any tasks running on the local group and the
> > other core has exactly one busy CPU before proceeding. Let
> > find_busiest_group() handle the case of more than one busy CPU. This is
> > true for SMT2, SMT4, SMT8, etc.
> 
> [...]

Thank you very much for your feedback, Dietmar!

> 
> > Changes since v1:
> >  * Reworded commit message and inline comments for clarity.
> >  * Stated that this changeset does not impact STM4 or SMT8.
> > ---
> >  kernel/sched/fair.c | 29 +++++++++--------------------
> >  1 file changed, 9 insertions(+), 20 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e4a0b8bd941c..18c672ff39ef 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8900,12 +8900,10 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> >  				    struct sched_group *sg)
> 
> I'm not sure why you change asym_smt_can_pull_tasks() together with
> removing SD_ASYM_PACKING from SMT level (patch 5/7)?

In x86 we have SD_ASYM_PACKING at the MC, CLS* and, before my patches, SMT
sched domains.

> 
> update_sg_lb_stats()
> 
>   ... && env->sd->flags & SD_ASYM_PACKING && .. && sched_asym()
>                                                    ^^^^^^^^^^^^
>     sched_asym()
> 
>       if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
>           (group->flags & SD_SHARE_CPUCAPACITY))
>         return asym_smt_can_pull_tasks()
>                ^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> So x86 won't have a sched domain with SD_SHARE_CPUCAPACITY and
> SD_ASYM_PACKING anymore. So sched_asym() would call sched_asym_prefer()
> directly on MC. What do I miss here?

asym_smt_can_pull_tasks() is used above the SMT level *and* when either the
local or sg sched groups are composed of CPUs that are SMT siblings.

In fact, asym_smt_can_pull_tasks() can only be called above the SMT level.
This is because the flags of a sched_group in a sched_domain are equal to
the flags of the child sched_domain. Since SMT is the lowest sched_domain,
its groups' flags are 0.

sched_asym() calls sched_asym_prefer() directly if balancing at the
SMT level and, at higher domains, if the child domain is not SMT.

This meets the requirement of Power7, where SMT siblings have different
priorities; and of x86, where physical cores have different priorities.

Thanks and BR,
Ricardo

* The target of these patches is Intel hybrid processors, on which cluster
  scheduling is disabled - cabdc3a8475b ("sched,x86: Don't use cluster
  topology for x86 hybrid CPUs"). Also, I have not observed topologies in
  which CPUs of the same cluster have different priorities.

  We are also looking into re-enabling cluster scheduling on hybrid
  topologies.
  
Dietmar Eggemann Dec. 21, 2022, 1:03 p.m. UTC | #3
On 12/12/2022 18:53, Ricardo Neri wrote:
> On Tue, Dec 06, 2022 at 06:22:41PM +0100, Dietmar Eggemann wrote:
>> On 22/11/2022 21:35, Ricardo Neri wrote:

[...]

>> I'm not sure why you change asym_smt_can_pull_tasks() together with
>> removing SD_ASYM_PACKING from SMT level (patch 5/7)?
> 
> In x86 we have SD_ASYM_PACKING at the MC, CLS* and, before my patches, SMT
> sched domains.
> 
>>
>> update_sg_lb_stats()
>>
>>   ... && env->sd->flags & SD_ASYM_PACKING && .. && sched_asym()
>>                                                    ^^^^^^^^^^^^
>>     sched_asym()
>>
>>       if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
>>           (group->flags & SD_SHARE_CPUCAPACITY))
>>         return asym_smt_can_pull_tasks()
>>                ^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> So x86 won't have a sched domain with SD_SHARE_CPUCAPACITY and
>> SD_ASYM_PACKING anymore. So sched_asym() would call sched_asym_prefer()
>> directly on MC. What do I miss here?
> 
> asym_smt_can_pull_tasks() is used above the SMT level *and* when either the
> local or sg sched groups are composed of CPUs that are SMT siblings.

OK.

> In fact, asym_smt_can_pull_tasks() can only be called above the SMT level.
> This is because the flags of a sched_group in a sched_domain are equal to
> the flags of the child sched_domain. Since SMT is the lowest sched_domain,
> its groups' flags are 0.

I see. I forgot about `[PATCH v5 0/6] sched/fair: Fix load balancing of
SMT siblings with ASYM_PACKING` from Sept 21 (specifically [PATCH v5
2/6] sched/topology: Introduce sched_group::flags).

> sched_asym() calls sched_asym_prefer() directly if balancing at the
> SMT level and, at higher domains, if the child domain is not SMT.

OK.

> This meets the requirement of Power7, where SMT siblings have different
> priorities; and of x86, where physical cores have different priorities.
> 
> Thanks and BR,
> Ricardo
> 
> * The target of these patches is Intel hybrid processors, on which cluster
>   scheduling is disabled - cabdc3a8475b ("sched,x86: Don't use cluster
>   topology for x86 hybrid CPUs"). Also, I have not observed topologies in
>   which CPUs of the same cluster have different priorities.

OK.

IMHO, the function header of asym_smt_can_pull_tasks() (3rd and 4th
paragraph ...  `If both @dst_cpu and @sg have SMT siblings` and `If @sg
does not have SMT siblings` still reflect the old code layout.
  
Ricardo Neri Dec. 22, 2022, 4:32 a.m. UTC | #4
On Wed, Dec 21, 2022 at 02:03:15PM +0100, Dietmar Eggemann wrote:
> On 12/12/2022 18:53, Ricardo Neri wrote:
> > On Tue, Dec 06, 2022 at 06:22:41PM +0100, Dietmar Eggemann wrote:
> >> On 22/11/2022 21:35, Ricardo Neri wrote:
> 
> [...]
> 
> >> I'm not sure why you change asym_smt_can_pull_tasks() together with
> >> removing SD_ASYM_PACKING from SMT level (patch 5/7)?
> > 
> > In x86 we have SD_ASYM_PACKING at the MC, CLS* and, before my patches, SMT
> > sched domains.
> > 
> >>
> >> update_sg_lb_stats()
> >>
> >>   ... && env->sd->flags & SD_ASYM_PACKING && .. && sched_asym()
> >>                                                    ^^^^^^^^^^^^
> >>     sched_asym()
> >>
> >>       if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> >>           (group->flags & SD_SHARE_CPUCAPACITY))
> >>         return asym_smt_can_pull_tasks()
> >>                ^^^^^^^^^^^^^^^^^^^^^^^^^
> >>
> >> So x86 won't have a sched domain with SD_SHARE_CPUCAPACITY and
> >> SD_ASYM_PACKING anymore. So sched_asym() would call sched_asym_prefer()
> >> directly on MC. What do I miss here?
> > 
> > asym_smt_can_pull_tasks() is used above the SMT level *and* when either the
> > local or sg sched groups are composed of CPUs that are SMT siblings.
> 
> OK.
> 
> > In fact, asym_smt_can_pull_tasks() can only be called above the SMT level.
> > This is because the flags of a sched_group in a sched_domain are equal to
> > the flags of the child sched_domain. Since SMT is the lowest sched_domain,
> > its groups' flags are 0.
> 
> I see. I forgot about `[PATCH v5 0/6] sched/fair: Fix load balancing of
> SMT siblings with ASYM_PACKING` from Sept 21 (specifically [PATCH v5
> 2/6] sched/topology: Introduce sched_group::flags).
> 
> > sched_asym() calls sched_asym_prefer() directly if balancing at the
> > SMT level and, at higher domains, if the child domain is not SMT.
> 
> OK.
> 
> > This meets the requirement of Power7, where SMT siblings have different
> > priorities; and of x86, where physical cores have different priorities.
> > 
> > Thanks and BR,
> > Ricardo
> > 
> > * The target of these patches is Intel hybrid processors, on which cluster
> >   scheduling is disabled - cabdc3a8475b ("sched,x86: Don't use cluster
> >   topology for x86 hybrid CPUs"). Also, I have not observed topologies in
> >   which CPUs of the same cluster have different priorities.
> 
> OK.
> 
> IMHO, the function header of asym_smt_can_pull_tasks() (3rd and 4th
> paragraph ...  `If both @dst_cpu and @sg have SMT siblings` and

Agreed. I changed the behavior of the function. I will update the
description.

>`If @sg does not have SMT siblings` still reflect the old code layout.

But this behavior did not change. The check covers both SMT and non-SMT
cases:

	 /*
	  * non-SMT @sg can only have 1 busy CPU. We only care SMT @sg
	  * has exactly one busy sibling
	  */
	if (sg_busy_cpus == 1 && 
	    /* local group is fully idle, SMT and non-SMT. */
	    !sds->local_stat.sum_nr_running)

Perhaps I can collapse the two paragraphs into one.

Thanks and BR,
Ricardo
  
Dietmar Eggemann Dec. 22, 2022, 11:12 a.m. UTC | #5
On 22/12/2022 05:32, Ricardo Neri wrote:
> On Wed, Dec 21, 2022 at 02:03:15PM +0100, Dietmar Eggemann wrote:
>> On 12/12/2022 18:53, Ricardo Neri wrote:
>>> On Tue, Dec 06, 2022 at 06:22:41PM +0100, Dietmar Eggemann wrote:
>>>> On 22/11/2022 21:35, Ricardo Neri wrote:
>>
>> [...]
>>
>>>> I'm not sure why you change asym_smt_can_pull_tasks() together with
>>>> removing SD_ASYM_PACKING from SMT level (patch 5/7)?
>>>
>>> In x86 we have SD_ASYM_PACKING at the MC, CLS* and, before my patches, SMT
>>> sched domains.
>>>
>>>>
>>>> update_sg_lb_stats()
>>>>
>>>>   ... && env->sd->flags & SD_ASYM_PACKING && .. && sched_asym()
>>>>                                                    ^^^^^^^^^^^^
>>>>     sched_asym()
>>>>
>>>>       if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
>>>>           (group->flags & SD_SHARE_CPUCAPACITY))
>>>>         return asym_smt_can_pull_tasks()
>>>>                ^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>
>>>> So x86 won't have a sched domain with SD_SHARE_CPUCAPACITY and
>>>> SD_ASYM_PACKING anymore. So sched_asym() would call sched_asym_prefer()
>>>> directly on MC. What do I miss here?
>>>
>>> asym_smt_can_pull_tasks() is used above the SMT level *and* when either the
>>> local or sg sched groups are composed of CPUs that are SMT siblings.
>>
>> OK.
>>
>>> In fact, asym_smt_can_pull_tasks() can only be called above the SMT level.
>>> This is because the flags of a sched_group in a sched_domain are equal to
>>> the flags of the child sched_domain. Since SMT is the lowest sched_domain,
>>> its groups' flags are 0.
>>
>> I see. I forgot about `[PATCH v5 0/6] sched/fair: Fix load balancing of
>> SMT siblings with ASYM_PACKING` from Sept 21 (specifically [PATCH v5
>> 2/6] sched/topology: Introduce sched_group::flags).
>>
>>> sched_asym() calls sched_asym_prefer() directly if balancing at the
>>> SMT level and, at higher domains, if the child domain is not SMT.
>>
>> OK.
>>
>>> This meets the requirement of Power7, where SMT siblings have different
>>> priorities; and of x86, where physical cores have different priorities.
>>>
>>> Thanks and BR,
>>> Ricardo
>>>
>>> * The target of these patches is Intel hybrid processors, on which cluster
>>>   scheduling is disabled - cabdc3a8475b ("sched,x86: Don't use cluster
>>>   topology for x86 hybrid CPUs"). Also, I have not observed topologies in
>>>   which CPUs of the same cluster have different priorities.
>>
>> OK.
>>
>> IMHO, the function header of asym_smt_can_pull_tasks() (3rd and 4th
>> paragraph ...  `If both @dst_cpu and @sg have SMT siblings` and
> 
> Agreed. I changed the behavior of the function. I will update the
> description.
> 
>> `If @sg does not have SMT siblings` still reflect the old code layout.
> 
> But this behavior did not change. The check covers both SMT and non-SMT
> cases:

The condition to call sched_asym_prefer() seems to have changed slightly
though (including the explanation that busy_cpus_delta >= 2 handling
should be done by fbg().:

sds->local_stat.sum_nr_running (A)
busy_cpus_delta = sg_busy_cpus - local_busy_cpus (B)
sg_busy_cpus = sgs->group_weight - sgs->idle_cpus (C)

From ((smt && B == 1) || (!smt && !A)) to (C == 1 && !A)

> 
> 	 /*
> 	  * non-SMT @sg can only have 1 busy CPU. We only care SMT @sg
> 	  * has exactly one busy sibling
> 	  */
> 	if (sg_busy_cpus == 1 && 
> 	    /* local group is fully idle, SMT and non-SMT. */
> 	    !sds->local_stat.sum_nr_running)
> 
> Perhaps I can collapse the two paragraphs into one.

Sounds good to me.

[...]
  
Valentin Schneider Dec. 22, 2022, 4:55 p.m. UTC | #6
On 22/11/22 12:35, Ricardo Neri wrote:
> @@ -8926,25 +8924,16 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
>               return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
>       }
>
> -	/* @dst_cpu has SMT siblings. */
> -
> -	if (sg_is_smt) {
> -		int local_busy_cpus = sds->local->group_weight -
> -				      sds->local_stat.idle_cpus;
> -		int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> -
> -		if (busy_cpus_delta == 1)
> -			return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> -
> -		return false;
> -	}
> -
>       /*
> -	 * @sg does not have SMT siblings. Ensure that @sds::local does not end
> -	 * up with more than one busy SMT sibling and only pull tasks if there
> -	 * are not busy CPUs (i.e., no CPU has running tasks).
> +	 * @dst_cpu has SMT siblings. Do asym_packing load balancing only if
> +	 * all its siblings are idle (moving tasks between physical cores in
> +	 * which some SMT siblings are busy results in the same throughput).
> +	 *
> +	 * If the difference in the number of busy CPUs is two or more, let
> +	 * find_busiest_group() take care of it. We only care if @sg has
> +	 * exactly one busy CPU. This covers SMT and non-SMT sched groups.
>        */
> -	if (!sds->local_stat.sum_nr_running)
> +	if (sg_busy_cpus == 1 && !sds->local_stat.sum_nr_running)
>               return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
>

Some of this is new to me - I had missed the original series introducing
this. However ISTM that this is conflating two concepts: SMT occupancy
balancing, and asym packing.

Take the !local_is_smt :: sg_busy_cpus >= 2 :: return true; path. It does
not involve asym packing priorities at all. This can end up in an
ASYM_PACKING load balance, which per calculate_imbalance() tries to move
*all* tasks to the higher priority CPU - in the case of SMT balancing,
we don't want to totally empty the core, just its siblings.

Is there an ITMT/big.LITTLE (or however x86 calls it) case that invalidates
the above?

Say, what's not sufficient with the below? AFAICT the only thing that isn't
covered is the sg_busy_cpus >= 2 thing, but IMO that's SMT balancing, not
asym packing - if the current calculate_imbalance() doesn't do it, it
should be fixed to do it. Looking at the

  local->group_type == group_has_spare

case, it looks like it should DTRT.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 224107278471f..15eb2d3cff186 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9176,12 +9176,15 @@ static inline bool
 sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
 	   struct sched_group *group)
 {
-	/* Only do SMT checks if either local or candidate have SMT siblings */
-	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
-	    (group->flags & SD_SHARE_CPUCAPACITY))
-		return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
+	/*
+	 * For SMT, env->idle != CPU_NOT_IDLE isn't sufficient, we need to make
+	 * sure the whole core is idle.
+	 */
+	if (((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
+	     (group->flags & SD_SHARE_CPUCAPACITY)) &&
+	    !test_idle_cores(env->dst_cpu))
+		return false;
 
-	/* Neither env::dst_cpu nor group::asym_prefer_cpu have SMT siblings. */
 	return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu, false);
 }
  
Ricardo Neri Dec. 23, 2022, 1:11 p.m. UTC | #7
On Thu, Dec 22, 2022 at 12:12:00PM +0100, Dietmar Eggemann wrote:
> On 22/12/2022 05:32, Ricardo Neri wrote:
> > On Wed, Dec 21, 2022 at 02:03:15PM +0100, Dietmar Eggemann wrote:
> >> On 12/12/2022 18:53, Ricardo Neri wrote:
> >>> On Tue, Dec 06, 2022 at 06:22:41PM +0100, Dietmar Eggemann wrote:
> >>>> On 22/11/2022 21:35, Ricardo Neri wrote:
> >>
> >> [...]
> >>
> >>>> I'm not sure why you change asym_smt_can_pull_tasks() together with
> >>>> removing SD_ASYM_PACKING from SMT level (patch 5/7)?
> >>>
> >>> In x86 we have SD_ASYM_PACKING at the MC, CLS* and, before my patches, SMT
> >>> sched domains.
> >>>
> >>>>
> >>>> update_sg_lb_stats()
> >>>>
> >>>>   ... && env->sd->flags & SD_ASYM_PACKING && .. && sched_asym()
> >>>>                                                    ^^^^^^^^^^^^
> >>>>     sched_asym()
> >>>>
> >>>>       if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> >>>>           (group->flags & SD_SHARE_CPUCAPACITY))
> >>>>         return asym_smt_can_pull_tasks()
> >>>>                ^^^^^^^^^^^^^^^^^^^^^^^^^
> >>>>
> >>>> So x86 won't have a sched domain with SD_SHARE_CPUCAPACITY and
> >>>> SD_ASYM_PACKING anymore. So sched_asym() would call sched_asym_prefer()
> >>>> directly on MC. What do I miss here?
> >>>
> >>> asym_smt_can_pull_tasks() is used above the SMT level *and* when either the
> >>> local or sg sched groups are composed of CPUs that are SMT siblings.
> >>
> >> OK.
> >>
> >>> In fact, asym_smt_can_pull_tasks() can only be called above the SMT level.
> >>> This is because the flags of a sched_group in a sched_domain are equal to
> >>> the flags of the child sched_domain. Since SMT is the lowest sched_domain,
> >>> its groups' flags are 0.
> >>
> >> I see. I forgot about `[PATCH v5 0/6] sched/fair: Fix load balancing of
> >> SMT siblings with ASYM_PACKING` from Sept 21 (specifically [PATCH v5
> >> 2/6] sched/topology: Introduce sched_group::flags).
> >>
> >>> sched_asym() calls sched_asym_prefer() directly if balancing at the
> >>> SMT level and, at higher domains, if the child domain is not SMT.
> >>
> >> OK.
> >>
> >>> This meets the requirement of Power7, where SMT siblings have different
> >>> priorities; and of x86, where physical cores have different priorities.
> >>>
> >>> Thanks and BR,
> >>> Ricardo
> >>>
> >>> * The target of these patches is Intel hybrid processors, on which cluster
> >>>   scheduling is disabled - cabdc3a8475b ("sched,x86: Don't use cluster
> >>>   topology for x86 hybrid CPUs"). Also, I have not observed topologies in
> >>>   which CPUs of the same cluster have different priorities.
> >>
> >> OK.
> >>
> >> IMHO, the function header of asym_smt_can_pull_tasks() (3rd and 4th
> >> paragraph ...  `If both @dst_cpu and @sg have SMT siblings` and
> > 
> > Agreed. I changed the behavior of the function. I will update the
> > description.
> > 
> >> `If @sg does not have SMT siblings` still reflect the old code layout.
> > 
> > But this behavior did not change. The check covers both SMT and non-SMT
> > cases:
> 
> The condition to call sched_asym_prefer() seems to have changed slightly
> though (including the explanation that busy_cpus_delta >= 2 handling
> should be done by fbg().:
> 
> sds->local_stat.sum_nr_running (A)
> busy_cpus_delta = sg_busy_cpus - local_busy_cpus (B)
> sg_busy_cpus = sgs->group_weight - sgs->idle_cpus (C)
> 
> From ((smt && B == 1) || (!smt && !A)) to (C == 1 && !A)

I agree that ((smt && B == 1) did change and I will update the comment.

My point is that (!smt && !A) is equivalent to (C == 1 && !A) if @sg has
only one CPU and is busy. The fourth paragraph still stands.

> 
> > 
> > 	 /*
> > 	  * non-SMT @sg can only have 1 busy CPU. We only care SMT @sg
> > 	  * has exactly one busy sibling
> > 	  */
> > 	if (sg_busy_cpus == 1 && 
> > 	    /* local group is fully idle, SMT and non-SMT. */
> > 	    !sds->local_stat.sum_nr_running)
> > 
> > Perhaps I can collapse the two paragraphs into one.
> 
> Sounds good to me.

Will do.

Thanks and BR,
Ricardo
  
Ricardo Neri Dec. 29, 2022, 4 a.m. UTC | #8
On Thu, Dec 22, 2022 at 04:55:58PM +0000, Valentin Schneider wrote:
> On 22/11/22 12:35, Ricardo Neri wrote:
> > @@ -8926,25 +8924,16 @@ static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> >               return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> >       }
> >
> > -	/* @dst_cpu has SMT siblings. */
> > -
> > -	if (sg_is_smt) {
> > -		int local_busy_cpus = sds->local->group_weight -
> > -				      sds->local_stat.idle_cpus;
> > -		int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> > -
> > -		if (busy_cpus_delta == 1)
> > -			return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> > -
> > -		return false;
> > -	}
> > -
> >       /*
> > -	 * @sg does not have SMT siblings. Ensure that @sds::local does not end
> > -	 * up with more than one busy SMT sibling and only pull tasks if there
> > -	 * are not busy CPUs (i.e., no CPU has running tasks).
> > +	 * @dst_cpu has SMT siblings. Do asym_packing load balancing only if
> > +	 * all its siblings are idle (moving tasks between physical cores in
> > +	 * which some SMT siblings are busy results in the same throughput).
> > +	 *
> > +	 * If the difference in the number of busy CPUs is two or more, let
> > +	 * find_busiest_group() take care of it. We only care if @sg has
> > +	 * exactly one busy CPU. This covers SMT and non-SMT sched groups.
> >        */
> > -	if (!sds->local_stat.sum_nr_running)
> > +	if (sg_busy_cpus == 1 && !sds->local_stat.sum_nr_running)
> >               return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> >
> 
> Some of this is new to me - I had missed the original series introducing
> this. However ISTM that this is conflating two concepts: SMT occupancy
> balancing, and asym packing.
> 
> Take the !local_is_smt :: sg_busy_cpus >= 2 :: return true; path. It does
> not involve asym packing priorities at all. This can end up in an
> ASYM_PACKING load balance,

Yes, this the desired result: an idle low-priority CPU should help a high-
priority core with more than one busy SMT sibling. But yes, it does not
relate to priorities and can be implemented differently.

> which per calculate_imbalance() tries to move
> *all* tasks to the higher priority CPU - in the case of SMT balancing,
> we don't want to totally empty the core, just its siblings.

But it will not empty the core, only one of its SMT siblings. A single
sibling will be selected in find_busiest_queue(). The other siblings will
be unaffected.

> 
> Is there an ITMT/big.LITTLE (or however x86 calls it) case that invalidates
> the above?

Please see below.

> 
> Say, what's not sufficient with the below? AFAICT the only thing that isn't
> covered is the sg_busy_cpus >= 2 thing, but IMO that's SMT balancing, not
> asym packing - if the current calculate_imbalance() doesn't do it, it
> should be fixed to do it.

Agreed.

>Looking at the
> 
>   local->group_type == group_has_spare
> 
> case, it looks like it should DTRT.

I had tried (and failed) to have find_busiest_group() handle the
!local_is_smt :: sg_busy_cpus >= 2 case. Now I think I made it work.

When the busiest group is not overloaded, find_busiest_group() and
local->group = group_has_spare during an idle load balance events the
number of *idle* CPUs. However, this does not work if the local and busiest
groups have different weights. In SMT2, for instance, if busiest has 2 busy
CPUs (i.e., 0 idle CPUs) and the destination CPU is idle, the difference in
the number of idle CPUs is 1. find_busiest_group() will incorrectly goto
out_balanced.

This issue very visible in Intel hybrid processors because the big cores
have SMT but the small cores do not. It can, however, be reproduced in non-
hybrid processors by offlining the SMT siblings of some cores.

The problem can be fixed by instead balancing the number of *busy* CPUs,
which is what in general we want, IMO. (When two groups have the same
weight, it is equivalent to balancing the number of idle CPUs).

This patch worked for me:

@@ -9787,14 +9787,18 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff;
 		} else {
+			unsigned int busiest_busy_cpus, local_busy_cpus;
+
+			busiest_busy_cpus = busiest->group_weight - busiest->idle_cpus;
+			local_busy_cpus = local->group_weight - local->idle_cpus;
 
 			/*
 			 * If there is no overload, we just want to even the number of
-			 * idle cpus.
+			 * busy cpus.
 			 */
 			env->migration_type = migrate_task;
-			env->imbalance = max_t(long, 0,
-					       (local->idle_cpus - busiest->idle_cpus));
+			env->imbalance = max_t(long, 0 ,
+					       (busiest_busy_cpus -  local_busy_cpus));
 		}
 
 #ifdef CONFIG_NUMA
@@ -9981,18 +9985,24 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 			 */
 			goto out_balanced;
 
-		if (busiest->group_weight > 1 &&
-		    local->idle_cpus <= (busiest->idle_cpus + 1))
-			/*
-			 * If the busiest group is not overloaded
-			 * and there is no imbalance between this and busiest
-			 * group wrt idle CPUs, it is balanced. The imbalance
-			 * becomes significant if the diff is greater than 1
-			 * otherwise we might end up to just move the imbalance
-			 * on another group. Of course this applies only if
-			 * there is more than 1 CPU per group.
-			 */
-			goto out_balanced;
+		if (busiest->group_weight > 1) {
+			unsigned int local_busy_cpus, busiest_busy_cpus;
+
+			local_busy_cpus = local->group_weight - local->idle_cpus;
+			busiest_busy_cpus = busiest->group_weight - busiest->idle_cpus;
+
+			if (busiest_busy_cpus <= local_busy_cpus + 1)
+				/*
+				 * If the busiest group is not overloaded
+				 * and there is no imbalance between this and busiest
+				 * group wrt busy CPUs, it is balanced. The imbalance
+				 * becomes significant if the diff is greater than 1
+				 * otherwise we might end up to just move the imbalance
+				 * on another group. Of course this applies only if
+				 * there is more than 1 CPU per group.
+				 */
+				goto out_balanced;
+		}
 
 		if (busiest->sum_h_nr_running == 1)
 			/*

With this I can remove the sg_busy_cpus >=2 thing from asym_smt_can_pull_tasks().

> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 224107278471f..15eb2d3cff186 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9176,12 +9176,15 @@ static inline bool
>  sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
>  	   struct sched_group *group)
>  {
> -	/* Only do SMT checks if either local or candidate have SMT siblings */
> -	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> -	    (group->flags & SD_SHARE_CPUCAPACITY))
> -		return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> +	/*
> +	 * For SMT, env->idle != CPU_NOT_IDLE isn't sufficient, we need to make
> +	 * sure the whole core is idle.
> +	 */
> +	if (((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> +	     (group->flags & SD_SHARE_CPUCAPACITY)) &&
> +	    !test_idle_cores(env->dst_cpu))

But test_idle_cores() tests for *any* idle core in the highest domain with the
SD_SHARE_PKG_RESOURCES flag. Here we are only interested in the SMT siblings
of env->dst_cpu. If is_core_idle(env->dst_cpu) is used, then I agree with the
change.

Thanks and BR,
Ricardo
  
Valentin Schneider Jan. 11, 2023, 4:04 p.m. UTC | #9
On 28/12/22 20:00, Ricardo Neri wrote:
> On Thu, Dec 22, 2022 at 04:55:58PM +0000, Valentin Schneider wrote:
>> Some of this is new to me - I had missed the original series introducing
>> this. However ISTM that this is conflating two concepts: SMT occupancy
>> balancing, and asym packing.
>> 
>> Take the !local_is_smt :: sg_busy_cpus >= 2 :: return true; path. It does
>> not involve asym packing priorities at all. This can end up in an
>> ASYM_PACKING load balance,
>
> Yes, this the desired result: an idle low-priority CPU should help a high-
> priority core with more than one busy SMT sibling. But yes, it does not
> relate to priorities and can be implemented differently.
>
>> which per calculate_imbalance() tries to move
>> *all* tasks to the higher priority CPU - in the case of SMT balancing,
>> we don't want to totally empty the core, just its siblings.
>
> But it will not empty the core, only one of its SMT siblings. A single
> sibling will be selected in find_busiest_queue(). The other siblings will
> be unaffected.
>

Right

>> 
>> Is there an ITMT/big.LITTLE (or however x86 calls it) case that invalidates
>> the above?
>
> Please see below.
>
>> 
>> Say, what's not sufficient with the below? AFAICT the only thing that isn't
>> covered is the sg_busy_cpus >= 2 thing, but IMO that's SMT balancing, not
>> asym packing - if the current calculate_imbalance() doesn't do it, it
>> should be fixed to do it.
>
> Agreed.
>
>>Looking at the
>> 
>>   local->group_type == group_has_spare
>> 
>> case, it looks like it should DTRT.
>
> I had tried (and failed) to have find_busiest_group() handle the
> !local_is_smt :: sg_busy_cpus >= 2 case. Now I think I made it work.
>
> When the busiest group is not overloaded, find_busiest_group() and
> local->group = group_has_spare during an idle load balance events the
> number of *idle* CPUs. However, this does not work if the local and busiest
> groups have different weights. In SMT2, for instance, if busiest has 2 busy
> CPUs (i.e., 0 idle CPUs) and the destination CPU is idle, the difference in
> the number of idle CPUs is 1. find_busiest_group() will incorrectly goto
> out_balanced.
>
> This issue very visible in Intel hybrid processors because the big cores
> have SMT but the small cores do not. It can, however, be reproduced in non-
> hybrid processors by offlining the SMT siblings of some cores.
>

I think I follow. If we're comparing two groups each spanning an SMT2 core,
then

  busiest->group_weight > 1 && local->idle_cpus <= (busiest->idle_cpus + 1)

is false if local is fully idle and busiest fully busy, but if local
becomes a non-SMT core, then that's true and we goto out_balanced.


With that said, shouldn't SD_PREFER_SIBLING help here? cf.

	if (sds.prefer_sibling && local->group_type == group_has_spare &&
	    busiest->sum_nr_running > local->sum_nr_running + 1)

It should be set on any topology level below the NUMA ones, we do remove it
on SD_ASYM_CPUCAPACITY levels because this used to interfere with misfit
balancing (it would override the group_type), things are a bit different
since Vincent's rewrite of load_balance() but I think we still want it off
there. I would expect it to be set in your system, though whether this is
playing nice with the asymmetry is another matter :-)

> The problem can be fixed by instead balancing the number of *busy* CPUs,
> which is what in general we want, IMO. (When two groups have the same
> weight, it is equivalent to balancing the number of idle CPUs).
>
> This patch worked for me:
>
> @@ -9787,14 +9787,18 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  			lsub_positive(&nr_diff, local->sum_nr_running);
>  			env->imbalance = nr_diff;
>  		} else {
> +			unsigned int busiest_busy_cpus, local_busy_cpus;
> +
> +			busiest_busy_cpus = busiest->group_weight - busiest->idle_cpus;
> +			local_busy_cpus = local->group_weight - local->idle_cpus;
>  
>  			/*
>  			 * If there is no overload, we just want to even the number of
> -			 * idle cpus.
> +			 * busy cpus.
>  			 */
>  			env->migration_type = migrate_task;
> -			env->imbalance = max_t(long, 0,
> -					       (local->idle_cpus - busiest->idle_cpus));
> +			env->imbalance = max_t(long, 0 ,
> +					       (busiest_busy_cpus -  local_busy_cpus));
>  		}
>  
>  #ifdef CONFIG_NUMA
> @@ -9981,18 +9985,24 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  			 */
>  			goto out_balanced;
>  
> -		if (busiest->group_weight > 1 &&
> -		    local->idle_cpus <= (busiest->idle_cpus + 1))
> -			/*
> -			 * If the busiest group is not overloaded
> -			 * and there is no imbalance between this and busiest
> -			 * group wrt idle CPUs, it is balanced. The imbalance
> -			 * becomes significant if the diff is greater than 1
> -			 * otherwise we might end up to just move the imbalance
> -			 * on another group. Of course this applies only if
> -			 * there is more than 1 CPU per group.
> -			 */
> -			goto out_balanced;
> +		if (busiest->group_weight > 1) {
> +			unsigned int local_busy_cpus, busiest_busy_cpus;
> +
> +			local_busy_cpus = local->group_weight - local->idle_cpus;
> +			busiest_busy_cpus = busiest->group_weight - busiest->idle_cpus;
> +
> +			if (busiest_busy_cpus <= local_busy_cpus + 1)
> +				/*
> +				 * If the busiest group is not overloaded
> +				 * and there is no imbalance between this and busiest
> +				 * group wrt busy CPUs, it is balanced. The imbalance
> +				 * becomes significant if the diff is greater than 1
> +				 * otherwise we might end up to just move the imbalance
> +				 * on another group. Of course this applies only if
> +				 * there is more than 1 CPU per group.
> +				 */
> +				goto out_balanced;
> +		}
>  
>  		if (busiest->sum_h_nr_running == 1)
>  			/*
>
> With this I can remove the sg_busy_cpus >=2 thing from asym_smt_can_pull_tasks().
>
>> 
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 224107278471f..15eb2d3cff186 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9176,12 +9176,15 @@ static inline bool
>>  sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats *sgs,
>>  	   struct sched_group *group)
>>  {
>> -	/* Only do SMT checks if either local or candidate have SMT siblings */
>> -	if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
>> -	    (group->flags & SD_SHARE_CPUCAPACITY))
>> -		return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
>> +	/*
>> +	 * For SMT, env->idle != CPU_NOT_IDLE isn't sufficient, we need to make
>> +	 * sure the whole core is idle.
>> +	 */
>> +	if (((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
>> +	     (group->flags & SD_SHARE_CPUCAPACITY)) &&
>> +	    !test_idle_cores(env->dst_cpu))
>
> But test_idle_cores() tests for *any* idle core in the highest domain with the
> SD_SHARE_PKG_RESOURCES flag. Here we are only interested in the SMT siblings
> of env->dst_cpu. If is_core_idle(env->dst_cpu) is used, then I agree with the
> change.
>

Right, I had gotten confused with test_idle_cores()

> Thanks and BR,
> Ricardo
  
Ricardo Neri Jan. 13, 2023, 7:02 p.m. UTC | #10
On Wed, Jan 11, 2023 at 04:04:23PM +0000, Valentin Schneider wrote:
> On 28/12/22 20:00, Ricardo Neri wrote:
> > On Thu, Dec 22, 2022 at 04:55:58PM +0000, Valentin Schneider wrote:
> >> Some of this is new to me - I had missed the original series introducing
> >> this. However ISTM that this is conflating two concepts: SMT occupancy
> >> balancing, and asym packing.
> >> 
> >> Take the !local_is_smt :: sg_busy_cpus >= 2 :: return true; path. It does
> >> not involve asym packing priorities at all. This can end up in an
> >> ASYM_PACKING load balance,
> >
> > Yes, this the desired result: an idle low-priority CPU should help a high-
> > priority core with more than one busy SMT sibling. But yes, it does not
> > relate to priorities and can be implemented differently.
> >
> >> which per calculate_imbalance() tries to move
> >> *all* tasks to the higher priority CPU - in the case of SMT balancing,
> >> we don't want to totally empty the core, just its siblings.
> >
> > But it will not empty the core, only one of its SMT siblings. A single
> > sibling will be selected in find_busiest_queue(). The other siblings will
> > be unaffected.
> >
> 
> Right
> 
> >> 
> >> Is there an ITMT/big.LITTLE (or however x86 calls it) case that invalidates
> >> the above?
> >
> > Please see below.
> >
> >> 
> >> Say, what's not sufficient with the below? AFAICT the only thing that isn't
> >> covered is the sg_busy_cpus >= 2 thing, but IMO that's SMT balancing, not
> >> asym packing - if the current calculate_imbalance() doesn't do it, it
> >> should be fixed to do it.
> >
> > Agreed.
> >
> >>Looking at the
> >> 
> >>   local->group_type == group_has_spare
> >> 
> >> case, it looks like it should DTRT.
> >
> > I had tried (and failed) to have find_busiest_group() handle the
> > !local_is_smt :: sg_busy_cpus >= 2 case. Now I think I made it work.
> >
> > When the busiest group is not overloaded, find_busiest_group() and
> > local->group = group_has_spare during an idle load balance events the
> > number of *idle* CPUs. However, this does not work if the local and busiest
> > groups have different weights. In SMT2, for instance, if busiest has 2 busy
> > CPUs (i.e., 0 idle CPUs) and the destination CPU is idle, the difference in
> > the number of idle CPUs is 1. find_busiest_group() will incorrectly goto
> > out_balanced.
> >
> > This issue very visible in Intel hybrid processors because the big cores
> > have SMT but the small cores do not. It can, however, be reproduced in non-
> > hybrid processors by offlining the SMT siblings of some cores.
> >
> 
> I think I follow. If we're comparing two groups each spanning an SMT2 core,
> then
> 
>   busiest->group_weight > 1 && local->idle_cpus <= (busiest->idle_cpus + 1)
> 
> is false if local is fully idle and busiest fully busy, but if local
> becomes a non-SMT core, then that's true and we goto out_balanced.

Exactly right.

> 
> 
> With that said, shouldn't SD_PREFER_SIBLING help here? cf.
> 
> 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> 	    busiest->sum_nr_running > local->sum_nr_running + 1)

It does not help because sds.prefer_sibling is false: an non-SMT core is
looking into offloading a fully_busy SMT core at the "MC" domain.
sds.prefer_sibling is set in update_sd_lb_stats() if the sched domain's child
has the SD_PREFER_SIBLING flag. Since the destination CPU is the non-SMT
core, there is no child.

> 
> It should be set on any topology level below the NUMA ones, we do remove it
> on SD_ASYM_CPUCAPACITY levels because this used to interfere with misfit
> balancing (it would override the group_type), things are a bit different
> since Vincent's rewrite of load_balance() but I think we still want it off
> there.

I see in find_busiest_group() that group_misfit_task is evaluated before
sds.prefer_sibling.

> I would expect it to be set in your system, though whether this is
> playing nice with the asymmetry is another matter :-)

I recall a few instances of SD_PREFER_SIBLING causing trouble me, but I
need to investigate more.

Thanks and BR,
Ricardo
  
Ricardo Neri Jan. 16, 2023, 4:05 a.m. UTC | #11
On Fri, Jan 13, 2023 at 11:02:26AM -0800, Ricardo Neri wrote:
> On Wed, Jan 11, 2023 at 04:04:23PM +0000, Valentin Schneider wrote:
> > On 28/12/22 20:00, Ricardo Neri wrote:
> > > On Thu, Dec 22, 2022 at 04:55:58PM +0000, Valentin Schneider wrote:
> > >> Some of this is new to me - I had missed the original series introducing
> > >> this. However ISTM that this is conflating two concepts: SMT occupancy
> > >> balancing, and asym packing.
> > >> 
> > >> Take the !local_is_smt :: sg_busy_cpus >= 2 :: return true; path. It does
> > >> not involve asym packing priorities at all. This can end up in an
> > >> ASYM_PACKING load balance,
> > >
> > > Yes, this the desired result: an idle low-priority CPU should help a high-
> > > priority core with more than one busy SMT sibling. But yes, it does not
> > > relate to priorities and can be implemented differently.
> > >
> > >> which per calculate_imbalance() tries to move
> > >> *all* tasks to the higher priority CPU - in the case of SMT balancing,
> > >> we don't want to totally empty the core, just its siblings.
> > >
> > > But it will not empty the core, only one of its SMT siblings. A single
> > > sibling will be selected in find_busiest_queue(). The other siblings will
> > > be unaffected.
> > >
> > 
> > Right
> > 
> > >> 
> > >> Is there an ITMT/big.LITTLE (or however x86 calls it) case that invalidates
> > >> the above?
> > >
> > > Please see below.
> > >
> > >> 
> > >> Say, what's not sufficient with the below? AFAICT the only thing that isn't
> > >> covered is the sg_busy_cpus >= 2 thing, but IMO that's SMT balancing, not
> > >> asym packing - if the current calculate_imbalance() doesn't do it, it
> > >> should be fixed to do it.
> > >
> > > Agreed.
> > >
> > >>Looking at the
> > >> 
> > >>   local->group_type == group_has_spare
> > >> 
> > >> case, it looks like it should DTRT.
> > >
> > > I had tried (and failed) to have find_busiest_group() handle the
> > > !local_is_smt :: sg_busy_cpus >= 2 case. Now I think I made it work.
> > >
> > > When the busiest group is not overloaded, find_busiest_group() and
> > > local->group = group_has_spare during an idle load balance events the
> > > number of *idle* CPUs. However, this does not work if the local and busiest
> > > groups have different weights. In SMT2, for instance, if busiest has 2 busy
> > > CPUs (i.e., 0 idle CPUs) and the destination CPU is idle, the difference in
> > > the number of idle CPUs is 1. find_busiest_group() will incorrectly goto
> > > out_balanced.
> > >
> > > This issue very visible in Intel hybrid processors because the big cores
> > > have SMT but the small cores do not. It can, however, be reproduced in non-
> > > hybrid processors by offlining the SMT siblings of some cores.
> > >
> > 
> > I think I follow. If we're comparing two groups each spanning an SMT2 core,
> > then
> > 
> >   busiest->group_weight > 1 && local->idle_cpus <= (busiest->idle_cpus + 1)
> > 
> > is false if local is fully idle and busiest fully busy, but if local
> > becomes a non-SMT core, then that's true and we goto out_balanced.
> 
> Exactly right.
> 
> > 
> > 
> > With that said, shouldn't SD_PREFER_SIBLING help here? cf.
> > 
> > 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > 	    busiest->sum_nr_running > local->sum_nr_running + 1)
> 
> It does not help because sds.prefer_sibling is false: an non-SMT core is
> looking into offloading a fully_busy SMT core at the "MC" domain.
> sds.prefer_sibling is set in update_sd_lb_stats() if the sched domain's child
> has the SD_PREFER_SIBLING flag. Since the destination CPU is the non-SMT
> core, there is no child.
> 
> > 
> > It should be set on any topology level below the NUMA ones, we do remove it
> > on SD_ASYM_CPUCAPACITY levels because this used to interfere with misfit
> > balancing (it would override the group_type), things are a bit different
> > since Vincent's rewrite of load_balance() but I think we still want it off
> > there.

Your comment got me thinking. Whose child sched domain wants prefer_sibling?
It sounds to me that is busiest's. I could not think of any reason of *having*
to use the flags of the local group.

We can use the flags of the sched group (as per 16d364ba6ef2 ("sched/topology:
Introduce sched_group::flags"), these are the flags of the child domain).

The patch below works for me and I don't have to even the number of busy CPUs.
It should not interfere with misfit balancing either:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a2c70e1087d0..737bb3c8bfae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9752,8 +9752,12 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
-	/* Tag domain that child domain prefers tasks go to siblings first */
-	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+	/*
+	 * Tag domain that child domain prefers tasks go to siblings first.
+	 * A sched group has the flags of the child domain, if any.
+	 */
+	if (sds->busiest)
+		sds->prefer_sibling = sds->busiest->flags & SD_PREFER_SIBLING;
 
 
 	if (env->sd->flags & SD_NUMA)

Thanks and BR,
Ricardo
  
Valentin Schneider Jan. 16, 2023, 7:07 p.m. UTC | #12
On 15/01/23 20:05, Ricardo Neri wrote:
>> >
>> > It should be set on any topology level below the NUMA ones, we do remove it
>> > on SD_ASYM_CPUCAPACITY levels because this used to interfere with misfit
>> > balancing (it would override the group_type), things are a bit different
>> > since Vincent's rewrite of load_balance() but I think we still want it off
>> > there.
>
> Your comment got me thinking. Whose child sched domain wants prefer_sibling?
> It sounds to me that is busiest's. I could not think of any reason of *having*
> to use the flags of the local group.
>

Hm, given that on systems without SD_ASYM_CPUCAPACITY, SD_PREFER_SIBLING is
set all the way from SMT up to the last !NUMA domain, should we just get
rid of the child/parent weirdness of SD_PREFER_SIBLING and stick with the
flags we are given at the level we're balancing at?

i.e.

        sds->prefer_sibling = env->sd & SD_PREFER_SIBLING;

Unless I'm reading this wrong, this also eliminates the effect of
SD_PREFER_SIBLING on the first NUMA level - DIE level has SD_PREFER_SIBLING
set, but we don't necessarily want to evenly spread things out when accross
NUMA nodes.


> We can use the flags of the sched group (as per 16d364ba6ef2 ("sched/topology:
> Introduce sched_group::flags"), these are the flags of the child domain).
>
> The patch below works for me and I don't have to even the number of busy CPUs.
> It should not interfere with misfit balancing either:
>

We remove that flag on systems where misfit balancing happens anyway, so
that's safe vs. SD_PREFER_SIBLING.
  
Ricardo Neri Jan. 17, 2023, 12:49 p.m. UTC | #13
On Mon, Jan 16, 2023 at 07:07:54PM +0000, Valentin Schneider wrote:
> On 15/01/23 20:05, Ricardo Neri wrote:
> >> >
> >> > It should be set on any topology level below the NUMA ones, we do remove it
> >> > on SD_ASYM_CPUCAPACITY levels because this used to interfere with misfit
> >> > balancing (it would override the group_type), things are a bit different
> >> > since Vincent's rewrite of load_balance() but I think we still want it off
> >> > there.
> >
> > Your comment got me thinking. Whose child sched domain wants prefer_sibling?
> > It sounds to me that is busiest's. I could not think of any reason of *having*
> > to use the flags of the local group.
> >
> 
> Hm, given that on systems without SD_ASYM_CPUCAPACITY, SD_PREFER_SIBLING is
> set all the way from SMT up to the last !NUMA domain, should we just get
> rid of the child/parent weirdness of SD_PREFER_SIBLING and stick with the
> flags we are given at the level we're balancing at?
> 
> i.e.
> 
>         sds->prefer_sibling = env->sd & SD_PREFER_SIBLING;

Agreed. This would also make the code easier to understand. It should not change
the current behavior either; except (i.e., fix) for the busiest->group_weight = 2
vs local->group_weight = 1 I raised.

> 
> Unless I'm reading this wrong, this also eliminates the effect of
> SD_PREFER_SIBLING on the first NUMA level - DIE level has SD_PREFER_SIBLING
> set, but we don't necessarily want to evenly spread things out when accross
> NUMA nodes.

Agreed.

> 
> 
> > We can use the flags of the sched group (as per 16d364ba6ef2 ("sched/topology:
> > Introduce sched_group::flags"), these are the flags of the child domain).
> >
> > The patch below works for me and I don't have to even the number of busy CPUs.
> > It should not interfere with misfit balancing either:
> >
> 
> We remove that flag on systems where misfit balancing happens anyway, so
> that's safe vs. SD_PREFER_SIBLING.

Then all looks good with your suggestion. I'll include a patch in my series.

Thanks and BR,
Ricardo
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..18c672ff39ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8900,12 +8900,10 @@  static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
 				    struct sched_group *sg)
 {
 #ifdef CONFIG_SCHED_SMT
-	bool local_is_smt, sg_is_smt;
+	bool local_is_smt;
 	int sg_busy_cpus;
 
 	local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
-	sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
-
 	sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
 
 	if (!local_is_smt) {
@@ -8926,25 +8924,16 @@  static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
 		return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
 	}
 
-	/* @dst_cpu has SMT siblings. */
-
-	if (sg_is_smt) {
-		int local_busy_cpus = sds->local->group_weight -
-				      sds->local_stat.idle_cpus;
-		int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
-
-		if (busy_cpus_delta == 1)
-			return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
-
-		return false;
-	}
-
 	/*
-	 * @sg does not have SMT siblings. Ensure that @sds::local does not end
-	 * up with more than one busy SMT sibling and only pull tasks if there
-	 * are not busy CPUs (i.e., no CPU has running tasks).
+	 * @dst_cpu has SMT siblings. Do asym_packing load balancing only if
+	 * all its siblings are idle (moving tasks between physical cores in
+	 * which some SMT siblings are busy results in the same throughput).
+	 *
+	 * If the difference in the number of busy CPUs is two or more, let
+	 * find_busiest_group() take care of it. We only care if @sg has
+	 * exactly one busy CPU. This covers SMT and non-SMT sched groups.
 	 */
-	if (!sds->local_stat.sum_nr_running)
+	if (sg_busy_cpus == 1 && !sds->local_stat.sum_nr_running)
 		return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
 
 	return false;