[v2,3/7] sched: Teach arch_asym_cpu_priority() the idle state of SMT siblings

Message ID 20221122203532.15013-4-ricardo.neri-calderon@linux.intel.com
State New
Headers
Series x86/sched: Avoid unnecessary migrations within SMT domains |

Commit Message

Ricardo Neri Nov. 22, 2022, 8:35 p.m. UTC
  Some processors (e.g., Intel processors with ITMT) use asym_packing to
balance load between physical cores with SMT. In such case, a core with all
its SMT siblings idle is more desirable than another with one or more busy
siblings.

Other processors (e.g, Power7 with SMT8) use asym_packing to balance load
among SMT siblings of different priority, regardless of their idle state.

Add a new parameter, check_smt, that architectures can use as needed.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim C. Chen <tim.c.chen@intel.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v1:
 * Introduced this patch.
---
 arch/x86/kernel/itmt.c         | 2 +-
 include/linux/sched/topology.h | 2 +-
 kernel/sched/fair.c            | 5 ++++-
 kernel/sched/sched.h           | 3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)
  

Comments

Dietmar Eggemann Dec. 6, 2022, 5:54 p.m. UTC | #1
On 22/11/2022 21:35, Ricardo Neri wrote:
> Some processors (e.g., Intel processors with ITMT) use asym_packing to
> balance load between physical cores with SMT. In such case, a core with all
> its SMT siblings idle is more desirable than another with one or more busy
> siblings.
> 
> Other processors (e.g, Power7 with SMT8) use asym_packing to balance load
> among SMT siblings of different priority, regardless of their idle state.
> 
> Add a new parameter, check_smt, that architectures can use as needed.

[...]

> Changes since v1:
>  * Introduced this patch.

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d18947a9c03e..0e4251f83807 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -142,8 +142,11 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>  #ifdef CONFIG_SMP
>  /*
>   * For asym packing, by default the lower numbered CPU has higher priority.
> + *
> + * When doing ASYM_PACKING at the "MC" or higher domains, architectures may

There is this new CLUSTER level between SMT and MC.

> + * want to check the idle state of the SMT siblngs of @cpu.

s/siblngs/siblings

The scheduler calls sched_asym_prefer(..., true) from
find_busiest_queue(), asym_active_balance() and nohz_balancer_kick()
even from SMT layer on !x86. So I guess a `bool check_smt` wouldn't be
sufficient to distinguish whether sched_smt_siblings_idle() should be
called or not. To me this comment is a little bit misleading. Not an
issue currently since there is only the x86 overwrite right now.

>   */
> -int __weak arch_asym_cpu_priority(int cpu)
> +int __weak arch_asym_cpu_priority(int cpu, bool check_smt)
>  {
>  	return -cpu;
>  }

[...]
  
Ricardo Neri Dec. 12, 2022, 5:54 p.m. UTC | #2
On Tue, Dec 06, 2022 at 06:54:39PM +0100, Dietmar Eggemann wrote:
> On 22/11/2022 21:35, Ricardo Neri wrote:
> > Some processors (e.g., Intel processors with ITMT) use asym_packing to
> > balance load between physical cores with SMT. In such case, a core with all
> > its SMT siblings idle is more desirable than another with one or more busy
> > siblings.
> > 
> > Other processors (e.g, Power7 with SMT8) use asym_packing to balance load
> > among SMT siblings of different priority, regardless of their idle state.
> > 
> > Add a new parameter, check_smt, that architectures can use as needed.
> 
> [...]
> 
> > Changes since v1:
> >  * Introduced this patch.
> 
> [...]
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d18947a9c03e..0e4251f83807 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -142,8 +142,11 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
> >  #ifdef CONFIG_SMP
> >  /*
> >   * For asym packing, by default the lower numbered CPU has higher priority.
> > + *
> > + * When doing ASYM_PACKING at the "MC" or higher domains, architectures may
> 
> There is this new CLUSTER level between SMT and MC.

This is a good catch! I will update the comment to say "domains above SMT".

> 
> > + * want to check the idle state of the SMT siblngs of @cpu.
> 
> s/siblngs/siblings
> 
> The scheduler calls sched_asym_prefer(..., true) from
> find_busiest_queue(), asym_active_balance() and nohz_balancer_kick()

In these places we are comparing two specific CPUs, of which the idle
state of its siblings impact their throughput and, in consequence, the
decision of attempt to balance load.  

In the places were sched_asym_prefer(...., false) is called we compare the
destination CPU with a CPU that bears the priority of a sched group,
regardless of the idle state of its siblings.

> even from SMT layer on !x86.

This is true, but the default arch_asym_cpu_priority ignores check_smt.

>  So I guess a `bool check_smt` wouldn't be
> sufficient to distinguish whether sched_smt_siblings_idle() should be
> called or not.

But it is the caller who determines whether the idle state of the SMT
siblings of @cpu may be relevant.

> To me this comment is a little bit misleading. Not an
> issue currently since there is only the x86 overwrite right now.

If my justification make sense to you, I can expand the comment to state
that the caller decides whether check_smt is needed but arch-specific
implementations are free to ignore it.

Thanks and BR,
Ricardo
  
Dietmar Eggemann Dec. 21, 2022, 5:12 p.m. UTC | #3
On 12/12/2022 18:54, Ricardo Neri wrote:
> On Tue, Dec 06, 2022 at 06:54:39PM +0100, Dietmar Eggemann wrote:
>> On 22/11/2022 21:35, Ricardo Neri wrote:

[...]

>>> + * want to check the idle state of the SMT siblngs of @cpu.
>>
>> s/siblngs/siblings
>>
>> The scheduler calls sched_asym_prefer(..., true) from
>> find_busiest_queue(), asym_active_balance() and nohz_balancer_kick()
> 
> In these places we are comparing two specific CPUs, of which the idle
> state of its siblings impact their throughput and, in consequence, the
> decision of attempt to balance load.  
> 
> In the places were sched_asym_prefer(...., false) is called we compare the
> destination CPU with a CPU that bears the priority of a sched group,
> regardless of the idle state of its siblings.

OK.

>> even from SMT layer on !x86.
> 
> This is true, but the default arch_asym_cpu_priority ignores check_smt.

True.

>>  So I guess a `bool check_smt` wouldn't be
>> sufficient to distinguish whether sched_smt_siblings_idle() should be
>> called or not.
> 
> But it is the caller who determines whether the idle state of the SMT
> siblings of @cpu may be relevant.

I assume caller being the task scheduler here. Callers with
`check_smt=true` can be called from any SD level with SD_ASYM_PACKING.

Imagine an arch w/ SD_ASYM_PACKING on SMT & MC overwriting
arch_asym_cpu_priority(). `bool check_smt` wouldn't be sufficient to
know whether a call to something like sched_smt_siblings_idle()
(is_core_idle()) which iterates over cpu_smt_mask(cpu) would make sense.

>> To me this comment is a little bit misleading. Not an
>> issue currently since there is only the x86 overwrite right now.
> 
> If my justification make sense to you, I can expand the comment to state
> that the caller decides whether check_smt is needed but arch-specific
> implementations are free to ignore it.

Not a big issue but to me if the task scheduler asks for `bool
check_smt` then archs would have to check to guarantee common behaviour.
And the meaning of `bool check_smt` on SMT is unclear to me.
Since only x86 would use this so far it can be adapted later for others
if needed.

[...]
  
Ricardo Neri Dec. 22, 2022, 4:55 a.m. UTC | #4
On Wed, Dec 21, 2022 at 06:12:52PM +0100, Dietmar Eggemann wrote:
> On 12/12/2022 18:54, Ricardo Neri wrote:
> > On Tue, Dec 06, 2022 at 06:54:39PM +0100, Dietmar Eggemann wrote:
> >> On 22/11/2022 21:35, Ricardo Neri wrote:
> 
> [...]
> 
> >>> + * want to check the idle state of the SMT siblngs of @cpu.
> >>
> >> s/siblngs/siblings
> >>
> >> The scheduler calls sched_asym_prefer(..., true) from
> >> find_busiest_queue(), asym_active_balance() and nohz_balancer_kick()
> > 
> > In these places we are comparing two specific CPUs, of which the idle
> > state of its siblings impact their throughput and, in consequence, the
> > decision of attempt to balance load.  
> > 
> > In the places were sched_asym_prefer(...., false) is called we compare the
> > destination CPU with a CPU that bears the priority of a sched group,
> > regardless of the idle state of its siblings.
> 
> OK.
> 
> >> even from SMT layer on !x86.
> > 
> > This is true, but the default arch_asym_cpu_priority ignores check_smt.
> 
> True.
> 
> >>  So I guess a `bool check_smt` wouldn't be
> >> sufficient to distinguish whether sched_smt_siblings_idle() should be
> >> called or not.
> > 
> > But it is the caller who determines whether the idle state of the SMT
> > siblings of @cpu may be relevant.
> 
> I assume caller being the task scheduler here.

Yes.

> Callers with `check_smt=true` can be called from any SD level with
> SD_ASYM_PACKING.

This is true.

> 
> Imagine an arch w/ SD_ASYM_PACKING on SMT & MC overwriting
> arch_asym_cpu_priority(). `bool check_smt` wouldn't be sufficient to
> know whether a call to something like sched_smt_siblings_idle()
> (is_core_idle()) which iterates over cpu_smt_mask(cpu) would make sense.

Agreed. I was hoping I could get away with this. x86 would not have the
the SD_ASYM_PACKING flag at the SMT level and Power7 would ignore
`check_smt`. :)

Callers of sched_asym_prefer() could use the flags of the sched domain to
check if we are at the SMT level.

I rescanned the code again and it looks like the needed sched domain flags
are available in all the places sched_asym_prefer() is called. The only
exception is asym_smt_can_pull_tasks(), but we already know that we don't
need such check. (We are looking for the sched group priority, regardless
of the idle state of the SMT siblings).

> 
> >> To me this comment is a little bit misleading. Not an
> >> issue currently since there is only the x86 overwrite right now.
> > 
> > If my justification make sense to you, I can expand the comment to state
> > that the caller decides whether check_smt is needed but arch-specific
> > implementations are free to ignore it.
> 
> Not a big issue but to me if the task scheduler asks for `bool
> check_smt` then archs would have to check to guarantee common behaviour.
> And the meaning of `bool check_smt` on SMT is unclear to me.
> Since only x86 would use this so far it can be adapted later for others
> if needed.

What is proposed in my previous paragraph should solve this, IMO.

Thanks and BR,
Ricardo
  
Valentin Schneider Dec. 22, 2022, 4:56 p.m. UTC | #5
On 21/12/22 20:55, Ricardo Neri wrote:
> On Wed, Dec 21, 2022 at 06:12:52PM +0100, Dietmar Eggemann wrote:
>> Imagine an arch w/ SD_ASYM_PACKING on SMT & MC overwriting
>> arch_asym_cpu_priority(). `bool check_smt` wouldn't be sufficient to
>> know whether a call to something like sched_smt_siblings_idle()
>> (is_core_idle()) which iterates over cpu_smt_mask(cpu) would make sense.
>
> Agreed. I was hoping I could get away with this. x86 would not have the
> the SD_ASYM_PACKING flag at the SMT level and Power7 would ignore
> `check_smt`. :)
>
> Callers of sched_asym_prefer() could use the flags of the sched domain to
> check if we are at the SMT level.
>
> I rescanned the code again and it looks like the needed sched domain flags
> are available in all the places sched_asym_prefer() is called. The only
> exception is asym_smt_can_pull_tasks(), but we already know that we don't
> need such check. (We are looking for the sched group priority, regardless
> of the idle state of the SMT siblings).
>

Given this is fed back to arch code, another option here would be to feed
the topology level this is happening at. You get it via sd->level and it
maps back into the arch's sched_domain_topology_level array.

Though the SD flags are probably the more generic solution.
  

Patch

diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 9ff480e94511..4cb5a5e4fa47 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -167,7 +167,7 @@  void sched_clear_itmt_support(void)
 	mutex_unlock(&itmt_update_mutex);
 }
 
-int arch_asym_cpu_priority(int cpu)
+int arch_asym_cpu_priority(int cpu, bool check_smt)
 {
 	return per_cpu(sched_core_priority, cpu);
 }
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..87b64b9776f6 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -63,7 +63,7 @@  static inline int cpu_numa_flags(void)
 }
 #endif
 
-extern int arch_asym_cpu_priority(int cpu);
+extern int arch_asym_cpu_priority(int cpu, bool check_smt);
 
 struct sched_domain_attr {
 	int relax_domain_level;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d18947a9c03e..0e4251f83807 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -142,8 +142,11 @@  __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 #ifdef CONFIG_SMP
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
+ *
+ * When doing ASYM_PACKING at the "MC" or higher domains, architectures may
+ * want to check the idle state of the SMT siblngs of @cpu.
  */
-int __weak arch_asym_cpu_priority(int cpu)
+int __weak arch_asym_cpu_priority(int cpu, bool check_smt)
 {
 	return -cpu;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0fc7c0130755..e5e52c2e82de 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -802,7 +802,8 @@  static inline long se_weight(struct sched_entity *se)
  */
 static inline bool sched_asym_prefer(int a, int b, bool check_smt)
 {
-	return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
+	return arch_asym_cpu_priority(a, check_smt) >
+	       arch_asym_cpu_priority(b, check_smt);
 }
 
 struct perf_domain {