[v6,3/4] sched/fair: Introduce SIS_CORE

Message ID 20221019122859.18399-4-wuyun.abel@bytedance.com
State New
Headers
Series sched/fair: Improve scan efficiency of SIS |

Commit Message

Abel Wu Oct. 19, 2022, 12:28 p.m. UTC
  The wakeup fastpath for fair tasks, select_idle_sibling() aka SIS,
plays an important role in maximizing the usage of cpu resources and
can greatly affect overall performance of the system.

The SIS tries to find an idle cpu inside the targeting LLC to place
the woken-up task. The cache hot cpus are preferred, but if none of
them is idle, a domain scan is fired to check the other cpus of that
LLC. It currently scans in linear fashion, and it works well under
light pressure due to lots of idle cpus can be available.

But things change. The LLC is getting bigger in modern and future
machines, and players like cloud service providers are continuously
trying to use all kinds of resources more efficiently to reduce TCO.
In either case, locating an idle cpu is no longer as easy as before.
So the linear scan doesn't fit well in such cases.

There are already features like SIS_{UTIL,PROP} exist to deal with
the scalability issue by limiting the scan depth, and it would be
better if we can improve the way how it scans as well. And this is
exactly what the SIS_CORE is born for.

When SIS_CORE is enabled, a cpumask containing idle cpus for each
LLC is maintained and stored in each LLC shared domain. The idle
cpus are recorded at CORE granule, so in theory only one idle cpu
of a CORE can be set to the mask. The ideas behind this:

 - Recording idle cpus is to narrow down the SIS scan, so we can
   avoid touching the runqueues that must be in a busy state, as
   we all know that the runqueues are one of the most hot data in
   the system. And because all the possibly idle cpus are in the
   mask, the hint of has_idle_core still works.

 - The rule of CORE granule update helps spreading load out to
   different cores trying to make better use of cpu capacity.

A major concern is the accuracy of the idle cpumask. A cpu present
in the mask might not be idle any more, which is called the false
positive cpu. Such cpus will negate lots of benefit this feature
brings. The strategy against the false positives will be introduced
in next patch.

Another concern is the overhead of accessing the LLC-shared cpumask,
which can be more severe in large LLCs. But a perf stat on cache
miss rate during hackbench doesn't show obvious difference.

This patch records idle cpus when they goes idle at CORE granule for
each LLC, and the cpumask is in LLC shared domain. The false positive
cpus are cleared when SIS domain scan fails.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched/topology.h | 15 ++++++++++
 kernel/sched/fair.c            | 51 +++++++++++++++++++++++++++++++---
 kernel/sched/features.h        |  7 +++++
 kernel/sched/topology.c        |  8 +++++-
 4 files changed, 76 insertions(+), 5 deletions(-)
  

Comments

Chen Yu Oct. 21, 2022, 4:03 a.m. UTC | #1
On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
[cut]
> A major concern is the accuracy of the idle cpumask. A cpu present
> in the mask might not be idle any more, which is called the false
> positive cpu. Such cpus will negate lots of benefit this feature
> brings. The strategy against the false positives will be introduced
> in next patch.
> 
I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
maybe SIS_CORE could be disabled by default in 3/4 but enabled
in 4/4? So this might facilicate git bisect in case of any regression
check?
[cut]
> + * To honor the rule of CORE granule update, set this cpu to the LLC idle
> + * cpumask only if there is no cpu of this core showed up in the cpumask.
> + */
> +static void update_idle_cpu(int cpu)
> +{
> +	struct sched_domain_shared *sds;
> +
> +	if (!sched_feat(SIS_CORE))
> +		return;
> +
> +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +	if (sds) {
> +		struct cpumask *icpus = to_cpumask(sds->icpus);
> +
> +		/*
> +		 * This is racy against clearing in select_idle_cpu(),
> +		 * and can lead to idle cpus miss the chance to be set to
> +		 * the idle cpumask, thus the idle cpus are temporarily
> +		 * out of reach in SIS domain scan. But it should be rare
> +		 * and we still have ILB to kick them working.
> +		 */
> +		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
> +			cpumask_set_cpu(cpu, icpus);
Maybe I miss something, here we only set one CPU in the icpus, but
when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
is this intended for 'CORE granule update'?


thanks,
Chenyu
  
Abel Wu Oct. 21, 2022, 4:30 a.m. UTC | #2
Hi Chen, thanks for your reviewing!

On 10/21/22 12:03 PM, Chen Yu wrote:
> On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
> [cut]
>> A major concern is the accuracy of the idle cpumask. A cpu present
>> in the mask might not be idle any more, which is called the false
>> positive cpu. Such cpus will negate lots of benefit this feature
>> brings. The strategy against the false positives will be introduced
>> in next patch.
>>
> I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
> maybe SIS_CORE could be disabled by default in 3/4 but enabled
> in 4/4? So this might facilicate git bisect in case of any regression
> check?

Agreed. Will fix in next version.

> [cut]
>> + * To honor the rule of CORE granule update, set this cpu to the LLC idle
>> + * cpumask only if there is no cpu of this core showed up in the cpumask.
>> + */
>> +static void update_idle_cpu(int cpu)
>> +{
>> +	struct sched_domain_shared *sds;
>> +
>> +	if (!sched_feat(SIS_CORE))
>> +		return;
>> +
>> +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> +	if (sds) {
>> +		struct cpumask *icpus = to_cpumask(sds->icpus);
>> +
>> +		/*
>> +		 * This is racy against clearing in select_idle_cpu(),
>> +		 * and can lead to idle cpus miss the chance to be set to
>> +		 * the idle cpumask, thus the idle cpus are temporarily
>> +		 * out of reach in SIS domain scan. But it should be rare
>> +		 * and we still have ILB to kick them working.
>> +		 */
>> +		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
>> +			cpumask_set_cpu(cpu, icpus);
> Maybe I miss something, here we only set one CPU in the icpus, but
> when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
> is this intended for 'CORE granule update'?

The __update_idle_core() is called by all the cpus that need to go idle
to update has_idle_core if necessary, and update_idle_cpu() is called
before that check.

Thanks,
Abel
  
Chen Yu Oct. 21, 2022, 4:34 a.m. UTC | #3
On 2022-10-21 at 12:30:56 +0800, Abel Wu wrote:
> Hi Chen, thanks for your reviewing!
> 
> On 10/21/22 12:03 PM, Chen Yu wrote:
> > On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
> > [cut]
> > > A major concern is the accuracy of the idle cpumask. A cpu present
> > > in the mask might not be idle any more, which is called the false
> > > positive cpu. Such cpus will negate lots of benefit this feature
> > > brings. The strategy against the false positives will be introduced
> > > in next patch.
> > > 
> > I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
> > maybe SIS_CORE could be disabled by default in 3/4 but enabled
> > in 4/4? So this might facilicate git bisect in case of any regression
> > check?
> 
> Agreed. Will fix in next version.
> 
> > [cut]
> > > + * To honor the rule of CORE granule update, set this cpu to the LLC idle
> > > + * cpumask only if there is no cpu of this core showed up in the cpumask.
> > > + */
> > > +static void update_idle_cpu(int cpu)
> > > +{
> > > +	struct sched_domain_shared *sds;
> > > +
> > > +	if (!sched_feat(SIS_CORE))
> > > +		return;
> > > +
> > > +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > +	if (sds) {
> > > +		struct cpumask *icpus = to_cpumask(sds->icpus);
> > > +
> > > +		/*
> > > +		 * This is racy against clearing in select_idle_cpu(),
> > > +		 * and can lead to idle cpus miss the chance to be set to
> > > +		 * the idle cpumask, thus the idle cpus are temporarily
> > > +		 * out of reach in SIS domain scan. But it should be rare
> > > +		 * and we still have ILB to kick them working.
> > > +		 */
> > > +		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
> > > +			cpumask_set_cpu(cpu, icpus);
> > Maybe I miss something, here we only set one CPU in the icpus, but
> > when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
> > is this intended for 'CORE granule update'?
> 
> The __update_idle_core() is called by all the cpus that need to go idle
> to update has_idle_core if necessary, and update_idle_cpu() is called
> before that check.
>
I see.

Since __update_idle_core() has checked all SMT siblings of 'cpu' if
they are idle, can that information also be updated to icpus?

thanks,
Chenyu 
> Thanks,
> Abel
>
  
Abel Wu Oct. 21, 2022, 9:35 a.m. UTC | #4
On 10/21/22 12:34 PM, Chen Yu wrote:
> On 2022-10-21 at 12:30:56 +0800, Abel Wu wrote:
>> Hi Chen, thanks for your reviewing!
>>
>> On 10/21/22 12:03 PM, Chen Yu wrote:
>>> On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
>>> [cut]
>>>> A major concern is the accuracy of the idle cpumask. A cpu present
>>>> in the mask might not be idle any more, which is called the false
>>>> positive cpu. Such cpus will negate lots of benefit this feature
>>>> brings. The strategy against the false positives will be introduced
>>>> in next patch.
>>>>
>>> I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
>>> maybe SIS_CORE could be disabled by default in 3/4 but enabled
>>> in 4/4? So this might facilicate git bisect in case of any regression
>>> check?
>>
>> Agreed. Will fix in next version.
>>
>>> [cut]
>>>> + * To honor the rule of CORE granule update, set this cpu to the LLC idle
>>>> + * cpumask only if there is no cpu of this core showed up in the cpumask.
>>>> + */
>>>> +static void update_idle_cpu(int cpu)
>>>> +{
>>>> +	struct sched_domain_shared *sds;
>>>> +
>>>> +	if (!sched_feat(SIS_CORE))
>>>> +		return;
>>>> +
>>>> +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>>>> +	if (sds) {
>>>> +		struct cpumask *icpus = to_cpumask(sds->icpus);
>>>> +
>>>> +		/*
>>>> +		 * This is racy against clearing in select_idle_cpu(),
>>>> +		 * and can lead to idle cpus miss the chance to be set to
>>>> +		 * the idle cpumask, thus the idle cpus are temporarily
>>>> +		 * out of reach in SIS domain scan. But it should be rare
>>>> +		 * and we still have ILB to kick them working.
>>>> +		 */
>>>> +		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
>>>> +			cpumask_set_cpu(cpu, icpus);
>>> Maybe I miss something, here we only set one CPU in the icpus, but
>>> when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
>>> is this intended for 'CORE granule update'?
>>
>> The __update_idle_core() is called by all the cpus that need to go idle
>> to update has_idle_core if necessary, and update_idle_cpu() is called
>> before that check.
>>
> I see.
> 
> Since __update_idle_core() has checked all SMT siblings of 'cpu' if
> they are idle, can that information also be updated to icpus?

I think this will simply fallback to the original per-cpu proposal and
lose the opportunity to spread tasks to different cores.
  
Chen Yu Oct. 21, 2022, 11:14 a.m. UTC | #5
On 2022-10-21 at 17:35:06 +0800, Abel Wu wrote:
> On 10/21/22 12:34 PM, Chen Yu wrote:
> > On 2022-10-21 at 12:30:56 +0800, Abel Wu wrote:
> > > Hi Chen, thanks for your reviewing!
> > > 
> > > On 10/21/22 12:03 PM, Chen Yu wrote:
> > > > On 2022-10-19 at 20:28:58 +0800, Abel Wu wrote:
> > > > [cut]
> > > > > A major concern is the accuracy of the idle cpumask. A cpu present
> > > > > in the mask might not be idle any more, which is called the false
> > > > > positive cpu. Such cpus will negate lots of benefit this feature
> > > > > brings. The strategy against the false positives will be introduced
> > > > > in next patch.
> > > > > 
> > > > I was thinking that, if patch[3/4] needs [4/4] to fix the false positives,
> > > > maybe SIS_CORE could be disabled by default in 3/4 but enabled
> > > > in 4/4? So this might facilicate git bisect in case of any regression
> > > > check?
> > > 
> > > Agreed. Will fix in next version.
> > > 
> > > > [cut]
> > > > > + * To honor the rule of CORE granule update, set this cpu to the LLC idle
> > > > > + * cpumask only if there is no cpu of this core showed up in the cpumask.
> > > > > + */
> > > > > +static void update_idle_cpu(int cpu)
> > > > > +{
> > > > > +	struct sched_domain_shared *sds;
> > > > > +
> > > > > +	if (!sched_feat(SIS_CORE))
> > > > > +		return;
> > > > > +
> > > > > +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > > > +	if (sds) {
> > > > > +		struct cpumask *icpus = to_cpumask(sds->icpus);
> > > > > +
> > > > > +		/*
> > > > > +		 * This is racy against clearing in select_idle_cpu(),
> > > > > +		 * and can lead to idle cpus miss the chance to be set to
> > > > > +		 * the idle cpumask, thus the idle cpus are temporarily
> > > > > +		 * out of reach in SIS domain scan. But it should be rare
> > > > > +		 * and we still have ILB to kick them working.
> > > > > +		 */
> > > > > +		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
> > > > > +			cpumask_set_cpu(cpu, icpus);
> > > > Maybe I miss something, here we only set one CPU in the icpus, but
> > > > when we reach update_idle_cpu(), all SMT siblings of 'cpu' are idle,
> > > > is this intended for 'CORE granule update'?
> > > 
> > > The __update_idle_core() is called by all the cpus that need to go idle
> > > to update has_idle_core if necessary, and update_idle_cpu() is called
> > > before that check.
> > > 
> > I see.
> > 
> > Since __update_idle_core() has checked all SMT siblings of 'cpu' if
> > they are idle, can that information also be updated to icpus?
> 
> I think this will simply fallback to the original per-cpu proposal and
> lose the opportunity to spread tasks to different cores.
OK, make sense.

thanks,
Chenyu
  

Patch

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..ac2162f33ada 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -82,6 +82,16 @@  struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+
+	/*
+	 * Used by sched feature SIS_CORE to record idle cpus at core
+	 * granule to improve efficiency of SIS domain scan.
+	 *
+	 * NOTE: this field is variable length. (Allocated dynamically
+	 * by attaching extra space to the end of the structure,
+	 * depending on how many CPUs the kernel has booted up with)
+	 */
+	unsigned long	icpus[];
 };
 
 struct sched_domain {
@@ -167,6 +177,11 @@  static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+static inline struct cpumask *sched_domain_icpus(struct sched_domain *sd)
+{
+	return to_cpumask(sd->shared->icpus);
+}
+
 extern void partition_sched_domains_locked(int ndoms_new,
 					   cpumask_var_t doms_new[],
 					   struct sched_domain_attr *dattr_new);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b668e16812e..3aa699e9d4af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6282,6 +6282,33 @@  static inline bool test_idle_cores(int cpu)
 	return false;
 }
 
+/*
+ * To honor the rule of CORE granule update, set this cpu to the LLC idle
+ * cpumask only if there is no cpu of this core showed up in the cpumask.
+ */
+static void update_idle_cpu(int cpu)
+{
+	struct sched_domain_shared *sds;
+
+	if (!sched_feat(SIS_CORE))
+		return;
+
+	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (sds) {
+		struct cpumask *icpus = to_cpumask(sds->icpus);
+
+		/*
+		 * This is racy against clearing in select_idle_cpu(),
+		 * and can lead to idle cpus miss the chance to be set to
+		 * the idle cpumask, thus the idle cpus are temporarily
+		 * out of reach in SIS domain scan. But it should be rare
+		 * and we still have ILB to kick them working.
+		 */
+		if (!cpumask_intersects(cpu_smt_mask(cpu), icpus))
+			cpumask_set_cpu(cpu, icpus);
+	}
+}
+
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
  * information in sd_llc_shared->has_idle_cores.
@@ -6298,6 +6325,7 @@  void __update_idle_core(struct rq *rq)
 		return;
 
 	rcu_read_lock();
+	update_idle_cpu(core);
 	if (test_idle_cores(core))
 		goto unlock;
 
@@ -6343,7 +6371,13 @@  static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 	if (idle)
 		return core;
 
-	cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+	/*
+	 * It is unlikely that more than one cpu of a core show up
+	 * in the @cpus if SIS_CORE enabled.
+	 */
+	if (!sched_feat(SIS_CORE))
+		cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
+
 	return -1;
 }
 
@@ -6394,7 +6428,7 @@  static inline int select_idle_smt(struct task_struct *p, int target)
  */
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
 {
-	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask), *icpus = NULL;
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 	struct sched_domain_shared *sd_share;
 	struct rq *this_rq = this_rq();
@@ -6402,8 +6436,6 @@  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	struct sched_domain *this_sd = NULL;
 	u64 time = 0;
 
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
 		u64 avg_cost, avg_idle, span_avg;
 		unsigned long now = jiffies;
@@ -6447,6 +6479,11 @@  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
+	if (sched_feat(SIS_CORE) && sched_smt_active())
+		icpus = sched_domain_icpus(sd);
+
+	cpumask_and(cpus, icpus ? icpus : sched_domain_span(sd), p->cpus_ptr);
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -6465,6 +6502,12 @@  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	if (has_idle_core)
 		set_idle_cores(target, false);
 
+	if (icpus && idle_cpu == -1) {
+		/* Reset the idle cpu mask if a full scan fails */
+		if (nr > 0)
+			cpumask_clear(icpus);
+	}
+
 	if (sched_feat(SIS_PROP) && this_sd && !has_idle_core) {
 		time = cpu_clock(this) - time;
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..bf3cae94caa6 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,6 +63,13 @@  SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(SIS_PROP, false)
 SCHED_FEAT(SIS_UTIL, true)
 
+/*
+ * Record idle cpus at core granule for each LLC to improve efficiency of
+ * SIS domain scan. Combine with the above features of limiting scan depth
+ * to better deal with the scalability issue.
+ */
+SCHED_FEAT(SIS_CORE, true)
+
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..a2bb0091c10d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1641,6 +1641,12 @@  sd_init(struct sched_domain_topology_level *tl,
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+
+		/*
+		 * This will temporarily break the rule of CORE granule,
+		 * but will be fixed after SIS scan failures.
+		 */
+		cpumask_copy(sched_domain_icpus(sd), sd_span);
 	}
 
 	sd->private = sdd;
@@ -2106,7 +2112,7 @@  static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+			sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sds)
 				return -ENOMEM;