[v6,4/4] sched/fair: Deal with SIS scan failures

Message ID 20221019122859.18399-5-wuyun.abel@bytedance.com
State New
Headers
Series sched/fair: Improve scan efficiency of SIS |

Commit Message

Abel Wu Oct. 19, 2022, 12:28 p.m. UTC
  When SIS_CORE is active, idle cpus are recorded into a cpumask at
CORE granule, and the SIS domain scan tries to locate an idle cpu
in this cpumask. So the quality of the cpumask matters. There are
generally two types of cpus that need to be dealt with:

 - False positive: the cpus that are present in the idle cpumask
   but is actually busy. This can be caused by task enqueuing to
   these cpus without removing from the cpumask immediately. To
   solve this, the cpumask needs to be shrinked if poor quality
   of the mask is observed.

 - True negative: the cpus that are idle but not show up in the
   idle cpumask. This is due to the rule of CORE granule update
   that these idle cpus will be ignored when their SMT siblings
   are already present in the idle cpumask.

According to the nature of the two type cpus and some heuristics,
the strategies against SIS scan failures are as follows:

 - It can be predicted that if a scan fails, the next scan from the
   same cpu will probably fail too. So mark these cpus fail-prone.
   A scan from fail-prone cpu is the time to shrink the cpumask.

 - All true negative cpus are the SMT siblings of the false positive
   cpus, and are taken for granted to be treated as the fallbacks of
   the false positive cpus. So a fail-prone scan should also check
   the SMT siblings to see if any true negative cpu exists.

The number of false positive cpus being removed during one scan is
not constrained, but it will be implicitly constrained by the load
of the LLC. If LLC is under heavy pressure, both the weight of idle
cpumask and the scan depth are reduced, so does the number of cpus
removed.

To sum up, this patch marks the cpus fail-prone if scans from them
failed last time, so next time scanning from them will also check
their SMT siblings to see if any true negative cpus available. The
false positive cpus will be removed during fail-prone scans if its
belonging core is fully busy.

Benchmark
=========

All of the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, and test machines are:

A) A dual socket machine modeled Intel Xeon(R) Platinum 8260 with SNC
disabled, so there are 2 NUMA nodes each of which has 24C/48T. Each
NUMA shares an LLC.

B) A dual socket machine modeled AMD EPYC 7Y83 64-Core Processor with
NPS1 enabled, so there are 2 NUMA nodes each of which has 64C/128T.
Each NUMA node contains several LLCs sized of 16 cpus.

The 'baseline' is based on tip sched/core fb04563d1cae (v5.19.0) with
the first 2 patches of this series applied. While 'sis_core' includes
the whole patchset.

Results
=======

hackbench-process-pipes
 (A)			baseline		sis_core
Amean     1        0.2540 (   0.00%)      0.2533 (   0.26%)
Amean     4        0.6220 (   0.00%)      0.5993 (   3.64%)
Amean     7        0.8020 (   0.00%)      0.7663 *   4.45%*
Amean     12       1.1823 (   0.00%)      1.1037 *   6.65%*
Amean     21       2.7717 (   0.00%)      2.2203 *  19.89%*
Amean     30       5.1200 (   0.00%)      3.7133 *  27.47%*
Amean     48       8.5890 (   0.00%)      7.0863 *  17.50%*
Amean     79      11.2580 (   0.00%)     10.3717 *   7.87%*
Amean     110     13.1283 (   0.00%)     12.4133 *   5.45%*
Amean     141     15.5967 (   0.00%)     14.4883 *   7.11%*
Amean     172     18.1153 (   0.00%)     17.2557 *   4.75%*
Amean     203     21.1340 (   0.00%)     20.2807 (   4.04%)
Amean     234     23.8227 (   0.00%)     22.8510 (   4.08%)
Amean     265     26.8293 (   0.00%)     25.7367 *   4.07%*
Amean     296     29.5800 (   0.00%)     28.7847 (   2.69%)
 (B)
Amean     1        0.3650 (   0.00%)      0.3510 (   3.84%)
Amean     4        0.4837 (   0.00%)      0.4753 (   1.72%)
Amean     7        0.4997 (   0.00%)      0.5073 (  -1.53%)
Amean     12       0.5863 (   0.00%)      0.5807 (   0.97%)
Amean     21       0.8930 (   0.00%)      0.8953 (  -0.26%)
Amean     30       1.2530 (   0.00%)      1.2633 (  -0.82%)
Amean     48       1.9743 (   0.00%)      1.9023 (   3.65%)
Amean     79       3.4933 (   0.00%)      3.2820 (   6.05%)
Amean     110      5.5963 (   0.00%)      5.3923 (   3.65%)
Amean     141      7.6550 (   0.00%)      6.8633 (  10.34%)
Amean     172      8.8323 (   0.00%)      8.2973 *   6.06%*
Amean     203     10.8683 (   0.00%)      9.5170 *  12.43%*
Amean     234     11.8683 (   0.00%)     10.6217 (  10.50%)
Amean     265     13.4717 (   0.00%)     11.9357 *  11.40%*
Amean     296     13.8130 (   0.00%)     12.7430 *   7.75%*

The results on machine A can approximately divided into 3 sections:
 - busy, e.g. <21 groups
 - overloaded, e.g. 21~48 groups
 - saturated, the rest part

The two cases of 296 groups in B and 110 groups in A have same amount
of tasks per cpu. So does 30 groups in B and 12 groups in A. So the
sections on A also apply to B, except that B only has the first two.
This implies that this feature seems to have consistant behaviour on
LLCs with different sizes.

For the busy part the result is neutral with slight wins or loss.
It's not hard to find an idle cpu in such case, so SIS_CORE doesn't
outperform the linear scanner, considering the cpumask is maintained
at a cost which will negate the slight benefit.

Once load increases, SIS_CORE helps improving the throughput quite a
lot by squeezing out the hidden capacity of the cpus. And even under
extreme load pressure when the cpu capacity is almost fully utilized,
there is still some can be exploited.

Although it is unlikely to be such loaded in the real world, the long
running workloads like training jobs can also keep cpus busy and can
benefit from this feature a lot.

netperf
 (A-udp)		    baseline		   sis_core
Hmean     send-64         214.34 (   0.00%)      210.79 *  -1.65%*
Hmean     send-128        427.90 (   0.00%)      417.96 *  -2.32%*
Hmean     send-256        839.65 (   0.00%)      823.78 *  -1.89%*
Hmean     send-1024      3207.45 (   0.00%)     3167.96 *  -1.23%*
Hmean     send-2048      6097.24 (   0.00%)     6089.01 (  -0.13%)
Hmean     send-3312      9350.83 (   0.00%)     9299.09 (  -0.55%)
Hmean     send-4096     11368.25 (   0.00%)    11186.44 *  -1.60%*
Hmean     send-8192     18273.21 (   0.00%)    18103.81 (  -0.93%)
Hmean     send-16384    28207.81 (   0.00%)    28259.82 (   0.18%)
 (B-udp)
Hmean     send-64         249.97 (   0.00%)      256.99 *   2.81%*
Hmean     send-128        500.68 (   0.00%)      514.44 *   2.75%*
Hmean     send-256        991.59 (   0.00%)     1017.38 *   2.60%*
Hmean     send-1024      3913.02 (   0.00%)     3982.68 *   1.78%*
Hmean     send-2048      7627.99 (   0.00%)     7590.30 (  -0.49%)
Hmean     send-3312     11907.07 (   0.00%)    12114.03 *   1.74%*
Hmean     send-4096     14300.09 (   0.00%)    14753.34 *   3.17%*
Hmean     send-8192     24576.21 (   0.00%)    25431.42 *   3.48%*
Hmean     send-16384    42105.89 (   0.00%)    41813.30 (  -0.69%)
 (A-tcp)
Hmean     64        1191.91 (   0.00%)     1220.47 *   2.40%*
Hmean     128       2318.60 (   0.00%)     2354.56 (   1.55%)
Hmean     256       4267.41 (   0.00%)     4226.72 (  -0.95%)
Hmean     1024     13190.66 (   0.00%)    13065.91 (  -0.95%)
Hmean     2048     20466.22 (   0.00%)    20704.66 (   1.17%)
Hmean     3312     24363.57 (   0.00%)    24613.99 *   1.03%*
Hmean     4096     26144.44 (   0.00%)    26204.24 (   0.23%)
Hmean     8192     30387.77 (   0.00%)    30703.65 *   1.04%*
Hmean     16384    34942.71 (   0.00%)    34205.44 *  -2.11%*
 (B-tcp)
Hmean     64        1971.18 (   0.00%)     2120.61 *   7.58%*
Hmean     128       3752.96 (   0.00%)     3995.68 *   6.47%*
Hmean     256       6861.58 (   0.00%)     7342.93 *   7.02%*
Hmean     1024     21966.06 (   0.00%)    23725.30 *   8.01%*
Hmean     2048     33949.66 (   0.00%)    35620.67 *   4.92%*
Hmean     3312     40681.75 (   0.00%)    41543.26 *   2.12%*
Hmean     4096     44309.70 (   0.00%)    45390.03 *   2.44%*
Hmean     8192     50909.35 (   0.00%)    52157.16 *   2.45%*
Hmean     16384    57198.37 (   0.00%)    57686.96 (   0.85%)

The netperf has an almost 100% success rate on used target, prev or
recent cpus, so the domain scan is generally not performed. This test
is to see how much overhead of maintaining the idle cpumask and the
result is neutral, which sounds like the overhead is acceptable.

tbench4 Throughput
 (A)			baseline		sis_core
Hmean     1        280.53 (   0.00%)      289.44 *   3.17%*
Hmean     2        561.94 (   0.00%)      571.46 *   1.69%*
Hmean     4       1109.14 (   0.00%)     1129.88 *   1.87%*
Hmean     8       2229.39 (   0.00%)     2266.52 *   1.67%*
Hmean     16      4383.06 (   0.00%)     4473.48 *   2.06%*
Hmean     32      7124.14 (   0.00%)     7223.83 *   1.40%*
Hmean     64      8815.41 (   0.00%)     8770.21 *  -0.51%*
Hmean     128    19519.35 (   0.00%)    20337.24 *   4.19%*
Hmean     256    19392.24 (   0.00%)    20052.98 *   3.41%*
Hmean     384    19201.07 (   0.00%)    19563.63 *   1.89%*
 (B)
Hmean     1         515.98 (   0.00%)      499.91 *  -3.12%*
Hmean     2        1031.54 (   0.00%)     1044.38 *   1.24%*
Hmean     4        1953.44 (   0.00%)     1959.30 *   0.30%*
Hmean     8        3622.52 (   0.00%)     3773.08 *   4.16%*
Hmean     16       6545.82 (   0.00%)     6814.46 *   4.10%*
Hmean     32      13697.73 (   0.00%)    13078.74 *  -4.52%*
Hmean     64      25589.59 (   0.00%)    24576.52 *  -3.96%*
Hmean     128     35667.64 (   0.00%)    37590.20 *   5.39%*
Hmean     256     65215.85 (   0.00%)    64921.74 *  -0.45%*
Hmean     512     66035.57 (   0.00%)    63812.48 *  -3.37%*
Hmean     1024    53391.64 (   0.00%)    62356.50 *  16.79%*

Like netperf, tbench4 benchmark also has a high success rate ~39% on
the cache hot cpus, and the SIS overall success rate is ~45%. This
benchmark runs a fast idling workload and makes the SIS idle cpumask
quite unstable, which is a disaster to this feature. Even so, not
much but still some improvement can be seen from the result.

Conclusion
==========

The overhead of maintaining the idle cpumasks seems acceptable, and
this cost can trade for considerable throughput improvement once LLC
becomes busier in which case less idle cpus are available.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/sched/fair.c  | 74 +++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |  3 ++
 2 files changed, 65 insertions(+), 12 deletions(-)
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3aa699e9d4af..d06d59ac2f05 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6309,6 +6309,16 @@  static void update_idle_cpu(int cpu)
 	}
 }
 
+static inline bool should_scan_sibling(int cpu)
+{
+	return cmpxchg(&cpu_rq(cpu)->sis_scan_sibling, 1, 0);
+}
+
+static inline void set_scan_sibling(int cpu)
+{
+	WRITE_ONCE(cpu_rq(cpu)->sis_scan_sibling, 1);
+}
+
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
  * information in sd_llc_shared->has_idle_cores.
@@ -6384,17 +6394,20 @@  static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 /*
  * Scan the local SMT mask for idle CPUs.
  */
-static int select_idle_smt(struct task_struct *p, int target)
+static int select_idle_smt(struct task_struct *p, int core, struct cpumask *cpus, int exclude)
 {
 	int cpu;
 
-	for_each_cpu_and(cpu, cpu_smt_mask(target), p->cpus_ptr) {
-		if (cpu == target)
+	for_each_cpu_and(cpu, cpu_smt_mask(core), p->cpus_ptr) {
+		if (exclude && cpu == core)
 			continue;
 		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
 			return cpu;
 	}
 
+	if (cpus)
+		cpumask_clear_cpu(core, cpus);
+
 	return -1;
 }
 
@@ -6409,12 +6422,21 @@  static inline bool test_idle_cores(int cpu)
 	return false;
 }
 
+static inline bool should_scan_sibling(int cpu)
+{
+	return false;
+}
+
+static inline void set_scan_sibling(int cpu)
+{
+}
+
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
 	return __select_idle_cpu(core, p);
 }
 
-static inline int select_idle_smt(struct task_struct *p, int target)
+static inline int select_idle_smt(struct task_struct *p, int core, struct cpumask *cpus, int exclude)
 {
 	return -1;
 }
@@ -6434,6 +6456,7 @@  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	struct rq *this_rq = this_rq();
 	int this = smp_processor_id();
 	struct sched_domain *this_sd = NULL;
+	bool scan_sibling = false;
 	u64 time = 0;
 
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
@@ -6479,20 +6502,31 @@  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
-	if (sched_feat(SIS_CORE) && sched_smt_active())
+	if (sched_feat(SIS_CORE) && sched_smt_active()) {
+		/*
+		 * Due to the nature of idle core scanning, has_idle_core
+		 * hint should also consume the scan_sibling flag even
+		 * though it doesn't use the flag when scanning.
+		 */
+		scan_sibling = should_scan_sibling(target);
 		icpus = sched_domain_icpus(sd);
+	}
 
 	cpumask_and(cpus, icpus ? icpus : sched_domain_span(sd), p->cpus_ptr);
 
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
+		if (!--nr)
+			break;
+
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
 			if ((unsigned int)i < nr_cpumask_bits)
 				return i;
-
+		} else if (scan_sibling) {
+			idle_cpu = select_idle_smt(p, cpu, icpus, 0);
+			if ((unsigned int)idle_cpu < nr_cpumask_bits)
+				break;
 		} else {
-			if (!--nr)
-				return -1;
 			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
 				break;
@@ -6503,9 +6537,25 @@  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		set_idle_cores(target, false);
 
 	if (icpus && idle_cpu == -1) {
-		/* Reset the idle cpu mask if a full scan fails */
-		if (nr > 0)
-			cpumask_clear(icpus);
+		if (nr > 0 && (has_idle_core || scan_sibling)) {
+			/*
+			 * Reset the idle cpu mask if a full scan fails,
+			 * but ignore the !has_idle_core case which should
+			 * have already been fixed during scan.
+			 */
+			if (has_idle_core)
+				cpumask_clear(icpus);
+		} else {
+			/*
+			 * As for partial scan failures, it will probably
+			 * fail again next time scanning from the same cpu.
+			 * Due to the SIS_CORE rule of CORE granule update,
+			 * some idle cpus can be missed in the mask. So it
+			 * would be reasonable to scan SMT siblings as well
+			 * if the scan is fail-prone.
+			 */
+			set_scan_sibling(target);
+		}
 	}
 
 	if (sched_feat(SIS_PROP) && this_sd && !has_idle_core) {
@@ -6657,7 +6707,7 @@  static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		has_idle_core = test_idle_cores(target);
 
 		if (!has_idle_core && cpus_share_cache(prev, target)) {
-			i = select_idle_smt(p, prev);
+			i = select_idle_smt(p, prev, NULL, 1);
 			if ((unsigned int)i < nr_cpumask_bits)
 				return i;
 		}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1fc198be1ffd..c7f8ed5021e6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -971,6 +971,9 @@  struct rq {
 
 #ifdef CONFIG_SMP
 	unsigned int		ttwu_pending;
+#ifdef CONFIG_SCHED_SMT
+	int			sis_scan_sibling;
+#endif
 #endif
 	u64			nr_switches;