[24/30] sched: support preempt=none under PREEMPT_AUTO

Message ID 20240213055554.1802415-25-ankur.a.arora@oracle.com
State New
Headers
Series PREEMPT_AUTO: support lazy rescheduling |

Commit Message

Ankur Arora Feb. 13, 2024, 5:55 a.m. UTC
  The default preemption policy for the no forced preemption model under
PREEMPT_AUTO is to always schedule lazily for well-behaved, non-idle
tasks, preempting at exit-to-user.

We already have that, so enable it.

Comparing a scheduling/IPC workload:

 # perf stat -a -e cs --repeat 10 --  perf bench sched messaging -g 20 -t -l 5000

 PREEMPT_AUTO, preempt=none

	 3,074,466            context-switches      ( +-  0.34% )
	   3.66437 +- 0.00494 seconds time elapsed  ( +-  0.13% )

 PREEMPT_DYNAMIC, preempt=none

	 2,954,976            context-switches      ( +-  0.70% )
	   3.62855 +- 0.00708 seconds time elapsed  ( +-  0.20% )

Both perform similarly, but we incur a slightly higher number of
context-switches with PREEMPT_AUTO.

Drilling down we see that both voluntary and involuntary
context-switches are higher for this test:

 PREEMPT_AUTO, preempt=none

	  2115660.30 +- 20442.34 voluntary context-switches   ( +- 0.960% )
	   784690.40 +- 19629.42 involuntary context-switches ( +- 2.500% )

 PREEMPT_DYNAMIC, preempt=none

          2049027.10 +- 35237.10 voluntary context-switches   ( +- 1.710% )
	   740676.90 +- 20346.45 involuntary context-switches ( +- 2.740% )

Assuming voluntary context-switches due to explicit blocking are
similar, we expect that PREEMPT_AUTO will incur larger context
switches at exit-to-user (counted as voluntary) since that is its
default rescheduling point.

Involuntary context-switches, under PREEMPT_AUTO are seen when a
task has exceeded its time quanta by a tick. Under PREEMPT_DYNAMIC,
these are incurred when a task needs to be rescheduled and then
encounters a cond_resched().
So, these two numbers aren't directly comparable.

Comparing a kernbench workload:

  # Half load (-j 32)

                         PREEMPT_AUTO                           PREEMPT_DYNAMIC

  wall            74.41 +-     0.45 ( +-  0.60% )         74.20 +-    0.33 sec ( +- 0.45% )
  utime         1419.78 +-     2.04 ( +-  0.14% )       1416.40 +-    6.07 sec ( +- 0.42% )
  stime          247.70 +-     0.88 ( +-  0.35% )        246.23 +-    1.20 sec ( +- 0.49% )
  %cpu          2240.20 +-    16.03 ( +-  0.71% )       2240.20 +-   19.34     ( +- 0.86% )
  inv-csw      13056.00 +-   427.58 ( +-  3.27% )      18750.60 +-  771.21     ( +- 4.11% )
  vol-csw     191000.00 +-  1623.25 ( +-  0.84% )     182857.00 +- 2373.12     ( +- 1.29% )

The runtimes are basically identical for both of these. Voluntary
context switches, as above (and in the optimal, maximal runs below),
are higher. Which as mentioned above, does add up.

However, unlike the sched-messaging workload, the involuntary
context-switches are generally lower (also true for the optimal,
maximal runs below.) One reason for that might be that kbuild spends
~20% time executing in the kernel, while sched-messaging spends ~95%
time in the kernel. Which means a greater likelihood of being
preempted due to exceeding its time quanta.

 # Optimal load (-j 256)

                         PREEMPT_AUTO                           PREEMPT_DYNAMIC

  wall           65.15 +-      0.08 ( +-  0.12% )           65.10 +-      0.19 ( +-  0.29% )
  utime        1876.56 +-    477.03 ( +- 25.42% )         1873.63 +-    481.98 ( +- 25.72% )
  stime         295.77 +-     49.17 ( +- 16.62% )          294.41 +-     50.79 ( +- 17.25% )
  %cpu         3179.30 +-    970.30 ( +- 30.51% )         3172.90 +-    983.26 ( +- 30.98% )
  inv-csw    369670.00 +- 375980.00 ( +- 101.70% )      390848.00 +- 392231.00 ( +- 100.35% )
  vol-csw    216544.00 +-  28604.60 ( +- 13.20% )       205117.00 +-  23949.50 ( +- 11.67% )

 # Maximal load (-j 0)

                         PREEMPT_AUTO                           PREEMPT_DYNAMIC

  wall           66.02 +-      0.53 ( +-  0.80% )           65.67 +-      0.55 ( +-  0.83% )
  utime        2024.79 +-    439.74 ( +- 21.71% )         2026.12 +-    446.28 ( +- 22.02% )
  stime         312.13 +-     46.14 ( +- 14.78% )          311.53 +-     47.84 ( +- 15.35% )
  %cpu         3465.40 +-    883.75 ( +- 25.50% )         3473.80 +-    903.27 ( +- 26.00% )
  inv-csw    471639.00 +- 336424.00 ( +- 71.33% )       500981.00 +- 353471.00 ( +- 70.55% )
  vol-csw    190138.00 +-  44947.20 ( +- 23.63% )       177813.00 +-  44345.50 ( +- 24.93% )

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Ziljstra <peterz@infradead.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
  

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5df59a4548dc..2d33f3ff51a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8968,7 +8968,9 @@  static void __sched_dynamic_update(int mode)
 {
 	switch (mode) {
 	case preempt_dynamic_none:
-		preempt_dynamic_mode = preempt_dynamic_undefined;
+		if (mode != preempt_dynamic_mode)
+			pr_info("%s: none\n", PREEMPT_MODE);
+		preempt_dynamic_mode = mode;
 		break;
 
 	case preempt_dynamic_voluntary: