[00/10] sched: EEVDF using latency-nice

Message ID	20230306132521.968182689@infradead.org
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Message-ID: <20230306132521.968182689@infradead.org> User-Agent: quilt/0.66 Date: Mon, 06 Mar 2023 14:25:21 +0100 From: Peter Zijlstra <peterz@infradead.org> To: mingo@kernel.org, vincent.guittot@linaro.org Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, juri.lelli@redhat.com, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, corbet@lwn.net, qyousef@layalina.io, chris.hyser@oracle.com, patrick.bellasi@matbug.net, pjt@google.com, pavel@ucw.cz, qperret@google.com, tim.c.chen@linux.intel.com, joshdon@google.com, timj@gnu.org, kprateek.nayak@amd.com, yu.c.chen@intel.com, youssefesmat@chromium.org, joel@joelfernandes.org Subject: [PATCH 00/10] sched: EEVDF using latency-nice Precedence: bulk
Series	sched: EEVDF using latency-nice \| [00/10] sched: EEVDF using latency-nice [01/10] sched: Introduce latency-nice as a per-task attribute [02/10] sched/core: Propagate parent tasks latency requirements to the child task [03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task [04/10] sched/fair: Add latency_offset [05/10] sched/fair: Add sched group latency support [06/10] sched/fair: Add avg_vruntime [07/10] sched/fair: Remove START_DEBIT [08/10] sched/fair: Add lag based placement [09/10] rbtree: Add rb_add_augmented_cached() helper [10/10] sched/fair: Implement an EEVDF like policy

Message ID

20230306132521.968182689@infradead.org

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Message-ID: <20230306132521.968182689@infradead.org>
User-Agent: quilt/0.66
Date: Mon, 06 Mar 2023 14:25:21 +0100
From: Peter Zijlstra <peterz@infradead.org>
To: mingo@kernel.org, vincent.guittot@linaro.org
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
        juri.lelli@redhat.com, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
        bristot@redhat.com, corbet@lwn.net, qyousef@layalina.io,
        chris.hyser@oracle.com, patrick.bellasi@matbug.net, pjt@google.com,
        pavel@ucw.cz, qperret@google.com, tim.c.chen@linux.intel.com,
        joshdon@google.com, timj@gnu.org, kprateek.nayak@amd.com,
        yu.c.chen@intel.com, youssefesmat@chromium.org,
        joel@joelfernandes.org
Subject: [PATCH 00/10] sched: EEVDF using latency-nice
Precedence: bulk

Series

sched: EEVDF using latency-nice |

Message

Peter Zijlstra March 6, 2023, 1:25 p.m. UTC

  Hi!

Ever since looking at the latency-nice patches, I've wondered if EEVDF would
not make more sense, and I did point Vincent at some older patches I had for
that (which is here his augmented rbtree thing comes from).

Also, since I really dislike the dual tree, I also figured we could dynamically
switch between an augmented tree and not (and while I have code for that,
that's not included in this posting because with the current results I don't
think we actually need this).

Anyway, since I'm somewhat under the weather, I spend last week desperately
trying to connect a small cluster of neurons in defiance of the snot overlord
and bring back the EEVDF patches from the dark crypts where they'd been
gathering cobwebs for the past 13 odd years.

By friday they worked well enough, and this morning (because obviously I forgot
the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
tbench and sysbench -- there's a bunch of wins and losses, but nothing that
indicates a total fail.

( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
  more consistent than CFS and has a bunch of latency wins )

( hackbench also doesn't show the augmented tree and generally more expensive
  pick to be a loss, in fact it shows a slight win here )


  hackbech load + cyclictest --policy other results:


			EEVDF			 CFS

		# Min Latencies: 00053
  LNICE(19)	# Avg Latencies: 04350
		# Max Latencies: 76019

		# Min Latencies: 00052		00053
  LNICE(0)	# Avg Latencies: 00690		00687
		# Max Latencies: 14145		13913

		# Min Latencies: 00019
  LNICE(-19)	# Avg Latencies: 00261
		# Max Latencies: 05642


The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going
cross-eyed from staring at tree prints and I just couldn't figure out where it
was going side-ways.

There's definitely more benchmarking/tweaking to be done (0-day already
reported a stress-ng loss), but if we can pull this off we can delete a whole
much of icky heuristics code. EEVDF is a much better defined policy than what
we currently have.

Comments

Vincent Guittot March 7, 2023, 10:27 a.m. UTC | #1

On Mon, 6 Mar 2023 at 15:17, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hi!
>
> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).
>
> Also, since I really dislike the dual tree, I also figured we could dynamically
> switch between an augmented tree and not (and while I have code for that,
> that's not included in this posting because with the current results I don't
> think we actually need this).
>
> Anyway, since I'm somewhat under the weather, I spend last week desperately
> trying to connect a small cluster of neurons in defiance of the snot overlord
> and bring back the EEVDF patches from the dark crypts where they'd been
> gathering cobwebs for the past 13 odd years.

I haven't studied your patchset in detail yet but at a 1st glance this
seems to be a major rework on the cfs task placement and the latency
is just an add-on on top of moving to the EEVDF scheduling.

>
> By friday they worked well enough, and this morning (because obviously I forgot
> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
> indicates a total fail.
>
> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
>   more consistent than CFS and has a bunch of latency wins )
>
> ( hackbench also doesn't show the augmented tree and generally more expensive
>   pick to be a loss, in fact it shows a slight win here )
>
>
>   hackbech load + cyclictest --policy other results:
>
>
>                         EEVDF                    CFS
>
>                 # Min Latencies: 00053
>   LNICE(19)     # Avg Latencies: 04350
>                 # Max Latencies: 76019
>
>                 # Min Latencies: 00052          00053
>   LNICE(0)      # Avg Latencies: 00690          00687
>                 # Max Latencies: 14145          13913
>
>                 # Min Latencies: 00019
>   LNICE(-19)    # Avg Latencies: 00261
>                 # Max Latencies: 05642
>
>
> The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going
> cross-eyed from staring at tree prints and I just couldn't figure out where it
> was going side-ways.
>
> There's definitely more benchmarking/tweaking to be done (0-day already
> reported a stress-ng loss), but if we can pull this off we can delete a whole
> much of icky heuristics code. EEVDF is a much better defined policy than what
> we currently have.
>
>

Peter Zijlstra March 7, 2023, 1:08 p.m. UTC | #2

On Tue, Mar 07, 2023 at 11:27:37AM +0100, Vincent Guittot wrote:
> On Mon, 6 Mar 2023 at 15:17, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Hi!
> >
> > Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> > not make more sense, and I did point Vincent at some older patches I had for
> > that (which is here his augmented rbtree thing comes from).
> >
> > Also, since I really dislike the dual tree, I also figured we could dynamically
> > switch between an augmented tree and not (and while I have code for that,
> > that's not included in this posting because with the current results I don't
> > think we actually need this).
> >
> > Anyway, since I'm somewhat under the weather, I spend last week desperately
> > trying to connect a small cluster of neurons in defiance of the snot overlord
> > and bring back the EEVDF patches from the dark crypts where they'd been
> > gathering cobwebs for the past 13 odd years.
> 
> I haven't studied your patchset in detail yet but at a 1st glance this
> seems to be a major rework on the cfs task placement and the latency
> is just an add-on on top of moving to the EEVDF scheduling.

It completely reworks the base scheduler, placement, preemption, picking
-- everything. The only thing they have in common is that they're both a
virtual time based scheduler.

The big advantage I see is that EEVDF is fairly well known and studied,
and a much better defined scheduler than WFQ. Specifically, where WFQ is
only well defined in how much time is given to any task (bandwidth), but
says nothing about how that is distributed in time. That is, there is no
native preemption condition/constraint etc. -- all that code we have is
random heuristics mostly.

The WF2Q/EEVDF class of schedulers otoh *do* define all that. There is a
lot less wiggle room as a result. The avg_vruntime / placement stuff I
did is fundamental to how it controls bandwidth distribution and
guarantees the WFQ subset. Specifically, by limiting the pick to that
subset of tasks that has positive lag (owed time), it guarantees this
fairness -- but that means we need a working measure of lag.

Similarly, since the whole 'when' thing is well defined in order to
provide the additional latency goals of these schedulers, placement is
crucial. Things like sleeper bonus is fundamentally incompatible with
latency guarantees -- both affect the 'when'.

Initial EEVDF paper is here:

https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564

It contains a few 'mistakes' and oversights, but those should not
matter.

Anyway, I'm still struggling to make complete sense of what you did --
will continue to stare at that.

Shrikanth Hegde March 8, 2023, 3:13 p.m. UTC | #3

> Hi!
>
> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).
>
> Also, since I really dislike the dual tree, I also figured we could dynamically
> switch between an augmented tree and not (and while I have code for that,
> that's not included in this posting because with the current results I don't
> think we actually need this).
>
> Anyway, since I'm somewhat under the weather, I spend last week desperately
> trying to connect a small cluster of neurons in defiance of the snot overlord
> and bring back the EEVDF patches from the dark crypts where they'd been
> gathering cobwebs for the past 13 odd years.
>
> By friday they worked well enough, and this morning (because obviously I forgot
> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
> indicates a total fail.
>
> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
>   more consistent than CFS and has a bunch of latency wins )
>
> ( hackbench also doesn't show the augmented tree and generally more expensive
>   pick to be a loss, in fact it shows a slight win here )
>
>
>   hackbech load + cyclictest --policy other results:
>
>
> 			EEVDF			 CFS
>
> 		# Min Latencies: 00053
>   LNICE(19)	# Avg Latencies: 04350
> 		# Max Latencies: 76019
>
> 		# Min Latencies: 00052		00053
>   LNICE(0)	# Avg Latencies: 00690		00687
> 		# Max Latencies: 14145		13913
>
> 		# Min Latencies: 00019
>   LNICE(-19)	# Avg Latencies: 00261
> 		# Max Latencies: 05642
>
>
> The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going
> cross-eyed from staring at tree prints and I just couldn't figure out where it
> was going side-ways.
>
> There's definitely more benchmarking/tweaking to be done (0-day already
> reported a stress-ng loss), but if we can pull this off we can delete a whole
> much of icky heuristics code. EEVDF is a much better defined policy than what
> we currently have.
>
Tested the patch series on powerpc systems. This test is done in the same way
that was done for vincent's V12 series.

Creating two cgroups. In cgroup2 running stress-ng -l 50 --cpu=<total_cpu> and
in cgroup1 running micro benchmarks. Different latency values are assigned to
cgroup1.

Tested on two different system. One system has 480 CPU and other one has 96
CPU.

++++++++
Summary:
++++++++
For hackbench, 480 CPU system shows good improvement.
96 CPU system shows same numbers as 6.2. Smaller system was showing regressing 
results as discussed in Vincent's V12 series. With this patch, there is no regression.

Schbench shows good improvement compared to v6.2 at LN=0 or LN=-20. Whereas
at LN=19, it shows regression.

Please suggest if any variation of the benchmark or a different benchmark to be run.


++++++++++++++++++
480 CPU system
++++++++++++++++++

==========
schbench
==========

		 v6.2		|  v6.2+LN=0    |  v6.2+LN=-20 |  v6.2+LN=19
1 Threads
  50.0th:	 14.00          |   12.00      |   14.50       |   15.00
  75.0th:	 16.50          |   14.50      |   17.00       |   18.00
  90.0th:	 18.50          |   17.00      |   19.50       |   20.00
  95.0th:	 20.50          |   18.50      |   22.00       |   23.50
  99.0th:	 27.50          |   24.50      |   31.50       |   155.00
  99.5th:	 36.00          |   30.00      |   44.50       |   2991.00
  99.9th:	 81.50          |   171.50     |   153.00      |   4621.00
2 Threads
  50.0th:	 14.00          |   15.50      |   17.00       |   16.00
  75.0th:	 17.00          |   18.00      |   19.00       |   19.00
  90.0th:	 20.00          |   21.00      |   22.00       |   22.50
  95.0th:	 23.00          |   23.00      |   25.00       |   25.50
  99.0th:	 71.00          |   30.50      |   35.50       |   990.50
  99.5th:	 1170.00        |   53.00      |   71.00       |   3719.00
  99.9th:	 5088.00        |   245.50     |   138.00      |   6644.00
4 Threads
  50.0th:	 20.50          |   20.00      |   20.00       |   19.50
  75.0th:	 24.50          |   23.00      |   23.00       |   23.50
  90.0th:	 31.00          |   27.00      |   26.50       |   27.50
  95.0th:	 260.50         |   29.50      |   29.00       |   35.00
  99.0th:	 3644.00        |   106.00     |   37.50       |   2884.00
  99.5th:	 5152.00        |   227.00     |   92.00       |   5496.00
  99.9th:	 8076.00        |   3662.50    |   517.00      |   8640.00
8 Threads
  50.0th:	 26.00          |   23.50      |   22.50       |   25.00
  75.0th:	 32.50          |   29.50      |   27.50       |   31.00
  90.0th:	 41.50          |   34.50      |   31.50       |   39.00
  95.0th:	 794.00         |   37.00      |   34.50       |   579.50
  99.0th:	 5992.00        |   48.50      |   52.00       |   5872.00
  99.5th:	 7208.00        |   100.50     |   97.50       |   7280.00
  99.9th:	 9392.00        |   4098.00    |   1226.00     |   9328.00
16 Threads
  50.0th:	 37.50          |   33.00      |   34.00       |   37.00
  75.0th:	 49.50          |   43.50      |   44.00       |   49.00
  90.0th:	 70.00          |   52.00      |   53.00       |   66.00
  95.0th:	 1284.00        |   57.50      |   59.00       |   1162.50
  99.0th:	 5600.00        |   79.50      |   111.50      |   5912.00
  99.5th:	 7216.00        |   282.00     |   194.50      |   7392.00
  99.9th:	 9328.00        |   4026.00    |   2009.00     |   9440.00
32 Threads
  50.0th:	 59.00          |   56.00      |   57.00       |   59.00
  75.0th:	 83.00          |   77.50      |   79.00       |   83.00
  90.0th:	 118.50         |   94.00      |   95.00       |   120.50
  95.0th:	 1921.00        |   104.50     |   104.00      |   1800.00
  99.0th:	 6672.00        |   425.00     |   255.00      |   6384.00
  99.5th:	 8252.00        |   2800.00    |   1252.00     |   7696.00
  99.9th:	 10448.00       |   7264.00    |   5888.00     |   9504.00

=========
hackbench
=========

Process		 10	 0.19   |   0.18      |   0.17      |   0.18
Process		 20 	 0.34   |   0.32      |   0.33      |   0.31
Process		 30 	 0.45   |   0.42      |   0.43      |   0.43
Process		 40 	 0.58   |   0.53      |   0.53      |   0.53
Process		 50 	 0.70   |   0.64      |   0.64      |   0.65
Process		 60 	 0.82   |   0.74      |   0.75      |   0.76
thread 		 10 	 0.20   |   0.19      |   0.19      |   0.19
thread 		 20 	 0.36   |   0.34      |   0.34      |   0.34
Process(Pipe)	 10 	 0.24   |   0.15      |   0.15      |   0.15
Process(Pipe)	 20 	 0.46   |   0.22      |   0.22      |   0.21
Process(Pipe)	 30 	 0.65   |   0.30      |   0.29      |   0.29
Process(Pipe)	 40 	 0.90   |   0.35      |   0.36      |   0.34
Process(Pipe)	 50 	 1.04   |   0.38      |   0.39      |   0.38
Process(Pipe)	 60 	 1.16   |   0.42      |   0.42      |   0.43
thread(Pipe)	 10 	 0.19   |   0.13      |   0.13      |   0.13
thread(Pipe)	 20 	 0.46   |   0.21      |   0.21      |   0.21


++++++++++++++++++
96 CPU system
++++++++++++++++++

===========
schbench
===========
		 v6.2        |  v6.2+LN=0    |  v6.2+LN=-20 |  v6.2+LN=19
1 Thread
  50.0th:	 10.50       |   10.00       |   10.00      |   11.00
  75.0th:	 12.50       |   11.50       |   11.50      |   12.50
  90.0th:	 15.00       |   13.00       |   13.50      |   16.50
  95.0th:	 47.50       |   15.00       |   15.00      |   274.50
  99.0th:	 4744.00     |   17.50       |   18.00      |   5032.00
  99.5th:	 7640.00     |   18.50       |   525.00     |   6636.00
  99.9th:	 8916.00     |   538.00      |   6704.00    |   9264.00
2 Threads
  50.0th:	 11.00       |   10.00       |   11.00      |   11.00
  75.0th:	 13.50       |   12.00       |   12.50      |   13.50
  90.0th:	 17.00       |   14.00       |   14.00      |   17.00
  95.0th:	 451.50      |   16.00       |   15.50      |   839.00
  99.0th:	 5488.00     |   20.50       |   18.00      |   6312.00
  99.5th:	 6712.00     |   986.00      |   19.00      |   7664.00
  99.9th:	 9856.00     |   4913.00     |   1154.00    |   8736.00
4 Threads
  50.0th:	 13.00       |   12.00       |   12.00      |   13.00
  75.0th:	 15.00       |   14.00       |   14.00      |   15.00
  90.0th:	 23.50       |   16.00       |   16.00      |   20.00
  95.0th:	 2508.00     |   17.50       |   17.50      |   1818.00
  99.0th:	 7232.00     |   777.00      |   38.50      |   5952.00
  99.5th:	 8720.00     |   3548.00     |   1926.00    |   7788.00
  99.9th:	 10352.00    |   6320.00     |   7160.00    |   10000.00
8 Threads
  50.0th:	 16.00       |   15.00       |   15.00      |   16.00
  75.0th:	 20.00       |   18.00       |   18.00      |   19.50
  90.0th:	 371.50      |   20.00       |   21.00      |   245.50
  95.0th:	 2992.00     |   22.00       |   23.00      |   2608.00
  99.0th:	 7784.00     |   1084.50     |   563.50     |   7136.00
  99.5th:	 9488.00     |   2612.00     |   2696.00    |   8720.00
  99.9th:	 15568.00    |   6656.00     |   7496.00    |   10000.00
16 Threads
  50.0th:	 23.00       |   21.00       |   20.00      |   22.50
  75.0th:	 31.00       |   27.50       |   26.00      |   29.50
  90.0th:	 1981.00     |   32.50       |   30.50      |   1500.50
  95.0th:	 4856.00     |   304.50      |   34.00      |   4046.00
  99.0th:	 10112.00    |   5720.00     |   4590.00    |   8220.00
  99.5th:	 13104.00    |   7828.00     |   7008.00    |   9312.00
  99.9th:	 18624.00    |   9856.00     |   9504.00    |   11984.00
32 Threads
  50.0th:	 36.50       |   34.50       |   33.50      |   35.50
  75.0th:	 56.50       |   48.00       |   46.00      |   52.50
  90.0th:	 4728.00     |   1470.50     |   376.00     |   3624.00
  95.0th:	 7808.00     |   4130.00     |   3850.00    |   6488.00
  99.0th:	 15776.00    |   8972.00     |   9060.00    |   9872.00
  99.5th:	 19072.00    |   11328.00    |   12224.00   |   11520.00
  99.9th:	 28864.00    |   18016.00    |   18368.00   |   18848.00


==========
Hackbench
=========

Type	      groups    v6.2	  | v6.2+LN=0  | v6.2+LN=-20 | v6.2+LN=19
Process		10	0.33      |   0.33     |   0.33      |   0.33
Process		20	0.61      |   0.56     |   0.58      |   0.57
Process		30	0.87      |   0.82     |   0.81      |   0.81
Process		40	1.10      |   1.05     |   1.06      |   1.05
Process		50	1.34      |   1.28     |   1.29      |   1.29
Process		60	1.58      |   1.53     |   1.52      |   1.51
thread		10	0.36      |   0.35     |   0.35      |   0.35
thread		20	0.64	  |   0.63     |   0.62      |   0.62
Process(Pipe)	10	0.18      |   0.18     |   0.18      |   0.17
Process(Pipe)   20	0.32      |   0.31     |   0.31      |   0.31
Process(Pipe)   30	0.42      |   0.41     |   0.41      |   0.42
Process(Pipe)   40	0.56      |   0.53     |   0.55      |   0.53
Process(Pipe)   50	0.68      |   0.66     |   0.66      |   0.66
Process(Pipe)   60	0.80      |   0.78     |   0.78      |   0.78
thread(Pipe)	10	0.20      |   0.18     |   0.19      |   0.18
thread(Pipe)	20	0.34      |   0.34     |   0.33      |   0.33

Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>

K Prateek Nayak March 22, 2023, 6:49 a.m. UTC | #4

Hello Peter,

Leaving some results from my testing on a dual socket Zen3 machine
(2 x 64C/128T) below.

tl;dr

o I've not tested workloads with nice and latency nice yet focusing more
  on the out of the box performance. No changes to sched_feat were made
  for the same reason. 

o Except for hackbench (m:n communication relationship), I do not see any
  regression for other standard benchmarks (mostly 1:1 or 1:n) relation
  when system is below fully loaded.

o At fully loaded scenario, schbench seems to be unhappy. Looking at the
  data from /proc/<pid>/sched for the tasks with schedstats enabled,
  there is an increase in number of context switches and the total wait
  sum. When system is overloaded, things flip and the schbench tail
  latency improves drastically. I suspect the involuntary
  context-switches help workers make progress much sooner after wakeup
  compared to tip thus leading to lower tail latency.

o For the same reason as above, tbench throughput takes a hit with
  number of involuntary context-switches increasing drastically for the
  tbench server. There is also an increase in wait sum noticed.

o Couple of real world workloads were also tested. DeathStarBench
  throughput tanks much more with the updated version in your tree
  compared to this series as is.
  SpecJBB Max-jOPS sees large improvements but comes at a cost of
  drop in Critical-jOPS signifying an increase in either wait time
  or an increase in involuntary context-switches which can lead to
  transactions taking longer to complete.

o Apart from DeathStarBench, the all the trends reported remain same
  comparing the version in your tree and this series, as is, applied
  on the same base kernel.

I'll leave the detailed results below and some limited analysis. 

On 3/6/2023 6:55 PM, Peter Zijlstra wrote:
> Hi!
> 
> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).
> 
> Also, since I really dislike the dual tree, I also figured we could dynamically
> switch between an augmented tree and not (and while I have code for that,
> that's not included in this posting because with the current results I don't
> think we actually need this).
> 
> Anyway, since I'm somewhat under the weather, I spend last week desperately
> trying to connect a small cluster of neurons in defiance of the snot overlord
> and bring back the EEVDF patches from the dark crypts where they'd been
> gathering cobwebs for the past 13 odd years.
> 
> By friday they worked well enough, and this morning (because obviously I forgot
> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
> indicates a total fail.
> 
> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
>   more consistent than CFS and has a bunch of latency wins )
> 
> ( hackbench also doesn't show the augmented tree and generally more expensive
>   pick to be a loss, in fact it shows a slight win here )
> 
> 
>   hackbech load + cyclictest --policy other results:
> 
> 
> 			EEVDF			 CFS
> 
> 		# Min Latencies: 00053
>   LNICE(19)	# Avg Latencies: 04350
> 		# Max Latencies: 76019
> 
> 		# Min Latencies: 00052		00053
>   LNICE(0)	# Avg Latencies: 00690		00687
> 		# Max Latencies: 14145		13913
> 
> 		# Min Latencies: 00019
>   LNICE(-19)	# Avg Latencies: 00261
> 		# Max Latencies: 05642
> 

Following are the results from testing the series on a dual socket
Zen3 machine (2 x 64C/128T):

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Kernel versions:
- tip:          6.2.0-rc6 tip sched/core
- eevdf: 	6.2.0-rc6 tip sched/core
		+ eevdf commits from your tree
		(https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf)

- eevdf prev:	6.2.0-rc6 tip sched/core + this series as is

When the testing started, the tip was at:
commit 7c4a5b89a0b5 "sched/rt: pick_next_rt_entity(): check list_entry"

Benchmark Results:

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test:			tip			eevdf
 1-groups:	   4.63 (0.00 pct)	   4.52 (2.37 pct)
 2-groups:	   4.42 (0.00 pct)	   5.41 (-22.39 pct)	*
 4-groups:	   4.21 (0.00 pct)	   5.26 (-24.94 pct)	*
 8-groups:	   4.95 (0.00 pct)	   5.01 (-1.21 pct)
16-groups:	   5.43 (0.00 pct)	   6.24 (-14.91 pct)	*

o NPS2

Test:			tip			eevdf
 1-groups:	   4.68 (0.00 pct)	   4.56 (2.56 pct)
 2-groups:	   4.45 (0.00 pct)	   5.19 (-16.62 pct)	*
 4-groups:	   4.19 (0.00 pct)	   4.53 (-8.11 pct)	*
 8-groups:	   4.80 (0.00 pct)	   4.81 (-0.20 pct)
16-groups:	   5.60 (0.00 pct)	   6.22 (-11.07 pct)	*

o NPS4

Test:			tip			eevdf
 1-groups:	   4.68 (0.00 pct)	   4.57 (2.35 pct)
 2-groups:	   4.56 (0.00 pct)	   5.19 (-13.81 pct)	*
 4-groups:	   4.50 (0.00 pct)	   4.96 (-10.22 pct)	*
 8-groups:	   5.76 (0.00 pct)	   5.49 (4.68 pct)
16-groups:	   5.60 (0.00 pct)	   6.53 (-16.60 pct)	*

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers:	tip			eevdf
  1:	  36.00 (0.00 pct)	  36.00 (0.00 pct)
  2:	  37.00 (0.00 pct)	  37.00 (0.00 pct)
  4:	  38.00 (0.00 pct)	  39.00 (-2.63 pct)
  8:	  52.00 (0.00 pct)	  50.00 (3.84 pct)
 16:	  66.00 (0.00 pct)	  68.00 (-3.03 pct)
 32:	 111.00 (0.00 pct)	 109.00 (1.80 pct)
 64:	 213.00 (0.00 pct)	 212.00 (0.46 pct)
128:	 502.00 (0.00 pct)	 637.00 (-26.89 pct)	*
256:	 45632.00 (0.00 pct)	 24992.00 (45.23 pct)	^
512:	 78720.00 (0.00 pct)	 44096.00 (43.98 pct)	^

o NPS2

#workers:	tip			eevdf
  1:	  31.00 (0.00 pct)	  23.00 (25.80 pct)
  2:	  32.00 (0.00 pct)	  33.00 (-3.12 pct)
  4:	  39.00 (0.00 pct)	  37.00 (5.12 pct)
  8:	  52.00 (0.00 pct)	  49.00 (5.76 pct)
 16:	  67.00 (0.00 pct)	  68.00 (-1.49 pct)
 32:	 113.00 (0.00 pct)	 112.00 (0.88 pct)
 64:	 213.00 (0.00 pct)	 214.00 (-0.46 pct)
128:	 508.00 (0.00 pct)	 491.00 (3.34 pct)
256:	 46912.00 (0.00 pct)	 22304.00 (52.45 pct)	^
512:	 76672.00 (0.00 pct)	 42944.00 (43.98 pct)	^

o NPS4

#workers:	tip			eevdf
  1:	  33.00 (0.00 pct)	  30.00 (9.09 pct)
  2:	  40.00 (0.00 pct)	  36.00 (10.00 pct)
  4:	  44.00 (0.00 pct)	  41.00 (6.81 pct)
  8:	  73.00 (0.00 pct)	  73.00 (0.00 pct)
 16:	  71.00 (0.00 pct)	  71.00 (0.00 pct)
 32:	 111.00 (0.00 pct)	 115.00 (-3.60 pct)
 64:	 217.00 (0.00 pct)	 211.00 (2.76 pct)
128:	 509.00 (0.00 pct)	 553.00 (-8.64 pct)	*
256:	 44352.00 (0.00 pct)	 26848.00 (39.46 pct)	^
512:	 75392.00 (0.00 pct)	 44352.00 (41.17 pct)	^


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients:	tip			eevdf
    1	 483.10 (0.00 pct)	 476.46 (-1.37 pct)
    2	 956.03 (0.00 pct)	 943.12 (-1.35 pct)
    4	 1786.36 (0.00 pct)	 1760.64 (-1.43 pct)
    8	 3304.47 (0.00 pct)	 3105.19 (-6.03 pct)
   16	 5440.44 (0.00 pct)	 5609.24 (3.10 pct)
   32	 10462.02 (0.00 pct)	 10416.02 (-0.43 pct)
   64	 18995.99 (0.00 pct)	 19317.34 (1.69 pct)
  128	 27896.44 (0.00 pct)	 28459.38 (2.01 pct)
  256	 49742.89 (0.00 pct)	 46371.44 (-6.77 pct)	*
  512	 49583.01 (0.00 pct)	 45717.22 (-7.79 pct)	*
 1024	 48467.75 (0.00 pct)	 43475.31 (-10.30 pct)	*

o NPS2

Clients:	tip			eevdf
    1	 472.57 (0.00 pct)	 475.35 (0.58 pct)
    2	 938.27 (0.00 pct)	 942.19 (0.41 pct)
    4	 1764.34 (0.00 pct)	 1783.50 (1.08 pct)
    8	 3043.57 (0.00 pct)	 3205.85 (5.33 pct)
   16	 5103.53 (0.00 pct)	 5154.94 (1.00 pct)
   32	 9767.22 (0.00 pct)	 9793.81 (0.27 pct)
   64	 18712.65 (0.00 pct)	 18601.10 (-0.59 pct)
  128	 27691.95 (0.00 pct)	 27542.57 (-0.53 pct)
  256	 47939.24 (0.00 pct)	 43401.62 (-9.46 pct)	*
  512	 47843.70 (0.00 pct)	 43971.16 (-8.09 pct)	*
 1024	 48412.05 (0.00 pct)	 42808.58 (-11.57 pct)	*

o NPS4

Clients:	tip			eevdf
    1	 486.74 (0.00 pct)	 484.88 (-0.38 pct)
    2	 950.50 (0.00 pct)	 950.04 (-0.04 pct)
    4	 1778.58 (0.00 pct)	 1796.03 (0.98 pct)
    8	 3106.36 (0.00 pct)	 3180.09 (2.37 pct)
   16	 5139.81 (0.00 pct)	 5139.50 (0.00 pct)
   32	 9911.04 (0.00 pct)	 10086.37 (1.76 pct)
   64	 18201.46 (0.00 pct)	 18289.40 (0.48 pct)
  128	 27284.67 (0.00 pct)	 26947.19 (-1.23 pct)
  256	 46793.72 (0.00 pct)	 43971.87 (-6.03 pct)	*
  512	 48841.96 (0.00 pct)	 44255.01 (-9.39 pct)	*
 1024	 48811.99 (0.00 pct)	 43118.99 (-11.66 pct)	*

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

- 10 Runs:

Test:		tip			eevdf
 Copy:	 321229.54 (0.00 pct)	 332975.45 (3.65 pct)
Scale:	 207471.32 (0.00 pct)	 212534.83 (2.44 pct)
  Add:	 234962.15 (0.00 pct)	 243011.39 (3.42 pct)
Triad:	 246256.00 (0.00 pct)	 256453.73 (4.14 pct)

- 100 Runs:

Test:		tip			eevdf
 Copy:	 332714.94 (0.00 pct)	 333183.42 (0.14 pct)
Scale:	 216140.84 (0.00 pct)	 212160.53 (-1.84 pct)
  Add:	 239605.00 (0.00 pct)	 233168.69 (-2.68 pct)
Triad:	 258580.84 (0.00 pct)	 256972.33 (-0.62 pct)

o NPS2

- 10 Runs:

Test:		tip			eevdf
 Copy:	 324423.92 (0.00 pct)	 340685.20 (5.01 pct)
Scale:	 215993.56 (0.00 pct)	 217895.31 (0.88 pct)
  Add:	 250590.28 (0.00 pct)	 257495.12 (2.75 pct)
Triad:	 261284.44 (0.00 pct)	 261373.49 (0.03 pct)

- 100 Runs:

Test:		tip			eevdf
 Copy:	 325993.72 (0.00 pct)	 341244.18 (4.67 pct)
Scale:	 227201.27 (0.00 pct)	 227255.98 (0.02 pct)
  Add:	 256601.84 (0.00 pct)	 258026.75 (0.55 pct)
Triad:	 260222.19 (0.00 pct)	 269878.75 (3.71 pct)

o NPS4

- 10 Runs:

Test:		tip			eevdf
 Copy:	 356850.80 (0.00 pct)	 371230.27 (4.02 pct)
Scale:	 247219.39 (0.00 pct)	 237846.20 (-3.79 pct)
  Add:	 268588.78 (0.00 pct)	 261088.54 (-2.79 pct)
Triad:	 272932.59 (0.00 pct)	 284068.07 (4.07 pct)

- 100 Runs:

Test:		tip			eevdf
 Copy:	 365965.18 (0.00 pct)	 371186.97 (1.42 pct)
Scale:	 246068.58 (0.00 pct)	 245991.10 (-0.03 pct)
  Add:	 263677.73 (0.00 pct)	 269021.14 (2.02 pct)
Triad:	 273701.36 (0.00 pct)	 280566.44 (2.50 pct)

~~~~~~~~~~~~~
~ Unixbench ~
~~~~~~~~~~~~~

o NPS1

Test			Metric	  Parallelism			tip		          eevdf
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49077561.21 (   0.00%)    49144835.64 (   0.14%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6285373890.61 (   0.00%)  6270537933.92 (  -0.24%)
unixbench-syscall       Amean     unixbench-syscall-1        2664815.40 (   0.00%)     2679289.17 *  -0.54%*
unixbench-syscall       Amean     unixbench-syscall-512      7848462.70 (   0.00%)     7456802.37 *   4.99%*
unixbench-pipe          Hmean     unixbench-pipe-1           2531131.89 (   0.00%)     2475863.05 *  -2.18%*
unixbench-pipe          Hmean     unixbench-pipe-512       305171024.40 (   0.00%)   301182156.60 (  -1.31%)
unixbench-spawn         Hmean     unixbench-spawn-1             4058.05 (   0.00%)        4284.38 *   5.58%*
unixbench-spawn         Hmean     unixbench-spawn-512          79893.24 (   0.00%)       78234.45 *  -2.08%*
unixbench-execl         Hmean     unixbench-execl-1             4148.64 (   0.00%)        4086.73 *  -1.49%*
unixbench-execl         Hmean     unixbench-execl-512          11077.20 (   0.00%)       11137.79 (   0.55%)

o NPS2

Test			Metric	  Parallelism			tip		          eevdf
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49394822.56 (   0.00%)    49175574.26 (  -0.44%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6267817215.36 (   0.00%)  6282838979.08 *   0.24%*
unixbench-syscall       Amean     unixbench-syscall-1        2663675.03 (   0.00%)     2677018.53 *  -0.50%*
unixbench-syscall       Amean     unixbench-syscall-512      7342392.90 (   0.00%)     7443264.00 *  -1.37%*
unixbench-pipe          Hmean     unixbench-pipe-1           2533194.04 (   0.00%)     2475969.01 *  -2.26%*
unixbench-pipe          Hmean     unixbench-pipe-512       303588239.03 (   0.00%)   302217597.98 *  -0.45%*
unixbench-spawn         Hmean     unixbench-spawn-1             5141.40 (   0.00%)        4862.78 (  -5.42%)    *
unixbench-spawn         Hmean     unixbench-spawn-512          82993.79 (   0.00%)       79139.42 *  -4.64%*    *
unixbench-execl         Hmean     unixbench-execl-1             4140.15 (   0.00%)        4084.20 *  -1.35%*
unixbench-execl         Hmean     unixbench-execl-512          12229.25 (   0.00%)       11445.22 (  -6.41%)    *

o NPS4

Test			Metric	  Parallelism			tip		          eevdf
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48970677.27 (   0.00%)    49070289.56 (   0.20%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6297506696.81 (   0.00%)  6311038905.07 (   0.21%)
unixbench-syscall       Amean     unixbench-syscall-1        2664715.13 (   0.00%)     2677752.20 *  -0.49%*
unixbench-syscall       Amean     unixbench-syscall-512      7938670.70 (   0.00%)     7972291.60 (  -0.42%)
unixbench-pipe          Hmean     unixbench-pipe-1           2527605.54 (   0.00%)     2476140.77 *  -2.04%*
unixbench-pipe          Hmean     unixbench-pipe-512       305068507.23 (   0.00%)   304114548.50 (  -0.31%)
unixbench-spawn         Hmean     unixbench-spawn-1             5207.34 (   0.00%)        4964.39 (  -4.67%)    *
unixbench-spawn         Hmean     unixbench-spawn-512          81352.38 (   0.00%)       74467.00 *  -8.46%*    *
unixbench-execl         Hmean     unixbench-execl-1             4131.37 (   0.00%)        4044.09 *  -2.11%*
unixbench-execl         Hmean     unixbench-execl-512          13025.56 (   0.00%)       11124.77 * -14.59%*    *

~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~

o NPS1

                        tip                     eevdf
 1-clients:      107932.22 (0.00 pct)    106167.39 (-1.63 pct)
 2-clients:      106887.99 (0.00 pct)    105304.25 (-1.48 pct)
 4-clients:      106676.11 (0.00 pct)    104328.10 (-2.20 pct)
 8-clients:      98645.45 (0.00 pct)     94076.26 (-4.63 pct)
16-clients:      88881.23 (0.00 pct)     86831.85 (-2.30 pct)
32-clients:      86654.28 (0.00 pct)     86313.80 (-0.39 pct)
64-clients:      81431.90 (0.00 pct)     74885.75 (-8.03 pct)
128-clients:     55993.77 (0.00 pct)     55378.10 (-1.09 pct)
256-clients:     43865.59 (0.00 pct)     44326.30 (1.05 pct)

o NPS2

                        tip                     eevdf
 1-clients:      106711.81 (0.00 pct)    108576.27 (1.74 pct)
 2-clients:      106987.79 (0.00 pct)    108348.24 (1.27 pct)
 4-clients:      105275.37 (0.00 pct)    105702.12 (0.40 pct)
 8-clients:      103028.31 (0.00 pct)    96250.20 (-6.57 pct)
16-clients:      87382.43 (0.00 pct)     87683.29 (0.34 pct)
32-clients:      86578.14 (0.00 pct)     86968.29 (0.45 pct)
64-clients:      81470.63 (0.00 pct)     75906.15 (-6.83 pct)
128-clients:     54803.35 (0.00 pct)     55051.90 (0.45 pct)
256-clients:     42910.29 (0.00 pct)     44062.33 (2.68 pct)

~~~~~~~~~~~
~ SpecJBB ~
~~~~~~~~~~~

o NPS1

			tip		    eevdf
Max-jOPS	       100%		  115.71%	(+15.71%)  ^
Critical-jOPS	       100%		   93.59%	 (-6.41%)  *

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

o NPS1

  #CCX                      	 1 CCX    2 CCX   3 CCX   4 CCX
o eevdf compared to tip		-10.93	 -14.35	  -9.74	  -6.07
o eevdf prev (this sries as is)    
  compared to tip                -1.99    -6.64   -4.99   -3.87

Note: #CCX is the number of LLCs the the services are pinned to.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Some Preliminary Analysis ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

tl;dr

- There seems to be an increase in number of involuntary context switches
  when the system is overloaded. This probably allows newly waking task to
  make progress benefiting latency sensitive workload like schbench in
  overloaded scenario compared to tip but hurts tbench performance.
  When system is fully loaded, the larger average wait time seems to hurt
  the schbench performance.
  More analysis is needed to get to the bottom of the problem.

- For hackbench 2 groups scenario, there seems the wait time seems to go up
  drastically.

Scheduler statistics of interest are listed in detail below.

Note: Units of all metrics denoting time is ms. They are processed from
      per-task schedstats in /proc/<pid>/sched.

o Hackbench (2 Groups) (NPS1)

				tip		eevdf			%diff
Comm				sched-messaging	sched-messaging		N/A
Sum of avg_atom			282.0024818	19.04355233		-93.24702669
Average of avg_atom		3.481512121	0.235105584		-93.24702669
Sum of avg_per_cpu		1761.949461	61.52537145		-96.50810805
Average of avg_per_cpu		21.75246248	0.759572487		-96.50810805
Average of avg_wait_time	0.007239228	0.012899105		78.18343632
Sum of nr_switches		4897740		4728784			-3.449672706
Sum of nr_voluntary_switches	4742512		4621606			-2.549408415
Sum of nr_involuntary_switches	155228		107178			-30.95446698
Sum of nr_wakeups		4742648		4623175			-2.51912012
Sum of nr_migrations		1263925		930600			-26.37221354
Sum of sum_exec_runtime		288481.15	262255.2574		-9.091024712
Sum of sum_idle_runtime		2576164.568	2851759.68		10.69788457
Sum of sum_sleep_runtime	76890.14753	78632.31679		2.265789982
Sum of wait_count		4897894		4728939			-3.449543824
Sum of wait_sum			3041.78227	24167.4694		694.5167422

o schbench (2 messengers, 128 workers - fully loaded) (NPS1)

				tip		eevdf		%diff
Comm				schbench	schbench	N/A
Sum of avg_atom			7538.162897	7289.565705	-3.297848503
Average of avg_atom		29.10487605	28.14504133	-3.297848503
Sum of avg_per_cpu		630248.6079	471215.3671	-25.23341406
Average of avg_per_cpu		2433.392309	1819.364352	-25.23341406
Average of avg_wait_time	0.054147456	25.34304285	46703.75524
Sum of nr_switches		85210		88176		3.480812111
Sum of nr_voluntary_switches	83165		83457		0.351109241
Sum of nr_involuntary_switches	2045		4719		130.7579462
Sum of nr_wakeups		83168		83459		0.34989419
Sum of nr_migrations		3265		3025		-7.350689127
Sum of sum_exec_runtime		2476504.52	2469058.164	-0.300680129
Sum of sum_idle_runtime		110294825.8	132520924.2	20.15153321
Sum of sum_sleep_runtime	5293337.741	5297778.714	0.083897408
Sum of sum_block_runtime	56.043253	15.12936	-73.00413664
Sum of wait_count		85615		88606		3.493546692
Sum of wait_sum			4653.340163	9605.221964	106.4156418

o schbench (2 messengers, 256 workers - overloaded) (NPS1)

				tip		eevdf		%diff
Comm				schbench	schbench	N/A
Sum of avg_atom			11676.77306	4803.485728	-58.8629007
Average of avg_atom		22.67334574	9.327156753	-58.8629007
Sum of avg_per_cpu		55235.68013	38286.47722	-30.68524343
Average of avg_per_cpu		107.2537478	74.34267421	-30.68524343
Average of avg_wait_time	2.23189096	2.58191945	15.68304621
Sum of nr_switches		202862		425258		109.6292061
Sum of nr_voluntary_switches	163079		165058		1.213522281
Sum of nr_involuntary_switches	39783		260200		554.0482115
Sum of nr_wakeups		163082		165058		1.211660392
Sum of nr_migrations		44199		54894		24.19738003
Sum of sum_exec_runtime		4586675.667	3963846.024	-13.57910801
Sum of sum_idle_runtime		201050644.2	195126863.7	-2.946412087
Sum of sum_sleep_runtime	10418117.66	10402686.4	-0.148119407
Sum of sum_block_runtime	1548.979156	516.115078	-66.68030838
Sum of wait_count		203377		425792		109.3609405
Sum of wait_sum			455609.3122	1100885.201	141.6292142

o tbench (256 clients - overloaded) (NPS1)

- tbench client
				tip		eevdf		% diff
comm				tbench		tbench		N/A
Sum of avg_atom			3.594587941	5.112101854	42.21663064
Average of avg_atom		0.013986724	0.019891447	42.21663064
Sum of avg_per_cpu		392838.0975	142065.4206	-63.83613975
Average of avg_per_cpu		1528.552909	552.7837377	-63.83613975
Average of avg_wait_time	0.010512441	0.006861579	-34.72895916
Sum of nr_switches		692845080	511780111	-26.1335433
Sum of nr_voluntary_switches	178151085	371234907	108.3820635
Sum of nr_involuntary_switches	514693995	140545204	-72.69344399
Sum of nr_wakeups		178151085	371234909	108.3820646
Sum of nr_migrations		45279		71177		57.19649286
Sum of sum_exec_runtime		9192343.465	9624025.792	4.69610746
Sum of sum_idle_runtime		7125370.721	16145736.39	126.5950365
Sum of sum_sleep_runtime	2222469.726	5792868.629	160.650058
Sum of sum_block_runtime	68.60879	446.080476	550.1797743
Sum of wait_count		692845479	511780543	-26.13352349
Sum of wait_sum			7287852.246	3297894.139	-54.7480653

- tbench server

				tip		eevdf		% diff
Comm				tbench_srv	tbench_srv	N/A
Sum of avg_atom			5.077837807	5.447267364	7.275331971
Average of avg_atom2		0.019758124	0.021195593	7.275331971
Sum of avg_per_cpu		538586.1634	87925.51225	-83.67475471
Average of avg_per_cpu2		2095.666006	342.1226158	-83.67475471
Average of avg_wait_time	0.000827346	0.006505748	686.3392261
Sum of nr_switches		692980666	511838912	-26.13951051
Sum of nr_voluntary_switches	690367607	390304935	-43.46418762
Sum of nr_involuntary_switches	2613059		121533977	4551.023073
Sum of nr_wakeups		690367607	390304935	-43.46418762
Sum of nr_migrations		39486		84474		113.9340526
Sum of sum_exec_runtime		9176708.278	8734423.401	-4.819646259
Sum of sum_idle_runtime		413900.3645	447180.3879	8.040588086
Sum of sum_sleep_runtime	8966201.976	6690818.107	-25.37734345
Sum of sum_block_runtime	1.776413	1.617435	-8.949382829
Sum of wait_count		692980942	511839229	-26.13949418
Sum of wait_sum			565739.6984	3295519.077	482.5150836

> 
> The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going
> cross-eyed from staring at tree prints and I just couldn't figure out where it
> was going side-ways.
> 
> There's definitely more benchmarking/tweaking to be done (0-day already
> reported a stress-ng loss), but if we can pull this off we can delete a whole
> much of icky heuristics code. EEVDF is a much better defined policy than what
> we currently have.
> 

DeathStarBench and SpecJBB and slightly more complex to analyze. I'll
get the schedstat data for both soon. I'll rerun some of the above
workloads with NO_PRESERVE_LAG to see if that makes any difference.
In the meantime, if you need more data from the test system for any
particular workload, please do let me know. I will collect the per-task
and system-wide schedstat data for the workload as it is rather
inexpensive to collect and gives good insights but if you need any
other data, I'll be more than happy to get those too for analysis.

--
Thanks and Regards,
Prateek

K Prateek Nayak March 22, 2023, 9:38 a.m. UTC | #5

Hello Peter,

One important detail I forgot to mention: When I picked eevdf commits
from your tree
(https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=sched/core),
they were based on v6.3-rc1 with the sched/eevdf HEAD at:

commit: 0dddbc0b54ad ("sched/fair: Implement an EEVDF like policy")

On 3/22/2023 12:19 PM, K Prateek Nayak wrote:
> Hello Peter,
> 
> Leaving some results from my testing on a dual socket Zen3 machine
> (2 x 64C/128T) below.
> 
> tl;dr
> 
> o I've not tested workloads with nice and latency nice yet focusing more
>   on the out of the box performance. No changes to sched_feat were made
>   for the same reason. 
> 
> o Except for hackbench (m:n communication relationship), I do not see any
>   regression for other standard benchmarks (mostly 1:1 or 1:n) relation
>   when system is below fully loaded.
> 
> o At fully loaded scenario, schbench seems to be unhappy. Looking at the
>   data from /proc/<pid>/sched for the tasks with schedstats enabled,
>   there is an increase in number of context switches and the total wait
>   sum. When system is overloaded, things flip and the schbench tail
>   latency improves drastically. I suspect the involuntary
>   context-switches help workers make progress much sooner after wakeup
>   compared to tip thus leading to lower tail latency.
> 
> o For the same reason as above, tbench throughput takes a hit with
>   number of involuntary context-switches increasing drastically for the
>   tbench server. There is also an increase in wait sum noticed.
> 
> o Couple of real world workloads were also tested. DeathStarBench
>   throughput tanks much more with the updated version in your tree
>   compared to this series as is.
>   SpecJBB Max-jOPS sees large improvements but comes at a cost of
>   drop in Critical-jOPS signifying an increase in either wait time
>   or an increase in involuntary context-switches which can lead to
>   transactions taking longer to complete.
> 
> o Apart from DeathStarBench, the all the trends reported remain same
>   comparing the version in your tree and this series, as is, applied
>   on the same base kernel.
> 
> I'll leave the detailed results below and some limited analysis. 
> 
> On 3/6/2023 6:55 PM, Peter Zijlstra wrote:
>> Hi!
>>
>> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
>> not make more sense, and I did point Vincent at some older patches I had for
>> that (which is here his augmented rbtree thing comes from).
>>
>> Also, since I really dislike the dual tree, I also figured we could dynamically
>> switch between an augmented tree and not (and while I have code for that,
>> that's not included in this posting because with the current results I don't
>> think we actually need this).
>>
>> Anyway, since I'm somewhat under the weather, I spend last week desperately
>> trying to connect a small cluster of neurons in defiance of the snot overlord
>> and bring back the EEVDF patches from the dark crypts where they'd been
>> gathering cobwebs for the past 13 odd years.
>>
>> By friday they worked well enough, and this morning (because obviously I forgot
>> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
>> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
>> indicates a total fail.
>>
>> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
>>   more consistent than CFS and has a bunch of latency wins )
>>
>> ( hackbench also doesn't show the augmented tree and generally more expensive
>>   pick to be a loss, in fact it shows a slight win here )
>>
>>
>>   hackbech load + cyclictest --policy other results:
>>
>>
>> 			EEVDF			 CFS
>>
>> 		# Min Latencies: 00053
>>   LNICE(19)	# Avg Latencies: 04350
>> 		# Max Latencies: 76019
>>
>> 		# Min Latencies: 00052		00053
>>   LNICE(0)	# Avg Latencies: 00690		00687
>> 		# Max Latencies: 14145		13913
>>
>> 		# Min Latencies: 00019
>>   LNICE(-19)	# Avg Latencies: 00261
>> 		# Max Latencies: 05642
>>
> 
> Following are the results from testing the series on a dual socket
> Zen3 machine (2 x 64C/128T):
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Kernel versions:
> - tip:          6.2.0-rc6 tip sched/core
> - eevdf: 	6.2.0-rc6 tip sched/core
> 		+ eevdf commits from your tree
> 		(https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf)

I had cherry picked the following commits for eevdf:

commit: b84a8f6b6fa3 ("sched: Introduce latency-nice as a per-task attribute")
commit: eea7fc6f13b4 ("sched/core: Propagate parent task's latency requirements to the child task")
commit: a143d2bcef65 ("sched: Allow sched_{get,set}attr to change latency_nice of the task")
commit: d9790468df14 ("sched/fair: Add latency_offset")
commit: 3d4d37acaba4 ("sched/fair: Add sched group latency support")
commit: 707840ffc8fa ("sched/fair: Add avg_vruntime")
commit: 394af9db316b ("sched/fair: Remove START_DEBIT")
commit: 89b2a2ee0e9d ("sched/fair: Add lag based placement")
commit: e3db9631d8ca ("rbtree: Add rb_add_augmented_cached() helper")
commit: 0dddbc0b54ad ("sched/fair: Implement an EEVDF like policy")

from the sched/eevdf branch in your tree onto the tip branch back when
I started testing. I notice some more changes have been added since then.
Queuing testing of latest changes on the updated tip:sched/core based
on v6.3-rc3. I was able to cherry pick the latest commits from
sched/eevdf cleanly.

> 
> - eevdf prev:	6.2.0-rc6 tip sched/core + this series as is
> 
> When the testing started, the tip was at:
> commit 7c4a5b89a0b5 "sched/rt: pick_next_rt_entity(): check list_entry"
> [..snip..]
> 
--
Thanks and Regards,
Prateek

Pavel Machek March 23, 2023, 11:53 a.m. UTC | #6

Hi!

> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).

Link for context: https://lwn.net/Articles/925371/ . "EEVDF" is not
commonly known acronym :-).

BR,								Pavel