diff mbox series

[10/10] sched/fair: Implement an EEVDF like policy

Message ID	20230306141502.810909205@infradead.org
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Message-ID: <20230306141502.810909205@infradead.org> User-Agent: quilt/0.66 Date: Mon, 06 Mar 2023 14:25:31 +0100 From: Peter Zijlstra <peterz@infradead.org> To: mingo@kernel.org, vincent.guittot@linaro.org Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, juri.lelli@redhat.com, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, corbet@lwn.net, qyousef@layalina.io, chris.hyser@oracle.com, patrick.bellasi@matbug.net, pjt@google.com, pavel@ucw.cz, qperret@google.com, tim.c.chen@linux.intel.com, joshdon@google.com, timj@gnu.org, kprateek.nayak@amd.com, yu.c.chen@intel.com, youssefesmat@chromium.org, joel@joelfernandes.org Subject: [PATCH 10/10] sched/fair: Implement an EEVDF like policy References: <20230306132521.968182689@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Precedence: bulk
Series	sched: EEVDF using latency-nice \| [00/10] sched: EEVDF using latency-nice [01/10] sched: Introduce latency-nice as a per-task attribute [02/10] sched/core: Propagate parent tasks latency requirements to the child task [03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task [04/10] sched/fair: Add latency_offset [05/10] sched/fair: Add sched group latency support [06/10] sched/fair: Add avg_vruntime [07/10] sched/fair: Remove START_DEBIT [08/10] sched/fair: Add lag based placement [09/10] rbtree: Add rb_add_augmented_cached() helper [10/10] sched/fair: Implement an EEVDF like policy

Commit Message

Peter Zijlstra March 6, 2023, 1:25 p.m. UTC

  Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.

Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.

EEVDF has two parameters:

 - weight; which is mapped to nice just as before
 - relative deadline; which is related to slice length and mapped
   to the new latency nice.

Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.

Preemption (both tick and wakeup) is driven by testing against a fresh
pick. Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    4 
 kernel/sched/debug.c    |    6 -
 kernel/sched/fair.c     |  265 ++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |    2 
 kernel/sched/sched.h    |    1 
 5 files changed, 247 insertions(+), 31 deletions(-)

Comments

Mike Galbraith March 8, 2023, 8:39 a.m. UTC | #1

On Mon, 2023-03-06 at 14:25 +0100, Peter Zijlstra wrote:
> Where CFS is currently a WFQ based scheduler with only a single knob,
> the weight. The addition of a second, latency oriented parameter,
> makes something like WF2Q or EEVDF based a much better fit.
>
> Specifically, EEVDF does EDF like scheduling in the left half of the
> tree -- those entities that are owed service. Except because this is a
> virtual time scheduler, the deadlines are in virtual time as well,
> which is what allows over-subscription.

Curiosity got the best of me, and I stuffed this series into master and
did a little light testing.  Unsurprisingly, the numbers move about as
they are wont to do when you diddle anything in sched-land, most
notable were the UDP_STREAM numbers which moved a LOT.

Another thing I found interesting was in the comparison of massive_intr
vs youtube clip load.  I was expecting less scheduling from +eevdf
tree, but got more in total and more total wait time.

Season to taste or just toss: I didn't start with a virgin tree, though
local hacks *should* be meaningless to tbench and netperf, and while
they reduce stacking, they make zero perceptible difference to the
desktop, and are common to both trees.

Some numbers:
box is old i4790 desktop box

30 sec tbench
6.3.0.g8ca09d5
Throughput 3655.33 MB/sec  8 clients  8 procs  max_latency=16.322 ms
Throughput 3651.73 MB/sec  8 clients  8 procs  max_latency=16.260 ms
Throughput 3627.60 MB/sec  8 clients  8 procs  max_latency=16.262 ms
6.3.0.g8ca09d5+eevdf
Throughput 3497.39 MB/sec  8 clients  8 procs  max_latency=12.380 ms
Throughput 3403.28 MB/sec  8 clients  8 procs  max_latency=12.296 ms
Throughput 3466.86 MB/sec  8 clients  8 procs  max_latency=12.349 ms
avg vs avg     .948


netperf                   cfs +eevdf  vs cfs
TCP_SENDFILE-1    Avg:  89258  92080   1.031
TCP_SENDFILE-2    Avg:  83271  83371   1.001
TCP_SENDFILE-4    Avg:  56395  53011    .939
TCP_SENDFILE-8    Avg:  26389  39470   1.495
TCP_SENDFILE-16   Avg:  10251  19590   1.911
TCP_STREAM-1      Avg:  72583  71276    .981
TCP_STREAM-2      Avg:  59627  54424    .912
TCP_STREAM-4      Avg:  33310  22717    .681
TCP_STREAM-8      Avg:   8062   7718    .957
TCP_STREAM-16     Avg:   5143   3726    .724
TCP_MAERTS-1      Avg:  72897  71052    .974
TCP_MAERTS-2      Avg:  55065  60022   1.090
TCP_MAERTS-4      Avg:  45525  26531    .582
TCP_MAERTS-8      Avg:  11435   7937    .694
TCP_MAERTS-16     Avg:    5437  3351    .616
TCP_RR-1          Avg: 192766 180505    .936
TCP_RR-2          Avg: 169043 164731    .974
TCP_RR-4          Avg: 115702 110938    .958
TCP_RR-8          Avg: 113085 111775    .988
TCP_RR-16         Avg:  55226  53439    .967
UDP_RR-1          Avg: 261076 242027    .927
UDP_RR-2          Avg: 224808 221913    .987
UDP_RR-4          Avg: 158232 155162    .980
UDP_RR-8          Avg: 152255 149527    .982
UDP_RR-16         Avg:  75148  72739    .967
UDP_STREAM-1      Avg: 102612 102728   1.001 hmm: UDP_STREAM deltas verified repeatable
UDP_STREAM-2      Avg:  93774  93563    .997
UDP_STREAM-4      Avg:  54728  55702   1.017
UDP_STREAM-8      Avg:  30153  63329   2.100
UDP_STREAM-16     Avg:  12997  31644   2.434

(if your mailer don't do wide, this is gonna make a mess)

youtube BigBuckBunny vs 8 hogs (ancient 8ms run 1ms sleep cpu distribution checker)
massive_intr 8 9999& perf sched record -- firefox https://www.youtube.com/watch?v=aqz-KE-bpKQ& sleep 300;killall massive_intr firefox
note: perf obviously twiddled to add 'Sum delay'.

6.3.0.g8ca09d5
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     | Max delay start           | Max delay end          |
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
  Renderer:5410         | 201867.875 ms |   106745 | avg:   0.468 ms | max:  22.465 ms | sum:49998.624 ms | max start:   100.880096 s | max end:   100.902561 s
  SwComposite:5453      | 189304.441 ms |    73405 | avg:   0.595 ms | max:  23.934 ms | sum:43657.252 ms | max start:   155.074533 s | max end:   155.098467 s
  massive_intr:5571     | 179445.480 ms |    79984 | avg:   0.705 ms | max:  22.822 ms | sum:56395.232 ms | max start:   289.707867 s | max end:   289.730689 s
  massive_intr:5569     | 179337.834 ms |    80101 | avg:   0.711 ms | max:  25.517 ms | sum:56965.103 ms | max start:   103.652981 s | max end:   103.678498 s
  massive_intr:5575     | 179190.634 ms |    79433 | avg:   0.720 ms | max:  18.873 ms | sum:57223.508 ms | max start:   223.520540 s | max end:   223.539413 s
  massive_intr:5568     | 179081.476 ms |    77118 | avg:   0.742 ms | max:  21.715 ms | sum:57188.088 ms | max start:    90.071642 s | max end:    90.093358 s
  massive_intr:5574     | 179079.481 ms |    78924 | avg:   0.723 ms | max:  23.821 ms | sum:57024.111 ms | max start:   345.637059 s | max end:   345.660881 s
  massive_intr:5570     | 178945.128 ms |    79466 | avg:   0.724 ms | max:  22.105 ms | sum:57554.610 ms | max start:    90.511319 s | max end:    90.533423 s
  massive_intr:5572     | 178867.227 ms |    80016 | avg:   0.717 ms | max:  25.140 ms | sum:57393.314 ms | max start:   193.801127 s | max end:   193.826267 s
  massive_intr:5573     | 178560.808 ms |    79764 | avg:   0.719 ms | max:  40.232 ms | sum:57338.388 ms | max start:    87.446558 s | max end:    87.486790 s
  X:2492                |  86585.854 ms |    17924 | avg:   0.294 ms | max:  20.053 ms | sum: 5262.780 ms | max start:   106.798324 s | max end:   106.818377 s
  llvmpipe-7:2870       |  36803.778 ms |     7268 | avg:   1.727 ms | max:  30.168 ms | sum:12548.310 ms | max start:    87.406812 s | max end:    87.436981 s
  llvmpipe-3:2866       |  35004.654 ms |     6161 | avg:   1.385 ms | max:  19.992 ms | sum: 8531.873 ms | max start:    87.410811 s | max end:    87.430803 s
  llvmpipe-1:2864       |  34615.309 ms |     6423 | avg:   1.238 ms | max:  21.740 ms | sum: 7954.871 ms | max start:    93.834245 s | max end:    93.855985 s
  llvmpipe-2:2865       |  34375.917 ms |     6205 | avg:   1.273 ms | max:  22.031 ms | sum: 7897.655 ms | max start:    87.414812 s | max end:    87.436843 s
  llvmpipe-0:2863       |  32479.993 ms |     8472 | avg:   0.906 ms | max:  18.145 ms | sum: 7674.587 ms | max start:   156.041479 s | max end:   156.059624 s
  llvmpipe-5:2868       |  32284.589 ms |     5668 | avg:   1.271 ms | max:  21.562 ms | sum: 7203.223 ms | max start:    98.798222 s | max end:    98.819784 s
  llvmpipe-6:2869       |  31752.624 ms |     5689 | avg:   1.241 ms | max:  18.067 ms | sum: 7057.222 ms | max start:    87.422817 s | max end:    87.440885 s
  llvmpipe-4:2867       |  31621.552 ms |     5298 | avg:   1.327 ms | max:  21.903 ms | sum: 7029.350 ms | max start:    87.418812 s | max end:    87.440715 s
  MediaPD~oder #1:5910  |  24623.459 ms |     7900 | avg:   0.455 ms | max:  18.267 ms | sum: 3596.813 ms | max start:   143.181740 s | max end:   143.200008 s
  MediaPD~oder #1:5908  |  24498.697 ms |     7698 | avg:   0.470 ms | max:  24.616 ms | sum: 3614.831 ms | max start:   222.568707 s | max end:   222.593322 s
  MediaPD~oder #1:5909  |  24447.400 ms |     7744 | avg:   0.476 ms | max:  22.047 ms | sum: 3683.826 ms | max start:   234.216798 s | max end:   234.238845 s
  MediaPD~oder #1:5907  |  24349.888 ms |     7819 | avg:   0.437 ms | max:  19.288 ms | sum: 3413.911 ms | max start:   131.920097 s | max end:   131.939385 s
  Isolated Web Co:5457  |  10982.274 ms |     5768 | avg:   0.295 ms | max:  25.414 ms | sum: 1701.308 ms | max start:    91.801774 s | max end:    91.827188 s
...
 ------------------------------------------------------------------------------------------------------------
  TOTAL:                |2370376.486 ms |  1189527 |                                     |    654188.153 ms |
 ------------------------------------------------------------------------------------------------------------

6.3.0.g8ca09d5 +eevdf
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     | Max delay start           | Max delay end          |
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
  Renderer:5675         | 211250.021 ms |    84700 | avg:   0.546 ms | max:  20.503 ms | sum:46225.011 ms | max start:   317.446170 s | max end:   317.466673 s
  SwComposite:5719      | 205804.660 ms |    66043 | avg:   0.685 ms | max:  19.871 ms | sum:45214.452 ms | max start:   119.470385 s | max end:   119.490256 s
  massive_intr:5838     | 195285.673 ms |    75458 | avg:   0.885 ms | max:  18.171 ms | sum:66793.033 ms | max start:   177.270747 s | max end:   177.288919 s
  massive_intr:5835     | 195217.246 ms |    75005 | avg:   0.884 ms | max:  18.211 ms | sum:66340.670 ms | max start:   340.966607 s | max end:   340.984818 s
  massive_intr:5836     | 195148.073 ms |    74723 | avg:   0.891 ms | max:  22.868 ms | sum:66544.981 ms | max start:    92.619771 s | max end:    92.642639 s
  massive_intr:5840     | 195093.638 ms |    75229 | avg:   0.886 ms | max:  21.502 ms | sum:66630.906 ms | max start:    96.715761 s | max end:    96.737263 s
  massive_intr:5837     | 195081.767 ms |    74906 | avg:   0.890 ms | max:  18.384 ms | sum:66672.064 ms | max start:   157.736916 s | max end:   157.755300 s
  massive_intr:5839     | 195067.653 ms |    74433 | avg:   0.892 ms | max:  17.731 ms | sum:66391.236 ms | max start:   327.225658 s | max end:   327.243389 s
  massive_intr:5841     | 194947.303 ms |    75468 | avg:   0.890 ms | max:  19.250 ms | sum:67155.572 ms | max start:    94.936517 s | max end:    94.955767 s
  massive_intr:5834     | 194820.690 ms |    74531 | avg:   0.901 ms | max:  16.593 ms | sum:67172.443 ms | max start:    87.268034 s | max end:    87.284627 s
  X:2673                |  55537.286 ms |    24293 | avg:   0.624 ms | max:  19.071 ms | sum:15149.642 ms | max start:   130.940240 s | max end:   130.959311 s
  MediaPD~oder #1:6190  |  24885.689 ms |    19835 | avg:   0.576 ms | max:  16.064 ms | sum:11424.368 ms | max start:   274.211226 s | max end:   274.227290 s
  MediaPD~oder #1:6188  |  24802.432 ms |    19475 | avg:   0.589 ms | max:  24.496 ms | sum:11465.965 ms | max start:   255.251754 s | max end:   255.276250 s
  MediaPD~oder #1:6189  |  24784.917 ms |    19684 | avg:   0.573 ms | max:  18.250 ms | sum:11277.622 ms | max start:   263.644372 s | max end:   263.662622 s
  MediaPD~oder #1:6187  |  24660.751 ms |    19613 | avg:   0.572 ms | max:  19.865 ms | sum:11221.213 ms | max start:   172.852720 s | max end:   172.872585 s
  llvmpipe-6:3047       |  18633.251 ms |     7545 | avg:   3.428 ms | max:  27.450 ms | sum:25864.293 ms | max start:   150.773081 s | max end:   150.800531 s
  llvmpipe-7:3048       |  18414.068 ms |     8024 | avg:   3.880 ms | max:  23.249 ms | sum:31135.265 ms | max start:   150.776388 s | max end:   150.799637 s
  llvmpipe-5:3046       |  17914.117 ms |     7336 | avg:   3.429 ms | max:  19.998 ms | sum:25155.781 ms | max start:   137.463848 s | max end:   137.483845 s
  llvmpipe-3:3044       |  17669.019 ms |     7913 | avg:   3.329 ms | max:  21.280 ms | sum:26340.572 ms | max start:   232.804014 s | max end:   232.825294 s
  llvmpipe-4:3045       |  17539.666 ms |     7438 | avg:   3.353 ms | max:  22.802 ms | sum:24936.398 ms | max start:    94.800014 s | max end:    94.822817 s
  llvmpipe-0:3041       |  17428.494 ms |     9445 | avg:   2.663 ms | max:  29.456 ms | sum:25153.007 ms | max start:    96.231519 s | max end:    96.260975 s
  llvmpipe-2:3043       |  17239.204 ms |     7962 | avg:   3.282 ms | max:  24.674 ms | sum:26133.925 ms | max start:   161.019506 s | max end:   161.044179 s
  llvmpipe-1:3042       |  17143.242 ms |     8261 | avg:   3.118 ms | max:  22.086 ms | sum:25756.911 ms | max start:   379.521740 s | max end:   379.543826 s
  Isolated Web Co:5723  |  11075.262 ms |    15573 | avg:   0.280 ms | max:  23.996 ms | sum: 4360.925 ms | max start:   230.161012 s | max end:   230.185008 s
...
 ------------------------------------------------------------------------------------------------------------
  TOTAL:                |2368428.034 ms |  2114759 |                                     |   1048388.278 ms |
 ------------------------------------------------------------------------------------------------------------
vs 6.3.0.g8ca09d5               .999         1.777                                                 1.602

Mike Galbraith March 8, 2023, 9:26 a.m. UTC | #2

On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> 
> netperf                   cfs +eevdf  vs cfs
> TCP_SENDFILE-1    Avg:  89258  92080   1.031
> TCP_SENDFILE-2    Avg:  83271  83371   1.001
> TCP_SENDFILE-4    Avg:  56395  53011    .939
> TCP_SENDFILE-8    Avg:  26389  39470   1.495
> TCP_SENDFILE-16   Avg:  10251  19590   1.911

Forgot to recheck this not so modest clients >= CPUS win, it's also
repeatable.

	-Mike

Mike Galbraith March 8, 2023, 1:36 p.m. UTC | #3

On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
>
> Curiosity got the best of me...

Remember this little bugger, allegedly distilled from a real
application control thread starvation issue?

6.3.0.g8ca09d5-master
homer:/root # time taskset -c 3 starve
expecting to receive 10000000 signals

real    0m24.424s
user    0m4.468s
sys     0m18.957s

6.3.0.g8ca09d5-eevdf
homer:/root # time taskset -c 3 starve
expecting to receive 10000000 signals
zzzzzz
^C

real    15m24.115s
user    0m3.078s
sys     0m0.000s


#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/wait.h>

volatile unsigned long loop;

void handler(int n)
{
	if (loop > 0)
		--loop;
}

static int child(void)
{
	pid_t ppid = getppid();

	sleep(1);
	while (1)
		kill(ppid, SIGUSR1);
	return 0;
}

int main(int argc, char **argv)
{
	pid_t child_pid;
	int r;

	loop = argc > 1 ? strtoul(argv[1], NULL, 10) : 10000000;
	printf("expecting to receive %lu signals\n", loop);

	if ((child_pid = fork()) == 0)
		exit(child());

	signal(SIGUSR1, handler);
	while (loop)
		sleep(1);
	r = kill(child_pid, SIGTERM);
	waitpid(child_pid, NULL, 0);
	return 0;
}

Mike Galbraith March 9, 2023, 4:23 a.m. UTC | #4

On Wed, 2023-03-08 at 14:36 +0100, Mike Galbraith wrote:
>
> Remember this little bugger, allegedly distilled from a real
> application control thread starvation issue?
>
> 6.3.0.g8ca09d5-master
> homer:/root # time taskset -c 3 starve      
> expecting to receive 10000000 signals
>
> real    0m24.424s
> user    0m4.468s
> sys     0m18.957s
>
> 6.3.0.g8ca09d5-eevdf
> homer:/root # time taskset -c 3 starve
> expecting to receive 10000000 signals
> zzzzzz
> ^C

Ok, seems there must be a math booboo lurking.

virgin source, 100% hog vs tbench buddy pair, all pinned.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 5060 root      20   0    4420    680    680 R 96.01 0.004   0:41.40 3 cpuhog
 5058 root      20   0   25500   1920   1792 S 2.326 0.012   0:01.05 3 tbench
 5059 root      20   0    8796    896    768 R 1.661 0.006   0:00.78 3 tbench_srv

echo NO_PRESERVE_LAG > features

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 5060 root      20   0    4420    680    680 R 99.33 0.004   1:28.24 3 cpuhog
 5058 root      20   0   25500   1920   1792 R 0.333 0.012   0:01.75 3 tbench
 5059 root      20   0    8796    896    768 S 0.333 0.006   0:01.30 3 tbench_srv

Peter Zijlstra March 9, 2023, 9:06 a.m. UTC | #5

Hi Mike!

On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> >
> > Curiosity got the best of me...
> 
> Remember this little bugger, allegedly distilled from a real
> application control thread starvation issue?

Oooh, yeah, I should still have that somewhere. I'll try and remember
what exactly was needed to make it behave properly.

Thanks for your feedback. I'll go prod at things a bit more. Like I
wrote, the code is more or less at the point where it stopped crashing
and doing obviously bad things. It definitely needs a more attention.

Peter Zijlstra March 9, 2023, 12:44 p.m. UTC | #6

On Thu, Mar 09, 2023 at 10:06:33AM +0100, Peter Zijlstra wrote:
> Hi Mike!
> 
> On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> > On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> > >
> > > Curiosity got the best of me...
> > 
> > Remember this little bugger, allegedly distilled from a real
> > application control thread starvation issue?
> 
> Oooh, yeah, I should still have that somewhere. I'll try and remember
> what exactly was needed to make it behave properly.

That thing wants both wakeup preemption and sleeper bonus. Specifically,
it needs the signal to insta-preempt the 'pointless' kill loop.

What happens is that while positive lag, we get this, when negative lag
happens wakeup-preemption is not achieved and we get delayed by a full
tick.

This gets us very little actual runtime.

Let me see what do do about that...

Peter Zijlstra March 9, 2023, 3:29 p.m. UTC | #7

On Thu, Mar 09, 2023 at 01:44:13PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 09, 2023 at 10:06:33AM +0100, Peter Zijlstra wrote:
> > Hi Mike!
> > 
> > On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> > > On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> > > >
> > > > Curiosity got the best of me...
> > > 
> > > Remember this little bugger, allegedly distilled from a real
> > > application control thread starvation issue?
> > 
> > Oooh, yeah, I should still have that somewhere. I'll try and remember
> > what exactly was needed to make it behave properly.
> 
> That thing wants both wakeup preemption and sleeper bonus. Specifically,
> it needs the signal to insta-preempt the 'pointless' kill loop.
> 
> What happens is that while positive lag, we get this, when negative lag
> happens wakeup-preemption is not achieved and we get delayed by a full
> tick.
> 
> This gets us very little actual runtime.
> 
> Let me see what do do about that...

So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
works -- this is the absolutely minimal amount required. It sucks a bit
it's HZ dependent, but alas.

Also, the whole sleeper bonus gets us back into needing to track the old
vruntime and the overflow crap for super long sleeps and all that fugly
:/ I was so hoping we could delete that code.

Oh well.

(also, did you know that removing the debug cruft helps with running
numbers? ;-)

I think it adds a bit of variance to the numbers -- but I've not ran
long enough to tell with certainty.

Peter Zijlstra March 9, 2023, 3:39 p.m. UTC | #8

On Thu, Mar 09, 2023 at 04:29:04PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 09, 2023 at 01:44:13PM +0100, Peter Zijlstra wrote:
> > On Thu, Mar 09, 2023 at 10:06:33AM +0100, Peter Zijlstra wrote:
> > > Hi Mike!
> > > 
> > > On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> > > > On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> > > > >
> > > > > Curiosity got the best of me...
> > > > 
> > > > Remember this little bugger, allegedly distilled from a real
> > > > application control thread starvation issue?
> > > 
> > > Oooh, yeah, I should still have that somewhere. I'll try and remember
> > > what exactly was needed to make it behave properly.
> > 
> > That thing wants both wakeup preemption and sleeper bonus. Specifically,
> > it needs the signal to insta-preempt the 'pointless' kill loop.
> > 
> > What happens is that while positive lag, we get this, when negative lag
> > happens wakeup-preemption is not achieved and we get delayed by a full
> > tick.
> > 
> > This gets us very little actual runtime.
> > 
> > Let me see what do do about that...
> 
> So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
> works -- this is the absolutely minimal amount required. It sucks a bit
> it's HZ dependent, but alas.
> 
> Also, the whole sleeper bonus gets us back into needing to track the old
> vruntime and the overflow crap for super long sleeps and all that fugly
> :/ I was so hoping we could delete that code.
> 
> Oh well.
> 
> (also, did you know that removing the debug cruft helps with running
> numbers? ;-)

Also, it helps to turn the sched_feat on... clearly i should be calling
it a day.

Mike Galbraith March 9, 2023, 4:24 p.m. UTC | #9

On Thu, 2023-03-09 at 16:29 +0100, Peter Zijlstra wrote:
>
> So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
> works -- this is the absolutely minimal amount required. It sucks a bit
> it's HZ dependent, but alas.
>
> Also, the whole sleeper bonus gets us back into needing to track the old
> vruntime and the overflow crap for super long sleeps and all that fugly
> :/ I was so hoping we could delete that code.
>
> Oh well.

Yeah, it's a worthy target.

	-Mike

Peter Zijlstra March 9, 2023, 4:42 p.m. UTC | #10

On Thu, Mar 09, 2023 at 04:29:04PM +0100, Peter Zijlstra wrote:

> So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
> works -- this is the absolutely minimal amount required. It sucks a bit
> it's HZ dependent, but alas.

Fixes starve, sucks for schbench and hackbench :/

Clearly more thinking is required...

root@ivb-ep:~/bench# echo NO_FAIR_SLEEPERS > /debug/sched/features
root@ivb-ep:~/bench# ./doit-schbench.sh ; ./doit-hackbench-series.sh
Latency percentiles (usec)
50.0000th: 83
75.0000th: 102
90.0000th: 109
95.0000th: 114
*99.0000th: 450
99.5000th: 723
99.9000th: 985
min=0, max=1067
1:            0.55355 +- 0.00290 seconds time elapsed  ( +-  0.52% )
2:            0.79591 +- 0.00545 seconds time elapsed  ( +-  0.68% )
5:             1.5804 +- 0.0102 seconds time elapsed  ( +-  0.65% )
10:             2.5674 +- 0.0110 seconds time elapsed  ( +-  0.43% )
20:             4.6116 +- 0.0160 seconds time elapsed  ( +-  0.35% )
40:             9.5965 +- 0.0167 seconds time elapsed  ( +-  0.17% )
root@ivb-ep:~/bench# time taskset -c 3 ./starve/starve 1000000
expecting to receive 1000000 signals
^C

real    0m32.999s
user    0m0.000s
sys     0m0.719s
root@ivb-ep:~/bench# echo FAIR_SLEEPERS > /debug/sched/features
root@ivb-ep:~/bench# ./doit-schbench.sh ; ./doit-hackbench-series.sh
Latency percentiles (usec)
50.0000th: 87
75.0000th: 103
90.0000th: 111
95.0000th: 116
*99.0000th: 163
99.5000th: 697
99.9000th: 1110
min=0, max=1522
1:            0.59076 +- 0.00577 seconds time elapsed  ( +-  0.98% )
2:            0.86093 +- 0.00407 seconds time elapsed  ( +-  0.47% )
5:             2.1018 +- 0.0129 seconds time elapsed  ( +-  0.61% )
10:             3.6378 +- 0.0395 seconds time elapsed  ( +-  1.09% )
20:            5.56884 +- 0.00979 seconds time elapsed  ( +-  0.18% )
40:            10.8570 +- 0.0207 seconds time elapsed  ( +-  0.19% )
root@ivb-ep:~/bench# time taskset -c 3 ./starve/starve 1000000
expecting to receive 1000000 signals

real    0m5.651s
user    0m0.604s
sys     0m4.047s


---

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4938,17 +4938,22 @@ place_entity(struct cfs_rq *cfs_rq, stru
 {
 	u64 vruntime = avg_vruntime(cfs_rq);
 
+	if (sched_feat(PRESERVE_LAG))
+		vruntime -= se->lag;
+
 	if (sched_feat(FAIR_SLEEPERS)) {
-		u64 sleep_time;
+//		u64 sleep_time;
 
 		/* sleeps up to a single latency don't count. */
 		if (!initial) {
-			unsigned long thresh;
+			unsigned long thresh = TICK_NSEC;
 
-			if (se_is_idle(se))
-				thresh = sysctl_sched_min_granularity;
-			else
-				thresh = sysctl_sched_latency;
+			if (!sched_feat(EEVDF)) {
+				if (se_is_idle(se))
+					thresh = sysctl_sched_min_granularity;
+				else
+					thresh = sysctl_sched_latency;
+			}
 
 			/*
 			 * Halve their sleep time's effect, to allow
@@ -4957,7 +4962,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 			if (sched_feat(GENTLE_FAIR_SLEEPERS))
 				thresh >>= 1;
 
-			vruntime -= thresh;
+			vruntime -= calc_delta_fair(thresh, se);
 		}
 
 		/*
@@ -4966,15 +4971,12 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		 * slept for a long time, don't even try to compare its vruntime with
 		 * the base as it may be too far off and the comparison may get
 		 * inversed due to s64 overflow.
-		 */
 		sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
 		if ((s64)sleep_time < 60LL * NSEC_PER_SEC)
+		 */
 			vruntime = max_vruntime(se->vruntime, vruntime);
 	}
 
-	if (sched_feat(PRESERVE_LAG))
-		vruntime -= se->lag;
-
 	se->vruntime = vruntime;
 	set_slice(cfs_rq, se);
 }

Peter Zijlstra March 10, 2023, 8:38 p.m. UTC | #11

On Thu, Mar 09, 2023 at 05:23:33AM +0100, Mike Galbraith wrote:
> Ok, seems there must be a math booboo lurking.

Yep that.. please try the version I pushed out here:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf

(It's based on 6.3-rc1, but it should be trivial to rebase onto v6.2 if
you so wish.)

Mike Galbraith March 11, 2023, 5:53 a.m. UTC | #12

On Fri, 2023-03-10 at 21:38 +0100, Peter Zijlstra wrote:
> On Thu, Mar 09, 2023 at 05:23:33AM +0100, Mike Galbraith wrote:
> > Ok, seems there must be a math booboo lurking.
>
> Yep that.. please try the version I pushed out here:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf
>
> (It's based on 6.3-rc1, but it should be trivial to rebase onto v6.2 if
> you so wish.)

I stuffed it into .today.  Yup, extreme sleeper stinginess is gone.

tbench/netperf numbers look roughly as previously.  massive_intr vs
desktop CPU distribution improved as expected, but (to me) oddly, the
amount of desktop beating up itself did not.

These are all .today

perf.data.eevdf
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     | Max delay start           | Max delay end          |
  massive_intr:(9)      |1499451.694 ms |   647385 | avg:   0.940 ms | max:  27.969 ms | sum:608348.424 ms | max start:   467.219496 s | max end:   467.247465 s
  TOTAL:                |2369880.457 ms |  2797272 |                                     |   1243484.857 ms |
perf.data.eevdf_FAIR_SLEEPERS
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     | Max delay start           | Max delay end          |
  massive_intr:(9)      |1415743.272 ms |   714999 | avg:   1.013 ms | max:  73.722 ms | sum:724624.600 ms | max start:  2678.846216 s | max end:  2678.919938 s
  TOTAL:                |2362333.392 ms |  2871586 |                                     |   1238748.802 ms |
perf.data.master
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     | Max delay start           | Max delay end          |
  massive_intr:(9)      |1463798.673 ms |   598383 | avg:   0.730 ms | max:  40.029 ms | sum:437069.674 ms | max start:   945.022965 s | max end:   945.062994 s
  TOTAL:                |2370011.937 ms |  1109828 |                                     |    616408.789 ms |

Mike Galbraith March 11, 2023, 7:56 a.m. UTC | #13

On Sat, 2023-03-11 at 06:53 +0100, Mike Galbraith wrote:
>
> massive_intr vs desktop CPU distribution improved as expected, but
> (to me) oddly, the amount of desktop beating up itself did not.

Hmm, maybe expected when fancy deadline math tries to wedge a very wide
GUI into a not so wide and pretty well saturated CPU.

	-Mike

diff mbox series

Patch

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -548,6 +548,9 @@  struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
 	struct rb_node			run_node;
+	u64				deadline;
+	u64				min_deadline;
+
 	struct list_head		group_node;
 	unsigned int			on_rq;
 
@@ -556,6 +559,7 @@  struct sched_entity {
 	u64				vruntime;
 	u64				prev_sum_exec_runtime;
 	s64				lag;
+	u64				slice;
 
 	u64				nr_migrations;
 
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -535,9 +535,13 @@  print_task(struct seq_file *m, struct rq
 	else
 		SEQ_printf(m, " %c", task_state_to_char(p));
 
-	SEQ_printf(m, " %15s %5d %9Ld.%06ld %9Ld %5d ",
+	SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
 		p->comm, task_pid_nr(p),
 		SPLIT_NS(p->se.vruntime),
+		entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
+		SPLIT_NS(p->se.deadline),
+		SPLIT_NS(p->se.slice),
+		SPLIT_NS(p->se.sum_exec_runtime),
 		(long long)(p->nvcsw + p->nivcsw),
 		p->prio);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -47,6 +47,7 @@ 
 #include <linux/psi.h>
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
+#include <linux/rbtree_augmented.h>
 
 #include <asm/switch_to.h>
 
@@ -683,6 +684,34 @@  u64 avg_vruntime(struct cfs_rq *cfs_rq)
 	return cfs_rq->min_vruntime + lag;
 }
 
+/*
+ * Entity is eligible once it received less service than it ought to have,
+ * eg. lag >= 0.
+ *
+ * lag_i = S - s_i = w_i*(V - w_i)
+ *
+ * lag_i >= 0 -> V >= v_i
+ *
+ *     \Sum (v_i - v)*w_i
+ * V = ------------------ + v
+ *          \Sum w_i
+ *
+ * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ */
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 avg_vruntime = cfs_rq->avg_vruntime;
+	long avg_load = cfs_rq->avg_load;
+
+	if (curr && curr->on_rq) {
+		avg_vruntime += entity_key(cfs_rq, curr) * curr->load.weight;
+		avg_load += curr->load.weight;
+	}
+
+	return avg_vruntime >= entity_key(cfs_rq, se) * avg_load;
+}
+
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	u64 min_vruntime = cfs_rq->min_vruntime;
@@ -699,8 +728,8 @@  static u64 __update_min_vruntime(struct
 
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
+	struct sched_entity *se = __pick_first_entity(cfs_rq);
 	struct sched_entity *curr = cfs_rq->curr;
-	struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
 
 	u64 vruntime = cfs_rq->min_vruntime;
 
@@ -711,9 +740,7 @@  static void update_min_vruntime(struct c
 			curr = NULL;
 	}
 
-	if (leftmost) { /* non-empty tree */
-		struct sched_entity *se = __node_2_se(leftmost);
-
+	if (se) {
 		if (!curr)
 			vruntime = se->vruntime;
 		else
@@ -730,18 +757,50 @@  static inline bool __entity_less(struct
 	return entity_before(__node_2_se(a), __node_2_se(b));
 }
 
+#define deadline_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
+
+static inline void __update_min_deadline(struct sched_entity *se, struct rb_node *node)
+{
+	if (node) {
+		struct sched_entity *rse = __node_2_se(node);
+		if (deadline_gt(min_deadline, se, rse))
+			se->min_deadline = rse->min_deadline;
+	}
+}
+
+/*
+ * se->min_deadline = min(se->deadline, left->min_deadline, right->min_deadline)
+ */
+static inline bool min_deadline_update(struct sched_entity *se, bool exit)
+{
+	u64 old_min_deadline = se->min_deadline;
+	struct rb_node *node = &se->run_node;
+
+	se->min_deadline = se->deadline;
+	__update_min_deadline(se, node->rb_right);
+	__update_min_deadline(se, node->rb_left);
+
+	return se->min_deadline == old_min_deadline;
+}
+
+RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
+		     run_node, min_deadline, min_deadline_update);
+
 /*
  * Enqueue an entity into the rb-tree:
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
-	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
+	se->min_deadline = se->deadline;
+	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				__entity_less, &min_deadline_cb);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
+	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				  &min_deadline_cb);
 	avg_vruntime_sub(cfs_rq, se);
 }
 
@@ -765,6 +824,101 @@  static struct sched_entity *__pick_next_
 	return __node_2_se(next);
 }
 
+static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+	struct sched_entity *left = __pick_first_entity(cfs_rq);
+
+	/*
+	 * If curr is set we have to see if its left of the leftmost entity
+	 * still in the tree, provided there was anything in the tree at all.
+	 */
+	if (!left || (curr && entity_before(curr, left)))
+		left = curr;
+
+	return left;
+}
+
+/*
+ * Earliest Eligible Virtual Deadline First
+ *
+ * In order to provide latency guarantees for different request sizes
+ * EEVDF selects the best runnable task from two criteria:
+ *
+ *  1) the task must be eligible (must be owed service)
+ *
+ *  2) from those tasks that meet 1), we select the one
+ *     with the earliest virtual deadline.
+ *
+ * We can do this in O(log n) time due to an augmented RB-tree. The
+ * tree keeps the entries sorted on service, but also functions as a
+ * heap based on the deadline by keeping:
+ *
+ *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
+ *
+ * Which allows an EDF like search on (sub)trees.
+ */
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct sched_entity *curr = cfs_rq->curr;
+	struct sched_entity *best = NULL;
+
+	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+		curr = NULL;
+
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/*
+		 * If this entity is not eligible, try the left subtree.
+		 *
+		 * XXX: would it be worth it to do the single division for
+		 *      avg_vruntime() once, instead of the multiplication
+		 *      in entity_eligible() O(log n) times?
+		 */
+		if (!entity_eligible(cfs_rq, se)) {
+			node = node->rb_left;
+			continue;
+		}
+
+		/*
+		 * If this entity has an earlier deadline than the previous
+		 * best, take this one. If it also has the earliest deadline
+		 * of its subtree, we're done.
+		 */
+		if (!best || deadline_gt(deadline, best, se)) {
+			best = se;
+			if (best->deadline == best->min_deadline)
+				break;
+		}
+
+		/*
+		 * If the earlest deadline in this subtree is in the fully
+		 * eligible left half of our space, go there.
+		 */
+		if (node->rb_left &&
+		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
+			node = node->rb_left;
+			continue;
+		}
+
+		node = node->rb_right;
+	}
+
+	if (!best || (curr && deadline_gt(deadline, best, curr)))
+		best = curr;
+
+	if (unlikely(!best)) {
+		struct sched_entity *left = __pick_first_entity(cfs_rq);
+		if (left) {
+			pr_err("EEVDF scheduling fail, picking leftmost\n");
+			return left;
+		}
+	}
+
+	return best;
+}
+
 #ifdef CONFIG_SCHED_DEBUG
 struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 {
@@ -882,6 +1036,32 @@  static u64 sched_slice(struct cfs_rq *cf
 	return slice;
 }
 
+static void set_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	if (sched_feat(EEVDF)) {
+		/*
+		 * For EEVDF the virtual time slope is determined by w_i (iow.
+		 * nice) while the request time r_i is determined by
+		 * latency-nice.
+		 */
+		se->slice = se->latency_offset;
+	} else {
+		/*
+		 * When many tasks blow up the sched_period; it is possible
+		 * that sched_slice() reports unusually large results (when
+		 * many tasks are very light for example). Therefore impose a
+		 * maximum.
+		 */
+		se->slice = min_t(u64, sched_slice(cfs_rq, se), sysctl_sched_latency);
+	}
+
+	/*
+	 * vd_i = ve_i + r_i / w_i
+	 */
+	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+	se->min_deadline = se->deadline;
+}
+
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
@@ -1014,6 +1194,13 @@  static void update_curr(struct cfs_rq *c
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
+	/*
+	 * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
+	 * this is probably good enough.
+	 */
+	if ((s64)(curr->vruntime - curr->deadline) > 0)
+		set_slice(cfs_rq, curr);
+
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
@@ -4788,6 +4975,7 @@  place_entity(struct cfs_rq *cfs_rq, stru
 		vruntime -= se->lag;
 
 	se->vruntime = vruntime;
+	set_slice(cfs_rq, se);
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -4996,19 +5184,20 @@  dequeue_entity(struct cfs_rq *cfs_rq, st
 static void
 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	unsigned long ideal_runtime, delta_exec;
+	unsigned long delta_exec;
 	struct sched_entity *se;
 	s64 delta;
 
-	/*
-	 * When many tasks blow up the sched_period; it is possible that
-	 * sched_slice() reports unusually large results (when many tasks are
-	 * very light for example). Therefore impose a maximum.
-	 */
-	ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
+	if (sched_feat(EEVDF)) {
+		if (pick_eevdf(cfs_rq) != curr)
+			goto preempt;
+
+		return;
+	}
 
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
-	if (delta_exec > ideal_runtime) {
+	if (delta_exec > curr->slice) {
+preempt:
 		resched_curr(rq_of(cfs_rq));
 		/*
 		 * The current task ran long enough, ensure it doesn't get
@@ -5032,7 +5221,7 @@  check_preempt_tick(struct cfs_rq *cfs_rq
 	if (delta < 0)
 		return;
 
-	if (delta > ideal_runtime)
+	if (delta > curr->slice)
 		resched_curr(rq_of(cfs_rq));
 }
 
@@ -5087,17 +5276,20 @@  wakeup_preempt_entity(struct sched_entit
 static struct sched_entity *
 pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	struct sched_entity *left = __pick_first_entity(cfs_rq);
-	struct sched_entity *se;
+	struct sched_entity *left, *se;
 
-	/*
-	 * If curr is set we have to see if its left of the leftmost entity
-	 * still in the tree, provided there was anything in the tree at all.
-	 */
-	if (!left || (curr && entity_before(curr, left)))
-		left = curr;
+	if (sched_feat(EEVDF)) {
+		/*
+		 * Enabling NEXT_BUDDY will affect latency but not fairness.
+		 */
+		if (sched_feat(NEXT_BUDDY) &&
+		    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+			return cfs_rq->next;
+
+		return pick_eevdf(cfs_rq);
+	}
 
-	se = left; /* ideally we run the leftmost entity */
+	se = left = pick_cfs(cfs_rq, curr);
 
 	/*
 	 * Avoid running the skip buddy, if running something else can
@@ -6192,13 +6384,12 @@  static inline void unthrottle_offline_cf
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	SCHED_WARN_ON(task_rq(p) != rq);
 
 	if (rq->cfs.h_nr_running > 1) {
-		u64 slice = sched_slice(cfs_rq, se);
 		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+		u64 slice = se->slice;
 		s64 delta = slice - ran;
 
 		if (delta < 0) {
@@ -7921,7 +8112,19 @@  static void check_preempt_wakeup(struct
 	if (cse_is_idle != pse_is_idle)
 		return;
 
-	update_curr(cfs_rq_of(se));
+	cfs_rq = cfs_rq_of(se);
+	update_curr(cfs_rq);
+
+	if (sched_feat(EEVDF)) {
+		/*
+		 * XXX pick_eevdf(cfs_rq) != se ?
+		 */
+		if (pick_eevdf(cfs_rq) == pse)
+			goto preempt;
+
+		return;
+	}
+
 	if (wakeup_preempt_entity(se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
@@ -8167,7 +8370,7 @@  static void yield_task_fair(struct rq *r
 
 	clear_buddies(cfs_rq, se);
 
-	if (curr->policy != SCHED_BATCH) {
+	if (sched_feat(EEVDF) || curr->policy != SCHED_BATCH) {
 		update_rq_clock(rq);
 		/*
 		 * Update run-time statistics of the 'current'.
@@ -8180,6 +8383,8 @@  static void yield_task_fair(struct rq *r
 		 */
 		rq_clock_skip_update(rq);
 	}
+	if (sched_feat(EEVDF))
+		se->deadline += calc_delta_fair(se->slice, se);
 
 	set_skip_buddy(se);
 }
@@ -11923,8 +12128,8 @@  static void rq_offline_fair(struct rq *r
 static inline bool
 __entity_slice_used(struct sched_entity *se, int min_nr_tasks)
 {
-	u64 slice = sched_slice(cfs_rq_of(se), se);
 	u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+	u64 slice = se->slice;
 
 	return (rtime * min_nr_tasks > slice);
 }
@@ -12639,7 +12844,7 @@  static unsigned int get_rr_interval_fair
 	 * idle runqueue:
 	 */
 	if (rq->cfs.load.weight)
-		rr_interval = NS_TO_JIFFIES(sched_slice(cfs_rq_of(se), se));
+		rr_interval = NS_TO_JIFFIES(se->slice);
 
 	return rr_interval;
 }
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,3 +103,5 @@  SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(EEVDF, true)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3316,5 +3316,6 @@  static inline void switch_mm_cid(struct
 #endif
 
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
+extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 #endif /* _KERNEL_SCHED_SCHED_H */