[06/17] sched/fair: Add lag based placement

Message ID 20230328110353.988000317@infradead.org
State New
Headers
Series sched: EEVDF using latency-nice |

Commit Message

Peter Zijlstra March 28, 2023, 9:26 a.m. UTC
  With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.

Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    1 
 kernel/sched/core.c     |    1 
 kernel/sched/fair.c     |  129 ++++++++++++++++++++++++++++++++++--------------
 kernel/sched/features.h |    8 ++
 4 files changed, 104 insertions(+), 35 deletions(-)
  

Comments

Chen Yu April 3, 2023, 9:18 a.m. UTC | #1
On 2023-03-28 at 11:26:28 +0200, Peter Zijlstra wrote:
>  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
[...]
>  		/*
> -		 * Halve their sleep time's effect, to allow
> -		 * for a gentler effect of sleepers:
> +		 * If we want to place a task and preserve lag, we have to
> +		 * consider the effect of the new entity on the weighted
> +		 * average and compensate for this, otherwise lag can quickly
> +		 * evaporate:
> +		 *
> +		 * l_i = V - v_i <=> v_i = V - l_i
> +		 *
> +		 * V = v_avg = W*v_avg / W
> +		 *
> +		 * V' = (W*v_avg + w_i*v_i) / (W + w_i)
If I understand correctly, V' means the avg_runtime if se_i is enqueued?
Then,

V  = (\Sum w_j*v_j) / W

V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)

Not sure how W*v_avg equals to Sum w_j*v_j ?

> +		 *    = (W*v_avg + w_i(v_avg - l_i)) / (W + w_i)
> +		 *    = v_avg + w_i*l_i/(W + w_i)
v_avg - w_i*l_i/(W + w_i) ?
> +		 *
> +		 * l_i' = V' - v_i = v_avg + w_i*l_i/(W + w_i) - (v_avg - l)
> +		 *      = l_i - w_i*l_i/(W + w_i)
> +		 *
> +		 * l_i = (W + w_i) * l_i' / W
>  		 */
[...]
> -		if (sched_feat(GENTLE_FAIR_SLEEPERS))
> -			thresh >>= 1;
> +		load = cfs_rq->avg_load;
> +		if (curr && curr->on_rq)
> +			load += curr->load.weight;
> +
> +		lag *= load + se->load.weight;
> +		if (WARN_ON_ONCE(!load))
> +			load = 1;
> +		lag = div_s64(lag, load);
>  
Should we calculate
l_i' = l_i * w / (W + w_i) instead of calculating l_i above? I thought we want to adjust
the lag(before enqueue) based on the new weight(after enqueued)


[I will start to run some benchmarks today.]

thanks,
Chenyu
  
Peter Zijlstra April 5, 2023, 9:47 a.m. UTC | #2
On Mon, Apr 03, 2023 at 05:18:06PM +0800, Chen Yu wrote:
> On 2023-03-28 at 11:26:28 +0200, Peter Zijlstra wrote:
> >  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> [...]
> >  		/*
> > -		 * Halve their sleep time's effect, to allow
> > -		 * for a gentler effect of sleepers:
> > +		 * If we want to place a task and preserve lag, we have to
> > +		 * consider the effect of the new entity on the weighted
> > +		 * average and compensate for this, otherwise lag can quickly
> > +		 * evaporate:
> > +		 *
> > +		 * l_i = V - v_i <=> v_i = V - l_i
> > +		 *
> > +		 * V = v_avg = W*v_avg / W
> > +		 *
> > +		 * V' = (W*v_avg + w_i*v_i) / (W + w_i)
> If I understand correctly, V' means the avg_runtime if se_i is enqueued?
> Then,
> 
> V  = (\Sum w_j*v_j) / W

multiply by W on both sides to get:

  V*W = \Sum w_j*v_j

> V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> 
> Not sure how W*v_avg equals to Sum w_j*v_j ?

 V := v_avg

(yeah, I should clean up this stuff, already said to Josh I would)

> > +		 *    = (W*v_avg + w_i(v_avg - l_i)) / (W + w_i)
> > +		 *    = v_avg + w_i*l_i/(W + w_i)
> v_avg - w_i*l_i/(W + w_i) ?

Yup -- seems typing is hard :-)

> > +		 *
> > +		 * l_i' = V' - v_i = v_avg + w_i*l_i/(W + w_i) - (v_avg - l)
> > +		 *      = l_i - w_i*l_i/(W + w_i)
> > +		 *
> > +		 * l_i = (W + w_i) * l_i' / W
> >  		 */
> [...]
> > -		if (sched_feat(GENTLE_FAIR_SLEEPERS))
> > -			thresh >>= 1;
> > +		load = cfs_rq->avg_load;
> > +		if (curr && curr->on_rq)
> > +			load += curr->load.weight;
> > +
> > +		lag *= load + se->load.weight;
> > +		if (WARN_ON_ONCE(!load))
> > +			load = 1;
> > +		lag = div_s64(lag, load);
> >  
> Should we calculate
> l_i' = l_i * w / (W + w_i) instead of calculating l_i above? I thought we want to adjust
> the lag(before enqueue) based on the new weight(after enqueued)

We want to ensure the lag after placement is the lag we got before
dequeue.

I've updated the comment to read like so:

		/*
		 * If we want to place a task and preserve lag, we have to
		 * consider the effect of the new entity on the weighted
		 * average and compensate for this, otherwise lag can quickly
		 * evaporate.
		 *
		 * Lag is defined as:
		 *
		 *   l_i = V - v_i <=> v_i = V - l_i
		 *
		 * And we take V to be the weighted average of all v:
		 *
		 *   V = (\Sum w_j*v_j) / W
		 *
		 * Where W is: \Sum w_j
		 *
		 * Then, the weighted average after adding an entity with lag
		 * l_i is given by:
		 *
		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
		 *      = (W*V + w_i*(V - l_i)) / (W + w_i)
		 *      = (W*V + w_i*V - w_i*l_i) / (W + w_i)
		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
		 *      = V - w_i*l_i / (W + w_i)
		 *
		 * And the actual lag after adding an entity with l_i is:
		 *
		 *   l'_i = V' - v_i
		 *        = V - w_i*l_i / (W + w_i) - (V - l_i)
		 *        = l_i - w_i*l_i / (W + w_i)
		 *
		 * Which is strictly less than l_i. So in order to preserve lag
		 * we should inflate the lag before placement such that the
		 * effective lag after placement comes out right.
		 *
		 * As such, invert the above relation for l'_i to get the l_i
		 * we need to use such that the lag after placement is the lag
		 * we computed before dequeue.
		 *
		 *   l'_i = l_i - w_i*l_i / (W + w_i)
		 *        = ((W + w_i)*l_i - w_i*l_i) / (W + w_i)
		 *
		 *   (W + w_i)*l'_i = (W + w_i)*l_i - w_i*l_i
		 *                  = W*l_i
		 *
		 *   l_i = (W + w_i)*l'_i / W
		 */
  
Chen Yu April 6, 2023, 3:03 a.m. UTC | #3
On 2023-04-05 at 11:47:20 +0200, Peter Zijlstra wrote:
> On Mon, Apr 03, 2023 at 05:18:06PM +0800, Chen Yu wrote:
> > On 2023-03-28 at 11:26:28 +0200, Peter Zijlstra wrote:
> > >  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> > [...]
> > >  		/*
> > > -		 * Halve their sleep time's effect, to allow
> > > -		 * for a gentler effect of sleepers:
> > > +		 * If we want to place a task and preserve lag, we have to
> > > +		 * consider the effect of the new entity on the weighted
> > > +		 * average and compensate for this, otherwise lag can quickly
> > > +		 * evaporate:
> > > +		 *
> > > +		 * l_i = V - v_i <=> v_i = V - l_i
> > > +		 *
> > > +		 * V = v_avg = W*v_avg / W
> > > +		 *
> > > +		 * V' = (W*v_avg + w_i*v_i) / (W + w_i)
> > If I understand correctly, V' means the avg_runtime if se_i is enqueued?
> > Then,
> > 
> > V  = (\Sum w_j*v_j) / W
> 
> multiply by W on both sides to get:
> 
>   V*W = \Sum w_j*v_j
> 
> > V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> > 
> > Not sure how W*v_avg equals to Sum w_j*v_j ?
> 
>  V := v_avg
>
I see, thanks for the explanation. 
> (yeah, I should clean up this stuff, already said to Josh I would)
> 
> > > +		 *    = (W*v_avg + w_i(v_avg - l_i)) / (W + w_i)
> > > +		 *    = v_avg + w_i*l_i/(W + w_i)
> > v_avg - w_i*l_i/(W + w_i) ?
> 
> Yup -- seems typing is hard :-)
> 
> > > +		 *
> > > +		 * l_i' = V' - v_i = v_avg + w_i*l_i/(W + w_i) - (v_avg - l)
> > > +		 *      = l_i - w_i*l_i/(W + w_i)
> > > +		 *
> > > +		 * l_i = (W + w_i) * l_i' / W
> > >  		 */
> > [...]
> > > -		if (sched_feat(GENTLE_FAIR_SLEEPERS))
> > > -			thresh >>= 1;
> > > +		load = cfs_rq->avg_load;
> > > +		if (curr && curr->on_rq)
> > > +			load += curr->load.weight;
> > > +
> > > +		lag *= load + se->load.weight;
> > > +		if (WARN_ON_ONCE(!load))
> > > +			load = 1;
> > > +		lag = div_s64(lag, load);
> > >  
> > Should we calculate
> > l_i' = l_i * w / (W + w_i) instead of calculating l_i above? I thought we want to adjust
> > the lag(before enqueue) based on the new weight(after enqueued)
> 
> We want to ensure the lag after placement is the lag we got before
> dequeue.
> 
> I've updated the comment to read like so:
> 
> 		/*
> 		 * If we want to place a task and preserve lag, we have to
> 		 * consider the effect of the new entity on the weighted
> 		 * average and compensate for this, otherwise lag can quickly
> 		 * evaporate.
> 		 *
> 		 * Lag is defined as:
> 		 *
> 		 *   l_i = V - v_i <=> v_i = V - l_i
> 		 *
> 		 * And we take V to be the weighted average of all v:
> 		 *
> 		 *   V = (\Sum w_j*v_j) / W
> 		 *
> 		 * Where W is: \Sum w_j
> 		 *
> 		 * Then, the weighted average after adding an entity with lag
> 		 * l_i is given by:
> 		 *
> 		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> 		 *      = (W*V + w_i*(V - l_i)) / (W + w_i)
> 		 *      = (W*V + w_i*V - w_i*l_i) / (W + w_i)
> 		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
small typo  w_i*l -> w_i*l_i
> 		 *      = V - w_i*l_i / (W + w_i)
> 		 *
> 		 * And the actual lag after adding an entity with l_i is:
> 		 *
> 		 *   l'_i = V' - v_i
> 		 *        = V - w_i*l_i / (W + w_i) - (V - l_i)
> 		 *        = l_i - w_i*l_i / (W + w_i)
> 		 *
> 		 * Which is strictly less than l_i. So in order to preserve lag
> 		 * we should inflate the lag before placement such that the
> 		 * effective lag after placement comes out right.
> 		 *
> 		 * As such, invert the above relation for l'_i to get the l_i
> 		 * we need to use such that the lag after placement is the lag
> 		 * we computed before dequeue.
> 		 *
> 		 *   l'_i = l_i - w_i*l_i / (W + w_i)
> 		 *        = ((W + w_i)*l_i - w_i*l_i) / (W + w_i)
> 		 *
> 		 *   (W + w_i)*l'_i = (W + w_i)*l_i - w_i*l_i
> 		 *                  = W*l_i
> 		 *
> 		 *   l_i = (W + w_i)*l'_i / W
> 		 */
Got it, thanks! This is very clear.

thanks,
Chenyu
  
Chen Yu April 13, 2023, 3:42 p.m. UTC | #4
On 2023-04-05 at 11:47:20 +0200, Peter Zijlstra wrote:
> On Mon, Apr 03, 2023 at 05:18:06PM +0800, Chen Yu wrote:
> > On 2023-03-28 at 11:26:28 +0200, Peter Zijlstra wrote:
So I launched the test on another platform with more CPUs,

baseline: 6.3-rc6

compare:  sched/eevdf branch on top of commit 8c59a975d5ee ("sched/eevdf: Debug / validation crud")


--------------------------------------------------------------------------------------
schbench:mthreads = 2
                   baseline                    eevdf+NO_PLACE_BONUS
worker_threads
25%                80.00           +19.2%      95.40            schbench.latency_90%_us
                   (0.00%)                     (0.51%)          stddev
50%                183.70          +2.2%       187.80           schbench.latency_90%_us
                   (0.35%)                     (0.46%)          stddev
75%                4065            -21.4%      3193             schbench.latency_90%_us
                   (69.65%)                    (3.42%)          stddev
100%               13696           -92.4%      1040             schbench.latency_90%_us
                   (5.25%)                     (69.03%)         stddev
125%               16457           -78.6%      3514             schbench.latency_90%_us
                   (10.50%)                    (6.25%)          stddev
150%               31177           -77.5%      7008             schbench.latency_90%_us
                   (6.84%)                     (5.19%)          stddev
175%               40729           -75.1%      10160            schbench.latency_90%_us
                   (6.11%)                     (2.53%)          stddev
200%               52224           -74.4%      13385            schbench.latency_90%_us
                   (10.42%)                    (1.72%)          stddev


                  eevdf+NO_PLACE_BONUS       eevdf+PLACE_BONUS
worker_threads
25%               96.30             +0.2%      96.50            schbench.latency_90%_us
                  (0.66%)                      (0.52%)          stddev
50%               187.20            -3.0%      181.60           schbench.latency_90%_us
                  (0.21%)                      (0.71%)          stddev
75%                3034             -84.1%     482.50           schbench.latency_90%_us
                  (5.56%)                      (27.40%)         stddev
100%              648.20            +114.7%    1391             schbench.latency_90%_us
                  (64.70%)                     (10.05%)         stddev
125%              3506              -3.0%      3400             schbench.latency_90%_us
                  (2.79%)                      (9.89%)          stddev
150%              6793              +29.6%     8803             schbench.latency_90%_us
                  (1.39%)                      (7.30%)          stddev
175%               9961             +9.2%      10876            schbench.latency_90%_us
                  (1.51%)                      (6.54%)          stddev
200%              13660             +3.3%      14118            schbench.latency_90%_us
                  (1.38%)                      (6.02%)          stddev



Summary for schbench: in most cases eevdf+NO_PLACE_BONUS gives the best performance.
And this is aligned with the previous test on another platform with smaller number of
CPUs, eevdf benefits schbench overall.

---------------------------------------------------------------------------------------



hackbench: ipc=pipe mode=process default fd:20

                   baseline                     eevdf+NO_PLACE_BONUS
worker_threads
1                  103103            -0.3%     102794        hackbench.throughput_avg
25%                115562          +825.7%    1069725        hackbench.throughput_avg
50%                296514          +352.1%    1340414        hackbench.throughput_avg
75%                498059          +190.8%    1448156        hackbench.throughput_avg
100%               804560           +74.8%    1406413        hackbench.throughput_avg


                   eevdf+NO_PLACE_BONUS        eevdf+PLACE_BONUS
worker_threads
1                  102172            +1.5%     103661         hackbench.throughput_avg
25%                1076503           -52.8%     508612        hackbench.throughput_avg
50%                1394311           -68.2%     443251        hackbench.throughput_avg
75%                1476502           -70.2%     440391        hackbench.throughput_avg
100%               1512706           -76.2%     359741        hackbench.throughput_avg


Summary for hackbench pipe process test: in most cases eevdf+NO_PLACE_BONUS gives the best performance.

-------------------------------------------------------------------------------------
unixbench: test=pipe

                   baseline                     eevdf+NO_PLACE_BONUS
nr_task
1                  1405              -0.5%       1398        unixbench.score
25%                77942             +0.9%      78680        unixbench.score
50%                155384            +1.1%     157100        unixbench.score
75%                179756            +0.3%     180295        unixbench.score
100%               204030            -0.2%     203540        unixbench.score
125%               204972            -0.4%     204062        unixbench.score
150%               205891            -0.5%     204792        unixbench.score
175%               207051            -0.5%     206047        unixbench.score
200%               209387            -0.9%     207559        unixbench.score


                   eevdf+NO_PLACE_BONUS        eevdf+PLACE_BONUS
nr_task
1                  1405              -0.3%       1401        unixbench.score
25%                78640             +0.0%      78647        unixbench.score
50%                157153            -0.0%     157093        unixbench.score
75%                180152            +0.0%     180205        unixbench.score
100%               203479            -0.0%     203464        unixbench.score
125%               203866            +0.1%     204013        unixbench.score
150%               204872            -0.0%     204838        unixbench.score
175%               205799            +0.0%     205824        unixbench.score
200%               207152            +0.2%     207546        unixbench.score

Seems to have no impact on unixbench in pipe mode.
--------------------------------------------------------------------------------

netperf: TCP_RR, ipv4, loopback

                   baseline                    eevdf+NO_PLACE_BONUS
nr_threads
25%                56232            -1.7%      55265        netperf.Throughput_tps
50%                49876            -3.1%      48338        netperf.Throughput_tps
75%                24281            +1.9%      24741        netperf.Throughput_tps
100%               73598            +3.8%      76375        netperf.Throughput_tps
125%               59119            +1.4%      59968        netperf.Throughput_tps
150%               49124            +1.2%      49727        netperf.Throughput_tps
175%               41929            +0.2%      42004        netperf.Throughput_tps
200%               36543            +0.4%      36677        netperf.Throughput_tps

                   eevdf+NO_PLACE_BONUS        eevdf+PLACE_BONUS
nr_threads
25%                55296            +4.7%      57877        netperf.Throughput_tps
50%                48659            +1.9%      49585        netperf.Throughput_tps
75%                24741            +0.3%      24807        netperf.Throughput_tps
100%               76455            +6.7%      81548        netperf.Throughput_tps
125%               60082            +7.6%      64622        netperf.Throughput_tps
150%               49618            +7.7%      53429        netperf.Throughput_tps
175%               41974            +7.6%      45160        netperf.Throughput_tps
200%               36677            +6.5%      39067        netperf.Throughput_tps

Seems to have no impact on netperf.
-----------------------------------------------------------------------------------

stress-ng: futex

                   baseline                     eevdf+NO_PLACE_BONUS
nr_threads
25%                207926           -21.0%     164356       stress-ng.futex.ops_per_sec
50%                46611           -16.1%      39130        stress-ng.futex.ops_per_sec
75%                71381           -11.3%      63283        stress-ng.futex.ops_per_sec
100%               58766            -0.8%      58269        stress-ng.futex.ops_per_sec
125%               59859           +11.3%      66645        stress-ng.futex.ops_per_sec
150%               52869            +7.6%      56863        stress-ng.futex.ops_per_sec
175%               49607           +22.9%      60969        stress-ng.futex.ops_per_sec
200%               56011           +11.8%      62631        stress-ng.futex.ops_per_sec


When the system is not busy, there is regression. When the system gets busier,
there are some improvement. Even with PLACE_BONUS enabled, there are still regression.
Per the perf profile of 50% case, there are nearly the same ratio of wakeup with vs without
eevdf patch applied:
50.82            -0.7       50.15        perf-profile.children.cycles-pp.futex_wake
but there are more preemption after eevdf enabled:
135095           +15.4%     155943        stress-ng.time.involuntary_context_switches
which is near the performance loss -16.1%
That is to say, eevdf help futex wakee grab the CPU easier(benefit latency), while might
have some impact on throughput?

thanks,
Chenyu
  
Chen Yu April 13, 2023, 3:55 p.m. UTC | #5
On 2023-04-13 at 23:42:34 +0800, Chen Yu wrote:
> On 2023-04-05 at 11:47:20 +0200, Peter Zijlstra wrote:
> > On Mon, Apr 03, 2023 at 05:18:06PM +0800, Chen Yu wrote:
> > > On 2023-03-28 at 11:26:28 +0200, Peter Zijlstra wrote:
> So I launched the test on another platform with more CPUs,
> 
> baseline: 6.3-rc6
> 
> compare:  sched/eevdf branch on top of commit 8c59a975d5ee ("sched/eevdf: Debug / validation crud")
> Chenyu
I realized that you have pushed some changes to eevdf branch yesterday, so the test was
actually tested on top of this commit I pulled 1 week ago:
commit 4f58ee3ba245ff97a075b17b454256f9c4d769c4 ("sched/eevdf: Debug / validation crud")

thanks,
Chenyu
  

Patch

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -555,6 +555,7 @@  struct sched_entity {
 	u64				sum_exec_runtime;
 	u64				vruntime;
 	u64				prev_sum_exec_runtime;
+	s64				vlag;
 
 	u64				nr_migrations;
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4439,6 +4439,7 @@  static void __sched_fork(unsigned long c
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+	p->se.vlag			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 	set_latency_offset(p);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -689,6 +689,15 @@  u64 avg_vruntime(struct cfs_rq *cfs_rq)
 	return cfs_rq->min_vruntime + avg;
 }
 
+/*
+ * lag_i = S - s_i = w_i * (V - v_i)
+ */
+void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	SCHED_WARN_ON(!se->on_rq);
+	se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
+}
+
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	u64 min_vruntime = cfs_rq->min_vruntime;
@@ -3417,6 +3426,8 @@  dequeue_load_avg(struct cfs_rq *cfs_rq,
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
+	unsigned long old_weight = se->load.weight;
+
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
@@ -3429,6 +3440,14 @@  static void reweight_entity(struct cfs_r
 
 	update_load_set(&se->load, weight);
 
+	if (!se->on_rq) {
+		/*
+		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v),
+		 * we need to scale se->vlag when w_i changes.
+		 */
+		se->vlag = div_s64(se->vlag * old_weight, weight);
+	}
+
 #ifdef CONFIG_SMP
 	do {
 		u32 divider = get_pelt_divider(&se->avg);
@@ -4778,49 +4797,86 @@  static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
 	u64 vruntime = avg_vruntime(cfs_rq);
+	s64 lag = 0;
 
-	/* sleeps up to a single latency don't count. */
-	if (!initial) {
-		unsigned long thresh;
+	/*
+	 * Due to how V is constructed as the weighted average of entities,
+	 * adding tasks with positive lag, or removing tasks with negative lag
+	 * will move 'time' backwards, this can screw around with the lag of
+	 * other tasks.
+	 *
+	 * EEVDF: placement strategy #1 / #2
+	 */
+	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
+		struct sched_entity *curr = cfs_rq->curr;
+		unsigned long load;
 
-		if (se_is_idle(se))
-			thresh = sysctl_sched_min_granularity;
-		else
-			thresh = sysctl_sched_latency;
+		lag = se->vlag;
 
 		/*
-		 * Halve their sleep time's effect, to allow
-		 * for a gentler effect of sleepers:
+		 * If we want to place a task and preserve lag, we have to
+		 * consider the effect of the new entity on the weighted
+		 * average and compensate for this, otherwise lag can quickly
+		 * evaporate:
+		 *
+		 * l_i = V - v_i <=> v_i = V - l_i
+		 *
+		 * V = v_avg = W*v_avg / W
+		 *
+		 * V' = (W*v_avg + w_i*v_i) / (W + w_i)
+		 *    = (W*v_avg + w_i(v_avg - l_i)) / (W + w_i)
+		 *    = v_avg + w_i*l_i/(W + w_i)
+		 *
+		 * l_i' = V' - v_i = v_avg + w_i*l_i/(W + w_i) - (v_avg - l)
+		 *      = l_i - w_i*l_i/(W + w_i)
+		 *
+		 * l_i = (W + w_i) * l_i' / W
 		 */
-		if (sched_feat(GENTLE_FAIR_SLEEPERS))
-			thresh >>= 1;
+		load = cfs_rq->avg_load;
+		if (curr && curr->on_rq)
+			load += curr->load.weight;
+
+		lag *= load + se->load.weight;
+		if (WARN_ON_ONCE(!load))
+			load = 1;
+		lag = div_s64(lag, load);
 
-		vruntime -= thresh;
+		vruntime -= lag;
 	}
 
-	/*
-	 * Pull vruntime of the entity being placed to the base level of
-	 * cfs_rq, to prevent boosting it if placed backwards.
-	 * However, min_vruntime can advance much faster than real time, with
-	 * the extreme being when an entity with the minimal weight always runs
-	 * on the cfs_rq. If the waking entity slept for a long time, its
-	 * vruntime difference from min_vruntime may overflow s64 and their
-	 * comparison may get inversed, so ignore the entity's original
-	 * vruntime in that case.
-	 * The maximal vruntime speedup is given by the ratio of normal to
-	 * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
-	 * When placing a migrated waking entity, its exec_start has been set
-	 * from a different rq. In order to take into account a possible
-	 * divergence between new and prev rq's clocks task because of irq and
-	 * stolen time, we take an additional margin.
-	 * So, cutting off on the sleep time of
-	 *     2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
-	 * should be safe.
-	 */
-	if (entity_is_long_sleeper(se))
-		se->vruntime = vruntime;
-	else
-		se->vruntime = max_vruntime(se->vruntime, vruntime);
+	if (sched_feat(FAIR_SLEEPERS)) {
+
+		/* sleeps up to a single latency don't count. */
+		if (!initial) {
+			unsigned long thresh;
+
+			if (se_is_idle(se))
+				thresh = sysctl_sched_min_granularity;
+			else
+				thresh = sysctl_sched_latency;
+
+			/*
+			 * Halve their sleep time's effect, to allow
+			 * for a gentler effect of sleepers:
+			 */
+			if (sched_feat(GENTLE_FAIR_SLEEPERS))
+				thresh >>= 1;
+
+			vruntime -= thresh;
+		}
+
+		/*
+		 * Pull vruntime of the entity being placed to the base level of
+		 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
+		 * slept for a long time, don't even try to compare its vruntime with
+		 * the base as it may be too far off and the comparison may get
+		 * inversed due to s64 overflow.
+		 */
+		if (!entity_is_long_sleeper(se))
+			vruntime = max_vruntime(se->vruntime, vruntime);
+	}
+
+	se->vruntime = vruntime;
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -4991,6 +5047,9 @@  dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	clear_buddies(cfs_rq, se);
 
+	if (flags & DEQUEUE_SLEEP)
+		update_entity_lag(cfs_rq, se);
+
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,12 +1,20 @@ 
 /* SPDX-License-Identifier: GPL-2.0 */
+
 /*
  * Only give sleepers 50% of their service deficit. This allows
  * them to run sooner, but does not allow tons of sleepers to
  * rip the spread apart.
  */
+SCHED_FEAT(FAIR_SLEEPERS, false)
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
 
 /*
+ * Using the avg_vruntime, do the right thing and preserve lag across
+ * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
+ */
+SCHED_FEAT(PLACE_LAG, true)
+
+/*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.