diff mbox series

[14/17] sched/eevdf: Better handle mixed slice length

Message ID	20230328110354.562078801@infradead.org
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Message-ID: <20230328110354.562078801@infradead.org> User-Agent: quilt/0.66 Date: Tue, 28 Mar 2023 11:26:36 +0200 From: Peter Zijlstra <peterz@infradead.org> To: mingo@kernel.org, vincent.guittot@linaro.org Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, juri.lelli@redhat.com, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, corbet@lwn.net, qyousef@layalina.io, chris.hyser@oracle.com, patrick.bellasi@matbug.net, pjt@google.com, pavel@ucw.cz, qperret@google.com, tim.c.chen@linux.intel.com, joshdon@google.com, timj@gnu.org, kprateek.nayak@amd.com, yu.c.chen@intel.com, youssefesmat@chromium.org, joel@joelfernandes.org, efault@gmx.de Subject: [PATCH 14/17] sched/eevdf: Better handle mixed slice length References: <20230328092622.062917921@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Precedence: bulk
Series	sched: EEVDF using latency-nice \| [00/17] sched: EEVDF using latency-nice [01/17] sched: Introduce latency-nice as a per-task attribute [02/17] sched/fair: Add latency_offset [03/17] sched/fair: Add sched group latency support [04/17] sched/fair: Add avg_vruntime [05/17] sched/fair: Remove START_DEBIT [06/17] sched/fair: Add lag based placement [07/17] rbtree: Add rb_add_augmented_cached() helper [08/17] sched/fair: Implement an EEVDF like policy [09/17] sched: Commit to lag based placement [10/17] sched/smp: Use lag to simplify cross-runqueue placement [11/17] sched: Commit to EEVDF [12/17] sched/debug: Rename min_granularity to base_slice [13/17] sched: Merge latency_offset into slice [14/17] sched/eevdf: Better handle mixed slice length [15/17,RFC] sched/eevdf: Sleeper bonus [16/17,RFC] sched/eevdf: Minimal vavg option [17/17,DEBUG] sched/eevdf: Debug / validation crud

Commit Message

Peter Zijlstra March 28, 2023, 9:26 a.m. UTC

  In the case where (due to latency-nice) there are different request
sizes in the tree, the smaller requests tend to be dominated by the
larger. Also note how the EEVDF lag limits are based on r_max.

Therefore; add a heuristic that for the mixed request size case, moves
smaller requests to placement strategy #2 which ensures they're
immidiately eligible and and due to their smaller (virtual) deadline
will cause preemption.

NOTE: this relies on update_entity_lag() to impose lag limits above
a single slice.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   14 ++++++++++++++
 kernel/sched/features.h |    1 +
 kernel/sched/sched.h    |    1 +
 3 files changed, 16 insertions(+)

Comments

Vincent Guittot March 31, 2023, 3:26 p.m. UTC | #1

On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <peterz@infradead.org> wrote:
>
> In the case where (due to latency-nice) there are different request
> sizes in the tree, the smaller requests tend to be dominated by the
> larger. Also note how the EEVDF lag limits are based on r_max.
>
> Therefore; add a heuristic that for the mixed request size case, moves
> smaller requests to placement strategy #2 which ensures they're
> immidiately eligible and and due to their smaller (virtual) deadline
> will cause preemption.
>
> NOTE: this relies on update_entity_lag() to impose lag limits above
> a single slice.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c     |   14 ++++++++++++++
>  kernel/sched/features.h |    1 +
>  kernel/sched/sched.h    |    1 +
>  3 files changed, 16 insertions(+)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
>         s64 key = entity_key(cfs_rq, se);
>
>         cfs_rq->avg_vruntime += key * weight;
> +       cfs_rq->avg_slice += se->slice * weight;
>         cfs_rq->avg_load += weight;
>  }
>
> @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
>         s64 key = entity_key(cfs_rq, se);
>
>         cfs_rq->avg_vruntime -= key * weight;
> +       cfs_rq->avg_slice -= se->slice * weight;
>         cfs_rq->avg_load -= weight;
>  }
>
> @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru
>                 lag = se->vlag;
>
>                 /*
> +                * For latency sensitive tasks; those that have a shorter than
> +                * average slice and do not fully consume the slice, transition
> +                * to EEVDF placement strategy #2.
> +                */
> +               if (sched_feat(PLACE_FUDGE) &&
> +                   cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
> +                       lag += vslice;
> +                       if (lag > 0)
> +                               lag = 0;

By using different lag policies for tasks, doesn't this create
unfairness between tasks ?

I wanted to stress this situation with a simple use case but it seems
that even without changing the slice, there is a fairness problem:

Task A always run
Task B loops on : running 1ms then sleeping 1ms
default nice and latency nice prio bot both
each task should get around 50% of the time.

The fairness is ok with tip/sched/core
but with eevdf, Task B only gets around 30%

I haven't identified the problem so far


> +               }
> +
> +               /*
>                  * If we want to place a task and preserve lag, we have to
>                  * consider the effect of the new entity on the weighted
>                  * average and compensate for this, otherwise lag can quickly
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -5,6 +5,7 @@
>   * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
>   */
>  SCHED_FEAT(PLACE_LAG, true)
> +SCHED_FEAT(PLACE_FUDGE, true)
>  SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
>
>  /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -559,6 +559,7 @@ struct cfs_rq {
>         unsigned int            idle_h_nr_running; /* SCHED_IDLE */
>
>         s64                     avg_vruntime;
> +       u64                     avg_slice;
>         u64                     avg_load;
>
>         u64                     exec_clock;
>
>

Mike Galbraith April 2, 2023, 2:40 a.m. UTC | #2

On Sun, 2023-04-02 at 07:23 +0800, Hillf Danton wrote:
> On 31 Mar 2023 17:26:51 +0200 Vincent Guittot <vincent.guittot@linaro.org>
> >
> > I wanted to stress this situation with a simple use case but it seems
> > that even without changing the slice, there is a fairness problem:
> >
> > Task A always run
> > Task B loops on : running 1ms then sleeping 1ms
> > default nice and latency nice prio bot both
> > each task should get around 50% of the time.
> >
> > The fairness is ok with tip/sched/core
> > but with eevdf, Task B only gets around 30%
>
> Convincing evidence for glitch in wakeup preempt.

If you turn on PLACE_BONUS, it'll mimic FAIR_SLEEPERS.. but if you then
do some testing, you'll probably turn it right back off.

The 50/50 split in current code isn't really any more fair, as soon as
you leave the tiny bubble of fairness, it's not the least bit fair.
Nor is that tiny bubble all rainbows and unicorns, it brought with it
benchmark wins and losses, like everything that changes more than
comments, its price being service latency variance.

The short term split doesn't really mean all that much, some things
will like the current fair-bubble better, some eevdf virtual deadline
math and its less spiky service.  We'll see.

I'm kinda hoping eevdf works out, FAIR_SLEEPERS is quite annoying to
squabble with.

	-Mike

Peter Zijlstra April 4, 2023, 9:29 a.m. UTC | #3

On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote:
> On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > In the case where (due to latency-nice) there are different request
> > sizes in the tree, the smaller requests tend to be dominated by the
> > larger. Also note how the EEVDF lag limits are based on r_max.
> >
> > Therefore; add a heuristic that for the mixed request size case, moves
> > smaller requests to placement strategy #2 which ensures they're
> > immidiately eligible and and due to their smaller (virtual) deadline
> > will cause preemption.
> >
> > NOTE: this relies on update_entity_lag() to impose lag limits above
> > a single slice.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/fair.c     |   14 ++++++++++++++
> >  kernel/sched/features.h |    1 +
> >  kernel/sched/sched.h    |    1 +
> >  3 files changed, 16 insertions(+)
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
> >         s64 key = entity_key(cfs_rq, se);
> >
> >         cfs_rq->avg_vruntime += key * weight;
> > +       cfs_rq->avg_slice += se->slice * weight;
> >         cfs_rq->avg_load += weight;
> >  }
> >
> > @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
> >         s64 key = entity_key(cfs_rq, se);
> >
> >         cfs_rq->avg_vruntime -= key * weight;
> > +       cfs_rq->avg_slice -= se->slice * weight;
> >         cfs_rq->avg_load -= weight;
> >  }
> >
> > @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru
> >                 lag = se->vlag;
> >
> >                 /*
> > +                * For latency sensitive tasks; those that have a shorter than
> > +                * average slice and do not fully consume the slice, transition
> > +                * to EEVDF placement strategy #2.
> > +                */
> > +               if (sched_feat(PLACE_FUDGE) &&
> > +                   cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
> > +                       lag += vslice;
> > +                       if (lag > 0)
> > +                               lag = 0;
> 
> By using different lag policies for tasks, doesn't this create
> unfairness between tasks ?

Possibly, I've just not managed to trigger it yet -- if it is an issue
it can be fixed by ensuring we don't place the entity before its
previous vruntime just like the sleeper hack later on.

> I wanted to stress this situation with a simple use case but it seems
> that even without changing the slice, there is a fairness problem:
> 
> Task A always run
> Task B loops on : running 1ms then sleeping 1ms
> default nice and latency nice prio bot both
> each task should get around 50% of the time.
> 
> The fairness is ok with tip/sched/core
> but with eevdf, Task B only gets around 30%
> 
> I haven't identified the problem so far

Heh, this is actually the correct behaviour. If you have a u=1 and a
u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.

CFS has this sleeper bonus hack that makes it 50% vs 50% but that is
strictly not correct -- although it does help a number of weird
benchmarks.

Joel Fernandes April 4, 2023, 1:50 p.m. UTC | #4

On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
> On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote:
> > On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > In the case where (due to latency-nice) there are different request
> > > sizes in the tree, the smaller requests tend to be dominated by the
> > > larger. Also note how the EEVDF lag limits are based on r_max.
> > >
> > > Therefore; add a heuristic that for the mixed request size case, moves
> > > smaller requests to placement strategy #2 which ensures they're
> > > immidiately eligible and and due to their smaller (virtual) deadline
> > > will cause preemption.
> > >
> > > NOTE: this relies on update_entity_lag() to impose lag limits above
> > > a single slice.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  kernel/sched/fair.c     |   14 ++++++++++++++
> > >  kernel/sched/features.h |    1 +
> > >  kernel/sched/sched.h    |    1 +
> > >  3 files changed, 16 insertions(+)
> > >
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
> > >         s64 key = entity_key(cfs_rq, se);
> > >
> > >         cfs_rq->avg_vruntime += key * weight;
> > > +       cfs_rq->avg_slice += se->slice * weight;
> > >         cfs_rq->avg_load += weight;
> > >  }
> > >
> > > @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
> > >         s64 key = entity_key(cfs_rq, se);
> > >
> > >         cfs_rq->avg_vruntime -= key * weight;
> > > +       cfs_rq->avg_slice -= se->slice * weight;
> > >         cfs_rq->avg_load -= weight;
> > >  }
> > >
> > > @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru
> > >                 lag = se->vlag;
> > >
> > >                 /*
> > > +                * For latency sensitive tasks; those that have a shorter than
> > > +                * average slice and do not fully consume the slice, transition
> > > +                * to EEVDF placement strategy #2.
> > > +                */
> > > +               if (sched_feat(PLACE_FUDGE) &&
> > > +                   cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
> > > +                       lag += vslice;
> > > +                       if (lag > 0)
> > > +                               lag = 0;
> > 
> > By using different lag policies for tasks, doesn't this create
> > unfairness between tasks ?
> 
> Possibly, I've just not managed to trigger it yet -- if it is an issue
> it can be fixed by ensuring we don't place the entity before its
> previous vruntime just like the sleeper hack later on.
> 
> > I wanted to stress this situation with a simple use case but it seems
> > that even without changing the slice, there is a fairness problem:
> > 
> > Task A always run
> > Task B loops on : running 1ms then sleeping 1ms
> > default nice and latency nice prio bot both
> > each task should get around 50% of the time.
> > 
> > The fairness is ok with tip/sched/core
> > but with eevdf, Task B only gets around 30%
> > 
> > I haven't identified the problem so far
> 
> Heh, this is actually the correct behaviour. If you have a u=1 and a
> u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.

Splitting like that sounds like starvation of the sleeper to me. If something
sleeps a lot, it will get even less CPU time on an average than it would if
there was no contention from the u=1 task.

And also CGroups will be even more weird than it already is in such a world,
2 different containers will not get CPU time distributed properly- say if
tasks in one container sleep a lot and tasks in another container are CPU
bound.

thanks,

 - Joel

Mike Galbraith April 5, 2023, 5:41 a.m. UTC | #5

On Tue, 2023-04-04 at 13:50 +0000, Joel Fernandes wrote:
> On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
> > On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote:
> >
> > >
> > > Task A always run
> > > Task B loops on : running 1ms then sleeping 1ms
> > > default nice and latency nice prio bot both
> > > each task should get around 50% of the time.
> > >
> > > The fairness is ok with tip/sched/core
> > > but with eevdf, Task B only gets around 30%
> > >
> > > I haven't identified the problem so far
> >
> > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
>
> Splitting like that sounds like starvation of the sleeper to me. If something
> sleeps a lot, it will get even less CPU time on an average than it would if
> there was no contention from the u=1 task.
>
> And also CGroups will be even more weird than it already is in such a world,
> 2 different containers will not get CPU time distributed properly- say if
> tasks in one container sleep a lot and tasks in another container are CPU
> bound.

Lets take a quick peek at some group distribution numbers.

start tbench and massive_intr in their own VT (autogroup), then in
another, sleep 300;killall top massive_intr tbench_srv tbench.

(caveman method because perf's refusing to handle fast switchers well
for me.. top's plenty good enough for this anyway, and less intrusive)

massive_intr runs 8ms, sleeps 1, wants 88.8% of 8 runqueues.  tbench
buddy pairs want only a tad more CPU, 100% between them, but switch
orders of magnitude more frequently.  Very dissimilar breeds of hog.

master.today      accrued   of 2400s vs master
team massive_intr 1120.50s     .466      1.000
team tbench       1256.13s     .523      1.000

+eevdf
team massive_intr 1071.94s     .446       .956
team tbench       1301.56s     .542      1.036

There is of course a distribution delta.. but was it meaningful?

Between mostly idle but kinda noisy GUI perturbing things, and more
importantly, neither load having been manually distributed and pinned,
both schedulers came out pretty good, and both a tad shy of.. perfect
is the enemy of good.

Raw numbers below in case my mouse mucked up feeding of numbers to bc
(blame the inanimate, they can't do a damn thing about it).

6.3.0.g148341f-master
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 5641 root      20   0    2564    640    640 R 50.33 0.004   2:17.19 5 massive_intr
 5636 root      20   0    2564    640    640 S 49.00 0.004   2:20.05 5 massive_intr
 5647 root      20   0    2564    640    640 R 48.67 0.004   2:21.85 6 massive_intr
 5640 root      20   0    2564    640    640 R 48.00 0.004   2:21.13 6 massive_intr
 5645 root      20   0    2564    640    640 R 47.67 0.004   2:18.25 5 massive_intr
 5638 root      20   0    2564    640    640 R 46.67 0.004   2:22.39 2 massive_intr
 5634 root      20   0    2564    640    640 R 45.00 0.004   2:18.93 4 massive_intr
 5643 root      20   0    2564    640    640 R 44.00 0.004   2:20.71 7 massive_intr
 5639 root      20   0   23468   1664   1536 R 29.00 0.010   1:22.31 3 tbench
 5644 root      20   0   23468   1792   1664 R 28.67 0.011   1:22.32 3 tbench
 5637 root      20   0   23468   1664   1536 S 28.00 0.010   1:22.75 5 tbench
 5631 root      20   0   23468   1792   1664 R 27.00 0.011   1:21.47 4 tbench
 5632 root      20   0   23468   1536   1408 R 27.00 0.010   1:21.78 0 tbench
 5653 root      20   0    6748    896    768 S 26.67 0.006   1:15.26 3 tbench_srv
 5633 root      20   0   23468   1792   1664 R 26.33 0.011   1:22.53 0 tbench
 5635 root      20   0   23468   1920   1792 R 26.33 0.012   1:20.72 7 tbench
 5642 root      20   0   23468   1920   1792 R 26.00 0.012   1:21.73 2 tbench
 5650 root      20   0    6748    768    768 R 25.67 0.005   1:15.71 1 tbench_srv
 5652 root      20   0    6748    768    768 S 25.67 0.005   1:15.71 3 tbench_srv
 5646 root      20   0    6748    768    768 S 25.33 0.005   1:14.97 4 tbench_srv
 5648 root      20   0    6748    896    768 S 25.00 0.006   1:14.66 0 tbench_srv
 5651 root      20   0    6748    896    768 S 24.67 0.006   1:14.79 2 tbench_srv
 5654 root      20   0    6748    768    768 R 24.33 0.005   1:15.47 0 tbench_srv
 5649 root      20   0    6748    768    768 R 24.00 0.005   1:13.95 7 tbench_srv

6.3.0.g148341f-master-eevdf
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
10561 root      20   0    2564    768    640 R 49.83 0.005   2:14.86 3 massive_intr
10562 root      20   0    2564    768    640 R 49.50 0.005   2:14.00 3 massive_intr
10564 root      20   0    2564    896    768 R 49.50 0.006   2:14.11 6 massive_intr
10559 root      20   0    2564    768    640 R 47.84 0.005   2:14.03 2 massive_intr
10560 root      20   0    2564    768    640 R 45.51 0.005   2:13.92 7 massive_intr
10557 root      20   0    2564    896    768 R 44.85 0.006   2:13.59 7 massive_intr
10563 root      20   0    2564    896    768 R 44.85 0.006   2:13.53 6 massive_intr
10558 root      20   0    2564    768    640 R 43.52 0.005   2:13.90 2 massive_intr
10577 root      20   0   23468   1920   1792 R 35.22 0.012   1:37.06 0 tbench
10574 root      20   0   23468   1920   1792 R 32.23 0.012   1:32.89 4 tbench
10580 root      20   0   23468   1920   1792 R 29.57 0.012   1:34.95 0 tbench
10575 root      20   0   23468   1792   1664 R 29.24 0.011   1:31.66 4 tbench
10576 root      20   0   23468   1792   1664 S 28.57 0.011   1:34.55 5 tbench
10573 root      20   0   23468   1792   1664 R 28.24 0.011   1:33.17 5 tbench
10578 root      20   0   23468   1920   1792 S 28.24 0.012   1:33.97 1 tbench
10579 root      20   0   23468   1920   1792 R 28.24 0.012   1:36.09 1 tbench
10587 root      20   0    6748    768    640 S 26.91 0.005   1:09.45 0 tbench_srv
10582 root      20   0    6748    768    640 R 24.25 0.005   1:08.19 4 tbench_srv
10588 root      20   0    6748    640    640 R 22.59 0.004   1:09.15 0 tbench_srv
10583 root      20   0    6748    640    640 R 21.93 0.004   1:07.93 4 tbench_srv
10586 root      20   0    6748    640    640 S 21.59 0.004   1:07.92 1 tbench_srv
10581 root      20   0    6748    640    640 S 21.26 0.004   1:07.08 5 tbench_srv
10585 root      20   0    6748    640    640 R 21.26 0.004   1:08.89 5 tbench_srv
10584 root      20   0    6748    768    640 S 20.93 0.005   1:08.61 1 tbench_srv

Peter Zijlstra April 5, 2023, 8:35 a.m. UTC | #6

On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
> On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:

> > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
> 
> Splitting like that sounds like starvation of the sleeper to me. If something
> sleeps a lot, it will get even less CPU time on an average than it would if
> there was no contention from the u=1 task.

No, sleeping, per definition, means you're not contending for CPU. What
CFS does, giving them a little boost, is strictly yuck and messes with
latency -- because suddenly you have a task that said it wasn't
competing appear as if it were, but you didn't run it (how could you, it
wasn't there to run) -- but it still needs to catch up.

The reason it does that, is mostly because at the time we didn't want to
do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
computing was still a thing :/ So hacks happened.

That said; I'm starting to regret not pushing the EEVDF thing harder
back in 2010 when I first wrote it :/

> And also CGroups will be even more weird than it already is in such a world,
> 2 different containers will not get CPU time distributed properly- say if
> tasks in one container sleep a lot and tasks in another container are CPU
> bound.

Cgroups are an abomination anyway :-) /me runs like hell. But no, I
don't actually expect too much trouble there.

Or rather, as per the above, time distribution is now more proper than
it was :-)

Joel Fernandes April 5, 2023, 8:05 p.m. UTC | #7

On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
> > On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
>
> > > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
> >
> > Splitting like that sounds like starvation of the sleeper to me. If something
> > sleeps a lot, it will get even less CPU time on an average than it would if
> > there was no contention from the u=1 task.
>
> No, sleeping, per definition, means you're not contending for CPU. What
> CFS does, giving them a little boost, is strictly yuck and messes with
> latency -- because suddenly you have a task that said it wasn't
> competing appear as if it were, but you didn't run it (how could you, it
> wasn't there to run) -- but it still needs to catch up.
>
> The reason it does that, is mostly because at the time we didn't want to
> do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
> computing was still a thing :/ So hacks happened.

Also you have the whole "boost tasks" that sleep a lot with CFS right?
 Like a task handling user input sleeps a lot, but when it wakes up,
it gets higher dynamic priority as its vruntime did not advance. I
guess EEVDF also gets you the same thing but still messes with the CPU
usage?

> That said; I'm starting to regret not pushing the EEVDF thing harder
> back in 2010 when I first wrote it :/
>
> > And also CGroups will be even more weird than it already is in such a world,
> > 2 different containers will not get CPU time distributed properly- say if
> > tasks in one container sleep a lot and tasks in another container are CPU
> > bound.
>
> Cgroups are an abomination anyway :-) /me runs like hell. But no, I
> don't actually expect too much trouble there.

So, with 2 equally weighted containers, if one has a task that sleeps
50% of the time, and another has a 100% task, then the sleeper will
only run 33% of the time? I can see people running containers having a
problem with that (a customer running one container gets less CPU than
the other.). Sorry if I missed something.

But yeah I do find the whole EEVDF idea interesting but I admit I have
to research it more.

 - Joel

Phil Auld April 14, 2023, 11:18 a.m. UTC | #8

On Wed, Apr 05, 2023 at 04:05:55PM -0400 Joel Fernandes wrote:
> On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
> > > On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
> >
> > > > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > > > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
> > >
> > > Splitting like that sounds like starvation of the sleeper to me. If something
> > > sleeps a lot, it will get even less CPU time on an average than it would if
> > > there was no contention from the u=1 task.
> >
> > No, sleeping, per definition, means you're not contending for CPU. What
> > CFS does, giving them a little boost, is strictly yuck and messes with
> > latency -- because suddenly you have a task that said it wasn't
> > competing appear as if it were, but you didn't run it (how could you, it
> > wasn't there to run) -- but it still needs to catch up.
> >
> > The reason it does that, is mostly because at the time we didn't want to
> > do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
> > computing was still a thing :/ So hacks happened.
> 
> Also you have the whole "boost tasks" that sleep a lot with CFS right?
>  Like a task handling user input sleeps a lot, but when it wakes up,
> it gets higher dynamic priority as its vruntime did not advance. I
> guess EEVDF also gets you the same thing but still messes with the CPU
> usage?
> 
> > That said; I'm starting to regret not pushing the EEVDF thing harder
> > back in 2010 when I first wrote it :/
> >
> > > And also CGroups will be even more weird than it already is in such a world,
> > > 2 different containers will not get CPU time distributed properly- say if
> > > tasks in one container sleep a lot and tasks in another container are CPU
> > > bound.
> >
> > Cgroups are an abomination anyway :-) /me runs like hell. But no, I
> > don't actually expect too much trouble there.
> 
> So, with 2 equally weighted containers, if one has a task that sleeps
> 50% of the time, and another has a 100% task, then the sleeper will
> only run 33% of the time? I can see people running containers having a
> problem with that (a customer running one container gets less CPU than
> the other.). Sorry if I missed something.
>

But the 50% sleeper is _asking_ for less CPU.  Doing 50% for each would
mean that when the sleeper task was awake it always ran, always won, to
the exclusion of any one else. (Assuming 1 CPU...)

Cheers,
Phil

> But yeah I do find the whole EEVDF idea interesting but I admit I have
> to research it more.
> 
>  - Joel
> 

--

Joel Fernandes April 16, 2023, 5:10 a.m. UTC | #9

> On Apr 14, 2023, at 1:18 PM, Phil Auld <pauld@redhat.com> wrote:
> 
> On Wed, Apr 05, 2023 at 04:05:55PM -0400 Joel Fernandes wrote:
>>> On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
>>>>> On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
>>> 
>>>>> Heh, this is actually the correct behaviour. If you have a u=1 and a
>>>>> u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
>>>> 
>>>> Splitting like that sounds like starvation of the sleeper to me. If something
>>>> sleeps a lot, it will get even less CPU time on an average than it would if
>>>> there was no contention from the u=1 task.
>>> 
>>> No, sleeping, per definition, means you're not contending for CPU. What
>>> CFS does, giving them a little boost, is strictly yuck and messes with
>>> latency -- because suddenly you have a task that said it wasn't
>>> competing appear as if it were, but you didn't run it (how could you, it
>>> wasn't there to run) -- but it still needs to catch up.
>>> 
>>> The reason it does that, is mostly because at the time we didn't want to
>>> do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
>>> computing was still a thing :/ So hacks happened.
>> 
>> Also you have the whole "boost tasks" that sleep a lot with CFS right?
>> Like a task handling user input sleeps a lot, but when it wakes up,
>> it gets higher dynamic priority as its vruntime did not advance. I
>> guess EEVDF also gets you the same thing but still messes with the CPU
>> usage?
>> 
>>> That said; I'm starting to regret not pushing the EEVDF thing harder
>>> back in 2010 when I first wrote it :/
>>> 
>>>> And also CGroups will be even more weird than it already is in such a world,
>>>> 2 different containers will not get CPU time distributed properly- say if
>>>> tasks in one container sleep a lot and tasks in another container are CPU
>>>> bound.
>>> 
>>> Cgroups are an abomination anyway :-) /me runs like hell. But no, I
>>> don't actually expect too much trouble there.
>> 
>> So, with 2 equally weighted containers, if one has a task that sleeps
>> 50% of the time, and another has a 100% task, then the sleeper will
>> only run 33% of the time? I can see people running containers having a
>> problem with that (a customer running one container gets less CPU than
>> the other.). Sorry if I missed something.
>> 
> 
> But the 50% sleeper is _asking_ for less CPU.  Doing 50% for each would
> mean that when the sleeper task was awake it always ran, always won, to
> the exclusion of any one else. (Assuming 1 CPU...)
> 

It sounds like you are saying that if the task busy looped instead of sleeping, it would get more CPU during the time it is not busy looping but doing some real work. That sounds like encouraging abuse to get more perf.

But again, I have not looked too closely at EEVDF or Peters patches. I was just going by Vincents test and was cautioning to not break users who depend on CFS shares..

Cheers,

- Joel


> Cheers,
> Phil
> 
>> But yeah I do find the whole EEVDF idea interesting but I admit I have
>> to research it more.
>> 
>> - Joel
>> 
> 
> -- 
>

diff mbox series

Patch

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -616,6 +616,7 @@  avg_vruntime_add(struct cfs_rq *cfs_rq,
 	s64 key = entity_key(cfs_rq, se);
 
 	cfs_rq->avg_vruntime += key * weight;
+	cfs_rq->avg_slice += se->slice * weight;
 	cfs_rq->avg_load += weight;
 }
 
@@ -626,6 +627,7 @@  avg_vruntime_sub(struct cfs_rq *cfs_rq,
 	s64 key = entity_key(cfs_rq, se);
 
 	cfs_rq->avg_vruntime -= key * weight;
+	cfs_rq->avg_slice -= se->slice * weight;
 	cfs_rq->avg_load -= weight;
 }
 
@@ -4832,6 +4834,18 @@  place_entity(struct cfs_rq *cfs_rq, stru
 		lag = se->vlag;
 
 		/*
+		 * For latency sensitive tasks; those that have a shorter than
+		 * average slice and do not fully consume the slice, transition
+		 * to EEVDF placement strategy #2.
+		 */
+		if (sched_feat(PLACE_FUDGE) &&
+		    cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
+			lag += vslice;
+			if (lag > 0)
+				lag = 0;
+		}
+
+		/*
 		 * If we want to place a task and preserve lag, we have to
 		 * consider the effect of the new entity on the weighted
 		 * average and compensate for this, otherwise lag can quickly
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -5,6 +5,7 @@ 
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */
 SCHED_FEAT(PLACE_LAG, true)
+SCHED_FEAT(PLACE_FUDGE, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 
 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -559,6 +559,7 @@  struct cfs_rq {
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
 
 	s64			avg_vruntime;
+	u64			avg_slice;
 	u64			avg_load;
 
 	u64			exec_clock;