Message ID | 20230328110354.562078801@infradead.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2127796vqo; Tue, 28 Mar 2023 04:08:58 -0700 (PDT) X-Google-Smtp-Source: AKy350YnrdxXfRiB0jdKth+vukG52hgqS/Uq+RSSRWU//h+UEWbjSXYjEasZB6flJVzDszTT1QAy X-Received: by 2002:a17:906:b349:b0:931:91a:fa4f with SMTP id cd9-20020a170906b34900b00931091afa4fmr17639492ejb.41.1680001738138; Tue, 28 Mar 2023 04:08:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680001738; cv=none; d=google.com; s=arc-20160816; b=p6M4unTCeUoFTpcKX9BkWO7YKQeq5+Zsj6EBWFjFa0TzRv41WNTPCEyJvOUzUH3suh 1mLgjmwx2eu1Jtvwa3iV6RlpSOqiKi/wuGdwmB5TXrZXlkMokqkMmdwYHOnDm6m6Kivx 7enaLC239thLzuOhy58nUKmK7YX2nRibBArbutX2W9fVbGTp6+uqiyXRxn7E5UPmtlE/ xQscATRutQlOBjWgvGhw/oAuJld00XuEFYXznnQ+z//V3vL9MFxgAo1QUcgxK2MMLt6C ycKN7VmSxv/pbfkf128DFDtJKvl1DmiQRhqbwfTaGe+1jWS27OFEtPFUYXybi6xc0eKP GKzg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:subject:cc:to:from:date :user-agent:message-id:dkim-signature; bh=6jo/I+r1XZjfSV9hErqb9NOr5B8VgbE2esFe+ZOAMPE=; b=u/mO0rjH2u3UUg3n/mPvpu8H+jh/SslqasWt0P67/zNpI/ytU+nfb73fp1gxU6p9Sb ExP6zs+cLc+gtb+TGG7K5WsNCRgkLa0vMDs9RNmhJwg+TT/7HMbnit65mxFFJe1VsG2j 9t6Yssw1BSpOCr1RmJYRHGUnmwKaFod+NqsgziiflYi8fgJOnV0En+KzKMWHDHCuNP7J u1FOSoRi+lOVtOnN+xefhFe8WG5IHaXQJyZIpKNGQuX/k/MgwAFhKEN2g3owhTWdi4yf qgQtbkyEn8zonDmXrrTDpA7jxhbyxqlxKuNnlOY7oIrbUsfSd2Ydd5vLEEF0wnl+nyIF Xzxw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=desiato.20200630 header.b=klW0hMjP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id jg3-20020a170907970300b0093112944d07si33457902ejc.44.2023.03.28.04.08.34; Tue, 28 Mar 2023 04:08:58 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=desiato.20200630 header.b=klW0hMjP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230054AbjC1LHK (ORCPT <rfc822;kartikey406@gmail.com> + 99 others); Tue, 28 Mar 2023 07:07:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39354 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229718AbjC1LHG (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 28 Mar 2023 07:07:06 -0400 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 39C872D40 for <linux-kernel@vger.kernel.org>; Tue, 28 Mar 2023 04:07:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=6jo/I+r1XZjfSV9hErqb9NOr5B8VgbE2esFe+ZOAMPE=; b=klW0hMjPWSwLhuCSR4NXBdMd4E PeJH63w9BozL4Qbyu7WF17R3BuOgQn9SZgdTVvbQ8S7Ub7O2mjQD6PVE/NsToESkN3bVr9sm3tV9p X/XlSOzw4ko0s1Jf7le/CO07cUjkE0ujiRsmrKN4BlnvJ3mgv44qUpEQHWoLCRl/eieMURadYHKFP utK+0enrC8OJ2FW9dzPDSDtcCDGSyjY32cOn/GHT/C+rH+1iV5d/xGp/XK1Cz+f93q1mTx7S50hqt PwQMtG6g1sWxz94rJHuHZ8yPyn60+auq1Pfxcv42gHWA6sQeJaNmCiaiNu8ztN5lkhciJwn7y+7de u6mwR+Uw==; Received: from j130084.upc-j.chello.nl ([24.132.130.84] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.96 #2 (Red Hat Linux)) id 1ph79m-006XV7-14; Tue, 28 Mar 2023 11:06:27 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 69A3C303229; Tue, 28 Mar 2023 13:06:23 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 0317C2CB8D7AA; Tue, 28 Mar 2023 13:06:21 +0200 (CEST) Message-ID: <20230328110354.562078801@infradead.org> User-Agent: quilt/0.66 Date: Tue, 28 Mar 2023 11:26:36 +0200 From: Peter Zijlstra <peterz@infradead.org> To: mingo@kernel.org, vincent.guittot@linaro.org Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, juri.lelli@redhat.com, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, corbet@lwn.net, qyousef@layalina.io, chris.hyser@oracle.com, patrick.bellasi@matbug.net, pjt@google.com, pavel@ucw.cz, qperret@google.com, tim.c.chen@linux.intel.com, joshdon@google.com, timj@gnu.org, kprateek.nayak@amd.com, yu.c.chen@intel.com, youssefesmat@chromium.org, joel@joelfernandes.org, efault@gmx.de Subject: [PATCH 14/17] sched/eevdf: Better handle mixed slice length References: <20230328092622.062917921@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Spam-Status: No, score=-2.5 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1761609502463916510?= X-GMAIL-MSGID: =?utf-8?q?1761609502463916510?= |
Series |
sched: EEVDF using latency-nice
|
|
Commit Message
Peter Zijlstra
March 28, 2023, 9:26 a.m. UTC
In the case where (due to latency-nice) there are different request
sizes in the tree, the smaller requests tend to be dominated by the
larger. Also note how the EEVDF lag limits are based on r_max.
Therefore; add a heuristic that for the mixed request size case, moves
smaller requests to placement strategy #2 which ensures they're
immidiately eligible and and due to their smaller (virtual) deadline
will cause preemption.
NOTE: this relies on update_entity_lag() to impose lag limits above
a single slice.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/sched/fair.c | 14 ++++++++++++++
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
3 files changed, 16 insertions(+)
Comments
On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <peterz@infradead.org> wrote: > > In the case where (due to latency-nice) there are different request > sizes in the tree, the smaller requests tend to be dominated by the > larger. Also note how the EEVDF lag limits are based on r_max. > > Therefore; add a heuristic that for the mixed request size case, moves > smaller requests to placement strategy #2 which ensures they're > immidiately eligible and and due to their smaller (virtual) deadline > will cause preemption. > > NOTE: this relies on update_entity_lag() to impose lag limits above > a single slice. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > kernel/sched/fair.c | 14 ++++++++++++++ > kernel/sched/features.h | 1 + > kernel/sched/sched.h | 1 + > 3 files changed, 16 insertions(+) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq, > s64 key = entity_key(cfs_rq, se); > > cfs_rq->avg_vruntime += key * weight; > + cfs_rq->avg_slice += se->slice * weight; > cfs_rq->avg_load += weight; > } > > @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq, > s64 key = entity_key(cfs_rq, se); > > cfs_rq->avg_vruntime -= key * weight; > + cfs_rq->avg_slice -= se->slice * weight; > cfs_rq->avg_load -= weight; > } > > @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru > lag = se->vlag; > > /* > + * For latency sensitive tasks; those that have a shorter than > + * average slice and do not fully consume the slice, transition > + * to EEVDF placement strategy #2. > + */ > + if (sched_feat(PLACE_FUDGE) && > + cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) { > + lag += vslice; > + if (lag > 0) > + lag = 0; By using different lag policies for tasks, doesn't this create unfairness between tasks ? I wanted to stress this situation with a simple use case but it seems that even without changing the slice, there is a fairness problem: Task A always run Task B loops on : running 1ms then sleeping 1ms default nice and latency nice prio bot both each task should get around 50% of the time. The fairness is ok with tip/sched/core but with eevdf, Task B only gets around 30% I haven't identified the problem so far > + } > + > + /* > * If we want to place a task and preserve lag, we have to > * consider the effect of the new entity on the weighted > * average and compensate for this, otherwise lag can quickly > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -5,6 +5,7 @@ > * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled. > */ > SCHED_FEAT(PLACE_LAG, true) > +SCHED_FEAT(PLACE_FUDGE, true) > SCHED_FEAT(PLACE_DEADLINE_INITIAL, true) > > /* > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -559,6 +559,7 @@ struct cfs_rq { > unsigned int idle_h_nr_running; /* SCHED_IDLE */ > > s64 avg_vruntime; > + u64 avg_slice; > u64 avg_load; > > u64 exec_clock; > >
On Sun, 2023-04-02 at 07:23 +0800, Hillf Danton wrote: > On 31 Mar 2023 17:26:51 +0200 Vincent Guittot <vincent.guittot@linaro.org> > > > > I wanted to stress this situation with a simple use case but it seems > > that even without changing the slice, there is a fairness problem: > > > > Task A always run > > Task B loops on : running 1ms then sleeping 1ms > > default nice and latency nice prio bot both > > each task should get around 50% of the time. > > > > The fairness is ok with tip/sched/core > > but with eevdf, Task B only gets around 30% > > Convincing evidence for glitch in wakeup preempt. If you turn on PLACE_BONUS, it'll mimic FAIR_SLEEPERS.. but if you then do some testing, you'll probably turn it right back off. The 50/50 split in current code isn't really any more fair, as soon as you leave the tiny bubble of fairness, it's not the least bit fair. Nor is that tiny bubble all rainbows and unicorns, it brought with it benchmark wins and losses, like everything that changes more than comments, its price being service latency variance. The short term split doesn't really mean all that much, some things will like the current fair-bubble better, some eevdf virtual deadline math and its less spiky service. We'll see. I'm kinda hoping eevdf works out, FAIR_SLEEPERS is quite annoying to squabble with. -Mike
On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote: > On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <peterz@infradead.org> wrote: > > > > In the case where (due to latency-nice) there are different request > > sizes in the tree, the smaller requests tend to be dominated by the > > larger. Also note how the EEVDF lag limits are based on r_max. > > > > Therefore; add a heuristic that for the mixed request size case, moves > > smaller requests to placement strategy #2 which ensures they're > > immidiately eligible and and due to their smaller (virtual) deadline > > will cause preemption. > > > > NOTE: this relies on update_entity_lag() to impose lag limits above > > a single slice. > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > --- > > kernel/sched/fair.c | 14 ++++++++++++++ > > kernel/sched/features.h | 1 + > > kernel/sched/sched.h | 1 + > > 3 files changed, 16 insertions(+) > > > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq, > > s64 key = entity_key(cfs_rq, se); > > > > cfs_rq->avg_vruntime += key * weight; > > + cfs_rq->avg_slice += se->slice * weight; > > cfs_rq->avg_load += weight; > > } > > > > @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq, > > s64 key = entity_key(cfs_rq, se); > > > > cfs_rq->avg_vruntime -= key * weight; > > + cfs_rq->avg_slice -= se->slice * weight; > > cfs_rq->avg_load -= weight; > > } > > > > @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru > > lag = se->vlag; > > > > /* > > + * For latency sensitive tasks; those that have a shorter than > > + * average slice and do not fully consume the slice, transition > > + * to EEVDF placement strategy #2. > > + */ > > + if (sched_feat(PLACE_FUDGE) && > > + cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) { > > + lag += vslice; > > + if (lag > 0) > > + lag = 0; > > By using different lag policies for tasks, doesn't this create > unfairness between tasks ? Possibly, I've just not managed to trigger it yet -- if it is an issue it can be fixed by ensuring we don't place the entity before its previous vruntime just like the sleeper hack later on. > I wanted to stress this situation with a simple use case but it seems > that even without changing the slice, there is a fairness problem: > > Task A always run > Task B loops on : running 1ms then sleeping 1ms > default nice and latency nice prio bot both > each task should get around 50% of the time. > > The fairness is ok with tip/sched/core > but with eevdf, Task B only gets around 30% > > I haven't identified the problem so far Heh, this is actually the correct behaviour. If you have a u=1 and a u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%. CFS has this sleeper bonus hack that makes it 50% vs 50% but that is strictly not correct -- although it does help a number of weird benchmarks.
On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote: > On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote: > > On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > In the case where (due to latency-nice) there are different request > > > sizes in the tree, the smaller requests tend to be dominated by the > > > larger. Also note how the EEVDF lag limits are based on r_max. > > > > > > Therefore; add a heuristic that for the mixed request size case, moves > > > smaller requests to placement strategy #2 which ensures they're > > > immidiately eligible and and due to their smaller (virtual) deadline > > > will cause preemption. > > > > > > NOTE: this relies on update_entity_lag() to impose lag limits above > > > a single slice. > > > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > > --- > > > kernel/sched/fair.c | 14 ++++++++++++++ > > > kernel/sched/features.h | 1 + > > > kernel/sched/sched.h | 1 + > > > 3 files changed, 16 insertions(+) > > > > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq, > > > s64 key = entity_key(cfs_rq, se); > > > > > > cfs_rq->avg_vruntime += key * weight; > > > + cfs_rq->avg_slice += se->slice * weight; > > > cfs_rq->avg_load += weight; > > > } > > > > > > @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq, > > > s64 key = entity_key(cfs_rq, se); > > > > > > cfs_rq->avg_vruntime -= key * weight; > > > + cfs_rq->avg_slice -= se->slice * weight; > > > cfs_rq->avg_load -= weight; > > > } > > > > > > @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru > > > lag = se->vlag; > > > > > > /* > > > + * For latency sensitive tasks; those that have a shorter than > > > + * average slice and do not fully consume the slice, transition > > > + * to EEVDF placement strategy #2. > > > + */ > > > + if (sched_feat(PLACE_FUDGE) && > > > + cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) { > > > + lag += vslice; > > > + if (lag > 0) > > > + lag = 0; > > > > By using different lag policies for tasks, doesn't this create > > unfairness between tasks ? > > Possibly, I've just not managed to trigger it yet -- if it is an issue > it can be fixed by ensuring we don't place the entity before its > previous vruntime just like the sleeper hack later on. > > > I wanted to stress this situation with a simple use case but it seems > > that even without changing the slice, there is a fairness problem: > > > > Task A always run > > Task B loops on : running 1ms then sleeping 1ms > > default nice and latency nice prio bot both > > each task should get around 50% of the time. > > > > The fairness is ok with tip/sched/core > > but with eevdf, Task B only gets around 30% > > > > I haven't identified the problem so far > > Heh, this is actually the correct behaviour. If you have a u=1 and a > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%. Splitting like that sounds like starvation of the sleeper to me. If something sleeps a lot, it will get even less CPU time on an average than it would if there was no contention from the u=1 task. And also CGroups will be even more weird than it already is in such a world, 2 different containers will not get CPU time distributed properly- say if tasks in one container sleep a lot and tasks in another container are CPU bound. thanks, - Joel
On Tue, 2023-04-04 at 13:50 +0000, Joel Fernandes wrote: > On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote: > > On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote: > > > > > > > > Task A always run > > > Task B loops on : running 1ms then sleeping 1ms > > > default nice and latency nice prio bot both > > > each task should get around 50% of the time. > > > > > > The fairness is ok with tip/sched/core > > > but with eevdf, Task B only gets around 30% > > > > > > I haven't identified the problem so far > > > > Heh, this is actually the correct behaviour. If you have a u=1 and a > > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%. > > Splitting like that sounds like starvation of the sleeper to me. If something > sleeps a lot, it will get even less CPU time on an average than it would if > there was no contention from the u=1 task. > > And also CGroups will be even more weird than it already is in such a world, > 2 different containers will not get CPU time distributed properly- say if > tasks in one container sleep a lot and tasks in another container are CPU > bound. Lets take a quick peek at some group distribution numbers. start tbench and massive_intr in their own VT (autogroup), then in another, sleep 300;killall top massive_intr tbench_srv tbench. (caveman method because perf's refusing to handle fast switchers well for me.. top's plenty good enough for this anyway, and less intrusive) massive_intr runs 8ms, sleeps 1, wants 88.8% of 8 runqueues. tbench buddy pairs want only a tad more CPU, 100% between them, but switch orders of magnitude more frequently. Very dissimilar breeds of hog. master.today accrued of 2400s vs master team massive_intr 1120.50s .466 1.000 team tbench 1256.13s .523 1.000 +eevdf team massive_intr 1071.94s .446 .956 team tbench 1301.56s .542 1.036 There is of course a distribution delta.. but was it meaningful? Between mostly idle but kinda noisy GUI perturbing things, and more importantly, neither load having been manually distributed and pinned, both schedulers came out pretty good, and both a tad shy of.. perfect is the enemy of good. Raw numbers below in case my mouse mucked up feeding of numbers to bc (blame the inanimate, they can't do a damn thing about it). 6.3.0.g148341f-master PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 5641 root 20 0 2564 640 640 R 50.33 0.004 2:17.19 5 massive_intr 5636 root 20 0 2564 640 640 S 49.00 0.004 2:20.05 5 massive_intr 5647 root 20 0 2564 640 640 R 48.67 0.004 2:21.85 6 massive_intr 5640 root 20 0 2564 640 640 R 48.00 0.004 2:21.13 6 massive_intr 5645 root 20 0 2564 640 640 R 47.67 0.004 2:18.25 5 massive_intr 5638 root 20 0 2564 640 640 R 46.67 0.004 2:22.39 2 massive_intr 5634 root 20 0 2564 640 640 R 45.00 0.004 2:18.93 4 massive_intr 5643 root 20 0 2564 640 640 R 44.00 0.004 2:20.71 7 massive_intr 5639 root 20 0 23468 1664 1536 R 29.00 0.010 1:22.31 3 tbench 5644 root 20 0 23468 1792 1664 R 28.67 0.011 1:22.32 3 tbench 5637 root 20 0 23468 1664 1536 S 28.00 0.010 1:22.75 5 tbench 5631 root 20 0 23468 1792 1664 R 27.00 0.011 1:21.47 4 tbench 5632 root 20 0 23468 1536 1408 R 27.00 0.010 1:21.78 0 tbench 5653 root 20 0 6748 896 768 S 26.67 0.006 1:15.26 3 tbench_srv 5633 root 20 0 23468 1792 1664 R 26.33 0.011 1:22.53 0 tbench 5635 root 20 0 23468 1920 1792 R 26.33 0.012 1:20.72 7 tbench 5642 root 20 0 23468 1920 1792 R 26.00 0.012 1:21.73 2 tbench 5650 root 20 0 6748 768 768 R 25.67 0.005 1:15.71 1 tbench_srv 5652 root 20 0 6748 768 768 S 25.67 0.005 1:15.71 3 tbench_srv 5646 root 20 0 6748 768 768 S 25.33 0.005 1:14.97 4 tbench_srv 5648 root 20 0 6748 896 768 S 25.00 0.006 1:14.66 0 tbench_srv 5651 root 20 0 6748 896 768 S 24.67 0.006 1:14.79 2 tbench_srv 5654 root 20 0 6748 768 768 R 24.33 0.005 1:15.47 0 tbench_srv 5649 root 20 0 6748 768 768 R 24.00 0.005 1:13.95 7 tbench_srv 6.3.0.g148341f-master-eevdf PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 10561 root 20 0 2564 768 640 R 49.83 0.005 2:14.86 3 massive_intr 10562 root 20 0 2564 768 640 R 49.50 0.005 2:14.00 3 massive_intr 10564 root 20 0 2564 896 768 R 49.50 0.006 2:14.11 6 massive_intr 10559 root 20 0 2564 768 640 R 47.84 0.005 2:14.03 2 massive_intr 10560 root 20 0 2564 768 640 R 45.51 0.005 2:13.92 7 massive_intr 10557 root 20 0 2564 896 768 R 44.85 0.006 2:13.59 7 massive_intr 10563 root 20 0 2564 896 768 R 44.85 0.006 2:13.53 6 massive_intr 10558 root 20 0 2564 768 640 R 43.52 0.005 2:13.90 2 massive_intr 10577 root 20 0 23468 1920 1792 R 35.22 0.012 1:37.06 0 tbench 10574 root 20 0 23468 1920 1792 R 32.23 0.012 1:32.89 4 tbench 10580 root 20 0 23468 1920 1792 R 29.57 0.012 1:34.95 0 tbench 10575 root 20 0 23468 1792 1664 R 29.24 0.011 1:31.66 4 tbench 10576 root 20 0 23468 1792 1664 S 28.57 0.011 1:34.55 5 tbench 10573 root 20 0 23468 1792 1664 R 28.24 0.011 1:33.17 5 tbench 10578 root 20 0 23468 1920 1792 S 28.24 0.012 1:33.97 1 tbench 10579 root 20 0 23468 1920 1792 R 28.24 0.012 1:36.09 1 tbench 10587 root 20 0 6748 768 640 S 26.91 0.005 1:09.45 0 tbench_srv 10582 root 20 0 6748 768 640 R 24.25 0.005 1:08.19 4 tbench_srv 10588 root 20 0 6748 640 640 R 22.59 0.004 1:09.15 0 tbench_srv 10583 root 20 0 6748 640 640 R 21.93 0.004 1:07.93 4 tbench_srv 10586 root 20 0 6748 640 640 S 21.59 0.004 1:07.92 1 tbench_srv 10581 root 20 0 6748 640 640 S 21.26 0.004 1:07.08 5 tbench_srv 10585 root 20 0 6748 640 640 R 21.26 0.004 1:08.89 5 tbench_srv 10584 root 20 0 6748 768 640 S 20.93 0.005 1:08.61 1 tbench_srv
On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote: > On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote: > > Heh, this is actually the correct behaviour. If you have a u=1 and a > > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%. > > Splitting like that sounds like starvation of the sleeper to me. If something > sleeps a lot, it will get even less CPU time on an average than it would if > there was no contention from the u=1 task. No, sleeping, per definition, means you're not contending for CPU. What CFS does, giving them a little boost, is strictly yuck and messes with latency -- because suddenly you have a task that said it wasn't competing appear as if it were, but you didn't run it (how could you, it wasn't there to run) -- but it still needs to catch up. The reason it does that, is mostly because at the time we didn't want to do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit computing was still a thing :/ So hacks happened. That said; I'm starting to regret not pushing the EEVDF thing harder back in 2010 when I first wrote it :/ > And also CGroups will be even more weird than it already is in such a world, > 2 different containers will not get CPU time distributed properly- say if > tasks in one container sleep a lot and tasks in another container are CPU > bound. Cgroups are an abomination anyway :-) /me runs like hell. But no, I don't actually expect too much trouble there. Or rather, as per the above, time distribution is now more proper than it was :-)
On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote: > > On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote: > > > > Heh, this is actually the correct behaviour. If you have a u=1 and a > > > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%. > > > > Splitting like that sounds like starvation of the sleeper to me. If something > > sleeps a lot, it will get even less CPU time on an average than it would if > > there was no contention from the u=1 task. > > No, sleeping, per definition, means you're not contending for CPU. What > CFS does, giving them a little boost, is strictly yuck and messes with > latency -- because suddenly you have a task that said it wasn't > competing appear as if it were, but you didn't run it (how could you, it > wasn't there to run) -- but it still needs to catch up. > > The reason it does that, is mostly because at the time we didn't want to > do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit > computing was still a thing :/ So hacks happened. Also you have the whole "boost tasks" that sleep a lot with CFS right? Like a task handling user input sleeps a lot, but when it wakes up, it gets higher dynamic priority as its vruntime did not advance. I guess EEVDF also gets you the same thing but still messes with the CPU usage? > That said; I'm starting to regret not pushing the EEVDF thing harder > back in 2010 when I first wrote it :/ > > > And also CGroups will be even more weird than it already is in such a world, > > 2 different containers will not get CPU time distributed properly- say if > > tasks in one container sleep a lot and tasks in another container are CPU > > bound. > > Cgroups are an abomination anyway :-) /me runs like hell. But no, I > don't actually expect too much trouble there. So, with 2 equally weighted containers, if one has a task that sleeps 50% of the time, and another has a 100% task, then the sleeper will only run 33% of the time? I can see people running containers having a problem with that (a customer running one container gets less CPU than the other.). Sorry if I missed something. But yeah I do find the whole EEVDF idea interesting but I admit I have to research it more. - Joel
On Wed, Apr 05, 2023 at 04:05:55PM -0400 Joel Fernandes wrote: > On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote: > > > On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote: > > > > > > Heh, this is actually the correct behaviour. If you have a u=1 and a > > > > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%. > > > > > > Splitting like that sounds like starvation of the sleeper to me. If something > > > sleeps a lot, it will get even less CPU time on an average than it would if > > > there was no contention from the u=1 task. > > > > No, sleeping, per definition, means you're not contending for CPU. What > > CFS does, giving them a little boost, is strictly yuck and messes with > > latency -- because suddenly you have a task that said it wasn't > > competing appear as if it were, but you didn't run it (how could you, it > > wasn't there to run) -- but it still needs to catch up. > > > > The reason it does that, is mostly because at the time we didn't want to > > do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit > > computing was still a thing :/ So hacks happened. > > Also you have the whole "boost tasks" that sleep a lot with CFS right? > Like a task handling user input sleeps a lot, but when it wakes up, > it gets higher dynamic priority as its vruntime did not advance. I > guess EEVDF also gets you the same thing but still messes with the CPU > usage? > > > That said; I'm starting to regret not pushing the EEVDF thing harder > > back in 2010 when I first wrote it :/ > > > > > And also CGroups will be even more weird than it already is in such a world, > > > 2 different containers will not get CPU time distributed properly- say if > > > tasks in one container sleep a lot and tasks in another container are CPU > > > bound. > > > > Cgroups are an abomination anyway :-) /me runs like hell. But no, I > > don't actually expect too much trouble there. > > So, with 2 equally weighted containers, if one has a task that sleeps > 50% of the time, and another has a 100% task, then the sleeper will > only run 33% of the time? I can see people running containers having a > problem with that (a customer running one container gets less CPU than > the other.). Sorry if I missed something. > But the 50% sleeper is _asking_ for less CPU. Doing 50% for each would mean that when the sleeper task was awake it always ran, always won, to the exclusion of any one else. (Assuming 1 CPU...) Cheers, Phil > But yeah I do find the whole EEVDF idea interesting but I admit I have > to research it more. > > - Joel > --
> On Apr 14, 2023, at 1:18 PM, Phil Auld <pauld@redhat.com> wrote: > > On Wed, Apr 05, 2023 at 04:05:55PM -0400 Joel Fernandes wrote: >>> On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <peterz@infradead.org> wrote: >>> >>> On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote: >>>>> On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote: >>> >>>>> Heh, this is actually the correct behaviour. If you have a u=1 and a >>>>> u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%. >>>> >>>> Splitting like that sounds like starvation of the sleeper to me. If something >>>> sleeps a lot, it will get even less CPU time on an average than it would if >>>> there was no contention from the u=1 task. >>> >>> No, sleeping, per definition, means you're not contending for CPU. What >>> CFS does, giving them a little boost, is strictly yuck and messes with >>> latency -- because suddenly you have a task that said it wasn't >>> competing appear as if it were, but you didn't run it (how could you, it >>> wasn't there to run) -- but it still needs to catch up. >>> >>> The reason it does that, is mostly because at the time we didn't want to >>> do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit >>> computing was still a thing :/ So hacks happened. >> >> Also you have the whole "boost tasks" that sleep a lot with CFS right? >> Like a task handling user input sleeps a lot, but when it wakes up, >> it gets higher dynamic priority as its vruntime did not advance. I >> guess EEVDF also gets you the same thing but still messes with the CPU >> usage? >> >>> That said; I'm starting to regret not pushing the EEVDF thing harder >>> back in 2010 when I first wrote it :/ >>> >>>> And also CGroups will be even more weird than it already is in such a world, >>>> 2 different containers will not get CPU time distributed properly- say if >>>> tasks in one container sleep a lot and tasks in another container are CPU >>>> bound. >>> >>> Cgroups are an abomination anyway :-) /me runs like hell. But no, I >>> don't actually expect too much trouble there. >> >> So, with 2 equally weighted containers, if one has a task that sleeps >> 50% of the time, and another has a 100% task, then the sleeper will >> only run 33% of the time? I can see people running containers having a >> problem with that (a customer running one container gets less CPU than >> the other.). Sorry if I missed something. >> > > But the 50% sleeper is _asking_ for less CPU. Doing 50% for each would > mean that when the sleeper task was awake it always ran, always won, to > the exclusion of any one else. (Assuming 1 CPU...) > It sounds like you are saying that if the task busy looped instead of sleeping, it would get more CPU during the time it is not busy looping but doing some real work. That sounds like encouraging abuse to get more perf. But again, I have not looked too closely at EEVDF or Peters patches. I was just going by Vincents test and was cautioning to not break users who depend on CFS shares.. Cheers, - Joel > Cheers, > Phil > >> But yeah I do find the whole EEVDF idea interesting but I admit I have >> to research it more. >> >> - Joel >> > > -- >
--- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq, s64 key = entity_key(cfs_rq, se); cfs_rq->avg_vruntime += key * weight; + cfs_rq->avg_slice += se->slice * weight; cfs_rq->avg_load += weight; } @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq, s64 key = entity_key(cfs_rq, se); cfs_rq->avg_vruntime -= key * weight; + cfs_rq->avg_slice -= se->slice * weight; cfs_rq->avg_load -= weight; } @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru lag = se->vlag; /* + * For latency sensitive tasks; those that have a shorter than + * average slice and do not fully consume the slice, transition + * to EEVDF placement strategy #2. + */ + if (sched_feat(PLACE_FUDGE) && + cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) { + lag += vslice; + if (lag > 0) + lag = 0; + } + + /* * If we want to place a task and preserve lag, we have to * consider the effect of the new entity on the weighted * average and compensate for this, otherwise lag can quickly --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -5,6 +5,7 @@ * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled. */ SCHED_FEAT(PLACE_LAG, true) +SCHED_FEAT(PLACE_FUDGE, true) SCHED_FEAT(PLACE_DEADLINE_INITIAL, true) /* --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -559,6 +559,7 @@ struct cfs_rq { unsigned int idle_h_nr_running; /* SCHED_IDLE */ s64 avg_vruntime; + u64 avg_slice; u64 avg_load; u64 exec_clock;