Message ID | c7b706d30d6316c52853ca056db5beb82ba72863.1699095159.git.bristot@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:8f47:0:b0:403:3b70:6f57 with SMTP id j7csp1580946vqu; Sat, 4 Nov 2023 04:01:25 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFwB4PYL8AwvTuyGt0s4WMHf8mbe+2xXB04nxQiAShX+eo+AXYZJ3J+bFZR4WdO4g0bk7Ho X-Received: by 2002:a05:6808:2091:b0:3ae:501e:a64a with SMTP id s17-20020a056808209100b003ae501ea64amr29965549oiw.10.1699095685298; Sat, 04 Nov 2023 04:01:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1699095685; cv=none; d=google.com; s=arc-20160816; b=I+33fuDXBm5AWctxU9Z2P+tYLQdGsx3B1XtrMVns7IOESmbW9f/6gqqQgcgU6H9sZS P2LyyB/8DfkrKrRq02B0pWgeDc0QiD29FJFIG55Y2Vt0xVOqtRBi4xhaEmSZPqGvoktk eEias6gtdmq3mlSw3F0/X/VK4APelREub02c71LG+LJKq+04Av4eBedhXzDclVz+wqAV WbhTB0+Z5p+T+BDf7hUq7mZaUdik2Xbbs84B32dQhiCmY5eVT+ocU/Y/IPCBLqvXOLYr sae9KHUbDlPSz+Z+KoEz+M4T9x9D+iTP7GnI/S0SJNy1beTfMK7ufcKRgvQUAxPmys9O Kgjg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=qiv0ybnovbkZvF9aAHElIx0CPOJVXsShwRDmaEXcJ/U=; fh=hclUZirF6ioy5RJ9ji4zprhkXEWhJA3xztaBC4vzfLk=; b=g/FZgKD9r8IAaRvf3pJQQxAEIjwrZDBfOJ9pIV8IeB6toD1g+44r0rSaMiGueDH7jL DtK0mWG92lz4TQ/au3xRLaAjlEcBUkZqHwORb6OjK1DuSES3C7iFUVWJu0EXPGBUpQpe TU5O44F1mCdj4vLSBOLsfNiDxjUCkOtb+/0nh7zf/gPGG//AHaONjCSDQ0AGfd6OvcJJ pebhHpe2eyF/M4tsKtqg7/BNC0+zeL26YFetDwRqbBFeN9OyRb8oXaH4fXz6kXalYX7g BT7qpTh/2Cv/wJB8xswhqdnbPuL5QINQVKOP0BiRgFW6c1FoScUtRCYTSouqbe6q+sh1 ZsAA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="pjb/OjO7"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id fg19-20020a056808641300b003b2e4809927si1525152oib.299.2023.11.04.04.01.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 04 Nov 2023 04:01:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="pjb/OjO7"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 0D6E4803E79D; Sat, 4 Nov 2023 04:00:55 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232190AbjKDLAm (ORCPT <rfc822;lhua1029@gmail.com> + 35 others); Sat, 4 Nov 2023 07:00:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43688 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232277AbjKDLAP (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 4 Nov 2023 07:00:15 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5040FD47 for <linux-kernel@vger.kernel.org>; Sat, 4 Nov 2023 03:59:59 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 262C1C433D9; Sat, 4 Nov 2023 10:59:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1699095598; bh=/tkdPd8l0zJh8Zr2pNqa5tEoVWnxLX29JBambo76CEU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=pjb/OjO7mznWTe5H4GTk+cTznokt0oUlIksqNNSmLg2EnlRMNibQRq5R7IqXCHHn6 UNBjdrT56oh9Kw5n8kBphlQLB4gyaTK/vuB0pEUvrkwesB4OdSQIgbvuDcVOFlSvpF fyQGKK5ieWK5Ig5Gs5IXz1wPFhYNKjy9EeXfIWquJBqrrVgjxSqfq9zt5H9Lt8w/gu Oy4VrTW1ZA0k3ahJ6HL+RnrvE7L9VyQEL6bvMAZDUrFpF6FahqxNo9qU+CUJwm5j4h 5EWukEndM3JjSRZ9SFQhe04CYTj/EhJyHUWse+WyFCgwnXEcMOkmKpOjc3bFHI2WlB 4RPGbSWdcjFCw== From: Daniel Bristot de Oliveira <bristot@kernel.org> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, linux-kernel@vger.kernel.org, Luca Abeni <luca.abeni@santannapisa.it>, Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>, Thomas Gleixner <tglx@linutronix.de>, Joel Fernandes <joel@joelfernandes.org>, Vineeth Pillai <vineeth@bitbyteword.org>, Shuah Khan <skhan@linuxfoundation.org>, bristot@kernel.org, Phil Auld <pauld@redhat.com> Subject: [PATCH v5 6/7] sched/deadline: Deferrable dl server Date: Sat, 4 Nov 2023 11:59:23 +0100 Message-Id: <c7b706d30d6316c52853ca056db5beb82ba72863.1699095159.git.bristot@kernel.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <cover.1699095159.git.bristot@kernel.org> References: <cover.1699095159.git.bristot@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Sat, 04 Nov 2023 04:00:55 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781630957510385685 X-GMAIL-MSGID: 1781630957510385685 |
Series |
SCHED_DEADLINE server infrastructure
|
|
Commit Message
Daniel Bristot de Oliveira
Nov. 4, 2023, 10:59 a.m. UTC
Among the motivations for the DL servers is the real-time throttling
mechanism. This mechanism works by throttling the rt_rq after
running for a long period without leaving space for fair tasks.
The base dl server avoids this problem by boosting fair tasks instead
of throttling the rt_rq. The point is that it boosts without waiting
for potential starvation, causing some non-intuitive cases.
For example, an IRQ dispatches two tasks on an idle system, a fair
and an RT. The DL server will be activated, running the fair task
before the RT one. This problem can be avoided by deferring the
dl server activation.
By setting the zerolax option, the dl_server will dispatch an
SCHED_DEADLINE reservation with replenished runtime, but throttled.
The dl_timer will be set for (period - runtime) ns from start time.
Thus boosting the fair rq on its 0-laxity time with respect to
rt_rq.
If the fair scheduler has the opportunity to run while waiting
for zerolax time, the dl server runtime will be consumed. If
the runtime is completely consumed before the zerolax time, the
server will be replenished while still in a throttled state. Then,
the dl_timer will be reset to the new zerolax time
If the fair server reaches the zerolax time without consuming
its runtime, the server will be boosted, following CBS rules
(thus without breaking SCHED_DEADLINE).
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
include/linux/sched.h | 2 +
kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
kernel/sched/fair.c | 3 ++
3 files changed, 103 insertions(+), 2 deletions(-)
Comments
On Sat, Nov 04, 2023 at 11:59:23AM +0100, Daniel Bristot de Oliveira wrote: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b15f7f376a67..399237cd9f59 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1201,6 +1201,8 @@ static void update_curr(struct cfs_rq *cfs_rq) > account_group_exec_runtime(curtask, delta_exec); > if (curtask->server) > dl_server_update(curtask->server, delta_exec); > + else > + dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec); > } > > account_cfs_rq_runtime(cfs_rq, delta_exec); I've rewritten that something like so.. --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1182,12 +1182,13 @@ s64 update_curr_common(struct rq *rq) static void update_curr(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; + struct rq *rq = rq_of(cfs_rq); s64 delta_exec; if (unlikely(!curr)) return; - delta_exec = update_curr_se(rq_of(cfs_rq), curr); + delta_exec = update_curr_se(rq, curr); if (unlikely(delta_exec <= 0)) return; @@ -1195,8 +1196,17 @@ static void update_curr(struct cfs_rq *c update_deadline(cfs_rq, curr); update_min_vruntime(cfs_rq); - if (entity_is_task(curr)) - update_curr_task(task_of(curr), delta_exec); + if (entity_is_task(curr)) { + struct task_struct *p = task_of(curr); + update_curr_task(p, delta_exec); + /* + * Any fair task that runs outside of fair_server should + * account against fair_server such that it can account for + * this time and possible avoid running this period. + */ + if (p->dl_server != &rq->fair_server) + dl_server_update(&rq->fair_server, delta_exec); + } account_cfs_rq_runtime(cfs_rq, delta_exec); }
On 11/6/23 15:55, Peter Zijlstra wrote: > On Sat, Nov 04, 2023 at 11:59:23AM +0100, Daniel Bristot de Oliveira wrote: > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index b15f7f376a67..399237cd9f59 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1201,6 +1201,8 @@ static void update_curr(struct cfs_rq *cfs_rq) >> account_group_exec_runtime(curtask, delta_exec); >> if (curtask->server) >> dl_server_update(curtask->server, delta_exec); >> + else >> + dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec); >> } >> >> account_cfs_rq_runtime(cfs_rq, delta_exec); > > @@ -1195,8 +1196,17 @@ static void update_curr(struct cfs_rq *c > update_deadline(cfs_rq, curr); > update_min_vruntime(cfs_rq); > > - if (entity_is_task(curr)) > - update_curr_task(task_of(curr), delta_exec); > + if (entity_is_task(curr)) { > + struct task_struct *p = task_of(curr); > + update_curr_task(p, delta_exec); > + /* > + * Any fair task that runs outside of fair_server should > + * account against fair_server such that it can account for > + * this time and possible avoid running this period. > + */ > + if (p->dl_server != &rq->fair_server) > + dl_server_update(&rq->fair_server, delta_exec); aren't we missing: else dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec); or am I missing something? :-) > + } > > account_cfs_rq_runtime(cfs_rq, delta_exec); > }
Hi Daniel, On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira <bristot@kernel.org> wrote: > > Among the motivations for the DL servers is the real-time throttling > mechanism. This mechanism works by throttling the rt_rq after > running for a long period without leaving space for fair tasks. > > The base dl server avoids this problem by boosting fair tasks instead > of throttling the rt_rq. The point is that it boosts without waiting > for potential starvation, causing some non-intuitive cases. > > For example, an IRQ dispatches two tasks on an idle system, a fair > and an RT. The DL server will be activated, running the fair task > before the RT one. This problem can be avoided by deferring the > dl server activation. > > By setting the zerolax option, the dl_server will dispatch an > SCHED_DEADLINE reservation with replenished runtime, but throttled. > > The dl_timer will be set for (period - runtime) ns from start time. > Thus boosting the fair rq on its 0-laxity time with respect to > rt_rq. > > If the fair scheduler has the opportunity to run while waiting > for zerolax time, the dl server runtime will be consumed. If > the runtime is completely consumed before the zerolax time, the > server will be replenished while still in a throttled state. Then, > the dl_timer will be reset to the new zerolax time > > If the fair server reaches the zerolax time without consuming > its runtime, the server will be boosted, following CBS rules > (thus without breaking SCHED_DEADLINE). > > Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> > --- > include/linux/sched.h | 2 + > kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++- > kernel/sched/fair.c | 3 ++ > 3 files changed, 103 insertions(+), 2 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 5ac1f252e136..56e53e6fd5a0 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -660,6 +660,8 @@ struct sched_dl_entity { > unsigned int dl_non_contending : 1; > unsigned int dl_overrun : 1; > unsigned int dl_server : 1; > + unsigned int dl_zerolax : 1; > + unsigned int dl_zerolax_armed : 1; > > /* > * Bandwidth enforcement timer. Each -deadline task has its > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 1d7b96ca9011..69ee1fbd60e4 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se, > /* for non-boosted task, pi_of(dl_se) == dl_se */ > dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; > dl_se->runtime = pi_of(dl_se)->dl_runtime; > + > + /* > + * If it is a zerolax reservation, throttle it. > + */ > + if (dl_se->dl_zerolax) { > + dl_se->dl_throttled = 1; > + dl_se->dl_zerolax_armed = 1; > + } > } > > /* > @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) > * could happen are, typically, a entity voluntarily trying to overcome its > * runtime, or it just underestimated it during sched_setattr(). > */ > +static int start_dl_timer(struct sched_dl_entity *dl_se); > static void replenish_dl_entity(struct sched_dl_entity *dl_se) > { > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) > dl_se->dl_yielded = 0; > if (dl_se->dl_throttled) > dl_se->dl_throttled = 0; > + > + /* > + * If this is the replenishment of a zerolax reservation, > + * clear the flag and return. > + */ > + if (dl_se->dl_zerolax_armed) { > + dl_se->dl_zerolax_armed = 0; > + return; > + } > + > + /* > + * A this point, if the zerolax server is not armed, and the deadline > + * is in the future, throttle the server and arm the zerolax timer. > + */ > + if (dl_se->dl_zerolax && > + dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) { > + if (!is_dl_boosted(dl_se)) { > + dl_se->dl_zerolax_armed = 1; > + dl_se->dl_throttled = 1; > + start_dl_timer(dl_se); > + } > + } > } > > /* > @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se) > } > > replenish_dl_new_period(dl_se, rq); > + } else if (dl_server(dl_se) && dl_se->dl_zerolax) { > + /* > + * The server can still use its previous deadline, so throttle > + * and arm the zero-laxity timer. > + */ > + dl_se->dl_zerolax_armed = 1; > + dl_se->dl_throttled = 1; > } > } > > @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se) > * We want the timer to fire at the deadline, but considering > * that it is actually coming from rq->clock and not from > * hrtimer's time base reading. > + * > + * The zerolax reservation will have its timer set to the > + * deadline - runtime. At that point, the CBS rule will decide > + * if the current deadline can be used, or if a replenishment > + * is required to avoid add too much pressure on the system > + * (current u > U). > */ > - act = ns_to_ktime(dl_next_period(dl_se)); > + if (dl_se->dl_zerolax_armed) { > + WARN_ON_ONCE(!dl_se->dl_throttled); > + act = ns_to_ktime(dl_se->deadline - dl_se->runtime); Just a question, here if dl_se->deadline - dl_se->runtime is large, then does that mean that server activation will be much more into the future? So say I want to give CFS 30%, then it will take 70% of the period before CFS preempts RT thus "starving" CFS for this duration. I think that's Ok for smaller periods and runtimes, though. I think it does reserve the amount of required CFS bandwidth so it is probably OK, though it is perhaps letting RT run more initially (say if CFS tasks are not CPU bound and occasionally wake up, they will always be hit by the 70% latency AFAICS which may be large for large periods and small runtimes). I/we're currently trying these patches on ChromeOS as well. Just started going over it to understand the patch. Looking nice so far and thanks, - Joel
On Mon, Nov 6, 2023 at 2:32 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > Hi Daniel, > > On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira > <bristot@kernel.org> wrote: > > > > Among the motivations for the DL servers is the real-time throttling > > mechanism. This mechanism works by throttling the rt_rq after > > running for a long period without leaving space for fair tasks. > > > > The base dl server avoids this problem by boosting fair tasks instead > > of throttling the rt_rq. The point is that it boosts without waiting > > for potential starvation, causing some non-intuitive cases. > > > > For example, an IRQ dispatches two tasks on an idle system, a fair > > and an RT. The DL server will be activated, running the fair task > > before the RT one. This problem can be avoided by deferring the > > dl server activation. > > > > By setting the zerolax option, the dl_server will dispatch an > > SCHED_DEADLINE reservation with replenished runtime, but throttled. > > > > The dl_timer will be set for (period - runtime) ns from start time. > > Thus boosting the fair rq on its 0-laxity time with respect to > > rt_rq. > > > > If the fair scheduler has the opportunity to run while waiting > > for zerolax time, the dl server runtime will be consumed. If > > the runtime is completely consumed before the zerolax time, the > > server will be replenished while still in a throttled state. Then, > > the dl_timer will be reset to the new zerolax time > > > > If the fair server reaches the zerolax time without consuming > > its runtime, the server will be boosted, following CBS rules > > (thus without breaking SCHED_DEADLINE). > > > > Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> > > --- > > include/linux/sched.h | 2 + > > kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++- > > kernel/sched/fair.c | 3 ++ > > 3 files changed, 103 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index 5ac1f252e136..56e53e6fd5a0 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -660,6 +660,8 @@ struct sched_dl_entity { > > unsigned int dl_non_contending : 1; > > unsigned int dl_overrun : 1; > > unsigned int dl_server : 1; > > + unsigned int dl_zerolax : 1; > > + unsigned int dl_zerolax_armed : 1; > > > > /* > > * Bandwidth enforcement timer. Each -deadline task has its > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > > index 1d7b96ca9011..69ee1fbd60e4 100644 > > --- a/kernel/sched/deadline.c > > +++ b/kernel/sched/deadline.c > > @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se, > > /* for non-boosted task, pi_of(dl_se) == dl_se */ > > dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; > > dl_se->runtime = pi_of(dl_se)->dl_runtime; > > + > > + /* > > + * If it is a zerolax reservation, throttle it. > > + */ > > + if (dl_se->dl_zerolax) { > > + dl_se->dl_throttled = 1; > > + dl_se->dl_zerolax_armed = 1; > > + } > > } > > > > /* > > @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) > > * could happen are, typically, a entity voluntarily trying to overcome its > > * runtime, or it just underestimated it during sched_setattr(). > > */ > > +static int start_dl_timer(struct sched_dl_entity *dl_se); > > static void replenish_dl_entity(struct sched_dl_entity *dl_se) > > { > > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > > @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) > > dl_se->dl_yielded = 0; > > if (dl_se->dl_throttled) > > dl_se->dl_throttled = 0; > > + > > + /* > > + * If this is the replenishment of a zerolax reservation, > > + * clear the flag and return. > > + */ > > + if (dl_se->dl_zerolax_armed) { > > + dl_se->dl_zerolax_armed = 0; > > + return; > > + } > > + > > + /* > > + * A this point, if the zerolax server is not armed, and the deadline > > + * is in the future, throttle the server and arm the zerolax timer. > > + */ > > + if (dl_se->dl_zerolax && > > + dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) { > > + if (!is_dl_boosted(dl_se)) { > > + dl_se->dl_zerolax_armed = 1; > > + dl_se->dl_throttled = 1; > > + start_dl_timer(dl_se); > > + } > > + } > > } > > > > /* > > @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se) > > } > > > > replenish_dl_new_period(dl_se, rq); > > + } else if (dl_server(dl_se) && dl_se->dl_zerolax) { > > + /* > > + * The server can still use its previous deadline, so throttle > > + * and arm the zero-laxity timer. > > + */ > > + dl_se->dl_zerolax_armed = 1; > > + dl_se->dl_throttled = 1; > > } > > } > > > > @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se) > > * We want the timer to fire at the deadline, but considering > > * that it is actually coming from rq->clock and not from > > * hrtimer's time base reading. > > + * > > + * The zerolax reservation will have its timer set to the > > + * deadline - runtime. At that point, the CBS rule will decide > > + * if the current deadline can be used, or if a replenishment > > + * is required to avoid add too much pressure on the system > > + * (current u > U). > > */ > > - act = ns_to_ktime(dl_next_period(dl_se)); > > + if (dl_se->dl_zerolax_armed) { > > + WARN_ON_ONCE(!dl_se->dl_throttled); > > + act = ns_to_ktime(dl_se->deadline - dl_se->runtime); > > Just a question, here if dl_se->deadline - dl_se->runtime is large, > then does that mean that server activation will be much more into the > future? So say I want to give CFS 30%, then it will take 70% of the > period before CFS preempts RT thus "starving" CFS for this duration. I > think that's Ok for smaller periods and runtimes, though. > > I think it does reserve the amount of required CFS bandwidth so it is > probably OK, though it is perhaps letting RT run more initially (say > if CFS tasks are not CPU bound and occasionally wake up, they will > always be hit by the 70% latency AFAICS which may be large for large > periods and small runtimes). > One more consideration I guess is, because the server is throttled till 0-laxity time, it is possible that if CFS sleeps even a bit (after the DL-server is unthrottled), then it will be pushed out to a full current deadline + period due to CBS. In such a situation, if CFS-server is the only DL task running, it might starve RT for a bit more time. Example, say CFS runtime is 0.3s and period is 1s. At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this point which is < the "time till deadline" of 0.005s). Now the runtime of the CFS-server will be replenished to the full 3s (due to CBS) and the deadline pushed out. The end result is the total runtime that the CFS-server actually gets is 0.0595s (though yes it did sleep for 5ms in between, still that's tiny -- say if it briefly blocked on a kernel mutex). On the other hand, if the CFS server started a bit earlier than the 0-laxity, it would probably not have had CBS pushing it out. This is likely also not an issue for shorter runtime/period values, still the throttling till later has a small trade-off (Not saying we should not do this, this whole series is likely a huge improvement over the current RT throttling). There is a chance I am uttering nonsense as I am not a DL expert, so apologies if so. Thanks.
On Mon, Nov 6, 2023 at 4:32 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Mon, Nov 6, 2023 at 2:32 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > Hi Daniel, > > > > On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira > > <bristot@kernel.org> wrote: > > > > > > Among the motivations for the DL servers is the real-time throttling > > > mechanism. This mechanism works by throttling the rt_rq after > > > running for a long period without leaving space for fair tasks. > > > > > > The base dl server avoids this problem by boosting fair tasks instead > > > of throttling the rt_rq. The point is that it boosts without waiting > > > for potential starvation, causing some non-intuitive cases. > > > > > > For example, an IRQ dispatches two tasks on an idle system, a fair > > > and an RT. The DL server will be activated, running the fair task > > > before the RT one. This problem can be avoided by deferring the > > > dl server activation. > > > > > > By setting the zerolax option, the dl_server will dispatch an > > > SCHED_DEADLINE reservation with replenished runtime, but throttled. > > > > > > The dl_timer will be set for (period - runtime) ns from start time. > > > Thus boosting the fair rq on its 0-laxity time with respect to > > > rt_rq. > > > > > > If the fair scheduler has the opportunity to run while waiting > > > for zerolax time, the dl server runtime will be consumed. If > > > the runtime is completely consumed before the zerolax time, the > > > server will be replenished while still in a throttled state. Then, > > > the dl_timer will be reset to the new zerolax time > > > > > > If the fair server reaches the zerolax time without consuming > > > its runtime, the server will be boosted, following CBS rules > > > (thus without breaking SCHED_DEADLINE). > > > > > > Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> > > > --- > > > include/linux/sched.h | 2 + > > > kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++- > > > kernel/sched/fair.c | 3 ++ > > > 3 files changed, 103 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > index 5ac1f252e136..56e53e6fd5a0 100644 > > > --- a/include/linux/sched.h > > > +++ b/include/linux/sched.h > > > @@ -660,6 +660,8 @@ struct sched_dl_entity { > > > unsigned int dl_non_contending : 1; > > > unsigned int dl_overrun : 1; > > > unsigned int dl_server : 1; > > > + unsigned int dl_zerolax : 1; > > > + unsigned int dl_zerolax_armed : 1; > > > > > > /* > > > * Bandwidth enforcement timer. Each -deadline task has its > > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > > > index 1d7b96ca9011..69ee1fbd60e4 100644 > > > --- a/kernel/sched/deadline.c > > > +++ b/kernel/sched/deadline.c > > > @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se, > > > /* for non-boosted task, pi_of(dl_se) == dl_se */ > > > dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; > > > dl_se->runtime = pi_of(dl_se)->dl_runtime; > > > + > > > + /* > > > + * If it is a zerolax reservation, throttle it. > > > + */ > > > + if (dl_se->dl_zerolax) { > > > + dl_se->dl_throttled = 1; > > > + dl_se->dl_zerolax_armed = 1; > > > + } > > > } > > > > > > /* > > > @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) > > > * could happen are, typically, a entity voluntarily trying to overcome its > > > * runtime, or it just underestimated it during sched_setattr(). > > > */ > > > +static int start_dl_timer(struct sched_dl_entity *dl_se); > > > static void replenish_dl_entity(struct sched_dl_entity *dl_se) > > > { > > > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > > > @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) > > > dl_se->dl_yielded = 0; > > > if (dl_se->dl_throttled) > > > dl_se->dl_throttled = 0; > > > + > > > + /* > > > + * If this is the replenishment of a zerolax reservation, > > > + * clear the flag and return. > > > + */ > > > + if (dl_se->dl_zerolax_armed) { > > > + dl_se->dl_zerolax_armed = 0; > > > + return; > > > + } > > > + > > > + /* > > > + * A this point, if the zerolax server is not armed, and the deadline > > > + * is in the future, throttle the server and arm the zerolax timer. > > > + */ > > > + if (dl_se->dl_zerolax && > > > + dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) { > > > + if (!is_dl_boosted(dl_se)) { > > > + dl_se->dl_zerolax_armed = 1; > > > + dl_se->dl_throttled = 1; > > > + start_dl_timer(dl_se); > > > + } > > > + } > > > } > > > > > > /* > > > @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se) > > > } > > > > > > replenish_dl_new_period(dl_se, rq); > > > + } else if (dl_server(dl_se) && dl_se->dl_zerolax) { > > > + /* > > > + * The server can still use its previous deadline, so throttle > > > + * and arm the zero-laxity timer. > > > + */ > > > + dl_se->dl_zerolax_armed = 1; > > > + dl_se->dl_throttled = 1; > > > } > > > } > > > > > > @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se) > > > * We want the timer to fire at the deadline, but considering > > > * that it is actually coming from rq->clock and not from > > > * hrtimer's time base reading. > > > + * > > > + * The zerolax reservation will have its timer set to the > > > + * deadline - runtime. At that point, the CBS rule will decide > > > + * if the current deadline can be used, or if a replenishment > > > + * is required to avoid add too much pressure on the system > > > + * (current u > U). > > > */ > > > - act = ns_to_ktime(dl_next_period(dl_se)); > > > + if (dl_se->dl_zerolax_armed) { > > > + WARN_ON_ONCE(!dl_se->dl_throttled); > > > + act = ns_to_ktime(dl_se->deadline - dl_se->runtime); > > > > Just a question, here if dl_se->deadline - dl_se->runtime is large, > > then does that mean that server activation will be much more into the > > future? So say I want to give CFS 30%, then it will take 70% of the > > period before CFS preempts RT thus "starving" CFS for this duration. I > > think that's Ok for smaller periods and runtimes, though. > > > > I think it does reserve the amount of required CFS bandwidth so it is > > probably OK, though it is perhaps letting RT run more initially (say > > if CFS tasks are not CPU bound and occasionally wake up, they will > > always be hit by the 70% latency AFAICS which may be large for large > > periods and small runtimes). > > > > One more consideration I guess is, because the server is throttled > till 0-laxity time, it is possible that if CFS sleeps even a bit > (after the DL-server is unthrottled), then it will be pushed out to a > full current deadline + period due to CBS. In such a situation, if > CFS-server is the only DL task running, it might starve RT for a bit > more time. > > Example, say CFS runtime is 0.3s and period is 1s. At 0.7s, 0-laxity > timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up > at 0.295s (its remaining runtime is 0.01s at this point which is < the > "time till deadline" of 0.005s). Now the runtime of the CFS-server > will be replenished to the full 3s (due to CBS) and the deadline > pushed out. The end result is the total runtime that the CFS-server > actually gets is 0.0595s (though yes it did sleep for 5ms in between, > still that's tiny -- say if it briefly blocked on a kernel mutex). Blah, I got lost in decimal points. Here's the example again: Say CFS-server runtime is 0.3s and period is 1s. At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this point which is < the "time till deadline" of 0.005s) Now the runtime of the CFS-server will be replenished to the full 0.3s (due to CBS) and the deadline pushed out. The end result is, the total runtime that the CFS-server actually gets is 0.595s (though yes it did sleep for 5ms in between, still that's tiny -- say if it briefly blocked on a kernel mutex). That's almost double the allocated runtime. This is just theoretical and I have yet to see if it is actually an issue in practice. Thanks.
On 11/6/23 20:32, Joel Fernandes wrote: > Hi Daniel, > > On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira > <bristot@kernel.org> wrote: >> >> Among the motivations for the DL servers is the real-time throttling >> mechanism. This mechanism works by throttling the rt_rq after >> running for a long period without leaving space for fair tasks. >> >> The base dl server avoids this problem by boosting fair tasks instead >> of throttling the rt_rq. The point is that it boosts without waiting >> for potential starvation, causing some non-intuitive cases. >> >> For example, an IRQ dispatches two tasks on an idle system, a fair >> and an RT. The DL server will be activated, running the fair task >> before the RT one. This problem can be avoided by deferring the >> dl server activation. >> >> By setting the zerolax option, the dl_server will dispatch an >> SCHED_DEADLINE reservation with replenished runtime, but throttled. >> >> The dl_timer will be set for (period - runtime) ns from start time. >> Thus boosting the fair rq on its 0-laxity time with respect to >> rt_rq. >> >> If the fair scheduler has the opportunity to run while waiting >> for zerolax time, the dl server runtime will be consumed. If >> the runtime is completely consumed before the zerolax time, the >> server will be replenished while still in a throttled state. Then, >> the dl_timer will be reset to the new zerolax time >> >> If the fair server reaches the zerolax time without consuming >> its runtime, the server will be boosted, following CBS rules >> (thus without breaking SCHED_DEADLINE). >> >> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> >> --- >> include/linux/sched.h | 2 + >> kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++- >> kernel/sched/fair.c | 3 ++ >> 3 files changed, 103 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 5ac1f252e136..56e53e6fd5a0 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -660,6 +660,8 @@ struct sched_dl_entity { >> unsigned int dl_non_contending : 1; >> unsigned int dl_overrun : 1; >> unsigned int dl_server : 1; >> + unsigned int dl_zerolax : 1; >> + unsigned int dl_zerolax_armed : 1; >> >> /* >> * Bandwidth enforcement timer. Each -deadline task has its >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c >> index 1d7b96ca9011..69ee1fbd60e4 100644 >> --- a/kernel/sched/deadline.c >> +++ b/kernel/sched/deadline.c >> @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se, >> /* for non-boosted task, pi_of(dl_se) == dl_se */ >> dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; >> dl_se->runtime = pi_of(dl_se)->dl_runtime; >> + >> + /* >> + * If it is a zerolax reservation, throttle it. >> + */ >> + if (dl_se->dl_zerolax) { >> + dl_se->dl_throttled = 1; >> + dl_se->dl_zerolax_armed = 1; >> + } >> } >> >> /* >> @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) >> * could happen are, typically, a entity voluntarily trying to overcome its >> * runtime, or it just underestimated it during sched_setattr(). >> */ >> +static int start_dl_timer(struct sched_dl_entity *dl_se); >> static void replenish_dl_entity(struct sched_dl_entity *dl_se) >> { >> struct dl_rq *dl_rq = dl_rq_of_se(dl_se); >> @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) >> dl_se->dl_yielded = 0; >> if (dl_se->dl_throttled) >> dl_se->dl_throttled = 0; >> + >> + /* >> + * If this is the replenishment of a zerolax reservation, >> + * clear the flag and return. >> + */ >> + if (dl_se->dl_zerolax_armed) { >> + dl_se->dl_zerolax_armed = 0; >> + return; >> + } >> + >> + /* >> + * A this point, if the zerolax server is not armed, and the deadline >> + * is in the future, throttle the server and arm the zerolax timer. >> + */ >> + if (dl_se->dl_zerolax && >> + dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) { >> + if (!is_dl_boosted(dl_se)) { >> + dl_se->dl_zerolax_armed = 1; >> + dl_se->dl_throttled = 1; >> + start_dl_timer(dl_se); >> + } >> + } >> } >> >> /* >> @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se) >> } >> >> replenish_dl_new_period(dl_se, rq); >> + } else if (dl_server(dl_se) && dl_se->dl_zerolax) { >> + /* >> + * The server can still use its previous deadline, so throttle >> + * and arm the zero-laxity timer. >> + */ >> + dl_se->dl_zerolax_armed = 1; >> + dl_se->dl_throttled = 1; >> } >> } >> >> @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se) >> * We want the timer to fire at the deadline, but considering >> * that it is actually coming from rq->clock and not from >> * hrtimer's time base reading. >> + * >> + * The zerolax reservation will have its timer set to the >> + * deadline - runtime. At that point, the CBS rule will decide >> + * if the current deadline can be used, or if a replenishment >> + * is required to avoid add too much pressure on the system >> + * (current u > U). >> */ >> - act = ns_to_ktime(dl_next_period(dl_se)); >> + if (dl_se->dl_zerolax_armed) { >> + WARN_ON_ONCE(!dl_se->dl_throttled); >> + act = ns_to_ktime(dl_se->deadline - dl_se->runtime); > > Just a question, here if dl_se->deadline - dl_se->runtime is large, > then does that mean that server activation will be much more into the > future? So say I want to give CFS 30%, then it will take 70% of the > period before CFS preempts RT thus "starving" CFS for this duration. I > think that's Ok for smaller periods and runtimes, though. I think you are answering yourself here :-) If the default values are not good, change them o/ The current interface allows you to have more responsive/small chuck of CPU or less responsive/large chucks of CPU... you can even place RT bellow CFS for a "bounded amount of time" by disabling the defer option... per CPU. All at once with different periods patterns on CPUs to increase the changes of having a cfs rq ready on another CPU... like... [3/10 - 2/6 - 1.5/5 - 1/3 no defer] in a 4 cpus system :-). The default setup is based on the throttling to avoid changing the historical behavior for those that... are happy with them. -- Daniel > - Joel
On 11/6/23 22:37, Joel Fernandes wrote: > On Mon, Nov 6, 2023 at 4:32 PM Joel Fernandes <joel@joelfernandes.org> wrote: >> >> On Mon, Nov 6, 2023 at 2:32 PM Joel Fernandes <joel@joelfernandes.org> wrote: >>> >>> Hi Daniel, >>> >>> On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira >>> <bristot@kernel.org> wrote: >>>> >>>> Among the motivations for the DL servers is the real-time throttling >>>> mechanism. This mechanism works by throttling the rt_rq after >>>> running for a long period without leaving space for fair tasks. >>>> >>>> The base dl server avoids this problem by boosting fair tasks instead >>>> of throttling the rt_rq. The point is that it boosts without waiting >>>> for potential starvation, causing some non-intuitive cases. >>>> >>>> For example, an IRQ dispatches two tasks on an idle system, a fair >>>> and an RT. The DL server will be activated, running the fair task >>>> before the RT one. This problem can be avoided by deferring the >>>> dl server activation. >>>> >>>> By setting the zerolax option, the dl_server will dispatch an >>>> SCHED_DEADLINE reservation with replenished runtime, but throttled. >>>> >>>> The dl_timer will be set for (period - runtime) ns from start time. >>>> Thus boosting the fair rq on its 0-laxity time with respect to >>>> rt_rq. >>>> >>>> If the fair scheduler has the opportunity to run while waiting >>>> for zerolax time, the dl server runtime will be consumed. If >>>> the runtime is completely consumed before the zerolax time, the >>>> server will be replenished while still in a throttled state. Then, >>>> the dl_timer will be reset to the new zerolax time >>>> >>>> If the fair server reaches the zerolax time without consuming >>>> its runtime, the server will be boosted, following CBS rules >>>> (thus without breaking SCHED_DEADLINE). >>>> >>>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> >>>> --- >>>> include/linux/sched.h | 2 + >>>> kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++- >>>> kernel/sched/fair.c | 3 ++ >>>> 3 files changed, 103 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/include/linux/sched.h b/include/linux/sched.h >>>> index 5ac1f252e136..56e53e6fd5a0 100644 >>>> --- a/include/linux/sched.h >>>> +++ b/include/linux/sched.h >>>> @@ -660,6 +660,8 @@ struct sched_dl_entity { >>>> unsigned int dl_non_contending : 1; >>>> unsigned int dl_overrun : 1; >>>> unsigned int dl_server : 1; >>>> + unsigned int dl_zerolax : 1; >>>> + unsigned int dl_zerolax_armed : 1; >>>> >>>> /* >>>> * Bandwidth enforcement timer. Each -deadline task has its >>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c >>>> index 1d7b96ca9011..69ee1fbd60e4 100644 >>>> --- a/kernel/sched/deadline.c >>>> +++ b/kernel/sched/deadline.c >>>> @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se, >>>> /* for non-boosted task, pi_of(dl_se) == dl_se */ >>>> dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; >>>> dl_se->runtime = pi_of(dl_se)->dl_runtime; >>>> + >>>> + /* >>>> + * If it is a zerolax reservation, throttle it. >>>> + */ >>>> + if (dl_se->dl_zerolax) { >>>> + dl_se->dl_throttled = 1; >>>> + dl_se->dl_zerolax_armed = 1; >>>> + } >>>> } >>>> >>>> /* >>>> @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) >>>> * could happen are, typically, a entity voluntarily trying to overcome its >>>> * runtime, or it just underestimated it during sched_setattr(). >>>> */ >>>> +static int start_dl_timer(struct sched_dl_entity *dl_se); >>>> static void replenish_dl_entity(struct sched_dl_entity *dl_se) >>>> { >>>> struct dl_rq *dl_rq = dl_rq_of_se(dl_se); >>>> @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) >>>> dl_se->dl_yielded = 0; >>>> if (dl_se->dl_throttled) >>>> dl_se->dl_throttled = 0; >>>> + >>>> + /* >>>> + * If this is the replenishment of a zerolax reservation, >>>> + * clear the flag and return. >>>> + */ >>>> + if (dl_se->dl_zerolax_armed) { >>>> + dl_se->dl_zerolax_armed = 0; >>>> + return; >>>> + } >>>> + >>>> + /* >>>> + * A this point, if the zerolax server is not armed, and the deadline >>>> + * is in the future, throttle the server and arm the zerolax timer. >>>> + */ >>>> + if (dl_se->dl_zerolax && >>>> + dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) { >>>> + if (!is_dl_boosted(dl_se)) { >>>> + dl_se->dl_zerolax_armed = 1; >>>> + dl_se->dl_throttled = 1; >>>> + start_dl_timer(dl_se); >>>> + } >>>> + } >>>> } >>>> >>>> /* >>>> @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se) >>>> } >>>> >>>> replenish_dl_new_period(dl_se, rq); >>>> + } else if (dl_server(dl_se) && dl_se->dl_zerolax) { >>>> + /* >>>> + * The server can still use its previous deadline, so throttle >>>> + * and arm the zero-laxity timer. >>>> + */ >>>> + dl_se->dl_zerolax_armed = 1; >>>> + dl_se->dl_throttled = 1; >>>> } >>>> } >>>> >>>> @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se) >>>> * We want the timer to fire at the deadline, but considering >>>> * that it is actually coming from rq->clock and not from >>>> * hrtimer's time base reading. >>>> + * >>>> + * The zerolax reservation will have its timer set to the >>>> + * deadline - runtime. At that point, the CBS rule will decide >>>> + * if the current deadline can be used, or if a replenishment >>>> + * is required to avoid add too much pressure on the system >>>> + * (current u > U). >>>> */ >>>> - act = ns_to_ktime(dl_next_period(dl_se)); >>>> + if (dl_se->dl_zerolax_armed) { >>>> + WARN_ON_ONCE(!dl_se->dl_throttled); >>>> + act = ns_to_ktime(dl_se->deadline - dl_se->runtime); >>> >>> Just a question, here if dl_se->deadline - dl_se->runtime is large, >>> then does that mean that server activation will be much more into the >>> future? So say I want to give CFS 30%, then it will take 70% of the >>> period before CFS preempts RT thus "starving" CFS for this duration. I >>> think that's Ok for smaller periods and runtimes, though. >>> >>> I think it does reserve the amount of required CFS bandwidth so it is >>> probably OK, though it is perhaps letting RT run more initially (say >>> if CFS tasks are not CPU bound and occasionally wake up, they will >>> always be hit by the 70% latency AFAICS which may be large for large >>> periods and small runtimes). >>> >> >> One more consideration I guess is, because the server is throttled >> till 0-laxity time, it is possible that if CFS sleeps even a bit >> (after the DL-server is unthrottled), then it will be pushed out to a >> full current deadline + period due to CBS. In such a situation, if >> CFS-server is the only DL task running, it might starve RT for a bit >> more time. >> >> Example, say CFS runtime is 0.3s and period is 1s. At 0.7s, 0-laxity >> timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up >> at 0.295s (its remaining runtime is 0.01s at this point which is < the >> "time till deadline" of 0.005s). Now the runtime of the CFS-server >> will be replenished to the full 3s (due to CBS) and the deadline >> pushed out. The end result is the total runtime that the CFS-server >> actually gets is 0.0595s (though yes it did sleep for 5ms in between, >> still that's tiny -- say if it briefly blocked on a kernel mutex). > > Blah, I got lost in decimal points. Here's the example again: > > Say CFS-server runtime is 0.3s and period is 1s. > > At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for > 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this > point which is < the "time till deadline" of 0.005s) > > Now the runtime of the CFS-server will be replenished to the full 0.3s > (due to CBS) and the deadline > pushed out. > > The end result is, the total runtime that the CFS-server actually gets > is 0.595s (though yes it did sleep for 5ms in between, still that's > tiny -- say if it briefly blocked on a kernel mutex). That's almost > double the allocated runtime. I think I got what you mean, and I think I took for granted that we were doing overload control on the replenishment, but it seems that we are not.. I just got back from a doct appt, I will do a proper reply later today. Thanks Joel! -- Daniel > This is just theoretical and I have yet to see if it is actually an > issue in practice. > > Thanks.
On Sat, 4 Nov 2023 11:59:23 +0100 Daniel Bristot de Oliveira <bristot@kernel.org> wrote: > @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) > * could happen are, typically, a entity voluntarily trying to overcome its > * runtime, or it just underestimated it during sched_setattr(). > */ > +static int start_dl_timer(struct sched_dl_entity *dl_se); > static void replenish_dl_entity(struct sched_dl_entity *dl_se) > { > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); Nit, but you really shouldn't have a function prototype declaration right next to a function, and especially not between the function's comment and the function itself. -- Steve
On Mon, 6 Nov 2023 16:37:32 -0500 Joel Fernandes <joel@joelfernandes.org> wrote: > Say CFS-server runtime is 0.3s and period is 1s. > > At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for > 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this > point which is < the "time till deadline" of 0.005s) > > Now the runtime of the CFS-server will be replenished to the full 0.3s > (due to CBS) and the deadline > pushed out. > > The end result is, the total runtime that the CFS-server actually gets > is 0.595s (though yes it did sleep for 5ms in between, still that's > tiny -- say if it briefly blocked on a kernel mutex). That's almost > double the allocated runtime. > > This is just theoretical and I have yet to see if it is actually an > issue in practice. Let me see if I understand what you are asking. By pushing the execution of the CFS-server to the end of its period, if it it was briefly blocked and was not able to consume all of its zerolax time, its bandwidth gets refreshed. Then it can run again, basically doubling its total time. But this is basically saying that it ran for its runtime at the start of one period and at the beginning of another, right? Is that an issue? The CFS-server is still just consuming it's time per period. That means that an RT tasks was starving the system that much to push it forward too much anyway. I wonder if we just document this behavior, if that would be enough? -- Steve
On Tue, 7 Nov 2023 11:47:32 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Let me see if I understand what you are asking. By pushing the execution of > the CFS-server to the end of its period, if it it was briefly blocked and > was not able to consume all of its zerolax time, its bandwidth gets > refreshed. Then it can run again, basically doubling its total time. > > But this is basically saying that it ran for its runtime at the start of > one period and at the beginning of another, right? > > Is that an issue? The CFS-server is still just consuming it's time per > period. That means that an RT tasks was starving the system that much to > push it forward too much anyway. I wonder if we just document this > behavior, if that would be enough? I may have even captured this scenario. I ran my migrate[1] program which I use to test RT migration, and it kicks off a bunch of RT tasks. I like this test because with the /proc/sys/kernel/sched_rt_* options set, it shows the lines where they are throttled really well. This time, I disabled those, and just kept the default: ~# cat /sys/kernel/debug/sched/rq/cpu0/fair_server_defer 1 ~# cat /sys/kernel/debug/sched/rq/cpu0/fair_server_period 1000000000 ~# cat /sys/kernel/debug/sched/rq/cpu0/fair_server_runtime 50000000 And ran my userspin[2] program. And recorded it with: trace-cmd record -e sched_switch The kernelshark output shows the delay from userspin taking up 0.1 seconds (double the time usually given), with a little preemption in between. -- Steve
On 11/7/23 17:47, Steven Rostedt wrote: > On Mon, 6 Nov 2023 16:37:32 -0500 > Joel Fernandes <joel@joelfernandes.org> wrote: > >> Say CFS-server runtime is 0.3s and period is 1s. >> >> At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for >> 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this >> point which is < the "time till deadline" of 0.005s) >> >> Now the runtime of the CFS-server will be replenished to the full 0.3s >> (due to CBS) and the deadline >> pushed out. >> >> The end result is, the total runtime that the CFS-server actually gets >> is 0.595s (though yes it did sleep for 5ms in between, still that's >> tiny -- say if it briefly blocked on a kernel mutex). That's almost >> double the allocated runtime. >> >> This is just theoretical and I have yet to see if it is actually an >> issue in practice. > > Let me see if I understand what you are asking. By pushing the execution of > the CFS-server to the end of its period, if it it was briefly blocked and > was not able to consume all of its zerolax time, its bandwidth gets > refreshed. Then it can run again, basically doubling its total time. > > But this is basically saying that it ran for its runtime at the start of > one period and at the beginning of another, right? > > Is that an issue? The CFS-server is still just consuming it's time per > period. That means that an RT tasks was starving the system that much to > push it forward too much anyway. I wonder if we just document this > behavior, if that would be enough? The code is not doing what I intended because I thought it was doing overload control on the replenishment, but it is not (my bad). he is seeing this timeline: - w=waiting - r=running - s=sleeping - T=throttled - 3/10 reservation (30%). |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww|rrrrrrrrrrrrrrrrrrrrrrrrrrr|s|rrrrrrrrr+rrrrrrrr+rrrrrrrrr|TTTTTTTTTT <CPU |___________________________period 1_______________________________________________________________|________period 2_______________________ < internal-period 0---------1---------2---------3---------4---------5---------6--------7--------8---------9----------10.......11.......12.........13......... < Real-time It is not actually that bad because the ~2x runtime is over 2 periods. But it is not what I intended... I intended this: |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww|rrrrrrrrrrrrrrrrrrrrrrrrrrrrsr|TTTTTTTTTT[...]TTTTTTTTTTT|rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr|TTTTTTT |___________________________period 1_________________________________|_________period 2________________________[...]___________|___period 3____________________|[.... internal-period 0---------1---------2---------3---------4---------5---------6--------7--------8---------9----------10.......11.[...]16.........17........18........19........20|[.... < Real-time ---------------------------------------------------------------------+---------------------------------------------------------| | +new period +30/30>30/100, thus new period. At the replenishment time, if the runtime left/period left > dl_rutime/dl_period, replenish with a new period to avoid adding to much pressure to CBS/EDF. One might say: but then the task period is different... or out of sync... but it is not a problem: look at the "real-time"... the task starts and run at the "deadline - runtime...." emulating the "zerolax" (note, I do not like the term zerolax here... but (thomas voice:) whatever :-)). One could say: in presence of deadline, this timelime will be different... But that is intentional, as we do not want the fair server to break DL. But more than that, if one has DL tasks, FIFO latency "property" is broken, and they should just disable the defer option.... that is what I mentioned at the log: "If the fair server reaches the zerolax time without consuming its runtime, the server will be boosted, following CBS rules (thus without breaking SCHED_DEADLINE)." by the rule I meant doing the overload check... I thought it was there already... but it was not... there was no need for it. I am working on it... it is a simple change (but I need to test). -- Daniel
On Tue, 7 Nov 2023 12:35:40 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > I ran my migrate[1] program which I use to test RT migration, and it kicks > > And ran my userspin[2] program. And recorded it with: I forgot to add the [1] and [2] [1] https://rostedt.org/code/migrate.c [2] https://rostedt.org/code/userspin.c -- Steve
What's more interesting, when looking at the userspin task, I see a lot of this: migrate-1153 [003] 1272.988097: sched_switch: migrate:1153 [90] S ==> userspin:1135 [120] userspin-1135 [003] 1272.988111: sched_switch: userspin:1135 [120] R ==> migrate:1146 [97] userspin sneaks in for 14 microseconds migrate-1146 [003] 1272.988141: sched_switch: migrate:1146 [97] R ==> migrate:1159 [84] migrate-1159 [003] 1272.988159: print: tracing_mark_write: thread 13 iter 15, took lock 15020 in 140726333419648 us migrate-1159 [003] 1272.992161: print: tracing_mark_write: thread 13 iter 15, unlock lock 6 migrate-1159 [003] 1272.992169: print: tracing_mark_write: thread 13 iter 15 sleeping migrate-1159 [003] 1272.992177: sched_switch: migrate:1159 [84] S ==> userspin:1135 [120] userspin-1135 [003] 1272.992190: sched_switch: userspin:1135 [120] R ==> migrate:1150 [93] Again for 13 microseconds. migrate-1150 [003] 1272.995118: sched_switch: migrate:1150 [93] R ==> migrate:1153 [90] migrate-1153 [003] 1272.995129: print: tracing_mark_write: thread 7 iter 15, taking lock 5 migrate-1153 [003] 1272.995164: print: tracing_mark_write: thread 7 iter 15, took lock 32 in 140726333419648 us migrate-1153 [003] 1273.005166: print: tracing_mark_write: thread 7 iter 15, unlock lock 5 migrate-1153 [003] 1273.005174: print: tracing_mark_write: thread 7 iter 15 sleeping migrate-1153 [003] 1273.005183: sched_switch: migrate:1153 [90] S ==> userspin:1135 [120] userspin-1135 [003] 1273.005204: sched_switch: userspin:1135 [120] R ==> migrate:1159 [84] For 21 microseconds. migrate-1159 [003] 1273.005216: print: tracing_mark_write: thread 13 iter 15, taking lock 7 migrate-1159 [003] 1273.005271: print: tracing_mark_write: thread 13 iter 15, took lock 53 in 140726333419648 us migrate-1159 [003] 1273.009273: print: tracing_mark_write: thread 13 iter 15, unlock lock 7 migrate-1159 [003] 1273.009281: print: tracing_mark_write: thread 13 iter 15 sleeping migrate-1159 [003] 1273.009289: sched_switch: migrate:1159 [84] S ==> userspin:1135 [120] userspin-1135 [003] 1273.009301: sched_switch: userspin:1135 [120] R ==> migrate:1147 [96] 12 microseconds migrate-1147 [003] 1273.012205: sched_switch: migrate:1147 [96] R ==> migrate:1153 [90] migrate-1153 [003] 1273.012217: print: tracing_mark_write: thread 7 iter 15, taking lock 6 migrate-1153 [003] 1273.012228: sched_switch: migrate:1153 [90] S ==> userspin:1135 [120] userspin-1135 [003] 1273.012242: sched_switch: userspin:1135 [120] R ==> migrate:1146 [97] migrate-1146 [003] 1273.014251: sched_switch: migrate:1146 [97] R ==> migrate:1148 [95] 2 milliseconds. (which is probably fine). migrate-1148 [003] 1273.020300: print: tracing_mark_write: thread 2 iter 14, unlock lock 2 migrate-1148 [003] 1273.020302: print: tracing_mark_write: thread 2 iter 14 sleeping migrate-1148 [003] 1273.020309: sched_switch: migrate:1148 [95] S ==> userspin:1135 [120] userspin-1135 [003] 1273.020324: sched_switch: userspin:1135 [120] R ==> migrate:1147 [96] 15 microseconds. migrate-1147 [003] 1273.020360: print: tracing_mark_write: thread 1 iter 14, unlock lock 1 migrate-1147 [003] 1273.020373: print: tracing_mark_write: thread 1 iter 14 sleeping migrate-1147 [003] 1273.020381: sched_switch: migrate:1147 [96] S ==> userspin:1135 [120] userspin-1135 [003] 1273.021397: sched_switch: userspin:1135 [120] R ==> migrate:1147 [96] 1 millisecond. migrate-1147 [003] 1273.021402: print: tracing_mark_write: thread 1 iter 14, taking lock 2 migrate-1147 [003] 1273.021404: print: tracing_mark_write: thread 1 iter 14, took lock 1 in 140726333419648 us migrate-1147 [003] 1273.022200: sched_switch: migrate:1147 [96] R ==> migrate:1152 [91] migrate-1152 [003] 1273.022206: print: tracing_mark_write: thread 6 iter 15, taking lock 6 migrate-1152 [003] 1273.022217: sched_switch: migrate:1152 [91] S ==> migrate:1147 [96] migrate-1147 [003] 1273.022289: sched_switch: migrate:1147 [96] R ==> migrate:1159 [84] migrate-1159 [003] 1273.022299: print: tracing_mark_write: thread 13 iter 16, taking lock 0 migrate-1159 [003] 1273.022326: print: tracing_mark_write: thread 13 iter 16, took lock 25 in 140726333419648 us migrate-1159 [003] 1273.026328: print: tracing_mark_write: thread 13 iter 16, unlock lock 0 migrate-1159 [003] 1273.026337: print: tracing_mark_write: thread 13 iter 16 sleeping migrate-1159 [003] 1273.026346: sched_switch: migrate:1159 [84] S ==> userspin:1135 [120] userspin-1135 [003] 1273.026359: sched_switch: userspin:1135 [120] R ==> migrate:1146 [97] 13 microseconds, and so on... migrate-1146 [003] 1273.027170: sched_switch: migrate:1146 [97] R ==> migrate:1149 [94] migrate-1149 [003] 1273.027189: print: tracing_mark_write: thread 3 iter 14, took lock 1927 in 140726333419648 us migrate-1149 [003] 1273.027335: sched_switch: migrate:1149 [94] R ==> migrate:1153 [90] migrate-1153 [003] 1273.027349: print: tracing_mark_write: thread 7 iter 15, took lock 15130 in 140726333419648 us migrate-1153 [003] 1273.037352: print: tracing_mark_write: thread 7 iter 15, unlock lock 6 migrate-1153 [003] 1273.037362: print: tracing_mark_write: thread 7 iter 15 sleeping migrate-1153 [003] 1273.037370: sched_switch: migrate:1153 [90] S ==> userspin:1135 [120] userspin-1135 [003] 1273.037395: sched_switch: userspin:1135 [120] R ==> migrate:1147 [96] migrate-1147 [003] 1273.037406: print: tracing_mark_write: thread 1 iter 14, unlock lock 2 migrate-1147 [003] 1273.037408: print: tracing_mark_write: thread 1 iter 14 sleeping migrate-1147 [003] 1273.037414: sched_switch: migrate:1147 [96] S ==> userspin:1135 [120] userspin-1135 [003] 1273.038428: sched_switch: userspin:1135 [120] R ==> migrate:1147 [96] It looks like it sneaks in when it's about to schedule a new RT task. Is this expected? -- Steve
> The code is not doing what I intended because I thought it was doing overload > control on the replenishment, but it is not (my bad). > I am still testing but... it is missing something like this (famous last words). diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 1092ca8892e0..6e2d21c47a04 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) * runtime, or it just underestimated it during sched_setattr(). */ static int start_dl_timer(struct sched_dl_entity *dl_se); +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t); + static void replenish_dl_entity(struct sched_dl_entity *dl_se) { struct dl_rq *dl_rq = dl_rq_of_se(dl_se); @@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) /* * This could be the case for a !-dl task that is boosted. * Just go with full inherited parameters. + * + * Or, it could be the case of a zerolax reservation that + * was not able to consume its runtime in background and + * reached this point with current u > U. + * + * In both cases, set a new period. */ - if (dl_se->dl_deadline == 0) - replenish_dl_new_period(dl_se, rq); + if (dl_se->dl_deadline == 0 || + (dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) { + dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; + dl_se->runtime = pi_of(dl_se)->dl_runtime; + } if (dl_se->dl_yielded && dl_se->runtime > 0) dl_se->runtime = 0;
Just got this too (with the 20% we talked about on IRC). migrate-991 6..... 713.996237: print: tracing_mark_write: thread 7 iter 3 sleeping The above is 991 in userspace writing to trace_marker migrate-991 6d..2. 713.996251: bprint: __schedule: Pick userspin:973:120 I added the above printk in the core pick_next_task(). migrate-991 6d..2. 713.996254: sched_switch: migrate:991 [90] S ==> userspin:973 [120] We switch to userspin for just 16 microseconds, and notice, NEED_RESCHED is not set. userspin-973 6dN.2. 713.996270: bprint: pick_task_rt: Pick RT migrate:988:93 The above printk is in pick_next_task_rt(), and NEED_RESCHED is now set! userspin-973 6dN.2. 713.996271: bprint: __schedule: Pick migrate:988:93 userspin-973 6d..2. 713.996272: sched_switch: userspin:973 [120] R ==> migrate:988 [93] I'll add your latest patch and see if that's different. I'll also test this without any of the patches first. -- Steve
On Tue, 7 Nov 2023 14:32:16 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> I'll also test this without any of the patches first.
And it still happens, and I now I know why :-)
Duh! The program is "migrate" which stress tests how RT tasks migrate
between CPUs when there's more RT tasks running that CPUs to run them on.
This is the push/pull logic in action!
migrate-958 4d..2. 266.971936: sched_switch: migrate:958 [89] S ==> userspin:939 [120]
Task 958 of priority 89 (lower is higher) goes to sleep. There's no RT
tasks on CPU 4 to run, so it runs userspin.
migrate-953 2d.h2. 266.971938: sched_waking: comm=migrate pid=957 prio=90 target_cpu=002
migrate-953 2d..2. 266.971944: sched_switch: migrate:953 [94] R ==> migrate:957 [90]
On CPU 2, task 957 (prio 90) preempts 953 (prio 94).
userspin-939 4d..2. 266.971953: sched_switch: userspin:939 [120] R ==> migrate:953 [94]
Now 953 migrates over to CPU 4 as it's currently the CPU running the lowest
priority task.
There's other cases where another CPU was simply overloaded, and when the
RT task on the CPU with userspin went to sleep, it triggered an IPI to the
overloaded CPU to tell it to push it over here.
All is good. Nothing to see here ;-)
-- Steve
On Tue, Nov 07, 2023 at 11:47:32AM -0500, Steven Rostedt wrote: > On Mon, 6 Nov 2023 16:37:32 -0500 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > Say CFS-server runtime is 0.3s and period is 1s. > > > > At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for > > 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this > > point which is < the "time till deadline" of 0.005s) > > > > Now the runtime of the CFS-server will be replenished to the full 0.3s > > (due to CBS) and the deadline > > pushed out. > > > > The end result is, the total runtime that the CFS-server actually gets > > is 0.595s (though yes it did sleep for 5ms in between, still that's > > tiny -- say if it briefly blocked on a kernel mutex). That's almost > > double the allocated runtime. > > > > This is just theoretical and I have yet to see if it is actually an > > issue in practice. > > Let me see if I understand what you are asking. By pushing the execution of > the CFS-server to the end of its period, if it it was briefly blocked and > was not able to consume all of its zerolax time, its bandwidth gets > refreshed. Then it can run again, basically doubling its total time. I think my assumption about what happens during blocking was wrong. If it blocked, the server is actually stopped via dl_server_stop() and it starts all over again on enqueue. That makes me worry about the opposite issue now. If the server restarts because it blocked briefly, that means again it starts in a throttled state and has to wait to run till zero-lax time. If CFS is a 99% load but blocks very briefly after getting to run a little bit (totalling 1% of the time), then it wont get 30% because it will keep getting delayed to the new 0-lax every time it wakes up from its very-brief nap. Is that really Ok? > But this is basically saying that it ran for its runtime at the start of > one period and at the beginning of another, right? I am not sure if this can happen but I could be missing something. AFAICS, there is no scenario where the DL server gets to run at the start of a new period unless RT is not running. The way the patch is written AFAICS, whenever the DL-server runs out of runtime, it gets throttled and a timer fires to go off at the beginning of the next period. (update_curr_dl_se() -> dl_runtime_exceeded() -> start_dl_timer()). In this timer handler (which fired at next period beginning), it will actually replenish_dl_entity() to refresh the runtime and push the period forward. Then it will throttle the server till the 0-lax time. That means we always end up running at the 0-lax time when starting a new period if RT is running, and never at the beginning. Did I miss something? On the other hand, if it does not run out of runtime, it will keep running within its 0-lax time. We know there is enough time within its 0-lax time for it to run because when we unthrottled it, we checked for that. Switching gears, another (most likely theoretical) concern I had is what if the 0-lax timer interrupt gets delayed a little bit. Then we will always end up not having enough 0-lax time and keep requeuing the timer, that means CFS will be starved always as we keep pushing the execution to the next period's 0-lax time. Anyway, I guess I better get to testing this stuff tomorrow and day after on ChromeOS before LPC starts. Personally I feel this is a great first cut and hope we can get v5 into mainline and iteratively improve. :) thanks, - Joel
On Tue, Nov 07, 2023 at 12:58:48PM +0100, Daniel Bristot de Oliveira wrote: [...] > >> One more consideration I guess is, because the server is throttled > >> till 0-laxity time, it is possible that if CFS sleeps even a bit > >> (after the DL-server is unthrottled), then it will be pushed out to a > >> full current deadline + period due to CBS. In such a situation, if > >> CFS-server is the only DL task running, it might starve RT for a bit > >> more time. > >> > >> Example, say CFS runtime is 0.3s and period is 1s. At 0.7s, 0-laxity > >> timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up > >> at 0.295s (its remaining runtime is 0.01s at this point which is < the > >> "time till deadline" of 0.005s). Now the runtime of the CFS-server > >> will be replenished to the full 3s (due to CBS) and the deadline > >> pushed out. The end result is the total runtime that the CFS-server > >> actually gets is 0.0595s (though yes it did sleep for 5ms in between, > >> still that's tiny -- say if it briefly blocked on a kernel mutex). > > > > Blah, I got lost in decimal points. Here's the example again: > > > > Say CFS-server runtime is 0.3s and period is 1s. > > > > At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for > > 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this > > point which is < the "time till deadline" of 0.005s) > > > > Now the runtime of the CFS-server will be replenished to the full 0.3s > > (due to CBS) and the deadline > > pushed out. > > > > The end result is, the total runtime that the CFS-server actually gets > > is 0.595s (though yes it did sleep for 5ms in between, still that's > > tiny -- say if it briefly blocked on a kernel mutex). That's almost > > double the allocated runtime. > > I think I got what you mean, and I think I took for granted that we were > doing overload control on the replenishment, but it seems that we are not.. > > I just got back from a doct appt, I will do a proper reply later today. Ah ok! Thanks Daniel! And hope the appointment went well. - Joel
Hi Daniel, On Tue, Nov 7, 2023 at 1:50 PM Daniel Bristot de Oliveira <bristot@kernel.org> wrote: > > > The code is not doing what I intended because I thought it was doing overload > > control on the replenishment, but it is not (my bad). > > > > I am still testing but... it is missing something like this (famous last words). > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 1092ca8892e0..6e2d21c47a04 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) > * runtime, or it just underestimated it during sched_setattr(). > */ > static int start_dl_timer(struct sched_dl_entity *dl_se); > +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t); > + > static void replenish_dl_entity(struct sched_dl_entity *dl_se) > { > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > @@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) > /* > * This could be the case for a !-dl task that is boosted. > * Just go with full inherited parameters. > + * > + * Or, it could be the case of a zerolax reservation that > + * was not able to consume its runtime in background and > + * reached this point with current u > U. > + * > + * In both cases, set a new period. > */ > - if (dl_se->dl_deadline == 0) > - replenish_dl_new_period(dl_se, rq); > + if (dl_se->dl_deadline == 0 || > + (dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) { > + dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; > + dl_se->runtime = pi_of(dl_se)->dl_runtime; > + } > > if (dl_se->dl_yielded && dl_se->runtime > 0) > dl_se->runtime = 0; I was wondering does this mean GRUB needs to be enabled? Otherwise I can see that "runtime / (deadline - t) > dl_runtime / dl_deadline" will be true almost all the time due to the constraint of executing at the 0-lax time. Because at the 0-lax time, AFAICS this will be 100% > 30% (say if CFS has a 30% reservation). And I think even if GRUB is enabled, it is possible other DL task may have reserved bandwidth. Or is there a subtlety that makes that not possible? thanks, - Joel
On 11/8/23 04:20, Joel Fernandes wrote: > Hi Daniel, > > On Tue, Nov 7, 2023 at 1:50 PM Daniel Bristot de Oliveira > <bristot@kernel.org> wrote: >> >>> The code is not doing what I intended because I thought it was doing overload >>> control on the replenishment, but it is not (my bad). >>> >> >> I am still testing but... it is missing something like this (famous last words). >> >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c >> index 1092ca8892e0..6e2d21c47a04 100644 >> --- a/kernel/sched/deadline.c >> +++ b/kernel/sched/deadline.c >> @@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) >> * runtime, or it just underestimated it during sched_setattr(). >> */ >> static int start_dl_timer(struct sched_dl_entity *dl_se); >> +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t); >> + >> static void replenish_dl_entity(struct sched_dl_entity *dl_se) >> { >> struct dl_rq *dl_rq = dl_rq_of_se(dl_se); >> @@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) >> /* >> * This could be the case for a !-dl task that is boosted. >> * Just go with full inherited parameters. >> + * >> + * Or, it could be the case of a zerolax reservation that >> + * was not able to consume its runtime in background and >> + * reached this point with current u > U. >> + * >> + * In both cases, set a new period. >> */ >> - if (dl_se->dl_deadline == 0) >> - replenish_dl_new_period(dl_se, rq); >> + if (dl_se->dl_deadline == 0 || >> + (dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) { >> + dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; >> + dl_se->runtime = pi_of(dl_se)->dl_runtime; >> + } >> >> if (dl_se->dl_yielded && dl_se->runtime > 0) >> dl_se->runtime = 0; > > I was wondering does this mean GRUB needs to be enabled? Otherwise I > can see that "runtime / (deadline - t) > dl_runtime / dl_deadline" > will be true almost all the time due to the constraint of executing at > the 0-lax time. No grub needed. It will only happen if the fair server did not have any chance to run. If it happens, it is not a problem, see that timeline I replied in the previous email. We do not want a zerolax scheduler, because it breaks everything else. It is a deferred EDF, that looking from wall clock, composes an "zerolaxish" timeline. > Because at the 0-lax time, AFAICS this will be 100% > 30% (say if CFS > has a 30% reservation). > > And I think even if GRUB is enabled, it is possible other DL task may > have reserved bandwidth. > > Or is there a subtlety that makes that not possible? > > thanks, > > - Joel
On Tue, Nov 07, 2023 at 07:50:28PM +0100, Daniel Bristot de Oliveira wrote: > > The code is not doing what I intended because I thought it was doing overload > > control on the replenishment, but it is not (my bad). > > > > I am still testing but... it is missing something like this (famous last words). > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 1092ca8892e0..6e2d21c47a04 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) > * runtime, or it just underestimated it during sched_setattr(). > */ > static int start_dl_timer(struct sched_dl_entity *dl_se); > +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t); > + > static void replenish_dl_entity(struct sched_dl_entity *dl_se) > { > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > @@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) > /* > * This could be the case for a !-dl task that is boosted. > * Just go with full inherited parameters. > + * > + * Or, it could be the case of a zerolax reservation that > + * was not able to consume its runtime in background and > + * reached this point with current u > U. > + * > + * In both cases, set a new period. > */ > - if (dl_se->dl_deadline == 0) > - replenish_dl_new_period(dl_se, rq); > + if (dl_se->dl_deadline == 0 || > + (dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) { > + dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; > + dl_se->runtime = pi_of(dl_se)->dl_runtime; > + } > > if (dl_se->dl_yielded && dl_se->runtime > 0) > dl_se->runtime = 0; Should we rather not cap the runtime, something like so? Because the above also causes period drift, which we do not want. --- diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 58b542bf2893..1453a2cd0680 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -829,10 +829,12 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) */ static void replenish_dl_entity(struct sched_dl_entity *dl_se) { + struct sched_dl_entity *pi_se = pi_of(dl_se); struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq); + u64 dl_runtime = pi_se->dl_runtime; - WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0); + WARN_ON_ONCE(dl_runtime <= 0); /* * This could be the case for a !-dl task that is boosted. @@ -851,10 +853,13 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) * arbitrary large. */ while (dl_se->runtime <= 0) { - dl_se->deadline += pi_of(dl_se)->dl_period; - dl_se->runtime += pi_of(dl_se)->dl_runtime; + dl_se->deadline += pi_se->dl_period; + dl_se->runtime += dl_runtime; } + if (dl_se->zerolax && dl_se->runtime > dl_runtime) + dl_se->runtime = dl_runtime; + /* * At this point, the deadline really should be "in * the future" with respect to rq->clock. If it's
On Wed, Nov 08, 2023 at 01:44:01PM +0100, Peter Zijlstra wrote: > Should we rather not cap the runtime, something like so? > Clearly I should've done the patch against a tree that includes the changes... > --- > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 58b542bf2893..1453a2cd0680 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -829,10 +829,12 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) > */ > static void replenish_dl_entity(struct sched_dl_entity *dl_se) > { > + struct sched_dl_entity *pi_se = pi_of(dl_se); > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > struct rq *rq = rq_of_dl_rq(dl_rq); > + u64 dl_runtime = pi_se->dl_runtime; > > - WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0); > + WARN_ON_ONCE(dl_runtime <= 0); > > /* > * This could be the case for a !-dl task that is boosted. > @@ -851,10 +853,13 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) > * arbitrary large. > */ > while (dl_se->runtime <= 0) { > - dl_se->deadline += pi_of(dl_se)->dl_period; > - dl_se->runtime += pi_of(dl_se)->dl_runtime; > + dl_se->deadline += pi_se->dl_period; > + dl_se->runtime += dl_runtime; > } > > + if (dl_se->zerolax && dl_se->runtime > dl_runtime) > + dl_se->runtime = dl_runtime; > + This should ofcourse go in the if (dl_se->dl_zerolax_armed) branch a little down from here. > /* > * At this point, the deadline really should be "in > * the future" with respect to rq->clock. If it's
On 11/8/23 13:44, Peter Zijlstra wrote:
> Because the above also causes period drift, which we do not want.
The period drift is not a problem when we do not have DL tasks because
... we do not have dl tasks. The task will run for runtime.
But not doing the period drift is bad if we have DL tasks because we
break the (current u <= U) rule... which breaks CBS/EDF.
On 11/8/23 13:44, Peter Zijlstra wrote: > Should we rather not cap the runtime, something like so? > > Because the above also causes period drift, which we do not want. like in the example I showed before: - 3/10 reservation (30%). - w=waiting - r=running - s=sleeping - T=throttled - fair server dispatched at 0, starvation from RT. |wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww|rrrrrrrrrrrrrrrrrrrrrrrrrrrrrr|TTTTTTTTTT[...]TTTTTTTTTTT|rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr|TTTTTTT |___________________________period 1_________________________________|_________period 2________________________[...]___________|___period 3____________________|[.... internal-period 0---------1---------2---------3---------4---------5---------6--------7--------8---------9----------10.......11.[...]16.........17........18........19........20|[.... < Real-time ---------------------------------------------------------------------+---------------------------------------------------------| | +new period From "real-world/wall clock" the internal period shift produces the "zerolax" timeline. It runs 3 units of time before the 10's. If one has a mix of DL and FIFO task, and want to enforce a given response time to the fair server, they can reduce the fair server period to achieve that. -- Daniel
On 11/8/23 13:50, Peter Zijlstra wrote: >> --- >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c >> index 58b542bf2893..1453a2cd0680 100644 >> --- a/kernel/sched/deadline.c >> +++ b/kernel/sched/deadline.c >> @@ -829,10 +829,12 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) >> */ >> static void replenish_dl_entity(struct sched_dl_entity *dl_se)>> { assuming starting rt, 3/10 params: it arrives here with: runtime = 3 laxity = 10 - 7 = 3 u = 1 >> + struct sched_dl_entity *pi_se = pi_of(dl_se); >> struct dl_rq *dl_rq = dl_rq_of_se(dl_se); >> struct rq *rq = rq_of_dl_rq(dl_rq); >> + u64 dl_runtime = pi_se->dl_runtime; >> >> - WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0); >> + WARN_ON_ONCE(dl_runtime <= 0); >> >> /* >> * This could be the case for a !-dl task that is boosted. >> @@ -851,10 +853,13 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) >> * arbitrary large. >> */ skip the while because runtime = 3 > 0 >> while (dl_se->runtime <= 0) { >> - dl_se->deadline += pi_of(dl_se)->dl_period; >> - dl_se->runtime += pi_of(dl_se)->dl_runtime; >> + dl_se->deadline += pi_se->dl_period; >> + dl_se->runtime += dl_runtime; >> } runtime is already = dl_runtime... >> + if (dl_se->zerolax && dl_se->runtime > dl_runtime) >> + dl_se->runtime = dl_runtime; >> + There is a way to cap it: it is doing the revised wakeup rule... the runtime will become 1. That is not what we want... and we would have to keep arming the server... while shifting the (internal) period puts the scheduler in the regular case :-) Externally, e.g., the user with the mouse his laptop, sees the "zerolax" timeline... :-) i.e., after at most 7, they get 3, before 10. it is simpler... and breaking the U thing is breaking GRUB, admission control.. and so on... by default - not in a overload DL overload scenario... it is by default :-/. > This should ofcourse go in the if (dl_se->dl_zerolax_armed) branch a > little down from here.
Hi Peter, On 08/11/23 13:44, Peter Zijlstra wrote: > On Tue, Nov 07, 2023 at 07:50:28PM +0100, Daniel Bristot de Oliveira wrote: > > > The code is not doing what I intended because I thought it was doing overload > > > control on the replenishment, but it is not (my bad). > > > > > > > I am still testing but... it is missing something like this (famous last words). > > > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > > index 1092ca8892e0..6e2d21c47a04 100644 > > --- a/kernel/sched/deadline.c > > +++ b/kernel/sched/deadline.c > > @@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) > > * runtime, or it just underestimated it during sched_setattr(). > > */ > > static int start_dl_timer(struct sched_dl_entity *dl_se); > > +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t); > > + > > static void replenish_dl_entity(struct sched_dl_entity *dl_se) > > { > > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > > @@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) > > /* > > * This could be the case for a !-dl task that is boosted. > > * Just go with full inherited parameters. > > + * > > + * Or, it could be the case of a zerolax reservation that > > + * was not able to consume its runtime in background and > > + * reached this point with current u > U. > > + * > > + * In both cases, set a new period. > > */ > > - if (dl_se->dl_deadline == 0) > > - replenish_dl_new_period(dl_se, rq); > > + if (dl_se->dl_deadline == 0 || > > + (dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) { > > + dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; > > + dl_se->runtime = pi_of(dl_se)->dl_runtime; > > + } > > > > if (dl_se->dl_yielded && dl_se->runtime > 0) > > dl_se->runtime = 0; > > Should we rather not cap the runtime, something like so? > > Because the above also causes period drift, which we do not want. I was honestly also concerned with the drift, but then thought it might not be an issue for the dl_server (zerolax), as it doesn't have a userspace counterpart that relies on synchronized clocks? > > --- > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 58b542bf2893..1453a2cd0680 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -829,10 +829,12 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) > */ > static void replenish_dl_entity(struct sched_dl_entity *dl_se) > { > + struct sched_dl_entity *pi_se = pi_of(dl_se); > struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > struct rq *rq = rq_of_dl_rq(dl_rq); > + u64 dl_runtime = pi_se->dl_runtime; > > - WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0); > + WARN_ON_ONCE(dl_runtime <= 0); > > /* > * This could be the case for a !-dl task that is boosted. > @@ -851,10 +853,13 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) > * arbitrary large. > */ > while (dl_se->runtime <= 0) { > - dl_se->deadline += pi_of(dl_se)->dl_period; > - dl_se->runtime += pi_of(dl_se)->dl_runtime; > + dl_se->deadline += pi_se->dl_period; > + dl_se->runtime += dl_runtime; > } > > + if (dl_se->zerolax && dl_se->runtime > dl_runtime) > + dl_se->runtime = dl_runtime; > + Anyway, I have the impression that this breaks EDF/CBS, as we are letting the dl_server run with full dl_runtime w/o postponing the period (essentially an u = 1 reservation until runtime is depleted). I would say we need to either do dl_se->deadline += pi_of(dl_se)->dl_period; dl_se->runtime = pi_of(dl_se)->dl_runtime; or (as Daniel proposed) dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; dl_se->runtime = pi_of(dl_se)->dl_runtime; and I seem to be inclined towards the latter, as the former would essentially reduce dl_server bandwidth under dl_runtime/dl_period at times. Best, Juri
On Wed, Nov 08, 2023 at 04:14:18PM +0100, Juri Lelli wrote: > > + if (dl_se->zerolax && dl_se->runtime > dl_runtime) > > + dl_se->runtime = dl_runtime; > > + > > Anyway, I have the impression that this breaks EDF/CBS, as we are letting > the dl_server run with full dl_runtime w/o postponing the period > (essentially an u = 1 reservation until runtime is depleted). Yeah, I sorted it with Daniel, we were not trying to fix the same problem :-)
On Wed, Nov 08, 2023 at 09:01:17AM +0100, Daniel Bristot de Oliveira wrote: > On 11/8/23 04:20, Joel Fernandes wrote: > > Hi Daniel, > > > > On Tue, Nov 7, 2023 at 1:50 PM Daniel Bristot de Oliveira > > <bristot@kernel.org> wrote: > >> > >>> The code is not doing what I intended because I thought it was doing overload > >>> control on the replenishment, but it is not (my bad). > >>> > >> > >> I am still testing but... it is missing something like this (famous last words). > >> > >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > >> index 1092ca8892e0..6e2d21c47a04 100644 > >> --- a/kernel/sched/deadline.c > >> +++ b/kernel/sched/deadline.c > >> @@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) > >> * runtime, or it just underestimated it during sched_setattr(). > >> */ > >> static int start_dl_timer(struct sched_dl_entity *dl_se); > >> +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t); > >> + > >> static void replenish_dl_entity(struct sched_dl_entity *dl_se) > >> { > >> struct dl_rq *dl_rq = dl_rq_of_se(dl_se); > >> @@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) > >> /* > >> * This could be the case for a !-dl task that is boosted. > >> * Just go with full inherited parameters. > >> + * > >> + * Or, it could be the case of a zerolax reservation that > >> + * was not able to consume its runtime in background and > >> + * reached this point with current u > U. > >> + * > >> + * In both cases, set a new period. > >> */ > >> - if (dl_se->dl_deadline == 0) > >> - replenish_dl_new_period(dl_se, rq); > >> + if (dl_se->dl_deadline == 0 || > >> + (dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) { > >> + dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; > >> + dl_se->runtime = pi_of(dl_se)->dl_runtime; > >> + } > >> > >> if (dl_se->dl_yielded && dl_se->runtime > 0) > >> dl_se->runtime = 0; > > > > I was wondering does this mean GRUB needs to be enabled? Otherwise I > > can see that "runtime / (deadline - t) > dl_runtime / dl_deadline" > > will be true almost all the time due to the constraint of executing at > > the 0-lax time. > > No grub needed. It will only happen if the fair server did not have any chance to run. > > If it happens, it is not a problem, see that timeline I replied in the previous > email. Ah you're right, I mistakenly read your diff assuming you were calling replenish_dl_new_period() on dl_entity_overflow(). Indeed the diff is needed (I was actually wondering about why that was not done in my initial review as well -- so its good we found it in discussion). > We do not want a zerolax scheduler, because it breaks everything else. It is > a deferred EDF, that looking from wall clock, composes an "zerolaxish" timeline. Indeed. I was not intending that we do zerolax scheduler, I was merely misreading the diff assuming you were throttling the DL-server once again at the zerolax time. thanks, - Joel
Hello, kernel test robot noticed "WARNING:at_kernel/sched/deadline.c:#enqueue_dl_entity" on: commit: dea46af8e193ed4f23c37123bfd4a825399aedfe ("[PATCH v5 6/7] sched/deadline: Deferrable dl server") url: https://github.com/intel-lab-lkp/linux/commits/Daniel-Bristot-de-Oliveira/sched-Unify-runtime-accounting-across-classes/20231104-201952 base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 984ffb6a4366752c949f7b39640aecdce222607f patch link: https://lore.kernel.org/all/c7b706d30d6316c52853ca056db5beb82ba72863.1699095159.git.bristot@kernel.org/ patch subject: [PATCH v5 6/7] sched/deadline: Deferrable dl server in testcase: trinity version: trinity-i386-abe9de86-1_20230429 with following parameters: runtime: 600s test-description: Trinity is a linux system call fuzz tester. test-url: http://codemonkey.org.uk/projects/trinity/ compiler: gcc-9 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) the issue does not always happen in our tests (show 9 times out of 20 runs), but keeps clean on parent. 6f69498ee58c052e dea46af8e193ed4f23c37123bfd ---------------- --------------------------- fail:runs %reproduction fail:runs | | | :20 45% 9:20 dmesg.RIP:enqueue_dl_entity :20 45% 9:20 dmesg.WARNING:at_kernel/sched/deadline.c:#enqueue_dl_entity If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202311132217.2a9a4aac-oliver.sang@intel.com [ 59.623267][ C0] ------------[ cut here ]------------ [ 59.627229][ C0] WARNING: CPU: 0 PID: 1 at kernel/sched/deadline.c:1803 enqueue_dl_entity (kernel/sched/deadline.c:1803 (discriminator 1)) [ 59.627229][ C0] Modules linked in: [ 59.627229][ C0] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G T 6.6.0-rc7-00090-gdea46af8e193 #1 [ 59.627229][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 59.627229][ C0] RIP: 0010:enqueue_dl_entity (kernel/sched/deadline.c:1803 (discriminator 1)) [ 59.627229][ C0] Code: 8e 74 ed ff ff 45 84 f6 0f 85 fd 08 00 00 48 8d b5 08 d8 ff ff 48 8d 95 c8 ee ff ff 4c 89 ff e8 40 f8 01 00 e9 50 ed ff ff 90 <0f> 0b 90 e9 9b ec ff ff 48 8d bd 1c d8 ff ff 48 c7 c3 40 fa 1f 00 All code ======== 0: 8e 74 ed ff mov -0x1(%rbp,%rbp,8),%? 4: ff 45 84 incl -0x7c(%rbp) 7: f6 0f 85 testb $0x85,(%rdi) a: fd std b: 08 00 or %al,(%rax) d: 00 48 8d add %cl,-0x73(%rax) 10: b5 08 mov $0x8,%ch 12: d8 ff fdivr %st(7),%st 14: ff 48 8d decl -0x73(%rax) 17: 95 xchg %eax,%ebp 18: c8 ee ff ff enterq $0xffee,$0xff 1c: 4c 89 ff mov %r15,%rdi 1f: e8 40 f8 01 00 callq 0x1f864 24: e9 50 ed ff ff jmpq 0xffffffffffffed79 29: 90 nop 2a:* 0f 0b ud2 <-- trapping instruction 2c: 90 nop 2d: e9 9b ec ff ff jmpq 0xffffffffffffeccd 32: 48 8d bd 1c d8 ff ff lea -0x27e4(%rbp),%rdi 39: 48 c7 c3 40 fa 1f 00 mov $0x1ffa40,%rbx Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: 90 nop 3: e9 9b ec ff ff jmpq 0xffffffffffffeca3 8: 48 8d bd 1c d8 ff ff lea -0x27e4(%rbp),%rdi f: 48 c7 c3 40 fa 1f 00 mov $0x1ffa40,%rbx [ 59.627229][ C0] RSP: 0000:ffffc90000007d28 EFLAGS: 00010092 [ 59.627229][ C0] RAX: dffffc0000000000 RBX: ffff8883aec00418 RCX: 1ffffffff096d168 [ 59.627229][ C0] RDX: 1ffff11075d80078 RSI: 0000000000000020 RDI: ffff8883aec003c0 [ 59.627229][ C0] RBP: ffff8883aec003c0 R08: ffff8883aec004f0 R09: ffff8883aec00500 [ 59.627229][ C0] R10: 0000000000000000 R11: ffffffff873fba8f R12: ffff8883aec00414 [ 59.627229][ C0] R13: 0000000000000020 R14: ffff8883aebffa58 R15: ffff8883aec003c0 [ 59.627229][ C0] FS: 0000000000000000(0000) GS:ffff8883aea00000(0000) knlGS:0000000000000000 [ 59.627229][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 59.627229][ C0] CR2: ffff88843ffff000 CR3: 0000000004e70000 CR4: 00000000000006b0 [ 59.627229][ C0] Call Trace: [ 59.627229][ C0] <IRQ> [ 59.627229][ C0] ? __warn (kernel/panic.c:673) [ 59.627229][ C0] ? enqueue_dl_entity (kernel/sched/deadline.c:1803 (discriminator 1)) [ 59.627229][ C0] ? report_bug (lib/bug.c:201 lib/bug.c:219) [ 59.627229][ C0] ? handle_bug (arch/x86/kernel/traps.c:237) [ 59.627229][ C0] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) [ 59.627229][ C0] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) [ 59.627229][ C0] ? enqueue_dl_entity (kernel/sched/deadline.c:1803 (discriminator 1)) [ 59.627229][ C0] ? update_rq_clock (kernel/sched/core.c:765 kernel/sched/core.c:750) [ 59.627229][ C0] ? kvm_sched_clock_read (arch/x86/kernel/kvmclock.c:91) [ 59.627229][ C0] ? sched_clock_tick (kernel/sched/clock.c:270 kernel/sched/clock.c:426 kernel/sched/clock.c:412) [ 59.627229][ C0] dl_task_timer (kernel/sched/deadline.c:1193) [ 59.627229][ C0] ? pick_task_dl (kernel/sched/deadline.c:1174) [ 59.627229][ C0] __hrtimer_run_queues (kernel/time/hrtimer.c:1688 kernel/time/hrtimer.c:1752) [ 59.627229][ C0] ? enqueue_hrtimer (kernel/time/hrtimer.c:1722) [ 59.627229][ C0] ? kvm_clock_read (arch/x86/include/asm/preempt.h:95 arch/x86/kernel/kvmclock.c:80) [ 59.627229][ C0] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:292 (discriminator 4) kernel/time/timekeeping.c:388 (discriminator 4) kernel/time/timekeeping.c:2320 (discriminator 4)) [ 59.627229][ C0] hrtimer_interrupt (kernel/time/hrtimer.c:1817) [ 59.627229][ C0] __sysvec_apic_timer_interrupt (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:444 include/linux/jump_label.h:260 include/linux/jump_label.h:270 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1081) [ 59.627229][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074 (discriminator 14)) [ 59.627229][ C0] </IRQ> [ 59.627229][ C0] <TASK> [ 59.627229][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:645) [ 59.627229][ C0] RIP: 0010:_raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194) [ 59.627229][ C0] Code: 83 c7 18 e8 ca 80 74 fd 48 89 ef e8 82 18 75 fd 81 e3 00 02 00 00 75 25 9c 58 f6 c4 02 75 2d 48 85 db 74 01 fb bf 01 00 00 00 <e8> 93 55 69 fd 65 8b 05 54 1a 66 7c 85 c0 74 0a 5b 5d c3 e8 30 63 All code ======== 0: 83 c7 18 add $0x18,%edi 3: e8 ca 80 74 fd callq 0xfffffffffd7480d2 8: 48 89 ef mov %rbp,%rdi b: e8 82 18 75 fd callq 0xfffffffffd751892 10: 81 e3 00 02 00 00 and $0x200,%ebx 16: 75 25 jne 0x3d 18: 9c pushfq 19: 58 pop %rax 1a: f6 c4 02 test $0x2,%ah 1d: 75 2d jne 0x4c 1f: 48 85 db test %rbx,%rbx 22: 74 01 je 0x25 24: fb sti 25: bf 01 00 00 00 mov $0x1,%edi 2a:* e8 93 55 69 fd callq 0xfffffffffd6955c2 <-- trapping instruction 2f: 65 8b 05 54 1a 66 7c mov %gs:0x7c661a54(%rip),%eax # 0x7c661a8a 36: 85 c0 test %eax,%eax 38: 74 0a je 0x44 3a: 5b pop %rbx 3b: 5d pop %rbp 3c: c3 retq 3d: e8 .byte 0xe8 3e: 30 .byte 0x30 3f: 63 .byte 0x63 Code starting with the faulting instruction =========================================== 0: e8 93 55 69 fd callq 0xfffffffffd695598 5: 65 8b 05 54 1a 66 7c mov %gs:0x7c661a54(%rip),%eax # 0x7c661a60 c: 85 c0 test %eax,%eax e: 74 0a je 0x1a 10: 5b pop %rbx 11: 5d pop %rbp 12: c3 retq 13: e8 .byte 0xe8 14: 30 .byte 0x30 15: 63 .byte 0x63 [ 59.627229][ C0] RSP: 0000:ffffc9000001fbb8 EFLAGS: 00000206 [ 59.627229][ C0] RAX: 0000000000000006 RBX: 0000000000000200 RCX: ffffffff812e2631 [ 59.627229][ C0] RDX: 0000000000000000 RSI: ffffffff83eaa940 RDI: 0000000000000001 [ 59.627229][ C0] RBP: ffff88812d07ed00 R08: 0000000000000001 R09: fffffbfff0e7f757 [ 59.627229][ C0] R10: fffffbfff0e7f756 R11: ffffffff873fbab7 R12: 0000000000000000 [ 59.627229][ C0] R13: 0000000000000246 R14: ffff888195a701a8 R15: ffff88812cd23350 [ 59.627229][ C0] ? mark_lock (arch/x86/include/asm/bitops.h:228 (discriminator 3) arch/x86/include/asm/bitops.h:240 (discriminator 3) include/asm-generic/bitops/instrumented-non-atomic.h:142 (discriminator 3) kernel/locking/lockdep.c:228 (discriminator 3) kernel/locking/lockdep.c:4655 (discriminator 3)) [ 59.627229][ C0] dma_fence_signal (drivers/dma-buf/dma-fence.c:327 drivers/dma-buf/dma-fence.c:476) [ 59.627229][ C0] wait_backward (drivers/dma-buf/st-dma-fence-chain.c:621) [ 59.627229][ C0] ? find_gap (drivers/dma-buf/st-dma-fence-chain.c:603) [ 59.627229][ C0] ? __cond_resched (kernel/sched/core.c:8521) [ 59.627229][ C0] __subtests (drivers/dma-buf/selftest.c:106 (discriminator 1)) [ 59.627229][ C0] ? kmem_cache_open (mm/slub.c:2479 mm/slub.c:4232 mm/slub.c:4560) [ 59.627229][ C0] ? __sanitycheck__ (drivers/dma-buf/selftest.c:92) [ 59.627229][ C0] ? kmem_cache_create_usercopy (mm/slab_common.c:351) [ 59.627229][ C0] dma_fence_chain (drivers/dma-buf/st-dma-fence-chain.c:708) [ 59.627229][ C0] st_init (drivers/dma-buf/selftest.c:141 drivers/dma-buf/selftest.c:155) [ 59.627229][ C0] ? udmabuf_dev_init (drivers/dma-buf/selftest.c:154) [ 59.627229][ C0] do_one_initcall (init/main.c:1232) [ 59.627229][ C0] ? trace_event_raw_event_initcall_level (init/main.c:1223) [ 59.627229][ C0] ? parameq (kernel/params.c:171) [ 59.627229][ C0] ? strcpy (lib/string.c:83 (discriminator 1)) [ 59.627229][ C0] kernel_init_freeable (init/main.c:1293 init/main.c:1310 init/main.c:1329 init/main.c:1547) [ 59.627229][ C0] ? finish_task_switch (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 kernel/sched/sched.h:1390 kernel/sched/core.c:5129 kernel/sched/core.c:5247) [ 59.627229][ C0] ? rest_init (init/main.c:1429) [ 59.627229][ C0] kernel_init (init/main.c:1439) [ 59.627229][ C0] ? _raw_spin_unlock_irq (arch/x86/include/asm/preempt.h:104 include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:202) [ 59.627229][ C0] ret_from_fork (arch/x86/kernel/process.c:153) [ 59.627229][ C0] ? rest_init (init/main.c:1429) [ 59.627229][ C0] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) [ 59.627229][ C0] </TASK> [ 59.627229][ C0] irq event stamp: 842784 [ 59.627229][ C0] hardirqs last enabled at (842783): irqentry_exit (kernel/entry/common.c:436) [ 59.627229][ C0] hardirqs last disabled at (842784): sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074) [ 59.627229][ C0] softirqs last enabled at (842772): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582) [ 59.627229][ C0] softirqs last disabled at (842761): irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:622 kernel/softirq.c:644) [ 59.627229][ C0] ---[ end trace 0000000000000000 ]--- The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20231113/202311132217.2a9a4aac-oliver.sang@intel.com
diff --git a/include/linux/sched.h b/include/linux/sched.h index 5ac1f252e136..56e53e6fd5a0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -660,6 +660,8 @@ struct sched_dl_entity { unsigned int dl_non_contending : 1; unsigned int dl_overrun : 1; unsigned int dl_server : 1; + unsigned int dl_zerolax : 1; + unsigned int dl_zerolax_armed : 1; /* * Bandwidth enforcement timer. Each -deadline task has its diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 1d7b96ca9011..69ee1fbd60e4 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se, /* for non-boosted task, pi_of(dl_se) == dl_se */ dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline; dl_se->runtime = pi_of(dl_se)->dl_runtime; + + /* + * If it is a zerolax reservation, throttle it. + */ + if (dl_se->dl_zerolax) { + dl_se->dl_throttled = 1; + dl_se->dl_zerolax_armed = 1; + } } /* @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) * could happen are, typically, a entity voluntarily trying to overcome its * runtime, or it just underestimated it during sched_setattr(). */ +static int start_dl_timer(struct sched_dl_entity *dl_se); static void replenish_dl_entity(struct sched_dl_entity *dl_se) { struct dl_rq *dl_rq = dl_rq_of_se(dl_se); @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se) dl_se->dl_yielded = 0; if (dl_se->dl_throttled) dl_se->dl_throttled = 0; + + /* + * If this is the replenishment of a zerolax reservation, + * clear the flag and return. + */ + if (dl_se->dl_zerolax_armed) { + dl_se->dl_zerolax_armed = 0; + return; + } + + /* + * A this point, if the zerolax server is not armed, and the deadline + * is in the future, throttle the server and arm the zerolax timer. + */ + if (dl_se->dl_zerolax && + dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) { + if (!is_dl_boosted(dl_se)) { + dl_se->dl_zerolax_armed = 1; + dl_se->dl_throttled = 1; + start_dl_timer(dl_se); + } + } } /* @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se) } replenish_dl_new_period(dl_se, rq); + } else if (dl_server(dl_se) && dl_se->dl_zerolax) { + /* + * The server can still use its previous deadline, so throttle + * and arm the zero-laxity timer. + */ + dl_se->dl_zerolax_armed = 1; + dl_se->dl_throttled = 1; } } @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se) * We want the timer to fire at the deadline, but considering * that it is actually coming from rq->clock and not from * hrtimer's time base reading. + * + * The zerolax reservation will have its timer set to the + * deadline - runtime. At that point, the CBS rule will decide + * if the current deadline can be used, or if a replenishment + * is required to avoid add too much pressure on the system + * (current u > U). */ - act = ns_to_ktime(dl_next_period(dl_se)); + if (dl_se->dl_zerolax_armed) { + WARN_ON_ONCE(!dl_se->dl_throttled); + act = ns_to_ktime(dl_se->deadline - dl_se->runtime); + } else { + act = ns_to_ktime(dl_next_period(dl_se)); + } + now = hrtimer_cb_get_time(timer); delta = ktime_to_ns(now) - rq_clock(rq); act = ktime_add_ns(act, delta); @@ -1333,6 +1383,9 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 return; } + if (dl_server(dl_se) && dl_se->dl_throttled && !dl_se->dl_zerolax) + return; + if (dl_entity_is_special(dl_se)) return; @@ -1356,6 +1409,39 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 dl_se->runtime -= scaled_delta_exec; + /* + * The fair server can consume its runtime while throttled (not queued). + * + * If the server consumes its entire runtime in this state. The server + * is not required for the current period. Thus, reset the server by + * starting a new period, pushing the activation to the zero-lax time. + */ + if (dl_se->dl_zerolax && dl_se->dl_throttled && dl_runtime_exceeded(dl_se)) { + s64 runtime_diff = dl_se->runtime + dl_se->dl_runtime; + + /* + * If this is a regular throttling case, let it run negative until + * the dl_runtime - runtime > 0. The reason being is that the next + * replenishment will result in a positive runtime one period ahead. + * + * Otherwise, the deadline will be pushed more than one period, not + * providing runtime/period anymore. + * + * If the dl_runtime - runtime < 0, then the server was able to get + * the runtime/period before the replenishment. So it is safe + * to start a new deffered period. + */ + if (!dl_se->dl_zerolax_armed && runtime_diff > 0) + return; + + hrtimer_try_to_cancel(&dl_se->dl_timer); + + replenish_dl_new_period(dl_se, dl_se->rq); + start_dl_timer(dl_se); + + return; + } + throttle: if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) { dl_se->dl_throttled = 1; @@ -1432,6 +1518,9 @@ void dl_server_start(struct sched_dl_entity *dl_se) void dl_server_stop(struct sched_dl_entity *dl_se) { dequeue_dl_entity(dl_se, DEQUEUE_SLEEP); + hrtimer_try_to_cancel(&dl_se->dl_timer); + dl_se->dl_zerolax_armed = 0; + dl_se->dl_throttled = 0; } void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, @@ -1743,7 +1832,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags) * be counted in the active utilization; hence, we need to call * add_running_bw(). */ - if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) { + if (!dl_se->dl_zerolax && dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) { if (flags & ENQUEUE_WAKEUP) task_contending(dl_se, flags); @@ -1765,6 +1854,13 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags) setup_new_dl_entity(dl_se); } + /* + * If we are still throttled, eg. we got replenished but are a + * zero-laxity task and still got to wait, don't enqueue. + */ + if (dl_se->dl_throttled && start_dl_timer(dl_se)) + return; + __enqueue_dl_entity(dl_se); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b15f7f376a67..399237cd9f59 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1201,6 +1201,8 @@ static void update_curr(struct cfs_rq *cfs_rq) account_group_exec_runtime(curtask, delta_exec); if (curtask->server) dl_server_update(curtask->server, delta_exec); + else + dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec); } account_cfs_rq_runtime(cfs_rq, delta_exec); @@ -8421,6 +8423,7 @@ void fair_server_init(struct rq *rq) dl_se->dl_runtime = 50 * NSEC_PER_MSEC; dl_se->dl_deadline = 1000 * NSEC_PER_MSEC; dl_se->dl_period = 1000 * NSEC_PER_MSEC; + dl_se->dl_zerolax = 1; dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick); }