Message ID | 20231020014031.919742-3-joel@joelfernandes.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2010:b0:403:3b70:6f57 with SMTP id fe16csp764588vqb; Thu, 19 Oct 2023 18:40:56 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEcR1EHDyCLgkMzLL1Fpt2xBknOVYeTS0GndLByeedLUod4lmQGsjEd/1Ws7ANjU2moNhnb X-Received: by 2002:a17:902:d682:b0:1c6:7ba:3a9a with SMTP id v2-20020a170902d68200b001c607ba3a9amr756428ply.14.1697766056258; Thu, 19 Oct 2023 18:40:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697766056; cv=none; d=google.com; s=arc-20160816; b=snjxq9NMplIdPMrYLk31Fx0wAUM0QmwycBTaDtCMz7KN8DaKagbCqb3mh9OZBkdrZi czI749+D1elFiFzPzqp+hnQ1UydAs87goXU/iaHEwrmCVrLXaVszFXEu3iqawW0nwD6W uFYfUjEc9ZvarSiKh/DVqKB/xyAjaH/Ygwk2D3OU//7DPmNqV9BvAJMqDeSBvQWd+7gV tv9yGoE61czNk7crQ/bU7qQ4X/qmNRA93bN2NjWqwxMpGWexydAm9EgAJlvkQ0gBsRYh tgRqAJ0DVT7oS0RDBp4e1m6CFeVY3txXkK3SUkorU2e7tAcztn2N7h6TSi5xIc5JSqo1 5UoQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=m0pwSa2GbXJihOZrNu7aMcnrt6aomdhOt1W1hRoVbr8=; fh=eUeDaXEoz+BmTGwje8HoWK/5Htsj786g1xe5Bw+HP1U=; b=xmvDgFrtBbfPLwriztnnnQumEd5Qs7lxbCAMUnWuqsuth/JyMudzM2m3NTHwik0HLO Mgsa8cO8b/Rnt+UuEH9umyPIhOFSWBPkXcp+PqT83PenkDXt+ZiDeVoEN7I8VRfHqq+p R8YYQTMn170H21hrvVpl7wQWSaQhdOeh7tHO65+W4swn8QwxBCsekrBmMxkqNNhCSBz1 +meQDXY2dDtMKw7ue7RKJOoNwHmiBb1R89yshrhZvmbUKARV99cuHFD/F+eJ9zNqK2Rd jU4pYLuHKdp8gEp7MewrtISln407SA21jAaWyRQowAYXkgzo64eCNr1E6tJduDzm8qxC MN4w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=N44iB+Am; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id jo23-20020a170903055700b001c3ea2bbebcsi663928plb.322.2023.10.19.18.40.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Oct 2023 18:40:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=N44iB+Am; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 4DAF382881BB; Thu, 19 Oct 2023 18:40:55 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346825AbjJTBkv (ORCPT <rfc822;lkml4gm@gmail.com> + 25 others); Thu, 19 Oct 2023 21:40:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47202 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346805AbjJTBkq (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 19 Oct 2023 21:40:46 -0400 Received: from mail-il1-x132.google.com (mail-il1-x132.google.com [IPv6:2607:f8b0:4864:20::132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EB347124 for <linux-kernel@vger.kernel.org>; Thu, 19 Oct 2023 18:40:44 -0700 (PDT) Received: by mail-il1-x132.google.com with SMTP id e9e14a558f8ab-3574cde48b4so1066595ab.1 for <linux-kernel@vger.kernel.org>; Thu, 19 Oct 2023 18:40:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; t=1697766044; x=1698370844; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=m0pwSa2GbXJihOZrNu7aMcnrt6aomdhOt1W1hRoVbr8=; b=N44iB+AmDtzIIgLBjpm32TTYi98/zApjmjJ1fnQPghi2nhfvDY8fJqkFIMsg2+ukGu pg/OlKXxJlIg9T9ys70LiFam6r7U+LUiS/9irMfmvHS9XRsKaIQWY2ysuMCgGzSJ/VGv h+TG+s8yfyMKNkXAC/6cCnfY9gGFxdiD1vUa8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697766044; x=1698370844; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=m0pwSa2GbXJihOZrNu7aMcnrt6aomdhOt1W1hRoVbr8=; b=ONOFXLSR4mtBa0sYKumbyqiEIDNAT3IilqL0Rv/qNVjuwCGHTBTt9N3gncQFGTD1a4 XSfAYRbXzTaZNBoyiRgZ9ISA7xGQCmTivuLUX1Bw9BEuVrO6mPpKjvQu82egCYAN7Q9i AgXQnyX7G8NcpaYjDWJVYK0TrisqSYUjL0mLJVVXBFSxXqtLNO0W6yT9U2mlJauIskgx t+F9EKOLPskxCYBTbxajjATHIonro2vc6IvvQW3vwV3x3ULjcVA72EkWFrOh6uN24v2v m2kh814ulLbQASF8GUaP8CMbRulzvPCLVdjMi/2DzcYaJS2yKhtTXA2jY99nTSesxJG9 iBOw== X-Gm-Message-State: AOJu0Yzc8D8ODGlX9YANZMxZMbtI4Uvp1KEX4nKKatbDG0wRWvjyGz1b YoHa8lWjUAb+JbEYmN+LOclMNqgTFt8yiomefAPFnA== X-Received: by 2002:a05:6e02:214b:b0:34d:e998:fb4f with SMTP id d11-20020a056e02214b00b0034de998fb4fmr850150ilv.10.1697766043833; Thu, 19 Oct 2023 18:40:43 -0700 (PDT) Received: from joelboxx5.c.googlers.com.com (20.10.132.34.bc.googleusercontent.com. [34.132.10.20]) by smtp.gmail.com with ESMTPSA id h9-20020a056e020d4900b00350f5584876sm270394ilj.27.2023.10.19.18.40.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Oct 2023 18:40:43 -0700 (PDT) From: "Joel Fernandes (Google)" <joel@joelfernandes.org> To: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com> Cc: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>, Suleiman Souhlal <suleiman@google.com>, Frederic Weisbecker <frederic@kernel.org>, "Paul E . McKenney" <paulmck@kernel.org>, Joel Fernandes <joel@joelfernandes.org> Subject: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance Date: Fri, 20 Oct 2023 01:40:28 +0000 Message-ID: <20231020014031.919742-3-joel@joelfernandes.org> X-Mailer: git-send-email 2.42.0.655.g421f12c284-goog In-Reply-To: <20231020014031.919742-1-joel@joelfernandes.org> References: <20231020014031.919742-1-joel@joelfernandes.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Thu, 19 Oct 2023 18:40:55 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780236740326130651 X-GMAIL-MSGID: 1780236740326130651 |
Series |
[1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2)
|
|
Commit Message
Joel Fernandes
Oct. 20, 2023, 1:40 a.m. UTC
From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> When newidle balancing triggers, we see that it constantly clobbers rq->next_balance even when there is no newidle balance happening due to the cost estimates. Due to this, we see that periodic load balance (rebalance_domains) may trigger way more often when the CPU is going in and out of idle at a high rate but is no really idle. Repeatedly triggering load balance there is a bad idea as it is a heavy operation. It also causes increases in softirq. Another issue is ->last_balance is not updated after newidle balance causing mistakes in the ->next_balance calculations. Fix by updating last_balance when a newidle load balance actually happens and then updating next_balance. This is also how it is done in other load balance paths. Testing shows a significant drop in softirqs when running: cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m -q Goes from ~6k to ~800. Cc: Suleiman Souhlal <suleiman@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- kernel/sched/fair.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-)
Comments
* Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> > > When newidle balancing triggers, we see that it constantly clobbers > rq->next_balance even when there is no newidle balance happening due to > the cost estimates. Due to this, we see that periodic load balance > (rebalance_domains) may trigger way more often when the CPU is going in > and out of idle at a high rate but is no really idle. Repeatedly > triggering load balance there is a bad idea as it is a heavy operation. > It also causes increases in softirq. > > Another issue is ->last_balance is not updated after newidle balance > causing mistakes in the ->next_balance calculations. > > Fix by updating last_balance when a newidle load balance actually happens > and then updating next_balance. This is also how it is done in other load > balance paths. > > Testing shows a significant drop in softirqs when running: > cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m -q > > Goes from ~6k to ~800. > > Cc: Suleiman Souhlal <suleiman@google.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Paul E. McKenney <paulmck@kernel.org> > Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > kernel/sched/fair.c | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8e276d12c3cb..b147ad09126a 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -12076,11 +12076,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > if (!READ_ONCE(this_rq->rd->overload) || > (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { > - > - if (sd) > - update_next_balance(sd, &next_balance); > rcu_read_unlock(); > - > goto out; > } > rcu_read_unlock(); > @@ -12095,8 +12091,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > int continue_balancing = 1; > u64 domain_cost; > > - update_next_balance(sd, &next_balance); > - > if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) > break; > > @@ -12109,6 +12103,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > t1 = sched_clock_cpu(this_cpu); > domain_cost = t1 - t0; > update_newidle_cost(sd, domain_cost); > + sd->last_balance = jiffies; > + update_next_balance(sd, &next_balance); > > curr_cost += domain_cost; > t0 = t1; Okay, I'm applying patches #2 and #3, without #1: it should be safe out of order, but let me know if I missed something ... Thanks, Ingo
On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > > From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> > > When newidle balancing triggers, we see that it constantly clobbers > rq->next_balance even when there is no newidle balance happening due to > the cost estimates. Due to this, we see that periodic load balance > (rebalance_domains) may trigger way more often when the CPU is going in > and out of idle at a high rate but is no really idle. Repeatedly > triggering load balance there is a bad idea as it is a heavy operation. > It also causes increases in softirq. we have 2 balance intervals: - one when idle based on the sd->balance_interval = sd_weight - one when busy which increases the period by multiplying it with busy_factor = 16 When becoming idle, the rq->next_balance can have been set using the 16*sd_weight period so load_balance can wait for a long time before running idle load balance for this cpu. As a typical example, instead of waiting at most 8ms, we will wait 128ms before we try to pull a task on the idle CPU. That's the reason for updating rq->next_balance in newidle_balance() > > Another issue is ->last_balance is not updated after newidle balance > causing mistakes in the ->next_balance calculations. newly idle load balance is not equal to idle load balance. It's a light load balance trying to pull one task and you can't really consider it to the normal load balance > > Fix by updating last_balance when a newidle load balance actually > happens and then updating next_balance. This is also how it is done in > other load balance paths. > > Testing shows a significant drop in softirqs when running: > cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m -q > > Goes from ~6k to ~800. Even if your figures look interesting, your patch adds regression in the load balance and the fairness. We can probably do improve the current behavior for decreasing number of ILB but your proposal is not the right solution IMO > > Cc: Suleiman Souhlal <suleiman@google.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Paul E. McKenney <paulmck@kernel.org> > Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > kernel/sched/fair.c | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8e276d12c3cb..b147ad09126a 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -12076,11 +12076,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > if (!READ_ONCE(this_rq->rd->overload) || > (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { > - > - if (sd) > - update_next_balance(sd, &next_balance); > rcu_read_unlock(); > - > goto out; > } > rcu_read_unlock(); > @@ -12095,8 +12091,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > int continue_balancing = 1; > u64 domain_cost; > > - update_next_balance(sd, &next_balance); > - > if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) > break; > > @@ -12109,6 +12103,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > t1 = sched_clock_cpu(this_cpu); > domain_cost = t1 - t0; > update_newidle_cost(sd, domain_cost); > + sd->last_balance = jiffies; > + update_next_balance(sd, &next_balance); > > curr_cost += domain_cost; > t0 = t1; > -- > 2.42.0.655.g421f12c284-goog >
Hi Ingo, On Fri, 20 Oct 2023 at 09:53, Ingo Molnar <mingo@kernel.org> wrote: > > ... > > curr_cost += domain_cost; > > t0 = t1; > > Okay, I'm applying patches #2 and #3, without #1: it should be safe > out of order, but let me know if I missed something ... Could you hold on for patch 3 ? As replied to Joel, the patch had regression in the update of rq->next_balance > > Thanks, > > Ingo
* Vincent Guittot <vincent.guittot@linaro.org> wrote: > Even if your figures look interesting, your patch adds regression in > the load balance and the fairness. Indeed - I've removed it from tip:sched/core for the time being. Thanks, Ingo
On Fri, Oct 20, 2023 at 9:57 AM Ingo Molnar <mingo@kernel.org> wrote: > > > * Vincent Guittot <vincent.guittot@linaro.org> wrote: > > > Even if your figures look interesting, your patch adds regression in > > the load balance and the fairness. > > Indeed - I've removed it from tip:sched/core for the time being. Sorry we should have marked it as RFC. We still feel there are issues in the newidle balance code and the fix is in the right direction, we'll continue discussing with Vincent on the other thread and try to come up with a proper solution. Thank you Ingo and Vincent! - Joel & Vineeth
* Vincent Guittot <vincent.guittot@linaro.org> wrote: > Hi Ingo, > > On Fri, 20 Oct 2023 at 09:53, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > ... > > > > curr_cost += domain_cost; > > > t0 = t1; > > > > Okay, I'm applying patches #2 and #3, without #1: it should be safe > > out of order, but let me know if I missed something ... > > Could you hold on for patch 3 ? As replied to Joel, the patch had > regression in the update of rq->next_balance Yeah, patch #3 is gone already from tip:sched/core. Thanks, Ingo
On Fri, Oct 20, 2023 at 03:40:14PM +0200, Vincent Guittot wrote: > On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google) > <joel@joelfernandes.org> wrote: > > > > From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> > > > > When newidle balancing triggers, we see that it constantly clobbers > > rq->next_balance even when there is no newidle balance happening due to > > the cost estimates. Due to this, we see that periodic load balance > > (rebalance_domains) may trigger way more often when the CPU is going in > > and out of idle at a high rate but is no really idle. Repeatedly > > triggering load balance there is a bad idea as it is a heavy operation. > > It also causes increases in softirq. > > we have 2 balance intervals: > - one when idle based on the sd->balance_interval = sd_weight > - one when busy which increases the period by multiplying it with > busy_factor = 16 On my production system I see load balance triggering every 4 jiffies! In a Qemu system (which I use for traces below), I see every 8 jiffies. I'll go look into that more as well, it could be something going on in get_sd_balance_interval(). > When becoming idle, the rq->next_balance can have been set using the > 16*sd_weight period so load_balance can wait for a long time before > running idle load balance for this cpu. Got it. > As a typical example, instead of waiting at most 8ms, we will wait > 128ms before we try to pull a task on the idle CPU. > > That's the reason for updating rq->next_balance in newidle_balance() Got it, makes sense. But I still feel the mechanism is too aggressive, see below. > > Another issue is ->last_balance is not updated after newidle balance > > causing mistakes in the ->next_balance calculations. > > newly idle load balance is not equal to idle load balance. It's a > light load balance trying to pull one task and you can't really > consider it to the normal load balance True. However the point is that it is coupled with the other load balance mechanism and the two are not independent. As you can see below, modifying rq->next_balance in newidle also causes the periodic balance to happen more aggressively as well if there is a high transition from busy to idle and viceversa. > > Fix by updating last_balance when a newidle load balance actually > > happens and then updating next_balance. This is also how it is done in > > other load balance paths. > > > > Testing shows a significant drop in softirqs when running: > > cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m -q > > > > Goes from ~6k to ~800. > > Even if your figures look interesting, your patch adds regression in > the load balance and the fairness. Yes I see that now. However it does illustrate the problem IMO. > We can probably do improve the current behavior for decreasing number > of ILB but your proposal is not the right solution IMO One of the problems is if you have task goes idle a lot, then the newidle_balance mechanism triggers the periodic balance every jiffie (once per millisecond on HZ=1000). Following are some traces I collected. cyclictest-123 [003] 522.650574 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,157) <idle>-0 [003] 522.651443 trigger_load_balance: time_after_eq(jiffies=221,158, rq->next_balance=221,145) = 1 <idle>-0 [003] 522.651461 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,158) cyclictest-123 [003] 522.651494 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,158) <idle>-0 [003] 522.652522 trigger_load_balance: time_after_eq(jiffies=221,159, rq->next_balance=221,145) = 1 <idle>-0 [003] 522.652560 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,159) cyclictest-124 [003] 522.652586 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,159) <idle>-0 [003] 522.654492 trigger_load_balance: time_after_eq(jiffies=221,161, rq->next_balance=221,145) = 1 <idle>-0 [003] 522.654534 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,161) Triggering it so aggressively likely that is useless, it increases softirq count and hurts power when no real work is done IMHO. And it probably makes things worse for power on ARM where you have uclamp stuff happening in the load balance paths which is quite heavy when I last traced that.. Further, we have observed in our tracing on real device that the update of rq->next_balance from the newidle path is itself buggy... we observed that because newidle balance may not update rq->last_balance, it is possible that rq->next_balance when updated by update_next_balance() will be updated to a value that is in the past and it will be stuck there for a long time! Perhaps we should investigate more and fix that bug separately. Vineeth could provide more details on the "getting stuck in the past" behavior as well. I hope these patches highlight some of the issues we find and can trigger any improvements by us or others. From our side we'll continue to work on it and thank you for explaining some of the load balance code! thanks, - Joel & Vineeth > > > > > Cc: Suleiman Souhlal <suleiman@google.com> > > Cc: Steven Rostedt <rostedt@goodmis.org> > > Cc: Frederic Weisbecker <frederic@kernel.org> > > Cc: Paul E. McKenney <paulmck@kernel.org> > > Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> > > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > --- > > kernel/sched/fair.c | 8 ++------ > > 1 file changed, 2 insertions(+), 6 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 8e276d12c3cb..b147ad09126a 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -12076,11 +12076,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > > > if (!READ_ONCE(this_rq->rd->overload) || > > (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { > > - > > - if (sd) > > - update_next_balance(sd, &next_balance); > > rcu_read_unlock(); > > - > > goto out; > > } > > rcu_read_unlock(); > > @@ -12095,8 +12091,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > int continue_balancing = 1; > > u64 domain_cost; > > > > - update_next_balance(sd, &next_balance); > > - > > if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) > > break; > > > > @@ -12109,6 +12103,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > t1 = sched_clock_cpu(this_cpu); > > domain_cost = t1 - t0; > > update_newidle_cost(sd, domain_cost); > > + sd->last_balance = jiffies; > > + update_next_balance(sd, &next_balance); > > > > curr_cost += domain_cost; > > t0 = t1; > > -- > > 2.42.0.655.g421f12c284-goog > >
On Sun, 22 Oct 2023 at 02:28, Joel Fernandes <joel@joelfernandes.org> wrote: > > On Fri, Oct 20, 2023 at 03:40:14PM +0200, Vincent Guittot wrote: > > On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google) > > <joel@joelfernandes.org> wrote: > > > > > > From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> > > > > > > When newidle balancing triggers, we see that it constantly clobbers > > > rq->next_balance even when there is no newidle balance happening due to > > > the cost estimates. Due to this, we see that periodic load balance > > > (rebalance_domains) may trigger way more often when the CPU is going in > > > and out of idle at a high rate but is no really idle. Repeatedly > > > triggering load balance there is a bad idea as it is a heavy operation. > > > It also causes increases in softirq. > > > > we have 2 balance intervals: > > - one when idle based on the sd->balance_interval = sd_weight > > - one when busy which increases the period by multiplying it with > > busy_factor = 16 > > On my production system I see load balance triggering every 4 jiffies! In a Which kind of system do you have? sd->balance_interval is in ms > Qemu system (which I use for traces below), I see every 8 jiffies. I'll go > look into that more as well, it could be something going on in > get_sd_balance_interval(). > > > When becoming idle, the rq->next_balance can have been set using the > > 16*sd_weight period so load_balance can wait for a long time before > > running idle load balance for this cpu. > > Got it. > > > As a typical example, instead of waiting at most 8ms, we will wait > > 128ms before we try to pull a task on the idle CPU. > > > > That's the reason for updating rq->next_balance in newidle_balance() > > Got it, makes sense. But I still feel the mechanism is too aggressive, see > below. > > > > Another issue is ->last_balance is not updated after newidle balance > > > causing mistakes in the ->next_balance calculations. > > > > newly idle load balance is not equal to idle load balance. It's a > > light load balance trying to pull one task and you can't really > > consider it to the normal load balance > > True. However the point is that it is coupled with the other load balance > mechanism and the two are not independent. As you can see below, modifying > rq->next_balance in newidle also causes the periodic balance to happen more > aggressively as well if there is a high transition from busy to idle and > viceversa. As mentioned, rq->next_balance is updated whenever cpu enters idle (i.e. in newidle_balance() but it's not related with doing a newly idle load balance. But your problem is more related with the fact that nohz.needs_update is set when stopping cpu timer in order to update nohz.next_balance which is then used to kick a "real" idle load balance > > > > Fix by updating last_balance when a newidle load balance actually > > > happens and then updating next_balance. This is also how it is done in > > > other load balance paths. > > > > > > Testing shows a significant drop in softirqs when running: > > > cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m -q > > > > > > Goes from ~6k to ~800. > > > > Even if your figures look interesting, your patch adds regression in > > the load balance and the fairness. > > Yes I see that now. However it does illustrate the problem IMO. > > > We can probably do improve the current behavior for decreasing number > > of ILB but your proposal is not the right solution IMO > > One of the problems is if you have task goes idle a lot, then the > newidle_balance mechanism triggers the periodic balance every jiffie (once > per millisecond on HZ=1000). every msec seems quite a lot. > > Following are some traces I collected. > > cyclictest-123 [003] 522.650574 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,157) > <idle>-0 [003] 522.651443 trigger_load_balance: time_after_eq(jiffies=221,158, rq->next_balance=221,145) = 1 > <idle>-0 [003] 522.651461 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,158) > cyclictest-123 [003] 522.651494 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,158) > <idle>-0 [003] 522.652522 trigger_load_balance: time_after_eq(jiffies=221,159, rq->next_balance=221,145) = 1 > <idle>-0 [003] 522.652560 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,159) > cyclictest-124 [003] 522.652586 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,159) > <idle>-0 [003] 522.654492 trigger_load_balance: time_after_eq(jiffies=221,161, rq->next_balance=221,145) = 1 > <idle>-0 [003] 522.654534 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,161) Ok, so IIUC your trace above, this happens because the tick is not stop after entering idle so it continues to fire and triggers a load balance without checking if there is a need like what is done for nohz mode > > Triggering it so aggressively likely that is useless, it increases softirq > count and hurts power when no real work is done IMHO. And it probably makes In the case above, the tick has already fired because it was not stopped. I mean that it's not the ilb which triggers this and woke up the cpu but it takes advantage of the tick so most of the pain has already happened (wakeup the cpu which was probably in a shallow c-state to let its tick firing) > things worse for power on ARM where you have uclamp stuff happening in the > load balance paths which is quite heavy when I last traced that.. > > Further, we have observed in our tracing on real device that the update of > rq->next_balance from the newidle path is itself buggy... we observed that > because newidle balance may not update rq->last_balance, it is possible that > rq->next_balance when updated by update_next_balance() will be updated to a > value that is in the past and it will be stuck there for a long time! Perhaps > we should investigate more and fix that bug separately. Vineeth could provide > more details on the "getting stuck in the past" behavior as well. sd->last_balance reflects last time an idle/busy load_balance happened (newly idle is out of the scope for the points that I mentioned previously). So if no load balance happens for a while, the rq->next_balance can be in the past but I don't see a problem here. It just means that a load balance hasn't happened for a while. It can even move backward if it has been set when busy but the cpu is now idle > > I hope these patches highlight some of the issues we find and can trigger any > improvements by us or others. From our side we'll continue to work on it and > thank you for explaining some of the load balance code! > > thanks, > > - Joel & Vineeth > > > > > > > > > > Cc: Suleiman Souhlal <suleiman@google.com> > > > Cc: Steven Rostedt <rostedt@goodmis.org> > > > Cc: Frederic Weisbecker <frederic@kernel.org> > > > Cc: Paul E. McKenney <paulmck@kernel.org> > > > Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> > > > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > --- > > > kernel/sched/fair.c | 8 ++------ > > > 1 file changed, 2 insertions(+), 6 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 8e276d12c3cb..b147ad09126a 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -12076,11 +12076,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > > > > > if (!READ_ONCE(this_rq->rd->overload) || > > > (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { > > > - > > > - if (sd) > > > - update_next_balance(sd, &next_balance); > > > rcu_read_unlock(); > > > - > > > goto out; > > > } > > > rcu_read_unlock(); > > > @@ -12095,8 +12091,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > > int continue_balancing = 1; > > > u64 domain_cost; > > > > > > - update_next_balance(sd, &next_balance); > > > - > > > if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) > > > break; > > > > > > @@ -12109,6 +12103,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > > t1 = sched_clock_cpu(this_cpu); > > > domain_cost = t1 - t0; > > > update_newidle_cost(sd, domain_cost); > > > + sd->last_balance = jiffies; > > > + update_next_balance(sd, &next_balance); > > > > > > curr_cost += domain_cost; > > > t0 = t1; > > > -- > > > 2.42.0.655.g421f12c284-goog > > >
Hi Vincent, Sorry for late reply, I was in Tokyo all these days and was waiting to get to writing a proper reply. See my replies below: On Thu, Oct 26, 2023 at 04:23:35PM +0200, Vincent Guittot wrote: > On Sun, 22 Oct 2023 at 02:28, Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Fri, Oct 20, 2023 at 03:40:14PM +0200, Vincent Guittot wrote: > > > On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google) > > > <joel@joelfernandes.org> wrote: > > > > > > > > From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> > > > > > > > > When newidle balancing triggers, we see that it constantly clobbers > > > > rq->next_balance even when there is no newidle balance happening due to > > > > the cost estimates. Due to this, we see that periodic load balance > > > > (rebalance_domains) may trigger way more often when the CPU is going in > > > > and out of idle at a high rate but is no really idle. Repeatedly > > > > triggering load balance there is a bad idea as it is a heavy operation. > > > > It also causes increases in softirq. > > > > > > we have 2 balance intervals: > > > - one when idle based on the sd->balance_interval = sd_weight > > > - one when busy which increases the period by multiplying it with > > > busy_factor = 16 > > > > On my production system I see load balance triggering every 4 jiffies! In a > > Which kind of system do you have? sd->balance_interval is in ms Yes, sorry I meant it triggers every jiffies which is extreme sometimes. It is an ADL SoC (12th gen Intel, 4 P cores 8 E cores) get_sd_balance_interval() returns 4 jiffies there. On my Qemu system, I see 8 jiffies. [...] > > > > Another issue is ->last_balance is not updated after newidle balance > > > > causing mistakes in the ->next_balance calculations. > > > > > > newly idle load balance is not equal to idle load balance. It's a > > > light load balance trying to pull one task and you can't really > > > consider it to the normal load balance > > > > True. However the point is that it is coupled with the other load balance > > mechanism and the two are not independent. As you can see below, modifying > > rq->next_balance in newidle also causes the periodic balance to happen more > > aggressively as well if there is a high transition from busy to idle and > > viceversa. > > As mentioned, rq->next_balance is updated whenever cpu enters idle > (i.e. in newidle_balance() but it's not related with doing a newly > idle load balance. Yes, I understand that. But my point was that the update of rq->next_balance from the newidle path is itself buggy and interferes with the load balance happening from the tick (trigger_load_balance -> run_rebalance_domains). > But your problem is more related with the fact that > nohz.needs_update is set when stopping cpu timer in order to update > nohz.next_balance which is then used to kick a "real" idle load > balance Well, independent of nohz idle balance, I think we need to fix this issue as mentioned above. This effect the periodic one as mentioned in the commit log. See here another trace I collected this time dumping the 'interval'. There is a tug of war happening between the newidle balance and the periodic balance. The periodic one sets rq->next_balance for cpu 0 to 760,143 and then the newidle comes in pulls it back a 118 jiffies to 760,024. This is actually in the past because jiffies is currently 760,045 !! This triggers the periodic balance againwhich sets rq->next_balance back to 760,143. Rinse and repeat. End result is you have periodic balance every jiffies. With this patch the issue goes away but we could fix it differently as you mentioned, we need to pull newidle balance back but perhaps not so aggressively. How about something like the untested diff I enclosed at the end of this email? <idle>-0 [000] 13.081781: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,045) cyclictest-120 [000] 13.081806: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) cyclictest-120 [000] 13.081807: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,045) cyclictest-120 [000] 13.082130: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) cyclictest-120 [000] 13.082338: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) cyclictest-120 [000] 13.082636: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) <idle>-0 [000] 13.082823: trigger_load_balance: time_after_eq(jiffies=760,046, rq->next_balance=760,024) = 1 <idle>-0 [000] 13.082823: softirq_raise: vec=7 [action=SCHED] <idle>-0 [000] 13.082871: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,046) trace-cmd-114 [000] 13.082876: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8) trace-cmd-114 [000] 13.082876: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,046) cyclictest-120 [000] 13.083333: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8) cyclictest-120 [000] 13.083633: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8) <idle>-0 [000] 13.083656: trigger_load_balance: time_after_eq(jiffies=760,047, rq->next_balance=760,024) = 1 <idle>-0 [000] 13.083656: softirq_raise: vec=7 [action=SCHED] <idle>-0 [000] 13.083702: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,047) cyclictest-120 [000] 13.083729: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) cyclictest-120 [000] 13.083730: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,047) cyclictest-120 [000] 13.083960: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) cyclictest-120 [000] 13.084069: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) cyclictest-120 [000] 13.084423: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) cyclictest-120 [000] 13.084633: trigger_load_balance: time_after_eq(jiffies=760,048, rq->next_balance=760,024) = 1 cyclictest-120 [000] 13.084633: softirq_raise: vec=7 [action=SCHED] cyclictest-120 [000] 13.084678: rebalance_domains: rq[cpu=0]->next_balance: > > > > Fix by updating last_balance when a newidle load balance actually > > > > happens and then updating next_balance. This is also how it is done in > > > > other load balance paths. > > > > > > > > Testing shows a significant drop in softirqs when running: > > > > cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m -q > > > > > > > > Goes from ~6k to ~800. > > > > > > Even if your figures look interesting, your patch adds regression in > > > the load balance and the fairness. > > > > Yes I see that now. However it does illustrate the problem IMO. > > > > > We can probably do improve the current behavior for decreasing number > > > of ILB but your proposal is not the right solution IMO > > > > One of the problems is if you have task goes idle a lot, then the > > newidle_balance mechanism triggers the periodic balance every jiffie (once > > per millisecond on HZ=1000). > > every msec seems quite a lot. Yeah! > > > > Following are some traces I collected. > > > > cyclictest-123 [003] 522.650574 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,157) > > <idle>-0 [003] 522.651443 trigger_load_balance: time_after_eq(jiffies=221,158, rq->next_balance=221,145) = 1 > > <idle>-0 [003] 522.651461 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,158) > > cyclictest-123 [003] 522.651494 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,158) > > <idle>-0 [003] 522.652522 trigger_load_balance: time_after_eq(jiffies=221,159, rq->next_balance=221,145) = 1 > > <idle>-0 [003] 522.652560 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,159) > > cyclictest-124 [003] 522.652586 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,159) > > <idle>-0 [003] 522.654492 trigger_load_balance: time_after_eq(jiffies=221,161, rq->next_balance=221,145) = 1 > > <idle>-0 [003] 522.654534 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,161) > > Ok, so IIUC your trace above, this happens because the tick is not > stop after entering idle so it continues to fire and triggers a load > balance without checking if there is a need like what is done for nohz > mode The tick is normally not stopped if the CPU is awakened too soon by a timer. That's pretty normal AFAIK. As you can see in the traces above, cyclictest keeps waking up. > > things worse for power on ARM where you have uclamp stuff happening in the > > load balance paths which is quite heavy when I last traced that.. > > > > Further, we have observed in our tracing on real device that the update of > > rq->next_balance from the newidle path is itself buggy... we observed that > > because newidle balance may not update rq->last_balance, it is possible that > > rq->next_balance when updated by update_next_balance() will be updated to a > > value that is in the past and it will be stuck there for a long time! Perhaps > > we should investigate more and fix that bug separately. Vineeth could provide > > more details on the "getting stuck in the past" behavior as well. > > sd->last_balance reflects last time an idle/busy load_balance happened > (newly idle is out of the scope for the points that I mentioned > previously). So if no load balance happens for a while, the > rq->next_balance can be in the past but I don't see a problem here. It > just means that a load balance hasn't happened for a while. It can > even move backward if it has been set when busy but the cpu is now > idle Sure, but I think it should at least set it by get_sd_balance_interval() into the future. Like so (untested)? Let me know what you think and thanks! ---8<----------------------- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a3318aeff9e8..0d6667d31c51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11314,6 +11314,30 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy) return interval; } +/* + * Update the next balance from newidle balance. + * The update of next_balance from newidle balance tries to make sure that + * we don't trigger periodic balance too far in the future on a now-idle + * system. This is just like update_next_balance except that since + * sd->last_balance may not have been updated for a while, we're careful to + * not set next_balance in the past. + */ +static inline void +update_next_balance_newidle(struct sched_domain *sd, unsigned long *next_balance) +{ + unsigned long interval, next; + + /* used by new idle balance, so cpu_busy = 0 */ + interval = get_sd_balance_interval(sd, 0); + next = sd->last_balance + interval; + + next = max(next, jiffies + interval); + + if (time_after(*next_balance, next)) { + *next_balance = next; + } +} + static inline void update_next_balance(struct sched_domain *sd, unsigned long *next_balance) { @@ -12107,7 +12131,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { if (sd) - update_next_balance(sd, &next_balance); + update_next_balance_newidle(sd, &next_balance); rcu_read_unlock(); goto out; @@ -12124,7 +12148,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) int continue_balancing = 1; u64 domain_cost; - update_next_balance(sd, &next_balance); + update_next_balance_newidle(sd, &next_balance); if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) break;
On Thu, Nov 9, 2023 at 5:02 AM Joel Fernandes <joel@joelfernandes.org> wrote: [...] > > > things worse for power on ARM where you have uclamp stuff happening in the > > > load balance paths which is quite heavy when I last traced that.. > > > > > > Further, we have observed in our tracing on real device that the update of > > > rq->next_balance from the newidle path is itself buggy... we observed that > > > because newidle balance may not update rq->last_balance, it is possible that > > > rq->next_balance when updated by update_next_balance() will be updated to a > > > value that is in the past and it will be stuck there for a long time! Perhaps > > > we should investigate more and fix that bug separately. Vineeth could provide > > > more details on the "getting stuck in the past" behavior as well. > > > > sd->last_balance reflects last time an idle/busy load_balance happened > > (newly idle is out of the scope for the points that I mentioned > > previously). So if no load balance happens for a while, the > > rq->next_balance can be in the past but I don't see a problem here. It > > just means that a load balance hasn't happened for a while. It can > > even move backward if it has been set when busy but the cpu is now > > idle > > Sure, but I think it should at least set it by get_sd_balance_interval() into > the future. Like so (untested)? Let me know what you think and thanks! Btw, I also drew a graph showing the issue without patch: https://i.imgur.com/RgTr45l.png Each "x" mark is run_rebalance_domains() running on a CPU. As can be seen, there were some 10 occurrences in a span of 15ms in one instance. Thanks, - Joel > ---8<----------------------- > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a3318aeff9e8..0d6667d31c51 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11314,6 +11314,30 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy) > return interval; > } > > +/* > + * Update the next balance from newidle balance. > + * The update of next_balance from newidle balance tries to make sure that > + * we don't trigger periodic balance too far in the future on a now-idle > + * system. This is just like update_next_balance except that since > + * sd->last_balance may not have been updated for a while, we're careful to > + * not set next_balance in the past. > + */ > +static inline void > +update_next_balance_newidle(struct sched_domain *sd, unsigned long *next_balance) > +{ > + unsigned long interval, next; > + > + /* used by new idle balance, so cpu_busy = 0 */ > + interval = get_sd_balance_interval(sd, 0); > + next = sd->last_balance + interval; > + > + next = max(next, jiffies + interval); > + > + if (time_after(*next_balance, next)) { > + *next_balance = next; > + } > +} > + > static inline void > update_next_balance(struct sched_domain *sd, unsigned long *next_balance) > { > @@ -12107,7 +12131,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { > > if (sd) > - update_next_balance(sd, &next_balance); > + update_next_balance_newidle(sd, &next_balance); > rcu_read_unlock(); > > goto out; > @@ -12124,7 +12148,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > int continue_balancing = 1; > u64 domain_cost; > > - update_next_balance(sd, &next_balance); > + update_next_balance_newidle(sd, &next_balance); > > if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) > break;
Le jeudi 09 nov. 2023 à 10:02:54 (+0000), Joel Fernandes a écrit : > Hi Vincent, > > Sorry for late reply, I was in Tokyo all these days and was waiting to get to > writing a proper reply. See my replies below: > > On Thu, Oct 26, 2023 at 04:23:35PM +0200, Vincent Guittot wrote: > > On Sun, 22 Oct 2023 at 02:28, Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > On Fri, Oct 20, 2023 at 03:40:14PM +0200, Vincent Guittot wrote: > > > > On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google) > > > > <joel@joelfernandes.org> wrote: > > > > > > > > > > From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> > > > > > > > > > > When newidle balancing triggers, we see that it constantly clobbers > > > > > rq->next_balance even when there is no newidle balance happening due to > > > > > the cost estimates. Due to this, we see that periodic load balance > > > > > (rebalance_domains) may trigger way more often when the CPU is going in > > > > > and out of idle at a high rate but is no really idle. Repeatedly > > > > > triggering load balance there is a bad idea as it is a heavy operation. > > > > > It also causes increases in softirq. > > > > > > > > we have 2 balance intervals: > > > > - one when idle based on the sd->balance_interval = sd_weight > > > > - one when busy which increases the period by multiplying it with > > > > busy_factor = 16 > > > > > > On my production system I see load balance triggering every 4 jiffies! In a > > > > Which kind of system do you have? sd->balance_interval is in ms > > Yes, sorry I meant it triggers every jiffies which is extreme sometimes. It > is an ADL SoC (12th gen Intel, 4 P cores 8 E cores) get_sd_balance_interval() > returns 4 jiffies there. On my Qemu system, I see 8 jiffies. Do you have details about the sched_domain hierarchy ? That could be part of your problem (see below) > > [...] > > > > > Another issue is ->last_balance is not updated after newidle balance > > > > > causing mistakes in the ->next_balance calculations. > > > > > > > > newly idle load balance is not equal to idle load balance. It's a > > > > light load balance trying to pull one task and you can't really > > > > consider it to the normal load balance > > > > > > True. However the point is that it is coupled with the other load balance > > > mechanism and the two are not independent. As you can see below, modifying > > > rq->next_balance in newidle also causes the periodic balance to happen more > > > aggressively as well if there is a high transition from busy to idle and > > > viceversa. > > > > As mentioned, rq->next_balance is updated whenever cpu enters idle > > (i.e. in newidle_balance() but it's not related with doing a newly > > idle load balance. > > Yes, I understand that. But my point was that the update of rq->next_balance > from the newidle path is itself buggy and interferes with the load balance > happening from the tick (trigger_load_balance -> run_rebalance_domains). Newidle path is not buggy. It only uses sd->last_balance + interval to estimate the next balance which is the correct thing to do. Your problem comes from the update of sd->last_balance which never happens and remains in the past whereas you call run_rebalance_domains() which should run load_balance for all domains with a sd->last_balance + interval in the past. Your problem most probably comes from the should_we_balance which always or "almost always" returns false in your use case for some sched_domain and prevents to updat sd->last_balance. Could you try the patch below ? It should fix your problem of trying to rebalance every tick whereas rebalance_domain is called. At least this should show if it's your problem but I'm not sure it's the right things to do all the time ... --- kernel/sched/fair.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3745ca289240..9ea1f42e5362 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11568,17 +11568,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) need_decay = update_newidle_cost(sd, 0); max_cost += sd->max_newidle_lb_cost; - /* - * Stop the load balance at this level. There is another - * CPU in our sched group which is doing load balancing more - * actively. - */ - if (!continue_balancing) { - if (need_decay) - continue; - break; - } - interval = get_sd_balance_interval(sd, busy); need_serialize = sd->flags & SD_SERIALIZE; @@ -11588,7 +11577,12 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) } if (time_after_eq(jiffies, sd->last_balance + interval)) { - if (load_balance(cpu, rq, sd, idle, &continue_balancing)) { + /* + * Stop the load balance at this level. There is another + * CPU in our sched group which is doing load balancing more + * actively. + */ + if (continue_balancing && load_balance(cpu, rq, sd, idle, &continue_balancing)) { /* * The LBF_DST_PINNED logic could have changed * env->dst_cpu, so we can't know our idle -- 2.34.1 > > > But your problem is more related with the fact that > > nohz.needs_update is set when stopping cpu timer in order to update > > nohz.next_balance which is then used to kick a "real" idle load > > balance > > Well, independent of nohz idle balance, I think we need to fix this issue as > mentioned above. This effect the periodic one as mentioned in the commit log. > > See here another trace I collected this time dumping the 'interval'. There is > a tug of war happening between the newidle balance and the periodic balance. > > The periodic one sets rq->next_balance for cpu 0 to 760,143 and then the > newidle comes in pulls it back a 118 jiffies to 760,024. This is actually in > the past because jiffies is currently 760,045 !! > > This triggers the periodic balance againwhich sets rq->next_balance back to > 760,143. > > Rinse and repeat. End result is you have periodic balance every jiffies. With > this patch the issue goes away but we could fix it differently as you > mentioned, we need to pull newidle balance back but perhaps not so > aggressively. How about something like the untested diff I enclosed at the > end of this email? > > <idle>-0 [000] 13.081781: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,045) > cyclictest-120 [000] 13.081806: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) > cyclictest-120 [000] 13.081807: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,045) > cyclictest-120 [000] 13.082130: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) > cyclictest-120 [000] 13.082338: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) > cyclictest-120 [000] 13.082636: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) > <idle>-0 [000] 13.082823: trigger_load_balance: time_after_eq(jiffies=760,046, rq->next_balance=760,024) = 1 > <idle>-0 [000] 13.082823: softirq_raise: vec=7 [action=SCHED] > <idle>-0 [000] 13.082871: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,046) > trace-cmd-114 [000] 13.082876: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8) > trace-cmd-114 [000] 13.082876: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,046) > cyclictest-120 [000] 13.083333: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8) > cyclictest-120 [000] 13.083633: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8) > <idle>-0 [000] 13.083656: trigger_load_balance: time_after_eq(jiffies=760,047, rq->next_balance=760,024) = 1 > <idle>-0 [000] 13.083656: softirq_raise: vec=7 [action=SCHED] > <idle>-0 [000] 13.083702: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,047) > cyclictest-120 [000] 13.083729: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) > cyclictest-120 [000] 13.083730: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,047) > cyclictest-120 [000] 13.083960: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) > cyclictest-120 [000] 13.084069: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) > cyclictest-120 [000] 13.084423: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) > cyclictest-120 [000] 13.084633: trigger_load_balance: time_after_eq(jiffies=760,048, rq->next_balance=760,024) = 1 > cyclictest-120 [000] 13.084633: softirq_raise: vec=7 [action=SCHED] > cyclictest-120 [000] 13.084678: rebalance_domains: rq[cpu=0]->next_balance: > > > > > > Fix by updating last_balance when a newidle load balance actually > > > > > happens and then updating next_balance. This is also how it is done in > > > > > other load balance paths. > > > > > > > > > > Testing shows a significant drop in softirqs when running: > > > > > cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m -q > > > > > > > > > > Goes from ~6k to ~800. > > > > > > > > Even if your figures look interesting, your patch adds regression in > > > > the load balance and the fairness. > > > > > > Yes I see that now. However it does illustrate the problem IMO. > > > > > > > We can probably do improve the current behavior for decreasing number > > > > of ILB but your proposal is not the right solution IMO > > > > > > One of the problems is if you have task goes idle a lot, then the > > > newidle_balance mechanism triggers the periodic balance every jiffie (once > > > per millisecond on HZ=1000). > > > > every msec seems quite a lot. > > Yeah! > > > > > > > Following are some traces I collected. > > > > > > cyclictest-123 [003] 522.650574 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,157) > > > <idle>-0 [003] 522.651443 trigger_load_balance: time_after_eq(jiffies=221,158, rq->next_balance=221,145) = 1 > > > <idle>-0 [003] 522.651461 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,158) > > > cyclictest-123 [003] 522.651494 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,158) > > > <idle>-0 [003] 522.652522 trigger_load_balance: time_after_eq(jiffies=221,159, rq->next_balance=221,145) = 1 > > > <idle>-0 [003] 522.652560 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,159) > > > cyclictest-124 [003] 522.652586 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,159) > > > <idle>-0 [003] 522.654492 trigger_load_balance: time_after_eq(jiffies=221,161, rq->next_balance=221,145) = 1 > > > <idle>-0 [003] 522.654534 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,161) > > > > Ok, so IIUC your trace above, this happens because the tick is not > > stop after entering idle so it continues to fire and triggers a load > > balance without checking if there is a need like what is done for nohz > > mode > > The tick is normally not stopped if the CPU is awakened too soon by a timer. > That's pretty normal AFAIK. As you can see in the traces above, cyclictest > keeps waking up. > > > > things worse for power on ARM where you have uclamp stuff happening in the > > > load balance paths which is quite heavy when I last traced that.. > > > > > > Further, we have observed in our tracing on real device that the update of > > > rq->next_balance from the newidle path is itself buggy... we observed that > > > because newidle balance may not update rq->last_balance, it is possible that > > > rq->next_balance when updated by update_next_balance() will be updated to a > > > value that is in the past and it will be stuck there for a long time! Perhaps > > > we should investigate more and fix that bug separately. Vineeth could provide > > > more details on the "getting stuck in the past" behavior as well. > > > > sd->last_balance reflects last time an idle/busy load_balance happened > > (newly idle is out of the scope for the points that I mentioned > > previously). So if no load balance happens for a while, the > > rq->next_balance can be in the past but I don't see a problem here. It > > just means that a load balance hasn't happened for a while. It can > > even move backward if it has been set when busy but the cpu is now > > idle > > Sure, but I think it should at least set it by get_sd_balance_interval() into > the future. Like so (untested)? Let me know what you think and thanks! > > ---8<----------------------- > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a3318aeff9e8..0d6667d31c51 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11314,6 +11314,30 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy) > return interval; > } > > +/* > + * Update the next balance from newidle balance. > + * The update of next_balance from newidle balance tries to make sure that > + * we don't trigger periodic balance too far in the future on a now-idle > + * system. This is just like update_next_balance except that since > + * sd->last_balance may not have been updated for a while, we're careful to > + * not set next_balance in the past. > + */ > +static inline void > +update_next_balance_newidle(struct sched_domain *sd, unsigned long *next_balance) > +{ > + unsigned long interval, next; > + > + /* used by new idle balance, so cpu_busy = 0 */ > + interval = get_sd_balance_interval(sd, 0); > + next = sd->last_balance + interval; > + > + next = max(next, jiffies + interval); > + > + if (time_after(*next_balance, next)) { > + *next_balance = next; > + } > +} > + > static inline void > update_next_balance(struct sched_domain *sd, unsigned long *next_balance) > { > @@ -12107,7 +12131,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { > > if (sd) > - update_next_balance(sd, &next_balance); > + update_next_balance_newidle(sd, &next_balance); > rcu_read_unlock(); > > goto out; > @@ -12124,7 +12148,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > int continue_balancing = 1; > u64 domain_cost; > > - update_next_balance(sd, &next_balance); > + update_next_balance_newidle(sd, &next_balance); > > if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) > break;
Hi Vincent, > On Nov 14, 2023, at 10:43 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > Le jeudi 09 nov. 2023 à 10:02:54 (+0000), Joel Fernandes a écrit : >> Hi Vincent, >> >> Sorry for late reply, I was in Tokyo all these days and was waiting to get to >> writing a proper reply. See my replies below: >> >>> On Thu, Oct 26, 2023 at 04:23:35PM +0200, Vincent Guittot wrote: >>> On Sun, 22 Oct 2023 at 02:28, Joel Fernandes <joel@joelfernandes.org> wrote: >>>> >>>> On Fri, Oct 20, 2023 at 03:40:14PM +0200, Vincent Guittot wrote: >>>>> On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google) >>>>> <joel@joelfernandes.org> wrote: >>>>>> >>>>>> From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> >>>>>> >>>>>> When newidle balancing triggers, we see that it constantly clobbers >>>>>> rq->next_balance even when there is no newidle balance happening due to >>>>>> the cost estimates. Due to this, we see that periodic load balance >>>>>> (rebalance_domains) may trigger way more often when the CPU is going in >>>>>> and out of idle at a high rate but is no really idle. Repeatedly >>>>>> triggering load balance there is a bad idea as it is a heavy operation. >>>>>> It also causes increases in softirq. >>>>> >>>>> we have 2 balance intervals: >>>>> - one when idle based on the sd->balance_interval = sd_weight >>>>> - one when busy which increases the period by multiplying it with >>>>> busy_factor = 16 >>>> >>>> On my production system I see load balance triggering every 4 jiffies! In a >>> >>> Which kind of system do you have? sd->balance_interval is in ms >> >> Yes, sorry I meant it triggers every jiffies which is extreme sometimes. It >> is an ADL SoC (12th gen Intel, 4 P cores 8 E cores) get_sd_balance_interval() >> returns 4 jiffies there. On my Qemu system, I see 8 jiffies. > > Do you have details about the sched_domain hierarchy ? > That could be part of your problem (see below) Since I am at LPC I am not at that machine right now but I could provide it next week. I replied below: > >> >> [...] >>>>>> Another issue is ->last_balance is not updated after newidle balance >>>>>> causing mistakes in the ->next_balance calculations. >>>>> >>>>> newly idle load balance is not equal to idle load balance. It's a >>>>> light load balance trying to pull one task and you can't really >>>>> consider it to the normal load balance >>>> >>>> True. However the point is that it is coupled with the other load balance >>>> mechanism and the two are not independent. As you can see below, modifying >>>> rq->next_balance in newidle also causes the periodic balance to happen more >>>> aggressively as well if there is a high transition from busy to idle and >>>> viceversa. >>> >>> As mentioned, rq->next_balance is updated whenever cpu enters idle >>> (i.e. in newidle_balance() but it's not related with doing a newly >>> idle load balance. >> >> Yes, I understand that. But my point was that the update of rq->next_balance >> from the newidle path is itself buggy and interferes with the load balance >> happening from the tick (trigger_load_balance -> run_rebalance_domains). > > Newidle path is not buggy. Sure perhaps not directly newidle but something else is buggy as you indicated below: > It only uses sd->last_balance + interval to > estimate the next balance which is the correct thing to do. Your problem > comes from the update of sd->last_balance which never happens and remains > in the past whereas you call run_rebalance_domains() which should > run load_balance for all domains with a sd->last_balance + interval in the > past. > Your problem most probably comes from the should_we_balance which always or > "almost always" returns false in your use case for some sched_domain and > prevents to updat sd->last_balance. Could you try the patch below ? > It should fix your problem of trying to rebalance every tick whereas > rebalance_domain is called. > At least this should show if it's your problem but I'm not sure it's the right > things to do all the time ... You raise a good point, the root cause is indeed last_balance being stuck in the past, or such. I will try the patch below and let you know. Also my previous diff where I cap the next balance setting also makes the issue go away, when I was last testing. Thanks Vincent! - Joel > > --- > kernel/sched/fair.c | 18 ++++++------------ > 1 file changed, 6 insertions(+), 12 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 3745ca289240..9ea1f42e5362 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11568,17 +11568,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) > need_decay = update_newidle_cost(sd, 0); > max_cost += sd->max_newidle_lb_cost; > > - /* > - * Stop the load balance at this level. There is another > - * CPU in our sched group which is doing load balancing more > - * actively. > - */ > - if (!continue_balancing) { > - if (need_decay) > - continue; > - break; > - } > - > interval = get_sd_balance_interval(sd, busy); > > need_serialize = sd->flags & SD_SERIALIZE; > @@ -11588,7 +11577,12 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) > } > > if (time_after_eq(jiffies, sd->last_balance + interval)) { > - if (load_balance(cpu, rq, sd, idle, &continue_balancing)) { > + /* > + * Stop the load balance at this level. There is another > + * CPU in our sched group which is doing load balancing more > + * actively. > + */ > + if (continue_balancing && load_balance(cpu, rq, sd, idle, &continue_balancing)) { > /* > * The LBF_DST_PINNED logic could have changed > * env->dst_cpu, so we can't know our idle > -- > 2.34.1 > >> >>> But your problem is more related with the fact that >>> nohz.needs_update is set when stopping cpu timer in order to update >>> nohz.next_balance which is then used to kick a "real" idle load >>> balance >> >> Well, independent of nohz idle balance, I think we need to fix this issue as >> mentioned above. This effect the periodic one as mentioned in the commit log. >> >> See here another trace I collected this time dumping the 'interval'. There is >> a tug of war happening between the newidle balance and the periodic balance. >> >> The periodic one sets rq->next_balance for cpu 0 to 760,143 and then the >> newidle comes in pulls it back a 118 jiffies to 760,024. This is actually in >> the past because jiffies is currently 760,045 !! >> >> This triggers the periodic balance againwhich sets rq->next_balance back to >> 760,143. >> >> Rinse and repeat. End result is you have periodic balance every jiffies. With >> this patch the issue goes away but we could fix it differently as you >> mentioned, we need to pull newidle balance back but perhaps not so >> aggressively. How about something like the untested diff I enclosed at the >> end of this email? >> >> <idle>-0 [000] 13.081781: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,045) >> cyclictest-120 [000] 13.081806: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) >> cyclictest-120 [000] 13.081807: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,045) >> cyclictest-120 [000] 13.082130: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) >> cyclictest-120 [000] 13.082338: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) >> cyclictest-120 [000] 13.082636: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8) >> <idle>-0 [000] 13.082823: trigger_load_balance: time_after_eq(jiffies=760,046, rq->next_balance=760,024) = 1 >> <idle>-0 [000] 13.082823: softirq_raise: vec=7 [action=SCHED] >> <idle>-0 [000] 13.082871: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,046) >> trace-cmd-114 [000] 13.082876: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8) >> trace-cmd-114 [000] 13.082876: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,046) >> cyclictest-120 [000] 13.083333: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8) >> cyclictest-120 [000] 13.083633: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8) >> <idle>-0 [000] 13.083656: trigger_load_balance: time_after_eq(jiffies=760,047, rq->next_balance=760,024) = 1 >> <idle>-0 [000] 13.083656: softirq_raise: vec=7 [action=SCHED] >> <idle>-0 [000] 13.083702: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,047) >> cyclictest-120 [000] 13.083729: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) >> cyclictest-120 [000] 13.083730: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,047) >> cyclictest-120 [000] 13.083960: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) >> cyclictest-120 [000] 13.084069: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) >> cyclictest-120 [000] 13.084423: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8) >> cyclictest-120 [000] 13.084633: trigger_load_balance: time_after_eq(jiffies=760,048, rq->next_balance=760,024) = 1 >> cyclictest-120 [000] 13.084633: softirq_raise: vec=7 [action=SCHED] >> cyclictest-120 [000] 13.084678: rebalance_domains: rq[cpu=0]->next_balance: >> >>>>>> Fix by updating last_balance when a newidle load balance actually >>>>>> happens and then updating next_balance. This is also how it is done in >>>>>> other load balance paths. >>>>>> >>>>>> Testing shows a significant drop in softirqs when running: >>>>>> cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m -q >>>>>> >>>>>> Goes from ~6k to ~800. >>>>> >>>>> Even if your figures look interesting, your patch adds regression in >>>>> the load balance and the fairness. >>>> >>>> Yes I see that now. However it does illustrate the problem IMO. >>>> >>>>> We can probably do improve the current behavior for decreasing number >>>>> of ILB but your proposal is not the right solution IMO >>>> >>>> One of the problems is if you have task goes idle a lot, then the >>>> newidle_balance mechanism triggers the periodic balance every jiffie (once >>>> per millisecond on HZ=1000). >>> >>> every msec seems quite a lot. >> >> Yeah! >> >>>> >>>> Following are some traces I collected. >>>> >>>> cyclictest-123 [003] 522.650574 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,157) >>>> <idle>-0 [003] 522.651443 trigger_load_balance: time_after_eq(jiffies=221,158, rq->next_balance=221,145) = 1 >>>> <idle>-0 [003] 522.651461 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,158) >>>> cyclictest-123 [003] 522.651494 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,158) >>>> <idle>-0 [003] 522.652522 trigger_load_balance: time_after_eq(jiffies=221,159, rq->next_balance=221,145) = 1 >>>> <idle>-0 [003] 522.652560 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,159) >>>> cyclictest-124 [003] 522.652586 newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,159) >>>> <idle>-0 [003] 522.654492 trigger_load_balance: time_after_eq(jiffies=221,161, rq->next_balance=221,145) = 1 >>>> <idle>-0 [003] 522.654534 rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,161) >>> >>> Ok, so IIUC your trace above, this happens because the tick is not >>> stop after entering idle so it continues to fire and triggers a load >>> balance without checking if there is a need like what is done for nohz >>> mode >> >> The tick is normally not stopped if the CPU is awakened too soon by a timer. >> That's pretty normal AFAIK. As you can see in the traces above, cyclictest >> keeps waking up. >> >>>> things worse for power on ARM where you have uclamp stuff happening in the >>>> load balance paths which is quite heavy when I last traced that.. >>>> >>>> Further, we have observed in our tracing on real device that the update of >>>> rq->next_balance from the newidle path is itself buggy... we observed that >>>> because newidle balance may not update rq->last_balance, it is possible that >>>> rq->next_balance when updated by update_next_balance() will be updated to a >>>> value that is in the past and it will be stuck there for a long time! Perhaps >>>> we should investigate more and fix that bug separately. Vineeth could provide >>>> more details on the "getting stuck in the past" behavior as well. >>> >>> sd->last_balance reflects last time an idle/busy load_balance happened >>> (newly idle is out of the scope for the points that I mentioned >>> previously). So if no load balance happens for a while, the >>> rq->next_balance can be in the past but I don't see a problem here. It >>> just means that a load balance hasn't happened for a while. It can >>> even move backward if it has been set when busy but the cpu is now >>> idle >> >> Sure, but I think it should at least set it by get_sd_balance_interval() into >> the future. Like so (untested)? Let me know what you think and thanks! >> >> ---8<----------------------- >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index a3318aeff9e8..0d6667d31c51 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -11314,6 +11314,30 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy) >> return interval; >> } >> >> +/* >> + * Update the next balance from newidle balance. >> + * The update of next_balance from newidle balance tries to make sure that >> + * we don't trigger periodic balance too far in the future on a now-idle >> + * system. This is just like update_next_balance except that since >> + * sd->last_balance may not have been updated for a while, we're careful to >> + * not set next_balance in the past. >> + */ >> +static inline void >> +update_next_balance_newidle(struct sched_domain *sd, unsigned long *next_balance) >> +{ >> + unsigned long interval, next; >> + >> + /* used by new idle balance, so cpu_busy = 0 */ >> + interval = get_sd_balance_interval(sd, 0); >> + next = sd->last_balance + interval; >> + >> + next = max(next, jiffies + interval); >> + >> + if (time_after(*next_balance, next)) { >> + *next_balance = next; >> + } >> +} >> + >> static inline void >> update_next_balance(struct sched_domain *sd, unsigned long *next_balance) >> { >> @@ -12107,7 +12131,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) >> (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { >> >> if (sd) >> - update_next_balance(sd, &next_balance); >> + update_next_balance_newidle(sd, &next_balance); >> rcu_read_unlock(); >> >> goto out; >> @@ -12124,7 +12148,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) >> int continue_balancing = 1; >> u64 domain_cost; >> >> - update_next_balance(sd, &next_balance); >> + update_next_balance_newidle(sd, &next_balance); >> >> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) >> break;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8e276d12c3cb..b147ad09126a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12076,11 +12076,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) if (!READ_ONCE(this_rq->rd->overload) || (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { - - if (sd) - update_next_balance(sd, &next_balance); rcu_read_unlock(); - goto out; } rcu_read_unlock(); @@ -12095,8 +12091,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) int continue_balancing = 1; u64 domain_cost; - update_next_balance(sd, &next_balance); - if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) break; @@ -12109,6 +12103,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) t1 = sched_clock_cpu(this_cpu); domain_cost = t1 - t0; update_newidle_cost(sd, domain_cost); + sd->last_balance = jiffies; + update_next_balance(sd, &next_balance); curr_cost += domain_cost; t0 = t1;