Message ID | 20231005161727.1855004-1-joel@joelfernandes.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2016:b0:403:3b70:6f57 with SMTP id fe22csp417288vqb; Thu, 5 Oct 2023 09:28:07 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHxPX1lCSuD2QPuGQb+nuBYu0AaxuMC0qCUl6fWw0vW77AKCrhkwtphH3JmFRhVLMaHLYOY X-Received: by 2002:a17:90a:4ce5:b0:274:c284:c83c with SMTP id k92-20020a17090a4ce500b00274c284c83cmr4780128pjh.48.1696523287396; Thu, 05 Oct 2023 09:28:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696523287; cv=none; d=google.com; s=arc-20160816; b=Q7zSQGenlNxkgIj7IYTNt4CvYRadd3B1OEvUVhDG/S7tjCp0IegYqb8sJal/jyzcwe IKx4JRhU6ypQg+6v9bmSrmB75z79cMAMO5UnH3soOZn9Tm6zM+OGi5ajs1SA2bhfVe+q T8puT6LUN4bZoP4I+fQjvCDz4UGxjvcV9ZrykDjqBhaHiw5GbZgVszg0/ZzPco9MalQL +uNcoIU3YoGu+x4XBHdk8gFLbJu4OHD+OWJRsOC0LL2p2ssFLUxucdzDP4bc537msISX mnlutPZU584DJsnNs4ScwckSjx8MX0lALenbRazSuvc0mB7VJbCBzmwyHlBp1XVe5TKN ParQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=Qh8XUEMAE5d54Nrxpyy2HmngsV1jZcKsRa1rZEBsG8Y=; fh=z9xWQ9vkQ/qmnhanT5PuQm+6gXYvcA4UGJVeEioY8So=; b=dsK2MkHbJpHLU1x5Uk0A9uo6ibzWP6ok+cB2V3u/Uu0FhL2YI1NQuiLUjpYmgujK5P kbBqMrevn5+tbOJXt4EoUqS6PB9eESJAh6CI59gByUkwEJP8HPzrcVxnRtALshSWF0WJ wDl8JYpUPi7uWoUiKvuuH9Zp11/3EUf3tHVInE3/WbO9KuNXT4cvONNUlS3wFCUt1fvW gqCe+WHC+NxJBvmSKRxfdaSf6VbhzuJOMG8Fm3afgHQkFRsgQnjR7eazB2ybnFhWg6PT ikML33eb8R/Esb1itam1kknSsLC4RDqiWy3JyN2QxSxGx5gPlUySgRCQ/eSAutj1xtfT y8kQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=p2T3uLy1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id a17-20020a17090abe1100b00277277fd2besi1909129pjs.190.2023.10.05.09.28.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Oct 2023 09:28:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=p2T3uLy1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 6D02F808FA61; Thu, 5 Oct 2023 09:28:06 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235664AbjJEQ1r (ORCPT <rfc822;ezelljr.billy@gmail.com> + 19 others); Thu, 5 Oct 2023 12:27:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49796 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239917AbjJEQXk (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 5 Oct 2023 12:23:40 -0400 Received: from mail-il1-x12b.google.com (mail-il1-x12b.google.com [IPv6:2607:f8b0:4864:20::12b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 026506098 for <linux-kernel@vger.kernel.org>; Thu, 5 Oct 2023 09:17:38 -0700 (PDT) Received: by mail-il1-x12b.google.com with SMTP id e9e14a558f8ab-352753fb42eso5139135ab.1 for <linux-kernel@vger.kernel.org>; Thu, 05 Oct 2023 09:17:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; t=1696522657; x=1697127457; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=Qh8XUEMAE5d54Nrxpyy2HmngsV1jZcKsRa1rZEBsG8Y=; b=p2T3uLy1NjgJjotCR5fTJca7vigzQZFGgK8+7HjcgkYocVhbKeV5rui3N8E4uyNa3r XV+Yl8+VJRdHZiXOw0BDMV3Qph7BWz9fZEtDXK8E7Sah2CcLNO7myeE+55u9nEOHiA28 9T5HZnTcAKjNDBK1tK2W06Vhe0EA+ZdVPLr1c= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696522657; x=1697127457; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Qh8XUEMAE5d54Nrxpyy2HmngsV1jZcKsRa1rZEBsG8Y=; b=B2zQKPfJ8u6wTppVrtxp4hiZLsVrSsehQprv+5vfFXitArHEbdZ5w9IH2+8PjWK5qE OzXYlL1C0AdsnS3jz/PjB3bXbBYhJvwpwYpKe7zLYgTR4hws/8B7Cmuc1XdI6iq/KcVa Kvb87HBNFYmed559Vd0G4QzeSAEKtSprdsaw/TuNi5SAvhwZvMXpb/QBECDf4Vl0vg79 gi1opIL40uFeq5Ncw1yCcdKCjIiw+ynH46dVVXRlLh7ncrnxmUzIyP2uqXaomYkETBdT TZNEmChRgCdLJpx08UGe5IqvfHTMrQzQHDUA07R6rpjVW20xJNQ/wZ1V59n9SfrqZGaZ j1Vw== X-Gm-Message-State: AOJu0YwmS/z3Pler1NUfjeyF9sF+oJhfJ3/IR/cVEu28JXWR24MKcx6E 2mP+B59Oyua3qVjGdxvc6psh3Cgtgpgj82iTcHg= X-Received: by 2002:a05:6e02:2143:b0:351:4b68:ec3b with SMTP id d3-20020a056e02214300b003514b68ec3bmr7099750ilv.10.1696522657564; Thu, 05 Oct 2023 09:17:37 -0700 (PDT) Received: from joelboxx5.c.googlers.com.com (161.74.123.34.bc.googleusercontent.com. [34.123.74.161]) by smtp.gmail.com with ESMTPSA id a18-20020a927f12000000b0034aa175c9c3sm494294ild.87.2023.10.05.09.17.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Oct 2023 09:17:36 -0700 (PDT) From: "Joel Fernandes (Google)" <joel@joelfernandes.org> To: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com> Cc: Vineeth Pillai <vineethrp@google.com>, Suleiman Souhlal <suleiman@google.com>, Hsin Yi <hsinyi@google.com>, Frederic Weisbecker <frederic@kernel.org>, "Paul E . McKenney" <paulmck@kernel.org>, Joel Fernandes <joel@joelfernandes.org> Subject: [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB Date: Thu, 5 Oct 2023 16:17:26 +0000 Message-ID: <20231005161727.1855004-1-joel@joelfernandes.org> X-Mailer: git-send-email 2.42.0.609.gbb76f46606-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Thu, 05 Oct 2023 09:28:06 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778933602698590739 X-GMAIL-MSGID: 1778933602698590739 |
Series |
[RFC] sched/fair: Avoid unnecessary IPIs for ILB
|
|
Commit Message
Joel Fernandes
Oct. 5, 2023, 4:17 p.m. UTC
From: Vineeth Pillai <vineethrp@google.com> Whenever a CPU stops its tick, it now requires another idle CPU to handle the balancing for it because it can't perform its own periodic load balancing. This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if the upcoming nohz-idle load balancing is too distant in the future. This update process is done by triggering an ILB, as the general ILB handler (_nohz_idle_balance) that manages regular nohz balancing also refreshes 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs and selecting the smallest value. Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This primarily results in the ILB handler updating 'nohz.next_balance' while possibly not doing any load balancing at all. However, sending an IPI merely to refresh 'nohz.next_balance' seems excessive, and there ought to be a more efficient method to update 'nohz.next_balance' from the local CPU. Fortunately, there already exists a mechanism to directly invoke the ILB handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" balancing and solely exists to update a CPU's blocked load if it couldn't pull more tasks during regular "newly idle balancing" - and it does so without having to send any IPIs. Once the flag is set, the ILB handler is called directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update the blocked load without an IPI, in our situation, we aim to refresh 'nohz.next_balance' without an IPI but we can piggy back on this. So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to indicate nohz.next_balance needs an update via this direct call shortcut. Note that we set this flag without knowledge that the tick is about to be stopped, because at the point we do it, we have no way of knowing that. However we do know that the CPU is about to enter idle. In our testing, the reduction in IPIs is well worth updating nohz.next_balance a few more times. Also just to note, without this patch we observe the following pattern: 1. A CPU is about to stop its tick. 2. It sets nohz.needs_update to 1. 3. It then stops its tick and goes idle. 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. 5. The ILB CPU ends up being the one that just stopped its tick! 6. This results in an IPI to the tick-stopped CPU which ends up waking it up and disturbing it! Testing shows a considerable reduction in IPIs when doing this: Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM the IPI call count profiled over 10s period is as follows: without fix: ~10500 with fix: ~1000 Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ] Cc: Suleiman Souhlal <suleiman@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Hsin Yi <hsinyi@google.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Vineeth Pillai <vineethrp@google.com> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- kernel/sched/fair.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-)
Comments
* Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > From: Vineeth Pillai <vineethrp@google.com> > > Whenever a CPU stops its tick, it now requires another idle CPU to handle the > balancing for it because it can't perform its own periodic load balancing. > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if > the upcoming nohz-idle load balancing is too distant in the future. This update > process is done by triggering an ILB, as the general ILB handler > (_nohz_idle_balance) that manages regular nohz balancing also refreshes > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs > and selecting the smallest value. > > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This > primarily results in the ILB handler updating 'nohz.next_balance' while > possibly not doing any load balancing at all. However, sending an IPI merely to > refresh 'nohz.next_balance' seems excessive, and there ought to be a more > efficient method to update 'nohz.next_balance' from the local CPU. > > Fortunately, there already exists a mechanism to directly invoke the ILB > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" > balancing and solely exists to update a CPU's blocked load if it couldn't pull > more tasks during regular "newly idle balancing" - and it does so without > having to send any IPIs. Once the flag is set, the ILB handler is called > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update > the blocked load without an IPI, in our situation, we aim to refresh > 'nohz.next_balance' without an IPI but we can piggy back on this. > > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to > indicate nohz.next_balance needs an update via this direct call shortcut. Note > that we set this flag without knowledge that the tick is about to be stopped, > because at the point we do it, we have no way of knowing that. However we do > know that the CPU is about to enter idle. In our testing, the reduction in IPIs > is well worth updating nohz.next_balance a few more times. > > Also just to note, without this patch we observe the following pattern: > > 1. A CPU is about to stop its tick. > 2. It sets nohz.needs_update to 1. > 3. It then stops its tick and goes idle. > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. > 5. The ILB CPU ends up being the one that just stopped its tick! > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up > and disturbing it! > > Testing shows a considerable reduction in IPIs when doing this: > > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM > the IPI call count profiled over 10s period is as follows: > without fix: ~10500 > with fix: ~1000 > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") > > [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ] > > Cc: Suleiman Souhlal <suleiman@google.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Hsin Yi <hsinyi@google.com> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Paul E. McKenney <paulmck@kernel.org> > Signed-off-by: Vineeth Pillai <vineethrp@google.com> > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > kernel/sched/fair.c | 21 ++++++++++++++------- > 1 file changed, 14 insertions(+), 7 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index cb225921bbca..2ece55f32782 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) > /* > * Ensures that if nohz_idle_balance() fails to observe our > * @idle_cpus_mask store, it must observe the @has_blocked > - * and @needs_update stores. > + * stores. > */ > smp_mb__after_atomic(); > > set_cpu_sd_state_idle(cpu); > > - WRITE_ONCE(nohz.needs_update, 1); > out: > /* > * Each time a cpu enter idle, we assume that it has blocked load and > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > } > > /* > - * Check if we need to run the ILB for updating blocked load before entering > - * idle state. > + * Check if we need to run the ILB for updating blocked load and/or updating > + * nohz.next_balance before entering idle state. > */ > void nohz_run_idle_balance(int cpu) > { > unsigned int flags; > > - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); > + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); > + > + if (!flags) > + return; > > /* > * Update the blocked load only if no SCHED_SOFTIRQ is about to happen > * (ie NOHZ_STATS_KICK set) and will do the same. > */ > - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) > - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); > + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && > + !need_resched()) > + _nohz_idle_balance(cpu_rq(cpu), flags); > } > > static void nohz_newidle_balance(struct rq *this_rq) > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq) > if (this_rq->avg_idle < sysctl_sched_migration_cost) > return; > > + /* If rq->next_balance before nohz.next_balance, trigger ILB */ > + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance))) > + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu)); > + > /* Don't need to update blocked load of idle CPUs*/ > if (!READ_ONCE(nohz.has_blocked) || > time_before(jiffies, READ_ONCE(nohz.next_blocked))) Ok, judging by your IPI reduction numbers this is definitely an optimization we want to do. The patch does make _nohz_idle_balance() run more parallel, as previously it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at least for next_balance updates), but I think it's still SMP-safe, as all key data structure updates are already rq-locked AFAICS. One thing I noticed is that we now use nohz.needs_update only in a single remaining case, when _nohz_idle_balance() "self-defers": /* * If this CPU gets work to do, stop the load balancing * work being done for other CPUs. Next load * balancing owner will pick it up. */ if (need_resched()) { if (flags & NOHZ_STATS_KICK) has_blocked_load = true; if (flags & NOHZ_NEXT_KICK) WRITE_ONCE(nohz.needs_update, 1); goto abort; } Getting a need-resched flag set on this CPU is a pretty dubious reason to skip an ILB run IMO, and we could do entirely without that complication, allowing us to remove the nohz.needs_update flag handling logic altogether? If we do that then the !need_resched() flag needs to go from nohz_run_idle_balance() too: /* * Update the blocked load only if no SCHED_SOFTIRQ is about to happen * (ie NOHZ_STATS_KICK set) and will do the same. */ if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && !need_resched()) _nohz_idle_balance(cpu_rq(cpu), flags); ... if I'm reading the code right that is. Thanks, Ingo
On Thu, 5 Oct 2023 at 18:17, Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > > From: Vineeth Pillai <vineethrp@google.com> > > Whenever a CPU stops its tick, it now requires another idle CPU to handle the > balancing for it because it can't perform its own periodic load balancing. > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if > the upcoming nohz-idle load balancing is too distant in the future. This update > process is done by triggering an ILB, as the general ILB handler > (_nohz_idle_balance) that manages regular nohz balancing also refreshes > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs > and selecting the smallest value. > > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This > primarily results in the ILB handler updating 'nohz.next_balance' while > possibly not doing any load balancing at all. However, sending an IPI merely to > refresh 'nohz.next_balance' seems excessive, and there ought to be a more > efficient method to update 'nohz.next_balance' from the local CPU. > > Fortunately, there already exists a mechanism to directly invoke the ILB > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" > balancing and solely exists to update a CPU's blocked load if it couldn't pull > more tasks during regular "newly idle balancing" - and it does so without > having to send any IPIs. Once the flag is set, the ILB handler is called > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update > the blocked load without an IPI, in our situation, we aim to refresh > 'nohz.next_balance' without an IPI but we can piggy back on this. > > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to > indicate nohz.next_balance needs an update via this direct call shortcut. Note > that we set this flag without knowledge that the tick is about to be stopped, > because at the point we do it, we have no way of knowing that. However we do > know that the CPU is about to enter idle. In our testing, the reduction in IPIs > is well worth updating nohz.next_balance a few more times. > > Also just to note, without this patch we observe the following pattern: > > 1. A CPU is about to stop its tick. > 2. It sets nohz.needs_update to 1. > 3. It then stops its tick and goes idle. > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. > 5. The ILB CPU ends up being the one that just stopped its tick! > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up > and disturbing it! > > Testing shows a considerable reduction in IPIs when doing this: > > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM > the IPI call count profiled over 10s period is as follows: > without fix: ~10500 > with fix: ~1000 > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") > > [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ] > > Cc: Suleiman Souhlal <suleiman@google.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Hsin Yi <hsinyi@google.com> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Paul E. McKenney <paulmck@kernel.org> > Signed-off-by: Vineeth Pillai <vineethrp@google.com> > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > kernel/sched/fair.c | 21 ++++++++++++++------- > 1 file changed, 14 insertions(+), 7 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index cb225921bbca..2ece55f32782 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) > /* > * Ensures that if nohz_idle_balance() fails to observe our > * @idle_cpus_mask store, it must observe the @has_blocked > - * and @needs_update stores. > + * stores. > */ > smp_mb__after_atomic(); > > set_cpu_sd_state_idle(cpu); > > - WRITE_ONCE(nohz.needs_update, 1); the set of needs_updat here is the main way to set nohz.needs_update and trigger an update of next_balance so it would be good to explain why we still need to keep it instead r removing it entirely > out: > /* > * Each time a cpu enter idle, we assume that it has blocked load and > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > } > > /* > - * Check if we need to run the ILB for updating blocked load before entering > - * idle state. > + * Check if we need to run the ILB for updating blocked load and/or updating > + * nohz.next_balance before entering idle state. > */ > void nohz_run_idle_balance(int cpu) > { > unsigned int flags; > > - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); > + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); > + > + if (!flags) > + return; > > /* > * Update the blocked load only if no SCHED_SOFTIRQ is about to happen > * (ie NOHZ_STATS_KICK set) and will do the same. > */ > - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) > - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); > + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && > + !need_resched()) > + _nohz_idle_balance(cpu_rq(cpu), flags); This breaks the update of the blocked load. nohz_newidle_balance() only sets NOHZ_NEWILB_KICK (and not NOHZ_STATS_KICK) when it wants to update blocked load before going idle but it then sets NOHZ_STATS_KICK when calling_nohz_idle_balance() . We only clear NOHZ_NEWILB_KICK when fetching flags but we test if other bits have been set by a pending softirq which will do the same. That's why we use NOHZ_NEWILB_KICK and not NOHZ_STATS_KICK directly. Similarly you can't directly use NOHZ_NEXT_KICK. Also note that _nohz_idle_balance() is not robust against parallel run > } > > static void nohz_newidle_balance(struct rq *this_rq) > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq) > if (this_rq->avg_idle < sysctl_sched_migration_cost) > return; > > + /* If rq->next_balance before nohz.next_balance, trigger ILB */ > + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance))) > + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu)); > + > /* Don't need to update blocked load of idle CPUs*/ > if (!READ_ONCE(nohz.has_blocked) || > time_before(jiffies, READ_ONCE(nohz.next_blocked))) > -- > 2.42.0.609.gbb76f46606-goog >
On Fri, Oct 06, 2023 at 12:51:41PM +0200, Ingo Molnar wrote: > > * Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > > > From: Vineeth Pillai <vineethrp@google.com> > > > > Whenever a CPU stops its tick, it now requires another idle CPU to handle the > > balancing for it because it can't perform its own periodic load balancing. > > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if > > the upcoming nohz-idle load balancing is too distant in the future. This update > > process is done by triggering an ILB, as the general ILB handler > > (_nohz_idle_balance) that manages regular nohz balancing also refreshes > > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs > > and selecting the smallest value. > > [...snip...] > > kernel/sched/fair.c | 21 ++++++++++++++------- > > 1 file changed, 14 insertions(+), 7 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index cb225921bbca..2ece55f32782 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) > > /* > > * Ensures that if nohz_idle_balance() fails to observe our > > * @idle_cpus_mask store, it must observe the @has_blocked > > - * and @needs_update stores. > > + * stores. > > */ > > smp_mb__after_atomic(); > > > > set_cpu_sd_state_idle(cpu); > > > > - WRITE_ONCE(nohz.needs_update, 1); > > out: > > /* > > * Each time a cpu enter idle, we assume that it has blocked load and > > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > > } > > > > /* > > - * Check if we need to run the ILB for updating blocked load before entering > > - * idle state. > > + * Check if we need to run the ILB for updating blocked load and/or updating > > + * nohz.next_balance before entering idle state. > > */ > > void nohz_run_idle_balance(int cpu) > > { > > unsigned int flags; > > > > - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); > > + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); > > + > > + if (!flags) > > + return; > > > > /* > > * Update the blocked load only if no SCHED_SOFTIRQ is about to happen > > * (ie NOHZ_STATS_KICK set) and will do the same. > > */ > > - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) > > - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); > > + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && > > + !need_resched()) > > + _nohz_idle_balance(cpu_rq(cpu), flags); > > } > > > > static void nohz_newidle_balance(struct rq *this_rq) > > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq) > > if (this_rq->avg_idle < sysctl_sched_migration_cost) > > return; > > > > + /* If rq->next_balance before nohz.next_balance, trigger ILB */ > > + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance))) > > + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu)); > > + > > /* Don't need to update blocked load of idle CPUs*/ > > if (!READ_ONCE(nohz.has_blocked) || > > time_before(jiffies, READ_ONCE(nohz.next_blocked))) > > Ok, judging by your IPI reduction numbers this is definitely an > optimization we want to do. Great, thanks. > The patch does make _nohz_idle_balance() run more parallel, as previously > it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at > least for next_balance updates), but I think it's still SMP-safe, as all > key data structure updates are already rq-locked AFAICS. Yes true, we are looking into the parallelism aspect more and will update on how it goes. Ideally, we'd like to ensure that nohz.next_balance is set to the earliest rq->next_balance even in the presence of concurrency. Theoretically, we feel the parallelism should not increase more than the current code but we'll look more into it. > One thing I noticed is that we now use nohz.needs_update only in a single > remaining case, when _nohz_idle_balance() "self-defers": > > /* > * If this CPU gets work to do, stop the load balancing > * work being done for other CPUs. Next load > * balancing owner will pick it up. > */ > if (need_resched()) { > if (flags & NOHZ_STATS_KICK) > has_blocked_load = true; > if (flags & NOHZ_NEXT_KICK) > WRITE_ONCE(nohz.needs_update, 1); > goto abort; > } > > Getting a need-resched flag set on this CPU is a pretty dubious reason to > skip an ILB run IMO, and we could do entirely without that complication, > allowing us to remove the nohz.needs_update flag handling logic altogether? Yes you are right I think, we can continue doing the ILB run in the case we are only doing lighter-weight stats update and not full-blown idle balancing. if (need_resched() && (flags & NOHZ_BALANCE_KICK)) goto abort; That way we can get rid of the needs_update variable as well, as you and Vincent pointed out. We could also add this as a separate patch in a series. Thanks for pointing out this idea. > If we do that then the !need_resched() flag needs to go from > nohz_run_idle_balance() too: > > /* > * Update the blocked load only if no SCHED_SOFTIRQ is about to happen > * (ie NOHZ_STATS_KICK set) and will do the same. > */ > if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && > !need_resched()) > _nohz_idle_balance(cpu_rq(cpu), flags); > > ... if I'm reading the code right that is. Yes, that sounds right. We will work more on this and post the next revision soon. Thank you! - Joel & Vineeth
Hi Vincent, On Fri, Oct 06, 2023 at 03:46:44PM +0200, Vincent Guittot wrote: [...] > > --- > > kernel/sched/fair.c | 21 ++++++++++++++------- > > 1 file changed, 14 insertions(+), 7 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index cb225921bbca..2ece55f32782 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) > > /* > > * Ensures that if nohz_idle_balance() fails to observe our > > * @idle_cpus_mask store, it must observe the @has_blocked > > - * and @needs_update stores. > > + * stores. > > */ > > smp_mb__after_atomic(); > > > > set_cpu_sd_state_idle(cpu); > > > > - WRITE_ONCE(nohz.needs_update, 1); > > the set of needs_updat here is the main way to set nohz.needs_update > and trigger an update of next_balance so it would be good to explain > why we still need to keep it instead r removing it entirely Ok, we are thinking of getting rid of it as Ingo suggested. > > out: > > /* > > * Each time a cpu enter idle, we assume that it has blocked load and > > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > > } > > > > /* > > - * Check if we need to run the ILB for updating blocked load before entering > > - * idle state. > > + * Check if we need to run the ILB for updating blocked load and/or updating > > + * nohz.next_balance before entering idle state. > > */ > > void nohz_run_idle_balance(int cpu) > > { > > unsigned int flags; > > > > - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); > > + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); > > + > > + if (!flags) > > + return; > > > > /* > > * Update the blocked load only if no SCHED_SOFTIRQ is about to happen > > * (ie NOHZ_STATS_KICK set) and will do the same. > > */ > > - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) > > - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); > > + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && > > + !need_resched()) > > + _nohz_idle_balance(cpu_rq(cpu), flags); > > This breaks the update of the blocked load. > nohz_newidle_balance() only sets NOHZ_NEWILB_KICK (and not > NOHZ_STATS_KICK) when it wants to update blocked load before going > idle but it then sets NOHZ_STATS_KICK when calling_nohz_idle_balance() > . We only clear NOHZ_NEWILB_KICK when fetching flags but we test if > other bits have been set by a pending softirq which will do the same. Yes, we realized this just after sending the RFC. Will fix, thank you! > That's why we use NOHZ_NEWILB_KICK and not NOHZ_STATS_KICK directly. > Similarly you can't directly use NOHZ_NEXT_KICK. This is not an issue actually, as NEXT_KICK is only set by this path (nohz_newidle_balance()) after this patch. The NEWILB_KICK and STATS_KICK case is different where it can be updated in more than 1 path. Right? > Also note that _nohz_idle_balance() is not robust against parallel run True, though the parallelism is already happening. However your point is well taken and we'll try to improve the existing code to make it more robust as well (if needed). Will dig deeper into it. thanks, - Joel & Vineeth
On 10/5/23 9:47 PM, Joel Fernandes (Google) wrote: > From: Vineeth Pillai <vineethrp@google.com> > > Whenever a CPU stops its tick, it now requires another idle CPU to handle the > balancing for it because it can't perform its own periodic load balancing. > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if > the upcoming nohz-idle load balancing is too distant in the future. This update > process is done by triggering an ILB, as the general ILB handler > (_nohz_idle_balance) that manages regular nohz balancing also refreshes > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs > and selecting the smallest value. > > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This > primarily results in the ILB handler updating 'nohz.next_balance' while > possibly not doing any load balancing at all. However, sending an IPI merely to > refresh 'nohz.next_balance' seems excessive, and there ought to be a more > efficient method to update 'nohz.next_balance' from the local CPU. > > Fortunately, there already exists a mechanism to directly invoke the ILB > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" > balancing and solely exists to update a CPU's blocked load if it couldn't pull > more tasks during regular "newly idle balancing" - and it does so without > having to send any IPIs. Once the flag is set, the ILB handler is called > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update > the blocked load without an IPI, in our situation, we aim to refresh > 'nohz.next_balance' without an IPI but we can piggy back on this. > > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to > indicate nohz.next_balance needs an update via this direct call shortcut. Note > that we set this flag without knowledge that the tick is about to be stopped, > because at the point we do it, we have no way of knowing that. However we do > know that the CPU is about to enter idle. In our testing, the reduction in IPIs > is well worth updating nohz.next_balance a few more times. > > Also just to note, without this patch we observe the following pattern: > > 1. A CPU is about to stop its tick. > 2. It sets nohz.needs_update to 1. > 3. It then stops its tick and goes idle. > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. > 5. The ILB CPU ends up being the one that just stopped its tick! > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up > and disturbing it! > > Testing shows a considerable reduction in IPIs when doing this: > > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM > the IPI call count profiled over 10s period is as follows: > without fix: ~10500 > with fix: ~1000 > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") > > [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ] Hi Joel/Vineeth. Its an interesting patch. Gave this patch a try on powerpc system with 96 CPU. (12 cores SMT8) Was debugging an issue where ILB count goes up significantly at a specific busy CPU count. Haven't yet found out why. Its WIP. Was curious if this patch would address that issue. cloned rt-test repo and ran same cyclictest command and collected softirq's count using bcc tool. That count remains same more or less with patch. Is what I am checking incorrect? Any other way to check IPI count? base 6.6_rc4 +patch block 31.00 48.86 net_rx 475.90 348.90 timer 2213.20 2405.00 rcu 33057.30 34738.10 sched 175904.70 169695.60 > > Cc: Suleiman Souhlal <suleiman@google.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Hsin Yi <hsinyi@google.com> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Paul E. McKenney <paulmck@kernel.org> > Signed-off-by: Vineeth Pillai <vineethrp@google.com> > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > kernel/sched/fair.c | 21 ++++++++++++++------- > 1 file changed, 14 insertions(+), 7 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index cb225921bbca..2ece55f32782 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) > /* > * Ensures that if nohz_idle_balance() fails to observe our > * @idle_cpus_mask store, it must observe the @has_blocked > - * and @needs_update stores. > + * stores. > */ > smp_mb__after_atomic(); > > set_cpu_sd_state_idle(cpu); > > - WRITE_ONCE(nohz.needs_update, 1); > out: > /* > * Each time a cpu enter idle, we assume that it has blocked load and > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > } > > /* > - * Check if we need to run the ILB for updating blocked load before entering > - * idle state. > + * Check if we need to run the ILB for updating blocked load and/or updating > + * nohz.next_balance before entering idle state. > */ > void nohz_run_idle_balance(int cpu) > { > unsigned int flags; > > - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); > + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); > + > + if (!flags) > + return; > > /* > * Update the blocked load only if no SCHED_SOFTIRQ is about to happen > * (ie NOHZ_STATS_KICK set) and will do the same. > */ > - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) > - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); > + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && > + !need_resched()) > + _nohz_idle_balance(cpu_rq(cpu), flags); > } > > static void nohz_newidle_balance(struct rq *this_rq) > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq) > if (this_rq->avg_idle < sysctl_sched_migration_cost) > return; > > + /* If rq->next_balance before nohz.next_balance, trigger ILB */ > + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance))) > + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu)); > + > /* Don't need to update blocked load of idle CPUs*/ > if (!READ_ONCE(nohz.has_blocked) || > time_before(jiffies, READ_ONCE(nohz.next_blocked)))
On Thu, Oct 05, 2023 at 04:17:26PM +0000, Joel Fernandes (Google) wrote: > From: Vineeth Pillai <vineethrp@google.com> > > Whenever a CPU stops its tick, it now requires another idle CPU to handle the > balancing for it because it can't perform its own periodic load balancing. > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if > the upcoming nohz-idle load balancing is too distant in the future. This update > process is done by triggering an ILB, as the general ILB handler > (_nohz_idle_balance) that manages regular nohz balancing also refreshes > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs > and selecting the smallest value. > > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This > primarily results in the ILB handler updating 'nohz.next_balance' while > possibly not doing any load balancing at all. However, sending an IPI merely to > refresh 'nohz.next_balance' seems excessive, and there ought to be a more > efficient method to update 'nohz.next_balance' from the local CPU. > > Fortunately, there already exists a mechanism to directly invoke the ILB > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" > balancing and solely exists to update a CPU's blocked load if it couldn't pull > more tasks during regular "newly idle balancing" - and it does so without > having to send any IPIs. Once the flag is set, the ILB handler is called > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update > the blocked load without an IPI, in our situation, we aim to refresh > 'nohz.next_balance' without an IPI but we can piggy back on this. > > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to > indicate nohz.next_balance needs an update via this direct call shortcut. Note > that we set this flag without knowledge that the tick is about to be stopped, > because at the point we do it, we have no way of knowing that. However we do > know that the CPU is about to enter idle. In our testing, the reduction in IPIs > is well worth updating nohz.next_balance a few more times. > > Also just to note, without this patch we observe the following pattern: > > 1. A CPU is about to stop its tick. > 2. It sets nohz.needs_update to 1. > 3. It then stops its tick and goes idle. > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. > 5. The ILB CPU ends up being the one that just stopped its tick! > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up > and disturbing it! > > Testing shows a considerable reduction in IPIs when doing this: > > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM > the IPI call count profiled over 10s period is as follows: > without fix: ~10500 > with fix: ~1000 > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") Hurm.. does this really warrant a Fixes tag? Afaict nothing is currently broken -- this is a pure optimization question, no?
On 10/7/23 12:48 AM, Shrikanth Hegde wrote: > > > On 10/5/23 9:47 PM, Joel Fernandes (Google) wrote: >> From: Vineeth Pillai <vineethrp@google.com> >> >> Whenever a CPU stops its tick, it now requires another idle CPU to handle the >> balancing for it because it can't perform its own periodic load balancing. >> This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if >> the upcoming nohz-idle load balancing is too distant in the future. This update >> process is done by triggering an ILB, as the general ILB handler >> (_nohz_idle_balance) that manages regular nohz balancing also refreshes >> 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs >> and selecting the smallest value. >> >> Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This >> primarily results in the ILB handler updating 'nohz.next_balance' while >> possibly not doing any load balancing at all. However, sending an IPI merely to >> refresh 'nohz.next_balance' seems excessive, and there ought to be a more >> efficient method to update 'nohz.next_balance' from the local CPU. >> >> Fortunately, there already exists a mechanism to directly invoke the ILB >> handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by >> setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" >> balancing and solely exists to update a CPU's blocked load if it couldn't pull >> more tasks during regular "newly idle balancing" - and it does so without >> having to send any IPIs. Once the flag is set, the ILB handler is called >> directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update >> the blocked load without an IPI, in our situation, we aim to refresh >> 'nohz.next_balance' without an IPI but we can piggy back on this. >> >> So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to >> indicate nohz.next_balance needs an update via this direct call shortcut. Note >> that we set this flag without knowledge that the tick is about to be stopped, >> because at the point we do it, we have no way of knowing that. However we do >> know that the CPU is about to enter idle. In our testing, the reduction in IPIs >> is well worth updating nohz.next_balance a few more times. >> >> Also just to note, without this patch we observe the following pattern: >> >> 1. A CPU is about to stop its tick. >> 2. It sets nohz.needs_update to 1. >> 3. It then stops its tick and goes idle. >> 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. >> 5. The ILB CPU ends up being the one that just stopped its tick! >> 6. This results in an IPI to the tick-stopped CPU which ends up waking it up >> and disturbing it! >> >> Testing shows a considerable reduction in IPIs when doing this: >> >> Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM >> the IPI call count profiled over 10s period is as follows: >> without fix: ~10500 >> with fix: ~1000 >> >> Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") >> >> [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ] > > Hi Joel/Vineeth. > > Its an interesting patch. > > Gave this patch a try on powerpc system with 96 CPU. (12 cores SMT8) > Was debugging an issue where ILB count goes up significantly at a specific > busy CPU count. Haven't yet found out why. Its WIP. Was curious if this patch > would address that issue. > > cloned rt-test repo and ran same cyclictest command and collected > softirq's count using bcc tool. That count remains same more or less with patch. > Is what I am checking incorrect? Any other way to check IPI count? > > base 6.6_rc4 +patch > > block 31.00 48.86 > net_rx 475.90 348.90 > timer 2213.20 2405.00 > rcu 33057.30 34738.10 > sched 175904.70 169695.60 > Ah, there is hardirq which shows IPI count. Didnt think of it. This is average of 10 runs where hardirq was collected at 10s while running cyclictest. This shows nice improvement. in base6.6 there were few instance where number of IPI was much high. base 6.6_rc4 +patch IPI-1 2741.1 1382.3 > >> >> Cc: Suleiman Souhlal <suleiman@google.com> >> Cc: Steven Rostedt <rostedt@goodmis.org> >> Cc: Hsin Yi <hsinyi@google.com> >> Cc: Frederic Weisbecker <frederic@kernel.org> >> Cc: Paul E. McKenney <paulmck@kernel.org> >> Signed-off-by: Vineeth Pillai <vineethrp@google.com> >> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> >> --- >> kernel/sched/fair.c | 21 ++++++++++++++------- >> 1 file changed, 14 insertions(+), 7 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index cb225921bbca..2ece55f32782 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) >> /* >> * Ensures that if nohz_idle_balance() fails to observe our >> * @idle_cpus_mask store, it must observe the @has_blocked >> - * and @needs_update stores. >> + * stores. >> */ >> smp_mb__after_atomic(); >> >> set_cpu_sd_state_idle(cpu); >> >> - WRITE_ONCE(nohz.needs_update, 1); >> out: >> /* >> * Each time a cpu enter idle, we assume that it has blocked load and >> @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) >> } >> >> /* >> - * Check if we need to run the ILB for updating blocked load before entering >> - * idle state. >> + * Check if we need to run the ILB for updating blocked load and/or updating >> + * nohz.next_balance before entering idle state. >> */ >> void nohz_run_idle_balance(int cpu) >> { >> unsigned int flags; >> >> - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); >> + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); >> + >> + if (!flags) >> + return; >> >> /* >> * Update the blocked load only if no SCHED_SOFTIRQ is about to happen >> * (ie NOHZ_STATS_KICK set) and will do the same. >> */ >> - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) >> - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); >> + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && >> + !need_resched()) >> + _nohz_idle_balance(cpu_rq(cpu), flags); >> } >> >> static void nohz_newidle_balance(struct rq *this_rq) >> @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq) >> if (this_rq->avg_idle < sysctl_sched_migration_cost) >> return; >> >> + /* If rq->next_balance before nohz.next_balance, trigger ILB */ >> + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance))) >> + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu)); >> + >> /* Don't need to update blocked load of idle CPUs*/ >> if (!READ_ONCE(nohz.has_blocked) || >> time_before(jiffies, READ_ONCE(nohz.next_blocked)))
Hi Shrikanth, > Hi Joel/Vineeth. > > Its an interesting patch. > > Gave this patch a try on powerpc system with 96 CPU. (12 cores SMT8) > Was debugging an issue where ILB count goes up significantly at a specific > busy CPU count. Haven't yet found out why. Its WIP. Was curious if this patch > would address that issue. > Increased SCHED_SOFTIRQ was another problem that we noticed and we have a separate fix for that in the chromeos tree. We were planning to send the patch after a bit more testing. You could try the patch and see if it solves the problem that you are seeing: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/918f229ba2ff720f5dfae4768469acbb6ea39fe2 Thanks
Hi Peter, On Fri, Oct 06, 2023 at 10:01:29PM +0200, Peter Zijlstra wrote: > On Thu, Oct 05, 2023 at 04:17:26PM +0000, Joel Fernandes (Google) wrote: > > From: Vineeth Pillai <vineethrp@google.com> > > > > Whenever a CPU stops its tick, it now requires another idle CPU to handle the > > balancing for it because it can't perform its own periodic load balancing. > > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if > > the upcoming nohz-idle load balancing is too distant in the future. This update > > process is done by triggering an ILB, as the general ILB handler > > (_nohz_idle_balance) that manages regular nohz balancing also refreshes > > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs > > and selecting the smallest value. > > > > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This > > primarily results in the ILB handler updating 'nohz.next_balance' while > > possibly not doing any load balancing at all. However, sending an IPI merely to > > refresh 'nohz.next_balance' seems excessive, and there ought to be a more > > efficient method to update 'nohz.next_balance' from the local CPU. > > > > Fortunately, there already exists a mechanism to directly invoke the ILB > > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by > > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" > > balancing and solely exists to update a CPU's blocked load if it couldn't pull > > more tasks during regular "newly idle balancing" - and it does so without > > having to send any IPIs. Once the flag is set, the ILB handler is called > > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update > > the blocked load without an IPI, in our situation, we aim to refresh > > 'nohz.next_balance' without an IPI but we can piggy back on this. > > > > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to > > indicate nohz.next_balance needs an update via this direct call shortcut. Note > > that we set this flag without knowledge that the tick is about to be stopped, > > because at the point we do it, we have no way of knowing that. However we do > > know that the CPU is about to enter idle. In our testing, the reduction in IPIs > > is well worth updating nohz.next_balance a few more times. > > > > Also just to note, without this patch we observe the following pattern: > > > > 1. A CPU is about to stop its tick. > > 2. It sets nohz.needs_update to 1. > > 3. It then stops its tick and goes idle. > > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. > > 5. The ILB CPU ends up being the one that just stopped its tick! > > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up > > and disturbing it! > > > > Testing shows a considerable reduction in IPIs when doing this: > > > > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM > > the IPI call count profiled over 10s period is as follows: > > without fix: ~10500 > > with fix: ~1000 > > > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") > > Hurm.. does this really warrant a Fixes tag? Afaict nothing is currently > broken -- this is a pure optimization question, no? IMHO it is a breakage as it breaks NOHZ -- a lot of times the ILB kicks back the CPU stopping the tick out of idle (effectively breaking NOHZ). The large number of IPIs also wrecks power and it happens only on 6.1 and after. Having the fixes tag means it will also goto all stable kernels >= 6.1. Hope that sounds reasonable and thank you for taking a look! thanks, - Joel
On Sat, Oct 07, 2023 at 12:48:57AM +0530, Shrikanth Hegde wrote: > > > On 10/5/23 9:47 PM, Joel Fernandes (Google) wrote: > > From: Vineeth Pillai <vineethrp@google.com> > > > > Whenever a CPU stops its tick, it now requires another idle CPU to handle the > > balancing for it because it can't perform its own periodic load balancing. > > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if > > the upcoming nohz-idle load balancing is too distant in the future. This update > > process is done by triggering an ILB, as the general ILB handler > > (_nohz_idle_balance) that manages regular nohz balancing also refreshes > > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs > > and selecting the smallest value. > > > > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This > > primarily results in the ILB handler updating 'nohz.next_balance' while > > possibly not doing any load balancing at all. However, sending an IPI merely to > > refresh 'nohz.next_balance' seems excessive, and there ought to be a more > > efficient method to update 'nohz.next_balance' from the local CPU. > > > > Fortunately, there already exists a mechanism to directly invoke the ILB > > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by > > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" > > balancing and solely exists to update a CPU's blocked load if it couldn't pull > > more tasks during regular "newly idle balancing" - and it does so without > > having to send any IPIs. Once the flag is set, the ILB handler is called > > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update > > the blocked load without an IPI, in our situation, we aim to refresh > > 'nohz.next_balance' without an IPI but we can piggy back on this. > > > > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to > > indicate nohz.next_balance needs an update via this direct call shortcut. Note > > that we set this flag without knowledge that the tick is about to be stopped, > > because at the point we do it, we have no way of knowing that. However we do > > know that the CPU is about to enter idle. In our testing, the reduction in IPIs > > is well worth updating nohz.next_balance a few more times. > > > > Also just to note, without this patch we observe the following pattern: > > > > 1. A CPU is about to stop its tick. > > 2. It sets nohz.needs_update to 1. > > 3. It then stops its tick and goes idle. > > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. > > 5. The ILB CPU ends up being the one that just stopped its tick! > > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up > > and disturbing it! > > > > Testing shows a considerable reduction in IPIs when doing this: > > > > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM > > the IPI call count profiled over 10s period is as follows: > > without fix: ~10500 > > with fix: ~1000 > > > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") > > > > [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ] > > Hi Joel/Vineeth. > > Its an interesting patch. > > Gave this patch a try on powerpc system with 96 CPU. (12 cores SMT8) > Was debugging an issue where ILB count goes up significantly at a specific > busy CPU count. Haven't yet found out why. Its WIP. Was curious if this patch > would address that issue. Do you see that only from a specific kernel version onward? Have you tried pre-6.1 kernels? > cloned rt-test repo and ran same cyclictest command and collected > softirq's count using bcc tool. That count remains same more or less with patch. > Is what I am checking incorrect? Any other way to check IPI count? > > base 6.6_rc4 +patch > > block 31.00 48.86 > net_rx 475.90 348.90 > timer 2213.20 2405.00 > rcu 33057.30 34738.10 > sched 175904.70 169695.60 So, we ran "perf stat -e irq:softirq_*" and counted those. For the sched softirq, a majority of them were raised from the ILB path. This gave a clue that it must be a flood of ILB. For IPI counts, we run "perf record -a -g" and then "perf script" and look for call stacks involving an smp_call. There's also /proc/interrupts that can read out IPI counts. We also used trace_printk in the ILB path, collected a trace, and count how many times the ILB happens. Vineeth collected the final data in this patch so I'll let him add any details if he used some other method. thanks, - Joel > > > > > > Cc: Suleiman Souhlal <suleiman@google.com> > > Cc: Steven Rostedt <rostedt@goodmis.org> > > Cc: Hsin Yi <hsinyi@google.com> > > Cc: Frederic Weisbecker <frederic@kernel.org> > > Cc: Paul E. McKenney <paulmck@kernel.org> > > Signed-off-by: Vineeth Pillai <vineethrp@google.com> > > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > --- > > kernel/sched/fair.c | 21 ++++++++++++++------- > > 1 file changed, 14 insertions(+), 7 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index cb225921bbca..2ece55f32782 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) > > /* > > * Ensures that if nohz_idle_balance() fails to observe our > > * @idle_cpus_mask store, it must observe the @has_blocked > > - * and @needs_update stores. > > + * stores. > > */ > > smp_mb__after_atomic(); > > > > set_cpu_sd_state_idle(cpu); > > > > - WRITE_ONCE(nohz.needs_update, 1); > > out: > > /* > > * Each time a cpu enter idle, we assume that it has blocked load and > > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > > } > > > > /* > > - * Check if we need to run the ILB for updating blocked load before entering > > - * idle state. > > + * Check if we need to run the ILB for updating blocked load and/or updating > > + * nohz.next_balance before entering idle state. > > */ > > void nohz_run_idle_balance(int cpu) > > { > > unsigned int flags; > > > > - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); > > + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); > > + > > + if (!flags) > > + return; > > > > /* > > * Update the blocked load only if no SCHED_SOFTIRQ is about to happen > > * (ie NOHZ_STATS_KICK set) and will do the same. > > */ > > - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) > > - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); > > + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && > > + !need_resched()) > > + _nohz_idle_balance(cpu_rq(cpu), flags); > > } > > > > static void nohz_newidle_balance(struct rq *this_rq) > > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq) > > if (this_rq->avg_idle < sysctl_sched_migration_cost) > > return; > > > > + /* If rq->next_balance before nohz.next_balance, trigger ILB */ > > + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance))) > > + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu)); > > + > > /* Don't need to update blocked load of idle CPUs*/ > > if (!READ_ONCE(nohz.has_blocked) || > > time_before(jiffies, READ_ONCE(nohz.next_blocked)))
On Sat, Oct 07, 2023 at 01:40:53AM +0530, Shrikanth Hegde wrote: > > > On 10/7/23 12:48 AM, Shrikanth Hegde wrote: > > > > > > On 10/5/23 9:47 PM, Joel Fernandes (Google) wrote: > >> From: Vineeth Pillai <vineethrp@google.com> > >> > >> Whenever a CPU stops its tick, it now requires another idle CPU to handle the > >> balancing for it because it can't perform its own periodic load balancing. > >> This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if > >> the upcoming nohz-idle load balancing is too distant in the future. This update > >> process is done by triggering an ILB, as the general ILB handler > >> (_nohz_idle_balance) that manages regular nohz balancing also refreshes > >> 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs > >> and selecting the smallest value. > >> > >> Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This > >> primarily results in the ILB handler updating 'nohz.next_balance' while > >> possibly not doing any load balancing at all. However, sending an IPI merely to > >> refresh 'nohz.next_balance' seems excessive, and there ought to be a more > >> efficient method to update 'nohz.next_balance' from the local CPU. > >> > >> Fortunately, there already exists a mechanism to directly invoke the ILB > >> handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by > >> setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" > >> balancing and solely exists to update a CPU's blocked load if it couldn't pull > >> more tasks during regular "newly idle balancing" - and it does so without > >> having to send any IPIs. Once the flag is set, the ILB handler is called > >> directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update > >> the blocked load without an IPI, in our situation, we aim to refresh > >> 'nohz.next_balance' without an IPI but we can piggy back on this. > >> > >> So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to > >> indicate nohz.next_balance needs an update via this direct call shortcut. Note > >> that we set this flag without knowledge that the tick is about to be stopped, > >> because at the point we do it, we have no way of knowing that. However we do > >> know that the CPU is about to enter idle. In our testing, the reduction in IPIs > >> is well worth updating nohz.next_balance a few more times. > >> > >> Also just to note, without this patch we observe the following pattern: > >> > >> 1. A CPU is about to stop its tick. > >> 2. It sets nohz.needs_update to 1. > >> 3. It then stops its tick and goes idle. > >> 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. > >> 5. The ILB CPU ends up being the one that just stopped its tick! > >> 6. This results in an IPI to the tick-stopped CPU which ends up waking it up > >> and disturbing it! > >> > >> Testing shows a considerable reduction in IPIs when doing this: > >> > >> Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM > >> the IPI call count profiled over 10s period is as follows: > >> without fix: ~10500 > >> with fix: ~1000 > >> > >> Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") > >> > >> [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ] > > > > Hi Joel/Vineeth. > > > > Its an interesting patch. > > > > Gave this patch a try on powerpc system with 96 CPU. (12 cores SMT8) > > Was debugging an issue where ILB count goes up significantly at a specific > > busy CPU count. Haven't yet found out why. Its WIP. Was curious if this patch > > would address that issue. > > > > cloned rt-test repo and ran same cyclictest command and collected > > softirq's count using bcc tool. That count remains same more or less with patch. > > Is what I am checking incorrect? Any other way to check IPI count? > > > > base 6.6_rc4 +patch > > > > block 31.00 48.86 > > net_rx 475.90 348.90 > > timer 2213.20 2405.00 > > rcu 33057.30 34738.10 > > sched 175904.70 169695.60 > > > > Ah, there is hardirq which shows IPI count. Didnt think of it. > This is average of 10 runs where hardirq was collected at 10s while running cyclictest. > This shows nice improvement. in base6.6 there were few instance where > number of IPI was much high. > > base 6.6_rc4 +patch > IPI-1 2741.1 1382.3 > Very cool! So I'll go ahead and add this data as well for the next revision. (I hope to post a new version in a few days after addressing all the review comments, I am unfortunately a bit slow this week due to travel and other things). Thanks, - Joel > > > > >> > >> Cc: Suleiman Souhlal <suleiman@google.com> > >> Cc: Steven Rostedt <rostedt@goodmis.org> > >> Cc: Hsin Yi <hsinyi@google.com> > >> Cc: Frederic Weisbecker <frederic@kernel.org> > >> Cc: Paul E. McKenney <paulmck@kernel.org> > >> Signed-off-by: Vineeth Pillai <vineethrp@google.com> > >> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > >> --- > >> kernel/sched/fair.c | 21 ++++++++++++++------- > >> 1 file changed, 14 insertions(+), 7 deletions(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index cb225921bbca..2ece55f32782 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) > >> /* > >> * Ensures that if nohz_idle_balance() fails to observe our > >> * @idle_cpus_mask store, it must observe the @has_blocked > >> - * and @needs_update stores. > >> + * stores. > >> */ > >> smp_mb__after_atomic(); > >> > >> set_cpu_sd_state_idle(cpu); > >> > >> - WRITE_ONCE(nohz.needs_update, 1); > >> out: > >> /* > >> * Each time a cpu enter idle, we assume that it has blocked load and > >> @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > >> } > >> > >> /* > >> - * Check if we need to run the ILB for updating blocked load before entering > >> - * idle state. > >> + * Check if we need to run the ILB for updating blocked load and/or updating > >> + * nohz.next_balance before entering idle state. > >> */ > >> void nohz_run_idle_balance(int cpu) > >> { > >> unsigned int flags; > >> > >> - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); > >> + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); > >> + > >> + if (!flags) > >> + return; > >> > >> /* > >> * Update the blocked load only if no SCHED_SOFTIRQ is about to happen > >> * (ie NOHZ_STATS_KICK set) and will do the same. > >> */ > >> - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) > >> - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); > >> + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && > >> + !need_resched()) > >> + _nohz_idle_balance(cpu_rq(cpu), flags); > >> } > >> > >> static void nohz_newidle_balance(struct rq *this_rq) > >> @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq) > >> if (this_rq->avg_idle < sysctl_sched_migration_cost) > >> return; > >> > >> + /* If rq->next_balance before nohz.next_balance, trigger ILB */ > >> + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance))) > >> + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu)); > >> + > >> /* Don't need to update blocked load of idle CPUs*/ > >> if (!READ_ONCE(nohz.has_blocked) || > >> time_before(jiffies, READ_ONCE(nohz.next_blocked)))
On Fri, Oct 06, 2023 at 12:51:41PM +0200, Ingo Molnar wrote: > > * Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > > > From: Vineeth Pillai <vineethrp@google.com> > > > > Whenever a CPU stops its tick, it now requires another idle CPU to handle the > > balancing for it because it can't perform its own periodic load balancing. > > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if > > the upcoming nohz-idle load balancing is too distant in the future. This update > > process is done by triggering an ILB, as the general ILB handler > > (_nohz_idle_balance) that manages regular nohz balancing also refreshes > > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs > > and selecting the smallest value. > > > > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This > > primarily results in the ILB handler updating 'nohz.next_balance' while > > possibly not doing any load balancing at all. However, sending an IPI merely to > > refresh 'nohz.next_balance' seems excessive, and there ought to be a more > > efficient method to update 'nohz.next_balance' from the local CPU. > > > > Fortunately, there already exists a mechanism to directly invoke the ILB > > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by > > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" > > balancing and solely exists to update a CPU's blocked load if it couldn't pull > > more tasks during regular "newly idle balancing" - and it does so without > > having to send any IPIs. Once the flag is set, the ILB handler is called > > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update > > the blocked load without an IPI, in our situation, we aim to refresh > > 'nohz.next_balance' without an IPI but we can piggy back on this. > > > > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to > > indicate nohz.next_balance needs an update via this direct call shortcut. Note > > that we set this flag without knowledge that the tick is about to be stopped, > > because at the point we do it, we have no way of knowing that. However we do > > know that the CPU is about to enter idle. In our testing, the reduction in IPIs > > is well worth updating nohz.next_balance a few more times. > > > > Also just to note, without this patch we observe the following pattern: > > > > 1. A CPU is about to stop its tick. > > 2. It sets nohz.needs_update to 1. > > 3. It then stops its tick and goes idle. > > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. > > 5. The ILB CPU ends up being the one that just stopped its tick! > > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up > > and disturbing it! > > > > Testing shows a considerable reduction in IPIs when doing this: > > > > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM > > the IPI call count profiled over 10s period is as follows: > > without fix: ~10500 > > with fix: ~1000 > > > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") > > > > [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ] > > > > Cc: Suleiman Souhlal <suleiman@google.com> > > Cc: Steven Rostedt <rostedt@goodmis.org> > > Cc: Hsin Yi <hsinyi@google.com> > > Cc: Frederic Weisbecker <frederic@kernel.org> > > Cc: Paul E. McKenney <paulmck@kernel.org> > > Signed-off-by: Vineeth Pillai <vineethrp@google.com> > > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > --- > > kernel/sched/fair.c | 21 ++++++++++++++------- > > 1 file changed, 14 insertions(+), 7 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index cb225921bbca..2ece55f32782 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) > > /* > > * Ensures that if nohz_idle_balance() fails to observe our > > * @idle_cpus_mask store, it must observe the @has_blocked > > - * and @needs_update stores. > > + * stores. > > */ > > smp_mb__after_atomic(); > > > > set_cpu_sd_state_idle(cpu); > > > > - WRITE_ONCE(nohz.needs_update, 1); > > out: > > /* > > * Each time a cpu enter idle, we assume that it has blocked load and > > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > > } > > > > /* > > - * Check if we need to run the ILB for updating blocked load before entering > > - * idle state. > > + * Check if we need to run the ILB for updating blocked load and/or updating > > + * nohz.next_balance before entering idle state. > > */ > > void nohz_run_idle_balance(int cpu) > > { > > unsigned int flags; > > > > - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); > > + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); > > + > > + if (!flags) > > + return; > > > > /* > > * Update the blocked load only if no SCHED_SOFTIRQ is about to happen > > * (ie NOHZ_STATS_KICK set) and will do the same. > > */ > > - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) > > - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); > > + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && > > + !need_resched()) > > + _nohz_idle_balance(cpu_rq(cpu), flags); > > } > > > > static void nohz_newidle_balance(struct rq *this_rq) > > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq) > > if (this_rq->avg_idle < sysctl_sched_migration_cost) > > return; > > > > + /* If rq->next_balance before nohz.next_balance, trigger ILB */ > > + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance))) > > + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu)); > > + > > /* Don't need to update blocked load of idle CPUs*/ > > if (!READ_ONCE(nohz.has_blocked) || > > time_before(jiffies, READ_ONCE(nohz.next_blocked))) > > Ok, judging by your IPI reduction numbers this is definitely an > optimization we want to do. > > The patch does make _nohz_idle_balance() run more parallel, as previously > it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at > least for next_balance updates), but I think it's still SMP-safe, as all > key data structure updates are already rq-locked AFAICS. One thing I am confused about in the original code is: tick_nohz_idle_stop_tick() is what sets the nohz.idle_cpus_mask. However, nohz_run_idle_balance() is called before that can happen, in do_idle(). So it is possible that NOHZ_NEWILB_KICK is set for a CPU but it is not yet in the mask. So will this code in _nohz_idle_balance() really run in such a scenario? if (flags & NOHZ_STATS_KICK) has_blocked_load |= update_nohz_stats(rq); AFAICS, this loop may not select the CPU due to its absence from the mask: for_each_cpu_wrap(balance_cpu, nohz.idle_cpus_mask, this_cpu+1) I must be missing something. I'll go trace this path later as well. thanks, - Joel
* Joel Fernandes <joel@joelfernandes.org> wrote: > > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") > > > > Hurm.. does this really warrant a Fixes tag? Afaict nothing is currently > > broken -- this is a pure optimization question, no? > > IMHO it is a breakage as it breaks NOHZ -- a lot of times the ILB kicks > back the CPU stopping the tick out of idle (effectively breaking NOHZ). > The large number of IPIs also wrecks power and it happens only on 6.1 and > after. Having the fixes tag means it will also goto all stable kernels >= > 6.1. Hope that sounds reasonable and thank you for taking a look! So it's basically a fix of a NOHZ performance regression, introduced by 7fd7a9e0caba or so, correct? As long as the fixes have a good hope of being backported with a low amount of overhead, a Fixes: tag for a ~2 years old performance regression is unusual but not unprecedented. We just need to make sure we don't put too much of a burden on the shoulders of -stable maintainers ... Thanks, Ingo
On Sun, Oct 8, 2023 at 1:35 PM Joel Fernandes <joel@joelfernandes.org> wrote: > [...snip...] > > The patch does make _nohz_idle_balance() run more parallel, as previously > > it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at > > least for next_balance updates), but I think it's still SMP-safe, as all > > key data structure updates are already rq-locked AFAICS. > > One thing I am confused about in the original code is: > > tick_nohz_idle_stop_tick() is what sets the nohz.idle_cpus_mask. > However, nohz_run_idle_balance() is called before that can happen, in > do_idle(). So it is possible that NOHZ_NEWILB_KICK is set for a CPU but it is > not yet in the mask. > > So will this code in _nohz_idle_balance() really run in such a scenario? > > if (flags & NOHZ_STATS_KICK) > has_blocked_load |= update_nohz_stats(rq); > > AFAICS, this loop may not select the CPU due to its absence from the mask: > for_each_cpu_wrap(balance_cpu, nohz.idle_cpus_mask, this_cpu+1) > I have traced this a bit further. As Joel mentioned, the nohz.idle_cpus_mask shouldn't contain this cpu when nohz_run_idle_balance () is called from do_idle(), but on tracing I have seen that it does have it mostly with HIGHRES. And I feel this is a bug. We call nohz_balance_enter_idle() when we turn off the tick, but we don't always call nohz_balance_exit_idle() when we turn the tick back on. We call it only on the next tick on this cpu in nohz_balancer_kick. If a wakeup happens on this cpu while the tick is off, we re-enable the tick, but do not remove ourselves from the nohz.idle_cpus_mask. So, ILB will consider this cpu to be a valid pick until the next tick on this cpu where it gets removed. I am not sure if this is intentional. If this is a bug and we fix it by calling nohz_balance_exit_idle during restart_tick, then we might not probably need NOHZ_NEWIDLE_KICK flag and could use NOHZ_STATS_KICK as there will not be any overlap between nohz_run_idle_balance and nohz_idle_balance. Thanks, Vineeth
On Mon, 9 Oct 2023 13:25:23 +0200 Ingo Molnar <mingo@kernel.org> wrote: > We just need to make sure we don't put too much of a burden on the > shoulders of -stable maintainers ... My experience is that changes like this usually find their way into stable with or without the Fixes tag. -- Steve
On Sun, 8 Oct 2023 at 19:35, Joel Fernandes <joel@joelfernandes.org> wrote: > > On Fri, Oct 06, 2023 at 12:51:41PM +0200, Ingo Molnar wrote: > > > > * Joel Fernandes (Google) <joel@joelfernandes.org> wrote: > > > > > From: Vineeth Pillai <vineethrp@google.com> > > > > > > Whenever a CPU stops its tick, it now requires another idle CPU to handle the > > > balancing for it because it can't perform its own periodic load balancing. > > > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if > > > the upcoming nohz-idle load balancing is too distant in the future. This update > > > process is done by triggering an ILB, as the general ILB handler > > > (_nohz_idle_balance) that manages regular nohz balancing also refreshes > > > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs > > > and selecting the smallest value. > > > > > > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This > > > primarily results in the ILB handler updating 'nohz.next_balance' while > > > possibly not doing any load balancing at all. However, sending an IPI merely to > > > refresh 'nohz.next_balance' seems excessive, and there ought to be a more > > > efficient method to update 'nohz.next_balance' from the local CPU. > > > > > > Fortunately, there already exists a mechanism to directly invoke the ILB > > > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by > > > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" > > > balancing and solely exists to update a CPU's blocked load if it couldn't pull > > > more tasks during regular "newly idle balancing" - and it does so without > > > having to send any IPIs. Once the flag is set, the ILB handler is called > > > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update > > > the blocked load without an IPI, in our situation, we aim to refresh > > > 'nohz.next_balance' without an IPI but we can piggy back on this. > > > > > > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to > > > indicate nohz.next_balance needs an update via this direct call shortcut. Note > > > that we set this flag without knowledge that the tick is about to be stopped, > > > because at the point we do it, we have no way of knowing that. However we do > > > know that the CPU is about to enter idle. In our testing, the reduction in IPIs > > > is well worth updating nohz.next_balance a few more times. > > > > > > Also just to note, without this patch we observe the following pattern: > > > > > > 1. A CPU is about to stop its tick. > > > 2. It sets nohz.needs_update to 1. > > > 3. It then stops its tick and goes idle. > > > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. > > > 5. The ILB CPU ends up being the one that just stopped its tick! > > > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up > > > and disturbing it! > > > > > > Testing shows a considerable reduction in IPIs when doing this: > > > > > > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM > > > the IPI call count profiled over 10s period is as follows: > > > without fix: ~10500 > > > with fix: ~1000 > > > > > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") > > > > > > [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ] > > > > > > Cc: Suleiman Souhlal <suleiman@google.com> > > > Cc: Steven Rostedt <rostedt@goodmis.org> > > > Cc: Hsin Yi <hsinyi@google.com> > > > Cc: Frederic Weisbecker <frederic@kernel.org> > > > Cc: Paul E. McKenney <paulmck@kernel.org> > > > Signed-off-by: Vineeth Pillai <vineethrp@google.com> > > > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > --- > > > kernel/sched/fair.c | 21 ++++++++++++++------- > > > 1 file changed, 14 insertions(+), 7 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index cb225921bbca..2ece55f32782 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) > > > /* > > > * Ensures that if nohz_idle_balance() fails to observe our > > > * @idle_cpus_mask store, it must observe the @has_blocked > > > - * and @needs_update stores. > > > + * stores. > > > */ > > > smp_mb__after_atomic(); > > > > > > set_cpu_sd_state_idle(cpu); > > > > > > - WRITE_ONCE(nohz.needs_update, 1); > > > out: > > > /* > > > * Each time a cpu enter idle, we assume that it has blocked load and > > > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > > > } > > > > > > /* > > > - * Check if we need to run the ILB for updating blocked load before entering > > > - * idle state. > > > + * Check if we need to run the ILB for updating blocked load and/or updating > > > + * nohz.next_balance before entering idle state. > > > */ > > > void nohz_run_idle_balance(int cpu) > > > { > > > unsigned int flags; > > > > > > - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); > > > + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); > > > + > > > + if (!flags) > > > + return; > > > > > > /* > > > * Update the blocked load only if no SCHED_SOFTIRQ is about to happen > > > * (ie NOHZ_STATS_KICK set) and will do the same. > > > */ > > > - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) > > > - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); > > > + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && > > > + !need_resched()) > > > + _nohz_idle_balance(cpu_rq(cpu), flags); > > > } > > > > > > static void nohz_newidle_balance(struct rq *this_rq) > > > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq) > > > if (this_rq->avg_idle < sysctl_sched_migration_cost) > > > return; > > > > > > + /* If rq->next_balance before nohz.next_balance, trigger ILB */ > > > + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance))) > > > + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu)); > > > + > > > /* Don't need to update blocked load of idle CPUs*/ > > > if (!READ_ONCE(nohz.has_blocked) || > > > time_before(jiffies, READ_ONCE(nohz.next_blocked))) > > > > Ok, judging by your IPI reduction numbers this is definitely an > > optimization we want to do. > > > > The patch does make _nohz_idle_balance() run more parallel, as previously > > it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at > > least for next_balance updates), but I think it's still SMP-safe, as all > > key data structure updates are already rq-locked AFAICS. > > One thing I am confused about in the original code is: > > tick_nohz_idle_stop_tick() is what sets the nohz.idle_cpus_mask. > However, nohz_run_idle_balance() is called before that can happen, in > do_idle(). So it is possible that NOHZ_NEWILB_KICK is set for a CPU but it is > not yet in the mask. 2 things: - update of nohz.idle_cpus_mask is not strictly aligned with cpu entering/exiting idle state. It is set when entering but only cleared during the next tick on the cpu because nohz.idle_cpus_mask is a bottleneck when all CPUs enter/exit idle at high rate (usec). This implies that a cpu entering idle can already be set in nohz.idle_cpus_mask - NOHZ_NEWILB_KICK is more about updating blocked load of others already idle CPUs than the one entering idle which has already updated its blocked load in newidle_balance() The goal of nohz_run_idle_balance() is to run ILB only for updating the blocked load of already idle CPUs without waking up one of those idle CPUs and outside the preempt disable / irq off phase of the local cpu about to enter idle because it can take a long time. > > So will this code in _nohz_idle_balance() really run in such a scenario? > > if (flags & NOHZ_STATS_KICK) > has_blocked_load |= update_nohz_stats(rq); > > AFAICS, this loop may not select the CPU due to its absence from the mask: > for_each_cpu_wrap(balance_cpu, nohz.idle_cpus_mask, this_cpu+1) > > I must be missing something. I'll go trace this path later as well. > > thanks, > > - Joel >
On Mon, Oct 9, 2023 at 7:25 AM Ingo Molnar <mingo@kernel.org> wrote: > > > * Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") > > > > > > Hurm.. does this really warrant a Fixes tag? Afaict nothing is currently > > > broken -- this is a pure optimization question, no? > > > > IMHO it is a breakage as it breaks NOHZ -- a lot of times the ILB kicks > > back the CPU stopping the tick out of idle (effectively breaking NOHZ). > > The large number of IPIs also wrecks power and it happens only on 6.1 and > > after. Having the fixes tag means it will also goto all stable kernels >= > > 6.1. Hope that sounds reasonable and thank you for taking a look! > > So it's basically a fix of a NOHZ performance regression, introduced by > 7fd7a9e0caba or so, correct? Yes. > As long as the fixes have a good hope of being backported with a low amount > of overhead, a Fixes: tag for a ~2 years old performance regression is > unusual but not unprecedented. > > We just need to make sure we don't put too much of a burden on the > shoulders of -stable maintainers ... Sure, that sounds good. After we upstream, I am happy to assist in any stable backporting work related to this as well. In any case, I have to backport it to ChromeOS kernels which are based on stable. There's only the 6.1 stable kernel at the moment that is affected. thanks, - Joel
On Tue, Oct 10, 2023 at 3:15 AM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Sun, 8 Oct 2023 at 19:35, Joel Fernandes <joel@joelfernandes.org> wrote: [...] > > One thing I am confused about in the original code is: > > > > tick_nohz_idle_stop_tick() is what sets the nohz.idle_cpus_mask. > > However, nohz_run_idle_balance() is called before that can happen, in > > do_idle(). So it is possible that NOHZ_NEWILB_KICK is set for a CPU but it is > > not yet in the mask. > > 2 things: > - update of nohz.idle_cpus_mask is not strictly aligned with cpu > entering/exiting idle state. It is set when entering but only cleared > during the next tick on the cpu because nohz.idle_cpus_mask is a > bottleneck when all CPUs enter/exit idle at high rate (usec). This > implies that a cpu entering idle can already be set in > nohz.idle_cpus_mask > - NOHZ_NEWILB_KICK is more about updating blocked load of others > already idle CPUs than the one entering idle which has already updated > its blocked load in newidle_balance() > > The goal of nohz_run_idle_balance() is to run ILB only for updating > the blocked load of already idle CPUs without waking up one of those > idle CPUs and outside the preempt disable / irq off phase of the local > cpu about to enter idle because it can take a long time. This makes complete sense, thank you for the background on this! Vineeth was just telling me in a 1:1 that he also tried doing the removal of the CPU from the idle mask in the restart-tick path. The result was that even though the mask modification is not as often as when doing it during the CPU coming out of idle, it is still higher than just doing it from the next busy tick, like in current mainline. As next steps we are looking into: 1. Monitor how often we set NEXT_KICK -- we think we can reduce the frequency of these even more and keep the overhead low. 2. Look more into the parallelism of nohz.next_balance updates (due to our next NEXT_KICK setting) and handle any race conditions. We are at the moment looking into if nohz.next_balance does not get set to the earliest value because of a race, and if so retry the operation. Something like (untested): if (likely(update_next_balance)) { do { WRITE_ONCE(nohz.next_balance, next_balance); if (likely(READ_ONCE(nohz.next_balance) <= next_balance)) { break; } cpu_relax(); } } Or a try_cmpxchg loop. I think after these items and a bit more testing, we should be good to send v2. Thanks.
Hello, kernel test robot noticed "WARNING:at_kernel/sched/core.c:#nohz_csd_func" on: commit: 7b0c45f5095f8868fb14cc4e1745befdf58d173c ("[PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB") url: https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/sched-fair-Avoid-unnecessary-IPIs-for-ILB/20231006-003907 base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 3006adf3be79cde4d14b1800b963b82b6e5572e0 patch link: https://lore.kernel.org/all/20231005161727.1855004-1-joel@joelfernandes.org/ patch subject: [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB in testcase: blktests version: blktests-x86_64-3f75e62-1_20231017 with following parameters: disk: 1SSD test: nvme-group-00 nvme_trtype: rdma compiler: gcc-12 test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory (please refer to attached dmesg/kmsg for entire log/backtrace) +----------------+------------+------------+ | | 3006adf3be | 7b0c45f509 | +----------------+------------+------------+ | boot_successes | 0 | 3 | +----------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202310192232.750e5c5b-oliver.sang@intel.com [ 55.309389][ C1] ------------[ cut here ]------------ [ 55.315508][ C1] WARNING: CPU: 1 PID: 0 at kernel/sched/core.c:1182 nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) [ 55.325508][ C1] Modules linked in: intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp btrfs blake2b_generic kvm_intel xor kvm raid6_pq zstd_compress irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel libcrc32c sha512_ssse3 crc32c_intel ipmi_ssif rapl nvme intel_cstate nvme_core mei_me ast t10_pi dax_hmem drm_shmem_helper crc64_rocksoft_generic idxd crc64_rocksoft mei drm_kms_helper wmi idxd_bus joydev i2c_ismt crc64 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad drm fuse ip_tables [ 55.380240][ C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.6.0-rc4-00038-g7b0c45f5095f #1 [ 55.390037][ C1] RIP: 0010:nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) [ 55.396018][ C1] Code: 84 c0 74 06 0f 8e d3 00 00 00 45 88 b4 24 28 0a 00 00 48 83 c4 08 bf 07 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 22 b0 f6 ff <0f> 0b e9 1b fe ff ff e8 76 6e 72 00 e9 66 fd ff ff e8 cc 6e 72 00 All code ======== 0: 84 c0 test %al,%al 2: 74 06 je 0xa 4: 0f 8e d3 00 00 00 jle 0xdd a: 45 88 b4 24 28 0a 00 mov %r14b,0xa28(%r12) 11: 00 12: 48 83 c4 08 add $0x8,%rsp 16: bf 07 00 00 00 mov $0x7,%edi 1b: 5b pop %rbx 1c: 41 5c pop %r12 1e: 41 5d pop %r13 20: 41 5e pop %r14 22: 41 5f pop %r15 24: 5d pop %rbp 25: e9 22 b0 f6 ff jmpq 0xfffffffffff6b04c 2a:* 0f 0b ud2 <-- trapping instruction 2c: e9 1b fe ff ff jmpq 0xfffffffffffffe4c 31: e8 76 6e 72 00 callq 0x726eac 36: e9 66 fd ff ff jmpq 0xfffffffffffffda1 3b: e8 cc 6e 72 00 callq 0x726f0c Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: e9 1b fe ff ff jmpq 0xfffffffffffffe22 7: e8 76 6e 72 00 callq 0x726e82 c: e9 66 fd ff ff jmpq 0xfffffffffffffd77 11: e8 cc 6e 72 00 callq 0x726ee2 [ 55.418037][ C1] RSP: 0018:ffa00000001f8f58 EFLAGS: 00010046 [ 55.424802][ C1] RAX: 0000000000000000 RBX: 000000000003a100 RCX: ffffffff8444c928 [ 55.433718][ C1] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ff110017fc8ba164 [ 55.442631][ C1] RBP: ffa00000001f8f88 R08: 0000000000000001 R09: ffe21c02ff91742c [ 55.451542][ C1] R10: ff110017fc8ba167 R11: ffa00000001f8ff8 R12: ff110017fc8ba100 [ 55.460461][ C1] R13: ff110017fc8ba164 R14: 0000000000000000 R15: 0000000000000001 [ 55.470959][ C1] FS: 0000000000000000(0000) GS:ff110017fc880000(0000) knlGS:0000000000000000 [ 55.482348][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 55.491067][ C1] CR2: 00007fabd7bff699 CR3: 000000407de46006 CR4: 0000000000f71ee0 [ 55.501337][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 55.511601][ C1] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 55.521859][ C1] PKRU: 55555554 [ 55.527131][ C1] Call Trace: [ 55.532072][ C1] <IRQ> [ 55.536527][ C1] ? __warn (kernel/panic.c:673) [ 55.542341][ C1] ? nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) [ 55.548935][ C1] ? report_bug (lib/bug.c:180 lib/bug.c:219) [ 55.555241][ C1] ? handle_bug (arch/x86/kernel/traps.c:237) [ 55.561323][ C1] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) [ 55.567792][ C1] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) [ 55.574671][ C1] ? nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) [ 55.581230][ C1] ? nohz_csd_func (arch/x86/include/asm/atomic.h:23 arch/x86/include/asm/atomic.h:135 include/linux/atomic/atomic-arch-fallback.h:1433 include/linux/atomic/atomic-arch-fallback.h:1565 include/linux/atomic/atomic-instrumented.h:862 kernel/sched/core.c:1181) [ 55.587667][ C1] ? task_mm_cid_work (kernel/sched/core.c:1173) [ 55.594511][ C1] __flush_smp_call_function_queue (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/csd.h:64 kernel/smp.c:134 kernel/smp.c:531) [ 55.602619][ C1] __sysvec_call_function_single (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 arch/x86/include/asm/trace/irq_vectors.h:99 arch/x86/kernel/smp.c:293) [ 55.610431][ C1] sysvec_call_function_single (arch/x86/kernel/smp.c:287 (discriminator 14)) [ 55.617918][ C1] </IRQ> [ 55.622373][ C1] <TASK> [ 55.624388][ C2] ------------[ cut here ]------------ [ 55.625607][ C1] asm_sysvec_call_function_single (arch/x86/include/asm/idtentry.h:652) [ 55.631669][ C2] WARNING: CPU: 2 PID: 0 at kernel/sched/core.c:1182 nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) [ 55.638279][ C1] RIP: _nohz_idle_balance+0xd9/0x7f0 [ 55.648220][ C2] Modules linked in: [ 55.655250][ C1] Code: 48 74 0a c7 05 c0 0f ce 04 00 00 00 00 8b 44 24 2c 83 e0 08 89 44 24 14 74 0a c7 05 ad 0f ce 04 00 00 00 00 f0 83 44 24 fc 00 <49> c7 c5 10 c4 3f 85 41 83 c4 01 48 b8 00 00 00 00 00 fc ff df 4c All code ======== 0: 48 74 0a rex.W je 0xd 3: c7 05 c0 0f ce 04 00 movl $0x0,0x4ce0fc0(%rip) # 0x4ce0fcd a: 00 00 00 d: 8b 44 24 2c mov 0x2c(%rsp),%eax 11: 83 e0 08 and $0x8,%eax 14: 89 44 24 14 mov %eax,0x14(%rsp) 18: 74 0a je 0x24 1a: c7 05 ad 0f ce 04 00 movl $0x0,0x4ce0fad(%rip) # 0x4ce0fd1 21: 00 00 00 24: f0 83 44 24 fc 00 lock addl $0x0,-0x4(%rsp) 2a:* 49 c7 c5 10 c4 3f 85 mov $0xffffffff853fc410,%r13 <-- trapping instruction 31: 41 83 c4 01 add $0x1,%r12d 35: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 3c: fc ff df 3f: 4c rex.WR Code starting with the faulting instruction =========================================== 0: 49 c7 c5 10 c4 3f 85 mov $0xffffffff853fc410,%r13 7: 41 83 c4 01 add $0x1,%r12d b: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 12: fc ff df 15: 4c rex.WR [ 55.656534][ C2] intel_rapl_msr [ 55.660852][ C1] RSP: 0018:ffa000000865fdb0 EFLAGS: 00000246 [ 55.682783][ C2] intel_rapl_common [ 55.686779][ C1] RAX: 0000000000000008 RBX: 0000000000000001 RCX: ffffffff812b76c7 [ 55.693493][ C2] x86_pkg_temp_thermal [ 55.697774][ C1] RDX: dffffc0000000000 RSI: 0000000000000008 RDI: 0000000000000001 [ 55.706653][ C2] intel_powerclamp [ 55.711232][ C1] RBP: ffa000000865fe90 R08: 0000000000000001 R09: ffe21c02ff91742c [ 55.720120][ C2] coretemp btrfs [ 55.724298][ C1] R10: ff110017fc8ba167 R11: 0000000000000014 R12: 0000000000000001 [ 55.733156][ C2] blake2b_generic [ 55.737157][ C1] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 55.746019][ C2] kvm_intel xor [ 55.750115][ C1] ? nohz_run_idle_balance (arch/x86/include/asm/atomic.h:23 arch/x86/include/asm/atomic.h:135 include/linux/atomic/atomic-arch-fallback.h:1433 include/linux/atomic/atomic-arch-fallback.h:1565 include/linux/atomic/atomic-instrumented.h:862 kernel/sched/fair.c:11954) [ 55.758991][ C2] kvm [ 55.762902][ C1] ? clockevents_program_event (kernel/time/clockevents.c:336 (discriminator 3)) [ 55.768839][ C2] raid6_pq zstd_compress [ 55.771772][ C1] ? rebalance_domains (kernel/sched/fair.c:11826) [ 55.778197][ C2] irqbypass [ 55.782972][ C1] ? __flush_smp_call_function_queue (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/csd.h:64 kernel/smp.c:134 kernel/smp.c:531) [ 55.788612][ C2] crct10dif_pclmul crc32_pclmul [ 55.792132][ C1] do_idle (arch/x86/include/asm/current.h:41 include/linux/sched/idle.h:31 kernel/sched/idle.c:255) [ 55.799153][ C2] ghash_clmulni_intel [ 55.804630][ C1] cpu_startup_entry (kernel/sched/idle.c:379 (discriminator 1)) [ 55.809028][ C2] libcrc32c sha512_ssse3 [ 55.813499][ C1] start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294) [ 55.818765][ C2] crc32c_intel [ 55.823547][ C1] ? set_cpu_sibling_map (arch/x86/kernel/smpboot.c:240) [ 55.828795][ C2] ipmi_ssif [ 55.832605][ C1] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433) [ 55.838652][ C2] rapl [ 55.842150][ C1] </TASK> [ 55.848889][ C2] nvme [ 55.851904][ C1] ---[ end trace 0000000000000000 ]--- [ 55.855226][ C2] intel_cstate [ 55.856376][ T1] systemd[1]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set [ 55.929716][ C2] nvme_core mei_me ast t10_pi dax_hmem drm_shmem_helper crc64_rocksoft_generic idxd crc64_rocksoft mei drm_kms_helper wmi idxd_bus joydev i2c_ismt crc64 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad drm fuse ip_tables [ 55.958456][ C2] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G W 6.6.0-rc4-00038-g7b0c45f5095f #1 [ 55.971146][ C2] RIP: 0010:nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) [ 55.978359][ C2] Code: 84 c0 74 06 0f 8e d3 00 00 00 45 88 b4 24 28 0a 00 00 48 83 c4 08 bf 07 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 22 b0 f6 ff <0f> 0b e9 1b fe ff ff e8 76 6e 72 00 e9 66 fd ff ff e8 cc 6e 72 00 All code ======== 0: 84 c0 test %al,%al 2: 74 06 je 0xa 4: 0f 8e d3 00 00 00 jle 0xdd a: 45 88 b4 24 28 0a 00 mov %r14b,0xa28(%r12) 11: 00 12: 48 83 c4 08 add $0x8,%rsp 16: bf 07 00 00 00 mov $0x7,%edi 1b: 5b pop %rbx 1c: 41 5c pop %r12 1e: 41 5d pop %r13 20: 41 5e pop %r14 22: 41 5f pop %r15 24: 5d pop %rbp 25: e9 22 b0 f6 ff jmpq 0xfffffffffff6b04c 2a:* 0f 0b ud2 <-- trapping instruction 2c: e9 1b fe ff ff jmpq 0xfffffffffffffe4c 31: e8 76 6e 72 00 callq 0x726eac 36: e9 66 fd ff ff jmpq 0xfffffffffffffda1 3b: e8 cc 6e 72 00 callq 0x726f0c Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: e9 1b fe ff ff jmpq 0xfffffffffffffe22 7: e8 76 6e 72 00 callq 0x726e82 c: e9 66 fd ff ff jmpq 0xfffffffffffffd77 11: e8 cc 6e 72 00 callq 0x726ee2 The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20231019/202310192232.750e5c5b-oliver.sang@intel.com
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cb225921bbca..2ece55f32782 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) /* * Ensures that if nohz_idle_balance() fails to observe our * @idle_cpus_mask store, it must observe the @has_blocked - * and @needs_update stores. + * stores. */ smp_mb__after_atomic(); set_cpu_sd_state_idle(cpu); - WRITE_ONCE(nohz.needs_update, 1); out: /* * Each time a cpu enter idle, we assume that it has blocked load and @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) } /* - * Check if we need to run the ILB for updating blocked load before entering - * idle state. + * Check if we need to run the ILB for updating blocked load and/or updating + * nohz.next_balance before entering idle state. */ void nohz_run_idle_balance(int cpu) { unsigned int flags; - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); + + if (!flags) + return; /* * Update the blocked load only if no SCHED_SOFTIRQ is about to happen * (ie NOHZ_STATS_KICK set) and will do the same. */ - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && + !need_resched()) + _nohz_idle_balance(cpu_rq(cpu), flags); } static void nohz_newidle_balance(struct rq *this_rq) @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq) if (this_rq->avg_idle < sysctl_sched_migration_cost) return; + /* If rq->next_balance before nohz.next_balance, trigger ILB */ + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance))) + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu)); + /* Don't need to update blocked load of idle CPUs*/ if (!READ_ONCE(nohz.has_blocked) || time_before(jiffies, READ_ONCE(nohz.next_blocked)))