diff mbox series

[RFC] sched/fair: Avoid unnecessary IPIs for ILB

Message ID	20231005161727.1855004-1-joel@joelfernandes.org
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; From: "Joel Fernandes (Google)" <joel@joelfernandes.org> To: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com> Cc: Vineeth Pillai <vineethrp@google.com>, Suleiman Souhlal <suleiman@google.com>, Hsin Yi <hsinyi@google.com>, Frederic Weisbecker <frederic@kernel.org>, "Paul E . McKenney" <paulmck@kernel.org>, Joel Fernandes <joel@joelfernandes.org> Subject: [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB Date: Thu, 5 Oct 2023 16:17:26 +0000 Message-ID: <20231005161727.1855004-1-joel@joelfernandes.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[RFC] sched/fair: Avoid unnecessary IPIs for ILB \| [RFC] sched/fair: Avoid unnecessary IPIs for ILB

Commit Message

Joel Fernandes Oct. 5, 2023, 4:17 p.m. UTC

  From: Vineeth Pillai <vineethrp@google.com>

Whenever a CPU stops its tick, it now requires another idle CPU to handle the
balancing for it because it can't perform its own periodic load balancing.
This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
the upcoming nohz-idle load balancing is too distant in the future. This update
process is done by triggering an ILB, as the general ILB handler
(_nohz_idle_balance) that manages regular nohz balancing also refreshes
'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
and selecting the smallest value.

Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
primarily results in the ILB handler updating 'nohz.next_balance' while
possibly not doing any load balancing at all. However, sending an IPI merely to
refresh 'nohz.next_balance' seems excessive, and there ought to be a more
efficient method to update 'nohz.next_balance' from the local CPU.

Fortunately, there already exists a mechanism to directly invoke the ILB
handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
balancing and solely exists to update a CPU's blocked load if it couldn't pull
more tasks during regular "newly idle balancing" - and it does so without
having to send any IPIs. Once the flag is set, the ILB handler is called
directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
the blocked load without an IPI, in our situation, we aim to refresh
'nohz.next_balance' without an IPI but we can piggy back on this.

So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
indicate nohz.next_balance needs an update via this direct call shortcut. Note
that we set this flag without knowledge that the tick is about to be stopped,
because at the point we do it, we have no way of knowing that. However we do
know that the CPU is about to enter idle. In our testing, the reduction in IPIs
is well worth updating nohz.next_balance a few more times.

Also just to note, without this patch we observe the following pattern:

1. A CPU is about to stop its tick.
2. It sets nohz.needs_update to 1.
3. It then stops its tick and goes idle.
4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
5. The ILB CPU ends up being the one that just stopped its tick!
6. This results in an IPI to the tick-stopped CPU which ends up waking it up
   and disturbing it!

Testing shows a considerable reduction in IPIs when doing this:

Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
the IPI call count profiled over 10s period is as follows:
without fix: ~10500
with fix: ~1000

Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")

[ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ]

Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Hsin Yi <hsinyi@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Vineeth Pillai <vineethrp@google.com>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

Comments

Ingo Molnar Oct. 6, 2023, 10:51 a.m. UTC | #1

* Joel Fernandes (Google) <joel@joelfernandes.org> wrote:

> From: Vineeth Pillai <vineethrp@google.com>
> 
> Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> balancing for it because it can't perform its own periodic load balancing.
> This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> the upcoming nohz-idle load balancing is too distant in the future. This update
> process is done by triggering an ILB, as the general ILB handler
> (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> and selecting the smallest value.
> 
> Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
> primarily results in the ILB handler updating 'nohz.next_balance' while
> possibly not doing any load balancing at all. However, sending an IPI merely to
> refresh 'nohz.next_balance' seems excessive, and there ought to be a more
> efficient method to update 'nohz.next_balance' from the local CPU.
> 
> Fortunately, there already exists a mechanism to directly invoke the ILB
> handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
> setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
> balancing and solely exists to update a CPU's blocked load if it couldn't pull
> more tasks during regular "newly idle balancing" - and it does so without
> having to send any IPIs. Once the flag is set, the ILB handler is called
> directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
> the blocked load without an IPI, in our situation, we aim to refresh
> 'nohz.next_balance' without an IPI but we can piggy back on this.
> 
> So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
> indicate nohz.next_balance needs an update via this direct call shortcut. Note
> that we set this flag without knowledge that the tick is about to be stopped,
> because at the point we do it, we have no way of knowing that. However we do
> know that the CPU is about to enter idle. In our testing, the reduction in IPIs
> is well worth updating nohz.next_balance a few more times.
> 
> Also just to note, without this patch we observe the following pattern:
> 
> 1. A CPU is about to stop its tick.
> 2. It sets nohz.needs_update to 1.
> 3. It then stops its tick and goes idle.
> 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
> 5. The ILB CPU ends up being the one that just stopped its tick!
> 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
>    and disturbing it!
> 
> Testing shows a considerable reduction in IPIs when doing this:
> 
> Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> the IPI call count profiled over 10s period is as follows:
> without fix: ~10500
> with fix: ~1000
> 
> Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
> 
> [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ]
> 
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Hsin Yi <hsinyi@google.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Vineeth Pillai <vineethrp@google.com>
> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/fair.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cb225921bbca..2ece55f32782 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
>  	/*
>  	 * Ensures that if nohz_idle_balance() fails to observe our
>  	 * @idle_cpus_mask store, it must observe the @has_blocked
> -	 * and @needs_update stores.
> +	 * stores.
>  	 */
>  	smp_mb__after_atomic();
>  
>  	set_cpu_sd_state_idle(cpu);
>  
> -	WRITE_ONCE(nohz.needs_update, 1);
>  out:
>  	/*
>  	 * Each time a cpu enter idle, we assume that it has blocked load and
> @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  }
>  
>  /*
> - * Check if we need to run the ILB for updating blocked load before entering
> - * idle state.
> + * Check if we need to run the ILB for updating blocked load and/or updating
> + * nohz.next_balance before entering idle state.
>   */
>  void nohz_run_idle_balance(int cpu)
>  {
>  	unsigned int flags;
>  
> -	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> +	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
> +
> +	if (!flags)
> +		return;
>  
>  	/*
>  	 * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
>  	 * (ie NOHZ_STATS_KICK set) and will do the same.
>  	 */
> -	if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> -		_nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
> +	if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> +	    !need_resched())
> +		_nohz_idle_balance(cpu_rq(cpu), flags);
>  }
>  
>  static void nohz_newidle_balance(struct rq *this_rq)
> @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq)
>  	if (this_rq->avg_idle < sysctl_sched_migration_cost)
>  		return;
>  
> +	/* If rq->next_balance before nohz.next_balance, trigger ILB */
> +	if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
> +		atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
> +
>  	/* Don't need to update blocked load of idle CPUs*/
>  	if (!READ_ONCE(nohz.has_blocked) ||
>  	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))

Ok, judging by your IPI reduction numbers this is definitely an 
optimization we want to do.

The patch does make _nohz_idle_balance() run more parallel, as previously 
it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at 
least for next_balance updates), but I think it's still SMP-safe, as all 
key data structure updates are already rq-locked AFAICS.

One thing I noticed is that we now use nohz.needs_update only in a single 
remaining case, when _nohz_idle_balance() "self-defers":

                /*
                 * If this CPU gets work to do, stop the load balancing
                 * work being done for other CPUs. Next load
                 * balancing owner will pick it up.
                 */
                if (need_resched()) {
                        if (flags & NOHZ_STATS_KICK)
                                has_blocked_load = true;
                        if (flags & NOHZ_NEXT_KICK)
                                WRITE_ONCE(nohz.needs_update, 1);
                        goto abort;
                }

Getting a need-resched flag set on this CPU is a pretty dubious reason to 
skip an ILB run IMO, and we could do entirely without that complication, 
allowing us to remove the nohz.needs_update flag handling logic altogether?

If we do that then the !need_resched() flag needs to go from 
nohz_run_idle_balance() too:

        /*
         * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
         * (ie NOHZ_STATS_KICK set) and will do the same.
         */
        if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
            !need_resched())
                _nohz_idle_balance(cpu_rq(cpu), flags);

... if I'm reading the code right that is.

Thanks,

	Ingo

Vincent Guittot Oct. 6, 2023, 1:46 p.m. UTC | #2

On Thu, 5 Oct 2023 at 18:17, Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> From: Vineeth Pillai <vineethrp@google.com>
>
> Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> balancing for it because it can't perform its own periodic load balancing.
> This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> the upcoming nohz-idle load balancing is too distant in the future. This update
> process is done by triggering an ILB, as the general ILB handler
> (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> and selecting the smallest value.
>
> Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
> primarily results in the ILB handler updating 'nohz.next_balance' while
> possibly not doing any load balancing at all. However, sending an IPI merely to
> refresh 'nohz.next_balance' seems excessive, and there ought to be a more
> efficient method to update 'nohz.next_balance' from the local CPU.
>
> Fortunately, there already exists a mechanism to directly invoke the ILB
> handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
> setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
> balancing and solely exists to update a CPU's blocked load if it couldn't pull
> more tasks during regular "newly idle balancing" - and it does so without
> having to send any IPIs. Once the flag is set, the ILB handler is called
> directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
> the blocked load without an IPI, in our situation, we aim to refresh
> 'nohz.next_balance' without an IPI but we can piggy back on this.
>
> So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
> indicate nohz.next_balance needs an update via this direct call shortcut. Note
> that we set this flag without knowledge that the tick is about to be stopped,
> because at the point we do it, we have no way of knowing that. However we do
> know that the CPU is about to enter idle. In our testing, the reduction in IPIs
> is well worth updating nohz.next_balance a few more times.
>
> Also just to note, without this patch we observe the following pattern:
>
> 1. A CPU is about to stop its tick.
> 2. It sets nohz.needs_update to 1.
> 3. It then stops its tick and goes idle.
> 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
> 5. The ILB CPU ends up being the one that just stopped its tick!
> 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
>    and disturbing it!
>
> Testing shows a considerable reduction in IPIs when doing this:
>
> Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> the IPI call count profiled over 10s period is as follows:
> without fix: ~10500
> with fix: ~1000
>
> Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
>
> [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ]
>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Hsin Yi <hsinyi@google.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Vineeth Pillai <vineethrp@google.com>
> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/fair.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cb225921bbca..2ece55f32782 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
>         /*
>          * Ensures that if nohz_idle_balance() fails to observe our
>          * @idle_cpus_mask store, it must observe the @has_blocked
> -        * and @needs_update stores.
> +        * stores.
>          */
>         smp_mb__after_atomic();
>
>         set_cpu_sd_state_idle(cpu);
>
> -       WRITE_ONCE(nohz.needs_update, 1);

the set of needs_updat here is the main way to set nohz.needs_update
and trigger an update of next_balance so it would be good to explain
why we still need to keep it instead r removing it entirely

>  out:
>         /*
>          * Each time a cpu enter idle, we assume that it has blocked load and
> @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  }
>
>  /*
> - * Check if we need to run the ILB for updating blocked load before entering
> - * idle state.
> + * Check if we need to run the ILB for updating blocked load and/or updating
> + * nohz.next_balance before entering idle state.
>   */
>  void nohz_run_idle_balance(int cpu)
>  {
>         unsigned int flags;
>
> -       flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> +       flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
> +
> +       if (!flags)
> +               return;
>
>         /*
>          * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
>          * (ie NOHZ_STATS_KICK set) and will do the same.
>          */
> -       if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> -               _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
> +       if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> +           !need_resched())
> +               _nohz_idle_balance(cpu_rq(cpu), flags);

This breaks the update of the blocked load.
nohz_newidle_balance() only sets NOHZ_NEWILB_KICK (and not
NOHZ_STATS_KICK) when it wants to update blocked load before going
idle but it then sets NOHZ_STATS_KICK when calling_nohz_idle_balance()
. We only clear NOHZ_NEWILB_KICK  when fetching flags but we test if
other bits have been set by a pending softirq which will do the same.
That's why we use NOHZ_NEWILB_KICK and not NOHZ_STATS_KICK directly.
Similarly you can't directly use NOHZ_NEXT_KICK.

Also note that _nohz_idle_balance() is not robust against parallel run

>  }
>
>  static void nohz_newidle_balance(struct rq *this_rq)
> @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq)
>         if (this_rq->avg_idle < sysctl_sched_migration_cost)
>                 return;
>
> +       /* If rq->next_balance before nohz.next_balance, trigger ILB */
> +       if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
> +               atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
> +
>         /* Don't need to update blocked load of idle CPUs*/
>         if (!READ_ONCE(nohz.has_blocked) ||
>             time_before(jiffies, READ_ONCE(nohz.next_blocked)))
> --
> 2.42.0.609.gbb76f46606-goog
>

Joel Fernandes Oct. 6, 2023, 4:32 p.m. UTC | #3

On Fri, Oct 06, 2023 at 12:51:41PM +0200, Ingo Molnar wrote:
> 
> * Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> 
> > From: Vineeth Pillai <vineethrp@google.com>
> > 
> > Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> > balancing for it because it can't perform its own periodic load balancing.
> > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> > the upcoming nohz-idle load balancing is too distant in the future. This update
> > process is done by triggering an ILB, as the general ILB handler
> > (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> > and selecting the smallest value.
> > 

[...snip...]

> >  kernel/sched/fair.c | 21 ++++++++++++++-------
> >  1 file changed, 14 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cb225921bbca..2ece55f32782 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
> >  	/*
> >  	 * Ensures that if nohz_idle_balance() fails to observe our
> >  	 * @idle_cpus_mask store, it must observe the @has_blocked
> > -	 * and @needs_update stores.
> > +	 * stores.
> >  	 */
> >  	smp_mb__after_atomic();
> >  
> >  	set_cpu_sd_state_idle(cpu);
> >  
> > -	WRITE_ONCE(nohz.needs_update, 1);
> >  out:
> >  	/*
> >  	 * Each time a cpu enter idle, we assume that it has blocked load and
> > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> >  }
> >  
> >  /*
> > - * Check if we need to run the ILB for updating blocked load before entering
> > - * idle state.
> > + * Check if we need to run the ILB for updating blocked load and/or updating
> > + * nohz.next_balance before entering idle state.
> >   */
> >  void nohz_run_idle_balance(int cpu)
> >  {
> >  	unsigned int flags;
> >  
> > -	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> > +	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
> > +
> > +	if (!flags)
> > +		return;
> >  
> >  	/*
> >  	 * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
> >  	 * (ie NOHZ_STATS_KICK set) and will do the same.
> >  	 */
> > -	if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> > -		_nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
> > +	if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> > +	    !need_resched())
> > +		_nohz_idle_balance(cpu_rq(cpu), flags);
> >  }
> >  
> >  static void nohz_newidle_balance(struct rq *this_rq)
> > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq)
> >  	if (this_rq->avg_idle < sysctl_sched_migration_cost)
> >  		return;
> >  
> > +	/* If rq->next_balance before nohz.next_balance, trigger ILB */
> > +	if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
> > +		atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
> > +
> >  	/* Don't need to update blocked load of idle CPUs*/
> >  	if (!READ_ONCE(nohz.has_blocked) ||
> >  	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))
> 
> Ok, judging by your IPI reduction numbers this is definitely an 
> optimization we want to do.

Great, thanks.

> The patch does make _nohz_idle_balance() run more parallel, as previously 
> it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at 
> least for next_balance updates), but I think it's still SMP-safe, as all 
> key data structure updates are already rq-locked AFAICS.

Yes true, we are looking into the parallelism aspect more and will update on
how it goes. Ideally, we'd like to ensure that nohz.next_balance is set to
the earliest rq->next_balance even in the presence of concurrency.

Theoretically, we feel the parallelism should not increase more than the
current code but we'll look more into it.

> One thing I noticed is that we now use nohz.needs_update only in a single 
> remaining case, when _nohz_idle_balance() "self-defers":
> 
>                 /*
>                  * If this CPU gets work to do, stop the load balancing
>                  * work being done for other CPUs. Next load
>                  * balancing owner will pick it up.
>                  */
>                 if (need_resched()) {
>                         if (flags & NOHZ_STATS_KICK)
>                                 has_blocked_load = true;
>                         if (flags & NOHZ_NEXT_KICK)
>                                 WRITE_ONCE(nohz.needs_update, 1);
>                         goto abort;
>                 }
> 
> Getting a need-resched flag set on this CPU is a pretty dubious reason to 
> skip an ILB run IMO, and we could do entirely without that complication, 
> allowing us to remove the nohz.needs_update flag handling logic altogether?

Yes you are right I think, we can continue doing the ILB run in the case we
are only doing lighter-weight stats update and not full-blown idle balancing.

if (need_resched() && (flags & NOHZ_BALANCE_KICK))
	goto abort;

That way we can get rid of the needs_update variable as well, as you and
Vincent pointed out. We could also add this as a separate patch in a series.
Thanks for pointing out this idea.

> If we do that then the !need_resched() flag needs to go from 
> nohz_run_idle_balance() too:
> 
>         /*
>          * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
>          * (ie NOHZ_STATS_KICK set) and will do the same.
>          */
>         if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
>             !need_resched())
>                 _nohz_idle_balance(cpu_rq(cpu), flags);
> 
> ... if I'm reading the code right that is.

Yes, that sounds right.

We will work more on this and post the next revision soon. Thank you!

 - Joel & Vineeth

Joel Fernandes Oct. 6, 2023, 4:46 p.m. UTC | #4

Hi Vincent,

On Fri, Oct 06, 2023 at 03:46:44PM +0200, Vincent Guittot wrote:
[...]
> > ---
> >  kernel/sched/fair.c | 21 ++++++++++++++-------
> >  1 file changed, 14 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cb225921bbca..2ece55f32782 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
> >         /*
> >          * Ensures that if nohz_idle_balance() fails to observe our
> >          * @idle_cpus_mask store, it must observe the @has_blocked
> > -        * and @needs_update stores.
> > +        * stores.
> >          */
> >         smp_mb__after_atomic();
> >
> >         set_cpu_sd_state_idle(cpu);
> >
> > -       WRITE_ONCE(nohz.needs_update, 1);
> 
> the set of needs_updat here is the main way to set nohz.needs_update
> and trigger an update of next_balance so it would be good to explain
> why we still need to keep it instead r removing it entirely

Ok, we are thinking of getting rid of it as Ingo suggested.

> >  out:
> >         /*
> >          * Each time a cpu enter idle, we assume that it has blocked load and
> > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> >  }
> >
> >  /*
> > - * Check if we need to run the ILB for updating blocked load before entering
> > - * idle state.
> > + * Check if we need to run the ILB for updating blocked load and/or updating
> > + * nohz.next_balance before entering idle state.
> >   */
> >  void nohz_run_idle_balance(int cpu)
> >  {
> >         unsigned int flags;
> >
> > -       flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> > +       flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
> > +
> > +       if (!flags)
> > +               return;
> >
> >         /*
> >          * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
> >          * (ie NOHZ_STATS_KICK set) and will do the same.
> >          */
> > -       if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> > -               _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
> > +       if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> > +           !need_resched())
> > +               _nohz_idle_balance(cpu_rq(cpu), flags);
> 
> This breaks the update of the blocked load.
> nohz_newidle_balance() only sets NOHZ_NEWILB_KICK (and not
> NOHZ_STATS_KICK) when it wants to update blocked load before going
> idle but it then sets NOHZ_STATS_KICK when calling_nohz_idle_balance()
> . We only clear NOHZ_NEWILB_KICK  when fetching flags but we test if
> other bits have been set by a pending softirq which will do the same.

Yes, we realized this just after sending the RFC. Will fix, thank you!

> That's why we use NOHZ_NEWILB_KICK and not NOHZ_STATS_KICK directly.
> Similarly you can't directly use NOHZ_NEXT_KICK.

This is not an issue actually, as NEXT_KICK is only set by this path
(nohz_newidle_balance()) after this patch. The NEWILB_KICK and STATS_KICK
case is different where it can be updated in more than 1 path. Right?

> Also note that _nohz_idle_balance() is not robust against parallel run

True, though the parallelism is already happening. However your point is well
taken and we'll try to improve the existing code to make it more robust as
well (if needed). Will dig deeper into it.

thanks,

- Joel & Vineeth

Shrikanth Hegde Oct. 6, 2023, 7:18 p.m. UTC | #5

On 10/5/23 9:47 PM, Joel Fernandes (Google) wrote:
> From: Vineeth Pillai <vineethrp@google.com>
> 
> Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> balancing for it because it can't perform its own periodic load balancing.
> This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> the upcoming nohz-idle load balancing is too distant in the future. This update
> process is done by triggering an ILB, as the general ILB handler
> (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> and selecting the smallest value.
> 
> Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
> primarily results in the ILB handler updating 'nohz.next_balance' while
> possibly not doing any load balancing at all. However, sending an IPI merely to
> refresh 'nohz.next_balance' seems excessive, and there ought to be a more
> efficient method to update 'nohz.next_balance' from the local CPU.
> 
> Fortunately, there already exists a mechanism to directly invoke the ILB
> handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
> setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
> balancing and solely exists to update a CPU's blocked load if it couldn't pull
> more tasks during regular "newly idle balancing" - and it does so without
> having to send any IPIs. Once the flag is set, the ILB handler is called
> directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
> the blocked load without an IPI, in our situation, we aim to refresh
> 'nohz.next_balance' without an IPI but we can piggy back on this.
> 
> So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
> indicate nohz.next_balance needs an update via this direct call shortcut. Note
> that we set this flag without knowledge that the tick is about to be stopped,
> because at the point we do it, we have no way of knowing that. However we do
> know that the CPU is about to enter idle. In our testing, the reduction in IPIs
> is well worth updating nohz.next_balance a few more times.
> 
> Also just to note, without this patch we observe the following pattern:
> 
> 1. A CPU is about to stop its tick.
> 2. It sets nohz.needs_update to 1.
> 3. It then stops its tick and goes idle.
> 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
> 5. The ILB CPU ends up being the one that just stopped its tick!
> 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
>    and disturbing it!
> 
> Testing shows a considerable reduction in IPIs when doing this:
> 
> Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> the IPI call count profiled over 10s period is as follows:
> without fix: ~10500
> with fix: ~1000
> 
> Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
> 
> [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ]

Hi Joel/Vineeth. 

Its an interesting patch. 

Gave this patch a try on powerpc system with 96 CPU. (12 cores SMT8)
Was debugging an issue where ILB count goes up significantly at a specific 
busy CPU count. Haven't yet found out why. Its WIP. Was curious if this patch 
would address that issue. 

cloned rt-test repo and ran same cyclictest command and collected 
softirq's count using bcc tool. That count remains same more or less with patch. 
Is what I am checking incorrect? Any other way to check IPI count?

        base 6.6_rc4    +patch

block       31.00        48.86
net_rx     475.90       348.90
timer     2213.20      2405.00
rcu      33057.30     34738.10
sched   175904.70    169695.60


> 
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Hsin Yi <hsinyi@google.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Vineeth Pillai <vineethrp@google.com>
> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/fair.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cb225921bbca..2ece55f32782 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
>  	/*
>  	 * Ensures that if nohz_idle_balance() fails to observe our
>  	 * @idle_cpus_mask store, it must observe the @has_blocked
> -	 * and @needs_update stores.
> +	 * stores.
>  	 */
>  	smp_mb__after_atomic();
>  
>  	set_cpu_sd_state_idle(cpu);
>  
> -	WRITE_ONCE(nohz.needs_update, 1);
>  out:
>  	/*
>  	 * Each time a cpu enter idle, we assume that it has blocked load and
> @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  }
>  
>  /*
> - * Check if we need to run the ILB for updating blocked load before entering
> - * idle state.
> + * Check if we need to run the ILB for updating blocked load and/or updating
> + * nohz.next_balance before entering idle state.
>   */
>  void nohz_run_idle_balance(int cpu)
>  {
>  	unsigned int flags;
>  
> -	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> +	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
> +
> +	if (!flags)
> +		return;
>  
>  	/*
>  	 * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
>  	 * (ie NOHZ_STATS_KICK set) and will do the same.
>  	 */
> -	if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> -		_nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
> +	if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> +	    !need_resched())
> +		_nohz_idle_balance(cpu_rq(cpu), flags);
>  }
>  
>  static void nohz_newidle_balance(struct rq *this_rq)
> @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq)
>  	if (this_rq->avg_idle < sysctl_sched_migration_cost)
>  		return;
>  
> +	/* If rq->next_balance before nohz.next_balance, trigger ILB */
> +	if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
> +		atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
> +
>  	/* Don't need to update blocked load of idle CPUs*/
>  	if (!READ_ONCE(nohz.has_blocked) ||
>  	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))

Peter Zijlstra Oct. 6, 2023, 8:01 p.m. UTC | #6

On Thu, Oct 05, 2023 at 04:17:26PM +0000, Joel Fernandes (Google) wrote:
> From: Vineeth Pillai <vineethrp@google.com>
> 
> Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> balancing for it because it can't perform its own periodic load balancing.
> This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> the upcoming nohz-idle load balancing is too distant in the future. This update
> process is done by triggering an ILB, as the general ILB handler
> (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> and selecting the smallest value.
> 
> Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
> primarily results in the ILB handler updating 'nohz.next_balance' while
> possibly not doing any load balancing at all. However, sending an IPI merely to
> refresh 'nohz.next_balance' seems excessive, and there ought to be a more
> efficient method to update 'nohz.next_balance' from the local CPU.
> 
> Fortunately, there already exists a mechanism to directly invoke the ILB
> handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
> setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
> balancing and solely exists to update a CPU's blocked load if it couldn't pull
> more tasks during regular "newly idle balancing" - and it does so without
> having to send any IPIs. Once the flag is set, the ILB handler is called
> directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
> the blocked load without an IPI, in our situation, we aim to refresh
> 'nohz.next_balance' without an IPI but we can piggy back on this.
> 
> So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
> indicate nohz.next_balance needs an update via this direct call shortcut. Note
> that we set this flag without knowledge that the tick is about to be stopped,
> because at the point we do it, we have no way of knowing that. However we do
> know that the CPU is about to enter idle. In our testing, the reduction in IPIs
> is well worth updating nohz.next_balance a few more times.
> 
> Also just to note, without this patch we observe the following pattern:
> 
> 1. A CPU is about to stop its tick.
> 2. It sets nohz.needs_update to 1.
> 3. It then stops its tick and goes idle.
> 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
> 5. The ILB CPU ends up being the one that just stopped its tick!
> 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
>    and disturbing it!
> 
> Testing shows a considerable reduction in IPIs when doing this:
> 
> Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> the IPI call count profiled over 10s period is as follows:
> without fix: ~10500
> with fix: ~1000
> 
> Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")

Hurm.. does this really warrant a Fixes tag? Afaict nothing is currently
broken -- this is a pure optimization question, no?

Shrikanth Hegde Oct. 6, 2023, 8:10 p.m. UTC | #7

On 10/7/23 12:48 AM, Shrikanth Hegde wrote:
> 
> 
> On 10/5/23 9:47 PM, Joel Fernandes (Google) wrote:
>> From: Vineeth Pillai <vineethrp@google.com>
>>
>> Whenever a CPU stops its tick, it now requires another idle CPU to handle the
>> balancing for it because it can't perform its own periodic load balancing.
>> This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
>> the upcoming nohz-idle load balancing is too distant in the future. This update
>> process is done by triggering an ILB, as the general ILB handler
>> (_nohz_idle_balance) that manages regular nohz balancing also refreshes
>> 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
>> and selecting the smallest value.
>>
>> Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
>> primarily results in the ILB handler updating 'nohz.next_balance' while
>> possibly not doing any load balancing at all. However, sending an IPI merely to
>> refresh 'nohz.next_balance' seems excessive, and there ought to be a more
>> efficient method to update 'nohz.next_balance' from the local CPU.
>>
>> Fortunately, there already exists a mechanism to directly invoke the ILB
>> handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
>> setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
>> balancing and solely exists to update a CPU's blocked load if it couldn't pull
>> more tasks during regular "newly idle balancing" - and it does so without
>> having to send any IPIs. Once the flag is set, the ILB handler is called
>> directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
>> the blocked load without an IPI, in our situation, we aim to refresh
>> 'nohz.next_balance' without an IPI but we can piggy back on this.
>>
>> So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
>> indicate nohz.next_balance needs an update via this direct call shortcut. Note
>> that we set this flag without knowledge that the tick is about to be stopped,
>> because at the point we do it, we have no way of knowing that. However we do
>> know that the CPU is about to enter idle. In our testing, the reduction in IPIs
>> is well worth updating nohz.next_balance a few more times.
>>
>> Also just to note, without this patch we observe the following pattern:
>>
>> 1. A CPU is about to stop its tick.
>> 2. It sets nohz.needs_update to 1.
>> 3. It then stops its tick and goes idle.
>> 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
>> 5. The ILB CPU ends up being the one that just stopped its tick!
>> 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
>>    and disturbing it!
>>
>> Testing shows a considerable reduction in IPIs when doing this:
>>
>> Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
>> the IPI call count profiled over 10s period is as follows:
>> without fix: ~10500
>> with fix: ~1000
>>
>> Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
>>
>> [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ]
> 
> Hi Joel/Vineeth. 
> 
> Its an interesting patch. 
> 
> Gave this patch a try on powerpc system with 96 CPU. (12 cores SMT8)
> Was debugging an issue where ILB count goes up significantly at a specific 
> busy CPU count. Haven't yet found out why. Its WIP. Was curious if this patch 
> would address that issue. 
> 
> cloned rt-test repo and ran same cyclictest command and collected 
> softirq's count using bcc tool. That count remains same more or less with patch. 
> Is what I am checking incorrect? Any other way to check IPI count?
> 
>         base 6.6_rc4    +patch
> 
> block       31.00        48.86
> net_rx     475.90       348.90
> timer     2213.20      2405.00
> rcu      33057.30     34738.10
> sched   175904.70    169695.60
> 

Ah, there is hardirq which shows IPI count. Didnt think of it. 
This is average of 10 runs where hardirq was collected at 10s while running cyclictest.
This shows nice improvement. in base6.6 there were few instance where
number of IPI was much high.

	base 6.6_rc4    +patch
IPI-1        2741.1       1382.3


> 
>>
>> Cc: Suleiman Souhlal <suleiman@google.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Hsin Yi <hsinyi@google.com>
>> Cc: Frederic Weisbecker <frederic@kernel.org>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Signed-off-by: Vineeth Pillai <vineethrp@google.com>
>> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>>  kernel/sched/fair.c | 21 ++++++++++++++-------
>>  1 file changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index cb225921bbca..2ece55f32782 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
>>  	/*
>>  	 * Ensures that if nohz_idle_balance() fails to observe our
>>  	 * @idle_cpus_mask store, it must observe the @has_blocked
>> -	 * and @needs_update stores.
>> +	 * stores.
>>  	 */
>>  	smp_mb__after_atomic();
>>  
>>  	set_cpu_sd_state_idle(cpu);
>>  
>> -	WRITE_ONCE(nohz.needs_update, 1);
>>  out:
>>  	/*
>>  	 * Each time a cpu enter idle, we assume that it has blocked load and
>> @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>  }
>>  
>>  /*
>> - * Check if we need to run the ILB for updating blocked load before entering
>> - * idle state.
>> + * Check if we need to run the ILB for updating blocked load and/or updating
>> + * nohz.next_balance before entering idle state.
>>   */
>>  void nohz_run_idle_balance(int cpu)
>>  {
>>  	unsigned int flags;
>>  
>> -	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
>> +	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
>> +
>> +	if (!flags)
>> +		return;
>>  
>>  	/*
>>  	 * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
>>  	 * (ie NOHZ_STATS_KICK set) and will do the same.
>>  	 */
>> -	if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
>> -		_nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
>> +	if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
>> +	    !need_resched())
>> +		_nohz_idle_balance(cpu_rq(cpu), flags);
>>  }
>>  
>>  static void nohz_newidle_balance(struct rq *this_rq)
>> @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq)
>>  	if (this_rq->avg_idle < sysctl_sched_migration_cost)
>>  		return;
>>  
>> +	/* If rq->next_balance before nohz.next_balance, trigger ILB */
>> +	if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
>> +		atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
>> +
>>  	/* Don't need to update blocked load of idle CPUs*/
>>  	if (!READ_ONCE(nohz.has_blocked) ||
>>  	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))

Vineeth Pillai Oct. 6, 2023, 9:20 p.m. UTC | #8

Hi Shrikanth,

> Hi Joel/Vineeth.
>
> Its an interesting patch.
>
> Gave this patch a try on powerpc system with 96 CPU. (12 cores SMT8)
> Was debugging an issue where ILB count goes up significantly at a specific
> busy CPU count. Haven't yet found out why. Its WIP. Was curious if this patch
> would address that issue.
>
Increased SCHED_SOFTIRQ was another problem that we noticed and we
have a separate fix for that in the chromeos tree. We were planning to
send the patch after a bit more testing. You could try the patch and
see if it solves the problem that you are seeing:
https://chromium.googlesource.com/chromiumos/third_party/kernel/+/918f229ba2ff720f5dfae4768469acbb6ea39fe2

Thanks

Joel Fernandes Oct. 8, 2023, 4:39 p.m. UTC | #9

Hi Peter,

On Fri, Oct 06, 2023 at 10:01:29PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2023 at 04:17:26PM +0000, Joel Fernandes (Google) wrote:
> > From: Vineeth Pillai <vineethrp@google.com>
> > 
> > Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> > balancing for it because it can't perform its own periodic load balancing.
> > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> > the upcoming nohz-idle load balancing is too distant in the future. This update
> > process is done by triggering an ILB, as the general ILB handler
> > (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> > and selecting the smallest value.
> > 
> > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
> > primarily results in the ILB handler updating 'nohz.next_balance' while
> > possibly not doing any load balancing at all. However, sending an IPI merely to
> > refresh 'nohz.next_balance' seems excessive, and there ought to be a more
> > efficient method to update 'nohz.next_balance' from the local CPU.
> > 
> > Fortunately, there already exists a mechanism to directly invoke the ILB
> > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
> > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
> > balancing and solely exists to update a CPU's blocked load if it couldn't pull
> > more tasks during regular "newly idle balancing" - and it does so without
> > having to send any IPIs. Once the flag is set, the ILB handler is called
> > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
> > the blocked load without an IPI, in our situation, we aim to refresh
> > 'nohz.next_balance' without an IPI but we can piggy back on this.
> > 
> > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
> > indicate nohz.next_balance needs an update via this direct call shortcut. Note
> > that we set this flag without knowledge that the tick is about to be stopped,
> > because at the point we do it, we have no way of knowing that. However we do
> > know that the CPU is about to enter idle. In our testing, the reduction in IPIs
> > is well worth updating nohz.next_balance a few more times.
> > 
> > Also just to note, without this patch we observe the following pattern:
> > 
> > 1. A CPU is about to stop its tick.
> > 2. It sets nohz.needs_update to 1.
> > 3. It then stops its tick and goes idle.
> > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
> > 5. The ILB CPU ends up being the one that just stopped its tick!
> > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
> >    and disturbing it!
> > 
> > Testing shows a considerable reduction in IPIs when doing this:
> > 
> > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> > the IPI call count profiled over 10s period is as follows:
> > without fix: ~10500
> > with fix: ~1000
> > 
> > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
> 
> Hurm.. does this really warrant a Fixes tag? Afaict nothing is currently
> broken -- this is a pure optimization question, no?

IMHO it is a breakage as it breaks NOHZ -- a lot of times the ILB kicks back
the CPU stopping the tick out of idle (effectively breaking NOHZ). The large
number of IPIs also wrecks power and it happens only on 6.1 and after. Having
the fixes tag means it will also goto all stable kernels >= 6.1. Hope that
sounds reasonable and thank you for taking a look!

thanks,

 - Joel

Joel Fernandes Oct. 8, 2023, 4:46 p.m. UTC | #10

On Sat, Oct 07, 2023 at 12:48:57AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 10/5/23 9:47 PM, Joel Fernandes (Google) wrote:
> > From: Vineeth Pillai <vineethrp@google.com>
> > 
> > Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> > balancing for it because it can't perform its own periodic load balancing.
> > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> > the upcoming nohz-idle load balancing is too distant in the future. This update
> > process is done by triggering an ILB, as the general ILB handler
> > (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> > and selecting the smallest value.
> > 
> > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
> > primarily results in the ILB handler updating 'nohz.next_balance' while
> > possibly not doing any load balancing at all. However, sending an IPI merely to
> > refresh 'nohz.next_balance' seems excessive, and there ought to be a more
> > efficient method to update 'nohz.next_balance' from the local CPU.
> > 
> > Fortunately, there already exists a mechanism to directly invoke the ILB
> > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
> > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
> > balancing and solely exists to update a CPU's blocked load if it couldn't pull
> > more tasks during regular "newly idle balancing" - and it does so without
> > having to send any IPIs. Once the flag is set, the ILB handler is called
> > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
> > the blocked load without an IPI, in our situation, we aim to refresh
> > 'nohz.next_balance' without an IPI but we can piggy back on this.
> > 
> > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
> > indicate nohz.next_balance needs an update via this direct call shortcut. Note
> > that we set this flag without knowledge that the tick is about to be stopped,
> > because at the point we do it, we have no way of knowing that. However we do
> > know that the CPU is about to enter idle. In our testing, the reduction in IPIs
> > is well worth updating nohz.next_balance a few more times.
> > 
> > Also just to note, without this patch we observe the following pattern:
> > 
> > 1. A CPU is about to stop its tick.
> > 2. It sets nohz.needs_update to 1.
> > 3. It then stops its tick and goes idle.
> > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
> > 5. The ILB CPU ends up being the one that just stopped its tick!
> > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
> >    and disturbing it!
> > 
> > Testing shows a considerable reduction in IPIs when doing this:
> > 
> > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> > the IPI call count profiled over 10s period is as follows:
> > without fix: ~10500
> > with fix: ~1000
> > 
> > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
> > 
> > [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ]
> 
> Hi Joel/Vineeth. 
> 
> Its an interesting patch. 
> 
> Gave this patch a try on powerpc system with 96 CPU. (12 cores SMT8)
> Was debugging an issue where ILB count goes up significantly at a specific 
> busy CPU count. Haven't yet found out why. Its WIP. Was curious if this patch 
> would address that issue. 

Do you see that only from a specific kernel version onward? Have you tried
pre-6.1 kernels?

> cloned rt-test repo and ran same cyclictest command and collected 
> softirq's count using bcc tool. That count remains same more or less with patch. 
> Is what I am checking incorrect? Any other way to check IPI count?
> 
>         base 6.6_rc4    +patch
> 
> block       31.00        48.86
> net_rx     475.90       348.90
> timer     2213.20      2405.00
> rcu      33057.30     34738.10
> sched   175904.70    169695.60

So, we ran "perf stat -e irq:softirq_*" and counted those. For the sched
softirq, a majority of them were raised from the ILB path. This gave a clue
that it must be a flood of ILB.

For IPI counts, we run "perf record -a -g" and then "perf script" and look
for call stacks involving an smp_call. There's also /proc/interrupts that can
read out IPI counts.

We also used trace_printk in the ILB path, collected a trace, and count how
many times the ILB happens.

Vineeth collected the final data in this patch so I'll let him add any
details if he used some other method.

thanks,

 - Joel

> 
> 
> > 
> > Cc: Suleiman Souhlal <suleiman@google.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Hsin Yi <hsinyi@google.com>
> > Cc: Frederic Weisbecker <frederic@kernel.org>
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Vineeth Pillai <vineethrp@google.com>
> > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  kernel/sched/fair.c | 21 ++++++++++++++-------
> >  1 file changed, 14 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cb225921bbca..2ece55f32782 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
> >  	/*
> >  	 * Ensures that if nohz_idle_balance() fails to observe our
> >  	 * @idle_cpus_mask store, it must observe the @has_blocked
> > -	 * and @needs_update stores.
> > +	 * stores.
> >  	 */
> >  	smp_mb__after_atomic();
> >  
> >  	set_cpu_sd_state_idle(cpu);
> >  
> > -	WRITE_ONCE(nohz.needs_update, 1);
> >  out:
> >  	/*
> >  	 * Each time a cpu enter idle, we assume that it has blocked load and
> > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> >  }
> >  
> >  /*
> > - * Check if we need to run the ILB for updating blocked load before entering
> > - * idle state.
> > + * Check if we need to run the ILB for updating blocked load and/or updating
> > + * nohz.next_balance before entering idle state.
> >   */
> >  void nohz_run_idle_balance(int cpu)
> >  {
> >  	unsigned int flags;
> >  
> > -	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> > +	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
> > +
> > +	if (!flags)
> > +		return;
> >  
> >  	/*
> >  	 * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
> >  	 * (ie NOHZ_STATS_KICK set) and will do the same.
> >  	 */
> > -	if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> > -		_nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
> > +	if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> > +	    !need_resched())
> > +		_nohz_idle_balance(cpu_rq(cpu), flags);
> >  }
> >  
> >  static void nohz_newidle_balance(struct rq *this_rq)
> > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq)
> >  	if (this_rq->avg_idle < sysctl_sched_migration_cost)
> >  		return;
> >  
> > +	/* If rq->next_balance before nohz.next_balance, trigger ILB */
> > +	if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
> > +		atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
> > +
> >  	/* Don't need to update blocked load of idle CPUs*/
> >  	if (!READ_ONCE(nohz.has_blocked) ||
> >  	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))

Joel Fernandes Oct. 8, 2023, 4:50 p.m. UTC | #11

On Sat, Oct 07, 2023 at 01:40:53AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 10/7/23 12:48 AM, Shrikanth Hegde wrote:
> > 
> > 
> > On 10/5/23 9:47 PM, Joel Fernandes (Google) wrote:
> >> From: Vineeth Pillai <vineethrp@google.com>
> >>
> >> Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> >> balancing for it because it can't perform its own periodic load balancing.
> >> This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> >> the upcoming nohz-idle load balancing is too distant in the future. This update
> >> process is done by triggering an ILB, as the general ILB handler
> >> (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> >> 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> >> and selecting the smallest value.
> >>
> >> Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
> >> primarily results in the ILB handler updating 'nohz.next_balance' while
> >> possibly not doing any load balancing at all. However, sending an IPI merely to
> >> refresh 'nohz.next_balance' seems excessive, and there ought to be a more
> >> efficient method to update 'nohz.next_balance' from the local CPU.
> >>
> >> Fortunately, there already exists a mechanism to directly invoke the ILB
> >> handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
> >> setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
> >> balancing and solely exists to update a CPU's blocked load if it couldn't pull
> >> more tasks during regular "newly idle balancing" - and it does so without
> >> having to send any IPIs. Once the flag is set, the ILB handler is called
> >> directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
> >> the blocked load without an IPI, in our situation, we aim to refresh
> >> 'nohz.next_balance' without an IPI but we can piggy back on this.
> >>
> >> So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
> >> indicate nohz.next_balance needs an update via this direct call shortcut. Note
> >> that we set this flag without knowledge that the tick is about to be stopped,
> >> because at the point we do it, we have no way of knowing that. However we do
> >> know that the CPU is about to enter idle. In our testing, the reduction in IPIs
> >> is well worth updating nohz.next_balance a few more times.
> >>
> >> Also just to note, without this patch we observe the following pattern:
> >>
> >> 1. A CPU is about to stop its tick.
> >> 2. It sets nohz.needs_update to 1.
> >> 3. It then stops its tick and goes idle.
> >> 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
> >> 5. The ILB CPU ends up being the one that just stopped its tick!
> >> 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
> >>    and disturbing it!
> >>
> >> Testing shows a considerable reduction in IPIs when doing this:
> >>
> >> Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> >> the IPI call count profiled over 10s period is as follows:
> >> without fix: ~10500
> >> with fix: ~1000
> >>
> >> Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
> >>
> >> [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ]
> > 
> > Hi Joel/Vineeth. 
> > 
> > Its an interesting patch. 
> > 
> > Gave this patch a try on powerpc system with 96 CPU. (12 cores SMT8)
> > Was debugging an issue where ILB count goes up significantly at a specific 
> > busy CPU count. Haven't yet found out why. Its WIP. Was curious if this patch 
> > would address that issue. 
> > 
> > cloned rt-test repo and ran same cyclictest command and collected 
> > softirq's count using bcc tool. That count remains same more or less with patch. 
> > Is what I am checking incorrect? Any other way to check IPI count?
> > 
> >         base 6.6_rc4    +patch
> > 
> > block       31.00        48.86
> > net_rx     475.90       348.90
> > timer     2213.20      2405.00
> > rcu      33057.30     34738.10
> > sched   175904.70    169695.60
> > 
> 
> Ah, there is hardirq which shows IPI count. Didnt think of it. 
> This is average of 10 runs where hardirq was collected at 10s while running cyclictest.
> This shows nice improvement. in base6.6 there were few instance where
> number of IPI was much high.
> 
> 	base 6.6_rc4    +patch
> IPI-1        2741.1       1382.3
> 

Very cool! So I'll go ahead and add this data as well for the next revision.

(I hope to post a new version in a few days after addressing all the review
comments, I am unfortunately a bit slow this week due to travel and other
things).

Thanks,

 - Joel


> 
> > 
> >>
> >> Cc: Suleiman Souhlal <suleiman@google.com>
> >> Cc: Steven Rostedt <rostedt@goodmis.org>
> >> Cc: Hsin Yi <hsinyi@google.com>
> >> Cc: Frederic Weisbecker <frederic@kernel.org>
> >> Cc: Paul E. McKenney <paulmck@kernel.org>
> >> Signed-off-by: Vineeth Pillai <vineethrp@google.com>
> >> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> ---
> >>  kernel/sched/fair.c | 21 ++++++++++++++-------
> >>  1 file changed, 14 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index cb225921bbca..2ece55f32782 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
> >>  	/*
> >>  	 * Ensures that if nohz_idle_balance() fails to observe our
> >>  	 * @idle_cpus_mask store, it must observe the @has_blocked
> >> -	 * and @needs_update stores.
> >> +	 * stores.
> >>  	 */
> >>  	smp_mb__after_atomic();
> >>  
> >>  	set_cpu_sd_state_idle(cpu);
> >>  
> >> -	WRITE_ONCE(nohz.needs_update, 1);
> >>  out:
> >>  	/*
> >>  	 * Each time a cpu enter idle, we assume that it has blocked load and
> >> @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> >>  }
> >>  
> >>  /*
> >> - * Check if we need to run the ILB for updating blocked load before entering
> >> - * idle state.
> >> + * Check if we need to run the ILB for updating blocked load and/or updating
> >> + * nohz.next_balance before entering idle state.
> >>   */
> >>  void nohz_run_idle_balance(int cpu)
> >>  {
> >>  	unsigned int flags;
> >>  
> >> -	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> >> +	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
> >> +
> >> +	if (!flags)
> >> +		return;
> >>  
> >>  	/*
> >>  	 * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
> >>  	 * (ie NOHZ_STATS_KICK set) and will do the same.
> >>  	 */
> >> -	if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> >> -		_nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
> >> +	if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> >> +	    !need_resched())
> >> +		_nohz_idle_balance(cpu_rq(cpu), flags);
> >>  }
> >>  
> >>  static void nohz_newidle_balance(struct rq *this_rq)
> >> @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq)
> >>  	if (this_rq->avg_idle < sysctl_sched_migration_cost)
> >>  		return;
> >>  
> >> +	/* If rq->next_balance before nohz.next_balance, trigger ILB */
> >> +	if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
> >> +		atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
> >> +
> >>  	/* Don't need to update blocked load of idle CPUs*/
> >>  	if (!READ_ONCE(nohz.has_blocked) ||
> >>  	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))

Joel Fernandes Oct. 8, 2023, 5:35 p.m. UTC | #12

On Fri, Oct 06, 2023 at 12:51:41PM +0200, Ingo Molnar wrote:
> 
> * Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> 
> > From: Vineeth Pillai <vineethrp@google.com>
> > 
> > Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> > balancing for it because it can't perform its own periodic load balancing.
> > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> > the upcoming nohz-idle load balancing is too distant in the future. This update
> > process is done by triggering an ILB, as the general ILB handler
> > (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> > and selecting the smallest value.
> > 
> > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
> > primarily results in the ILB handler updating 'nohz.next_balance' while
> > possibly not doing any load balancing at all. However, sending an IPI merely to
> > refresh 'nohz.next_balance' seems excessive, and there ought to be a more
> > efficient method to update 'nohz.next_balance' from the local CPU.
> > 
> > Fortunately, there already exists a mechanism to directly invoke the ILB
> > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
> > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
> > balancing and solely exists to update a CPU's blocked load if it couldn't pull
> > more tasks during regular "newly idle balancing" - and it does so without
> > having to send any IPIs. Once the flag is set, the ILB handler is called
> > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
> > the blocked load without an IPI, in our situation, we aim to refresh
> > 'nohz.next_balance' without an IPI but we can piggy back on this.
> > 
> > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
> > indicate nohz.next_balance needs an update via this direct call shortcut. Note
> > that we set this flag without knowledge that the tick is about to be stopped,
> > because at the point we do it, we have no way of knowing that. However we do
> > know that the CPU is about to enter idle. In our testing, the reduction in IPIs
> > is well worth updating nohz.next_balance a few more times.
> > 
> > Also just to note, without this patch we observe the following pattern:
> > 
> > 1. A CPU is about to stop its tick.
> > 2. It sets nohz.needs_update to 1.
> > 3. It then stops its tick and goes idle.
> > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
> > 5. The ILB CPU ends up being the one that just stopped its tick!
> > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
> >    and disturbing it!
> > 
> > Testing shows a considerable reduction in IPIs when doing this:
> > 
> > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> > the IPI call count profiled over 10s period is as follows:
> > without fix: ~10500
> > with fix: ~1000
> > 
> > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
> > 
> > [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ]
> > 
> > Cc: Suleiman Souhlal <suleiman@google.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Hsin Yi <hsinyi@google.com>
> > Cc: Frederic Weisbecker <frederic@kernel.org>
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Vineeth Pillai <vineethrp@google.com>
> > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  kernel/sched/fair.c | 21 ++++++++++++++-------
> >  1 file changed, 14 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cb225921bbca..2ece55f32782 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
> >  	/*
> >  	 * Ensures that if nohz_idle_balance() fails to observe our
> >  	 * @idle_cpus_mask store, it must observe the @has_blocked
> > -	 * and @needs_update stores.
> > +	 * stores.
> >  	 */
> >  	smp_mb__after_atomic();
> >  
> >  	set_cpu_sd_state_idle(cpu);
> >  
> > -	WRITE_ONCE(nohz.needs_update, 1);
> >  out:
> >  	/*
> >  	 * Each time a cpu enter idle, we assume that it has blocked load and
> > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> >  }
> >  
> >  /*
> > - * Check if we need to run the ILB for updating blocked load before entering
> > - * idle state.
> > + * Check if we need to run the ILB for updating blocked load and/or updating
> > + * nohz.next_balance before entering idle state.
> >   */
> >  void nohz_run_idle_balance(int cpu)
> >  {
> >  	unsigned int flags;
> >  
> > -	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> > +	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
> > +
> > +	if (!flags)
> > +		return;
> >  
> >  	/*
> >  	 * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
> >  	 * (ie NOHZ_STATS_KICK set) and will do the same.
> >  	 */
> > -	if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> > -		_nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
> > +	if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> > +	    !need_resched())
> > +		_nohz_idle_balance(cpu_rq(cpu), flags);
> >  }
> >  
> >  static void nohz_newidle_balance(struct rq *this_rq)
> > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq)
> >  	if (this_rq->avg_idle < sysctl_sched_migration_cost)
> >  		return;
> >  
> > +	/* If rq->next_balance before nohz.next_balance, trigger ILB */
> > +	if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
> > +		atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
> > +
> >  	/* Don't need to update blocked load of idle CPUs*/
> >  	if (!READ_ONCE(nohz.has_blocked) ||
> >  	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))
> 
> Ok, judging by your IPI reduction numbers this is definitely an 
> optimization we want to do.
> 
> The patch does make _nohz_idle_balance() run more parallel, as previously 
> it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at 
> least for next_balance updates), but I think it's still SMP-safe, as all 
> key data structure updates are already rq-locked AFAICS.

One thing I am confused about in the original code is:

tick_nohz_idle_stop_tick() is what sets the nohz.idle_cpus_mask.
However, nohz_run_idle_balance() is called before that can happen, in
do_idle(). So it is possible that NOHZ_NEWILB_KICK is set for a CPU but it is
not yet in the mask.

So will this code in _nohz_idle_balance() really run in such a scenario?

                if (flags & NOHZ_STATS_KICK)
                        has_blocked_load |= update_nohz_stats(rq);

AFAICS, this loop may not select the CPU due to its absence from the mask:
        for_each_cpu_wrap(balance_cpu,  nohz.idle_cpus_mask, this_cpu+1)

I must be missing something. I'll go trace this path later as well.

thanks,

 - Joel

Ingo Molnar Oct. 9, 2023, 11:25 a.m. UTC | #13

* Joel Fernandes <joel@joelfernandes.org> wrote:

> > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
> > 
> > Hurm.. does this really warrant a Fixes tag? Afaict nothing is currently
> > broken -- this is a pure optimization question, no?
> 
> IMHO it is a breakage as it breaks NOHZ -- a lot of times the ILB kicks 
> back the CPU stopping the tick out of idle (effectively breaking NOHZ). 
> The large number of IPIs also wrecks power and it happens only on 6.1 and 
> after. Having the fixes tag means it will also goto all stable kernels >= 
> 6.1. Hope that sounds reasonable and thank you for taking a look!

So it's basically a fix of a NOHZ performance regression, introduced by 
7fd7a9e0caba or so, correct?

As long as the fixes have a good hope of being backported with a low amount 
of overhead, a Fixes: tag for a ~2 years old performance regression is 
unusual but not unprecedented.

We just need to make sure we don't put too much of a burden on the 
shoulders of -stable maintainers ...

Thanks,

	Ingo

Vineeth Pillai Oct. 9, 2023, 6:33 p.m. UTC | #14

On Sun, Oct 8, 2023 at 1:35 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
[...snip...]
> > The patch does make _nohz_idle_balance() run more parallel, as previously
> > it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at
> > least for next_balance updates), but I think it's still SMP-safe, as all
> > key data structure updates are already rq-locked AFAICS.
>
> One thing I am confused about in the original code is:
>
> tick_nohz_idle_stop_tick() is what sets the nohz.idle_cpus_mask.
> However, nohz_run_idle_balance() is called before that can happen, in
> do_idle(). So it is possible that NOHZ_NEWILB_KICK is set for a CPU but it is
> not yet in the mask.
>
> So will this code in _nohz_idle_balance() really run in such a scenario?
>
>                 if (flags & NOHZ_STATS_KICK)
>                         has_blocked_load |= update_nohz_stats(rq);
>
> AFAICS, this loop may not select the CPU due to its absence from the mask:
>         for_each_cpu_wrap(balance_cpu,  nohz.idle_cpus_mask, this_cpu+1)
>
I have traced this a bit further. As Joel mentioned, the
nohz.idle_cpus_mask shouldn't contain this cpu when
nohz_run_idle_balance () is called from do_idle(), but on tracing I
have seen that it does have it mostly with HIGHRES. And I feel this is
a bug. We call nohz_balance_enter_idle() when we turn off the tick,
but we don't always call nohz_balance_exit_idle() when we turn the
tick back on. We call it only on the next tick on this cpu in
nohz_balancer_kick. If a wakeup happens on this cpu while the tick is
off, we re-enable the tick, but do not remove ourselves from the
nohz.idle_cpus_mask. So, ILB will consider this cpu to be a valid pick
until the next tick on this cpu where it gets removed. I am not sure
if this is intentional.

If this is a bug and we fix it by calling nohz_balance_exit_idle
during restart_tick, then we might not probably need NOHZ_NEWIDLE_KICK
flag and could use NOHZ_STATS_KICK as there will not be any overlap
between nohz_run_idle_balance and nohz_idle_balance.

Thanks,
Vineeth

Steven Rostedt Oct. 9, 2023, 8:11 p.m. UTC | #15

On Mon, 9 Oct 2023 13:25:23 +0200
Ingo Molnar <mingo@kernel.org> wrote:

> We just need to make sure we don't put too much of a burden on the 
> shoulders of -stable maintainers ...

My experience is that changes like this usually find their way into stable
with or without the Fixes tag.

-- Steve

Vincent Guittot Oct. 10, 2023, 7:15 a.m. UTC | #16

On Sun, 8 Oct 2023 at 19:35, Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Fri, Oct 06, 2023 at 12:51:41PM +0200, Ingo Molnar wrote:
> >
> > * Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> >
> > > From: Vineeth Pillai <vineethrp@google.com>
> > >
> > > Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> > > balancing for it because it can't perform its own periodic load balancing.
> > > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> > > the upcoming nohz-idle load balancing is too distant in the future. This update
> > > process is done by triggering an ILB, as the general ILB handler
> > > (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> > > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> > > and selecting the smallest value.
> > >
> > > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
> > > primarily results in the ILB handler updating 'nohz.next_balance' while
> > > possibly not doing any load balancing at all. However, sending an IPI merely to
> > > refresh 'nohz.next_balance' seems excessive, and there ought to be a more
> > > efficient method to update 'nohz.next_balance' from the local CPU.
> > >
> > > Fortunately, there already exists a mechanism to directly invoke the ILB
> > > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
> > > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
> > > balancing and solely exists to update a CPU's blocked load if it couldn't pull
> > > more tasks during regular "newly idle balancing" - and it does so without
> > > having to send any IPIs. Once the flag is set, the ILB handler is called
> > > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
> > > the blocked load without an IPI, in our situation, we aim to refresh
> > > 'nohz.next_balance' without an IPI but we can piggy back on this.
> > >
> > > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
> > > indicate nohz.next_balance needs an update via this direct call shortcut. Note
> > > that we set this flag without knowledge that the tick is about to be stopped,
> > > because at the point we do it, we have no way of knowing that. However we do
> > > know that the CPU is about to enter idle. In our testing, the reduction in IPIs
> > > is well worth updating nohz.next_balance a few more times.
> > >
> > > Also just to note, without this patch we observe the following pattern:
> > >
> > > 1. A CPU is about to stop its tick.
> > > 2. It sets nohz.needs_update to 1.
> > > 3. It then stops its tick and goes idle.
> > > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
> > > 5. The ILB CPU ends up being the one that just stopped its tick!
> > > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
> > >    and disturbing it!
> > >
> > > Testing shows a considerable reduction in IPIs when doing this:
> > >
> > > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> > > the IPI call count profiled over 10s period is as follows:
> > > without fix: ~10500
> > > with fix: ~1000
> > >
> > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
> > >
> > > [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ]
> > >
> > > Cc: Suleiman Souhlal <suleiman@google.com>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Hsin Yi <hsinyi@google.com>
> > > Cc: Frederic Weisbecker <frederic@kernel.org>
> > > Cc: Paul E. McKenney <paulmck@kernel.org>
> > > Signed-off-by: Vineeth Pillai <vineethrp@google.com>
> > > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> > >  kernel/sched/fair.c | 21 ++++++++++++++-------
> > >  1 file changed, 14 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index cb225921bbca..2ece55f32782 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu)
> > >     /*
> > >      * Ensures that if nohz_idle_balance() fails to observe our
> > >      * @idle_cpus_mask store, it must observe the @has_blocked
> > > -    * and @needs_update stores.
> > > +    * stores.
> > >      */
> > >     smp_mb__after_atomic();
> > >
> > >     set_cpu_sd_state_idle(cpu);
> > >
> > > -   WRITE_ONCE(nohz.needs_update, 1);
> > >  out:
> > >     /*
> > >      * Each time a cpu enter idle, we assume that it has blocked load and
> > > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> > >  }
> > >
> > >  /*
> > > - * Check if we need to run the ILB for updating blocked load before entering
> > > - * idle state.
> > > + * Check if we need to run the ILB for updating blocked load and/or updating
> > > + * nohz.next_balance before entering idle state.
> > >   */
> > >  void nohz_run_idle_balance(int cpu)
> > >  {
> > >     unsigned int flags;
> > >
> > > -   flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> > > +   flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
> > > +
> > > +   if (!flags)
> > > +           return;
> > >
> > >     /*
> > >      * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
> > >      * (ie NOHZ_STATS_KICK set) and will do the same.
> > >      */
> > > -   if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> > > -           _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
> > > +   if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
> > > +       !need_resched())
> > > +           _nohz_idle_balance(cpu_rq(cpu), flags);
> > >  }
> > >
> > >  static void nohz_newidle_balance(struct rq *this_rq)
> > > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq)
> > >     if (this_rq->avg_idle < sysctl_sched_migration_cost)
> > >             return;
> > >
> > > +   /* If rq->next_balance before nohz.next_balance, trigger ILB */
> > > +   if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
> > > +           atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
> > > +
> > >     /* Don't need to update blocked load of idle CPUs*/
> > >     if (!READ_ONCE(nohz.has_blocked) ||
> > >         time_before(jiffies, READ_ONCE(nohz.next_blocked)))
> >
> > Ok, judging by your IPI reduction numbers this is definitely an
> > optimization we want to do.
> >
> > The patch does make _nohz_idle_balance() run more parallel, as previously
> > it would be generally run by the first-idle CPU in nohz.idle_cpus_mask (at
> > least for next_balance updates), but I think it's still SMP-safe, as all
> > key data structure updates are already rq-locked AFAICS.
>
> One thing I am confused about in the original code is:
>
> tick_nohz_idle_stop_tick() is what sets the nohz.idle_cpus_mask.
> However, nohz_run_idle_balance() is called before that can happen, in
> do_idle(). So it is possible that NOHZ_NEWILB_KICK is set for a CPU but it is
> not yet in the mask.

2 things:
- update of nohz.idle_cpus_mask is not strictly aligned with cpu
entering/exiting idle state. It is set when entering but only cleared
during the next tick on the cpu because nohz.idle_cpus_mask is a
bottleneck when all CPUs enter/exit idle at high rate (usec). This
implies that a cpu entering idle can already be set in
nohz.idle_cpus_mask
- NOHZ_NEWILB_KICK is more about updating blocked load of others
already idle CPUs than the one entering idle which has already updated
its blocked load in newidle_balance()

The goal of nohz_run_idle_balance() is to run ILB only for updating
the blocked load of already idle CPUs without waking up one of those
idle CPUs and outside the preempt disable / irq off phase of the local
cpu about to enter idle because it can take a long time.

>
> So will this code in _nohz_idle_balance() really run in such a scenario?
>
>                 if (flags & NOHZ_STATS_KICK)
>                         has_blocked_load |= update_nohz_stats(rq);
>
> AFAICS, this loop may not select the CPU due to its absence from the mask:
>         for_each_cpu_wrap(balance_cpu,  nohz.idle_cpus_mask, this_cpu+1)
>
> I must be missing something. I'll go trace this path later as well.
>
> thanks,
>
>  - Joel
>

Joel Fernandes Oct. 10, 2023, 5:55 p.m. UTC | #17

On Mon, Oct 9, 2023 at 7:25 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Joel Fernandes <joel@joelfernandes.org> wrote:
>
> > > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
> > >
> > > Hurm.. does this really warrant a Fixes tag? Afaict nothing is currently
> > > broken -- this is a pure optimization question, no?
> >
> > IMHO it is a breakage as it breaks NOHZ -- a lot of times the ILB kicks
> > back the CPU stopping the tick out of idle (effectively breaking NOHZ).
> > The large number of IPIs also wrecks power and it happens only on 6.1 and
> > after. Having the fixes tag means it will also goto all stable kernels >=
> > 6.1. Hope that sounds reasonable and thank you for taking a look!
>
> So it's basically a fix of a NOHZ performance regression, introduced by
> 7fd7a9e0caba or so, correct?

Yes.

> As long as the fixes have a good hope of being backported with a low amount
> of overhead, a Fixes: tag for a ~2 years old performance regression is
> unusual but not unprecedented.
>
> We just need to make sure we don't put too much of a burden on the
> shoulders of -stable maintainers ...

Sure, that sounds good. After we upstream, I am happy to assist in any
stable backporting work related to this as well. In any case, I have
to backport it to ChromeOS kernels which are based on stable. There's
only the 6.1 stable kernel at the moment that is affected.

thanks,

 - Joel

Joel Fernandes Oct. 10, 2023, 7:32 p.m. UTC | #18

On Tue, Oct 10, 2023 at 3:15 AM Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Sun, 8 Oct 2023 at 19:35, Joel Fernandes <joel@joelfernandes.org> wrote:
[...]
> > One thing I am confused about in the original code is:
> >
> > tick_nohz_idle_stop_tick() is what sets the nohz.idle_cpus_mask.
> > However, nohz_run_idle_balance() is called before that can happen, in
> > do_idle(). So it is possible that NOHZ_NEWILB_KICK is set for a CPU but it is
> > not yet in the mask.
>
> 2 things:
> - update of nohz.idle_cpus_mask is not strictly aligned with cpu
> entering/exiting idle state. It is set when entering but only cleared
> during the next tick on the cpu because nohz.idle_cpus_mask is a
> bottleneck when all CPUs enter/exit idle at high rate (usec). This
> implies that a cpu entering idle can already be set in
> nohz.idle_cpus_mask
> - NOHZ_NEWILB_KICK is more about updating blocked load of others
> already idle CPUs than the one entering idle which has already updated
> its blocked load in newidle_balance()
>
> The goal of nohz_run_idle_balance() is to run ILB only for updating
> the blocked load of already idle CPUs without waking up one of those
> idle CPUs and outside the preempt disable / irq off phase of the local
> cpu about to enter idle because it can take a long time.

This makes complete sense, thank you for the background on this!

Vineeth was just telling me in a 1:1 that he also tried doing the removal of
the CPU from the idle mask in the restart-tick path. The result was that even
though the mask modification is not as often as when doing it during the CPU
coming out of idle, it is still higher than just doing it from the next busy
tick, like in current mainline.

As next steps we are looking into:
1. Monitor how often we set NEXT_KICK -- we think we can reduce the frequency
of these even more and keep the overhead low.

2. Look more into the parallelism of nohz.next_balance updates (due to our
next NEXT_KICK setting) and handle any race conditions. We are at the moment
looking into if nohz.next_balance does not get set to the earliest value
because of a race, and if so retry the operation.

Something like (untested):

if (likely(update_next_balance)) {
  do {
	WRITE_ONCE(nohz.next_balance, next_balance);
	if (likely(READ_ONCE(nohz.next_balance) <= next_balance)) {
		break;
	}
	cpu_relax();
  }
}

Or a try_cmpxchg loop.

I think after these items and a bit more testing, we should be good to send v2.

Thanks.

kernel test robot Oct. 19, 2023, 2:56 p.m. UTC | #19

Hello,

kernel test robot noticed "WARNING:at_kernel/sched/core.c:#nohz_csd_func" on:

commit: 7b0c45f5095f8868fb14cc4e1745befdf58d173c ("[PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB")
url: https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/sched-fair-Avoid-unnecessary-IPIs-for-ILB/20231006-003907
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 3006adf3be79cde4d14b1800b963b82b6e5572e0
patch link: https://lore.kernel.org/all/20231005161727.1855004-1-joel@joelfernandes.org/
patch subject: [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB

in testcase: blktests
version: blktests-x86_64-3f75e62-1_20231017
with following parameters:

	disk: 1SSD
	test: nvme-group-00
	nvme_trtype: rdma



compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


+----------------+------------+------------+
|                | 3006adf3be | 7b0c45f509 |
+----------------+------------+------------+
| boot_successes | 0          | 3          |
+----------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310192232.750e5c5b-oliver.sang@intel.com


[   55.309389][    C1] ------------[ cut here ]------------
[ 55.315508][ C1] WARNING: CPU: 1 PID: 0 at kernel/sched/core.c:1182 nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) 
[   55.325508][    C1] Modules linked in: intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp btrfs blake2b_generic kvm_intel xor kvm raid6_pq zstd_compress irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel libcrc32c sha512_ssse3 crc32c_intel ipmi_ssif rapl nvme intel_cstate nvme_core mei_me ast t10_pi dax_hmem drm_shmem_helper crc64_rocksoft_generic idxd crc64_rocksoft mei drm_kms_helper wmi idxd_bus joydev i2c_ismt crc64 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad drm fuse ip_tables
[   55.380240][    C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.6.0-rc4-00038-g7b0c45f5095f #1
[ 55.390037][ C1] RIP: 0010:nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) 
[ 55.396018][ C1] Code: 84 c0 74 06 0f 8e d3 00 00 00 45 88 b4 24 28 0a 00 00 48 83 c4 08 bf 07 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 22 b0 f6 ff <0f> 0b e9 1b fe ff ff e8 76 6e 72 00 e9 66 fd ff ff e8 cc 6e 72 00
All code
========
   0:	84 c0                	test   %al,%al
   2:	74 06                	je     0xa
   4:	0f 8e d3 00 00 00    	jle    0xdd
   a:	45 88 b4 24 28 0a 00 	mov    %r14b,0xa28(%r12)
  11:	00 
  12:	48 83 c4 08          	add    $0x8,%rsp
  16:	bf 07 00 00 00       	mov    $0x7,%edi
  1b:	5b                   	pop    %rbx
  1c:	41 5c                	pop    %r12
  1e:	41 5d                	pop    %r13
  20:	41 5e                	pop    %r14
  22:	41 5f                	pop    %r15
  24:	5d                   	pop    %rbp
  25:	e9 22 b0 f6 ff       	jmpq   0xfffffffffff6b04c
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	e9 1b fe ff ff       	jmpq   0xfffffffffffffe4c
  31:	e8 76 6e 72 00       	callq  0x726eac
  36:	e9 66 fd ff ff       	jmpq   0xfffffffffffffda1
  3b:	e8 cc 6e 72 00       	callq  0x726f0c

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	e9 1b fe ff ff       	jmpq   0xfffffffffffffe22
   7:	e8 76 6e 72 00       	callq  0x726e82
   c:	e9 66 fd ff ff       	jmpq   0xfffffffffffffd77
  11:	e8 cc 6e 72 00       	callq  0x726ee2
[   55.418037][    C1] RSP: 0018:ffa00000001f8f58 EFLAGS: 00010046
[   55.424802][    C1] RAX: 0000000000000000 RBX: 000000000003a100 RCX: ffffffff8444c928
[   55.433718][    C1] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ff110017fc8ba164
[   55.442631][    C1] RBP: ffa00000001f8f88 R08: 0000000000000001 R09: ffe21c02ff91742c
[   55.451542][    C1] R10: ff110017fc8ba167 R11: ffa00000001f8ff8 R12: ff110017fc8ba100
[   55.460461][    C1] R13: ff110017fc8ba164 R14: 0000000000000000 R15: 0000000000000001
[   55.470959][    C1] FS:  0000000000000000(0000) GS:ff110017fc880000(0000) knlGS:0000000000000000
[   55.482348][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   55.491067][    C1] CR2: 00007fabd7bff699 CR3: 000000407de46006 CR4: 0000000000f71ee0
[   55.501337][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   55.511601][    C1] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   55.521859][    C1] PKRU: 55555554
[   55.527131][    C1] Call Trace:
[   55.532072][    C1]  <IRQ>
[ 55.536527][ C1] ? __warn (kernel/panic.c:673) 
[ 55.542341][ C1] ? nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) 
[ 55.548935][ C1] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 55.555241][ C1] ? handle_bug (arch/x86/kernel/traps.c:237) 
[ 55.561323][ C1] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) 
[ 55.567792][ C1] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) 
[ 55.574671][ C1] ? nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) 
[ 55.581230][ C1] ? nohz_csd_func (arch/x86/include/asm/atomic.h:23 arch/x86/include/asm/atomic.h:135 include/linux/atomic/atomic-arch-fallback.h:1433 include/linux/atomic/atomic-arch-fallback.h:1565 include/linux/atomic/atomic-instrumented.h:862 kernel/sched/core.c:1181) 
[ 55.587667][ C1] ? task_mm_cid_work (kernel/sched/core.c:1173) 
[ 55.594511][ C1] __flush_smp_call_function_queue (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/csd.h:64 kernel/smp.c:134 kernel/smp.c:531) 
[ 55.602619][ C1] __sysvec_call_function_single (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 arch/x86/include/asm/trace/irq_vectors.h:99 arch/x86/kernel/smp.c:293) 
[ 55.610431][ C1] sysvec_call_function_single (arch/x86/kernel/smp.c:287 (discriminator 14)) 
[   55.617918][    C1]  </IRQ>
[   55.622373][    C1]  <TASK>
[   55.624388][    C2] ------------[ cut here ]------------
[ 55.625607][ C1] asm_sysvec_call_function_single (arch/x86/include/asm/idtentry.h:652) 
[ 55.631669][ C2] WARNING: CPU: 2 PID: 0 at kernel/sched/core.c:1182 nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) 
[ 55.638279][ C1] RIP: _nohz_idle_balance+0xd9/0x7f0 
[   55.648220][    C2] Modules linked in:
[ 55.655250][ C1] Code: 48 74 0a c7 05 c0 0f ce 04 00 00 00 00 8b 44 24 2c 83 e0 08 89 44 24 14 74 0a c7 05 ad 0f ce 04 00 00 00 00 f0 83 44 24 fc 00 <49> c7 c5 10 c4 3f 85 41 83 c4 01 48 b8 00 00 00 00 00 fc ff df 4c
All code
========
   0:	48 74 0a             	rex.W je 0xd
   3:	c7 05 c0 0f ce 04 00 	movl   $0x0,0x4ce0fc0(%rip)        # 0x4ce0fcd
   a:	00 00 00 
   d:	8b 44 24 2c          	mov    0x2c(%rsp),%eax
  11:	83 e0 08             	and    $0x8,%eax
  14:	89 44 24 14          	mov    %eax,0x14(%rsp)
  18:	74 0a                	je     0x24
  1a:	c7 05 ad 0f ce 04 00 	movl   $0x0,0x4ce0fad(%rip)        # 0x4ce0fd1
  21:	00 00 00 
  24:	f0 83 44 24 fc 00    	lock addl $0x0,-0x4(%rsp)
  2a:*	49 c7 c5 10 c4 3f 85 	mov    $0xffffffff853fc410,%r13		<-- trapping instruction
  31:	41 83 c4 01          	add    $0x1,%r12d
  35:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  3c:	fc ff df 
  3f:	4c                   	rex.WR

Code starting with the faulting instruction
===========================================
   0:	49 c7 c5 10 c4 3f 85 	mov    $0xffffffff853fc410,%r13
   7:	41 83 c4 01          	add    $0x1,%r12d
   b:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  12:	fc ff df 
  15:	4c                   	rex.WR
[   55.656534][    C2]  intel_rapl_msr
[   55.660852][    C1] RSP: 0018:ffa000000865fdb0 EFLAGS: 00000246
[   55.682783][    C2]  intel_rapl_common
[   55.686779][    C1] RAX: 0000000000000008 RBX: 0000000000000001 RCX: ffffffff812b76c7
[   55.693493][    C2]  x86_pkg_temp_thermal
[   55.697774][    C1] RDX: dffffc0000000000 RSI: 0000000000000008 RDI: 0000000000000001
[   55.706653][    C2]  intel_powerclamp
[   55.711232][    C1] RBP: ffa000000865fe90 R08: 0000000000000001 R09: ffe21c02ff91742c
[   55.720120][    C2]  coretemp btrfs
[   55.724298][    C1] R10: ff110017fc8ba167 R11: 0000000000000014 R12: 0000000000000001
[   55.733156][    C2]  blake2b_generic
[   55.737157][    C1] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   55.746019][    C2]  kvm_intel xor
[ 55.750115][ C1] ? nohz_run_idle_balance (arch/x86/include/asm/atomic.h:23 arch/x86/include/asm/atomic.h:135 include/linux/atomic/atomic-arch-fallback.h:1433 include/linux/atomic/atomic-arch-fallback.h:1565 include/linux/atomic/atomic-instrumented.h:862 kernel/sched/fair.c:11954) 
[   55.758991][    C2]  kvm
[ 55.762902][ C1] ? clockevents_program_event (kernel/time/clockevents.c:336 (discriminator 3)) 
[   55.768839][    C2]  raid6_pq zstd_compress
[ 55.771772][ C1] ? rebalance_domains (kernel/sched/fair.c:11826) 
[   55.778197][    C2]  irqbypass
[ 55.782972][ C1] ? __flush_smp_call_function_queue (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/csd.h:64 kernel/smp.c:134 kernel/smp.c:531) 
[   55.788612][    C2]  crct10dif_pclmul crc32_pclmul
[ 55.792132][ C1] do_idle (arch/x86/include/asm/current.h:41 include/linux/sched/idle.h:31 kernel/sched/idle.c:255) 
[   55.799153][    C2]  ghash_clmulni_intel
[ 55.804630][ C1] cpu_startup_entry (kernel/sched/idle.c:379 (discriminator 1)) 
[   55.809028][    C2]  libcrc32c sha512_ssse3
[ 55.813499][ C1] start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294) 
[   55.818765][    C2]  crc32c_intel
[ 55.823547][ C1] ? set_cpu_sibling_map (arch/x86/kernel/smpboot.c:240) 
[   55.828795][    C2]  ipmi_ssif
[ 55.832605][ C1] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433) 
[   55.838652][    C2]  rapl
[   55.842150][    C1]  </TASK>
[   55.848889][    C2]  nvme
[   55.851904][    C1] ---[ end trace 0000000000000000 ]---
[   55.855226][    C2]  intel_cstate
[   55.856376][    T1] systemd[1]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[   55.929716][    C2]  nvme_core mei_me ast t10_pi dax_hmem drm_shmem_helper crc64_rocksoft_generic idxd crc64_rocksoft mei drm_kms_helper wmi idxd_bus joydev i2c_ismt crc64 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad drm fuse ip_tables
[   55.958456][    C2] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G        W          6.6.0-rc4-00038-g7b0c45f5095f #1
[ 55.971146][ C2] RIP: 0010:nohz_csd_func (kernel/sched/core.c:1182 (discriminator 1)) 
[ 55.978359][ C2] Code: 84 c0 74 06 0f 8e d3 00 00 00 45 88 b4 24 28 0a 00 00 48 83 c4 08 bf 07 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 22 b0 f6 ff <0f> 0b e9 1b fe ff ff e8 76 6e 72 00 e9 66 fd ff ff e8 cc 6e 72 00
All code
========
   0:	84 c0                	test   %al,%al
   2:	74 06                	je     0xa
   4:	0f 8e d3 00 00 00    	jle    0xdd
   a:	45 88 b4 24 28 0a 00 	mov    %r14b,0xa28(%r12)
  11:	00 
  12:	48 83 c4 08          	add    $0x8,%rsp
  16:	bf 07 00 00 00       	mov    $0x7,%edi
  1b:	5b                   	pop    %rbx
  1c:	41 5c                	pop    %r12
  1e:	41 5d                	pop    %r13
  20:	41 5e                	pop    %r14
  22:	41 5f                	pop    %r15
  24:	5d                   	pop    %rbp
  25:	e9 22 b0 f6 ff       	jmpq   0xfffffffffff6b04c
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	e9 1b fe ff ff       	jmpq   0xfffffffffffffe4c
  31:	e8 76 6e 72 00       	callq  0x726eac
  36:	e9 66 fd ff ff       	jmpq   0xfffffffffffffda1
  3b:	e8 cc 6e 72 00       	callq  0x726f0c

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	e9 1b fe ff ff       	jmpq   0xfffffffffffffe22
   7:	e8 76 6e 72 00       	callq  0x726e82
   c:	e9 66 fd ff ff       	jmpq   0xfffffffffffffd77
  11:	e8 cc 6e 72 00       	callq  0x726ee2


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231019/202310192232.750e5c5b-oliver.sang@intel.com

diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cb225921bbca..2ece55f32782 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11786,13 +11786,12 @@  void nohz_balance_enter_idle(int cpu)
 	/*
 	 * Ensures that if nohz_idle_balance() fails to observe our
 	 * @idle_cpus_mask store, it must observe the @has_blocked
-	 * and @needs_update stores.
+	 * stores.
 	 */
 	smp_mb__after_atomic();
 
 	set_cpu_sd_state_idle(cpu);
 
-	WRITE_ONCE(nohz.needs_update, 1);
 out:
 	/*
 	 * Each time a cpu enter idle, we assume that it has blocked load and
@@ -11945,21 +11944,25 @@  static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 }
 
 /*
- * Check if we need to run the ILB for updating blocked load before entering
- * idle state.
+ * Check if we need to run the ILB for updating blocked load and/or updating
+ * nohz.next_balance before entering idle state.
  */
 void nohz_run_idle_balance(int cpu)
 {
 	unsigned int flags;
 
-	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
+	flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu));
+
+	if (!flags)
+		return;
 
 	/*
 	 * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
 	 * (ie NOHZ_STATS_KICK set) and will do the same.
 	 */
-	if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
-		_nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK);
+	if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) &&
+	    !need_resched())
+		_nohz_idle_balance(cpu_rq(cpu), flags);
 }
 
 static void nohz_newidle_balance(struct rq *this_rq)
@@ -11977,6 +11980,10 @@  static void nohz_newidle_balance(struct rq *this_rq)
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
 		return;
 
+	/* If rq->next_balance before nohz.next_balance, trigger ILB */
+	if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance)))
+		atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu));
+
 	/* Don't need to update blocked load of idle CPUs*/
 	if (!READ_ONCE(nohz.has_blocked) ||
 	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))