Stopping the tick on a fully loaded system

Message ID 80956e8f-761e-b74-1c7a-3966f9e8d934@linutronix.de
State New
Headers
Series Stopping the tick on a fully loaded system |

Commit Message

Anna-Maria Behnsen July 20, 2023, 6:51 a.m. UTC
  Hi,

during tests of the timer pull model, Gautham observed regressions under
load. With the timer pull model in place, going idle is more expensive. My
naive assumption, that a system which is fully loaded will not go idle was
simply wrong. Under a fully loaded system (top shows ~1% idle), some CPUs
go idle and stop the tick for several us and come back to work and this
heavily repeats.

Peter and tglx helped me to track it down to find the reason: The governor
which decides if the tick will be stopped only looks at the next timer but
does not take into account how busy the system is. Here Peter pointed to
the scheduler avg_idle value.

Beside the existing avg_idle, I introduced really_avg_idle which is not
limited to twice max_idle_balance_cost but also updated in
ttwu_do_activate() when avg_idle is updated.

With tracing, I was able to see that in the fully loaded case, 75%-80% of
the idle periods have been shorter than the really_avg_idle value. (trace
printk of really_avg_idle values directly at the begin of
tick_nohz_next_event(); enabling sched_wakeup tracepoint; take the delta
between the timestamps of the first and the latter as idle time).

A generalized approach to prevent going idle, when the system is loaded,
would be to add a check how busy the system is to tick_nohz_next_event().

In my PoC (find it at the end) it's simply checked whether the
really_avg_idle value is smaller than TICK_NSEC. It's not possible to use
the existing avg_idle value as it is always smaller than TICK_NSEC on 250HZ
systems. But there regressions occur under load and the standard deviation
of the test results were in the same range as the regression (between 5 and
10%).

So I wanted to understand the avg_idle values and examined the distribution
with different load scenarios. There my next naive assumption was, that
under load mainly short values will be seen. This is true, when the system
is halfway loaded (top shows ~50% idle). But when the system is fully
loaded, the avg_idle values are no longer 'focused' on small values.

Here I stopped and started to write the mail. I don't know the reason for
the distribution under load and I also don't know if the idea of checking
the system load in tick_nohz_next_event() is good or not. And last but not
least, scheduler is a closed book for me...

Thanks,

	Anna-Maria

---8<---
  

Comments

Anna-Maria Behnsen July 20, 2023, 1 p.m. UTC | #1
Hi Vincent,

On Thu, 20 Jul 2023, Vincent Guittot wrote:

> On Thu, 20 Jul 2023 at 08:51, Anna-Maria Behnsen
> <anna-maria@linutronix.de> wrote:
> >
> > Hi,
> >
> > during tests of the timer pull model, Gautham observed regressions under
> > load. With the timer pull model in place, going idle is more expensive. My
> > naive assumption, that a system which is fully loaded will not go idle was
> > simply wrong. Under a fully loaded system (top shows ~1% idle), some CPUs
> > go idle and stop the tick for several us and come back to work and this
> > heavily repeats.
> >
> > Peter and tglx helped me to track it down to find the reason: The governor
> > which decides if the tick will be stopped only looks at the next timer but
> > does not take into account how busy the system is. Here Peter pointed to
> > the scheduler avg_idle value.
> >
> > Beside the existing avg_idle, I introduced really_avg_idle which is not
> > limited to twice max_idle_balance_cost but also updated in
> > ttwu_do_activate() when avg_idle is updated.
> >
> > With tracing, I was able to see that in the fully loaded case, 75%-80% of
> > the idle periods have been shorter than the really_avg_idle value. (trace
> > printk of really_avg_idle values directly at the begin of
> > tick_nohz_next_event(); enabling sched_wakeup tracepoint; take the delta
> > between the timestamps of the first and the latter as idle time).
> >
> > A generalized approach to prevent going idle, when the system is loaded,
> > would be to add a check how busy the system is to tick_nohz_next_event().
> 
> Would it be better to let the cpuidle governor decide whether to stop
> or not the tick instead ?
> With your change, tick_nohz_next_event() becomes an random value which
> might return something else than the real next event
> 
> you might me interested by this:
> https://lore.kernel.org/all/20230105145159.1089531-1-kajetan.puchalski@arm.com/
> 
> They use cpu utilization to stay in shallow c-states some of which
> don't stop the tick. Maybe you extend this to make sure to not stop
> the tick for high load

I had also a look at teo. It makes things better but does not solve the
underlying problem that I see here - please correct me if I missed
something or if I'm simply wrong:

Yes, the governors have to decide in the end, whether it makes sense to
stop the tick or not. For this decision, the governors require information
about the current state of the core and how long nothing has to be done
propably. At the moment the governors therefore call
tick_nohz_get_sleep_length(). This checks first whether the tick can be
stopped. Then it takes into account whether rcu, irq_work, arch_work needs
the CPU or a timer softirq is pending. If non of this is true, then the
timers are checked. So tick_nohz_get_sleep_length() isn't only based on
timers already.

The information about the sleep length of the scheduler perspective is
completely missing in the current existing check for the probable sleep
length.

Sure, teo takes scheduler utilization into account directly in the
governor. But for me it is not comprehensible, why the CPU utilization
check is done after asking for the possible sleep length where timers are
taken into account. If the CPU is busy anyway, the information generated by
tick_nohz_next_event() is irrelevant. And when the CPU is not busy, then it
makes sense to ask for the sleep length also from a timer perspective.

When this CPU utilization check is implemented directly inside the
governor, every governor has to implement it on it's own. So wouldn't it
make sense to implement a "how utilized is the CPU out of a scheduler
perspective" in one place and use this as the first check in
tick_nohz_get_sleep_length()/tick_nohz_next_event()?

> >
> > In my PoC (find it at the end) it's simply checked whether the
> > really_avg_idle value is smaller than TICK_NSEC. It's not possible to use
> > the existing avg_idle value as it is always smaller than TICK_NSEC on 250HZ
> > systems. But there regressions occur under load and the standard deviation
> > of the test results were in the same range as the regression (between 5 and
> > 10%).
> >
> > So I wanted to understand the avg_idle values and examined the distribution
> > with different load scenarios. There my next naive assumption was, that
> > under load mainly short values will be seen. This is true, when the system
> > is halfway loaded (top shows ~50% idle). But when the system is fully
> > loaded, the avg_idle values are no longer 'focused' on small values.
> 
> avg_idle is broken for what you want to do. It is updated only when
> you leave an idle state which means that it stays stuck to the old avg
> idle time when your system is fully busy
>

As I said, I'm not familiar with scheduler internals. If avg_idle is not
the right thing, there might be another possibility to generate the
information about a possible sleep length out of the already existing
scheduler data.

Thanks,

	Anna-Maria
  
Vincent Guittot July 20, 2023, 1:55 p.m. UTC | #2
On Thu, 20 Jul 2023 at 15:00, Anna-Maria Behnsen
<anna-maria@linutronix.de> wrote:
>
> Hi Vincent,
>
> On Thu, 20 Jul 2023, Vincent Guittot wrote:
>
> > On Thu, 20 Jul 2023 at 08:51, Anna-Maria Behnsen
> > <anna-maria@linutronix.de> wrote:
> > >
> > > Hi,
> > >
> > > during tests of the timer pull model, Gautham observed regressions under
> > > load. With the timer pull model in place, going idle is more expensive. My
> > > naive assumption, that a system which is fully loaded will not go idle was
> > > simply wrong. Under a fully loaded system (top shows ~1% idle), some CPUs
> > > go idle and stop the tick for several us and come back to work and this
> > > heavily repeats.
> > >
> > > Peter and tglx helped me to track it down to find the reason: The governor
> > > which decides if the tick will be stopped only looks at the next timer but
> > > does not take into account how busy the system is. Here Peter pointed to
> > > the scheduler avg_idle value.
> > >
> > > Beside the existing avg_idle, I introduced really_avg_idle which is not
> > > limited to twice max_idle_balance_cost but also updated in
> > > ttwu_do_activate() when avg_idle is updated.
> > >
> > > With tracing, I was able to see that in the fully loaded case, 75%-80% of
> > > the idle periods have been shorter than the really_avg_idle value. (trace
> > > printk of really_avg_idle values directly at the begin of
> > > tick_nohz_next_event(); enabling sched_wakeup tracepoint; take the delta
> > > between the timestamps of the first and the latter as idle time).
> > >
> > > A generalized approach to prevent going idle, when the system is loaded,
> > > would be to add a check how busy the system is to tick_nohz_next_event().
> >
> > Would it be better to let the cpuidle governor decide whether to stop
> > or not the tick instead ?
> > With your change, tick_nohz_next_event() becomes an random value which
> > might return something else than the real next event
> >
> > you might me interested by this:
> > https://lore.kernel.org/all/20230105145159.1089531-1-kajetan.puchalski@arm.com/
> >
> > They use cpu utilization to stay in shallow c-states some of which
> > don't stop the tick. Maybe you extend this to make sure to not stop
> > the tick for high load
>
> I had also a look at teo. It makes things better but does not solve the
> underlying problem that I see here - please correct me if I missed
> something or if I'm simply wrong:
>
> Yes, the governors have to decide in the end, whether it makes sense to
> stop the tick or not. For this decision, the governors require information
> about the current state of the core and how long nothing has to be done
> propably. At the moment the governors therefore call
> tick_nohz_get_sleep_length(). This checks first whether the tick can be
> stopped. Then it takes into account whether rcu, irq_work, arch_work needs
> the CPU or a timer softirq is pending. If non of this is true, then the
> timers are checked. So tick_nohz_get_sleep_length() isn't only based on
> timers already.

yes but all these are boolean to say that they rely on the tick to
work correctly

>
> The information about the sleep length of the scheduler perspective is
> completely missing in the current existing check for the probable sleep
> length.

The scheduler is not able to estimate the sleep length and its
avg_idle is just a rough and wrong estimate which already fails to do
its job in several cases. It's quite difficult to predict a sleep
duration if not impossible. Even the cpuidle governors which are the
expert for this often fail. That's partly why teo has been added
because the menu governor was failing to correctly predict sleep
duration for some systems.

>
> Sure, teo takes scheduler utilization into account directly in the
> governor. But for me it is not comprehensible, why the CPU utilization
> check is done after asking for the possible sleep length where timers are
> taken into account. If the CPU is busy anyway, the information generated by
> tick_nohz_next_event() is irrelevant. And when the CPU is not busy, then it
> makes sense to ask for the sleep length also from a timer perspective.

You should ask teo's maintainer and contributors for a detailed
explanation about the policy and why they check cpu utilization after
getting the next timer event

>
> When this CPU utilization check is implemented directly inside the
> governor, every governor has to implement it on it's own. So wouldn't it
> make sense to implement a "how utilized is the CPU out of a scheduler
> perspective" in one place and use this as the first check in
> tick_nohz_get_sleep_length()/tick_nohz_next_event()?

tick_nohz_get_sleep_length() is not the probable sleep length but the
next timer event.

whereas it's not clear what an utilized cpu means.

Next timer event and cpu's utilization are 2 different parameters and
the governors might want to apply different "weights" on these
parameters in their policy.

>
> > >
> > > In my PoC (find it at the end) it's simply checked whether the
> > > really_avg_idle value is smaller than TICK_NSEC. It's not possible to use
> > > the existing avg_idle value as it is always smaller than TICK_NSEC on 250HZ
> > > systems. But there regressions occur under load and the standard deviation
> > > of the test results were in the same range as the regression (between 5 and
> > > 10%).
> > >
> > > So I wanted to understand the avg_idle values and examined the distribution
> > > with different load scenarios. There my next naive assumption was, that
> > > under load mainly short values will be seen. This is true, when the system
> > > is halfway loaded (top shows ~50% idle). But when the system is fully
> > > loaded, the avg_idle values are no longer 'focused' on small values.
> >
> > avg_idle is broken for what you want to do. It is updated only when
> > you leave an idle state which means that it stays stuck to the old avg
> > idle time when your system is fully busy

Trying to predict sleep duration is quite complex and needs other
inputs than just the scheduler's one. As an example, predicting irq is
another source of uncertainty in your sleep duration. As a side note,
I have already seen some test results where randomly selecting an idle
state was doing as good as current governors.

Vincent

> >
>
> As I said, I'm not familiar with scheduler internals. If avg_idle is not
> the right thing, there might be another possibility to generate the
> information about a possible sleep length out of the already existing
> scheduler data.
>
> Thanks,
>
>         Anna-Maria
>
  
Frederic Weisbecker July 23, 2023, 9:21 p.m. UTC | #3
(Adding Rafael in Cc)

Le Thu, Jul 20, 2023 at 03:00:37PM +0200, Anna-Maria Behnsen a écrit :
> I had also a look at teo. It makes things better but does not solve the
> underlying problem that I see here - please correct me if I missed
> something or if I'm simply wrong:
> 
> Yes, the governors have to decide in the end, whether it makes sense to
> stop the tick or not. For this decision, the governors require information
> about the current state of the core and how long nothing has to be done
> propably. At the moment the governors therefore call
> tick_nohz_get_sleep_length(). This checks first whether the tick can be
> stopped. Then it takes into account whether rcu, irq_work, arch_work needs
> the CPU or a timer softirq is pending. If non of this is true, then the
> timers are checked. So tick_nohz_get_sleep_length() isn't only based on
> timers already.

Right but those things (rcu/irq work, etc...) act kind of like timers here
and they should be considered as exceptions.

The timer infrastructure shouldn't take into account the idle activity,
this is really a job for the cpuidle governors.
 
> The information about the sleep length of the scheduler perspective is
> completely missing in the current existing check for the probable sleep
> length.
> 
> Sure, teo takes scheduler utilization into account directly in the
> governor. But for me it is not comprehensible, why the CPU utilization
> check is done after asking for the possible sleep length where timers are
> taken into account. If the CPU is busy anyway, the information generated by
> tick_nohz_next_event() is irrelevant. And when the CPU is not busy, then it
> makes sense to ask for the sleep length also from a timer perspective.
> 
> When this CPU utilization check is implemented directly inside the
> governor, every governor has to implement it on it's own. So wouldn't it
> make sense to implement a "how utilized is the CPU out of a scheduler
> perspective" in one place and use this as the first check in
> tick_nohz_get_sleep_length()/tick_nohz_next_event()?
> 

Well, beyond that, there might be other situations where the governor may
decide not to stop the tick even if tick_nohz_next_event() says it's possible
to do so. That's the purpose of having that next event as an input among many
others for the cpuidle governors.

As such, calling tmigr_cpu_deactivate() on next tick _evaluation_ time instead of
tick _stop_ time is always going to be problematic.

Can we fix that and call tmigr_cpu_deactivate() from tick_nohz_stop_tick()
instead? This will change a bit the locking scenario because
tick_nohz_stop_tick() doesn't hold the base lock. Is it a problem though?
In the worst case a remote tick happens and handles the earliest timer
for the current CPU while it's between tick_nohz_next_event() and
tick_nohz_stop_tick(), but then the current CPU would just propagate
an earlier deadline than needed. No big deal.

Though I could be overlooking some race or something else making that
not possible of course...

Thanks.
  
Rafael J. Wysocki July 24, 2023, 8:23 a.m. UTC | #4
On Sun, Jul 23, 2023 at 11:21 PM Frederic Weisbecker
<frederic@kernel.org> wrote:
>
> (Adding Rafael in Cc)
>
> Le Thu, Jul 20, 2023 at 03:00:37PM +0200, Anna-Maria Behnsen a écrit :
> > I had also a look at teo. It makes things better but does not solve the
> > underlying problem that I see here - please correct me if I missed
> > something or if I'm simply wrong:
> >
> > Yes, the governors have to decide in the end, whether it makes sense to
> > stop the tick or not. For this decision, the governors require information
> > about the current state of the core and how long nothing has to be done
> > propably. At the moment the governors therefore call
> > tick_nohz_get_sleep_length(). This checks first whether the tick can be
> > stopped. Then it takes into account whether rcu, irq_work, arch_work needs
> > the CPU or a timer softirq is pending. If non of this is true, then the
> > timers are checked. So tick_nohz_get_sleep_length() isn't only based on
> > timers already.
>
> Right but those things (rcu/irq work, etc...) act kind of like timers here
> and they should be considered as exceptions.
>
> The timer infrastructure shouldn't take into account the idle activity,
> this is really a job for the cpuidle governors.
>
> > The information about the sleep length of the scheduler perspective is
> > completely missing in the current existing check for the probable sleep
> > length.

Well, not exactly.

From the governor's perspective, tick_nohz_get_sleep_length() is
supposed to return a deterministic upper bound on the upcoming idle
duration (or at least it is used in the governors this way).  IOW, it
is expected that the CPU will not be idle longer that the
tick_nohz_get_sleep_length() return value.

There are other factors that may cause the governor to predict a
shorter idle duration and the information coming from the scheduler
may be regarded as one of them, but they are not deterministic.

> > Sure, teo takes scheduler utilization into account directly in the
> > governor. But for me it is not comprehensible, why the CPU utilization
> > check is done after asking for the possible sleep length where timers are
> > taken into account. If the CPU is busy anyway, the information generated by
> > tick_nohz_next_event() is irrelevant.

Why isn't it?

The CPU is idle at that point and it has gone idle for a reason.
Surely, there was nothing to run on it at.

The scheduler only knows that the CPU has been busy recently, which
may not imply anything on whether or not and when it is going to be
busy again.

> > And when the CPU is not busy, then it
> > makes sense to ask for the sleep length also from a timer perspective.
> >
> > When this CPU utilization check is implemented directly inside the
> > governor, every governor has to implement it on it's own. So wouldn't it
> > make sense to implement a "how utilized is the CPU out of a scheduler

Yes, it would make sense to do that, but I thought that PELT was that
thing.  Wasn't it?

> > perspective" in one place and use this as the first check in
> > tick_nohz_get_sleep_length()/tick_nohz_next_event()?

No, IMO it should be a separate check, because
tick_nohz_get_sleep_length() is expected to return a deterministic
upper bound on the idle duration whereas the scheduler information is
not deterministic.  It is the governor's role to decide how to take it
into account.

> Well, beyond that, there might be other situations where the governor may
> decide not to stop the tick even if tick_nohz_next_event() says it's possible
> to do so. That's the purpose of having that next event as an input among many
> others for the cpuidle governors.

Exactly.

> As such, calling tmigr_cpu_deactivate() on next tick _evaluation_ time instead of
> tick _stop_ time is always going to be problematic.
>
> Can we fix that and call tmigr_cpu_deactivate() from tick_nohz_stop_tick()
> instead? This will change a bit the locking scenario because
> tick_nohz_stop_tick() doesn't hold the base lock. Is it a problem though?
> In the worst case a remote tick happens and handles the earliest timer
> for the current CPU while it's between tick_nohz_next_event() and
> tick_nohz_stop_tick(), but then the current CPU would just propagate
> an earlier deadline than needed. No big deal.

FWIW, this sounds reasonable to me.
  
Anna-Maria Behnsen July 25, 2023, 1:07 p.m. UTC | #5
Hi,

On Mon, 24 Jul 2023, Rafael J. Wysocki wrote:

> On Sun, Jul 23, 2023 at 11:21 PM Frederic Weisbecker
> <frederic@kernel.org> wrote:
> >
> From the governor's perspective, tick_nohz_get_sleep_length() is
> supposed to return a deterministic upper bound on the upcoming idle
> duration (or at least it is used in the governors this way).  IOW, it
> is expected that the CPU will not be idle longer that the
> tick_nohz_get_sleep_length() return value.
> 
> There are other factors that may cause the governor to predict a
> shorter idle duration and the information coming from the scheduler
> may be regarded as one of them, but they are not deterministic.

Ok. Thanks for this explanation why two separate checks are required.

> > > Sure, teo takes scheduler utilization into account directly in the
> > > governor. But for me it is not comprehensible, why the CPU utilization
> > > check is done after asking for the possible sleep length where timers are
> > > taken into account. If the CPU is busy anyway, the information generated by
> > > tick_nohz_next_event() is irrelevant.
> 
> Why isn't it?
> 
> The CPU is idle at that point and it has gone idle for a reason.
> Surely, there was nothing to run on it at.
> 
> The scheduler only knows that the CPU has been busy recently, which
> may not imply anything on whether or not and when it is going to be
> busy again.
>

I went back one step and simply had a look at current upstream to
understand the behavior when going idle under load more detailed. I wanted
to see the distribution of idle time duration when the tick is stopped. I
used dbench tests with a script provided by Gautham to generate three
different loads: 100% load, 50% load and 25% load. The kernel is configured
with HZ=250. The system has 2 sockets, 64 cores per socket, 2 threads each
-> 256 CPUs. Three minutes trace data - idle periods larger than three
minutes are not tracked. The governor is teo.

I added tracepoints to the point where the tick is stopped and where the
tick is started again. I calculated the delta between the timestamps of
those two trace points and had a look at the distribution:


			100% load		50% load		25% load
			(top: ~2% idle)		(top: ~49% idle)	(top: ~74% idle;
									33 CPUs are completely idle)
			---------------		----------------	----------------------------
Idle Total		1658703	100%		3150522	100%		2377035 100%
x >= 4ms		2504	0.15%		2	0.00%		53	0.00%
4ms> x >= 2ms		390	0.02%		0	0.00%		4563	0.19%
2ms > x >= 1ms		62	0.00%		1	0.00%		54	0.00%
1ms > x >= 500us	67	0.00%		6	0.00%		2	0.00%
500us > x >= 250us	93	0.01%		39	0.00%		11	0.00%
250us > x >=100us	280	0.02%		1145	0.04%		633	0.03%
100us > x >= 50us	942	0.06%		30722	0.98%		13347	0.56%
50us > x >= 25us	26728	1.61%		310932	9.87%		106083	4.46%
25us > x >= 10us	825920	49.79%		2320683	73.66%		1722505	72.46%
10us > x > 5us		795197	47.94%		442991	14.06%		506008	21.29%
5us > x			6520	0.39%		43994	1.40%		23645	0.99%


99% of the tick stops only have an idle period shorter than 50us (50us is
1,25% of a tick length).

This is also the reason for my opinion, that the return of
tick_nohz_next_event() is completely irrelevant in a (fully) loaded case:
The return is in the range of ticks, and a tick is ~100 times longer than
the the duration of the majority of idle periods.


> > > And when the CPU is not busy, then it
> > > makes sense to ask for the sleep length also from a timer perspective.
> > >
> > > When this CPU utilization check is implemented directly inside the
> > > governor, every governor has to implement it on it's own. So wouldn't it
> > > make sense to implement a "how utilized is the CPU out of a scheduler
> 
> Yes, it would make sense to do that, but I thought that PELT was that
> thing.  Wasn't it?
> 
> > > perspective" in one place and use this as the first check in
> > > tick_nohz_get_sleep_length()/tick_nohz_next_event()?

[...]

> > As such, calling tmigr_cpu_deactivate() on next tick _evaluation_ time instead of
> > tick _stop_ time is always going to be problematic.
> >
> > Can we fix that and call tmigr_cpu_deactivate() from tick_nohz_stop_tick()
> > instead? This will change a bit the locking scenario because
> > tick_nohz_stop_tick() doesn't hold the base lock. Is it a problem though?
> > In the worst case a remote tick happens and handles the earliest timer
> > for the current CPU while it's between tick_nohz_next_event() and
> > tick_nohz_stop_tick(), but then the current CPU would just propagate
> > an earlier deadline than needed. No big deal.
> 
> FWIW, this sounds reasonable to me.
> 

The worst case scenario will not happen, because remote timer expiry only
happens when CPU is not active in the hierarchy. And with your proposal
this is valid after tick_nohz_stop_tick().

Nevertheless, I see some problems with this. But this also depends if there
is the need to change current idle behavior or not. Right now, this are my
concerns:

- The determinism of tick_nohz_next_event() will break: The return of
  tick_nohz_next_event() will not take into account, if it is the last CPU
  going idle and then has to take care of remote timers. So the first timer
  of the CPU (regardless of global or local) has to be handed back even if
  it could be handled by the hierarchy.

- When moving the tmigr_cpu_deactivate() to tick_nohz_stop_tick() and the
  return value of tmigr_cpu_deactivate() is before the ts->next_tick, the
  expiry has to be modified in tick_nohz_stop_tick().

- The load is simply moved to a later place - tick_nohz_stop_tick() is
  never called without a preceding tick_nohz_next_event() call. Yes,
  tick_nohz_next_event() is called under load ~8% more than
  tick_nohz_stop_tick(), but the 'quality' of the return value of
  tick_nohz_next_event() is getting worse.

- timer migration hierarchy is not a standalone timer infrastructure. It
  only makes sense to handle it in combination with the existing timer
  wheel. When the timer base is idle, the timer migration hierarchy with
  the migrators will do the job for global timers. So, I'm not sure about
  the impact of the changed locking - but I'm pretty sure changing that
  increases the probability for ugly races hidden somewhere between the
  lines.

Thanks,

	Anna-Maria
  
Rafael J. Wysocki July 25, 2023, 2:27 p.m. UTC | #6
On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen
<anna-maria@linutronix.de> wrote:
>
> Hi,
>
> On Mon, 24 Jul 2023, Rafael J. Wysocki wrote:
>
> > On Sun, Jul 23, 2023 at 11:21 PM Frederic Weisbecker
> > <frederic@kernel.org> wrote:
> > >
> > From the governor's perspective, tick_nohz_get_sleep_length() is
> > supposed to return a deterministic upper bound on the upcoming idle
> > duration (or at least it is used in the governors this way).  IOW, it
> > is expected that the CPU will not be idle longer that the
> > tick_nohz_get_sleep_length() return value.
> >
> > There are other factors that may cause the governor to predict a
> > shorter idle duration and the information coming from the scheduler
> > may be regarded as one of them, but they are not deterministic.
>
> Ok. Thanks for this explanation why two separate checks are required.
>
> > > > Sure, teo takes scheduler utilization into account directly in the
> > > > governor. But for me it is not comprehensible, why the CPU utilization
> > > > check is done after asking for the possible sleep length where timers are
> > > > taken into account. If the CPU is busy anyway, the information generated by
> > > > tick_nohz_next_event() is irrelevant.
> >
> > Why isn't it?
> >
> > The CPU is idle at that point and it has gone idle for a reason.
> > Surely, there was nothing to run on it at.
> >
> > The scheduler only knows that the CPU has been busy recently, which
> > may not imply anything on whether or not and when it is going to be
> > busy again.
> >
>
> I went back one step and simply had a look at current upstream to
> understand the behavior when going idle under load more detailed. I wanted
> to see the distribution of idle time duration when the tick is stopped. I
> used dbench tests with a script provided by Gautham to generate three
> different loads: 100% load, 50% load and 25% load. The kernel is configured
> with HZ=250. The system has 2 sockets, 64 cores per socket, 2 threads each
> -> 256 CPUs. Three minutes trace data - idle periods larger than three
> minutes are not tracked. The governor is teo.
>
> I added tracepoints to the point where the tick is stopped and where the
> tick is started again. I calculated the delta between the timestamps of
> those two trace points and had a look at the distribution:
>
>
>                         100% load               50% load                25% load
>                         (top: ~2% idle)         (top: ~49% idle)        (top: ~74% idle;
>                                                                         33 CPUs are completely idle)
>                         ---------------         ----------------        ----------------------------
> Idle Total              1658703 100%            3150522 100%            2377035 100%
> x >= 4ms                2504    0.15%           2       0.00%           53      0.00%
> 4ms> x >= 2ms           390     0.02%           0       0.00%           4563    0.19%
> 2ms > x >= 1ms          62      0.00%           1       0.00%           54      0.00%
> 1ms > x >= 500us        67      0.00%           6       0.00%           2       0.00%
> 500us > x >= 250us      93      0.01%           39      0.00%           11      0.00%
> 250us > x >=100us       280     0.02%           1145    0.04%           633     0.03%
> 100us > x >= 50us       942     0.06%           30722   0.98%           13347   0.56%
> 50us > x >= 25us        26728   1.61%           310932  9.87%           106083  4.46%
> 25us > x >= 10us        825920  49.79%          2320683 73.66%          1722505 72.46%
> 10us > x > 5us          795197  47.94%          442991  14.06%          506008  21.29%
> 5us > x                 6520    0.39%           43994   1.40%           23645   0.99%
>
>
> 99% of the tick stops only have an idle period shorter than 50us (50us is
> 1,25% of a tick length).

Well, this just means that the governor predicts overly long idle
durations quite often under this workload.

The governor's decision on whether or not to stop the tick is based on
its idle duration prediction.  If it overshoots, that's how it goes.

> This is also the reason for my opinion, that the return of
> tick_nohz_next_event() is completely irrelevant in a (fully) loaded case:

It is an upper bound and in a fully loaded case it may be way off.

> The return is in the range of ticks, and a tick is ~100 times longer than
> the the duration of the majority of idle periods.

Overall, this means that there is room for improvements in the idle governor.

> > > > And when the CPU is not busy, then it
> > > > makes sense to ask for the sleep length also from a timer perspective.
> > > >
> > > > When this CPU utilization check is implemented directly inside the
> > > > governor, every governor has to implement it on it's own. So wouldn't it
> > > > make sense to implement a "how utilized is the CPU out of a scheduler
> >
> > Yes, it would make sense to do that, but I thought that PELT was that
> > thing.  Wasn't it?
> >
> > > > perspective" in one place and use this as the first check in
> > > > tick_nohz_get_sleep_length()/tick_nohz_next_event()?
>
> [...]
>
> > > As such, calling tmigr_cpu_deactivate() on next tick _evaluation_ time instead of
> > > tick _stop_ time is always going to be problematic.
> > >
> > > Can we fix that and call tmigr_cpu_deactivate() from tick_nohz_stop_tick()
> > > instead? This will change a bit the locking scenario because
> > > tick_nohz_stop_tick() doesn't hold the base lock. Is it a problem though?
> > > In the worst case a remote tick happens and handles the earliest timer
> > > for the current CPU while it's between tick_nohz_next_event() and
> > > tick_nohz_stop_tick(), but then the current CPU would just propagate
> > > an earlier deadline than needed. No big deal.
> >
> > FWIW, this sounds reasonable to me.
> >
>
> The worst case scenario will not happen, because remote timer expiry only
> happens when CPU is not active in the hierarchy. And with your proposal
> this is valid after tick_nohz_stop_tick().
>
> Nevertheless, I see some problems with this. But this also depends if there
> is the need to change current idle behavior or not. Right now, this are my
> concerns:
>
> - The determinism of tick_nohz_next_event() will break:

That would be bad.

>   The return of
>   tick_nohz_next_event() will not take into account, if it is the last CPU
>   going idle and then has to take care of remote timers. So the first timer
>   of the CPU (regardless of global or local) has to be handed back even if
>   it could be handled by the hierarchy.
>
> - When moving the tmigr_cpu_deactivate() to tick_nohz_stop_tick() and the
>   return value of tmigr_cpu_deactivate() is before the ts->next_tick, the
>   expiry has to be modified in tick_nohz_stop_tick().
>
> - The load is simply moved to a later place - tick_nohz_stop_tick() is
>   never called without a preceding tick_nohz_next_event() call. Yes,
>   tick_nohz_next_event() is called under load ~8% more than
>   tick_nohz_stop_tick(), but the 'quality' of the return value of
>   tick_nohz_next_event() is getting worse.
>
> - timer migration hierarchy is not a standalone timer infrastructure. It
>   only makes sense to handle it in combination with the existing timer
>   wheel. When the timer base is idle, the timer migration hierarchy with
>   the migrators will do the job for global timers. So, I'm not sure about
>   the impact of the changed locking - but I'm pretty sure changing that
>   increases the probability for ugly races hidden somewhere between the
>   lines.

I'll let Frederic respond to the above, but from my perspective it all
just means that the idle governors in use today are not perfect.

However, they will never be perfect, because they only have a little
time to make a decision, so it's a matter of balancing that with
precision.
  
Peter Zijlstra July 25, 2023, 10:28 p.m. UTC | #7
On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote:
> On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen

> >                         100% load               50% load                25% load
> >                         (top: ~2% idle)         (top: ~49% idle)        (top: ~74% idle;
> >                                                                         33 CPUs are completely idle)
> >                         ---------------         ----------------        ----------------------------
> > Idle Total              1658703 100%            3150522 100%            2377035 100%
> > x >= 4ms                2504    0.15%           2       0.00%           53      0.00%
> > 4ms> x >= 2ms           390     0.02%           0       0.00%           4563    0.19%
> > 2ms > x >= 1ms          62      0.00%           1       0.00%           54      0.00%
> > 1ms > x >= 500us        67      0.00%           6       0.00%           2       0.00%
> > 500us > x >= 250us      93      0.01%           39      0.00%           11      0.00%
> > 250us > x >=100us       280     0.02%           1145    0.04%           633     0.03%
> > 100us > x >= 50us       942     0.06%           30722   0.98%           13347   0.56%
> > 50us > x >= 25us        26728   1.61%           310932  9.87%           106083  4.46%
> > 25us > x >= 10us        825920  49.79%          2320683 73.66%          1722505 72.46%
> > 10us > x > 5us          795197  47.94%          442991  14.06%          506008  21.29%
> > 5us > x                 6520    0.39%           43994   1.40%           23645   0.99%
> >
> >
> > 99% of the tick stops only have an idle period shorter than 50us (50us is
> > 1,25% of a tick length).
> 
> Well, this just means that the governor predicts overly long idle
> durations quite often under this workload.
> 
> The governor's decision on whether or not to stop the tick is based on
> its idle duration prediction.  If it overshoots, that's how it goes.

This is abysmal; IIRC TEO tracks a density function in C state buckets
and if it finds it's more likely to be shorter than 'predicted' by the
timer it should pick something shallower.

Given we have this density function, picking something that's <1% likely
is insane. In fact, it seems to suggest the whole pick-alternative thing
is utterly broken.

> > This is also the reason for my opinion, that the return of
> > tick_nohz_next_event() is completely irrelevant in a (fully) loaded case:
> 
> It is an upper bound and in a fully loaded case it may be way off.

But given we have our density function, we should be able to do much
better.


Oooh,... I think I see the problem. Our bins are strictly the available
C-state, but if you run this on a Zen3 that has ACPI-idle, then you end
up with something that only has 3 C states, like:

$ for i in state*/residency ; do echo -n "${i}: "; cat $i; done
state0/residency: 0
state1/residency: 2
state2/residency: 36

Which means we only have buckets: (0,0] (0,2000], (2000,36000] or somesuch. All
of them very much smaller than TICK_NSEC.

That means we don't track nearly enough data to reliably tell anything
about disabling the tick or not. We should have at least one bucket
beyond TICK_NSEC for this.

Hmm.. it is getting very late, but how about I get the cpuidle framework
to pad the drv states with a few 'disabled' C states so that we have at
least enough data to cross the TICK_NSEC boundary and say something
usable about things.

Because as things stand, it's very likely we determine @stop_tick purely
based on what tick_nohz_get_sleep_length() tells us, not on what we've
learnt from recent history.


(FWIW intel_idle seems to not have an entry for Tigerlake !?! -- my poor
laptop, it feels neglected)
  
Frederic Weisbecker July 26, 2023, 10:47 a.m. UTC | #8
Le Tue, Jul 25, 2023 at 03:07:05PM +0200, Anna-Maria Behnsen a écrit :
> The worst case scenario will not happen, because remote timer expiry only
> happens when CPU is not active in the hierarchy. And with your proposal
> this is valid after tick_nohz_stop_tick().
> 
> Nevertheless, I see some problems with this. But this also depends if there
> is the need to change current idle behavior or not. Right now, this are my
> concerns:
> 
> - The determinism of tick_nohz_next_event() will break: The return of
>   tick_nohz_next_event() will not take into account, if it is the last CPU
>   going idle and then has to take care of remote timers. So the first timer
>   of the CPU (regardless of global or local) has to be handed back even if
>   it could be handled by the hierarchy.

Bah, of course...

> 
> - When moving the tmigr_cpu_deactivate() to tick_nohz_stop_tick() and the
>   return value of tmigr_cpu_deactivate() is before the ts->next_tick, the
>   expiry has to be modified in tick_nohz_stop_tick().
> 
> - The load is simply moved to a later place - tick_nohz_stop_tick() is
>   never called without a preceding tick_nohz_next_event() call. Yes,
>   tick_nohz_next_event() is called under load ~8% more than
>   tick_nohz_stop_tick(), but the 'quality' of the return value of
>   tick_nohz_next_event() is getting worse.
> 
> - timer migration hierarchy is not a standalone timer infrastructure. It
>   only makes sense to handle it in combination with the existing timer
>   wheel. When the timer base is idle, the timer migration hierarchy with
>   the migrators will do the job for global timers. So, I'm not sure about
>   the impact of the changed locking - but I'm pretty sure changing that
>   increases the probability for ugly races hidden somewhere between the
>   lines.

Sure thing, and this won't be pretty.

> 
> Thanks,
> 
> 	Anna-Maria
  
Frederic Weisbecker July 26, 2023, 10:59 a.m. UTC | #9
Le Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki a écrit :
> On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen
> I'll let Frederic respond to the above, but from my perspective it all
> just means that the idle governors in use today are not perfect.
> 
> However, they will never be perfect, because they only have a little
> time to make a decision, so it's a matter of balancing that with
> precision.

So, even without considering Anna-Maria's patchset, sparing
the next timer lookup if we already know we won't stop it would show
quite a benefit because that lookup involves locking and quite some
overhead walking through all levels.

Thanks.
  
Rafael J. Wysocki July 26, 2023, 3:07 p.m. UTC | #10
On Wed, Jul 26, 2023 at 12:59 PM Frederic Weisbecker
<frederic@kernel.org> wrote:
>
> Le Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki a écrit :
> > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen
> > I'll let Frederic respond to the above, but from my perspective it all
> > just means that the idle governors in use today are not perfect.
> >
> > However, they will never be perfect, because they only have a little
> > time to make a decision, so it's a matter of balancing that with
> > precision.
>
> So, even without considering Anna-Maria's patchset, sparing
> the next timer lookup if we already know we won't stop it would show
> quite a benefit because that lookup involves locking and quite some
> overhead walking through all levels.

You are right.

In principle, this can be done by using TICK_NSEC as the initial
estimation of the idle duration and only calling
tick_nohz_get_sleep_length() if the candidate state ends up in the
higher bins.  Interesting.
  
Rafael J. Wysocki July 26, 2023, 3:10 p.m. UTC | #11
On Wed, Jul 26, 2023 at 12:29 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote:
> > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen
>
> > >                         100% load               50% load                25% load
> > >                         (top: ~2% idle)         (top: ~49% idle)        (top: ~74% idle;
> > >                                                                         33 CPUs are completely idle)
> > >                         ---------------         ----------------        ----------------------------
> > > Idle Total              1658703 100%            3150522 100%            2377035 100%
> > > x >= 4ms                2504    0.15%           2       0.00%           53      0.00%
> > > 4ms> x >= 2ms           390     0.02%           0       0.00%           4563    0.19%
> > > 2ms > x >= 1ms          62      0.00%           1       0.00%           54      0.00%
> > > 1ms > x >= 500us        67      0.00%           6       0.00%           2       0.00%
> > > 500us > x >= 250us      93      0.01%           39      0.00%           11      0.00%
> > > 250us > x >=100us       280     0.02%           1145    0.04%           633     0.03%
> > > 100us > x >= 50us       942     0.06%           30722   0.98%           13347   0.56%
> > > 50us > x >= 25us        26728   1.61%           310932  9.87%           106083  4.46%
> > > 25us > x >= 10us        825920  49.79%          2320683 73.66%          1722505 72.46%
> > > 10us > x > 5us          795197  47.94%          442991  14.06%          506008  21.29%
> > > 5us > x                 6520    0.39%           43994   1.40%           23645   0.99%
> > >
> > >
> > > 99% of the tick stops only have an idle period shorter than 50us (50us is
> > > 1,25% of a tick length).
> >
> > Well, this just means that the governor predicts overly long idle
> > durations quite often under this workload.
> >
> > The governor's decision on whether or not to stop the tick is based on
> > its idle duration prediction.  If it overshoots, that's how it goes.
>
> This is abysmal; IIRC TEO tracks a density function in C state buckets
> and if it finds it's more likely to be shorter than 'predicted' by the
> timer it should pick something shallower.
>
> Given we have this density function, picking something that's <1% likely
> is insane. In fact, it seems to suggest the whole pick-alternative thing
> is utterly broken.
>
> > > This is also the reason for my opinion, that the return of
> > > tick_nohz_next_event() is completely irrelevant in a (fully) loaded case:
> >
> > It is an upper bound and in a fully loaded case it may be way off.
>
> But given we have our density function, we should be able to do much
> better.
>
>
> Oooh,... I think I see the problem. Our bins are strictly the available
> C-state, but if you run this on a Zen3 that has ACPI-idle, then you end
> up with something that only has 3 C states, like:
>
> $ for i in state*/residency ; do echo -n "${i}: "; cat $i; done
> state0/residency: 0
> state1/residency: 2
> state2/residency: 36
>
> Which means we only have buckets: (0,0] (0,2000], (2000,36000] or somesuch. All
> of them very much smaller than TICK_NSEC.
>
> That means we don't track nearly enough data to reliably tell anything
> about disabling the tick or not. We should have at least one bucket
> beyond TICK_NSEC for this.

Quite likely.

> Hmm.. it is getting very late, but how about I get the cpuidle framework
> to pad the drv states with a few 'disabled' C states so that we have at
> least enough data to cross the TICK_NSEC boundary and say something
> usable about things.
>
> Because as things stand, it's very likely we determine @stop_tick purely
> based on what tick_nohz_get_sleep_length() tells us, not on what we've
> learnt from recent history.
>
>
> (FWIW intel_idle seems to not have an entry for Tigerlake !?! -- my poor
> laptop, it feels neglected)

It should then use ACPI _CST idle states.
  
Rafael J. Wysocki July 26, 2023, 3:53 p.m. UTC | #12
On Wed, Jul 26, 2023 at 5:10 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Jul 26, 2023 at 12:29 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen
> >
> > > >                         100% load               50% load                25% load
> > > >                         (top: ~2% idle)         (top: ~49% idle)        (top: ~74% idle;
> > > >                                                                         33 CPUs are completely idle)
> > > >                         ---------------         ----------------        ----------------------------
> > > > Idle Total              1658703 100%            3150522 100%            2377035 100%
> > > > x >= 4ms                2504    0.15%           2       0.00%           53      0.00%
> > > > 4ms> x >= 2ms           390     0.02%           0       0.00%           4563    0.19%
> > > > 2ms > x >= 1ms          62      0.00%           1       0.00%           54      0.00%
> > > > 1ms > x >= 500us        67      0.00%           6       0.00%           2       0.00%
> > > > 500us > x >= 250us      93      0.01%           39      0.00%           11      0.00%
> > > > 250us > x >=100us       280     0.02%           1145    0.04%           633     0.03%
> > > > 100us > x >= 50us       942     0.06%           30722   0.98%           13347   0.56%
> > > > 50us > x >= 25us        26728   1.61%           310932  9.87%           106083  4.46%
> > > > 25us > x >= 10us        825920  49.79%          2320683 73.66%          1722505 72.46%
> > > > 10us > x > 5us          795197  47.94%          442991  14.06%          506008  21.29%
> > > > 5us > x                 6520    0.39%           43994   1.40%           23645   0.99%
> > > >
> > > >
> > > > 99% of the tick stops only have an idle period shorter than 50us (50us is
> > > > 1,25% of a tick length).
> > >
> > > Well, this just means that the governor predicts overly long idle
> > > durations quite often under this workload.
> > >
> > > The governor's decision on whether or not to stop the tick is based on
> > > its idle duration prediction.  If it overshoots, that's how it goes.
> >
> > This is abysmal; IIRC TEO tracks a density function in C state buckets
> > and if it finds it's more likely to be shorter than 'predicted' by the
> > timer it should pick something shallower.
> >
> > Given we have this density function, picking something that's <1% likely
> > is insane. In fact, it seems to suggest the whole pick-alternative thing
> > is utterly broken.
> >
> > > > This is also the reason for my opinion, that the return of
> > > > tick_nohz_next_event() is completely irrelevant in a (fully) loaded case:
> > >
> > > It is an upper bound and in a fully loaded case it may be way off.
> >
> > But given we have our density function, we should be able to do much
> > better.
> >
> >
> > Oooh,... I think I see the problem. Our bins are strictly the available
> > C-state, but if you run this on a Zen3 that has ACPI-idle, then you end
> > up with something that only has 3 C states, like:
> >
> > $ for i in state*/residency ; do echo -n "${i}: "; cat $i; done
> > state0/residency: 0
> > state1/residency: 2
> > state2/residency: 36
> >
> > Which means we only have buckets: (0,0] (0,2000], (2000,36000] or somesuch. All
> > of them very much smaller than TICK_NSEC.
> >
> > That means we don't track nearly enough data to reliably tell anything
> > about disabling the tick or not. We should have at least one bucket
> > beyond TICK_NSEC for this.
>
> Quite likely.

So the reasoning here was that those additional bins would not be
necessary for idle state selection, but the problem of whether or not
to stop the tick is kind of separate from the idle state selection
problem if the target residency values for all of the idle states are
relatively short.  And so it should be addressed separately which
currently it is not.  Admittedly, this is a mistake.
  
Peter Zijlstra July 26, 2023, 4:14 p.m. UTC | #13
On Wed, Jul 26, 2023 at 05:53:46PM +0200, Rafael J. Wysocki wrote:

> > > That means we don't track nearly enough data to reliably tell anything
> > > about disabling the tick or not. We should have at least one bucket
> > > beyond TICK_NSEC for this.
> >
> > Quite likely.
> 
> So the reasoning here was that those additional bins would not be
> necessary for idle state selection, but the problem of whether or not
> to stop the tick is kind of separate from the idle state selection
> problem if the target residency values for all of the idle states are
> relatively short.  And so it should be addressed separately which
> currently it is not.  Admittedly, this is a mistake.

Right, the C state buckets are enough to pick a state, but not to handle
the tick thing.

The below hack boots on my ivb-ep with extra (disabled) states. Now let
me go hack up teo to make use of that.

name		residency

POLL		0
C1              1
C1E             80
C3              156
C6              300
TICK            1000
POST-TICK       2000


---
diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
index d9cda7f6ccb9..5f435fb8b89f 100644
--- a/drivers/cpuidle/driver.c
+++ b/drivers/cpuidle/driver.c
@@ -147,13 +147,37 @@ static void cpuidle_setup_broadcast_timer(void *arg)
 		tick_broadcast_disable();
 }
 
+static int tick_enter(struct cpuidle_device *dev,
+		      struct cpuidle_driver *drv,
+		      int index)
+{
+	return -ENODEV;
+}
+
+static void __cpuidle_state_init_tick(struct cpuidle_state *s)
+{
+	strcpy(s->name, "TICK");
+	strcpy(s->desc, "(no-op)");
+
+	s->target_residency_ns = TICK_NSEC;
+	s->target_residency = div_u64(TICK_NSEC, NSEC_PER_USEC);
+
+	s->exit_latency_ns = 0;
+	s->exit_latency = 0;
+
+	s->flags |= CPUIDLE_FLAG_UNUSABLE;
+
+	s->enter = tick_enter;
+	s->enter_s2idle = tick_enter;
+}
+
 /**
  * __cpuidle_driver_init - initialize the driver's internal data
  * @drv: a valid pointer to a struct cpuidle_driver
  */
 static void __cpuidle_driver_init(struct cpuidle_driver *drv)
 {
-	int i;
+	int tick = 0, post_tick = 0, i;
 
 	/*
 	 * Use all possible CPUs as the default, because if the kernel boots
@@ -192,6 +216,39 @@ static void __cpuidle_driver_init(struct cpuidle_driver *drv)
 			s->exit_latency_ns =  0;
 		else
 			s->exit_latency = div_u64(s->exit_latency_ns, NSEC_PER_USEC);
+
+		if (!tick && s->target_residency_ns >= TICK_NSEC) {
+			tick = 1;
+
+			if (s->target_residency_ns == TICK_NSEC)
+				continue;
+
+			post_tick = 1;
+
+			memmove(&drv->states[i+1], &drv->states[i],
+				sizeof(struct cpuidle_state) * (CPUIDLE_STATE_MAX - i - 1));
+			drv->state_count++;
+
+			__cpuidle_state_init_tick(s);
+			i++;
+		}
+	}
+
+	if (!tick) {
+		struct cpuidle_state *s = &drv->states[i];
+		__cpuidle_state_init_tick(s);
+		drv->state_count++;
+		i++;
+	}
+
+	if (!post_tick) {
+		struct cpuidle_state *s = &drv->states[i];
+		__cpuidle_state_init_tick(s);
+		strcpy(s->name, "POST-TICK");
+		s->target_residency_ns *= 2;
+		s->target_residency *= 2;
+		drv->state_count++;
+		i++;
 	}
 }
 
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 3183aeb7f5b4..a642ee9e916c 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -16,7 +16,7 @@
 #include <linux/hrtimer.h>
 #include <linux/context_tracking.h>
 
-#define CPUIDLE_STATE_MAX	10
+#define CPUIDLE_STATE_MAX	16
 #define CPUIDLE_NAME_LEN	16
 #define CPUIDLE_DESC_LEN	32
  
Anna-Maria Behnsen July 26, 2023, 4:40 p.m. UTC | #14
Hi,

On Wed, 26 Jul 2023, Peter Zijlstra wrote:

> On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote:
> > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen
> 
> > >                         100% load               50% load                25% load
> > >                         (top: ~2% idle)         (top: ~49% idle)        (top: ~74% idle;
> > >                                                                         33 CPUs are completely idle)
> > >                         ---------------         ----------------        ----------------------------
> > > Idle Total              1658703 100%            3150522 100%            2377035 100%
> > > x >= 4ms                2504    0.15%           2       0.00%           53      0.00%
> > > 4ms> x >= 2ms           390     0.02%           0       0.00%           4563    0.19%
> > > 2ms > x >= 1ms          62      0.00%           1       0.00%           54      0.00%
> > > 1ms > x >= 500us        67      0.00%           6       0.00%           2       0.00%
> > > 500us > x >= 250us      93      0.01%           39      0.00%           11      0.00%
> > > 250us > x >=100us       280     0.02%           1145    0.04%           633     0.03%
> > > 100us > x >= 50us       942     0.06%           30722   0.98%           13347   0.56%
> > > 50us > x >= 25us        26728   1.61%           310932  9.87%           106083  4.46%
> > > 25us > x >= 10us        825920  49.79%          2320683 73.66%          1722505 72.46%
> > > 10us > x > 5us          795197  47.94%          442991  14.06%          506008  21.29%
> > > 5us > x                 6520    0.39%           43994   1.40%           23645   0.99%
> > >
> > >
> > > 99% of the tick stops only have an idle period shorter than 50us (50us is
> > > 1,25% of a tick length).
> > 
> > Well, this just means that the governor predicts overly long idle
> > durations quite often under this workload.
> > 
> > The governor's decision on whether or not to stop the tick is based on
> > its idle duration prediction.  If it overshoots, that's how it goes.
> 
> This is abysmal; IIRC TEO tracks a density function in C state buckets
> and if it finds it's more likely to be shorter than 'predicted' by the
> timer it should pick something shallower.
> 
> Given we have this density function, picking something that's <1% likely
> is insane. In fact, it seems to suggest the whole pick-alternative thing
> is utterly broken.
> 

When I tried to understand the cstates, I noticed cstates have been
disabled on the zen3 machine I used for testing - I'm sorry, pilot error.

So the numbers above are caused by calling tick_nohz_idle_stop_tick()
unconditionally in cpuidle_idle_call() when cpuidle_not_available() is
true.

The regression Gautham observed was then caused by tons of
tick_nohz_next_event() calls, which are more expensive with the current
implementation of the timer migration hierarchy (if he tested with cstates
enabled...).

Nevertheless, I rerun the tests on current upstream with cstates enabled on
zen3 machine and on a SKL-X with governor teo, menu and ladder and
generated the following numbers (100% load):

Zen3:
			teo			menu			ladder
			------------------	------------------	------------------
Idle Total		2533	100.00%		5123	100.00%		1333746	100.00%
x >= 4ms		1458	57.56%		2764	53.95%		2304	0.17%
4ms> x >= 2ms		91	3.59%		95	1.85%		98	0.01%
2ms > x >= 1ms		56	2.21%		66	1.29%		57	0.00%
1ms > x >= 500us	64	2.53%		74	1.44%		61	0.00%
500us > x >= 250us	73	2.88%		39	0.76%		69	0.01%
250us > x >=100us	76	3.00%		88	1.72%		502	0.04%
100us > x >= 50us	33	1.30%		104	2.03%		3976	0.30%
50us > x >= 25us	39	1.54%		289	5.64%		64463	4.83%
25us > x >= 10us	199	7.86%		830	16.20%		1245946	93.42%
10us > x > 5us		156	6.16%		231	4.51%		9452	0.71%
5us > x			288	11.37%		543	10.60%		6818	0.51%

tick_nohz_next_event()
total count		8839790			2113357			1363896



SKL-X:
			teo			menu			ladder
			------------------	------------------	------------------
Idle Total		2388	100.00%		2047	100.00%		693514	100.00%
x >= 4ms		2047	85.72%		1347	65.80%		1141	0.16%
4ms> x >= 2ms		29	1.21%		47	2.30%		18	0.00%
2ms > x >= 1ms		20	0.84%		9	0.44%		10	0.00%
1ms > x >= 500us	21	0.88%		17	0.83%		10	0.00%
500us > x >= 250us	15	0.63%		26	1.27%		9	0.00%
250us > x >=100us	67	2.81%		39	1.91%		24	0.00%
100us > x >= 50us	18	0.75%		26	1.27%		17	0.00%
50us > x >= 25us	15	0.63%		28	1.37%		2141	0.31%
25us > x >= 10us	31	1.30%		61	2.98%		108208	15.60%
10us > x > 5us		37	1.55%		195	9.53%		242809	35.01%
5us > x			88	3.69%		252	12.31%		339127	48.90%

tick_nohz_next_event()
total count		2317973			2481724			701069



With this (and hopefully without another pilot error), I see the following
'open points' where improvement or more thoughts might be good:

- Without cstates enabled, it is possible to change the cpuidle governors
  even if they do not have an impact on idle behavior but at the first
  glance it looks like cpuidle governors are used. Is this behavior
  intended?

- When there is no cpuidle driver, tick_nohz_idle_stop_tick() is called
  unconditionally - is there the possibility to make an easy check whether
  the CPU is loaded?

- The governors teo and menu do the tick_nohz_next_event() check even if
  the CPU is fully loaded and but the check is not for free.

- timer bases are marked idle in tick_nohz_next_event() when the next
  expiry is more than a tick away. But when the tick can not be stopped,
  because CPU is loaded and timer base is alreay marked idle, a remote
  timer enqueue before clearing timer base idle information will lead to a
  IPI which is also expensive.

  It might be worth a try to do only a (maybe leaner) check for the next
  timer in tick_nohz_next_event() and do the actual idle dance in
  tick_nohz_stop_tick(). When a timer is enqueued remote between
  tick_nohz_next_event() and tick_nohz_stop_tick() call, there is no need
  for an IPI - CPU might be prevented from stopping the tick. This is also
  the case at the moment and only solved by an IPI after tick is already
  stopped.

  With regard to the timer migration hierarchy, there might be the
  possibility to do also a quick check in tick_nohz_next_event() and do the
  final tmigr_cpu_deactivate() thing when stopping the tick and marking the
  timer bases idle. So no lock ordering change would be required here...

- Side note: When testing 'ladder' on SKL-X machine there was a strange
  pattern: All CPUs on the second socket, stopped the tick quite often
  (~12000) and all of the idle durations were below 50us. All CPUs on the
  first socket stopped the tick max ~100 times and most of the idle
  durations were longer than 4ms (HZ=250).


Thanks,

	Anna-Maria
  
Peter Zijlstra July 26, 2023, 4:49 p.m. UTC | #15
On Wed, Jul 26, 2023 at 06:14:32PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 26, 2023 at 05:53:46PM +0200, Rafael J. Wysocki wrote:
> 
> > > > That means we don't track nearly enough data to reliably tell anything
> > > > about disabling the tick or not. We should have at least one bucket
> > > > beyond TICK_NSEC for this.
> > >
> > > Quite likely.
> > 
> > So the reasoning here was that those additional bins would not be
> > necessary for idle state selection, but the problem of whether or not
> > to stop the tick is kind of separate from the idle state selection
> > problem if the target residency values for all of the idle states are
> > relatively short.  And so it should be addressed separately which
> > currently it is not.  Admittedly, this is a mistake.
> 
> Right, the C state buckets are enough to pick a state, but not to handle
> the tick thing.
> 
> The below hack boots on my ivb-ep with extra (disabled) states. Now let
> me go hack up teo to make use of that.
> 
> name		residency
> 
> POLL		0
> C1              1
> C1E             80
> C3              156
> C6              300
> TICK            1000
> POST-TICK       2000
> 

builds and boots, futher untested -- I need to see to dinner.

The idea is that teo_update() should account to the highest state if
measured_ns is 'large'.

Then, then the tick is on, see if the majority (50%) of the
hit+intercepts are below the TICK threshold, if so, don't stop the tick
and assume duration_ns = TICK_NSEC -- iow. don't call
tick_nohz_get_sleep_length().

I'll check if the below code actually does as intended once the loonies
are in bed.


---

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index 987fc5f3997d..362470c8c273 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -197,7 +197,6 @@ struct teo_cpu {
 	int next_recent_idx;
 	int recent_idx[NR_RECENT];
 	unsigned long util_threshold;
-	bool utilized;
 };
 
 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
@@ -227,7 +226,7 @@ static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data)
 static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
-	int i, idx_timer = 0, idx_duration = 0;
+	int i, idx_timer = 0, idx_duration = drv->state_count-1;
 	u64 measured_ns;
 
 	if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) {
@@ -362,11 +361,12 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	unsigned int recent_sum = 0;
 	unsigned int idx_hit_sum = 0;
 	unsigned int hit_sum = 0;
+	unsigned int tick_sum = 0;
 	int constraint_idx = 0;
 	int idx0 = 0, idx = -1;
 	bool alt_intercepts, alt_recent;
 	ktime_t delta_tick;
-	s64 duration_ns;
+	s64 duration_ns = TICK_NSEC;
 	int i;
 
 	if (dev->last_state_idx >= 0) {
@@ -376,36 +376,26 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 
 	cpu_data->time_span_ns = local_clock();
 
-	duration_ns = tick_nohz_get_sleep_length(&delta_tick);
-	cpu_data->sleep_length_ns = duration_ns;
+	if (!tick_nohz_tick_stopped()) {
+		for (i = 1; i < drv->state_count; i++) {
+			struct teo_bin *prev_bin = &cpu_data->state_bins[i-1];
+			struct cpuidle_state *s = &drv->states[i];
 
-	/* Check if there is any choice in the first place. */
-	if (drv->state_count < 2) {
-		idx = 0;
-		goto end;
-	}
-	if (!dev->states_usage[0].disable) {
-		idx = 0;
-		if (drv->states[1].target_residency_ns > duration_ns)
-			goto end;
-	}
+			tick_sum += prev_bin->intercepts;
+			tick_sum += prev_bin->hits;
 
-	cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data);
-	/*
-	 * If the CPU is being utilized over the threshold and there are only 2
-	 * states to choose from, the metrics need not be considered, so choose
-	 * the shallowest non-polling state and exit.
-	 */
-	if (drv->state_count < 3 && cpu_data->utilized) {
-		for (i = 0; i < drv->state_count; ++i) {
-			if (!dev->states_usage[i].disable &&
-			    !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
-				idx = i;
-				goto end;
-			}
+			if (s->target_residency_ns >= TICK_NSEC)
+				break;
 		}
+
+		if (2*tick_sum > cpu_data->total)
+			*stop_tick = false;
 	}
 
+	if (*stop_tick)
+		duration_ns = tick_nohz_get_sleep_length(&delta_tick);
+	cpu_data->sleep_length_ns = duration_ns;
+
 	/*
 	 * Find the deepest idle state whose target residency does not exceed
 	 * the current sleep length and the deepest idle state not deeper than
@@ -541,7 +531,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	 * If the CPU is being utilized over the threshold, choose a shallower
 	 * non-polling state to improve latency
 	 */
-	if (cpu_data->utilized)
+	if (teo_cpu_is_utilized(dev->cpu, cpu_data))
 		idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
 
 end:
@@ -549,8 +539,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	 * Don't stop the tick if the selected state is a polling one or if the
 	 * expected idle duration is shorter than the tick period length.
 	 */
-	if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
-	    duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {
+	if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || !*stop_tick) {
 		*stop_tick = false;
 
 		/*
  
Rafael J. Wysocki July 26, 2023, 6:30 p.m. UTC | #16
On Wed, Jul 26, 2023 at 6:40 PM Anna-Maria Behnsen
<anna-maria@linutronix.de> wrote:
>
> Hi,
>
> On Wed, 26 Jul 2023, Peter Zijlstra wrote:
>
> > On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen
> >
> > > >                         100% load               50% load                25% load
> > > >                         (top: ~2% idle)         (top: ~49% idle)        (top: ~74% idle;
> > > >                                                                         33 CPUs are completely idle)
> > > >                         ---------------         ----------------        ----------------------------
> > > > Idle Total              1658703 100%            3150522 100%            2377035 100%
> > > > x >= 4ms                2504    0.15%           2       0.00%           53      0.00%
> > > > 4ms> x >= 2ms           390     0.02%           0       0.00%           4563    0.19%
> > > > 2ms > x >= 1ms          62      0.00%           1       0.00%           54      0.00%
> > > > 1ms > x >= 500us        67      0.00%           6       0.00%           2       0.00%
> > > > 500us > x >= 250us      93      0.01%           39      0.00%           11      0.00%
> > > > 250us > x >=100us       280     0.02%           1145    0.04%           633     0.03%
> > > > 100us > x >= 50us       942     0.06%           30722   0.98%           13347   0.56%
> > > > 50us > x >= 25us        26728   1.61%           310932  9.87%           106083  4.46%
> > > > 25us > x >= 10us        825920  49.79%          2320683 73.66%          1722505 72.46%
> > > > 10us > x > 5us          795197  47.94%          442991  14.06%          506008  21.29%
> > > > 5us > x                 6520    0.39%           43994   1.40%           23645   0.99%
> > > >
> > > >
> > > > 99% of the tick stops only have an idle period shorter than 50us (50us is
> > > > 1,25% of a tick length).
> > >
> > > Well, this just means that the governor predicts overly long idle
> > > durations quite often under this workload.
> > >
> > > The governor's decision on whether or not to stop the tick is based on
> > > its idle duration prediction.  If it overshoots, that's how it goes.
> >
> > This is abysmal; IIRC TEO tracks a density function in C state buckets
> > and if it finds it's more likely to be shorter than 'predicted' by the
> > timer it should pick something shallower.
> >
> > Given we have this density function, picking something that's <1% likely
> > is insane. In fact, it seems to suggest the whole pick-alternative thing
> > is utterly broken.
> >
>
> When I tried to understand the cstates, I noticed cstates have been
> disabled on the zen3 machine I used for testing - I'm sorry, pilot error.
>
> So the numbers above are caused by calling tick_nohz_idle_stop_tick()
> unconditionally in cpuidle_idle_call() when cpuidle_not_available() is
> true.
>
> The regression Gautham observed was then caused by tons of
> tick_nohz_next_event() calls, which are more expensive with the current
> implementation of the timer migration hierarchy (if he tested with cstates
> enabled...).
>
> Nevertheless, I rerun the tests on current upstream with cstates enabled on
> zen3 machine and on a SKL-X with governor teo, menu and ladder and
> generated the following numbers (100% load):
>
> Zen3:
>                         teo                     menu                    ladder
>                         ------------------      ------------------      ------------------
> Idle Total              2533    100.00%         5123    100.00%         1333746 100.00%
> x >= 4ms                1458    57.56%          2764    53.95%          2304    0.17%
> 4ms> x >= 2ms           91      3.59%           95      1.85%           98      0.01%
> 2ms > x >= 1ms          56      2.21%           66      1.29%           57      0.00%
> 1ms > x >= 500us        64      2.53%           74      1.44%           61      0.00%
> 500us > x >= 250us      73      2.88%           39      0.76%           69      0.01%
> 250us > x >=100us       76      3.00%           88      1.72%           502     0.04%
> 100us > x >= 50us       33      1.30%           104     2.03%           3976    0.30%
> 50us > x >= 25us        39      1.54%           289     5.64%           64463   4.83%
> 25us > x >= 10us        199     7.86%           830     16.20%          1245946 93.42%
> 10us > x > 5us          156     6.16%           231     4.51%           9452    0.71%
> 5us > x                 288     11.37%          543     10.60%          6818    0.51%
>
> tick_nohz_next_event()
> total count             8839790                 2113357                 1363896
>
>
>
> SKL-X:
>                         teo                     menu                    ladder
>                         ------------------      ------------------      ------------------
> Idle Total              2388    100.00%         2047    100.00%         693514  100.00%
> x >= 4ms                2047    85.72%          1347    65.80%          1141    0.16%
> 4ms> x >= 2ms           29      1.21%           47      2.30%           18      0.00%
> 2ms > x >= 1ms          20      0.84%           9       0.44%           10      0.00%
> 1ms > x >= 500us        21      0.88%           17      0.83%           10      0.00%
> 500us > x >= 250us      15      0.63%           26      1.27%           9       0.00%
> 250us > x >=100us       67      2.81%           39      1.91%           24      0.00%
> 100us > x >= 50us       18      0.75%           26      1.27%           17      0.00%
> 50us > x >= 25us        15      0.63%           28      1.37%           2141    0.31%
> 25us > x >= 10us        31      1.30%           61      2.98%           108208  15.60%
> 10us > x > 5us          37      1.55%           195     9.53%           242809  35.01%
> 5us > x                 88      3.69%           252     12.31%          339127  48.90%
>
> tick_nohz_next_event()
> total count             2317973                 2481724                 701069

This looks better indeed.  What's the HZ value?

> With this (and hopefully without another pilot error), I see the following
> 'open points' where improvement or more thoughts might be good:
>
> - Without cstates enabled, it is possible to change the cpuidle governors
>   even if they do not have an impact on idle behavior but at the first
>   glance it looks like cpuidle governors are used. Is this behavior
>   intended?

Not really intended and it can be improved.  I guess it's this way,
because there were no complaints. :-)

> - When there is no cpuidle driver, tick_nohz_idle_stop_tick() is called
>   unconditionally - is there the possibility to make an easy check whether
>   the CPU is loaded?

PELT could be asked like in the teo governor.  Or it may be a better
idea to simply never stop the tick in that case (the default idle
state is going to be shallow anyway, so running the tick shouldn't
matter from the energy POV).

> - The governors teo and menu do the tick_nohz_next_event() check even if
>   the CPU is fully loaded and but the check is not for free.

Let me have a loot at teo in that respect.

The problem is when tick_nohz_get_sleep_length() should not be called.
The easy case is when the governor would select the shallowest idle
state without taking it into account, but what about the deeper ones?
I guess this depends on the exit latency of the current candidate idle
state, but what exit latency would be low enough?  I guess 2 us would
be fine, but what about 10 us, or even 20 us for that matter?

> - timer bases are marked idle in tick_nohz_next_event() when the next
>   expiry is more than a tick away. But when the tick can not be stopped,
>   because CPU is loaded and timer base is alreay marked idle, a remote
>   timer enqueue before clearing timer base idle information will lead to a
>   IPI which is also expensive.
>
>   It might be worth a try to do only a (maybe leaner) check for the next
>   timer in tick_nohz_next_event() and do the actual idle dance in
>   tick_nohz_stop_tick(). When a timer is enqueued remote between
>   tick_nohz_next_event() and tick_nohz_stop_tick() call, there is no need
>   for an IPI - CPU might be prevented from stopping the tick. This is also
>   the case at the moment and only solved by an IPI after tick is already
>   stopped.
>
>   With regard to the timer migration hierarchy, there might be the
>   possibility to do also a quick check in tick_nohz_next_event() and do the
>   final tmigr_cpu_deactivate() thing when stopping the tick and marking the
>   timer bases idle. So no lock ordering change would be required here...
>
> - Side note: When testing 'ladder' on SKL-X machine there was a strange
>   pattern: All CPUs on the second socket, stopped the tick quite often
>   (~12000) and all of the idle durations were below 50us. All CPUs on the
>   first socket stopped the tick max ~100 times and most of the idle
>   durations were longer than 4ms (HZ=250).

I don't really have a quick explanation for that.  It's probably due
to the work distribution pattern.

Not that I think that ladder is interesting any more, TBH.
  
Peter Zijlstra July 26, 2023, 8:09 p.m. UTC | #17
On Wed, Jul 26, 2023 at 08:30:01PM +0200, Rafael J. Wysocki wrote:

> > - The governors teo and menu do the tick_nohz_next_event() check even if
> >   the CPU is fully loaded and but the check is not for free.
> 
> Let me have a loot at teo in that respect.
> 
> The problem is when tick_nohz_get_sleep_length() should not be called.
> The easy case is when the governor would select the shallowest idle
> state without taking it into account, but what about the deeper ones?
> I guess this depends on the exit latency of the current candidate idle
> state, but what exit latency would be low enough?  I guess 2 us would
> be fine, but what about 10 us, or even 20 us for that matter?

The patch I send here:

  https://lkml.kernel.org/r/20230726164958.GV38236@hirez.programming.kicks-ass.net

(which was stuck in a mailqueue :/) tries to address that.

Additionally, I think we can do something like this on top of all that,
stop going deeper when 66% of wakeups is at or below the current state.


--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -362,6 +362,7 @@ static int teo_select(struct cpuidle_dri
 	unsigned int idx_hit_sum = 0;
 	unsigned int hit_sum = 0;
 	unsigned int tick_sum = 0;
+	unsigned int thresh_sum = 0;
 	int constraint_idx = 0;
 	int idx0 = 0, idx = -1;
 	bool alt_intercepts, alt_recent;
@@ -396,6 +397,8 @@ static int teo_select(struct cpuidle_dri
 		duration_ns = tick_nohz_get_sleep_length(&delta_tick);
 	cpu_data->sleep_length_ns = duration_ns;
 
+	thresh_sum = 2 * cpu_data->total / 3; /* 66% */
+
 	/*
 	 * Find the deepest idle state whose target residency does not exceed
 	 * the current sleep length and the deepest idle state not deeper than
@@ -426,6 +429,9 @@ static int teo_select(struct cpuidle_dri
 		if (s->target_residency_ns > duration_ns)
 			break;
 
+		if (intercept_sum + hit_sum > thresh_sum)
+			break;
+
 		idx = i;
 
 		if (s->exit_latency_ns <= latency_req)
  
Peter Zijlstra July 26, 2023, 9:26 p.m. UTC | #18
On Wed, Jul 26, 2023 at 06:49:58PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 26, 2023 at 06:14:32PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 26, 2023 at 05:53:46PM +0200, Rafael J. Wysocki wrote:
> > 
> > > > > That means we don't track nearly enough data to reliably tell anything
> > > > > about disabling the tick or not. We should have at least one bucket
> > > > > beyond TICK_NSEC for this.
> > > >
> > > > Quite likely.
> > > 
> > > So the reasoning here was that those additional bins would not be
> > > necessary for idle state selection, but the problem of whether or not
> > > to stop the tick is kind of separate from the idle state selection
> > > problem if the target residency values for all of the idle states are
> > > relatively short.  And so it should be addressed separately which
> > > currently it is not.  Admittedly, this is a mistake.
> > 
> > Right, the C state buckets are enough to pick a state, but not to handle
> > the tick thing.
> > 
> > The below hack boots on my ivb-ep with extra (disabled) states. Now let
> > me go hack up teo to make use of that.
> > 
> > name		residency
> > 
> > POLL		0
> > C1              1
> > C1E             80
> > C3              156
> > C6              300
> > TICK            1000
> > POST-TICK       2000
> > 
> 
> builds and boots, futher untested -- I need to see to dinner.
> 
> The idea is that teo_update() should account to the highest state if
> measured_ns is 'large'.
> 
> Then, then the tick is on, see if the majority (50%) of the
> hit+intercepts are below the TICK threshold, if so, don't stop the tick
> and assume duration_ns = TICK_NSEC -- iow. don't call
> tick_nohz_get_sleep_length().
> 
> I'll check if the below code actually does as intended once the loonies
> are in bed.

D'oh, it suffers the 'obvious' problem. Since the tick is not disabled,
we'll never log sleeps longer than the tick and hence never decide to
disable the tick.

Clever of me... This needs a wee bit of refining it does :-)
  
Peter Zijlstra July 27, 2023, 7:59 a.m. UTC | #19
On Wed, Jul 26, 2023 at 06:14:32PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 26, 2023 at 05:53:46PM +0200, Rafael J. Wysocki wrote:
> 
> > > > That means we don't track nearly enough data to reliably tell anything
> > > > about disabling the tick or not. We should have at least one bucket
> > > > beyond TICK_NSEC for this.
> > >
> > > Quite likely.
> > 
> > So the reasoning here was that those additional bins would not be
> > necessary for idle state selection, but the problem of whether or not
> > to stop the tick is kind of separate from the idle state selection
> > problem if the target residency values for all of the idle states are
> > relatively short.  And so it should be addressed separately which
> > currently it is not.  Admittedly, this is a mistake.
> 
> Right, the C state buckets are enough to pick a state, but not to handle
> the tick thing.
> 
> The below hack boots on my ivb-ep with extra (disabled) states. Now let
> me go hack up teo to make use of that.
> 
> name		residency
> 
> POLL		0
> C1              1
> C1E             80
> C3              156
> C6              300
> TICK            1000
> POST-TICK       2000
> 

Ah, so last night (or rather, somewhat realy today) I realized I has the
buckets wrong.

We don't have buckets to the left, but buckets to the right, so the
above would give:

0: [0,1)
1: [1,80)
2: [80,156)
3: [156,300)
4: [300,1000)
5: [1000,2000)
6: [2000,...)

Which also means I can ditch the whole POST-TICK bucket. Let me get
breakfast and try all this again.
  
Rafael J. Wysocki July 27, 2023, 8:10 p.m. UTC | #20
On Thu, Jul 27, 2023 at 9:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jul 26, 2023 at 06:14:32PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 26, 2023 at 05:53:46PM +0200, Rafael J. Wysocki wrote:
> >
> > > > > That means we don't track nearly enough data to reliably tell anything
> > > > > about disabling the tick or not. We should have at least one bucket
> > > > > beyond TICK_NSEC for this.
> > > >
> > > > Quite likely.
> > >
> > > So the reasoning here was that those additional bins would not be
> > > necessary for idle state selection, but the problem of whether or not
> > > to stop the tick is kind of separate from the idle state selection
> > > problem if the target residency values for all of the idle states are
> > > relatively short.  And so it should be addressed separately which
> > > currently it is not.  Admittedly, this is a mistake.
> >
> > Right, the C state buckets are enough to pick a state, but not to handle
> > the tick thing.
> >
> > The below hack boots on my ivb-ep with extra (disabled) states. Now let
> > me go hack up teo to make use of that.
> >
> > name          residency
> >
> > POLL          0
> > C1              1
> > C1E             80
> > C3              156
> > C6              300
> > TICK            1000
> > POST-TICK       2000
> >
>
> Ah, so last night (or rather, somewhat realy today) I realized I has the
> buckets wrong.
>
> We don't have buckets to the left, but buckets to the right, so the
> above would give:
>
> 0: [0,1)
> 1: [1,80)
> 2: [80,156)
> 3: [156,300)
> 4: [300,1000)
> 5: [1000,2000)
> 6: [2000,...)
>
> Which also means I can ditch the whole POST-TICK bucket. Let me get
> breakfast and try all this again.

Note that this is about the cases between the target residency of the
deepest enabled idle state and the tick period and from the
Anna-Maria's results it looks that's about 0,1% of the total number.
  

Patch

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2307,6 +2307,7 @@  static inline bool owner_on_cpu(struct t
 
 /* Returns effective CPU energy utilization, as seen by the scheduler */
 unsigned long sched_cpu_util(int cpu);
+u64 sched_cpu_really_avg_idle(int cpu);
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_RSEQ
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3753,6 +3753,7 @@  ttwu_do_activate(struct rq *rq, struct t
 		u64 max = 2*rq->max_idle_balance_cost;
 
 		update_avg(&rq->avg_idle, delta);
+		update_avg(&rq->really_avg_idle, delta);
 
 		if (rq->avg_idle > max)
 			rq->avg_idle = max;
@@ -7455,6 +7456,12 @@  unsigned long sched_cpu_util(int cpu)
 {
 	return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
 }
+
+u64 sched_cpu_really_avg_idle(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	return rq->really_avg_idle;
+}
 #endif /* CONFIG_SMP */
 
 /**
@@ -9988,6 +9995,7 @@  void __init sched_init(void)
 		rq->online = 0;
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
+		rq->really_avg_idle = 2*sysctl_sched_migration_cost;
 		rq->wake_stamp = jiffies;
 		rq->wake_avg_idle = rq->avg_idle;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1073,6 +1073,7 @@  struct rq {
 #endif
 	u64			idle_stamp;
 	u64			avg_idle;
+	u64			really_avg_idle;
 
 	unsigned long		wake_stamp;
 	u64			wake_avg_idle;
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -800,7 +800,7 @@  static inline bool local_timer_softirq_p
 
 static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
 {
-	u64 basemono, next_tick, delta, expires;
+	u64 basemono, next_tick, delta, expires, sched_avg_idle;
 	unsigned long basejiff;
 	unsigned int seq;
 
@@ -823,8 +823,11 @@  static ktime_t tick_nohz_next_event(stru
 	 * minimal delta which brings us back to this place
 	 * immediately. Lather, rinse and repeat...
 	 */
-	if (rcu_needs_cpu() || arch_needs_cpu() ||
-	    irq_work_needs_cpu() || local_timer_softirq_pending()) {
+	sched_avg_idle = sched_cpu_really_avg_idle(cpu);
+	if (sched_avg_idle <= (u64)TICK_NSEC) {
+		next_tick = basemono + sched_avg_idle;
+	} else if (rcu_needs_cpu() || arch_needs_cpu() ||
+		   irq_work_needs_cpu() || local_timer_softirq_pending()) {
 		next_tick = basemono + TICK_NSEC;
 	} else {
 		/*