[v8,10/25] timers: Move marking timer bases idle into tick_nohz_stop_tick()

Message ID 20231004123454.15691-11-anna-maria@linutronix.de
State New
Headers
Series timer: Move from a push remote at enqueue to a pull at expiry model |

Commit Message

Anna-Maria Behnsen Oct. 4, 2023, 12:34 p.m. UTC
  The timer base is marked idle when get_next_timer_interrupt() is
executed. But the decision whether the tick will be stopped and whether the
system is able to go idle is done later. When the timer bases is marked
idle and a new first timer is enqueued remote an IPI is raised. Even if it
is not required because the tick is not stopped and the timer base is
evaluated again at the next tick.

To prevent this, the timer base is marked idle in tick_nohz_stop_tick() and
get_next_timer_interrupt() is streamlined by only looking for the next
timer interrupt. All other work is postponed to timer_set_idle() which is
called by tick_nohz_stop_tick().

While at it a whitespace damage is fixed as well.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
 kernel/time/tick-internal.h |  1 +
 kernel/time/tick-sched.c    | 38 ++++++++++++++++++++++++-----------
 kernel/time/timer.c         | 40 +++++++++++++++++++++++++++++++++----
 3 files changed, 63 insertions(+), 16 deletions(-)
  

Comments

Frederic Weisbecker Oct. 12, 2023, 3:52 p.m. UTC | #1
Le Wed, Oct 04, 2023 at 02:34:39PM +0200, Anna-Maria Behnsen a écrit :
>  static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
>  {
>  	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
> +	unsigned long basejiff = ts->last_jiffies;
>  	u64 basemono = ts->timer_expires_base;
> -	u64 expires = ts->timer_expires;
> +	bool timer_idle = ts->tick_stopped;
> +	u64 expires;
>  
>  	/* Make sure we won't be trying to stop it twice in a row. */
>  	ts->timer_expires_base = 0;
>  
> +	/*
> +	 * Now the tick should be stopped definitely - so timer base needs to be
> +	 * marked idle as well to not miss a newly queued timer.
> +	 */
> +	expires = timer_set_idle(basejiff, basemono, &timer_idle);
> +	if (!timer_idle) {
> +		/*
> +		 * Do not clear tick_stopped here when it was already set - it will
> +		 * be retained on next idle iteration when tick expired earlier
> +		 * than expected.
> +		 */
> +		expires = basemono + TICK_NSEC;
> +
> +		/* Undo the effect of timer_set_idle() */
> +		timer_clear_idle();

Looks like you don't even need to clear ->is_idle on failure. timer_set_idle()
does it for you.

> +	} else if (expires < ts->timer_expires) {
> +		ts->timer_expires = expires;
> +	} else {
> +		expires = ts->timer_expires;

Is it because timer_set_idle() doesn't recalculate the next hrtimer (as opposed
to get_next_timer_interrupt())? And since tick_nohz_next_event() did, the fact
that ts->timer_expires has a lower value may mean there is an hrtimer to take
into account and so you rather use the old calculation?

If so please add a comment explaining that because it's not that obvious. It's
worth noting also the side effect that the nearest timer may have been cancelled
in-between and we might reprogram too-early but the event should be rare enough
that we don't care.

Another reason also is that cpuidle may have programmed a shallow C-state
because it saw an early next expiration estimation. And if the related timer is
cancelled in-between and we didn't keep the old expiration estimation, we would
otherwise stop the tick for a long time with a shallow C-state.

> @@ -926,7 +944,7 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
>  	 * first call we save the current tick time, so we can restart
>  	 * the scheduler tick in nohz_restart_sched_tick.
>  	 */
> -	if (!ts->tick_stopped) {
> +	if (!ts->tick_stopped && timer_idle) {

In fact, if (!ts->tick_stopped && !timer_idle) then you
should return now and avoid the reprogramming.

> @@ -1950,6 +1950,40 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
>  	if (cpu_is_offline(smp_processor_id()))
>  		return expires;
>  
> +	raw_spin_lock(&base->lock);
> +	nextevt = __get_next_timer_interrupt(basej, base);
> +	raw_spin_unlock(&base->lock);

It's unfortunate we have to lock here, which means we lock twice
on the idle path. But I can't think of a better way and I guess
the follow-up patches rely on that.

Thanks.
  
Anna-Maria Behnsen Oct. 19, 2023, 1:37 p.m. UTC | #2
Frederic Weisbecker <frederic@kernel.org> writes:

> Le Wed, Oct 04, 2023 at 02:34:39PM +0200, Anna-Maria Behnsen a écrit :
>>  static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
>>  {
>>  	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
>> +	unsigned long basejiff = ts->last_jiffies;
>>  	u64 basemono = ts->timer_expires_base;
>> -	u64 expires = ts->timer_expires;
>> +	bool timer_idle = ts->tick_stopped;
>> +	u64 expires;
>>  
>>  	/* Make sure we won't be trying to stop it twice in a row. */
>>  	ts->timer_expires_base = 0;
>>  
>> +	/*
>> +	 * Now the tick should be stopped definitely - so timer base needs to be
>> +	 * marked idle as well to not miss a newly queued timer.
>> +	 */
>> +	expires = timer_set_idle(basejiff, basemono, &timer_idle);
>> +	if (!timer_idle) {
>> +		/*
>> +		 * Do not clear tick_stopped here when it was already set - it will
>> +		 * be retained on next idle iteration when tick expired earlier
>> +		 * than expected.
>> +		 */
>> +		expires = basemono + TICK_NSEC;
>> +
>> +		/* Undo the effect of timer_set_idle() */
>> +		timer_clear_idle();
>
> Looks like you don't even need to clear ->is_idle on failure. timer_set_idle()
> does it for you.

You are right. I tried several approaches and then forgot to remove it
here.

>> +	} else if (expires < ts->timer_expires) {
>> +		ts->timer_expires = expires;
>> +	} else {
>> +		expires = ts->timer_expires;
>
> Is it because timer_set_idle() doesn't recalculate the next hrtimer (as opposed
> to get_next_timer_interrupt())? And since tick_nohz_next_event() did, the fact
> that ts->timer_expires has a lower value may mean there is an hrtimer to take
> into account and so you rather use the old calculation?

Yes and because power things rely on it.

> If so please add a comment explaining that because it's not that obvious. It's
> worth noting also the side effect that the nearest timer may have been cancelled
> in-between and we might reprogram too-early but the event should be rare enough
> that we don't care.
>
> Another reason also is that cpuidle may have programmed a shallow C-state
> because it saw an early next expiration estimation. And if the related timer is
> cancelled in-between and we didn't keep the old expiration estimation, we would
> otherwise stop the tick for a long time with a shallow C-state.

I'll add a comment covering all your input! Thanks!
The probability that there happens a lot of enqueue and dequeue of
timers between get_next_timer_interrupt() and setting timer base idle is
not very high. But we have to make sure that we do not miss a new first
timer there.

>> @@ -926,7 +944,7 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
>>  	 * first call we save the current tick time, so we can restart
>>  	 * the scheduler tick in nohz_restart_sched_tick.
>>  	 */
>> -	if (!ts->tick_stopped) {
>> +	if (!ts->tick_stopped && timer_idle) {
>
> In fact, if (!ts->tick_stopped && !timer_idle) then you
> should return now and avoid the reprogramming.

You are right. I'll add it and test it.

>> @@ -1950,6 +1950,40 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
>>  	if (cpu_is_offline(smp_processor_id()))
>>  		return expires;
>>  
>> +	raw_spin_lock(&base->lock);
>> +	nextevt = __get_next_timer_interrupt(basej, base);
>> +	raw_spin_unlock(&base->lock);
>
> It's unfortunate we have to lock here, which means we lock twice
> on the idle path. But I can't think of a better way and I guess
> the follow-up patches rely on that.

We have to do it like this, because power people need the sleep length
information to able to decide whether to stop the tick or not. If we do
not want to have the timer base locked two times in idle path, we will
not be able to move timer base idle marking into
tick_nohz_stop_tick(). But the good thing is, that we do not mark timer
bases idle, when tick is not stopped with this approach.

btw, I try to rewrite this patch completely as tglx was not happy about
some parts of code duplication. I'll make sure that your remarks are
also covered.

Thanks,

	Anna-Maria
  

Patch

diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 649f2b48e8f0..b035606a6f5e 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -164,6 +164,7 @@  static inline void timers_update_nohz(void) { }
 DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
 
 extern u64 get_next_timer_interrupt(unsigned long basej, u64 basem);
+u64 timer_set_idle(unsigned long basej, u64 basem, bool *idle);
 void timer_clear_idle(void);
 
 #define CLOCK_SET_WALL							\
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b3cf535881a4..7e1fdbc6d5f0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -846,11 +846,6 @@  static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
 
 	delta = next_tick - basemono;
 	if (delta <= (u64)TICK_NSEC) {
-		/*
-		 * Tell the timer code that the base is not idle, i.e. undo
-		 * the effect of get_next_timer_interrupt():
-		 */
-		timer_clear_idle();
 		/*
 		 * We've not stopped the tick yet, and there's a timer in the
 		 * next period, so no point in stopping it either, bail.
@@ -886,12 +881,35 @@  static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
 static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
 {
 	struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
+	unsigned long basejiff = ts->last_jiffies;
 	u64 basemono = ts->timer_expires_base;
-	u64 expires = ts->timer_expires;
+	bool timer_idle = ts->tick_stopped;
+	u64 expires;
 
 	/* Make sure we won't be trying to stop it twice in a row. */
 	ts->timer_expires_base = 0;
 
+	/*
+	 * Now the tick should be stopped definitely - so timer base needs to be
+	 * marked idle as well to not miss a newly queued timer.
+	 */
+	expires = timer_set_idle(basejiff, basemono, &timer_idle);
+	if (!timer_idle) {
+		/*
+		 * Do not clear tick_stopped here when it was already set - it will
+		 * be retained on next idle iteration when tick expired earlier
+		 * than expected.
+		 */
+		expires = basemono + TICK_NSEC;
+
+		/* Undo the effect of timer_set_idle() */
+		timer_clear_idle();
+	} else if (expires < ts->timer_expires) {
+		ts->timer_expires = expires;
+	} else {
+		expires = ts->timer_expires;
+	}
+
 	/*
 	 * If this CPU is the one which updates jiffies, then give up
 	 * the assignment and let it be taken by the CPU which runs
@@ -926,7 +944,7 @@  static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
 	 * first call we save the current tick time, so we can restart
 	 * the scheduler tick in nohz_restart_sched_tick.
 	 */
-	if (!ts->tick_stopped) {
+	if (!ts->tick_stopped && timer_idle) {
 		calc_load_nohz_start();
 		quiet_vmstat();
 
@@ -989,7 +1007,7 @@  static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 	/*
 	 * Cancel the scheduled timer and restore the tick
 	 */
-	ts->tick_stopped  = 0;
+	ts->tick_stopped = 0;
 	tick_nohz_restart(ts, now);
 }
 
@@ -1145,10 +1163,6 @@  void tick_nohz_idle_stop_tick(void)
 void tick_nohz_idle_retain_tick(void)
 {
 	tick_nohz_retain_tick(this_cpu_ptr(&tick_cpu_sched));
-	/*
-	 * Undo the effect of get_next_timer_interrupt() called from
-	 * tick_nohz_next_event().
-	 */
 	timer_clear_idle();
 }
 
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index f443aa807fbc..8518f7aa7319 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1950,6 +1950,40 @@  u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
 	if (cpu_is_offline(smp_processor_id()))
 		return expires;
 
+	raw_spin_lock(&base->lock);
+	nextevt = __get_next_timer_interrupt(basej, base);
+	raw_spin_unlock(&base->lock);
+
+	expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
+
+	return cmp_next_hrtimer_event(basem, expires);
+}
+
+/**
+ * timer_set_idle - Set the idle state of the timer bases (if possible)
+ * @basej:	base time jiffies
+ * @basem:	base time clock monotonic
+ * @idle:	pointer to store the value of timer_base->in_idle
+ *
+ * Returns the next timer expiry.
+ *
+ * hrtimers are not taken into account once more, as they already have been
+ * taken into account when asking for the next timer expiry.
+ */
+u64 timer_set_idle(unsigned long basej, u64 basem, bool *idle)
+{
+	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+	unsigned long nextevt;
+
+	/*
+	 * Pretend that there is no timer pending if the cpu is offline.
+	 * Possible pending timers will be migrated later to an active cpu.
+	 */
+	if (cpu_is_offline(smp_processor_id())) {
+		*idle = true;
+		return KTIME_MAX;
+	}
+
 	raw_spin_lock(&base->lock);
 	nextevt = __get_next_timer_interrupt(basej, base);
 
@@ -1966,13 +2000,11 @@  u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
 	 * maintained for the BASE_STD base, deferrable timers may still
 	 * see large granularity skew (by design).
 	 */
-	base->is_idle = time_after(nextevt, basej + 1);
+	base->is_idle = *idle = time_after(nextevt, basej + 1);
 
 	raw_spin_unlock(&base->lock);
 
-	expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
-
-	return cmp_next_hrtimer_event(basem, expires);
+	return basem + (u64)(nextevt - basej) * TICK_NSEC;
 }
 
 /**