[v9,22/32] timers: Keep the pinned timers separate from the others

Message ID 20231201092654.34614-23-anna-maria@linutronix.de
State New
Headers
Series timers: Move from a push remote at enqueue to a pull at expiry model |

Commit Message

Anna-Maria Behnsen Dec. 1, 2023, 9:26 a.m. UTC
  Separate the storage space for pinned timers. Deferrable timers (doesn't
matter if pinned or non pinned) are still enqueued into their own base.

This is preparatory work for changing the NOHZ timer placement from a push
at enqueue time to a pull at expiry time model.

Originally-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
v9:
  - Update was required (change of preceding patches)

v6:
  - Drop set TIMER_PINNED flag in add_timer_on() and drop related
    warning. add_timer_on() fix is splitted into a separate
    patch. Therefore also drop "Reviewed-by" of Frederic Weisbecker

v5:
  - Add WARN_ONCE() in add_timer_on()
  - Decrease patch size by splitting into three patches (this patch and the
    two before)

v4:
  - split out logic to forward base clock into a helper function
    forward_base_clk() (Frederic)
  - ease the code in run_local_timers() and timer_clear_idle() (Frederic)
---
 kernel/time/timer.c | 95 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 68 insertions(+), 27 deletions(-)
  

Comments

Sebastian Andrzej Siewior Dec. 5, 2023, 9:11 p.m. UTC | #1
On 2023-12-01 10:26:44 [+0100], Anna-Maria Behnsen wrote:
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1985,10 +1998,31 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
>  		return expires;
>  	}
>  
> -	raw_spin_lock(&base->lock);
> -	nextevt = next_timer_interrupt(base, basej);
> +	base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
> +	base_global = this_cpu_ptr(&timer_bases[BASE_GLOBAL]);
> +
> +	raw_spin_lock(&base_local->lock);
> +	raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
> +
> +	nextevt_local = next_timer_interrupt(base_local, basej);
> +	nextevt_global = next_timer_interrupt(base_global, basej);
>  
> -	if (base->timers_pending) {
> +	/*
> +	 * Check whether the local event is expiring before or at the same
> +	 * time as the global event.
> +	 *
> +	 * Note, that nextevt_global and nextevt_local might be based on
> +	 * different base->clk values. So it's not guaranteed that
> +	 * comparing with empty bases results in a correct local_first.

This ends like an unsolved mystery case. Could you add why one should
not worry about an incorrect local_first?

But seriously, how far apart can they get and what difference does it
make?  At timer enqueue time clk equals jiffies. At this point one clk
base could be at jiffies and the other might be a few jiffies before
that.
The next event (as in next_expiry) should be valid for both compare
wise. Both must be larger than jiffies. The delta between jiffies and
next event has to be less than NEXT_TIMER_MAX_DELTA for each base.

> +	 */
> +	if (base_local->timers_pending && base_global->timers_pending)
> +		local_first = time_before_eq(nextevt_local, nextevt_global);
> +	else
> +		local_first = base_local->timers_pending;
> +
> +	nextevt = local_first ? nextevt_local : nextevt_global;
> +
> +	if (base_local->timers_pending || base_global->timers_pending) {
>  		/* If we missed a tick already, force 0 delta */
>  		if (time_before(nextevt, basej))
>  			nextevt = basej;

So if nextevt_local missed a tick and nextevt_global is
NEXT_TIMER_MAX_DELTA-1 (so we get the largest difference possible
between those two) then the time_before_eq() should still come out
right. We could still miss more than one tick.

This looks good. I just don't understand the (above) comment.

Sebastian
  
Anna-Maria Behnsen Dec. 6, 2023, 10:23 a.m. UTC | #2
Sebastian Siewior <bigeasy@linutronix.de> writes:

> On 2023-12-01 10:26:44 [+0100], Anna-Maria Behnsen wrote:
>> --- a/kernel/time/timer.c
>> +++ b/kernel/time/timer.c
>> @@ -1985,10 +1998,31 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
>>  		return expires;
>>  	}
>>  
>> -	raw_spin_lock(&base->lock);
>> -	nextevt = next_timer_interrupt(base, basej);
>> +	base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
>> +	base_global = this_cpu_ptr(&timer_bases[BASE_GLOBAL]);
>> +
>> +	raw_spin_lock(&base_local->lock);
>> +	raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
>> +
>> +	nextevt_local = next_timer_interrupt(base_local, basej);
>> +	nextevt_global = next_timer_interrupt(base_global, basej);
>>  
>> -	if (base->timers_pending) {
>> +	/*
>> +	 * Check whether the local event is expiring before or at the same
>> +	 * time as the global event.
>> +	 *
>> +	 * Note, that nextevt_global and nextevt_local might be based on
>> +	 * different base->clk values. So it's not guaranteed that
>> +	 * comparing with empty bases results in a correct local_first.
>
> This ends like an unsolved mystery case. Could you add why one should
> not worry about an incorrect local_first?
>
> But seriously, how far apart can they get and what difference does it
> make?  At timer enqueue time clk equals jiffies. At this point one clk
> base could be at jiffies and the other might be a few jiffies before
> that.
> The next event (as in next_expiry) should be valid for both compare
> wise. Both must be larger than jiffies. The delta between jiffies and
> next event has to be less than NEXT_TIMER_MAX_DELTA for each base.
>
>> +	 */
>> +	if (base_local->timers_pending && base_global->timers_pending)
>> +		local_first = time_before_eq(nextevt_local, nextevt_global);
>> +	else
>> +		local_first = base_local->timers_pending;
>> +
>> +	nextevt = local_first ? nextevt_local : nextevt_global;
>> +
>> +	if (base_local->timers_pending || base_global->timers_pending) {
>>  		/* If we missed a tick already, force 0 delta */
>>  		if (time_before(nextevt, basej))
>>  			nextevt = basej;
>
> So if nextevt_local missed a tick and nextevt_global is
> NEXT_TIMER_MAX_DELTA-1 (so we get the largest difference possible
> between those two) then the time_before_eq() should still come out
> right. We could still miss more than one tick.
>

This problem was only there when comparing _empty_ bases
(!timer_base::timers_pending) because of the different base clocks and
the stale next_expiry.

But I didn't update the check and the comment after introducing the
forward of the next_expiry when !timer_base::timers_pending in
next_timer_interrupt(). So now it is sufficient to replace the
local_first detection by simply doing:

	local_first = time_before_eq(nextevt_local, nextevt_global);

Will fix it and will also add a comment to next_timer_interrupt() where
the next_expiry is updated when !timer_base::timers_pending.

Thanks,

	Anna-Maria
  

Patch

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index eda4972ca862..366ea26ce3ba 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -187,12 +187,18 @@  EXPORT_SYMBOL(jiffies_64);
 #define WHEEL_SIZE	(LVL_SIZE * LVL_DEPTH)
 
 #ifdef CONFIG_NO_HZ_COMMON
-# define NR_BASES	2
-# define BASE_STD	0
-# define BASE_DEF	1
+/*
+ * If multiple bases need to be locked, use the base ordering for lock
+ * nesting, i.e. lowest number first.
+ */
+# define NR_BASES	3
+# define BASE_LOCAL	0
+# define BASE_GLOBAL	1
+# define BASE_DEF	2
 #else
 # define NR_BASES	1
-# define BASE_STD	0
+# define BASE_LOCAL	0
+# define BASE_GLOBAL	0
 # define BASE_DEF	0
 #endif
 
@@ -899,7 +905,10 @@  static int detach_if_pending(struct timer_list *timer, struct timer_base *base,
 
 static inline struct timer_base *get_timer_cpu_base(u32 tflags, u32 cpu)
 {
-	struct timer_base *base = per_cpu_ptr(&timer_bases[BASE_STD], cpu);
+	int index = tflags & TIMER_PINNED ? BASE_LOCAL : BASE_GLOBAL;
+	struct timer_base *base;
+
+	base = per_cpu_ptr(&timer_bases[index], cpu);
 
 	/*
 	 * If the timer is deferrable and NO_HZ_COMMON is set then we need
@@ -912,7 +921,10 @@  static inline struct timer_base *get_timer_cpu_base(u32 tflags, u32 cpu)
 
 static inline struct timer_base *get_timer_this_cpu_base(u32 tflags)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+	int index = tflags & TIMER_PINNED ? BASE_LOCAL : BASE_GLOBAL;
+	struct timer_base *base;
+
+	base = this_cpu_ptr(&timer_bases[index]);
 
 	/*
 	 * If the timer is deferrable and NO_HZ_COMMON is set then we need
@@ -1971,9 +1983,10 @@  static unsigned long next_timer_interrupt(struct timer_base *base,
 static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
 					     bool *idle)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+	unsigned long nextevt, nextevt_local, nextevt_global;
+	struct timer_base *base_local, *base_global;
 	u64 expires = KTIME_MAX;
-	unsigned long nextevt;
+	bool local_first;
 
 	/*
 	 * Pretend that there is no timer pending if the cpu is offline.
@@ -1985,10 +1998,31 @@  static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
 		return expires;
 	}
 
-	raw_spin_lock(&base->lock);
-	nextevt = next_timer_interrupt(base, basej);
+	base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
+	base_global = this_cpu_ptr(&timer_bases[BASE_GLOBAL]);
+
+	raw_spin_lock(&base_local->lock);
+	raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
+
+	nextevt_local = next_timer_interrupt(base_local, basej);
+	nextevt_global = next_timer_interrupt(base_global, basej);
 
-	if (base->timers_pending) {
+	/*
+	 * Check whether the local event is expiring before or at the same
+	 * time as the global event.
+	 *
+	 * Note, that nextevt_global and nextevt_local might be based on
+	 * different base->clk values. So it's not guaranteed that
+	 * comparing with empty bases results in a correct local_first.
+	 */
+	if (base_local->timers_pending && base_global->timers_pending)
+		local_first = time_before_eq(nextevt_local, nextevt_global);
+	else
+		local_first = base_local->timers_pending;
+
+	nextevt = local_first ? nextevt_local : nextevt_global;
+
+	if (base_local->timers_pending || base_global->timers_pending) {
 		/* If we missed a tick already, force 0 delta */
 		if (time_before(nextevt, basej))
 			nextevt = basej;
@@ -1999,28 +2033,33 @@  static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
 	 * We have a fresh next event. Check whether we can forward the
 	 * base.
 	 */
-	__forward_timer_base(base, basej);
+	__forward_timer_base(base_local, basej);
+	__forward_timer_base(base_global, basej);
 
 	/*
 	 * Set base->is_idle only when caller is timer_base_try_to_set_idle()
 	 */
 	if (idle) {
 		/*
-		 * Base is idle if the next event is more than a tick away.
+		 * Bases are idle if the next event is more than a tick away.
 		 *
 		 * If the base is marked idle then any timer add operation must
 		 * forward the base clk itself to keep granularity small. This
-		 * idle logic is only maintained for the BASE_STD base,
-		 * deferrable timers may still see large granularity skew (by
-		 * design).
+		 * idle logic is only maintained for the BASE_LOCAL and
+		 * BASE_GLOBAL base, deferrable timers may still see large
+		 * granularity skew (by design).
 		 */
-		if (!base->is_idle)
-			base->is_idle = time_after(nextevt, basej + 1);
-		*idle = base->is_idle;
-		trace_timer_base_idle(base->is_idle, base->cpu);
+		if (!base_local->is_idle) {
+			bool is_idle = time_after(nextevt, basej + 1);
+
+			base_local->is_idle = base_global->is_idle = is_idle;
+		}
+		*idle = base_local->is_idle;
+		trace_timer_base_idle(base_local->is_idle, base_local->cpu);
 	}
 
-	raw_spin_unlock(&base->lock);
+	raw_spin_unlock(&base_global->lock);
+	raw_spin_unlock(&base_local->lock);
 
 	return cmp_next_hrtimer_event(basem, expires);
 }
@@ -2064,15 +2103,15 @@  u64 timer_base_try_to_set_idle(unsigned long basej, u64 basem, bool *idle)
  */
 void timer_clear_idle(void)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
-
 	/*
 	 * We do this unlocked. The worst outcome is a remote enqueue sending
 	 * a pointless IPI, but taking the lock would just make the window for
 	 * sending the IPI a few instructions smaller for the cost of taking
 	 * the lock in the exit from idle path.
 	 */
-	base->is_idle = false;
+	__this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false);
+	__this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false);
+
 	trace_timer_base_idle(0, smp_processor_id());
 }
 #endif
@@ -2123,11 +2162,13 @@  static inline void __run_timers(struct timer_base *base)
  */
 static __latent_entropy void run_timer_softirq(struct softirq_action *h)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
 
 	__run_timers(base);
-	if (IS_ENABLED(CONFIG_NO_HZ_COMMON))
+	if (IS_ENABLED(CONFIG_NO_HZ_COMMON)) {
+		__run_timers(this_cpu_ptr(&timer_bases[BASE_GLOBAL]));
 		__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
+	}
 }
 
 /*
@@ -2135,7 +2176,7 @@  static __latent_entropy void run_timer_softirq(struct softirq_action *h)
  */
 static void run_local_timers(void)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
 
 	hrtimer_run_queues();