[5/5] timers: Tag (hr)timer softirq as hotplug safe

Message ID 20230912104406.312185-6-frederic@kernel.org
State New
Headers
Series tick/nohz: cleanups and fixes v2 |

Commit Message

Frederic Weisbecker Sept. 12, 2023, 10:44 a.m. UTC
  Specific stress involving frequent CPU-hotplug operations, such as
running rcutorture for example, may trigger the following message:

	"NOHZ tick-stop error: local softirq work is pending, handler #02!!!"

This happens in the CPU-down hotplug process, after
CPUHP_AP_SMPBOOT_THREADS whose teardown callback parks ksoftirqd, and
before the target CPU shuts down through CPUHP_AP_IDLE_DEAD. In this
fragile intermediate state, softirqs waiting for threaded handling may
be forever ignored and eventually reported by the idle task as in the
above example.

However some vectors are known to be safe as long as the corresponding
subsystems have teardown callbacks handling the migration of their
events. The above error message reports pending timers softirq although
this vector can be considered as hotplug safe because the
CPUHP_TIMERS_PREPARE teardown callback performs the necessary migration
of timers after the death of the CPU. Hrtimers also have a similar
hotplug handling.

Therefore this error message, as far as (hr-)timers are concerned, can
be considered spurious and the relevant softirq vectors can be marked as
hotplug safe.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/interrupt.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)
  

Comments

Joel Fernandes Sept. 16, 2023, 1:38 a.m. UTC | #1
On Tue, Sep 12, 2023 at 6:44 AM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> Specific stress involving frequent CPU-hotplug operations, such as
> running rcutorture for example, may trigger the following message:
>
>         "NOHZ tick-stop error: local softirq work is pending, handler #02!!!"
>
> This happens in the CPU-down hotplug process, after
> CPUHP_AP_SMPBOOT_THREADS whose teardown callback parks ksoftirqd, and
> before the target CPU shuts down through CPUHP_AP_IDLE_DEAD. In this
> fragile intermediate state, softirqs waiting for threaded handling may
> be forever ignored and eventually reported by the idle task as in the
> above example.
>
> However some vectors are known to be safe as long as the corresponding
> subsystems have teardown callbacks handling the migration of their
> events. The above error message reports pending timers softirq although
> this vector can be considered as hotplug safe because the
> CPUHP_TIMERS_PREPARE teardown callback performs the necessary migration
> of timers after the death of the CPU. Hrtimers also have a similar
> hotplug handling.
>
> Therefore this error message, as far as (hr-)timers are concerned, can
> be considered spurious and the relevant softirq vectors can be marked as
> hotplug safe.

We could:
Cc: stable@vger.kernel.org

Since hell is breaking loose a bit because of:
https://lore.kernel.org/all/20230831133214.XF2yjiEb@linutronix.de/T/

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel


> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  include/linux/interrupt.h | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index a92bce40b04b..4a1dc88ddbff 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -569,8 +569,12 @@ enum
>   *     2) rcu_report_dead() reports the final quiescent states.
>   *
>   * _ IRQ_POLL: irq_poll_cpu_dead() migrates the queue
> + *
> + * _ (HR)TIMER_SOFTIRQ: (hr)timers_dead_cpu() migrates the queue
>   */
> -#define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(RCU_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ))
> +#define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(TIMER_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ) |\
> +                                  BIT(HRTIMER_SOFTIRQ) | BIT(RCU_SOFTIRQ))
> +
>

Perhaps

>  /* map softirq index to softirq name. update 'softirq_to_name' in
>   * kernel/softirq.c when adding a new softirq.
> --
> 2.34.1
>
  
Thomas Gleixner Sept. 18, 2023, 5:04 p.m. UTC | #2
On Fri, Sep 15 2023 at 21:38, Joel Fernandes wrote:
>> Therefore this error message, as far as (hr-)timers are concerned, can
>> be considered spurious and the relevant softirq vectors can be marked as
>> hotplug safe.
>
> We could:

We should :)

> Cc: stable@vger.kernel.org
>
> Since hell is breaking loose a bit because of:
> https://lore.kernel.org/all/20230831133214.XF2yjiEb@linutronix.de/T/
>
> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

I'll pick that up tomorrow.

Thanks,

        tglx
  

Patch

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a92bce40b04b..4a1dc88ddbff 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -569,8 +569,12 @@  enum
  * 	2) rcu_report_dead() reports the final quiescent states.
  *
  * _ IRQ_POLL: irq_poll_cpu_dead() migrates the queue
+ *
+ * _ (HR)TIMER_SOFTIRQ: (hr)timers_dead_cpu() migrates the queue
  */
-#define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(RCU_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ))
+#define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(TIMER_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ) |\
+				   BIT(HRTIMER_SOFTIRQ) | BIT(RCU_SOFTIRQ))
+
 
 /* map softirq index to softirq name. update 'softirq_to_name' in
  * kernel/softirq.c when adding a new softirq.