watchdog/hardlockup: Avoid large stack frames in watchdog_hardlockup_check()

Message ID 20230731091754.1.I501ab68cb926ee33a7c87e063d207abf09b9943c@changeid
State New
Headers
Series watchdog/hardlockup: Avoid large stack frames in watchdog_hardlockup_check() |

Commit Message

Doug Anderson July 31, 2023, 4:17 p.m. UTC
  After commit 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to
watchdog_hardlockup_check()") we started storing a `struct cpumask` on
the stack in watchdog_hardlockup_check(). On systems with
CONFIG_NR_CPUS set to 8192 this takes up 1K on the stack. That
triggers warnings with `CONFIG_FRAME_WARN` set to 1024.

Instead of putting this `struct cpumask` on the stack, let's declare
it as `static`. This has the downside of taking up 1K of memory all
the time on systems with `CONFIG_NR_CPUS` to 8192, but on systems with
smaller `CONFIG_NR_CPUS` it's not much emory (with 128 CPUs it's only
16 bytes of memory). Presumably anyone building a system with
`CONFIG_NR_CPUS=8192` can afford the extra 1K of memory.

NOTE: as part of this change, we no longer check the return value of
trigger_single_cpu_backtrace(). While we could do this and only call
cpumask_clear_cpu() if trigger_single_cpu_backtrace() didn't fail,
that's probably not worth it. There's no reason to believe that
trigger_cpumask_backtrace() will succeed at backtracing the CPU when
trigger_single_cpu_backtrace() failed.

Alternatives considered:
- Use kmalloc with GFP_ATOMIC to allocate. I decided against this
  since relying on kmalloc when the system is hard locked up seems
  like a bad idea.
- Change the arch_trigger_cpumask_backtrace() across all architectures
  to take an extra parameter to get the needed behavior. This seems
  like a lot of churn for a small savings.

Fixes: 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202307310955.pLZDhpnl-lkp@intel.com
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 kernel/watchdog.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)
  

Comments

Michal Hocko Aug. 1, 2023, 12:58 p.m. UTC | #1
On Mon 31-07-23 09:17:59, Douglas Anderson wrote:
> After commit 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to
> watchdog_hardlockup_check()") we started storing a `struct cpumask` on
> the stack in watchdog_hardlockup_check(). On systems with
> CONFIG_NR_CPUS set to 8192 this takes up 1K on the stack. That
> triggers warnings with `CONFIG_FRAME_WARN` set to 1024.
> 
> Instead of putting this `struct cpumask` on the stack, let's declare
> it as `static`. This has the downside of taking up 1K of memory all
> the time on systems with `CONFIG_NR_CPUS` to 8192, but on systems with
> smaller `CONFIG_NR_CPUS` it's not much emory (with 128 CPUs it's only
> 16 bytes of memory). Presumably anyone building a system with
> `CONFIG_NR_CPUS=8192` can afford the extra 1K of memory.
> 
> NOTE: as part of this change, we no longer check the return value of
> trigger_single_cpu_backtrace(). While we could do this and only call
> cpumask_clear_cpu() if trigger_single_cpu_backtrace() didn't fail,
> that's probably not worth it. There's no reason to believe that
> trigger_cpumask_backtrace() will succeed at backtracing the CPU when
> trigger_single_cpu_backtrace() failed.
> 
> Alternatives considered:
> - Use kmalloc with GFP_ATOMIC to allocate. I decided against this
>   since relying on kmalloc when the system is hard locked up seems
>   like a bad idea.
> - Change the arch_trigger_cpumask_backtrace() across all architectures
>   to take an extra parameter to get the needed behavior. This seems
>   like a lot of churn for a small savings.
> 
> Fixes: 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()")
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/r/202307310955.pLZDhpnl-lkp@intel.com
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
>  kernel/watchdog.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index be38276a365f..19db2357969a 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -151,9 +151,6 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
>  	 */
>  	if (is_hardlockup(cpu)) {
>  		unsigned int this_cpu = smp_processor_id();
> -		struct cpumask backtrace_mask;
> -
> -		cpumask_copy(&backtrace_mask, cpu_online_mask);
>  
>  		/* Only print hardlockups once. */
>  		if (per_cpu(watchdog_hardlockup_warned, cpu))
> @@ -167,10 +164,8 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
>  				show_regs(regs);
>  			else
>  				dump_stack();
> -			cpumask_clear_cpu(cpu, &backtrace_mask);
>  		} else {
> -			if (trigger_single_cpu_backtrace(cpu))
> -				cpumask_clear_cpu(cpu, &backtrace_mask);
> +			trigger_single_cpu_backtrace(cpu);
>  		}
>  
>  		/*
> @@ -178,8 +173,13 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
>  		 * hardlockups generating interleaving traces
>  		 */
>  		if (sysctl_hardlockup_all_cpu_backtrace &&
> -		    !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped))
> +		    !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped)) {
> +			static struct cpumask backtrace_mask;
> +
> +			cpumask_copy(&backtrace_mask, cpu_online_mask);
> +			cpumask_clear_cpu(cpu, &backtrace_mask);
>  			trigger_cpumask_backtrace(&backtrace_mask);

This looks rather wasteful to just copy the cpumask over to
backtrace_mask in nmi_trigger_cpumask_backtrace (which all but sparc
arches do AFAICS).

Would it be possible to use arch_trigger_cpumask_backtrace(cpu_online_mask, false)
and special case cpu != this_cpu && sysctl_hardlockup_all_cpu_backtrace?

> +		}
>  
>  		if (hardlockup_panic)
>  			nmi_panic(regs, "Hard LOCKUP");
> -- 
> 2.41.0.487.g6d72f3e995-goog
>
  
Doug Anderson Aug. 1, 2023, 2:16 p.m. UTC | #2
Hi,

On Tue, Aug 1, 2023 at 5:58 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 31-07-23 09:17:59, Douglas Anderson wrote:
> > After commit 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to
> > watchdog_hardlockup_check()") we started storing a `struct cpumask` on
> > the stack in watchdog_hardlockup_check(). On systems with
> > CONFIG_NR_CPUS set to 8192 this takes up 1K on the stack. That
> > triggers warnings with `CONFIG_FRAME_WARN` set to 1024.
> >
> > Instead of putting this `struct cpumask` on the stack, let's declare
> > it as `static`. This has the downside of taking up 1K of memory all
> > the time on systems with `CONFIG_NR_CPUS` to 8192, but on systems with
> > smaller `CONFIG_NR_CPUS` it's not much emory (with 128 CPUs it's only
> > 16 bytes of memory). Presumably anyone building a system with
> > `CONFIG_NR_CPUS=8192` can afford the extra 1K of memory.
> >
> > NOTE: as part of this change, we no longer check the return value of
> > trigger_single_cpu_backtrace(). While we could do this and only call
> > cpumask_clear_cpu() if trigger_single_cpu_backtrace() didn't fail,
> > that's probably not worth it. There's no reason to believe that
> > trigger_cpumask_backtrace() will succeed at backtracing the CPU when
> > trigger_single_cpu_backtrace() failed.
> >
> > Alternatives considered:
> > - Use kmalloc with GFP_ATOMIC to allocate. I decided against this
> >   since relying on kmalloc when the system is hard locked up seems
> >   like a bad idea.
> > - Change the arch_trigger_cpumask_backtrace() across all architectures
> >   to take an extra parameter to get the needed behavior. This seems
> >   like a lot of churn for a small savings.
> >
> > Fixes: 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()")
> > Reported-by: kernel test robot <lkp@intel.com>
> > Closes: https://lore.kernel.org/r/202307310955.pLZDhpnl-lkp@intel.com
> > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > ---
> >
> >  kernel/watchdog.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> > index be38276a365f..19db2357969a 100644
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -151,9 +151,6 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> >        */
> >       if (is_hardlockup(cpu)) {
> >               unsigned int this_cpu = smp_processor_id();
> > -             struct cpumask backtrace_mask;
> > -
> > -             cpumask_copy(&backtrace_mask, cpu_online_mask);
> >
> >               /* Only print hardlockups once. */
> >               if (per_cpu(watchdog_hardlockup_warned, cpu))
> > @@ -167,10 +164,8 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> >                               show_regs(regs);
> >                       else
> >                               dump_stack();
> > -                     cpumask_clear_cpu(cpu, &backtrace_mask);
> >               } else {
> > -                     if (trigger_single_cpu_backtrace(cpu))
> > -                             cpumask_clear_cpu(cpu, &backtrace_mask);
> > +                     trigger_single_cpu_backtrace(cpu);
> >               }
> >
> >               /*
> > @@ -178,8 +173,13 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> >                * hardlockups generating interleaving traces
> >                */
> >               if (sysctl_hardlockup_all_cpu_backtrace &&
> > -                 !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped))
> > +                 !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped)) {
> > +                     static struct cpumask backtrace_mask;
> > +
> > +                     cpumask_copy(&backtrace_mask, cpu_online_mask);
> > +                     cpumask_clear_cpu(cpu, &backtrace_mask);
> >                       trigger_cpumask_backtrace(&backtrace_mask);
>
> This looks rather wasteful to just copy the cpumask over to
> backtrace_mask in nmi_trigger_cpumask_backtrace (which all but sparc
> arches do AFAICS).
>
> Would it be possible to use arch_trigger_cpumask_backtrace(cpu_online_mask, false)
> and special case cpu != this_cpu && sysctl_hardlockup_all_cpu_backtrace?

So you're saying optimize the case where cpu == this_cpu and then have
a special case (where we still copy) for cpu != this_cpu? I can do
that if that's what people want, but (assuming I understand correctly)
that's making the wrong tradeoff. Specifically, this code runs one
time right as we're crashing and if it takes an extra millisecond to
run it's not a huge deal. It feels better to avoid the special case
and keep the code smaller.

Let me know if I misunderstood.

-Doug
  
Michal Hocko Aug. 1, 2023, 3:26 p.m. UTC | #3
On Tue 01-08-23 07:16:15, Doug Anderson wrote:
> Hi,
> 
> On Tue, Aug 1, 2023 at 5:58 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 31-07-23 09:17:59, Douglas Anderson wrote:
> > > After commit 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to
> > > watchdog_hardlockup_check()") we started storing a `struct cpumask` on
> > > the stack in watchdog_hardlockup_check(). On systems with
> > > CONFIG_NR_CPUS set to 8192 this takes up 1K on the stack. That
> > > triggers warnings with `CONFIG_FRAME_WARN` set to 1024.
> > >
> > > Instead of putting this `struct cpumask` on the stack, let's declare
> > > it as `static`. This has the downside of taking up 1K of memory all
> > > the time on systems with `CONFIG_NR_CPUS` to 8192, but on systems with
> > > smaller `CONFIG_NR_CPUS` it's not much emory (with 128 CPUs it's only
> > > 16 bytes of memory). Presumably anyone building a system with
> > > `CONFIG_NR_CPUS=8192` can afford the extra 1K of memory.
> > >
> > > NOTE: as part of this change, we no longer check the return value of
> > > trigger_single_cpu_backtrace(). While we could do this and only call
> > > cpumask_clear_cpu() if trigger_single_cpu_backtrace() didn't fail,
> > > that's probably not worth it. There's no reason to believe that
> > > trigger_cpumask_backtrace() will succeed at backtracing the CPU when
> > > trigger_single_cpu_backtrace() failed.
> > >
> > > Alternatives considered:
> > > - Use kmalloc with GFP_ATOMIC to allocate. I decided against this
> > >   since relying on kmalloc when the system is hard locked up seems
> > >   like a bad idea.
> > > - Change the arch_trigger_cpumask_backtrace() across all architectures
> > >   to take an extra parameter to get the needed behavior. This seems
> > >   like a lot of churn for a small savings.
> > >
> > > Fixes: 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()")
> > > Reported-by: kernel test robot <lkp@intel.com>
> > > Closes: https://lore.kernel.org/r/202307310955.pLZDhpnl-lkp@intel.com
> > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > ---
> > >
> > >  kernel/watchdog.c | 14 +++++++-------
> > >  1 file changed, 7 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> > > index be38276a365f..19db2357969a 100644
> > > --- a/kernel/watchdog.c
> > > +++ b/kernel/watchdog.c
> > > @@ -151,9 +151,6 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> > >        */
> > >       if (is_hardlockup(cpu)) {
> > >               unsigned int this_cpu = smp_processor_id();
> > > -             struct cpumask backtrace_mask;
> > > -
> > > -             cpumask_copy(&backtrace_mask, cpu_online_mask);
> > >
> > >               /* Only print hardlockups once. */
> > >               if (per_cpu(watchdog_hardlockup_warned, cpu))
> > > @@ -167,10 +164,8 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> > >                               show_regs(regs);
> > >                       else
> > >                               dump_stack();
> > > -                     cpumask_clear_cpu(cpu, &backtrace_mask);
> > >               } else {
> > > -                     if (trigger_single_cpu_backtrace(cpu))
> > > -                             cpumask_clear_cpu(cpu, &backtrace_mask);
> > > +                     trigger_single_cpu_backtrace(cpu);
> > >               }
> > >
> > >               /*
> > > @@ -178,8 +173,13 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> > >                * hardlockups generating interleaving traces
> > >                */
> > >               if (sysctl_hardlockup_all_cpu_backtrace &&
> > > -                 !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped))
> > > +                 !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped)) {
> > > +                     static struct cpumask backtrace_mask;
> > > +
> > > +                     cpumask_copy(&backtrace_mask, cpu_online_mask);
> > > +                     cpumask_clear_cpu(cpu, &backtrace_mask);
> > >                       trigger_cpumask_backtrace(&backtrace_mask);
> >
> > This looks rather wasteful to just copy the cpumask over to
> > backtrace_mask in nmi_trigger_cpumask_backtrace (which all but sparc
> > arches do AFAICS).
> >
> > Would it be possible to use arch_trigger_cpumask_backtrace(cpu_online_mask, false)
> > and special case cpu != this_cpu && sysctl_hardlockup_all_cpu_backtrace?
> 
> So you're saying optimize the case where cpu == this_cpu and then have
> a special case (where we still copy) for cpu != this_cpu? I can do
> that if that's what people want, but (assuming I understand correctly)
> that's making the wrong tradeoff. Specifically, this code runs one
> time right as we're crashing and if it takes an extra millisecond to
> run it's not a huge deal. It feels better to avoid the special case
> and keep the code smaller.
> 
> Let me know if I misunderstood.

I meant something like this (untested but it should give an idea what I
mean hopefully). Maybe it can be simplified even further. TBH I am not
entirely sure why cpu == this_cpu needs this special handling.
--- 
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index be38276a365f..0eedac9f1019 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -151,9 +151,7 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
 	 */
 	if (is_hardlockup(cpu)) {
 		unsigned int this_cpu = smp_processor_id();
-		struct cpumask backtrace_mask;
-
-		cpumask_copy(&backtrace_mask, cpu_online_mask);
+		bool dump_all = false;
 
 		/* Only print hardlockups once. */
 		if (per_cpu(watchdog_hardlockup_warned, cpu))
@@ -167,10 +165,6 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
 				show_regs(regs);
 			else
 				dump_stack();
-			cpumask_clear_cpu(cpu, &backtrace_mask);
-		} else {
-			if (trigger_single_cpu_backtrace(cpu))
-				cpumask_clear_cpu(cpu, &backtrace_mask);
 		}
 
 		/*
@@ -179,7 +173,12 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
 		 */
 		if (sysctl_hardlockup_all_cpu_backtrace &&
 		    !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped))
-			trigger_cpumask_backtrace(&backtrace_mask);
+			dump_all = true;
+
+		if (dump_all)
+			arch_trigger_cpumask_backtrace(cpu_online_mask, cpu != this_cpu);
+		else if (cpu != this_cpu)
+			trigger_single_cpu_backtrace(cpu);
 
 		if (hardlockup_panic)
 			nmi_panic(regs, "Hard LOCKUP");
  
Doug Anderson Aug. 1, 2023, 3:41 p.m. UTC | #4
Hi,

On Tue, Aug 1, 2023 at 8:26 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 01-08-23 07:16:15, Doug Anderson wrote:
> > Hi,
> >
> > On Tue, Aug 1, 2023 at 5:58 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 31-07-23 09:17:59, Douglas Anderson wrote:
> > > > After commit 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to
> > > > watchdog_hardlockup_check()") we started storing a `struct cpumask` on
> > > > the stack in watchdog_hardlockup_check(). On systems with
> > > > CONFIG_NR_CPUS set to 8192 this takes up 1K on the stack. That
> > > > triggers warnings with `CONFIG_FRAME_WARN` set to 1024.
> > > >
> > > > Instead of putting this `struct cpumask` on the stack, let's declare
> > > > it as `static`. This has the downside of taking up 1K of memory all
> > > > the time on systems with `CONFIG_NR_CPUS` to 8192, but on systems with
> > > > smaller `CONFIG_NR_CPUS` it's not much emory (with 128 CPUs it's only
> > > > 16 bytes of memory). Presumably anyone building a system with
> > > > `CONFIG_NR_CPUS=8192` can afford the extra 1K of memory.
> > > >
> > > > NOTE: as part of this change, we no longer check the return value of
> > > > trigger_single_cpu_backtrace(). While we could do this and only call
> > > > cpumask_clear_cpu() if trigger_single_cpu_backtrace() didn't fail,
> > > > that's probably not worth it. There's no reason to believe that
> > > > trigger_cpumask_backtrace() will succeed at backtracing the CPU when
> > > > trigger_single_cpu_backtrace() failed.
> > > >
> > > > Alternatives considered:
> > > > - Use kmalloc with GFP_ATOMIC to allocate. I decided against this
> > > >   since relying on kmalloc when the system is hard locked up seems
> > > >   like a bad idea.
> > > > - Change the arch_trigger_cpumask_backtrace() across all architectures
> > > >   to take an extra parameter to get the needed behavior. This seems
> > > >   like a lot of churn for a small savings.
> > > >
> > > > Fixes: 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()")
> > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > Closes: https://lore.kernel.org/r/202307310955.pLZDhpnl-lkp@intel.com
> > > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > > ---
> > > >
> > > >  kernel/watchdog.c | 14 +++++++-------
> > > >  1 file changed, 7 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> > > > index be38276a365f..19db2357969a 100644
> > > > --- a/kernel/watchdog.c
> > > > +++ b/kernel/watchdog.c
> > > > @@ -151,9 +151,6 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> > > >        */
> > > >       if (is_hardlockup(cpu)) {
> > > >               unsigned int this_cpu = smp_processor_id();
> > > > -             struct cpumask backtrace_mask;
> > > > -
> > > > -             cpumask_copy(&backtrace_mask, cpu_online_mask);
> > > >
> > > >               /* Only print hardlockups once. */
> > > >               if (per_cpu(watchdog_hardlockup_warned, cpu))
> > > > @@ -167,10 +164,8 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> > > >                               show_regs(regs);
> > > >                       else
> > > >                               dump_stack();
> > > > -                     cpumask_clear_cpu(cpu, &backtrace_mask);
> > > >               } else {
> > > > -                     if (trigger_single_cpu_backtrace(cpu))
> > > > -                             cpumask_clear_cpu(cpu, &backtrace_mask);
> > > > +                     trigger_single_cpu_backtrace(cpu);
> > > >               }
> > > >
> > > >               /*
> > > > @@ -178,8 +173,13 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> > > >                * hardlockups generating interleaving traces
> > > >                */
> > > >               if (sysctl_hardlockup_all_cpu_backtrace &&
> > > > -                 !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped))
> > > > +                 !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped)) {
> > > > +                     static struct cpumask backtrace_mask;
> > > > +
> > > > +                     cpumask_copy(&backtrace_mask, cpu_online_mask);
> > > > +                     cpumask_clear_cpu(cpu, &backtrace_mask);
> > > >                       trigger_cpumask_backtrace(&backtrace_mask);
> > >
> > > This looks rather wasteful to just copy the cpumask over to
> > > backtrace_mask in nmi_trigger_cpumask_backtrace (which all but sparc
> > > arches do AFAICS).
> > >
> > > Would it be possible to use arch_trigger_cpumask_backtrace(cpu_online_mask, false)
> > > and special case cpu != this_cpu && sysctl_hardlockup_all_cpu_backtrace?
> >
> > So you're saying optimize the case where cpu == this_cpu and then have
> > a special case (where we still copy) for cpu != this_cpu? I can do
> > that if that's what people want, but (assuming I understand correctly)
> > that's making the wrong tradeoff. Specifically, this code runs one
> > time right as we're crashing and if it takes an extra millisecond to
> > run it's not a huge deal. It feels better to avoid the special case
> > and keep the code smaller.
> >
> > Let me know if I misunderstood.
>
> I meant something like this (untested but it should give an idea what I
> mean hopefully). Maybe it can be simplified even further. TBH I am not
> entirely sure why cpu == this_cpu needs this special handling.
> ---
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index be38276a365f..0eedac9f1019 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -151,9 +151,7 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
>          */
>         if (is_hardlockup(cpu)) {
>                 unsigned int this_cpu = smp_processor_id();
> -               struct cpumask backtrace_mask;
> -
> -               cpumask_copy(&backtrace_mask, cpu_online_mask);
> +               bool dump_all = false;
>
>                 /* Only print hardlockups once. */
>                 if (per_cpu(watchdog_hardlockup_warned, cpu))
> @@ -167,10 +165,6 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
>                                 show_regs(regs);
>                         else
>                                 dump_stack();
> -                       cpumask_clear_cpu(cpu, &backtrace_mask);
> -               } else {
> -                       if (trigger_single_cpu_backtrace(cpu))
> -                               cpumask_clear_cpu(cpu, &backtrace_mask);
>                 }
>
>                 /*
> @@ -179,7 +173,12 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
>                  */
>                 if (sysctl_hardlockup_all_cpu_backtrace &&
>                     !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped))
> -                       trigger_cpumask_backtrace(&backtrace_mask);
> +                       dump_all = true;
> +
> +               if (dump_all)
> +                       arch_trigger_cpumask_backtrace(cpu_online_mask, cpu != this_cpu);
> +               else if (cpu != this_cpu)
> +                       trigger_single_cpu_backtrace(cpu);

Ah, I see what you mean. The one issue I have with your solution is
that the ordering of the stack crawls is less ideal in the "dump all"
case when cpu != this_cpu. We really want to see the stack crawl of
the locked up CPU first and _then_ see the stack crawls of other CPUs.
With your solution the locked up CPU will be interspersed with all the
others and will be harder to find in the output (you've got to match
it up with the "Watchdog detected hard LOCKUP on cpu N" message).
While that's probably not a huge deal, it's nicer to make the output
easy to understand for someone trying to parse it...

-Doug
  
Michal Hocko Aug. 2, 2023, 7:27 a.m. UTC | #5
On Tue 01-08-23 08:41:49, Doug Anderson wrote:
[...]
> Ah, I see what you mean. The one issue I have with your solution is
> that the ordering of the stack crawls is less ideal in the "dump all"
> case when cpu != this_cpu. We really want to see the stack crawl of
> the locked up CPU first and _then_ see the stack crawls of other CPUs.
> With your solution the locked up CPU will be interspersed with all the
> others and will be harder to find in the output (you've got to match
> it up with the "Watchdog detected hard LOCKUP on cpu N" message).
> While that's probably not a huge deal, it's nicer to make the output
> easy to understand for someone trying to parse it...

Is it worth to waste memory for this arguably nicer output? Identifying
the stack of the locked up CPU is trivial.
  
Doug Anderson Aug. 2, 2023, 2:12 p.m. UTC | #6
Hi,

On Wed, Aug 2, 2023 at 12:27 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 01-08-23 08:41:49, Doug Anderson wrote:
> [...]
> > Ah, I see what you mean. The one issue I have with your solution is
> > that the ordering of the stack crawls is less ideal in the "dump all"
> > case when cpu != this_cpu. We really want to see the stack crawl of
> > the locked up CPU first and _then_ see the stack crawls of other CPUs.
> > With your solution the locked up CPU will be interspersed with all the
> > others and will be harder to find in the output (you've got to match
> > it up with the "Watchdog detected hard LOCKUP on cpu N" message).
> > While that's probably not a huge deal, it's nicer to make the output
> > easy to understand for someone trying to parse it...
>
> Is it worth to waste memory for this arguably nicer output? Identifying
> the stack of the locked up CPU is trivial.

I guess it's debatable, but as someone who has spent time staring at
trawling through reports generated like this, I'd say "yes", it's
super helpful in understanding the problem to have the hung CPU first.
Putting the memory usage in perspective:

* On a kernel built with a more normal number of max CPUs, like 256,
this is only a use of 32 bytes of memory. That's 8 CPU instructions
worth of memory.

* Even on a system with the largest number of max CPUs we currently
allow (8192), this is only a use of 1024 bytes of memory. Sure, that's
a big chunk, but this is also something on our largest systems.

In any case, how about this. We only need the memory allocated if
`sysctl_hardlockup_all_cpu_backtrace` is non-zero. I can hook in
whenever that's changed (should be just at bootup) and then kmalloc
memory then. This really limits the extra memory to just cases when
it's useful. Presumably on systems that are designed to run massively
SMP they wouldn't want to turn this knob on anyway since it would spew
far too much data. If you took a kernel compiled for max SMP, ran it
on a machine with only a few cores, and wanted this feature turned on
then at most you'd be chewing up 1K. In the average case this would
chew up some extra memory (extra CPU instructions to implement the
function take code space, extra overhead around kmalloc) but it would
avoid the 1K chunk in most cases.

Does that sound reasonable?

I guess my last alternative would be to keep the special case of
tracing the hung CPU first (this is the most important part IMO) and
then accept the double trace, AKA:

/* Try to avoid re-dumping the stack on the hung CPU if possible */
if (cpu == this_cpu))
  trigger_allbutself_cpu_backtrace();
else
  trigger_all_cpu_backtrace();

-Doug
  
Michal Hocko Aug. 2, 2023, 4:05 p.m. UTC | #7
On Wed 02-08-23 07:12:29, Doug Anderson wrote:
> Hi,
> 
> On Wed, Aug 2, 2023 at 12:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 01-08-23 08:41:49, Doug Anderson wrote:
> > [...]
> > > Ah, I see what you mean. The one issue I have with your solution is
> > > that the ordering of the stack crawls is less ideal in the "dump all"
> > > case when cpu != this_cpu. We really want to see the stack crawl of
> > > the locked up CPU first and _then_ see the stack crawls of other CPUs.
> > > With your solution the locked up CPU will be interspersed with all the
> > > others and will be harder to find in the output (you've got to match
> > > it up with the "Watchdog detected hard LOCKUP on cpu N" message).
> > > While that's probably not a huge deal, it's nicer to make the output
> > > easy to understand for someone trying to parse it...
> >
> > Is it worth to waste memory for this arguably nicer output? Identifying
> > the stack of the locked up CPU is trivial.
> 
> I guess it's debatable, but as someone who has spent time staring at
> trawling through reports generated like this, I'd say "yes", it's
> super helpful in understanding the problem to have the hung CPU first.

Well, I have to admit that most lockdep splats I have dealt with
recently do not come with sysctl_hardlockup_all_cpu_backtrace so I
cannot really judge.

> Putting the memory usage in perspective:
> 
> * On a kernel built with a more normal number of max CPUs, like 256,
> this is only a use of 32 bytes of memory. That's 8 CPU instructions
> worth of memory.

Think of distribution kernels that many people use. E.g SLES kernel uses
8k CONFIG_NR_CPUS

> * Even on a system with the largest number of max CPUs we currently
> allow (8192), this is only a use of 1024 bytes of memory. Sure, that's
> a big chunk, but this is also something on our largest systems.

This is independent on the size of the machine if you are using
pre-built kernels.

> In any case, how about this. We only need the memory allocated if
> `sysctl_hardlockup_all_cpu_backtrace` is non-zero. I can hook in
> whenever that's changed (should be just at bootup) and then kmalloc
> memory then.

this is certainly better than the original proposal

> This really limits the extra memory to just cases when
> it's useful. Presumably on systems that are designed to run massively
> SMP they wouldn't want to turn this knob on anyway since it would spew
> far too much data. If you took a kernel compiled for max SMP, ran it
> on a machine with only a few cores, and wanted this feature turned on
> then at most you'd be chewing up 1K. In the average case this would
> chew up some extra memory (extra CPU instructions to implement the
> function take code space, extra overhead around kmalloc) but it would
> avoid the 1K chunk in most cases.
> 
> Does that sound reasonable?

If the locked up cpu needs to be first is a real requirement (and this
seems debateable) then sure why not. I do not feel strongly to argue one
way or the other, maybe others have an opinion on that.

> I guess my last alternative would be to keep the special case of
> tracing the hung CPU first (this is the most important part IMO) and
> then accept the double trace, AKA:

That sounds wrong.
 
> /* Try to avoid re-dumping the stack on the hung CPU if possible */
> if (cpu == this_cpu))
>   trigger_allbutself_cpu_backtrace();
> else
>   trigger_all_cpu_backtrace();
> 
> -Doug
  
Petr Mladek Aug. 3, 2023, 8:12 a.m. UTC | #8
On Wed 2023-08-02 07:12:29, Doug Anderson wrote:
> Hi,
> 
> On Wed, Aug 2, 2023 at 12:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 01-08-23 08:41:49, Doug Anderson wrote:
> > [...]
> > > Ah, I see what you mean. The one issue I have with your solution is
> > > that the ordering of the stack crawls is less ideal in the "dump all"
> > > case when cpu != this_cpu. We really want to see the stack crawl of
> > > the locked up CPU first and _then_ see the stack crawls of other CPUs.
> > > With your solution the locked up CPU will be interspersed with all the
> > > others and will be harder to find in the output (you've got to match
> > > it up with the "Watchdog detected hard LOCKUP on cpu N" message).
> > > While that's probably not a huge deal, it's nicer to make the output
> > > easy to understand for someone trying to parse it...
> >
> > Is it worth to waste memory for this arguably nicer output? Identifying
> > the stack of the locked up CPU is trivial.
> 
> I guess it's debatable, but as someone who has spent time staring at
> trawling through reports generated like this, I'd say "yes", it's
> super helpful in understanding the problem to have the hung CPU first.
> Putting the memory usage in perspective:

nmi_trigger_cpumask_backtrace() has its own copy of the cpu mask.
What about changing the @exclude_self parameter to @exclude_cpu
and do:

	if (exclude_cpu >= 0)
		cpumask_clear_cpu(exclude_cpu, to_cpumask(backtrace_mask));


It would require changing also arch_trigger_cpumask_backtrace() to

	void arch_trigger_cpumask_backtrace(const struct cpumask *mask,
				    int exclude_cpu);

but it looks doable.


It might theoretically change the behavior because
nmi_trigger_cpumask_backtrace() does

	this_cpu = get_cpu();

I guess that it was used to make sure that smp_processor_id()
the same. But it is too late. It should ignore the callers CPU.
So, this might actually fix a possible race.

Note that get_cpu() also disabled preemption. IMHO, it is not strictly needed.
But it might make sense to keep it because the backtraces are printed
when something goes wrong and it should print backtraces from the current state.

Best Regards,
Petr
  
Michal Hocko Aug. 3, 2023, 8:30 a.m. UTC | #9
On Thu 03-08-23 10:12:12, Petr Mladek wrote:
> On Wed 2023-08-02 07:12:29, Doug Anderson wrote:
> > Hi,
> > 
> > On Wed, Aug 2, 2023 at 12:27 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 01-08-23 08:41:49, Doug Anderson wrote:
> > > [...]
> > > > Ah, I see what you mean. The one issue I have with your solution is
> > > > that the ordering of the stack crawls is less ideal in the "dump all"
> > > > case when cpu != this_cpu. We really want to see the stack crawl of
> > > > the locked up CPU first and _then_ see the stack crawls of other CPUs.
> > > > With your solution the locked up CPU will be interspersed with all the
> > > > others and will be harder to find in the output (you've got to match
> > > > it up with the "Watchdog detected hard LOCKUP on cpu N" message).
> > > > While that's probably not a huge deal, it's nicer to make the output
> > > > easy to understand for someone trying to parse it...
> > >
> > > Is it worth to waste memory for this arguably nicer output? Identifying
> > > the stack of the locked up CPU is trivial.
> > 
> > I guess it's debatable, but as someone who has spent time staring at
> > trawling through reports generated like this, I'd say "yes", it's
> > super helpful in understanding the problem to have the hung CPU first.
> > Putting the memory usage in perspective:
> 
> nmi_trigger_cpumask_backtrace() has its own copy of the cpu mask.
> What about changing the @exclude_self parameter to @exclude_cpu
> and do:
> 
> 	if (exclude_cpu >= 0)
> 		cpumask_clear_cpu(exclude_cpu, to_cpumask(backtrace_mask));
> 
> 
> It would require changing also arch_trigger_cpumask_backtrace() to
> 
> 	void arch_trigger_cpumask_backtrace(const struct cpumask *mask,
> 				    int exclude_cpu);
> 
> but it looks doable.

Yes, but sparc is doing its own thing so it would require changing that
as well. But this looks reasonable as well.
  
Doug Anderson Aug. 3, 2023, 11:10 p.m. UTC | #10
Hi,

On Thu, Aug 3, 2023 at 1:30 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 03-08-23 10:12:12, Petr Mladek wrote:
> > On Wed 2023-08-02 07:12:29, Doug Anderson wrote:
> > > Hi,
> > >
> > > On Wed, Aug 2, 2023 at 12:27 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 01-08-23 08:41:49, Doug Anderson wrote:
> > > > [...]
> > > > > Ah, I see what you mean. The one issue I have with your solution is
> > > > > that the ordering of the stack crawls is less ideal in the "dump all"
> > > > > case when cpu != this_cpu. We really want to see the stack crawl of
> > > > > the locked up CPU first and _then_ see the stack crawls of other CPUs.
> > > > > With your solution the locked up CPU will be interspersed with all the
> > > > > others and will be harder to find in the output (you've got to match
> > > > > it up with the "Watchdog detected hard LOCKUP on cpu N" message).
> > > > > While that's probably not a huge deal, it's nicer to make the output
> > > > > easy to understand for someone trying to parse it...
> > > >
> > > > Is it worth to waste memory for this arguably nicer output? Identifying
> > > > the stack of the locked up CPU is trivial.
> > >
> > > I guess it's debatable, but as someone who has spent time staring at
> > > trawling through reports generated like this, I'd say "yes", it's
> > > super helpful in understanding the problem to have the hung CPU first.
> > > Putting the memory usage in perspective:
> >
> > nmi_trigger_cpumask_backtrace() has its own copy of the cpu mask.
> > What about changing the @exclude_self parameter to @exclude_cpu
> > and do:
> >
> >       if (exclude_cpu >= 0)
> >               cpumask_clear_cpu(exclude_cpu, to_cpumask(backtrace_mask));
> >
> >
> > It would require changing also arch_trigger_cpumask_backtrace() to
> >
> >       void arch_trigger_cpumask_backtrace(const struct cpumask *mask,
> >                                   int exclude_cpu);
> >
> > but it looks doable.
>
> Yes, but sparc is doing its own thing so it would require changing that
> as well. But this looks reasonable as well.

OK. I've tried a v3 with that:

https://lore.kernel.org/r/20230803160649.v3.2.I501ab68cb926ee33a7c87e063d207abf09b9943c@changeid

-Doug
  

Patch

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index be38276a365f..19db2357969a 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -151,9 +151,6 @@  void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
 	 */
 	if (is_hardlockup(cpu)) {
 		unsigned int this_cpu = smp_processor_id();
-		struct cpumask backtrace_mask;
-
-		cpumask_copy(&backtrace_mask, cpu_online_mask);
 
 		/* Only print hardlockups once. */
 		if (per_cpu(watchdog_hardlockup_warned, cpu))
@@ -167,10 +164,8 @@  void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
 				show_regs(regs);
 			else
 				dump_stack();
-			cpumask_clear_cpu(cpu, &backtrace_mask);
 		} else {
-			if (trigger_single_cpu_backtrace(cpu))
-				cpumask_clear_cpu(cpu, &backtrace_mask);
+			trigger_single_cpu_backtrace(cpu);
 		}
 
 		/*
@@ -178,8 +173,13 @@  void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
 		 * hardlockups generating interleaving traces
 		 */
 		if (sysctl_hardlockup_all_cpu_backtrace &&
-		    !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped))
+		    !test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped)) {
+			static struct cpumask backtrace_mask;
+
+			cpumask_copy(&backtrace_mask, cpu_online_mask);
+			cpumask_clear_cpu(cpu, &backtrace_mask);
 			trigger_cpumask_backtrace(&backtrace_mask);
+		}
 
 		if (hardlockup_panic)
 			nmi_panic(regs, "Hard LOCKUP");