[RFC] smp,csd: throw an error if a CSD lock is stuck for too long

Message ID 20230821160409.663b8ba9@imladris.surriel.com
State New
Headers
Series [RFC] smp,csd: throw an error if a CSD lock is stuck for too long |

Commit Message

Rik van Riel Aug. 21, 2023, 8:04 p.m. UTC
  The CSD lock seems to get stuck in 2 "modes". When it gets stuck
temporarily, it usually gets released in a few seconds, and sometimes
up to one or two minutes.

If the CSD lock stays stuck for more than several minutes, it never
seems to get unstuck, and gradually more and more things in the system
end up also getting stuck.

In the latter case, we should just give up, so the system can dump out
a little more information about what went wrong, and, with panic_on_oops
and a kdump kernel loaded, dump a whole bunch more information about
what might have gone wrong.

Question: should this have its own panic_on_ipistall switch in
/proc/sys/kernel, or maybe piggyback on panic_on_oops in a different
way than via BUG_ON?

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/smp.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)
  

Comments

Paul E. McKenney Aug. 21, 2023, 8:29 p.m. UTC | #1
On Mon, Aug 21, 2023 at 04:04:09PM -0400, Rik van Riel wrote:
> The CSD lock seems to get stuck in 2 "modes". When it gets stuck
> temporarily, it usually gets released in a few seconds, and sometimes
> up to one or two minutes.
> 
> If the CSD lock stays stuck for more than several minutes, it never
> seems to get unstuck, and gradually more and more things in the system
> end up also getting stuck.
> 
> In the latter case, we should just give up, so the system can dump out
> a little more information about what went wrong, and, with panic_on_oops
> and a kdump kernel loaded, dump a whole bunch more information about
> what might have gone wrong.
> 
> Question: should this have its own panic_on_ipistall switch in
> /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different
> way than via BUG_ON?

Keep it simple until there is a demonstrated need?

> Signed-off-by: Rik van Riel <riel@surriel.com>

Looks good to me!

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

I am also pulling this into -rcu for further testing.

							Thanx, Paul

> ---
>  kernel/smp.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 385179dae360..8b808bff15e6 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -228,6 +228,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
>  	}
>  
>  	ts2 = sched_clock();
> +	/* How long since we last checked for a stuck CSD lock.*/
>  	ts_delta = ts2 - *ts1;
>  	if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
>  		return false;
> @@ -241,9 +242,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
>  	else
>  		cpux = cpu;
>  	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
> +	/* How long since this CSD lock was stuck. */
> +	ts_delta = ts2 - ts0;
>  	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
> -		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
> +		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
>  		 cpu, csd->func, csd->info);
> +	/*
> +	 * If the CSD lock is still stuck after 5 minutes, it is unlikely
> +	 * to become unstuck. Use a signed comparison to avoid triggering
> +	 * on underflows when the TSC is out of sync between sockets.
> +	 */
> +	BUG_ON((s64)ts_delta > 300000000000LL);
>  	if (cpu_cur_csd && csd != cpu_cur_csd) {
>  		pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
>  			 *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
> -- 
> 2.41.0
> 
>
  
Peter Zijlstra Sept. 13, 2023, 1:22 p.m. UTC | #2
On Mon, Aug 21, 2023 at 04:04:09PM -0400, Rik van Riel wrote:
> The CSD lock seems to get stuck in 2 "modes". When it gets stuck
> temporarily, it usually gets released in a few seconds, and sometimes
> up to one or two minutes.
> 
> If the CSD lock stays stuck for more than several minutes, it never
> seems to get unstuck, and gradually more and more things in the system
> end up also getting stuck.
> 
> In the latter case, we should just give up, so the system can dump out
> a little more information about what went wrong, and, with panic_on_oops
> and a kdump kernel loaded, dump a whole bunch more information about
> what might have gone wrong.
> 
> Question: should this have its own panic_on_ipistall switch in
> /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different
> way than via BUG_ON?
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  kernel/smp.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 385179dae360..8b808bff15e6 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -228,6 +228,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
>  	}
>  
>  	ts2 = sched_clock();
> +	/* How long since we last checked for a stuck CSD lock.*/
>  	ts_delta = ts2 - *ts1;
>  	if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
>  		return false;
> @@ -241,9 +242,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
>  	else
>  		cpux = cpu;
>  	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
> +	/* How long since this CSD lock was stuck. */
> +	ts_delta = ts2 - ts0;
>  	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
> -		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
> +		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
>  		 cpu, csd->func, csd->info);
> +	/*
> +	 * If the CSD lock is still stuck after 5 minutes, it is unlikely
> +	 * to become unstuck. Use a signed comparison to avoid triggering
> +	 * on underflows when the TSC is out of sync between sockets.
> +	 */
> +	BUG_ON((s64)ts_delta > 300000000000LL);
>  	if (cpu_cur_csd && csd != cpu_cur_csd) {
>  		pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
>  			 *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),

How are you guys still seeing this? I thought the KVM APIC thing was
fixed a while ago?
  
Rik van Riel Sept. 13, 2023, 2:33 p.m. UTC | #3
On Wed, 2023-09-13 at 15:22 +0200, Peter Zijlstra wrote:
> On Mon, Aug 21, 2023 at 04:04:09PM -0400, Rik van Riel wrote:
> > 
> > +       /* How long since this CSD lock was stuck. */
> > +       ts_delta = ts2 - ts0;
> > 
> > +       /*
> > +        * If the CSD lock is still stuck after 5 minutes, it is
> > unlikely
> > +        * to become unstuck. Use a signed comparison to avoid
> > triggering
> > +        * on underflows when the TSC is out of sync between
> > sockets.
> > +        */
> > +       BUG_ON((s64)ts_delta > 300000000000LL);
> >         if (cpu_cur_csd && csd != cpu_cur_csd) {
> >                 pr_alert("\tcsd: CSD lock (#%d) handling prior
> > %pS(%ps) request.\n",
> >                          *bug_id, READ_ONCE(per_cpu(cur_csd_func,
> > cpux)),
> 
> How are you guys still seeing this? I thought the KVM APIC thing was
> fixed a while ago?
> 
It's more fun than that. We're seeing this on bare metal.

Unfortunately, when a system gets wedged that way currently,
it ends up being power cycled automatically, and we aren't
getting crash dumps with clues on what causes the issue.

Doing a BUG_ON() + panic, followed by a kexec into the kdump
kernel will hopefully give us some clues on what might be
causing the issue.
  
Peter Zijlstra Sept. 13, 2023, 4:17 p.m. UTC | #4
On Wed, Sep 13, 2023 at 10:33:51AM -0400, Rik van Riel wrote:
> On Wed, 2023-09-13 at 15:22 +0200, Peter Zijlstra wrote:
> > On Mon, Aug 21, 2023 at 04:04:09PM -0400, Rik van Riel wrote:
> > > 
> > > +       /* How long since this CSD lock was stuck. */
> > > +       ts_delta = ts2 - ts0;
> > > 
> > > +       /*
> > > +        * If the CSD lock is still stuck after 5 minutes, it is
> > > unlikely
> > > +        * to become unstuck. Use a signed comparison to avoid
> > > triggering
> > > +        * on underflows when the TSC is out of sync between
> > > sockets.
> > > +        */
> > > +       BUG_ON((s64)ts_delta > 300000000000LL);
> > >         if (cpu_cur_csd && csd != cpu_cur_csd) {
> > >                 pr_alert("\tcsd: CSD lock (#%d) handling prior
> > > %pS(%ps) request.\n",
> > >                          *bug_id, READ_ONCE(per_cpu(cur_csd_func,
> > > cpux)),
> > 
> > How are you guys still seeing this? I thought the KVM APIC thing was
> > fixed a while ago?
> > 
> It's more fun than that. We're seeing this on bare metal.

Oh, 'fun' indeed, *groan*.

> Unfortunately, when a system gets wedged that way currently,
> it ends up being power cycled automatically, and we aren't
> getting crash dumps with clues on what causes the issue.
> 
> Doing a BUG_ON() + panic, followed by a kexec into the kdump
> kernel will hopefully give us some clues on what might be
> causing the issue.

I'm conflicted on the need to push such a debug patch upstream, otoh.
given the amount of debug code already in csd, why not.

But yeah, curious hear what comes out of this.
  
Rik van Riel Sept. 13, 2023, 8:17 p.m. UTC | #5
On Wed, 2023-09-13 at 18:17 +0200, Peter Zijlstra wrote:
> On Wed, Sep 13, 2023 at 10:33:51AM -0400, Rik van Riel wrote:
> > > 
> > It's more fun than that. We're seeing this on bare metal.
> 
> Oh, 'fun' indeed, *groan*.
> 
> > Unfortunately, when a system gets wedged that way currently,
> > it ends up being power cycled automatically, and we aren't
> > getting crash dumps with clues on what causes the issue.
> > 
> > Doing a BUG_ON() + panic, followed by a kexec into the kdump
> > kernel will hopefully give us some clues on what might be
> > causing the issue.
> 
> I'm conflicted on the need to push such a debug patch upstream, otoh.
> given the amount of debug code already in csd, why not.
> 
> But yeah, curious hear what comes out of this.
> 
Oh, there's more to it than just debugging the issue.

This will also help recover systems faster, since they
will end up panicking, kdumping, and rebooting, faster
than the "hey, that system looks like it's stuck"
power cycling scripts can get to it.
  

Patch

diff --git a/kernel/smp.c b/kernel/smp.c
index 385179dae360..8b808bff15e6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -228,6 +228,7 @@  static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
 	}
 
 	ts2 = sched_clock();
+	/* How long since we last checked for a stuck CSD lock.*/
 	ts_delta = ts2 - *ts1;
 	if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
 		return false;
@@ -241,9 +242,17 @@  static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
 	else
 		cpux = cpu;
 	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
+	/* How long since this CSD lock was stuck. */
+	ts_delta = ts2 - ts0;
 	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
-		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
+		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
 		 cpu, csd->func, csd->info);
+	/*
+	 * If the CSD lock is still stuck after 5 minutes, it is unlikely
+	 * to become unstuck. Use a signed comparison to avoid triggering
+	 * on underflows when the TSC is out of sync between sockets.
+	 */
+	BUG_ON((s64)ts_delta > 300000000000LL);
 	if (cpu_cur_csd && csd != cpu_cur_csd) {
 		pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
 			 *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),