diff mbox series

[V2,1/8] x86/smp: Make stop_other_cpus() more robust

Message ID	20230613121615.639116359@linutronix.de
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Message-ID: <20230613121615.639116359@linutronix.de> From: Thomas Gleixner <tglx@linutronix.de> To: LKML <linux-kernel@vger.kernel.org> Cc: x86@kernel.org, Mario Limonciello <mario.limonciello@amd.com>, Tom Lendacky <thomas.lendacky@amd.com>, Tony Battersby <tonyb@cybernetics.com>, Ashok Raj <ashok.raj@linux.intel.com>, Tony Luck <tony.luck@intel.com>, Arjan van de Veen <arjan@linux.intel.com>, Eric Biederman <ebiederm@xmission.com> Subject: [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust References: <20230613115353.599087484@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Date: Tue, 13 Jun 2023 14:17:55 +0200 (CEST) Precedence: bulk
Series	x86/smp: Cure stop_other_cpus() and kexec() troubles \| [V2,0/8] x86/smp: Cure stop_other_cpus() and kexec() troubles [V2,1/8] x86/smp: Make stop_other_cpus() more robust [V2,2/8] x86/smp: Dont access non-existing CPUID leaf [V2,3/8] x86/smp: Remove pointless wmb() from native_stop_other_cpus() [V2,4/8] x86/smp: Acquire stopping_cpu unconditionally [V2,5/8] x86/smp: Use dedicated cache-line for mwait_play_dead() [V2,6/8] x86/smp: Cure kexec() vs. mwait_play_dead() breakage [V2,7/8] x86/smp: Split sending INIT IPI out into a helper function [V2,8/8] x86/smp: Put CPUs into INIT on shutdown if possible

Commit Message

Thomas Gleixner June 13, 2023, 12:17 p.m. UTC

  Tony reported intermittent lockups on poweroff. His analysis identified the
wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
on SME enabled machines a kexec() does not leave any stale data in the
caches when switching from encrypted to non-encrypted mode or vice versa.

That wbindv() is conditional on the SME feature bit which is read directly
from CPUID. But that readout does not check whether the CPUID leaf is
available or not. If it's not available the CPU will return the value of
the highest supported leaf instead. Depending on the content the "SME" bit
might be set or not.

That's incorrect but harmless. Making the CPUID readout conditional makes
the observed hangs go away, but it does not fix the underlying problem:

CPU0					CPU1

 stop_other_cpus()
   send_IPIs(REBOOT);			stop_this_cpu()
   while (num_online_cpus() > 1);         set_online(false);
   proceed... -> hang
				          wbinvd()

WBINVD is an expensive operation and if multiple CPUs issue it at the same
time the resulting delays are even larger.

But CPU0 already observed num_online_cpus() going down to 1 and proceeds
which causes the system to hang.

Make this more robust by adding a counter which is set to the number of
online CPUs before sending the IPIs and decremented in stop_this_cpu()
after the WBINVD completed. Check for that counter in stop_other_cpus()
instead of watching num_online_cpus().

Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Reported-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
---
 arch/x86/include/asm/cpu.h |    2 ++
 arch/x86/kernel/process.c  |   10 ++++++++++
 arch/x86/kernel/smp.c      |   15 ++++++++++++---
 3 files changed, 24 insertions(+), 3 deletions(-)

Comments

Ashok Raj June 14, 2023, 7:42 p.m. UTC | #1

Hi Thomas,

On Tue, Jun 13, 2023 at 02:17:55PM +0200, Thomas Gleixner wrote:
> Tony reported intermittent lockups on poweroff. His analysis identified the
> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
> on SME enabled machines a kexec() does not leave any stale data in the
> caches when switching from encrypted to non-encrypted mode or vice versa.
> 
> That wbindv() is conditional on the SME feature bit which is read directly
> from CPUID. But that readout does not check whether the CPUID leaf is
> available or not. If it's not available the CPU will return the value of
> the highest supported leaf instead. Depending on the content the "SME" bit
> might be set or not.
> 
> That's incorrect but harmless. Making the CPUID readout conditional makes
> the observed hangs go away, but it does not fix the underlying problem:
> 
> CPU0					CPU1
> 
>  stop_other_cpus()
>    send_IPIs(REBOOT);			stop_this_cpu()
>    while (num_online_cpus() > 1);         set_online(false);
>    proceed... -> hang
> 				          wbinvd()
> 
> WBINVD is an expensive operation and if multiple CPUs issue it at the same
> time the resulting delays are even larger.

Is this situation similar to what happened with the unexpected wakeups from
mwait_play_dead()? 

i.e the wbinvd() takes a while, and when CPU0 moves ahead, the previous
kernel marches past the wbinvd() instruction since we didn't wait to ensure
this has indeed completed?

native_machine_halt()
{
	machine_shutdown()->stop_other_cpus() 
	stop_this_cpu();<---- Unbalanced atomic_dec()?
}

But the last stop_this_cpu() in native_machine_halt() would 
make the count go negative? Maybe that's OK since no one is waiting for it
to go zero at that point?

> 
> But CPU0 already observed num_online_cpus() going down to 1 and proceeds
> which causes the system to hang.
> 
> Make this more robust by adding a counter which is set to the number of
> online CPUs before sending the IPIs and decremented in stop_this_cpu()
> after the WBINVD completed. Check for that counter in stop_other_cpus()
> instead of watching num_online_cpus().
> 
> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
> Reported-by: Tony Battersby <tonyb@cybernetics.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
> ---
>  arch/x86/include/asm/cpu.h |    2 ++
>  arch/x86/kernel/process.c  |   10 ++++++++++
>  arch/x86/kernel/smp.c      |   15 ++++++++++++---
>  3 files changed, 24 insertions(+), 3 deletions(-)
> 
> 
> --- a/arch/x86/include/asm/cpu.h
> +++ b/arch/x86/include/asm/cpu.h
> @@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void);
>  int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
>  int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
>  
> +extern atomic_t stop_cpus_count;
> +
>  #endif /* _ASM_X86_CPU_H */
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -759,6 +759,8 @@ bool xen_set_default_idle(void)
>  }
>  #endif
>  
> +atomic_t stop_cpus_count;
> +
>  void __noreturn stop_this_cpu(void *dummy)
>  {
>  	local_irq_disable();
> @@ -783,6 +785,14 @@ void __noreturn stop_this_cpu(void *dumm
>  	 */
>  	if (cpuid_eax(0x8000001f) & BIT(0))
>  		native_wbinvd();
> +
> +	/*
> +	 * native_stop_other_cpus() will write to @stop_cpus_count after
> +	 * observing that it went down to zero, which will invalidate the
> +	 * cacheline on this CPU.
> +	 */
> +	atomic_dec(&stop_cpus_count);
> +
>  	for (;;) {
>  		/*
>  		 * Use native_halt() so that memory contents don't change
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -27,6 +27,7 @@
>  #include <asm/mmu_context.h>
>  #include <asm/proto.h>
>  #include <asm/apic.h>
> +#include <asm/cpu.h>
>  #include <asm/idtentry.h>
>  #include <asm/nmi.h>
>  #include <asm/mce.h>
> @@ -171,6 +172,8 @@ static void native_stop_other_cpus(int w
>  		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
>  			return;
>  
> +		atomic_set(&stop_cpus_count, num_online_cpus() - 1);
> +
>  		/* sync above data before sending IRQ */
>  		wmb();
>  
> @@ -183,12 +186,12 @@ static void native_stop_other_cpus(int w
>  		 * CPUs reach shutdown state.
>  		 */
>  		timeout = USEC_PER_SEC;
> -		while (num_online_cpus() > 1 && timeout--)
> +		while (atomic_read(&stop_cpus_count) > 0 && timeout--)
>  			udelay(1);
>  	}
>  
>  	/* if the REBOOT_VECTOR didn't work, try with the NMI */
> -	if (num_online_cpus() > 1) {
> +	if (atomic_read(&stop_cpus_count) > 0) {
>  		/*
>  		 * If NMI IPI is enabled, try to register the stop handler
>  		 * and send the IPI. In any case try to wait for the other
> @@ -208,7 +211,7 @@ static void native_stop_other_cpus(int w
>  		 * one or more CPUs do not reach shutdown state.
>  		 */
>  		timeout = USEC_PER_MSEC * 10;
> -		while (num_online_cpus() > 1 && (wait || timeout--))
> +		while (atomic_read(&stop_cpus_count) > 0 && (wait || timeout--))
>  			udelay(1);
>  	}

If we go down the INIT path, life is less complicated.. 

After REBOOT_VECTOR IPI, if stop_cpus_count > 0, we send NMI to all CPUs.
Won't this completely update the atomic_dec() since CPUs in hlt() will also
take the NMI correct? I'm not sure if this is problematic.

Or should we reinitialize stop_cpus_count before the NMI hurrah?

>  
> @@ -216,6 +219,12 @@ static void native_stop_other_cpus(int w
>  	disable_local_APIC();
>  	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
>  	local_irq_restore(flags);
> +
> +	/*
> +	 * Ensure that the cache line is invalidated on the other CPUs. See
> +	 * comment vs. SME in stop_this_cpu().
> +	 */
> +	atomic_set(&stop_cpus_count, INT_MAX);

Didn't understand why INT_MAX here?

Thomas Gleixner June 14, 2023, 7:53 p.m. UTC | #2

On Wed, Jun 14 2023 at 12:42, Ashok Raj wrote:
> On Tue, Jun 13, 2023 at 02:17:55PM +0200, Thomas Gleixner wrote:
>> 
>> WBINVD is an expensive operation and if multiple CPUs issue it at the same
>> time the resulting delays are even larger.
>
> Is this situation similar to what happened with the unexpected wakeups from
> mwait_play_dead()? 
>
> i.e the wbinvd() takes a while, and when CPU0 moves ahead, the previous
> kernel marches past the wbinvd() instruction since we didn't wait to ensure
> this has indeed completed?

This is about reboot and I don't know how wbinvd() interacts with
that. But yes, this could be an issue vs. kexec() too.

> native_machine_halt()
> {
> 	machine_shutdown()->stop_other_cpus() 
> 	stop_this_cpu();<---- Unbalanced atomic_dec()?
> }
>
> But the last stop_this_cpu() in native_machine_halt() would 
> make the count go negative? Maybe that's OK since no one is waiting for it
> to go zero at that point?

Correct it's the CPU which stopped the others.

>>  		timeout = USEC_PER_MSEC * 10;
>> -		while (num_online_cpus() > 1 && (wait || timeout--))
>> +		while (atomic_read(&stop_cpus_count) > 0 && (wait || timeout--))
>>  			udelay(1);
>>  	}
>
> If we go down the INIT path, life is less complicated.. 
>
> After REBOOT_VECTOR IPI, if stop_cpus_count > 0, we send NMI to all CPUs.
> Won't this completely update the atomic_dec() since CPUs in hlt() will also
> take the NMI correct? I'm not sure if this is problematic.
>
> Or should we reinitialize stop_cpus_count before the NMI hurrah

Bah. Didn't think about HLT. Let me go back to the drawing board. Good catch!

>> +	/*
>> +	 * Ensure that the cache line is invalidated on the other CPUs. See
>> +	 * comment vs. SME in stop_this_cpu().
>> +	 */
>> +	atomic_set(&stop_cpus_count, INT_MAX);
>
> Didn't understand why INT_MAX here?

Any random number will do. The only purpose is to ensure that there is
no dirty cache line on the other (stopped) CPUs.

Now let me look into this NMI cruft.

Thanks,

        tglx

Ashok Raj June 14, 2023, 8:47 p.m. UTC | #3

On Wed, Jun 14, 2023 at 09:53:21PM +0200, Thomas Gleixner wrote:
> > If we go down the INIT path, life is less complicated.. 
> >
> > After REBOOT_VECTOR IPI, if stop_cpus_count > 0, we send NMI to all CPUs.
> > Won't this completely update the atomic_dec() since CPUs in hlt() will also
> > take the NMI correct? I'm not sure if this is problematic.
> >
> > Or should we reinitialize stop_cpus_count before the NMI hurrah
> 
> Bah. Didn't think about HLT. Let me go back to the drawing board. Good catch!
> 
> >> +	/*
> >> +	 * Ensure that the cache line is invalidated on the other CPUs. See
> >> +	 * comment vs. SME in stop_this_cpu().
> >> +	 */
> >> +	atomic_set(&stop_cpus_count, INT_MAX);
> >
> > Didn't understand why INT_MAX here?
> 
> Any random number will do. The only purpose is to ensure that there is
> no dirty cache line on the other (stopped) CPUs.
> 
> Now let me look into this NMI cruft.
> 

Maybe if each CPU going down can set their mask, we can simply hit NMI to
only the problematic ones?

The simple count doesn't capture the CPUs in trouble.

Thomas Gleixner June 14, 2023, 10:40 p.m. UTC | #4

On Wed, Jun 14 2023 at 13:47, Ashok Raj wrote:
> On Wed, Jun 14, 2023 at 09:53:21PM +0200, Thomas Gleixner wrote:
>> 
>> Now let me look into this NMI cruft.
>> 
>
> Maybe if each CPU going down can set their mask, we can simply hit NMI to
> only the problematic ones?
>
> The simple count doesn't capture the CPUs in trouble.

Even a mask is not cutting it. If CPUs did not react on the reboot
vector then there is no guarantee that they are not going to do so
concurrently to the NMI IPI:

CPU0                          CPU1

IPI(BROADCAST, REBOOT);
wait() // timeout
                              stop_this_cpu()
if (!all_stopped()) {
  for_each_cpu(cpu, mask) {
                                mark_stopped(); <- all_stopped() == true now
       IPI(cpu, NMI);
  }                            --> NMI()

  // no wait() because all_stopped() == true

proceed_and_hope() ....

On bare metal this is likely to "work" by chance, but in a guest all
bets are off.

I'm not surprised at all.

The approach of piling hardware and firmware legacy on top of hardware
and firmware legacy in the hope that we can "fix" that in software was
wrong from the very beginning.

What's surprising is that this worked for a really long time. Though
with increasing complexity the thereby produced debris is starting to
rear its ugly head.

I'm sure the marketing departements of _all_ x86 vendors will come up
with a brilliant slogan for that. Something like:

  "We are committed to ensure that you are able to experience the
   failures of the past forever with increasingly improved performance
   and new exciting features which are fully backwards failure
   compatible."

TBH, the (OS) software industry has proliferated that by joining the
'features first' choir without much thought and push back. See
arch/x86/kernel/cpu/* for prime examples.

Ranted enough. I'm going to sleep now and look at this mess tomorrow
morning with brain awake again. Though that will not change the
underlying problem, which is unfixable.

Thanks,

        tglx

diff mbox series

Patch

--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -98,4 +98,6 @@  extern u64 x86_read_arch_cap_msr(void);
 int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
 int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
 
+extern atomic_t stop_cpus_count;
+
 #endif /* _ASM_X86_CPU_H */
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -759,6 +759,8 @@  bool xen_set_default_idle(void)
 }
 #endif
 
+atomic_t stop_cpus_count;
+
 void __noreturn stop_this_cpu(void *dummy)
 {
 	local_irq_disable();
@@ -783,6 +785,14 @@  void __noreturn stop_this_cpu(void *dumm
 	 */
 	if (cpuid_eax(0x8000001f) & BIT(0))
 		native_wbinvd();
+
+	/*
+	 * native_stop_other_cpus() will write to @stop_cpus_count after
+	 * observing that it went down to zero, which will invalidate the
+	 * cacheline on this CPU.
+	 */
+	atomic_dec(&stop_cpus_count);
+
 	for (;;) {
 		/*
 		 * Use native_halt() so that memory contents don't change
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -27,6 +27,7 @@ 
 #include <asm/mmu_context.h>
 #include <asm/proto.h>
 #include <asm/apic.h>
+#include <asm/cpu.h>
 #include <asm/idtentry.h>
 #include <asm/nmi.h>
 #include <asm/mce.h>
@@ -171,6 +172,8 @@  static void native_stop_other_cpus(int w
 		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
 			return;
 
+		atomic_set(&stop_cpus_count, num_online_cpus() - 1);
+
 		/* sync above data before sending IRQ */
 		wmb();
 
@@ -183,12 +186,12 @@  static void native_stop_other_cpus(int w
 		 * CPUs reach shutdown state.
 		 */
 		timeout = USEC_PER_SEC;
-		while (num_online_cpus() > 1 && timeout--)
+		while (atomic_read(&stop_cpus_count) > 0 && timeout--)
 			udelay(1);
 	}
 
 	/* if the REBOOT_VECTOR didn't work, try with the NMI */
-	if (num_online_cpus() > 1) {
+	if (atomic_read(&stop_cpus_count) > 0) {
 		/*
 		 * If NMI IPI is enabled, try to register the stop handler
 		 * and send the IPI. In any case try to wait for the other
@@ -208,7 +211,7 @@  static void native_stop_other_cpus(int w
 		 * one or more CPUs do not reach shutdown state.
 		 */
 		timeout = USEC_PER_MSEC * 10;
-		while (num_online_cpus() > 1 && (wait || timeout--))
+		while (atomic_read(&stop_cpus_count) > 0 && (wait || timeout--))
 			udelay(1);
 	}
 
@@ -216,6 +219,12 @@  static void native_stop_other_cpus(int w
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 	local_irq_restore(flags);
+
+	/*
+	 * Ensure that the cache line is invalidated on the other CPUs. See
+	 * comment vs. SME in stop_this_cpu().
+	 */
+	atomic_set(&stop_cpus_count, INT_MAX);
 }
 
 /*