[v3,5/7] x86/smp: Cure kexec() vs. mwait_play_dead() breakage

Message ID 20230615193330.492257119@linutronix.de
State New
Headers
Series x86/smp: Cure stop_other_cpus() and kexec() troubles |

Commit Message

Thomas Gleixner June 15, 2023, 8:33 p.m. UTC
  TLDR: It's a mess.

When kexec() is executed on a system with "offline" CPUs, which are parked
in mwait_play_dead() it can end up in a triple fault during the bootup of
the kexec kernel or cause hard to diagnose data corruption.

The reason is that kexec() eventually overwrites the previous kernels text,
page tables, data and stack, If it writes to the cache line which is
monitored by an previously offlined CPU, MWAIT resumes execution and ends
up executing the wrong text, dereferencing overwritten page tables or
corrupting the kexec kernels data.

Cure this by bringing the offline CPUs out of MWAIT into HLT.

Write to the monitored cache line of each offline CPU, which makes MWAIT
resume execution. The written control word tells the offline CPUs to issue
HLT, which does not have the MWAIT problem.

That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as
those make it come out of HLT.

A follow up change will put them into INIT, which protects at least against
NMI and SMI.

Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
 arch/x86/include/asm/smp.h |    2 +
 arch/x86/kernel/smp.c      |    5 +++
 arch/x86/kernel/smpboot.c  |   59 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+)
  

Comments

Borislav Petkov June 20, 2023, 9:23 a.m. UTC | #1
On Thu, Jun 15, 2023 at 10:33:57PM +0200, Thomas Gleixner wrote:
> TLDR: It's a mess.

LOL. You don't need the rest of the commit message - this says it all.
:-P

> When kexec() is executed on a system with "offline" CPUs, which are parked

I'd say simply "with offlined CPUs"

> in mwait_play_dead() it can end up in a triple fault during the bootup of
> the kexec kernel or cause hard to diagnose data corruption.
> 
> The reason is that kexec() eventually overwrites the previous kernels text,

kernel's

> page tables, data and stack, If it writes to the cache line which is

... and stack. If it ...

> monitored by an previously offlined CPU, MWAIT resumes execution and ends

... by a previously...

> up executing the wrong text, dereferencing overwritten page tables or
> corrupting the kexec kernels data.

Lovely.

> Cure this by bringing the offline CPUs out of MWAIT into HLT.

... offlined CPUs... :

> Write to the monitored cache line of each offline CPU, which makes MWAIT

ditto.

> resume execution. The written control word tells the offline CPUs to issue

ditto and so on.

> HLT, which does not have the MWAIT problem.
> 
> That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as
> those make it come out of HLT.
> 
> A follow up change will put them into INIT, which protects at least against
> NMI and SMI.

...

> @@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void)
>  			(highest_subcstate - 1);
>  	}
>  
> +	/* Set up state for the kexec() hack below */
> +	md->status = CPUDEAD_MWAIT_WAIT;
> +	md->control = CPUDEAD_MWAIT_WAIT;
> +
>  	wbinvd();
>  
>  	while (1) {
> @@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void)

JFYI: that last hunk has some conflicts applying to latest tip/master.
Might need merge resolving...

>  		mb();
>  		__mwait(eax, 0);
>  
> +		if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
> +			/*
> +			 * Kexec is about to happen. Don't go back into mwait() as
> +			 * the kexec kernel might overwrite text and data including
> +			 * page tables and stack. So mwait() would resume when the
> +			 * monitor cache line is written to and then the CPU goes
> +			 * south due to overwritten text, page tables and stack.
> +			 *
> +			 * Note: This does _NOT_ protect against a stray MCE, NMI,
> +			 * SMI. They will resume execution at the instruction
> +			 * following the HLT instruction and run into the problem
> +			 * which this is trying to prevent.
> +			 */
> +			WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
> +			while(1)
> +				native_halt();
> +		}
> +
>  		cond_wakeup_cpu0();
>  	}
>  }
>  
> +/*
> + * Kick all "offline" CPUs out of mwait on kexec(). See comment in
> + * mwait_play_dead().
> + */
> +void smp_kick_mwait_play_dead(void)
> +{
> +	u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;

Do you even need this newstate thing?

> +	struct mwait_cpu_dead *md;
> +	unsigned int cpu, i;
> +
> +	for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) {
> +		md = per_cpu_ptr(&mwait_cpu_dead, cpu);
> +
> +		/* Does it sit in mwait_play_dead() ? */
> +		if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT)
> +			continue;
> +
> +		/* Wait maximal 5ms */

		/* Wait 5ms maximum */

> +		for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) {
> +			/* Bring it out of mwait */
> +			WRITE_ONCE(md->control, newstate);
> +			udelay(5);
> +		}
> +
> +		if (READ_ONCE(md->status) != newstate)
> +			pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu);

Shouldn't this be a pr_err_once thing so that it doesn't flood the
console unnecessarily?
  
Thomas Gleixner June 20, 2023, 12:25 p.m. UTC | #2
On Tue, Jun 20 2023 at 11:23, Borislav Petkov wrote:
> On Thu, Jun 15, 2023 at 10:33:57PM +0200, Thomas Gleixner wrote:
>> TLDR: It's a mess.
>>  	while (1) {
>> @@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void)
>
> JFYI: that last hunk has some conflicts applying to latest tip/master.
> Might need merge resolving...

Yes, I know.

>> +/*
>> + * Kick all "offline" CPUs out of mwait on kexec(). See comment in
>> + * mwait_play_dead().
>> + */
>> +void smp_kick_mwait_play_dead(void)
>> +{
>> +	u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;
>
> Do you even need this newstate thing?

Yes, for two reasons:

  1) To explicitely tell the other CPU to go into HLT. MWAIT can resume
     execution due to SMIs or NMIs, so we don't want to go them into HLT
     unconditionally. TLD; .... :)

  2) Two have the state feedback from the other CPU.

>> +
>> +		if (READ_ONCE(md->status) != newstate)
>> +			pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu);
>
> Shouldn't this be a pr_err_once thing so that it doesn't flood the
> console unnecessarily?

Yes, no, do not know :)
  

Patch

--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -132,6 +132,8 @@  void wbinvd_on_cpu(int cpu);
 int wbinvd_on_all_cpus(void);
 void cond_wakeup_cpu0(void);
 
+void smp_kick_mwait_play_dead(void);
+
 void native_smp_send_reschedule(int cpu);
 void native_send_call_func_ipi(const struct cpumask *mask);
 void native_send_call_func_single_ipi(int cpu);
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -21,6 +21,7 @@ 
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/kexec.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -157,6 +158,10 @@  static void native_stop_other_cpus(int w
 	if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
 		return;
 
+	/* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */
+	if (kexec_in_progress)
+		smp_kick_mwait_play_dead();
+
 	/*
 	 * 1) Send an IPI on the reboot vector to all other CPUs.
 	 *
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -53,6 +53,7 @@ 
 #include <linux/tboot.h>
 #include <linux/gfp.h>
 #include <linux/cpuidle.h>
+#include <linux/kexec.h>
 #include <linux/numa.h>
 #include <linux/pgtable.h>
 #include <linux/overflow.h>
@@ -106,6 +107,9 @@  struct mwait_cpu_dead {
 	unsigned int	status;
 };
 
+#define CPUDEAD_MWAIT_WAIT	0xDEADBEEF
+#define CPUDEAD_MWAIT_KEXEC_HLT	0x4A17DEAD
+
 /*
  * Cache line aligned data for mwait_play_dead(). Separate on purpose so
  * that it's unlikely to be touched by other CPUs.
@@ -173,6 +177,10 @@  static void smp_callin(void)
 {
 	int cpuid;
 
+	/* Mop up eventual mwait_play_dead() wreckage */
+	this_cpu_write(mwait_cpu_dead.status, 0);
+	this_cpu_write(mwait_cpu_dead.control, 0);
+
 	/*
 	 * If waken up by an INIT in an 82489DX configuration
 	 * cpu_callout_mask guarantees we don't get here before
@@ -1807,6 +1815,10 @@  static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
+	/* Set up state for the kexec() hack below */
+	md->status = CPUDEAD_MWAIT_WAIT;
+	md->control = CPUDEAD_MWAIT_WAIT;
+
 	wbinvd();
 
 	while (1) {
@@ -1824,10 +1836,57 @@  static inline void mwait_play_dead(void)
 		mb();
 		__mwait(eax, 0);
 
+		if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
+			/*
+			 * Kexec is about to happen. Don't go back into mwait() as
+			 * the kexec kernel might overwrite text and data including
+			 * page tables and stack. So mwait() would resume when the
+			 * monitor cache line is written to and then the CPU goes
+			 * south due to overwritten text, page tables and stack.
+			 *
+			 * Note: This does _NOT_ protect against a stray MCE, NMI,
+			 * SMI. They will resume execution at the instruction
+			 * following the HLT instruction and run into the problem
+			 * which this is trying to prevent.
+			 */
+			WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
+			while(1)
+				native_halt();
+		}
+
 		cond_wakeup_cpu0();
 	}
 }
 
+/*
+ * Kick all "offline" CPUs out of mwait on kexec(). See comment in
+ * mwait_play_dead().
+ */
+void smp_kick_mwait_play_dead(void)
+{
+	u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;
+	struct mwait_cpu_dead *md;
+	unsigned int cpu, i;
+
+	for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) {
+		md = per_cpu_ptr(&mwait_cpu_dead, cpu);
+
+		/* Does it sit in mwait_play_dead() ? */
+		if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT)
+			continue;
+
+		/* Wait maximal 5ms */
+		for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) {
+			/* Bring it out of mwait */
+			WRITE_ONCE(md->control, newstate);
+			udelay(5);
+		}
+
+		if (READ_ONCE(md->status) != newstate)
+			pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu);
+	}
+}
+
 void __noreturn hlt_play_dead(void)
 {
 	if (__this_cpu_read(cpu_info.x86) >= 4)