[V2,4/8] x86/smp: Acquire stopping_cpu unconditionally

Message ID 20230613121615.820042015@linutronix.de
State New
Headers
Series x86/smp: Cure stop_other_cpus() and kexec() troubles |

Commit Message

Thomas Gleixner June 13, 2023, 12:17 p.m. UTC
  There is no reason to acquire the stopping_cpu atomic_t only when there is
more than one online CPU.

Make it unconditional to prepare for fixing the kexec() problem when there
are present but "offline" CPUs which play dead in mwait_play_dead().

They need to be brought out of mwait before kexec() as kexec() can
overwrite text, pagetables, stacks and the monitored cacheline of the
original kernel. The latter causes mwait to resume execution which
obviously causes havoc on the kexec kernel which results usually in triple
faults.

Move the acquire out of the num_online_cpus() > 1 condition so the upcoming
'kick mwait' fixup is properly protected.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
---
 arch/x86/kernel/smp.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)
  

Comments

Peter Zijlstra June 15, 2023, 9:02 a.m. UTC | #1
On Tue, Jun 13, 2023 at 02:17:59PM +0200, Thomas Gleixner wrote:
> There is no reason to acquire the stopping_cpu atomic_t only when there is
> more than one online CPU.
> 
> Make it unconditional to prepare for fixing the kexec() problem when there
> are present but "offline" CPUs which play dead in mwait_play_dead().
> 
> They need to be brought out of mwait before kexec() as kexec() can
> overwrite text, pagetables, stacks and the monitored cacheline of the
> original kernel. The latter causes mwait to resume execution which
> obviously causes havoc on the kexec kernel which results usually in triple
> faults.
> 
> Move the acquire out of the num_online_cpus() > 1 condition so the upcoming
> 'kick mwait' fixup is properly protected.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Ashok Raj <ashok.raj@intel.com>
> ---
>  arch/x86/kernel/smp.c |   14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -153,6 +153,12 @@ static void native_stop_other_cpus(int w
>  	if (reboot_force)
>  		return;
>  
> +	/* Only proceed if this is the first CPU to reach this code */
> +	if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
> +		return;
> +
> +	atomic_set(&stop_cpus_count, num_online_cpus() - 1);
> +

	if (({ int old = -1; !atomic_try_cmpxchg(&stopping_cpu, &old, safe_smp_processor_id()); }))
		return;

Doesn't really roll of the tongue, does it :/

Also, I don't think anybody cares about performance at this point, so
ignore I wrote this email.

/me presses send anyway.
  

Patch

--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -153,6 +153,12 @@  static void native_stop_other_cpus(int w
 	if (reboot_force)
 		return;
 
+	/* Only proceed if this is the first CPU to reach this code */
+	if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
+		return;
+
+	atomic_set(&stop_cpus_count, num_online_cpus() - 1);
+
 	/*
 	 * Use an own vector here because smp_call_function
 	 * does lots of things not suitable in a panic situation.
@@ -167,13 +173,7 @@  static void native_stop_other_cpus(int w
 	 * code.  By syncing, we give the cpus up to one second to
 	 * finish their work before we force them off with the NMI.
 	 */
-	if (num_online_cpus() > 1) {
-		/* did someone beat us here? */
-		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
-			return;
-
-		atomic_set(&stop_cpus_count, num_online_cpus() - 1);
-
+	if (atomic_read(&stop_cpus_count) > 0) {
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
 
 		/*