[v3,08/36] x86/smpboot: Split up native_cpu_up() into separate phases and document them

Message ID 20230508185217.671595388@linutronix.de
State New
Headers
Series cpu/hotplug, x86: Reworked parallel CPU bringup |

Commit Message

Thomas Gleixner May 8, 2023, 7:43 p.m. UTC
  From: David Woodhouse <dwmw@amazon.co.uk>

There are four logical parts to what native_cpu_up() does on the BSP (or
on the controlling CPU for a later hotplug):

 1) Wake the AP by sending the INIT/SIPI/SIPI sequence.

 2) Wait for the AP to make it as far as wait_for_master_cpu() which
    sets that CPU's bit in cpu_initialized_mask, then sets the bit in
    cpu_callout_mask to let the AP proceed through cpu_init().

 3) Wait for the AP to finish cpu_init() and get as far as the
    smp_callin() call, which sets that CPU's bit in cpu_callin_mask.

 4) Perform the TSC synchronization and wait for the AP to actually
    mark itself online in cpu_online_mask.

In preparation to allow these phases to operate in parallel on multiple
APs, split them out into separate functions and document the interactions
a little more clearly in both the BP and AP code paths.

No functional change intended.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>


---
 arch/x86/kernel/smpboot.c |  187 +++++++++++++++++++++++++++++-----------------
 1 file changed, 121 insertions(+), 66 deletions(-)
---
  

Comments

Peter Zijlstra May 9, 2023, 10:04 a.m. UTC | #1
On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote:

> @@ -233,14 +237,31 @@ static void notrace start_secondary(void
>  	load_cr3(swapper_pg_dir);
>  	__flush_tlb_all();
>  #endif
> +	/*
> +	 * Sync point with wait_cpu_initialized(). Before proceeding through
> +	 * cpu_init(), the AP will call wait_for_master_cpu() which sets its
> +	 * own bit in cpu_initialized_mask and then waits for the BSP to set
> +	 * its bit in cpu_callout_mask to release it.
> +	 */
>  	cpu_init_secondary();
>  	rcu_cpu_starting(raw_smp_processor_id());
>  	x86_cpuinit.early_percpu_clock_init();
> +
> +	/*
> +	 * Sync point with wait_cpu_callin(). The AP doesn't wait here
> +	 * but just sets the bit to let the controlling CPU (BSP) know that
> +	 * it's got this far.
> +	 */
>  	smp_callin();
>  
> -	/* otherwise gcc will move up smp_processor_id before the cpu_init */
> +	/* Otherwise gcc will move up smp_processor_id() before cpu_init() */
>  	barrier();

Not to the detriment of this patch, but this barrier() and it's comment
seem weird vs smp_callin(). That function ends with an atomic bitop (it
has to, at the very least it must not be weaker than store-release) but
also has an explicit wmb() to order setup vs CPU_STARTING.

(arguably that should be a full fence *AND* get a comment)

There is no way the smp_processor_id() referred to in this comment can
land before cpu_init() even without the barrier().

> -	/* Check TSC synchronization with the control CPU: */
> +
> +	/*
> +	 * Check TSC synchronization with the control CPU, which will do
> +	 * its part of this from wait_cpu_online(), making it an implicit
> +	 * synchronization point.
> +	 */
>  	check_tsc_sync_target();
>  
>  	/*
  
Peter Zijlstra May 9, 2023, 10:19 a.m. UTC | #2
Again, not really this patch, but since I had to look at this code ....

On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote:
> @@ -1048,60 +1066,89 @@ static int do_boot_cpu(int apicid, int c

	/*
	 * AP might wait on cpu_callout_mask in cpu_init() with
	 * cpu_initialized_mask set if previous attempt to online
	 * it timed-out. Clear cpu_initialized_mask so that after
	 * INIT/SIPI it could start with a clean state.
	 */
	cpumask_clear_cpu(cpu, cpu_initialized_mask);
	smp_mb();

^^^ that barrier is weird too, cpumask_clear_cpu() is an atomic op and
implies much the same (this is x86 after all). If you want to be super
explicit about it write:

	smp_mb__after_atomic();

(which is a no-op) but then it still very much requires a comment as to
what exactly it orders against what.


	/*
	 * Wake up a CPU in difference cases:
	 * - Use a method from the APIC driver if one defined, with wakeup
	 *   straight to 64-bit mode preferred over wakeup to RM.
	 * Otherwise,
>  	 * - Use an INIT boot APIC message
>  	 */
>  	if (apic->wakeup_secondary_cpu_64)
> +		return apic->wakeup_secondary_cpu_64(apicid, start_ip);
>  	else if (apic->wakeup_secondary_cpu)
> +		return apic->wakeup_secondary_cpu(apicid, start_ip);
>  
> +	return wakeup_secondary_cpu_via_init(apicid, start_ip);
> +}
  
Peter Zijlstra May 9, 2023, 10:31 a.m. UTC | #3
And since I'm commenting on existing things anyway, let me continue...

On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote:

> +static int wait_cpu_cpumask(unsigned int cpu, const struct cpumask *mask)
> +{
> +	unsigned long timeout;
>  
> +	/*
> +	 * Wait up to 10s for the CPU to report in.
> +	 */
> +	timeout = jiffies + 10*HZ;
> +	while (time_before(jiffies, timeout)) {
> +		if (cpumask_test_cpu(cpu, mask))
> +			return 0;
> +
> +		schedule();
>  	}
> +	return -1;
> +}

> +/*
> + * Bringup step three: Wait for the target AP to reach smp_callin().
> + * The AP is not waiting for us here so we don't need to parallelise
> + * this step. Not entirely clear why we care about this, since we just
> + * proceed directly to TSC synchronization which is the next sync
> + * point with the AP anyway.
> + */
> +static void wait_cpu_callin(unsigned int cpu)
> +{
> +	while (!cpumask_test_cpu(cpu, cpu_callin_mask))
> +		schedule();
> +}
> +
> +/*
> + * Bringup step four: Synchronize the TSC and wait for the target AP
> + * to reach set_cpu_online() in start_secondary().
> + */
> +static void wait_cpu_online(unsigned int cpu)
>  {
>  	unsigned long flags;
> +
> +	/*
> +	 * Check TSC synchronization with the AP (keep irqs disabled
> +	 * while doing so):
> +	 */
> +	local_irq_save(flags);
> +	check_tsc_sync_source(cpu);
> +	local_irq_restore(flags);
> +
> +	/*
> +	 * Wait for the AP to mark itself online, so the core caller
> +	 * can drop sparse_irq_lock.
> +	 */
> +	while (!cpu_online(cpu))
> +		schedule();
> +}

These schedule() loops make me itch... this is basically Ye Olde yield()
loop with all it's known 'benefits'.

Now, I don't think it's horribly broken, we're explicitly waiting on
another CPU and can't have priority inversions, but yuck!

It could all be somewhat cleaned up with wait_var_event{_timeout}() and
wake_up_var(), but I'm really not sure that's worth it. But at least it
requires a comment to justify.
  
Thomas Gleixner May 9, 2023, 12:07 p.m. UTC | #4
On Tue, May 09 2023 at 12:04, Peter Zijlstra wrote:
> On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote:
>> +	/*
>> +	 * Sync point with wait_cpu_callin(). The AP doesn't wait here
>> +	 * but just sets the bit to let the controlling CPU (BSP) know that
>> +	 * it's got this far.
>> +	 */
>>  	smp_callin();
>>  
>> -	/* otherwise gcc will move up smp_processor_id before the cpu_init */
>> +	/* Otherwise gcc will move up smp_processor_id() before cpu_init() */
>>  	barrier();
>
> Not to the detriment of this patch, but this barrier() and it's comment
> seem weird vs smp_callin(). That function ends with an atomic bitop (it
> has to, at the very least it must not be weaker than store-release) but
> also has an explicit wmb() to order setup vs CPU_STARTING.
>
> (arguably that should be a full fence *AND* get a comment)
>
> There is no way the smp_processor_id() referred to in this comment can
> land before cpu_init() even without the barrier().

Right. Let me clean that up.

Thanks,

        tglx
  
Thomas Gleixner May 9, 2023, 12:08 p.m. UTC | #5
On Tue, May 09 2023 at 12:19, Peter Zijlstra wrote:
> On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote:
>> @@ -1048,60 +1066,89 @@ static int do_boot_cpu(int apicid, int c
>
> 	/*
> 	 * AP might wait on cpu_callout_mask in cpu_init() with
> 	 * cpu_initialized_mask set if previous attempt to online
> 	 * it timed-out. Clear cpu_initialized_mask so that after
> 	 * INIT/SIPI it could start with a clean state.
> 	 */
> 	cpumask_clear_cpu(cpu, cpu_initialized_mask);
> 	smp_mb();
>
> ^^^ that barrier is weird too, cpumask_clear_cpu() is an atomic op and
> implies much the same (this is x86 after all). If you want to be super
> explicit about it write:
>
> 	smp_mb__after_atomic();
>
> (which is a no-op) but then it still very much requires a comment as to
> what exactly it orders against what.

As this is gone a few patches later, I just be lazy and leave it alone.
  
Thomas Gleixner May 9, 2023, 12:09 p.m. UTC | #6
On Tue, May 09 2023 at 12:31, Peter Zijlstra wrote:
> On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote:
>> +	/*
>> +	 * Wait for the AP to mark itself online, so the core caller
>> +	 * can drop sparse_irq_lock.
>> +	 */
>> +	while (!cpu_online(cpu))
>> +		schedule();
>> +}
>
> These schedule() loops make me itch... this is basically Ye Olde yield()
> loop with all it's known 'benefits'.
>
> Now, I don't think it's horribly broken, we're explicitly waiting on
> another CPU and can't have priority inversions, but yuck!
>
> It could all be somewhat cleaned up with wait_var_event{_timeout}() and
> wake_up_var(), but I'm really not sure that's worth it. But at least it
> requires a comment to justify.

All of them are gone with the later patches and the control CPU will
just go straight to wait for the completion in the core code.
  
Thomas Gleixner May 9, 2023, 5:59 p.m. UTC | #7
On Tue, May 09 2023 at 14:07, Thomas Gleixner wrote:
> On Tue, May 09 2023 at 12:04, Peter Zijlstra wrote:
>> On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote:
>>> +	/*
>>> +	 * Sync point with wait_cpu_callin(). The AP doesn't wait here
>>> +	 * but just sets the bit to let the controlling CPU (BSP) know that
>>> +	 * it's got this far.
>>> +	 */
>>>  	smp_callin();
>>>  
>>> -	/* otherwise gcc will move up smp_processor_id before the cpu_init */
>>> +	/* Otherwise gcc will move up smp_processor_id() before cpu_init() */
>>>  	barrier();
>>
>> Not to the detriment of this patch, but this barrier() and it's comment
>> seem weird vs smp_callin(). That function ends with an atomic bitop (it
>> has to, at the very least it must not be weaker than store-release) but
>> also has an explicit wmb() to order setup vs CPU_STARTING.
>>
>> (arguably that should be a full fence *AND* get a comment)
>>
>> There is no way the smp_processor_id() referred to in this comment can
>> land before cpu_init() even without the barrier().
>
> Right. Let me clean that up.

So I went and tried to figure out where this comes from. It's from
d8f19f2cac70 ("[PATCH] x86-64 merge") in the history tree. One of those
well documented combo patches which change world and some more. The
context back then was:

	/*
	 * Dont put anything before smp_callin(), SMP
	 * booting is too fragile that we want to limit the
	 * things done here to the most necessary things.
	 */
	cpu_init();
	smp_callin();

	Dprintk("cpu %d: waiting for commence\n", smp_processor_id()); 

That still does not explain what the barrier is doing. I tried to
harvest mailing list archives, but did not find anything. The back then
list discuss@x86-64.org was never publicly archived... Boris gave me an
tarball, but this 'barrier()' add on was nowhere discussed in public.

As the barrier has no obvious value, I'm adding a patch upfront which
removes it.

Thanks,

        tglx
  
Thomas Gleixner May 9, 2023, 6:03 p.m. UTC | #8
On Tue, May 09 2023 at 12:19, Peter Zijlstra wrote:
> Again, not really this patch, but since I had to look at this code ....
>
> On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote:
>> @@ -1048,60 +1066,89 @@ static int do_boot_cpu(int apicid, int c
>
> 	/*
> 	 * AP might wait on cpu_callout_mask in cpu_init() with
> 	 * cpu_initialized_mask set if previous attempt to online
> 	 * it timed-out. Clear cpu_initialized_mask so that after
> 	 * INIT/SIPI it could start with a clean state.
> 	 */
> 	cpumask_clear_cpu(cpu, cpu_initialized_mask);
> 	smp_mb();
>
> ^^^ that barrier is weird too, cpumask_clear_cpu() is an atomic op and
> implies much the same (this is x86 after all). If you want to be super
> explicit about it write:
>
> 	smp_mb__after_atomic();
>
> (which is a no-op) but then it still very much requires a comment as to
> what exactly it orders against what.

Won't bother either as that mask is gone a few patches later.
  
Thomas Gleixner May 9, 2023, 8:11 p.m. UTC | #9
On Tue, May 09 2023 at 12:04, Peter Zijlstra wrote:
> On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote:
> Not to the detriment of this patch, but this barrier() and it's comment
> seem weird vs smp_callin(). That function ends with an atomic bitop (it
> has to, at the very least it must not be weaker than store-release) but
> also has an explicit wmb() to order setup vs CPU_STARTING.
>
> (arguably that should be a full fence *AND* get a comment)

TBH: I'm grasping for something 'arguable': What's the point of this
wmb() or even a mb()?

Most of the [w]mb()'s in smpboot.c except those in mwait_play_dead()
have a very distinct voodoo programming smell.

Thanks,

        tglx
  
Peter Zijlstra May 10, 2023, 8:39 a.m. UTC | #10
On Tue, May 09, 2023 at 10:11:05PM +0200, Thomas Gleixner wrote:
> On Tue, May 09 2023 at 12:04, Peter Zijlstra wrote:
> > On Mon, May 08, 2023 at 09:43:39PM +0200, Thomas Gleixner wrote:
> > Not to the detriment of this patch, but this barrier() and it's comment
> > seem weird vs smp_callin(). That function ends with an atomic bitop (it
> > has to, at the very least it must not be weaker than store-release) but
> > also has an explicit wmb() to order setup vs CPU_STARTING.
> >
> > (arguably that should be a full fence *AND* get a comment)
> 
> TBH: I'm grasping for something 'arguable': What's the point of this
> wmb() or even a mb()?
> 
> Most of the [w]mb()'s in smpboot.c except those in mwait_play_dead()
> have a very distinct voodoo programming smell.

Oh fully agreed, esp. without a comment these things are hugely suspect.
I could not immediately see purpose either.

My arguably argument was about IF it was needed at all, then it would
make more sense to me to also constrain loads. But I'd be more than
happy to see the whole thing go. But perhaps not in this series?
  

Patch

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -193,6 +193,10 @@  static void smp_callin(void)
 
 	wmb();
 
+	/*
+	 * This runs the AP through all the cpuhp states to its target
+	 * state (CPUHP_ONLINE in the case of serial bringup).
+	 */
 	notify_cpu_starting(cpuid);
 
 	/*
@@ -233,14 +237,31 @@  static void notrace start_secondary(void
 	load_cr3(swapper_pg_dir);
 	__flush_tlb_all();
 #endif
+	/*
+	 * Sync point with wait_cpu_initialized(). Before proceeding through
+	 * cpu_init(), the AP will call wait_for_master_cpu() which sets its
+	 * own bit in cpu_initialized_mask and then waits for the BSP to set
+	 * its bit in cpu_callout_mask to release it.
+	 */
 	cpu_init_secondary();
 	rcu_cpu_starting(raw_smp_processor_id());
 	x86_cpuinit.early_percpu_clock_init();
+
+	/*
+	 * Sync point with wait_cpu_callin(). The AP doesn't wait here
+	 * but just sets the bit to let the controlling CPU (BSP) know that
+	 * it's got this far.
+	 */
 	smp_callin();
 
-	/* otherwise gcc will move up smp_processor_id before the cpu_init */
+	/* Otherwise gcc will move up smp_processor_id() before cpu_init() */
 	barrier();
-	/* Check TSC synchronization with the control CPU: */
+
+	/*
+	 * Check TSC synchronization with the control CPU, which will do
+	 * its part of this from wait_cpu_online(), making it an implicit
+	 * synchronization point.
+	 */
 	check_tsc_sync_target();
 
 	/*
@@ -259,6 +280,7 @@  static void notrace start_secondary(void
 	 * half valid vector space.
 	 */
 	lock_vector_lock();
+	/* Sync point with do_wait_cpu_online() */
 	set_cpu_online(smp_processor_id(), true);
 	lapic_online();
 	unlock_vector_lock();
@@ -981,17 +1003,13 @@  int common_cpu_up(unsigned int cpu, stru
 /*
  * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
  * (ie clustered apic addressing mode), this is a LOGICAL apic ID.
- * Returns zero if CPU booted OK, else error code from
+ * Returns zero if startup was successfully sent, else error code from
  * ->wakeup_secondary_cpu.
  */
 static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
 {
-	/* start_ip had better be page-aligned! */
 	unsigned long start_ip = real_mode_header->trampoline_start;
 
-	unsigned long boot_error = 0;
-	unsigned long timeout;
-
 #ifdef CONFIG_X86_64
 	/* If 64-bit wakeup method exists, use the 64-bit mode trampoline IP */
 	if (apic->wakeup_secondary_cpu_64)
@@ -1048,60 +1066,89 @@  static int do_boot_cpu(int apicid, int c
 	 * - Use an INIT boot APIC message
 	 */
 	if (apic->wakeup_secondary_cpu_64)
-		boot_error = apic->wakeup_secondary_cpu_64(apicid, start_ip);
+		return apic->wakeup_secondary_cpu_64(apicid, start_ip);
 	else if (apic->wakeup_secondary_cpu)
-		boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
-	else
-		boot_error = wakeup_secondary_cpu_via_init(apicid, start_ip);
+		return apic->wakeup_secondary_cpu(apicid, start_ip);
 
-	if (!boot_error) {
-		/*
-		 * Wait 10s total for first sign of life from AP
-		 */
-		boot_error = -1;
-		timeout = jiffies + 10*HZ;
-		while (time_before(jiffies, timeout)) {
-			if (cpumask_test_cpu(cpu, cpu_initialized_mask)) {
-				/*
-				 * Tell AP to proceed with initialization
-				 */
-				cpumask_set_cpu(cpu, cpu_callout_mask);
-				boot_error = 0;
-				break;
-			}
-			schedule();
-		}
-	}
+	return wakeup_secondary_cpu_via_init(apicid, start_ip);
+}
 
-	if (!boot_error) {
-		/*
-		 * Wait till AP completes initial initialization
-		 */
-		while (!cpumask_test_cpu(cpu, cpu_callin_mask)) {
-			/*
-			 * Allow other tasks to run while we wait for the
-			 * AP to come online. This also gives a chance
-			 * for the MTRR work(triggered by the AP coming online)
-			 * to be completed in the stop machine context.
-			 */
-			schedule();
-		}
-	}
+static int wait_cpu_cpumask(unsigned int cpu, const struct cpumask *mask)
+{
+	unsigned long timeout;
 
-	if (x86_platform.legacy.warm_reset) {
-		/*
-		 * Cleanup possible dangling ends...
-		 */
-		smpboot_restore_warm_reset_vector();
+	/*
+	 * Wait up to 10s for the CPU to report in.
+	 */
+	timeout = jiffies + 10*HZ;
+	while (time_before(jiffies, timeout)) {
+		if (cpumask_test_cpu(cpu, mask))
+			return 0;
+
+		schedule();
 	}
+	return -1;
+}
 
-	return boot_error;
+/*
+ * Bringup step two: Wait for the target AP to reach cpu_init_secondary()
+ * and thus wait_for_master_cpu(), then set cpu_callout_mask to allow it
+ * to proceed.  The AP will then proceed past setting its 'callin' bit
+ * and end up waiting in check_tsc_sync_target() until we reach
+ * do_wait_cpu_online() to tend to it.
+ */
+static int wait_cpu_initialized(unsigned int cpu)
+{
+	/*
+	 * Wait for first sign of life from AP.
+	 */
+	if (wait_cpu_cpumask(cpu, cpu_initialized_mask))
+		return -1;
+
+	cpumask_set_cpu(cpu, cpu_callout_mask);
+	return 0;
 }
 
-int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
+/*
+ * Bringup step three: Wait for the target AP to reach smp_callin().
+ * The AP is not waiting for us here so we don't need to parallelise
+ * this step. Not entirely clear why we care about this, since we just
+ * proceed directly to TSC synchronization which is the next sync
+ * point with the AP anyway.
+ */
+static void wait_cpu_callin(unsigned int cpu)
+{
+	while (!cpumask_test_cpu(cpu, cpu_callin_mask))
+		schedule();
+}
+
+/*
+ * Bringup step four: Synchronize the TSC and wait for the target AP
+ * to reach set_cpu_online() in start_secondary().
+ */
+static void wait_cpu_online(unsigned int cpu)
 {
-	int apicid = apic->cpu_present_to_apicid(cpu);
 	unsigned long flags;
+
+	/*
+	 * Check TSC synchronization with the AP (keep irqs disabled
+	 * while doing so):
+	 */
+	local_irq_save(flags);
+	check_tsc_sync_source(cpu);
+	local_irq_restore(flags);
+
+	/*
+	 * Wait for the AP to mark itself online, so the core caller
+	 * can drop sparse_irq_lock.
+	 */
+	while (!cpu_online(cpu))
+		schedule();
+}
+
+static int native_kick_ap(unsigned int cpu, struct task_struct *tidle)
+{
+	int apicid = apic->cpu_present_to_apicid(cpu);
 	int err;
 
 	lockdep_assert_irqs_enabled();
@@ -1142,25 +1189,33 @@  int native_cpu_up(unsigned int cpu, stru
 		return err;
 
 	err = do_boot_cpu(apicid, cpu, tidle);
-	if (err) {
+	if (err)
 		pr_err("do_boot_cpu failed(%d) to wakeup CPU#%u\n", err, cpu);
-		return err;
-	}
 
-	/*
-	 * Check TSC synchronization with the AP (keep irqs disabled
-	 * while doing so):
-	 */
-	local_irq_save(flags);
-	check_tsc_sync_source(cpu);
-	local_irq_restore(flags);
+	return err;
+}
 
-	while (!cpu_online(cpu)) {
-		cpu_relax();
-		touch_nmi_watchdog();
-	}
+int native_cpu_up(unsigned int cpu, struct task_struct *tidle)
+{
+	int ret;
 
-	return 0;
+	ret = native_kick_ap(cpu, tidle);
+	if (ret)
+		goto out;
+
+	ret = wait_cpu_initialized(cpu);
+	if (ret)
+		goto out;
+
+	wait_cpu_callin(cpu);
+	wait_cpu_online(cpu);
+
+out:
+	/* Cleanup possible dangling ends... */
+	if (x86_platform.legacy.warm_reset)
+		smpboot_restore_warm_reset_vector();
+
+	return ret;
 }
 
 /**