[v15,03/12] cpu/hotplug: Add dynamic parallel bringup states before CPUHP_BRINGUP_CPU

Message ID 20230316222109.1940300-4-usama.arif@bytedance.com
State New
Headers
Series Parallel CPU bringup for x86_64 |

Commit Message

Usama Arif March 16, 2023, 10:21 p.m. UTC
  From: David Woodhouse <dwmw@amazon.co.uk>

There is often significant latency in the early stages of CPU bringup,
and time is wasted by waking each CPU (e.g. with SIPI/INIT/INIT on x86)
and then waiting for it to respond before moving on to the next.

Allow a platform to register a set of pre-bringup CPUHP states to which
each CPU can be stepped in parallel, thus absorbing some of that latency.

There is a subtlety here: even with an empty CPUHP_BP_PARALLEL_DYN step,
this means that *all* CPUs are brought through the prepare states and to
CPUHP_BP_PREPARE_DYN before any of them are taken to CPUHP_BRINGUP_CPU
and then are allowed to run for themselves to CPUHP_ONLINE.

So any combination of prepare/start calls which depend on A-B ordering
for each CPU in turn, such as the X2APIC code which used to allocate a
cluster mask 'just in case' and store it in a global variable in the
prep stage, then potentially consume that preallocated structure from
the AP and set the global pointer to NULL to be reallocated in
CPUHP_X2APIC_PREPARE for the next CPU... would explode horribly.

Any platform enabling the CPUHP_BP_PARALLEL_DYN steps must be reviewed
and tested to ensure that such issues do not exist, and the existing
behaviour of bringing CPUs to CPUHP_BP_PREPARE_DYN and then immediately
to CPUHP_BRINGUP_CPU and CPUHP_ONLINE only one at a time does not change
unless such a state is registered.

Note that the new parallel stages do *not* yet bring each AP to the
CPUHP_BRINGUP_CPU state at the same time, only to the new states which
exist before it. The final loop in bringup_nonboot_cpus() is untouched,
bringing each AP in turn from the final PARALLEL_DYN state (or all the
way from CPUHP_OFFLINE) to CPUHP_BRINGUP_CPU and then waiting for that
AP to do its own processing and reach CPUHP_ONLINE before releasing the
next.

Parallelising that part by bringing them all to CPUHP_BRINGUP_CPU
and then waiting for them all is an exercise for the future.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
---
 include/linux/cpuhotplug.h |  2 ++
 kernel/cpu.c               | 31 +++++++++++++++++++++++++++++--
 2 files changed, 31 insertions(+), 2 deletions(-)
  

Comments

Thomas Gleixner March 20, 2023, 2:30 p.m. UTC | #1
On Thu, Mar 16 2023 at 22:21, Usama Arif wrote:
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 6b3dccb4a888..6ccc64defd47 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1497,8 +1497,30 @@ int bringup_hibernate_cpu(unsigned int sleep_cpu)
>  
>  void bringup_nonboot_cpus(unsigned int setup_max_cpus)
>  {
> +	unsigned int n = setup_max_cpus - num_online_cpus();
>  	unsigned int cpu;
>  
> +	/*
> +	 * An architecture may have registered parallel pre-bringup states to
> +	 * which each CPU may be brought in parallel. For each such state,
> +	 * bring N CPUs to it in turn before the final round of bringing them
> +	 * online.
> +	 */
> +	if (n > 0) {
> +		enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN;
> +
> +		while (st <= CPUHP_BP_PARALLEL_DYN_END && cpuhp_hp_states[st].name) {
> +			int i = n;
> +
> +			for_each_present_cpu(cpu) {
> +				cpu_up(cpu, st);
> +				if (!--i)
> +					break;
> +			}
> +			st++;
> +		}
> +	}
> +
>  	for_each_present_cpu(cpu) {
>  		if (num_online_cpus() >= setup_max_cpus)
>  			break;

This causes a subtle issue. The bringup loop above moves all CPUs to
cpuhp_state == CPUHP_BP_PARALLEL_DYN_END. So the serial bootup will
start from there and bring them fully up.

Now if a bringup fails, then the rollback will only go back down to
CPUHP_BP_PARALLEL_DYN_END, which means that the control CPU won't do any
cleanups below CPUHP_BP_PARALLEL_DYN_END.

That 'fail' is a common case for SMT soft disable via the 'nosmt'
command line parameter. Due to the marvelous MCE broadcast 'feature' we
need to bringup the SMT siblings at least to the CPUHP_AP_ONLINE_IDLE
state once and then roll them back.

While this is not necessarily a fatal problem, it's changing behaviour
and with quite some of the details hidden in the (then not issued)
teardown callbacks might cause some hard to decode subtle surprises.

So that second for_each_present_cpu() loop needs to check the return
value of cpu_up() and issue a full rollback to CPUHP_OFFLINE in case of
fail.

Thanks,

        tglx
  
David Woodhouse March 21, 2023, 7:14 p.m. UTC | #2
On Mon, 2023-03-20 at 15:30 +0100, Thomas Gleixner wrote:
> 
> This causes a subtle issue. The bringup loop above moves all CPUs to
> cpuhp_state == CPUHP_BP_PARALLEL_DYN_END. So the serial bootup will
> start from there and bring them fully up.
> 
> Now if a bringup fails, then the rollback will only go back down to
> CPUHP_BP_PARALLEL_DYN_END, which means that the control CPU won't do any
> cleanups below CPUHP_BP_PARALLEL_DYN_END.
> 
> That 'fail' is a common case for SMT soft disable via the 'nosmt'
> command line parameter. Due to the marvelous MCE broadcast 'feature' we
> need to bringup the SMT siblings at least to the CPUHP_AP_ONLINE_IDLE
> state once and then roll them back.
> 
> While this is not necessarily a fatal problem, it's changing behaviour
> and with quite some of the details hidden in the (then not issued)
> teardown callbacks might cause some hard to decode subtle surprises.
> 
> So that second for_each_present_cpu() loop needs to check the return
> value of cpu_up() and issue a full rollback to CPUHP_OFFLINE in case of
> fail.

@@ -1524,8 +1531,22 @@ void bringup_nonboot_cpus(unsigned int setup_max_cpus)
        for_each_present_cpu(cpu) {
                if (num_online_cpus() >= setup_max_cpus)
                        break;
-               if (!cpu_online(cpu))
-                       cpu_up(cpu, CPUHP_ONLINE);
+               if (!cpu_online(cpu)) {
+                       int ret = cpu_up(cpu, CPUHP_ONLINE);
+
+                       /*
+                        * For the parallel bringup case, roll all the way back
+                        * to CPUHP_OFFLINE on failure; don't leave them in the
+                        * parallel stages. This happens in the nosmt case for
+                        * non-primary threads.
+                        */
+                       if (ret && cpuhp_hp_states[CPUHP_BP_PARALLEL_DYN].name) {
+                               struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
+                               if (can_rollback_cpu(st))
+                                       WARN_ON(cpuhp_invoke_callback_range(false, cpu, st,
+                                                                           CPUHP_OFFLINE));
+                       }
+               }
        }
 }
  

Patch

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 6c6859bfc454..e5a73ae6ccc0 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -133,6 +133,8 @@  enum cpuhp_state {
 	CPUHP_MIPS_SOC_PREPARE,
 	CPUHP_BP_PREPARE_DYN,
 	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
+	CPUHP_BP_PARALLEL_DYN,
+	CPUHP_BP_PARALLEL_DYN_END		= CPUHP_BP_PARALLEL_DYN + 4,
 	CPUHP_BRINGUP_CPU,
 
 	/*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6b3dccb4a888..6ccc64defd47 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1497,8 +1497,30 @@  int bringup_hibernate_cpu(unsigned int sleep_cpu)
 
 void bringup_nonboot_cpus(unsigned int setup_max_cpus)
 {
+	unsigned int n = setup_max_cpus - num_online_cpus();
 	unsigned int cpu;
 
+	/*
+	 * An architecture may have registered parallel pre-bringup states to
+	 * which each CPU may be brought in parallel. For each such state,
+	 * bring N CPUs to it in turn before the final round of bringing them
+	 * online.
+	 */
+	if (n > 0) {
+		enum cpuhp_state st = CPUHP_BP_PARALLEL_DYN;
+
+		while (st <= CPUHP_BP_PARALLEL_DYN_END && cpuhp_hp_states[st].name) {
+			int i = n;
+
+			for_each_present_cpu(cpu) {
+				cpu_up(cpu, st);
+				if (!--i)
+					break;
+			}
+			st++;
+		}
+	}
+
 	for_each_present_cpu(cpu) {
 		if (num_online_cpus() >= setup_max_cpus)
 			break;
@@ -1875,6 +1897,10 @@  static int cpuhp_reserve_state(enum cpuhp_state state)
 		step = cpuhp_hp_states + CPUHP_BP_PREPARE_DYN;
 		end = CPUHP_BP_PREPARE_DYN_END;
 		break;
+	case CPUHP_BP_PARALLEL_DYN:
+		step = cpuhp_hp_states + CPUHP_BP_PARALLEL_DYN;
+		end = CPUHP_BP_PARALLEL_DYN_END;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1899,14 +1925,15 @@  static int cpuhp_store_callbacks(enum cpuhp_state state, const char *name,
 	/*
 	 * If name is NULL, then the state gets removed.
 	 *
-	 * CPUHP_AP_ONLINE_DYN and CPUHP_BP_PREPARE_DYN are handed out on
+	 * CPUHP_AP_ONLINE_DYN and CPUHP_BP_P*_DYN are handed out on
 	 * the first allocation from these dynamic ranges, so the removal
 	 * would trigger a new allocation and clear the wrong (already
 	 * empty) state, leaving the callbacks of the to be cleared state
 	 * dangling, which causes wreckage on the next hotplug operation.
 	 */
 	if (name && (state == CPUHP_AP_ONLINE_DYN ||
-		     state == CPUHP_BP_PREPARE_DYN)) {
+		     state == CPUHP_BP_PREPARE_DYN ||
+		     state == CPUHP_BP_PARALLEL_DYN)) {
 		ret = cpuhp_reserve_state(state);
 		if (ret < 0)
 			return ret;