[v7,0/9] Parallel CPU bringup for x86_64

Message ID 20230207230436.2690891-1-usama.arif@bytedance.com
Headers
Series Parallel CPU bringup for x86_64 |

Message

Usama Arif Feb. 7, 2023, 11:04 p.m. UTC
  Tested on v7, doing INIT/SIPI/SIPI in parallel brings down the time for
smpboot from ~700ms to 100ms (85% improvement) on a server with 128 CPUs
split across 2 NUMA nodes.

The major change over v6 is keeping parallel smp support enabled in AMD.
APIC ID for parallel CPU bringup is now obtained from CPUID leaf 0x0B
(for x2APIC mode) otherwise CPUID leaf 0x1 (8 bits).

The patch for reusing timer calibration for secondary CPUs is also removed
from the series as its not part of parallel smp bringup and needs to be
further thought about.

Thanks,
Usama

Changes across versions:
v2: Cut it back to just INIT/SIPI/SIPI in parallel for now, nothing more
v3: Clean up x2apic patch, add MTRR optimisation, lock topology update
    in preparation for more parallelisation.
v4: Fixes to the real mode parallelisation patch spotted by SeanC, to
    avoid scribbling on initial_gs in common_cpu_up(), and to allow all
    24 bits of the physical X2APIC ID to be used. That patch still needs
    a Signed-off-by from its original author, who once claimed not to
    remember writing it at all. But now we've fixed it, hopefully he'll
    admit it now :)
v5: rebase to v6.1 and remeasure performance, disable parallel bringup
    for AMD CPUs.
v6: rebase to v6.2-rc6, disabled parallel boot on amd as a cpu bug and
    reused timer calibration for secondary CPUs.
v7: [David Woodhouse] iterate over all possible CPUs to find any existing
    cluster mask in alloc_clustermask. (patch 1/9)
    Keep parallel AMD support enabled in AMD, using APIC ID in CPUID leaf
    0x0B (for x2APIC mode) or CPUID leaf 0x01 where 8 bits are sufficient.
    Included sanity checks for APIC id from 0x0B. (patch 6/9)
    Removed patch for reusing timer calibration for secondary CPUs.
    commit message and code improvements.

David Woodhouse (8):
  x86/apic/x2apic: Allow CPU cluster_mask to be populated in parallel
  cpu/hotplug: Move idle_thread_get() to <linux/smpboot.h>
  cpu/hotplug: Add dynamic parallel bringup states before
    CPUHP_BRINGUP_CPU
  x86/smpboot: Reference count on smpboot_setup_warm_reset_vector()
  x86/smpboot: Split up native_cpu_up into separate phases and document
    them
  x86/smpboot: Send INIT/SIPI/SIPI to secondary CPUs in parallel
  x86/mtrr: Avoid repeated save of MTRRs on boot-time CPU bringup
  x86/smpboot: Serialize topology updates for secondary bringup

Thomas Gleixner (1):
  x86/smpboot: Support parallel startup of secondary CPUs

 arch/x86/include/asm/realmode.h       |   3 +
 arch/x86/include/asm/smp.h            |  14 +-
 arch/x86/include/asm/topology.h       |   2 -
 arch/x86/kernel/acpi/sleep.c          |   1 +
 arch/x86/kernel/apic/apic.c           |   2 +-
 arch/x86/kernel/apic/x2apic_cluster.c | 130 ++++++----
 arch/x86/kernel/cpu/common.c          |   6 +-
 arch/x86/kernel/cpu/mtrr/mtrr.c       |   9 +
 arch/x86/kernel/head_64.S             |  84 +++++++
 arch/x86/kernel/smpboot.c             | 349 +++++++++++++++++++-------
 arch/x86/realmode/init.c              |   3 +
 arch/x86/realmode/rm/trampoline_64.S  |  14 ++
 arch/x86/xen/smp_pv.c                 |   4 +-
 include/linux/cpuhotplug.h            |   2 +
 include/linux/smpboot.h               |   7 +
 kernel/cpu.c                          |  31 ++-
 kernel/smpboot.c                      |   2 +-
 kernel/smpboot.h                      |   2 -
 18 files changed, 506 insertions(+), 159 deletions(-)
  

Comments

Paul E. McKenney Feb. 9, 2023, 3:53 a.m. UTC | #1
On Tue, Feb 07, 2023 at 11:04:27PM +0000, Usama Arif wrote:
> Tested on v7, doing INIT/SIPI/SIPI in parallel brings down the time for
> smpboot from ~700ms to 100ms (85% improvement) on a server with 128 CPUs
> split across 2 NUMA nodes.
> 
> The major change over v6 is keeping parallel smp support enabled in AMD.
> APIC ID for parallel CPU bringup is now obtained from CPUID leaf 0x0B
> (for x2APIC mode) otherwise CPUID leaf 0x1 (8 bits).
> 
> The patch for reusing timer calibration for secondary CPUs is also removed
> from the series as its not part of parallel smp bringup and needs to be
> further thought about.

Running rcutorture on this got me the following NULL pointer dereference
on scenario TREE01:

------------------------------------------------------------------------

[   34.662066] smpboot: CPU 0 is now offline
[   34.674075] rcu: NOCB: Cannot CB-offload offline CPU 25
[   35.038003] rcu: De-offloading 5
[   35.112997] rcu: Offloading 12
[   35.716011] smpboot: Booting Node 0 Processor 0 APIC 0x0
[   35.762685] BUG: kernel NULL pointer dereference, address: 0000000000000001
[   35.764278] #PF: supervisor instruction fetch in kernel mode
[   35.765530] #PF: error_code(0x0010) - not-present page
[   35.766700] PGD 0 P4D 0
[   35.767278] Oops: 0010 [#1] PREEMPT SMP PTI
[   35.768223] CPU: 36 PID: 0 Comm: swapper/36 Not tainted 6.2.0-rc1-00206-g18a37610b632-dirty #3563
[   35.770201] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

------------------------------------------------------------------------

Given an x86 system with KVM and qemu, this can be reproduced by running
the following from the top-level directory in the Linux-kernel source
tree:

	tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --configs "TREE01 TINY01" --trust-make

Out of 15 runs, 14 blew up just after the first attempt to bring CPU
0 back online.  The 15th run blew up just after the second attempt to
bring CPU 0 online, the first attempt having succeeded.

My guess is that the CONFIG_BOOTPARAM_HOTPLUG_CPU0=y Kconfig option is
tickling this bug.  This Kconfig option has been added to the TREE01
scenario in the -rcu tree's "dev" branch, which might mean that this test
would pass on mainline.  But CONFIG_BOOTPARAM_HOTPLUG_CPU0=y is not new,
only rcutorture's testing of it.

Thoughts?

							Thanx, Paul
  
Usama Arif Feb. 9, 2023, 9:49 a.m. UTC | #2
On 09/02/2023 03:53, Paul E. McKenney wrote:
> On Tue, Feb 07, 2023 at 11:04:27PM +0000, Usama Arif wrote:
>> Tested on v7, doing INIT/SIPI/SIPI in parallel brings down the time for
>> smpboot from ~700ms to 100ms (85% improvement) on a server with 128 CPUs
>> split across 2 NUMA nodes.
>>
>> The major change over v6 is keeping parallel smp support enabled in AMD.
>> APIC ID for parallel CPU bringup is now obtained from CPUID leaf 0x0B
>> (for x2APIC mode) otherwise CPUID leaf 0x1 (8 bits).
>>
>> The patch for reusing timer calibration for secondary CPUs is also removed
>> from the series as its not part of parallel smp bringup and needs to be
>> further thought about.
> 
> Running rcutorture on this got me the following NULL pointer dereference
> on scenario TREE01:
> 
> ------------------------------------------------------------------------
> 
> [   34.662066] smpboot: CPU 0 is now offline
> [   34.674075] rcu: NOCB: Cannot CB-offload offline CPU 25
> [   35.038003] rcu: De-offloading 5
> [   35.112997] rcu: Offloading 12
> [   35.716011] smpboot: Booting Node 0 Processor 0 APIC 0x0
> [   35.762685] BUG: kernel NULL pointer dereference, address: 0000000000000001
> [   35.764278] #PF: supervisor instruction fetch in kernel mode
> [   35.765530] #PF: error_code(0x0010) - not-present page
> [   35.766700] PGD 0 P4D 0
> [   35.767278] Oops: 0010 [#1] PREEMPT SMP PTI
> [   35.768223] CPU: 36 PID: 0 Comm: swapper/36 Not tainted 6.2.0-rc1-00206-g18a37610b632-dirty #3563
> [   35.770201] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> 
> ------------------------------------------------------------------------
> 
> Given an x86 system with KVM and qemu, this can be reproduced by running
> the following from the top-level directory in the Linux-kernel source
> tree:
> 
> 	tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --configs "TREE01 TINY01" --trust-make
> 
> Out of 15 runs, 14 blew up just after the first attempt to bring CPU
> 0 back online.  The 15th run blew up just after the second attempt to
> bring CPU 0 online, the first attempt having succeeded.
> 
> My guess is that the CONFIG_BOOTPARAM_HOTPLUG_CPU0=y Kconfig option is
> tickling this bug.  This Kconfig option has been added to the TREE01
> scenario in the -rcu tree's "dev" branch, which might mean that this test
> would pass on mainline.  But CONFIG_BOOTPARAM_HOTPLUG_CPU0=y is not new,
> only rcutorture's testing of it.
> 
> Thoughts?
> 
> 							Thanx, Paul

It looks like its because of the initial_gs, initial_stack and
early_gdt_descr not being setup properly for CPU0 hotplug, i.e. 
init_cpu_data isnt called in cpu0 hotplug case.

Its easy to test, just by doing
echo 0 > /sys/devices/system/cpu/cpu0/online;
echo 1 > /sys/devices/system/cpu/cpu0/online;

As a quick check, if we do something like below (probably there is a 
much better place to set these..), the above hotplug commands will work.

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 3ec5182d9698..184135c47ee5 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1080,6 +1080,7 @@ wakeup_cpu_via_init_nmi(int cpu, unsigned long 
start_ip, int apicid,
                                           wakeup_cpu0_nmi, 0, "wake_cpu0");

         if (!boot_error) {
+               initial_gs = per_cpu_offset(cpu);
                 enable_start_cpu0 = 1;
                 *cpu0_nmi_registered = 1;
                 id = apic->dest_mode_logical ? cpu0_logical_apicid : 
apicid;
@@ -1188,10 +1189,14 @@ static int do_boot_cpu(int apicid, int cpu, 
struct task_struct *idle,
                 boot_error = apic->wakeup_secondary_cpu_64(apicid, 
start_ip);
         else if (apic->wakeup_secondary_cpu)
                 boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
-       else
+       else {
+               if(!cpu) {
+                       early_gdt_descr.address = (unsigned 
long)get_cpu_gdt_rw(cpu);
+                       initial_stack  = idle->thread.sp;
+               }
                 boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
                                                      cpu0_nmi_registered);
-
+       }
         return boot_error;
  }
  
David Woodhouse Feb. 9, 2023, 10:06 a.m. UTC | #3
On Thu, 2023-02-09 at 09:49 +0000, Usama Arif wrote:
> 
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1080,6 +1080,7 @@ wakeup_cpu_via_init_nmi(int cpu, unsigned long 
> start_ip, int apicid,
>                                            wakeup_cpu0_nmi, 0, "wake_cpu0");
> 
>          if (!boot_error) {
> +               initial_gs = per_cpu_offset(cpu);

That's for 64-bit.

>                  enable_start_cpu0 = 1;
>                  *cpu0_nmi_registered = 1;
>                  id = apic->dest_mode_logical ? cpu0_logical_apicid :
> apicid;
> @@ -1188,10 +1189,14 @@ static int do_boot_cpu(int apicid, int cpu, 
> struct task_struct *idle,
>                  boot_error = apic->wakeup_secondary_cpu_64(apicid, 
> start_ip);
>          else if (apic->wakeup_secondary_cpu)
>                  boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
> -       else
> +       else {
> +               if(!cpu) {
> +                       early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
> +                       initial_stack  = idle->thread.sp;
> +               }

And that's 32-bit.

These were previously done in common_cpu_up() or do_boot_cpu(), which
means they were done not only for the wakeup_cpu_via_init_nmi() code
path, but also when we call apic->wakeup_secondary_cpu() (or _64()) for
the esoteric UV/NumaConnect machines.

Thanks for diagnosing it so quickly; I'll work up a subtly different
fix.

>                  boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
>                                                       cpu0_nmi_registered);
> -
> +       }
>          return boot_error;
>   }
>
  
Usama Arif Feb. 9, 2023, 10:19 a.m. UTC | #4
On 09/02/2023 10:06, David Woodhouse wrote:
> On Thu, 2023-02-09 at 09:49 +0000, Usama Arif wrote:
>>
>> --- a/arch/x86/kernel/smpboot.c
>> +++ b/arch/x86/kernel/smpboot.c
>> @@ -1080,6 +1080,7 @@ wakeup_cpu_via_init_nmi(int cpu, unsigned long
>> start_ip, int apicid,
>>                                             wakeup_cpu0_nmi, 0, "wake_cpu0");
>>
>>           if (!boot_error) {
>> +               initial_gs = per_cpu_offset(cpu);
> 
> That's for 64-bit.
> 
>>                   enable_start_cpu0 = 1;
>>                   *cpu0_nmi_registered = 1;
>>                   id = apic->dest_mode_logical ? cpu0_logical_apicid :
>> apicid;
>> @@ -1188,10 +1189,14 @@ static int do_boot_cpu(int apicid, int cpu,
>> struct task_struct *idle,
>>                   boot_error = apic->wakeup_secondary_cpu_64(apicid,
>> start_ip);
>>           else if (apic->wakeup_secondary_cpu)
>>                   boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
>> -       else
>> +       else {
>> +               if(!cpu) {
>> +                       early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
>> +                       initial_stack  = idle->thread.sp;
>> +               }
> 
> And that's 32-bit.
> 
> These were previously done in common_cpu_up() or do_boot_cpu(), which
> means they were done not only for the wakeup_cpu_via_init_nmi() code
> path, but also when we call apic->wakeup_secondary_cpu() (or _64()) for
> the esoteric UV/NumaConnect machines.
> 
> Thanks for diagnosing it so quickly; I'll work up a subtly different
> fix.
> 

Yeah, was just a hacky fix while I was trying to figure out the issue.

do_boot_cpu is only called in cpu0 in hotplug case, so I think this a 
much better diff:

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 3ec5182d9698..f7ae82969ee4 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1136,13 +1136,16 @@ static int do_boot_cpu(int apicid, int cpu, 
struct task_struct *idle,
         idle->thread.sp = (unsigned long)task_pt_regs(idle);
         initial_code = (unsigned long)start_secondary;

-       if (IS_ENABLED(CONFIG_X86_32)) {
+       if (IS_ENABLED(CONFIG_X86_32) || !cpu) {
                 early_gdt_descr.address = (unsigned 
long)get_cpu_gdt_rw(cpu);
                 initial_stack  = idle->thread.sp;
         } else if (!do_parallel_bringup) {
                 smpboot_control = STARTUP_SECONDARY | apicid;
         }

+       if (!cpu)
+               initial_gs = per_cpu_offset(cpu);
+
         /* Enable the espfix hack for this CPU */
         init_espfix_ap(cpu);



>>                   boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
>>                                                        cpu0_nmi_registered);
>> -
>> +       }
>>           return boot_error;
>>    }
>>
>
  
David Woodhouse Feb. 9, 2023, 10:22 a.m. UTC | #5
On Thu, 2023-02-09 at 10:19 +0000, Usama Arif wrote:
> 
> 
> On 09/02/2023 10:06, David Woodhouse wrote:
> > On Thu, 2023-02-09 at 09:49 +0000, Usama Arif wrote:
> > > 
> > > --- a/arch/x86/kernel/smpboot.c
> > > +++ b/arch/x86/kernel/smpboot.c
> > > @@ -1080,6 +1080,7 @@ wakeup_cpu_via_init_nmi(int cpu, unsigned long
> > > start_ip, int apicid,
> > >                                             wakeup_cpu0_nmi, 0, "wake_cpu0");
> > > 
> > >           if (!boot_error) {
> > > +               initial_gs = per_cpu_offset(cpu);
> > 
> > That's for 64-bit.
> > 
> > >                   enable_start_cpu0 = 1;
> > >                   *cpu0_nmi_registered = 1;
> > >                   id = apic->dest_mode_logical ? cpu0_logical_apicid :
> > > apicid;
> > > @@ -1188,10 +1189,14 @@ static int do_boot_cpu(int apicid, int cpu,
> > > struct task_struct *idle,
> > >                   boot_error = apic->wakeup_secondary_cpu_64(apicid,
> > > start_ip);
> > >           else if (apic->wakeup_secondary_cpu)
> > >                   boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
> > > -       else
> > > +       else {
> > > +               if(!cpu) {
> > > +                       early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
> > > +                       initial_stack  = idle->thread.sp;
> > > +               }
> > 
> > And that's 32-bit.
> > 
> > These were previously done in common_cpu_up() or do_boot_cpu(), which
> > means they were done not only for the wakeup_cpu_via_init_nmi() code
> > path, but also when we call apic->wakeup_secondary_cpu() (or _64()) for
> > the esoteric UV/NumaConnect machines.
> > 
> > Thanks for diagnosing it so quickly; I'll work up a subtly different
> > fix.
> > 
> 
> Yeah, was just a hacky fix while I was trying to figure out the issue.
> 
> do_boot_cpu is only called in cpu0 in hotplug case, so I think this a
> much better diff:
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 3ec5182d9698..f7ae82969ee4 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1136,13 +1136,16 @@ static int do_boot_cpu(int apicid, int cpu, 
> struct task_struct *idle,
>          idle->thread.sp = (unsigned long)task_pt_regs(idle);
>          initial_code = (unsigned long)start_secondary;
> 
> -       if (IS_ENABLED(CONFIG_X86_32)) {
> +       if (IS_ENABLED(CONFIG_X86_32) || !cpu) {
>                  early_gdt_descr.address = (unsigned 
> long)get_cpu_gdt_rw(cpu);
>                  initial_stack  = idle->thread.sp;
>          } else if (!do_parallel_bringup) {
>                  smpboot_control = STARTUP_SECONDARY | apicid;
>          }
> 
> +       if (!cpu)
> +               initial_gs = per_cpu_offset(cpu);
> +
>          /* Enable the espfix hack for this CPU */
>          init_espfix_ap(cpu);
> 
> 
> 
> > >                   boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
> > >                                                        cpu0_nmi_registered);
> > > -
> > > +       }
> > >           return boot_error;
> > >    }
> > > 
> > 


I'm trying to find the actual startup path that CPU0 takes when woken
by an NMI.

Can't we make it take the secondary_startup_64 path that actually
checks smpboot_control, sees the STARTUP_SECONDARY flag set, and then
gets all these things from the per-cpu data as $DEITY (well, Thomas)
intended?
  
David Woodhouse Feb. 9, 2023, 11:03 a.m. UTC | #6
On Thu, 2023-02-09 at 09:49 +0000, Usama Arif wrote:
> 
> Its easy to test, just by doing
> echo 0 > /sys/devices/system/cpu/cpu0/online;
> echo 1 > /sys/devices/system/cpu/cpu0/online;

This one also fixes it for me. If we're happy with this approach, I'll
work it into Thomas's original patch (and hopefully eventually he'll be
happy enough with it and the commit message that he'll give us his
Signed-off-by for it.)


I could probably add a Co-developed-by: tglx for that first x2apic
patch in the series too, but then it would *also* need his SoB and I
didn't want to be owed two, so I just pasted his suggested code and
didn't credit him.


diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 5462464fe3ef..ea6052a97619 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -450,7 +450,16 @@ SYM_CODE_END(secondary_startup_64)
 SYM_CODE_START(start_cpu0)
        ANNOTATE_NOENDBR
        UNWIND_HINT_EMPTY
-       movq    initial_stack(%rip), %rsp
+       /* Load the per-cpu base for CPU#0 */
+       leaq    __per_cpu_offset(%rip), %rbx
+       movq    (%rbx), %rbx
+
+       /* Find the idle task stack */
+       movq    $idle_threads, %rcx
+       addq    %rbx, %rcx
+       movq    (%rcx), %rcx
+       movq    TASK_threadsp(%rcx), %rsp
+
        jmp     .Ljump_to_C_code
 SYM_CODE_END(start_cpu0)
 #endif

I cut and pasted some of that, I'm not entirely sure why we have three
instructions to do the equivalent of 'movq idle_threads(%ebx), %ecx'
and may fix that in the original as I work this in.
  
Usama Arif Feb. 9, 2023, 11:45 a.m. UTC | #7
On 09/02/2023 11:03, David Woodhouse wrote:
> On Thu, 2023-02-09 at 09:49 +0000, Usama Arif wrote:
>>
>> Its easy to test, just by doing
>> echo 0 > /sys/devices/system/cpu/cpu0/online;
>> echo 1 > /sys/devices/system/cpu/cpu0/online;
> 
> This one also fixes it for me. If we're happy with this approach, I'll
> work it into Thomas's original patch (and hopefully eventually he'll be
> happy enough with it and the commit message that he'll give us his
> Signed-off-by for it.)
> 

Yes, I think its better!

> 
> I could probably add a Co-developed-by: tglx for that first x2apic
> patch in the series too, but then it would *also* need his SoB and I
> didn't want to be owed two, so I just pasted his suggested code and
> didn't credit him.
> 
> 
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 5462464fe3ef..ea6052a97619 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -450,7 +450,16 @@ SYM_CODE_END(secondary_startup_64)
>   SYM_CODE_START(start_cpu0)
>          ANNOTATE_NOENDBR
>          UNWIND_HINT_EMPTY
> -       movq    initial_stack(%rip), %rsp
> +       /* Load the per-cpu base for CPU#0 */
> +       leaq    __per_cpu_offset(%rip), %rbx
> +       movq    (%rbx), %rbx
> +
> +       /* Find the idle task stack */
> +       movq    $idle_threads, %rcx
> +       addq    %rbx, %rcx
> +       movq    (%rcx), %rcx
> +       movq    TASK_threadsp(%rcx), %rsp
> +
>          jmp     .Ljump_to_C_code
>   SYM_CODE_END(start_cpu0)
>   #endif
> 
> I cut and pasted some of that, I'm not entirely sure why we have three
> instructions to do the equivalent of 'movq idle_threads(%ebx), %ecx'
> and may fix that in the original as I work this in.
  
Thomas Gleixner Feb. 9, 2023, 11:53 a.m. UTC | #8
On Thu, Feb 09 2023 at 11:03, David Woodhouse wrote:
> This one also fixes it for me. If we're happy with this approach, I'll
> work it into Thomas's original patch (and hopefully eventually he'll be
> happy enough with it and the commit message that he'll give us his
> Signed-off-by for it.)

I'm happy enough by now, but I'm not sure how much of the original patch
is still left. Also you did the heavy lifting of making it work and
writing the nice changelog. So please make this:

From: David Woodhouse <dwmw2@infradead.org>

Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>

> I could probably add a Co-developed-by: tglx for that first x2apic
> patch in the series too, but then it would *also* need his SoB and I
> didn't want to be owed two, so I just pasted his suggested code and
> didn't credit him.

That's what Suggested-by: is for. For that I don't owe you anything. :)

Thanks

        tglx
  
David Woodhouse Feb. 9, 2023, 12:10 p.m. UTC | #9
On Thu, 2023-02-09 at 12:53 +0100, Thomas Gleixner wrote:
> On Thu, Feb 09 2023 at 11:03, David Woodhouse wrote:
> > This one also fixes it for me. If we're happy with this approach, I'll
> > work it into Thomas's original patch (and hopefully eventually he'll be
> > happy enough with it and the commit message that he'll give us his
> > Signed-off-by for it.)
> 
> I'm happy enough by now, but I'm not sure how much of the original patch
> is still left. Also you did the heavy lifting of making it work and
> writing the nice changelog. So please make this:
> 
> From: David Woodhouse <dwmw2@infradead.org>
> 
> Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Woodhouse <dwmw2@infradead.org>

Thanks. I'll flip that to the Amazon address, of course. It's useless
for actual email (until I apply that LART some more) but I should still
use it for that.

I think I'm going to make one more change to that as I review it; in
the "should never happen" case of not finding the APIC ID in the
cpuid_to_apicid[] array it would just keep searching for ever. I don't
know if there's a better thing to do other than just dropping the
trampoline lock and 1:cli;hlt;jmp 1b but at least it's *attempting* to
handle the failure.

Patch below, and I'll update the tree shortly. There's a "what if
there's noise in the top 32 bits of %rcx" fix in there too.

> > I could probably add a Co-developed-by: tglx for that first x2apic
> > patch in the series too, but then it would *also* need his SoB and I
> > didn't want to be owed two, so I just pasted his suggested code and
> > didn't credit him.
> 
> That's what Suggested-by: is for. For that I don't owe you anything. :)

Well, I didn't think that was really for suggestions which come in
'diff -up' form, but OK :)

--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -280,15 +280,25 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
        cpuid
 
 .Lsetup_AP:
-       /* EDX contains the APICID of the current CPU */
-       xorl    %ecx, %ecx
+       /* EDX contains the APIC ID of the current CPU */
+       xorq    %rcx, %rcx
        leaq    cpuid_to_apicid(%rip), %rbx
 
 .Lfind_cpunr:
        cmpl    (%rbx,%rcx,4), %edx
        jz      .Linit_cpu_data
-       inc     %rcx
-       jmp     .Lfind_cpunr
+       inc     %ecx
+       cmpl    nr_cpu_ids(%rip), %ecx
+       jb      .Lfind_cpunr
+
+       /*  APIC ID not found in the table. Drop the trampoline lock and bail. */
+       movq    trampoline_lock(%rip), %rax
+       lock
+       btrl    $0, (%rax)
+
+1:     cli
+       hlt
+       jmp     1b
 
 .Linit_cpu_data:
        /* Get the per cpu offset for the given CPU# which is in ECX */
@@ -303,9 +313,7 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
        movq    %rcx, early_gdt_descr_base(%rip)
 
        /* Find the idle task stack */
-       movq    $idle_threads, %rcx
-       addq    %rbx, %rcx
-       movq    (%rcx), %rcx
+       movq    idle_threads(%rbx), %rcx
        movq    TASK_threadsp(%rcx), %rcx
        movq    %rcx, initial_stack(%rip)
 #endif /* CONFIG_SMP */
@@ -450,7 +458,14 @@ SYM_CODE_END(secondary_startup_64)
 SYM_CODE_START(start_cpu0)
        ANNOTATE_NOENDBR
        UNWIND_HINT_EMPTY
-       movq    initial_stack(%rip), %rsp
+       /* Load the per-cpu base for CPU#0 */
+       leaq    __per_cpu_offset(%rip), %rbx
+       movq    (%rbx), %rbx
+
+       /* Find the idle task stack */
+       movq    idle_threads(%rbx), %rcx
+       movq    TASK_threadsp(%rcx), %rsp
+
        jmp     .Ljump_to_C_code
 SYM_CODE_END(start_cpu0)
 #endif
  
David Woodhouse Feb. 9, 2023, 1:48 p.m. UTC | #10
On Thu, 2023-02-09 at 12:10 +0000, David Woodhouse wrote:
> On Thu, 2023-02-09 at 12:53 +0100, Thomas Gleixner wrote:
> > On Thu, Feb 09 2023 at 11:03, David Woodhouse wrote:
> > > This one also fixes it for me. If we're happy with this approach, I'll
> > > work it into Thomas's original patch (and hopefully eventually he'll be
> > > happy enough with it and the commit message that he'll give us his
> > > Signed-off-by for it.)
> > 
> > I'm happy enough by now, but I'm not sure how much of the original patch
> > is still left. Also you did the heavy lifting of making it work and
> > writing the nice changelog. So please make this:
> > 
> > From: David Woodhouse <dwmw2@infradead.org>
> > 
> > Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: David Woodhouse <dwmw2@infradead.org>
> 
> Thanks. I'll flip that to the Amazon address, of course. It's useless
> for actual email (until I apply that LART some more) but I should still
> use it for that.
> 
> I think I'm going to make one more change to that as I review it; in
> the "should never happen" case of not finding the APIC ID in the
> cpuid_to_apicid[] array it would just keep searching for ever. I don't
> know if there's a better thing to do other than just dropping the
> trampoline lock and 1:cli;hlt;jmp 1b but at least it's *attempting* to
> handle the failure.
> 
> Patch below, and I'll update the tree shortly. There's a "what if
> there's noise in the top 32 bits of %rcx" fix in there too.

All done and pushed out to parallel-6.2-rc7-part1 (and not -part1)
branches. Usama, are you able to redo the testing and take it from
here? Thanks for that; it's saving me a lot of time!


I'm mostly done for the week now as by this time tomorrow, I need to
have the skis on the roof of the car and be ready to pick the family up
from school and start driving south...
  
Usama Arif Feb. 9, 2023, 2:01 p.m. UTC | #11
On 09/02/2023 13:48, David Woodhouse wrote:
> On Thu, 2023-02-09 at 12:10 +0000, David Woodhouse wrote:
>> On Thu, 2023-02-09 at 12:53 +0100, Thomas Gleixner wrote:
>>> On Thu, Feb 09 2023 at 11:03, David Woodhouse wrote:
>>>> This one also fixes it for me. If we're happy with this approach, I'll
>>>> work it into Thomas's original patch (and hopefully eventually he'll be
>>>> happy enough with it and the commit message that he'll give us his
>>>> Signed-off-by for it.)
>>>
>>> I'm happy enough by now, but I'm not sure how much of the original patch
>>> is still left. Also you did the heavy lifting of making it work and
>>> writing the nice changelog. So please make this:
>>>
>>> From: David Woodhouse <dwmw2@infradead.org>
>>>
>>> Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
>>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
>>
>> Thanks. I'll flip that to the Amazon address, of course. It's useless
>> for actual email (until I apply that LART some more) but I should still
>> use it for that.
>>
>> I think I'm going to make one more change to that as I review it; in
>> the "should never happen" case of not finding the APIC ID in the
>> cpuid_to_apicid[] array it would just keep searching for ever. I don't
>> know if there's a better thing to do other than just dropping the
>> trampoline lock and 1:cli;hlt;jmp 1b but at least it's *attempting* to
>> handle the failure.
>>
>> Patch below, and I'll update the tree shortly. There's a "what if
>> there's noise in the top 32 bits of %rcx" fix in there too.
> 
> All done and pushed out to parallel-6.2-rc7-part1 (and not -part1)
> branches. Usama, are you able to redo the testing and take it from
> here? Thanks for that; it's saving me a lot of time!
> 
Thanks! Will retest and repost now!

Usama

> 
> I'm mostly done for the week now as by this time tomorrow, I need to
> have the skis on the roof of the car and be ready to pick the family up
> from school and start driving south...
  
Thomas Gleixner Feb. 9, 2023, 3:48 p.m. UTC | #12
On Thu, Feb 09 2023 at 12:10, David Woodhouse wrote:
> On Thu, 2023-02-09 at 12:53 +0100, Thomas Gleixner wrote:
>> > I could probably add a Co-developed-by: tglx for that first x2apic
>> > patch in the series too, but then it would *also* need his SoB and I
>> > didn't want to be owed two, so I just pasted his suggested code and
>> > didn't credit him.
>> 
>> That's what Suggested-by: is for. For that I don't owe you anything. :)
>
> Well, I didn't think that was really for suggestions which come in
> 'diff -up' form, but OK :)

I can't even remember, but probably diff -up was way faster than writing
a novel :)

Thanks,

        tglx