diff mbox series

[v3] RISC-V: Probe misaligned access speed in parallel

Message ID	20231106225855.3121724-1-evan@rivosinc.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; From: Evan Green <evan@rivosinc.com> To: Palmer Dabbelt <palmer@rivosinc.com> Cc: Jisheng Zhang <jszhang@kernel.org>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, David Laight <David.Laight@aculab.com>, Evan Green <evan@rivosinc.com>, Albert Ou <aou@eecs.berkeley.edu>, Andrew Jones <ajones@ventanamicro.com>, Anup Patel <apatel@ventanamicro.com>, =?utf-8?b?Q2zDqW1lbnQgTMOpZ2Vy?= <cleger@rivosinc.com>, Conor Dooley <conor.dooley@microchip.com>, Greentime Hu <greentime.hu@sifive.com>, Heiko Stuebner <heiko@sntech.de>, Ley Foon Tan <leyfoon.tan@starfivetech.com>, Marc Zyngier <maz@kernel.org>, Palmer Dabbelt <palmer@dabbelt.com>, Paul Walmsley <paul.walmsley@sifive.com>, Sunil V L <sunilvl@ventanamicro.com>, linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org Subject: [PATCH v3] RISC-V: Probe misaligned access speed in parallel Date: Mon, 6 Nov 2023 14:58:55 -0800 Message-Id: <20231106225855.3121724-1-evan@rivosinc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v3] RISC-V: Probe misaligned access speed in parallel \| [v3] RISC-V: Probe misaligned access speed in parallel

Commit Message

Evan Green Nov. 6, 2023, 10:58 p.m. UTC

  Probing for misaligned access speed takes about 0.06 seconds. On a
system with 64 cores, doing this in smp_callin() means it's done
serially, extending boot time by 3.8 seconds. That's a lot of boot time.

Instead of measuring each CPU serially, let's do the measurements on
all CPUs in parallel. If we disable preemption on all CPUs, the
jiffies stop ticking, so we can do this in stages of 1) everybody
except core 0, then 2) core 0. The allocations are all done outside of
on_each_cpu() to avoid calling alloc_pages() with interrupts disabled.

For hotplugged CPUs that come in after the boot time measurement,
register CPU hotplug callbacks, and do the measurement there. Interrupts
are enabled in those callbacks, so they're fine to do alloc_pages() in.

Reported-by: Jisheng Zhang <jszhang@kernel.org>
Closes: https://lore.kernel.org/all/mhng-9359993d-6872-4134-83ce-c97debe1cf9a@palmer-ri-x1c9/T/#mae9b8f40016f9df428829d33360144dc5026bcbf
Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
Signed-off-by: Evan Green <evan@rivosinc.com>

---

Changes in v3:
 - Avoid alloc_pages() with interrupts disabled (Sebastien)
 - Use cpuhp callbacks instead of hooking into smp_callin() (Sebastien).
 - Move cached answer check in check_unaligned_access() out to the
   hotplug callback, both to save the work of a useless allocation, and
   since check_unaligned_access_emulated() resets the answer to unknown.

Changes in v2:
 - Removed new global, used system_state == SYSTEM_RUNNING instead
   (Jisheng)
 - Added tags

 arch/riscv/include/asm/cpufeature.h |  1 -
 arch/riscv/kernel/cpufeature.c      | 96 +++++++++++++++++++++++------
 arch/riscv/kernel/smpboot.c         |  1 -
 3 files changed, 77 insertions(+), 21 deletions(-)

Comments

Sebastian Andrzej Siewior Nov. 7, 2023, 8:34 a.m. UTC | #1

On 2023-11-06 14:58:55 [-0800], Evan Green wrote:
> Probing for misaligned access speed takes about 0.06 seconds. On a
> system with 64 cores, doing this in smp_callin() means it's done
> serially, extending boot time by 3.8 seconds. That's a lot of boot time.
> 
> Instead of measuring each CPU serially, let's do the measurements on
> all CPUs in parallel. If we disable preemption on all CPUs, the
> jiffies stop ticking, so we can do this in stages of 1) everybody
> except core 0, then 2) core 0. The allocations are all done outside of
> on_each_cpu() to avoid calling alloc_pages() with interrupts disabled.
> 
> For hotplugged CPUs that come in after the boot time measurement,
> register CPU hotplug callbacks, and do the measurement there. Interrupts
> are enabled in those callbacks, so they're fine to do alloc_pages() in.

I think this is dragged out of proportion. I would do this (if needed
can can't be identified by CPU-ID or so) on boot CPU only. If there is
evidence/ proof/ blessing from the high RiscV council that different
types of CPU cores are mixed together then this could be extended.
You brought Big-Little up in the other thread. This is actually known.
Same as with hyper-threads on x86, you know which CPU is the core and
which hyper thread (CPU) belongs to it.
So in terms of BigLittle you _could_ limit this to one Big and one
Little core instead running it on all.

But this is just my few on this. From PREEMPT_RT's point of view, the
way you restructured the memory allocation should work now.

> Reported-by: Jisheng Zhang <jszhang@kernel.org>
> Closes: https://lore.kernel.org/all/mhng-9359993d-6872-4134-83ce-c97debe1cf9a@palmer-ri-x1c9/T/#mae9b8f40016f9df428829d33360144dc5026bcbf
> Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
> Signed-off-by: Evan Green <evan@rivosinc.com>
> 
> 
> diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> index 6a01ded615cd..fe59e18dbd5b 100644
> --- a/arch/riscv/kernel/cpufeature.c
> +++ b/arch/riscv/kernel/cpufeature.c
…
>  
> -static int __init check_unaligned_access_boot_cpu(void)
> +/* Measure unaligned access on all CPUs present at boot in parallel. */
> +static int check_unaligned_access_all_cpus(void)
>  {
> -	check_unaligned_access(0);
> +	unsigned int cpu;
> +	unsigned int cpu_count = num_possible_cpus();
> +	struct page **bufs = kzalloc(cpu_count * sizeof(struct page *),
> +				     GFP_KERNEL);

kcalloc(). For beauty reasons you could try a reverse xmas tree. 

> +
> +	if (!bufs) {
> +		pr_warn("Allocation failure, not measuring misaligned performance\n");
> +		return 0;
> +	}
> +
> +	/*
> +	 * Allocate separate buffers for each CPU so there's no fighting over
> +	 * cache lines.
> +	 */
> +	for_each_cpu(cpu, cpu_online_mask) {
> +		bufs[cpu] = alloc_pages(GFP_KERNEL, MISALIGNED_BUFFER_ORDER);
> +		if (!bufs[cpu]) {
> +			pr_warn("Allocation failure, not measuring misaligned performance\n");
> +			goto out;
> +		}
> +	}
> +
> +	/* Check everybody except 0, who stays behind to tend jiffies. */
> +	on_each_cpu(check_unaligned_access_nonboot_cpu, bufs, 1);

comments! _HOW_ do you ensure that CPU0 is left out? You don't. CPU0
does this and the leaves which is a waste. Using on_each_cpu_cond()
could deal with this. And you have the check within the wrapper
(check_unaligned_access_nonboot_cpu()) anyway.

> +	/* Check core 0. */
> +	smp_call_on_cpu(0, check_unaligned_access, bufs[0], true);

Now that comment is obvious. If you want to add a comment, why not state
why CPU0 has to be done last?

> +
> +	/* Setup hotplug callback for any new CPUs that come online. */
> +	cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "riscv:online",
> +				  riscv_online_cpu, NULL);
Instead riscv:online you could use riscv:unaliged_check or something
that pin points the callback to something obvious. This is exported via
sysfs.

Again, comment is obvious. For that to make sense would require RiscV to
support physical-hotplug. For KVM like environment (where you can plug in
CPUs later) this probably doesn't make sense at all. Why not? Because

- without explicit CPU pinning your slow/ fast CPU mapping (host <->
  guest) could change if the scheduler on the host moves the threads
  around.

- without explicit task offload and resource partitioning on the host
  your guest thread might get interrupt during the measurement. This is
  done during boot so chances are high that it runs 100% of its time
  slice and will be preempted once other tasks on the host ask for CPU
  run time.

Both points mean, that the results may not be accurate over time. Maybe
a KVM-hint if this is so important (along with other restrictions).

> +
> +out:
>  	unaligned_emulation_finish();
> +	for_each_cpu(cpu, cpu_online_mask) {
> +		if (bufs[cpu])
> +			__free_pages(bufs[cpu], MISALIGNED_BUFFER_ORDER);
> +	}
> +
> +	kfree(bufs);
>  	return 0;
>  }
>  
> -arch_initcall(check_unaligned_access_boot_cpu);
> +arch_initcall(check_unaligned_access_all_cpus);
>  
>  void riscv_user_isa_enable(void)
>  {

Sebastian

Evan Green Nov. 7, 2023, 5:26 p.m. UTC | #2

On Tue, Nov 7, 2023 at 12:34 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2023-11-06 14:58:55 [-0800], Evan Green wrote:
> > Probing for misaligned access speed takes about 0.06 seconds. On a
> > system with 64 cores, doing this in smp_callin() means it's done
> > serially, extending boot time by 3.8 seconds. That's a lot of boot time.
> >
> > Instead of measuring each CPU serially, let's do the measurements on
> > all CPUs in parallel. If we disable preemption on all CPUs, the
> > jiffies stop ticking, so we can do this in stages of 1) everybody
> > except core 0, then 2) core 0. The allocations are all done outside of
> > on_each_cpu() to avoid calling alloc_pages() with interrupts disabled.
> >
> > For hotplugged CPUs that come in after the boot time measurement,
> > register CPU hotplug callbacks, and do the measurement there. Interrupts
> > are enabled in those callbacks, so they're fine to do alloc_pages() in.
>
> I think this is dragged out of proportion. I would do this (if needed
> can can't be identified by CPU-ID or so) on boot CPU only. If there is
> evidence/ proof/ blessing from the high RiscV council that different
> types of CPU cores are mixed together then this could be extended.
> You brought Big-Little up in the other thread. This is actually known.
> Same as with hyper-threads on x86, you know which CPU is the core and
> which hyper thread (CPU) belongs to it.
> So in terms of BigLittle you _could_ limit this to one Big and one
> Little core instead running it on all.

Doing it on one per cluster might also happen to work, but I still see
nothing that prevents variety within a cluster, so I'm not comfortable
with that assumption. It also doesn't buy much. I'm not sure what kind
of guidance RVI is providing on integrating multiple CPUs into a
system. I haven't seen any myself, but am happy to reassess if there's
documentation banning the scenarios I'm imagining.

>
> But this is just my few on this. From PREEMPT_RT's point of view, the
> way you restructured the memory allocation should work now.

Thanks!

>
> > Reported-by: Jisheng Zhang <jszhang@kernel.org>
> > Closes: https://lore.kernel.org/all/mhng-9359993d-6872-4134-83ce-c97debe1cf9a@palmer-ri-x1c9/T/#mae9b8f40016f9df428829d33360144dc5026bcbf
> > Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
> > Signed-off-by: Evan Green <evan@rivosinc.com>
> >
> >
> > diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> > index 6a01ded615cd..fe59e18dbd5b 100644
> > --- a/arch/riscv/kernel/cpufeature.c
> > +++ b/arch/riscv/kernel/cpufeature.c
> …
> >
> > -static int __init check_unaligned_access_boot_cpu(void)
> > +/* Measure unaligned access on all CPUs present at boot in parallel. */
> > +static int check_unaligned_access_all_cpus(void)
> >  {
> > -     check_unaligned_access(0);
> > +     unsigned int cpu;
> > +     unsigned int cpu_count = num_possible_cpus();
> > +     struct page **bufs = kzalloc(cpu_count * sizeof(struct page *),
> > +                                  GFP_KERNEL);
>
> kcalloc(). For beauty reasons you could try a reverse xmas tree.
>
> > +
> > +     if (!bufs) {
> > +             pr_warn("Allocation failure, not measuring misaligned performance\n");
> > +             return 0;
> > +     }
> > +
> > +     /*
> > +      * Allocate separate buffers for each CPU so there's no fighting over
> > +      * cache lines.
> > +      */
> > +     for_each_cpu(cpu, cpu_online_mask) {
> > +             bufs[cpu] = alloc_pages(GFP_KERNEL, MISALIGNED_BUFFER_ORDER);
> > +             if (!bufs[cpu]) {
> > +                     pr_warn("Allocation failure, not measuring misaligned performance\n");
> > +                     goto out;
> > +             }
> > +     }
> > +
> > +     /* Check everybody except 0, who stays behind to tend jiffies. */
> > +     on_each_cpu(check_unaligned_access_nonboot_cpu, bufs, 1);
>
> comments! _HOW_ do you ensure that CPU0 is left out? You don't. CPU0
> does this and the leaves which is a waste. Using on_each_cpu_cond()
> could deal with this. And you have the check within the wrapper
> (check_unaligned_access_nonboot_cpu()) anyway.
>
> > +     /* Check core 0. */
> > +     smp_call_on_cpu(0, check_unaligned_access, bufs[0], true);
>
> Now that comment is obvious. If you want to add a comment, why not state
> why CPU0 has to be done last?
>
> > +
> > +     /* Setup hotplug callback for any new CPUs that come online. */
> > +     cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "riscv:online",
> > +                               riscv_online_cpu, NULL);
> Instead riscv:online you could use riscv:unaliged_check or something
> that pin points the callback to something obvious. This is exported via
> sysfs.
>
> Again, comment is obvious. For that to make sense would require RiscV to
> support physical-hotplug. For KVM like environment (where you can plug in
> CPUs later) this probably doesn't make sense at all. Why not? Because
>
> - without explicit CPU pinning your slow/ fast CPU mapping (host <->
>   guest) could change if the scheduler on the host moves the threads
>   around.

Taking a system with non-identical cores and allowing vcpus to bounce
between them sounds like a hypervisor configuration issue to me,
regardless of this patch.

>
> - without explicit task offload and resource partitioning on the host
>   your guest thread might get interrupt during the measurement. This is
>   done during boot so chances are high that it runs 100% of its time
>   slice and will be preempted once other tasks on the host ask for CPU
>   run time.

The measurement takes the best (lowest time) iteration. So unless
every iteration gets interrupted, I should get a good read in there
somewhere.
-Evan

Palmer Dabbelt Nov. 7, 2023, 5:45 p.m. UTC | #3

On Tue, 07 Nov 2023 09:26:03 PST (-0800), Evan Green wrote:
> On Tue, Nov 7, 2023 at 12:34 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
>>
>> On 2023-11-06 14:58:55 [-0800], Evan Green wrote:
>> > Probing for misaligned access speed takes about 0.06 seconds. On a
>> > system with 64 cores, doing this in smp_callin() means it's done
>> > serially, extending boot time by 3.8 seconds. That's a lot of boot time.
>> >
>> > Instead of measuring each CPU serially, let's do the measurements on
>> > all CPUs in parallel. If we disable preemption on all CPUs, the
>> > jiffies stop ticking, so we can do this in stages of 1) everybody
>> > except core 0, then 2) core 0. The allocations are all done outside of
>> > on_each_cpu() to avoid calling alloc_pages() with interrupts disabled.
>> >
>> > For hotplugged CPUs that come in after the boot time measurement,
>> > register CPU hotplug callbacks, and do the measurement there. Interrupts
>> > are enabled in those callbacks, so they're fine to do alloc_pages() in.
>>
>> I think this is dragged out of proportion. I would do this (if needed
>> can can't be identified by CPU-ID or so) on boot CPU only. If there is
>> evidence/ proof/ blessing from the high RiscV council that different
>> types of CPU cores are mixed together then this could be extended.
>> You brought Big-Little up in the other thread. This is actually known.
>> Same as with hyper-threads on x86, you know which CPU is the core and
>> which hyper thread (CPU) belongs to it.
>> So in terms of BigLittle you _could_ limit this to one Big and one
>> Little core instead running it on all.
>
> Doing it on one per cluster might also happen to work, but I still see
> nothing that prevents variety within a cluster, so I'm not comfortable
> with that assumption. It also doesn't buy much. I'm not sure what kind
> of guidance RVI is providing on integrating multiple CPUs into a
> system. I haven't seen any myself, but am happy to reassess if there's
> documentation banning the scenarios I'm imagining.

IIUC there's pretty much no rules here, and vendors are already building 
wacky systems (the K230 just showed up with heterogenous-ISA cores, 
we've got a handful now).  I guess we could write up some guidance in 
Documentation/riscv describing what sort of systems we generally test 
on, but given how RISC-V generally goes vendors are just going to build 
the crazy stuff anyway and we'll have to deal with it.

>
>>
>> But this is just my few on this. From PREEMPT_RT's point of view, the
>> way you restructured the memory allocation should work now.
>
> Thanks!
>
>>
>> > Reported-by: Jisheng Zhang <jszhang@kernel.org>
>> > Closes: https://lore.kernel.org/all/mhng-9359993d-6872-4134-83ce-c97debe1cf9a@palmer-ri-x1c9/T/#mae9b8f40016f9df428829d33360144dc5026bcbf
>> > Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
>> > Signed-off-by: Evan Green <evan@rivosinc.com>
>> >
>> >
>> > diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
>> > index 6a01ded615cd..fe59e18dbd5b 100644
>> > --- a/arch/riscv/kernel/cpufeature.c
>> > +++ b/arch/riscv/kernel/cpufeature.c
>> …
>> >
>> > -static int __init check_unaligned_access_boot_cpu(void)
>> > +/* Measure unaligned access on all CPUs present at boot in parallel. */
>> > +static int check_unaligned_access_all_cpus(void)
>> >  {
>> > -     check_unaligned_access(0);
>> > +     unsigned int cpu;
>> > +     unsigned int cpu_count = num_possible_cpus();
>> > +     struct page **bufs = kzalloc(cpu_count * sizeof(struct page *),
>> > +                                  GFP_KERNEL);
>>
>> kcalloc(). For beauty reasons you could try a reverse xmas tree.
>>
>> > +
>> > +     if (!bufs) {
>> > +             pr_warn("Allocation failure, not measuring misaligned performance\n");
>> > +             return 0;
>> > +     }
>> > +
>> > +     /*
>> > +      * Allocate separate buffers for each CPU so there's no fighting over
>> > +      * cache lines.
>> > +      */
>> > +     for_each_cpu(cpu, cpu_online_mask) {
>> > +             bufs[cpu] = alloc_pages(GFP_KERNEL, MISALIGNED_BUFFER_ORDER);
>> > +             if (!bufs[cpu]) {
>> > +                     pr_warn("Allocation failure, not measuring misaligned performance\n");
>> > +                     goto out;
>> > +             }
>> > +     }
>> > +
>> > +     /* Check everybody except 0, who stays behind to tend jiffies. */
>> > +     on_each_cpu(check_unaligned_access_nonboot_cpu, bufs, 1);
>>
>> comments! _HOW_ do you ensure that CPU0 is left out? You don't. CPU0
>> does this and the leaves which is a waste. Using on_each_cpu_cond()
>> could deal with this. And you have the check within the wrapper
>> (check_unaligned_access_nonboot_cpu()) anyway.
>>
>> > +     /* Check core 0. */
>> > +     smp_call_on_cpu(0, check_unaligned_access, bufs[0], true);
>>
>> Now that comment is obvious. If you want to add a comment, why not state
>> why CPU0 has to be done last?
>>
>> > +
>> > +     /* Setup hotplug callback for any new CPUs that come online. */
>> > +     cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "riscv:online",
>> > +                               riscv_online_cpu, NULL);
>> Instead riscv:online you could use riscv:unaliged_check or something
>> that pin points the callback to something obvious. This is exported via
>> sysfs.
>>
>> Again, comment is obvious. For that to make sense would require RiscV to
>> support physical-hotplug. For KVM like environment (where you can plug in
>> CPUs later) this probably doesn't make sense at all. Why not? Because
>>
>> - without explicit CPU pinning your slow/ fast CPU mapping (host <->
>>   guest) could change if the scheduler on the host moves the threads
>>   around.
>
> Taking a system with non-identical cores and allowing vcpus to bounce
> between them sounds like a hypervisor configuration issue to me,
> regardless of this patch.
>
>>
>> - without explicit task offload and resource partitioning on the host
>>   your guest thread might get interrupt during the measurement. This is
>>   done during boot so chances are high that it runs 100% of its time
>>   slice and will be preempted once other tasks on the host ask for CPU
>>   run time.
>
> The measurement takes the best (lowest time) iteration. So unless
> every iteration gets interrupted, I should get a good read in there
> somewhere.
> -Evan

patchwork-bot+linux-riscv@kernel.org Nov. 8, 2023, 3:10 p.m. UTC | #4

Hello:

This patch was applied to riscv/linux.git (for-next)
by Palmer Dabbelt <palmer@rivosinc.com>:

On Mon,  6 Nov 2023 14:58:55 -0800 you wrote:
> Probing for misaligned access speed takes about 0.06 seconds. On a
> system with 64 cores, doing this in smp_callin() means it's done
> serially, extending boot time by 3.8 seconds. That's a lot of boot time.
> 
> Instead of measuring each CPU serially, let's do the measurements on
> all CPUs in parallel. If we disable preemption on all CPUs, the
> jiffies stop ticking, so we can do this in stages of 1) everybody
> except core 0, then 2) core 0. The allocations are all done outside of
> on_each_cpu() to avoid calling alloc_pages() with interrupts disabled.
> 
> [...]

Here is the summary with links:
  - [v3] RISC-V: Probe misaligned access speed in parallel
    https://git.kernel.org/riscv/c/55e0bf49a0d0

You are awesome, thank you!

diff mbox series

Patch

diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
index 7f1e46a9d445..69f2cae96f0b 100644
--- a/arch/riscv/include/asm/cpufeature.h
+++ b/arch/riscv/include/asm/cpufeature.h
@@ -30,7 +30,6 @@  DECLARE_PER_CPU(long, misaligned_access_speed);
 /* Per-cpu ISA extensions. */
 extern struct riscv_isainfo hart_isa[NR_CPUS];
 
-void check_unaligned_access(int cpu);
 void riscv_user_isa_enable(void);
 
 #ifdef CONFIG_RISCV_MISALIGNED
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 6a01ded615cd..fe59e18dbd5b 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -8,6 +8,7 @@ 
 
 #include <linux/acpi.h>
 #include <linux/bitmap.h>
+#include <linux/cpuhotplug.h>
 #include <linux/ctype.h>
 #include <linux/log2.h>
 #include <linux/memory.h>
@@ -29,6 +30,7 @@ 
 
 #define MISALIGNED_ACCESS_JIFFIES_LG2 1
 #define MISALIGNED_BUFFER_SIZE 0x4000
+#define MISALIGNED_BUFFER_ORDER get_order(MISALIGNED_BUFFER_SIZE)
 #define MISALIGNED_COPY_SIZE ((MISALIGNED_BUFFER_SIZE / 2) - 0x80)
 
 unsigned long elf_hwcap __read_mostly;
@@ -557,30 +559,21 @@  unsigned long riscv_get_elf_hwcap(void)
 	return hwcap;
 }
 
-void check_unaligned_access(int cpu)
+static int check_unaligned_access(void *param)
 {
+	int cpu = smp_processor_id();
 	u64 start_cycles, end_cycles;
 	u64 word_cycles;
 	u64 byte_cycles;
 	int ratio;
 	unsigned long start_jiffies, now;
-	struct page *page;
+	struct page *page = param;
 	void *dst;
 	void *src;
 	long speed = RISCV_HWPROBE_MISALIGNED_SLOW;
 
 	if (check_unaligned_access_emulated(cpu))
-		return;
-
-	/* We are already set since the last check */
-	if (per_cpu(misaligned_access_speed, cpu) != RISCV_HWPROBE_MISALIGNED_UNKNOWN)
-		return;
-
-	page = alloc_pages(GFP_NOWAIT, get_order(MISALIGNED_BUFFER_SIZE));
-	if (!page) {
-		pr_warn("Can't alloc pages to measure memcpy performance");
-		return;
-	}
+		return 0;
 
 	/* Make an unaligned destination buffer. */
 	dst = (void *)((unsigned long)page_address(page) | 0x1);
@@ -634,7 +627,7 @@  void check_unaligned_access(int cpu)
 		pr_warn("cpu%d: rdtime lacks granularity needed to measure unaligned access speed\n",
 			cpu);
 
-		goto out;
+		return 0;
 	}
 
 	if (word_cycles < byte_cycles)
@@ -648,19 +641,84 @@  void check_unaligned_access(int cpu)
 		(speed == RISCV_HWPROBE_MISALIGNED_FAST) ? "fast" : "slow");
 
 	per_cpu(misaligned_access_speed, cpu) = speed;
+	return 0;
+}
 
-out:
-	__free_pages(page, get_order(MISALIGNED_BUFFER_SIZE));
+static void check_unaligned_access_nonboot_cpu(void *param)
+{
+	unsigned int cpu = smp_processor_id();
+	struct page **pages = param;
+
+	if (smp_processor_id() != 0)
+		check_unaligned_access(pages[cpu]);
+}
+
+static int riscv_online_cpu(unsigned int cpu)
+{
+	static struct page *buf;
+
+	/* We are already set since the last check */
+	if (per_cpu(misaligned_access_speed, cpu) != RISCV_HWPROBE_MISALIGNED_UNKNOWN)
+		return 0;
+
+	buf = alloc_pages(GFP_KERNEL, MISALIGNED_BUFFER_ORDER);
+	if (!buf) {
+		pr_warn("Allocation failure, not measuring misaligned performance\n");
+		return -ENOMEM;
+	}
+
+	check_unaligned_access(buf);
+	__free_pages(buf, MISALIGNED_BUFFER_ORDER);
+	return 0;
 }
 
-static int __init check_unaligned_access_boot_cpu(void)
+/* Measure unaligned access on all CPUs present at boot in parallel. */
+static int check_unaligned_access_all_cpus(void)
 {
-	check_unaligned_access(0);
+	unsigned int cpu;
+	unsigned int cpu_count = num_possible_cpus();
+	struct page **bufs = kzalloc(cpu_count * sizeof(struct page *),
+				     GFP_KERNEL);
+
+	if (!bufs) {
+		pr_warn("Allocation failure, not measuring misaligned performance\n");
+		return 0;
+	}
+
+	/*
+	 * Allocate separate buffers for each CPU so there's no fighting over
+	 * cache lines.
+	 */
+	for_each_cpu(cpu, cpu_online_mask) {
+		bufs[cpu] = alloc_pages(GFP_KERNEL, MISALIGNED_BUFFER_ORDER);
+		if (!bufs[cpu]) {
+			pr_warn("Allocation failure, not measuring misaligned performance\n");
+			goto out;
+		}
+	}
+
+	/* Check everybody except 0, who stays behind to tend jiffies. */
+	on_each_cpu(check_unaligned_access_nonboot_cpu, bufs, 1);
+
+	/* Check core 0. */
+	smp_call_on_cpu(0, check_unaligned_access, bufs[0], true);
+
+	/* Setup hotplug callback for any new CPUs that come online. */
+	cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "riscv:online",
+				  riscv_online_cpu, NULL);
+
+out:
 	unaligned_emulation_finish();
+	for_each_cpu(cpu, cpu_online_mask) {
+		if (bufs[cpu])
+			__free_pages(bufs[cpu], MISALIGNED_BUFFER_ORDER);
+	}
+
+	kfree(bufs);
 	return 0;
 }
 
-arch_initcall(check_unaligned_access_boot_cpu);
+arch_initcall(check_unaligned_access_all_cpus);
 
 void riscv_user_isa_enable(void)
 {
diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
index d69c628c24f4..d162bf339beb 100644
--- a/arch/riscv/kernel/smpboot.c
+++ b/arch/riscv/kernel/smpboot.c
@@ -247,7 +247,6 @@  asmlinkage __visible void smp_callin(void)
 	riscv_ipi_enable();
 
 	numa_add_cpu(curr_cpuid);
-	check_unaligned_access(curr_cpuid);
 	set_cpu_online(curr_cpuid, 1);
 
 	if (has_vector()) {