x86/hyperv: Pass on the lpj value from host to guest

Message ID 167571656510.2157946.174424531449774007.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net
State New
Headers
Series x86/hyperv: Pass on the lpj value from host to guest |

Commit Message

Stanislav Kinsburskii Feb. 6, 2023, 8:49 p.m. UTC
  From: Stanislav Kinsburskiy <stanislav.kinsburskiy@gmail.com>

And have it preset.
This change allows to significantly reduce time to bring up guest SMP
configuration as well as make sure the guest won't get inaccurate
calibration results due to "noisy neighbour" situation.

Below are the numbers for 16 VCPU guest before the patch (~1300 msec)

[    0.562938] x86: Booting SMP configuration:
...
[    1.859447] smp: Brought up 1 node, 16 CPUs

and after the patch (~130 msec):

[    0.445079] x86: Booting SMP configuration:
...
[    0.575035] smp: Brought up 1 node, 16 CPUs

This change is inspired by commit 0293615f3fb9 ("x86: KVM guest: use
paravirt function to calculate cpu khz").

Signed-off-by: Stanislav Kinsburskiy <stanislav.kinsburskiy@gmail.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Wei Liu <wei.liu@kernel.org>
CC: Dexuan Cui <decui@microsoft.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: x86@kernel.org
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: linux-hyperv@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/cpu/mshyperv.c |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
  

Comments

Nuno Das Neves Feb. 7, 2023, 11:24 p.m. UTC | #1
On 2/6/2023 12:49 PM, Stanislav Kinsburskii wrote:
> From: Stanislav Kinsburskiy <stanislav.kinsburskiy@gmail.com>
> 
> And have it preset.
> This change allows to significantly reduce time to bring up guest SMP
> configuration as well as make sure the guest won't get inaccurate
> calibration results due to "noisy neighbour" situation.
> 
> Below are the numbers for 16 VCPU guest before the patch (~1300 msec)
> 
> [    0.562938] x86: Booting SMP configuration:
> ...
> [    1.859447] smp: Brought up 1 node, 16 CPUs
> 
> and after the patch (~130 msec):
> 
> [    0.445079] x86: Booting SMP configuration:
> ...
> [    0.575035] smp: Brought up 1 node, 16 CPUs
> 
> This change is inspired by commit 0293615f3fb9 ("x86: KVM guest: use
> paravirt function to calculate cpu khz").
> 
> Signed-off-by: Stanislav Kinsburskiy <stanislav.kinsburskiy@gmail.com>
> CC: "K. Y. Srinivasan" <kys@microsoft.com>
> CC: Haiyang Zhang <haiyangz@microsoft.com>
> CC: Wei Liu <wei.liu@kernel.org>
> CC: Dexuan Cui <decui@microsoft.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Borislav Petkov <bp@alien8.de>
> CC: Dave Hansen <dave.hansen@linux.intel.com>
> CC: x86@kernel.org
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: linux-hyperv@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/cpu/mshyperv.c |   21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index dedec2f23ad1..0282b2e96cc2 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -320,6 +320,21 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
>  }
>  #endif
>  
> +static void __init __maybe_unused hv_preset_lpj(void)
> +{
> +	unsigned long khz;
> +	u64 lpj;
> +
> +	if (!x86_platform.calibrate_tsc)
> +		return;
> +
> +	khz = x86_platform.calibrate_tsc();
> +
> +	lpj = ((u64)khz * 1000);
> +	do_div(lpj, HZ);
> +	preset_lpj = lpj;
> +}
> +
>  static void __init ms_hyperv_init_platform(void)
>  {
>  	int hv_max_functions_eax;
> @@ -521,6 +536,12 @@ static void __init ms_hyperv_init_platform(void)
>  
>  	/* Register Hyper-V specific clocksource */
>  	hv_init_clocksource();
> +
> +	/*
> +	 * Preset lpj to make calibrate_delay a no-op, which is turn helps to
> +	 * speed up secondary cores initialization.
> +	 */
> +	hv_preset_lpj();
>  #endif
>  	/*
>  	 * TSC should be marked as unstable only after Hyper-V
> 

Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
  
Wei Liu Feb. 13, 2023, 3:54 p.m. UTC | #2
On Tue, Feb 07, 2023 at 03:24:47PM -0800, Nuno Das Neves wrote:
> On 2/6/2023 12:49 PM, Stanislav Kinsburskii wrote:
> > From: Stanislav Kinsburskiy <stanislav.kinsburskiy@gmail.com>
> > 
> > And have it preset.

In the future please add a blank line between two paragraphs.

> > This change allows to significantly reduce time to bring up guest SMP
> > configuration as well as make sure the guest won't get inaccurate
> > calibration results due to "noisy neighbour" situation.
> > 

This looks like a good idea. 0293615f3fb9 was committed in 2008, so
we're very late to the party. Better late than never though.

If I hear no objections in a few days' time I will apply this to
hyperv-next with Nuno's Rb tag.

Thanks,
Wei.

> > Below are the numbers for 16 VCPU guest before the patch (~1300 msec)
> > 
> > [    0.562938] x86: Booting SMP configuration:
> > ...
> > [    1.859447] smp: Brought up 1 node, 16 CPUs
> > 
> > and after the patch (~130 msec):
> > 
> > [    0.445079] x86: Booting SMP configuration:
> > ...
> > [    0.575035] smp: Brought up 1 node, 16 CPUs
> > 
> > This change is inspired by commit 0293615f3fb9 ("x86: KVM guest: use
> > paravirt function to calculate cpu khz").
> > 
> > Signed-off-by: Stanislav Kinsburskiy <stanislav.kinsburskiy@gmail.com>
> > CC: "K. Y. Srinivasan" <kys@microsoft.com>
> > CC: Haiyang Zhang <haiyangz@microsoft.com>
> > CC: Wei Liu <wei.liu@kernel.org>
> > CC: Dexuan Cui <decui@microsoft.com>
> > CC: Thomas Gleixner <tglx@linutronix.de>
> > CC: Ingo Molnar <mingo@redhat.com>
> > CC: Borislav Petkov <bp@alien8.de>
> > CC: Dave Hansen <dave.hansen@linux.intel.com>
> > CC: x86@kernel.org
> > CC: "H. Peter Anvin" <hpa@zytor.com>
> > CC: linux-hyperv@vger.kernel.org
> > CC: linux-kernel@vger.kernel.org
> > ---
> >  arch/x86/kernel/cpu/mshyperv.c |   21 +++++++++++++++++++++
> >  1 file changed, 21 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > index dedec2f23ad1..0282b2e96cc2 100644
> > --- a/arch/x86/kernel/cpu/mshyperv.c
> > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > @@ -320,6 +320,21 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
> >  }
> >  #endif
> >  
> > +static void __init __maybe_unused hv_preset_lpj(void)
> > +{
> > +	unsigned long khz;
> > +	u64 lpj;
> > +
> > +	if (!x86_platform.calibrate_tsc)
> > +		return;
> > +
> > +	khz = x86_platform.calibrate_tsc();
> > +
> > +	lpj = ((u64)khz * 1000);
> > +	do_div(lpj, HZ);
> > +	preset_lpj = lpj;
> > +}
> > +
> >  static void __init ms_hyperv_init_platform(void)
> >  {
> >  	int hv_max_functions_eax;
> > @@ -521,6 +536,12 @@ static void __init ms_hyperv_init_platform(void)
> >  
> >  	/* Register Hyper-V specific clocksource */
> >  	hv_init_clocksource();
> > +
> > +	/*
> > +	 * Preset lpj to make calibrate_delay a no-op, which is turn helps to
> > +	 * speed up secondary cores initialization.
> > +	 */
> > +	hv_preset_lpj();
> >  #endif
> >  	/*
> >  	 * TSC should be marked as unstable only after Hyper-V
> > 
> 
> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
  
Michael Kelley (LINUX) Feb. 14, 2023, 4:19 p.m. UTC | #3
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> 
> And have it preset.
> This change allows to significantly reduce time to bring up guest SMP
> configuration as well as make sure the guest won't get inaccurate
> calibration results due to "noisy neighbour" situation.
> 
> Below are the numbers for 16 VCPU guest before the patch (~1300 msec)
> 
> [    0.562938] x86: Booting SMP configuration:
> ...
> [    1.859447] smp: Brought up 1 node, 16 CPUs
> 
> and after the patch (~130 msec):
> 
> [    0.445079] x86: Booting SMP configuration:
> ...
> [    0.575035] smp: Brought up 1 node, 16 CPUs
> 
> This change is inspired by commit 0293615f3fb9 ("x86: KVM guest: use
> paravirt function to calculate cpu khz").

This patch has been nagging at me a bit, and I finally did some further
checking.   Looking at Linux guests on local Hyper-V and in Azure, I see
a dmesg output line like this during boot:

Calibrating delay loop (skipped), value calculated using timer frequency.. 5187.81 BogoMIPS (lpj=2593905)

We're already skipping the delay loop calculation because lpj_fine
is set in tsc_init(), using the results of get_loops_per_jiffy().  The
latter does exactly the same calculation as hv_preset_lpj() in
this patch.

Is this patch arising from an environment where tsc_init() is
skipped for some reason?  Just trying to make sure we fully
when this patch is applicable, and when not.

Michael

> 
> Signed-off-by: Stanislav Kinsburskiy <stanislav.kinsburskiy@gmail.com>
> CC: "K. Y. Srinivasan" <kys@microsoft.com>
> CC: Haiyang Zhang <haiyangz@microsoft.com>
> CC: Wei Liu <wei.liu@kernel.org>
> CC: Dexuan Cui <decui@microsoft.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: Borislav Petkov <bp@alien8.de>
> CC: Dave Hansen <dave.hansen@linux.intel.com>
> CC: x86@kernel.org
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: linux-hyperv@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/cpu/mshyperv.c |   21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index dedec2f23ad1..0282b2e96cc2 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -320,6 +320,21 @@ static void __init hv_smp_prepare_cpus(unsigned int
> max_cpus)
>  }
>  #endif
> 
> +static void __init __maybe_unused hv_preset_lpj(void)
> +{
> +	unsigned long khz;
> +	u64 lpj;
> +
> +	if (!x86_platform.calibrate_tsc)
> +		return;
> +
> +	khz = x86_platform.calibrate_tsc();
> +
> +	lpj = ((u64)khz * 1000);
> +	do_div(lpj, HZ);
> +	preset_lpj = lpj;
> +}
> +
>  static void __init ms_hyperv_init_platform(void)
>  {
>  	int hv_max_functions_eax;
> @@ -521,6 +536,12 @@ static void __init ms_hyperv_init_platform(void)
> 
>  	/* Register Hyper-V specific clocksource */
>  	hv_init_clocksource();
> +
> +	/*
> +	 * Preset lpj to make calibrate_delay a no-op, which is turn helps to
> +	 * speed up secondary cores initialization.
> +	 */
> +	hv_preset_lpj();
>  #endif
>  	/*
>  	 * TSC should be marked as unstable only after Hyper-V
>
  
Stanislav Kinsburskii Feb. 16, 2023, 7:41 p.m. UTC | #4
On Tue, Feb 14, 2023 at 04:19:13PM +0000, Michael Kelley (LINUX) wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > 
> > And have it preset.
> > This change allows to significantly reduce time to bring up guest SMP
> > configuration as well as make sure the guest won't get inaccurate
> > calibration results due to "noisy neighbour" situation.
> > 
> > Below are the numbers for 16 VCPU guest before the patch (~1300 msec)
> > 
> > [    0.562938] x86: Booting SMP configuration:
> > ...
> > [    1.859447] smp: Brought up 1 node, 16 CPUs
> > 
> > and after the patch (~130 msec):
> > 
> > [    0.445079] x86: Booting SMP configuration:
> > ...
> > [    0.575035] smp: Brought up 1 node, 16 CPUs
> > 
> > This change is inspired by commit 0293615f3fb9 ("x86: KVM guest: use
> > paravirt function to calculate cpu khz").
> 
> This patch has been nagging at me a bit, and I finally did some further
> checking.   Looking at Linux guests on local Hyper-V and in Azure, I see
> a dmesg output line like this during boot:
> 
> Calibrating delay loop (skipped), value calculated using timer frequency.. 5187.81 BogoMIPS (lpj=2593905)
> 
> We're already skipping the delay loop calculation because lpj_fine
> is set in tsc_init(), using the results of get_loops_per_jiffy().  The
> latter does exactly the same calculation as hv_preset_lpj() in
> this patch.
> 
> Is this patch arising from an environment where tsc_init() is
> skipped for some reason?  Just trying to make sure we fully
> when this patch is applicable, and when not.
> 

The problem here is a bit different: "lpj_fine" is considered only for
the boot CPU (from init/calibrate.c):

        } else if ((!printed) && lpj_fine) {
                lpj = lpj_fine;
                pr_info("Calibrating delay loop (skipped), "
                        "value calculated using timer frequency.. ");

while all the secondary ones use the timer to calibrate.

With this change lpj_preset will be used for all cores (from
init/calbrate.c):

        } else if (preset_lpj) {
                lpj = preset_lpj;
                if (!printed)
                        pr_info("Calibrating delay loop (skipped) "
                                "preset value.. ");

This lofic with lpj_fine comes from commit 3da757daf86e ("x86: use
cpu_khz for loops_per_jiffy calculation"), where the commit messages
states the following:

    We do this only for the boot processor because the AP's can have
    different base frequencies or the BIOS might boot a AP at a different
    frequency.

Hope this helps.

Thanks,
Stanislav

> Michael
> 
> > 
> > Signed-off-by: Stanislav Kinsburskiy <stanislav.kinsburskiy@gmail.com>
> > CC: "K. Y. Srinivasan" <kys@microsoft.com>
> > CC: Haiyang Zhang <haiyangz@microsoft.com>
> > CC: Wei Liu <wei.liu@kernel.org>
> > CC: Dexuan Cui <decui@microsoft.com>
> > CC: Thomas Gleixner <tglx@linutronix.de>
> > CC: Ingo Molnar <mingo@redhat.com>
> > CC: Borislav Petkov <bp@alien8.de>
> > CC: Dave Hansen <dave.hansen@linux.intel.com>
> > CC: x86@kernel.org
> > CC: "H. Peter Anvin" <hpa@zytor.com>
> > CC: linux-hyperv@vger.kernel.org
> > CC: linux-kernel@vger.kernel.org
> > ---
> >  arch/x86/kernel/cpu/mshyperv.c |   21 +++++++++++++++++++++
> >  1 file changed, 21 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > index dedec2f23ad1..0282b2e96cc2 100644
> > --- a/arch/x86/kernel/cpu/mshyperv.c
> > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > @@ -320,6 +320,21 @@ static void __init hv_smp_prepare_cpus(unsigned int
> > max_cpus)
> >  }
> >  #endif
> > 
> > +static void __init __maybe_unused hv_preset_lpj(void)
> > +{
> > +	unsigned long khz;
> > +	u64 lpj;
> > +
> > +	if (!x86_platform.calibrate_tsc)
> > +		return;
> > +
> > +	khz = x86_platform.calibrate_tsc();
> > +
> > +	lpj = ((u64)khz * 1000);
> > +	do_div(lpj, HZ);
> > +	preset_lpj = lpj;
> > +}
> > +
> >  static void __init ms_hyperv_init_platform(void)
> >  {
> >  	int hv_max_functions_eax;
> > @@ -521,6 +536,12 @@ static void __init ms_hyperv_init_platform(void)
> > 
> >  	/* Register Hyper-V specific clocksource */
> >  	hv_init_clocksource();
> > +
> > +	/*
> > +	 * Preset lpj to make calibrate_delay a no-op, which is turn helps to
> > +	 * speed up secondary cores initialization.
> > +	 */
> > +	hv_preset_lpj();
> >  #endif
> >  	/*
> >  	 * TSC should be marked as unstable only after Hyper-V
> > 
>
  
Michael Kelley (LINUX) Feb. 17, 2023, 2:34 a.m. UTC | #5
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, February 16, 2023 11:41 AM
> 
> On Tue, Feb 14, 2023 at 04:19:13PM +0000, Michael Kelley (LINUX) wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > >
> > > And have it preset.
> > > This change allows to significantly reduce time to bring up guest SMP
> > > configuration as well as make sure the guest won't get inaccurate
> > > calibration results due to "noisy neighbour" situation.
> > >
> > > Below are the numbers for 16 VCPU guest before the patch (~1300 msec)
> > >
> > > [    0.562938] x86: Booting SMP configuration:
> > > ...
> > > [    1.859447] smp: Brought up 1 node, 16 CPUs
> > >
> > > and after the patch (~130 msec):
> > >
> > > [    0.445079] x86: Booting SMP configuration:
> > > ...
> > > [    0.575035] smp: Brought up 1 node, 16 CPUs
> > >
> > > This change is inspired by commit 0293615f3fb9 ("x86: KVM guest: use
> > > paravirt function to calculate cpu khz").
> >
> > This patch has been nagging at me a bit, and I finally did some further
> > checking.   Looking at Linux guests on local Hyper-V and in Azure, I see
> > a dmesg output line like this during boot:
> >
> > Calibrating delay loop (skipped), value calculated using timer frequency.. 5187.81
> BogoMIPS (lpj=2593905)
> >
> > We're already skipping the delay loop calculation because lpj_fine
> > is set in tsc_init(), using the results of get_loops_per_jiffy().  The
> > latter does exactly the same calculation as hv_preset_lpj() in
> > this patch.
> >
> > Is this patch arising from an environment where tsc_init() is
> > skipped for some reason?  Just trying to make sure we fully
> > when this patch is applicable, and when not.
> >
> 
> The problem here is a bit different: "lpj_fine" is considered only for
> the boot CPU (from init/calibrate.c):
> 
>         } else if ((!printed) && lpj_fine) {
>                 lpj = lpj_fine;
>                 pr_info("Calibrating delay loop (skipped), "
>                         "value calculated using timer frequency.. ");
> 
> while all the secondary ones use the timer to calibrate.
> 
> With this change lpj_preset will be used for all cores (from
> init/calbrate.c):
> 
>         } else if (preset_lpj) {
>                 lpj = preset_lpj;
>                 if (!printed)
>                         pr_info("Calibrating delay loop (skipped) "
>                                 "preset value.. ");
> 
> This lofic with lpj_fine comes from commit 3da757daf86e ("x86: use
> cpu_khz for loops_per_jiffy calculation"), where the commit messages
> states the following:
> 
>     We do this only for the boot processor because the AP's can have
>     different base frequencies or the BIOS might boot a AP at a different
>     frequency.
> 
> Hope this helps.
> 

Indeed, you are right about lpj_fine being applied only to the boot
CPU.  So I've looked a little closer because I don't see the 1300
milliseconds you see for a 16 vCPU guest.

I've been experimenting with a 32 vCPU guest, and without your
patch, it takes only 26 milliseconds to get all 32 vCPUs started.  I
think the trick is in the call to calibrate_delay_is_known().  This
function copies the lpj value from a CPU in the same NUMA node
that has already been calibrated, assuming that constant_tsc is
set, which is the case in my test VM.  So the boot CPU sets lpj
based on lpj_fine, and all other CPUs effectively copy the value
from the boot CPU without doing calibration.

I also experimented with multiple NUMA nodes.  In that case, it
does take a longer.  Dividing the 32 vCPUs into 4 NUMA nodes,
it takes about 210 miliseconds to boot all 32 vCPUs.  Presumably the
extra time is due to timer-based calibration being done once for each
NUMA node, plus probably some misc NUMA accounting overhead.
With preset_lpj set, that 210 milliseconds drops to 32 milliseconds,
which is more like the case with only 1 NUMA nodes, so there's some
modest benefit with multiple NUMA nodes.

Could you check if constant_tsc is set in your test environment?  It
really should be set in a Hyper-V VM.

Michael
  
Stanislav Kinsburskii Feb. 17, 2023, 10:07 p.m. UTC | #6
On Fri, Feb 17, 2023 at 02:34:21AM +0000, Michael Kelley (LINUX) wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, February 16, 2023 11:41 AM
> > 
> > On Tue, Feb 14, 2023 at 04:19:13PM +0000, Michael Kelley (LINUX) wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > >
> > > > And have it preset.
> > > > This change allows to significantly reduce time to bring up guest SMP
> > > > configuration as well as make sure the guest won't get inaccurate
> > > > calibration results due to "noisy neighbour" situation.
> > > >
> > > > Below are the numbers for 16 VCPU guest before the patch (~1300 msec)
> > > >
> > > > [    0.562938] x86: Booting SMP configuration:
> > > > ...
> > > > [    1.859447] smp: Brought up 1 node, 16 CPUs
> > > >
> > > > and after the patch (~130 msec):
> > > >
> > > > [    0.445079] x86: Booting SMP configuration:
> > > > ...
> > > > [    0.575035] smp: Brought up 1 node, 16 CPUs
> > > >
> > > > This change is inspired by commit 0293615f3fb9 ("x86: KVM guest: use
> > > > paravirt function to calculate cpu khz").
> > >
> > > This patch has been nagging at me a bit, and I finally did some further
> > > checking.   Looking at Linux guests on local Hyper-V and in Azure, I see
> > > a dmesg output line like this during boot:
> > >
> > > Calibrating delay loop (skipped), value calculated using timer frequency.. 5187.81
> > BogoMIPS (lpj=2593905)
> > >
> > > We're already skipping the delay loop calculation because lpj_fine
> > > is set in tsc_init(), using the results of get_loops_per_jiffy().  The
> > > latter does exactly the same calculation as hv_preset_lpj() in
> > > this patch.
> > >
> > > Is this patch arising from an environment where tsc_init() is
> > > skipped for some reason?  Just trying to make sure we fully
> > > when this patch is applicable, and when not.
> > >
> > 
> > The problem here is a bit different: "lpj_fine" is considered only for
> > the boot CPU (from init/calibrate.c):
> > 
> >         } else if ((!printed) && lpj_fine) {
> >                 lpj = lpj_fine;
> >                 pr_info("Calibrating delay loop (skipped), "
> >                         "value calculated using timer frequency.. ");
> > 
> > while all the secondary ones use the timer to calibrate.
> > 
> > With this change lpj_preset will be used for all cores (from
> > init/calbrate.c):
> > 
> >         } else if (preset_lpj) {
> >                 lpj = preset_lpj;
> >                 if (!printed)
> >                         pr_info("Calibrating delay loop (skipped) "
> >                                 "preset value.. ");
> > 
> > This lofic with lpj_fine comes from commit 3da757daf86e ("x86: use
> > cpu_khz for loops_per_jiffy calculation"), where the commit messages
> > states the following:
> > 
> >     We do this only for the boot processor because the AP's can have
> >     different base frequencies or the BIOS might boot a AP at a different
> >     frequency.
> > 
> > Hope this helps.
> > 
> 
> Indeed, you are right about lpj_fine being applied only to the boot
> CPU.  So I've looked a little closer because I don't see the 1300
> milliseconds you see for a 16 vCPU guest.
> 
> I've been experimenting with a 32 vCPU guest, and without your
> patch, it takes only 26 milliseconds to get all 32 vCPUs started.  I
> think the trick is in the call to calibrate_delay_is_known().  This
> function copies the lpj value from a CPU in the same NUMA node
> that has already been calibrated, assuming that constant_tsc is
> set, which is the case in my test VM.  So the boot CPU sets lpj
> based on lpj_fine, and all other CPUs effectively copy the value
> from the boot CPU without doing calibration.
> 
> I also experimented with multiple NUMA nodes.  In that case, it
> does take a longer.  Dividing the 32 vCPUs into 4 NUMA nodes,
> it takes about 210 miliseconds to boot all 32 vCPUs.  Presumably the
> extra time is due to timer-based calibration being done once for each
> NUMA node, plus probably some misc NUMA accounting overhead.
> With preset_lpj set, that 210 milliseconds drops to 32 milliseconds,
> which is more like the case with only 1 NUMA nodes, so there's some
> modest benefit with multiple NUMA nodes.
> 
> Could you check if constant_tsc is set in your test environment?  It
> really should be set in a Hyper-V VM.
> 

I guess I should have mentioned, that the results presented in the
commit message are from L2 guest, where there are no NUMA nodes and thus
every core is calibrated individually and thus boot time grows linearly
with the number of the cores assigned.

I'm not sure though, would NUMA emulation be a right choice here or
should this boot time penalty be left as is because we can't guarantee
all the processes are in the same numa node and thus their lpj values
have to be measured.

What do you think, Michael?

Thanks,
Stanislav

> Michael
  

Patch

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index dedec2f23ad1..0282b2e96cc2 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -320,6 +320,21 @@  static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
 }
 #endif
 
+static void __init __maybe_unused hv_preset_lpj(void)
+{
+	unsigned long khz;
+	u64 lpj;
+
+	if (!x86_platform.calibrate_tsc)
+		return;
+
+	khz = x86_platform.calibrate_tsc();
+
+	lpj = ((u64)khz * 1000);
+	do_div(lpj, HZ);
+	preset_lpj = lpj;
+}
+
 static void __init ms_hyperv_init_platform(void)
 {
 	int hv_max_functions_eax;
@@ -521,6 +536,12 @@  static void __init ms_hyperv_init_platform(void)
 
 	/* Register Hyper-V specific clocksource */
 	hv_init_clocksource();
+
+	/*
+	 * Preset lpj to make calibrate_delay a no-op, which is turn helps to
+	 * speed up secondary cores initialization.
+	 */
+	hv_preset_lpj();
 #endif
 	/*
 	 * TSC should be marked as unstable only after Hyper-V