[RFC,1/1] KVM: x86: add param to update master clock periodically

Message ID 20230926230649.67852-1-dongli.zhang@oracle.com
State New
Headers
Series [RFC,1/1] KVM: x86: add param to update master clock periodically |

Commit Message

Dongli Zhang Sept. 26, 2023, 11:06 p.m. UTC
  This is to minimize the kvmclock drift during CPU hotplug (or when the
master clock and pvclock_vcpu_time_info are updated). The drift is
because kvmclock and raw monotonic (tsc) use different
equation/mult/shift to calculate that how many nanoseconds (given the tsc
as input) has passed.

The calculation of the kvmclock is based on the pvclock_vcpu_time_info
provided by the host side.

struct pvclock_vcpu_time_info {
	u32   version;
	u32   pad0;
	u64   tsc_timestamp;     --> by host raw monotonic
	u64   system_time;       --> by host raw monotonic
	u32   tsc_to_system_mul; --> by host KVM
	s8    tsc_shift;         --> by host KVM
	u8    flags;
	u8    pad[2];
} __attribute__((__packed__));

To calculate the current guest kvmclock:

1. Obtain the tsc = rdtsc() of guest.

2. If shift < 0:
    tmp = tsc >> tsc_shift
   if shift > 0:
    tmp = tsc << tsc_shift

3. The kvmclock value will be: (tmp * tsc_to_system_mul) >> 32

Therefore, the current kvmclock will be either:

(rdtsc() >> tsc_shift) * tsc_to_system_mul >> 32

... or ...

(rdtsc() << tsc_shift) * tsc_to_system_mul >> 32

The 'tsc_to_system_mul' and 'tsc_shift' are calculated by the host KVM.

When the master clock is actively used, the 'tsc_timestamp' and
'system_time' are derived from the host raw monotonic time, which is
calculated based on the 'mult' and 'shift' of clocksource_tsc:

elapsed_time = (tsc * mult) >> shift

Since kvmclock and raw monotonic (clocksource_tsc) use different
equation/mult/shift to convert the tsc to nanosecond, there may be clock
drift issue during CPU hotplug (when the master clock is updated).

1. The guest boots and all vcpus have the same 'pvclock_vcpu_time_info'
(suppose the master clock is used).

2. Since the master clock is never updated, the periodic kvmclock_sync_work
does not update the values in 'pvclock_vcpu_time_info'.

3. Suppose a very long period has passed (e.g., 30-day).

4. The user adds another vcpu. Both master clock and
'pvclock_vcpu_time_info' are updated, based on the raw monotonic.

(Ideally, we expect the update is based on 'tsc_to_system_mul' and
'tsc_shift' from kvmclock).


Because kvmclock and raw monotonic (clocksource_tsc) use different
equation/mult/shift to convert the tsc to nanosecond, there will be drift
between:

(1) kvmclock based on current rdtsc and old 'pvclock_vcpu_time_info'
(2) kvmclock based on current rdtsc and new 'pvclock_vcpu_time_info'

According to the test, there is a drift of 4502145ns between (1) and (2)
after about 40 hours. The longer the time, the large the drift.

This is to add a module param to allow the user to configure for how often
to refresh the master clock, in order to reduce the kvmclock drift based on
user requirement (e.g., every 5-min to every day). The more often that the
master clock is refreshed, the smaller the kvmclock drift during the vcpu
hotplug.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
Other options are to update the masterclock in:
- kvmclock_sync_work, or
- pvclock_gtod_notify()

 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 34 +++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)
  

Comments

Joe Jin Sept. 27, 2023, 12:29 a.m. UTC | #1
On 9/26/23 4:06 PM, Dongli Zhang wrote:
> This is to minimize the kvmclock drift during CPU hotplug (or when the
> master clock and pvclock_vcpu_time_info are updated). The drift is
> because kvmclock and raw monotonic (tsc) use different
> equation/mult/shift to calculate that how many nanoseconds (given the tsc
> as input) has passed.
>
> The calculation of the kvmclock is based on the pvclock_vcpu_time_info
> provided by the host side.
>
> struct pvclock_vcpu_time_info {
> 	u32   version;
> 	u32   pad0;
> 	u64   tsc_timestamp;     --> by host raw monotonic
> 	u64   system_time;       --> by host raw monotonic
> 	u32   tsc_to_system_mul; --> by host KVM
> 	s8    tsc_shift;         --> by host KVM
> 	u8    flags;
> 	u8    pad[2];
> } __attribute__((__packed__));
>
> To calculate the current guest kvmclock:
>
> 1. Obtain the tsc = rdtsc() of guest.
>
> 2. If shift < 0:
>     tmp = tsc >> tsc_shift
>    if shift > 0:
>     tmp = tsc << tsc_shift
>
> 3. The kvmclock value will be: (tmp * tsc_to_system_mul) >> 32
>
> Therefore, the current kvmclock will be either:
>
> (rdtsc() >> tsc_shift) * tsc_to_system_mul >> 32
>
> ... or ...
>
> (rdtsc() << tsc_shift) * tsc_to_system_mul >> 32
>
> The 'tsc_to_system_mul' and 'tsc_shift' are calculated by the host KVM.
>
> When the master clock is actively used, the 'tsc_timestamp' and
> 'system_time' are derived from the host raw monotonic time, which is
> calculated based on the 'mult' and 'shift' of clocksource_tsc:
>
> elapsed_time = (tsc * mult) >> shift
>
> Since kvmclock and raw monotonic (clocksource_tsc) use different
> equation/mult/shift to convert the tsc to nanosecond, there may be clock
> drift issue during CPU hotplug (when the master clock is updated).
>
> 1. The guest boots and all vcpus have the same 'pvclock_vcpu_time_info'
> (suppose the master clock is used).
>
> 2. Since the master clock is never updated, the periodic kvmclock_sync_work
> does not update the values in 'pvclock_vcpu_time_info'.
>
> 3. Suppose a very long period has passed (e.g., 30-day).
>
> 4. The user adds another vcpu. Both master clock and
> 'pvclock_vcpu_time_info' are updated, based on the raw monotonic.
>
> (Ideally, we expect the update is based on 'tsc_to_system_mul' and
> 'tsc_shift' from kvmclock).
>
>
> Because kvmclock and raw monotonic (clocksource_tsc) use different
> equation/mult/shift to convert the tsc to nanosecond, there will be drift
> between:
>
> (1) kvmclock based on current rdtsc and old 'pvclock_vcpu_time_info'
> (2) kvmclock based on current rdtsc and new 'pvclock_vcpu_time_info'
>
> According to the test, there is a drift of 4502145ns between (1) and (2)
> after about 40 hours. The longer the time, the large the drift.
>
> This is to add a module param to allow the user to configure for how often
> to refresh the master clock, in order to reduce the kvmclock drift based on
> user requirement (e.g., every 5-min to every day). The more often that the
> master clock is refreshed, the smaller the kvmclock drift during the vcpu
> hotplug.
>
> Cc: Joe Jin <joe.jin@oracle.com>
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
> Other options are to update the masterclock in:
> - kvmclock_sync_work, or
> - pvclock_gtod_notify()
>
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/x86.c              | 34 +++++++++++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 17715cb8731d..57409dce5d73 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1331,6 +1331,7 @@ struct kvm_arch {
>  	u64 master_cycle_now;
>  	struct delayed_work kvmclock_update_work;
>  	struct delayed_work kvmclock_sync_work;
> +	struct delayed_work masterclock_sync_work;
>  
>  	struct kvm_xen_hvm_config xen_hvm_config;
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9f18b06bbda6..0b71dc3785eb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>  static bool __read_mostly kvmclock_periodic_sync = true;
>  module_param(kvmclock_periodic_sync, bool, S_IRUGO);
>  
> +unsigned int __read_mostly masterclock_sync_period;
> +module_param(masterclock_sync_period, uint, 0444);

Can the mode be 0644 and allow it be changed at runtime?

Thanks,
Joe
> +
>  /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
>  static u32 __read_mostly tsc_tolerance_ppm = 250;
>  module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> @@ -3298,6 +3301,31 @@ static void kvmclock_sync_fn(struct work_struct *work)
>  					KVMCLOCK_SYNC_PERIOD);
>  }
>  
> +static void masterclock_sync_fn(struct work_struct *work)
> +{
> +	unsigned long i;
> +	struct delayed_work *dwork = to_delayed_work(work);
> +	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
> +					   masterclock_sync_work);
> +	struct kvm *kvm = container_of(ka, struct kvm, arch);
> +	struct kvm_vcpu *vcpu;
> +
> +	if (!masterclock_sync_period)
> +		return;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		/*
> +		 * It is not required to kick the vcpu because it is not
> +		 * expected to update the master clock immediately.
> +		 */
> +		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
> +	}
> +
> +	schedule_delayed_work(&ka->masterclock_sync_work,
> +			      masterclock_sync_period * HZ);
> +}
> +
> +
>  /* These helpers are safe iff @msr is known to be an MCx bank MSR. */
>  static bool is_mci_control_msr(u32 msr)
>  {
> @@ -11970,6 +11998,10 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
>  	if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0)
>  		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
>  						KVMCLOCK_SYNC_PERIOD);
> +
> +	if (masterclock_sync_period && vcpu->vcpu_idx == 0)
> +		schedule_delayed_work(&kvm->arch.masterclock_sync_work,
> +				      masterclock_sync_period * HZ);
>  }
>  
>  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
> @@ -12344,6 +12376,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  
>  	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
>  	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
> +	INIT_DELAYED_WORK(&kvm->arch.masterclock_sync_work, masterclock_sync_fn);
>  
>  	kvm_apicv_init(kvm);
>  	kvm_hv_init_vm(kvm);
> @@ -12383,6 +12416,7 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm)
>  
>  void kvm_arch_sync_events(struct kvm *kvm)
>  {
> +	cancel_delayed_work_sync(&kvm->arch.masterclock_sync_work);
>  	cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
>  	cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
>  	kvm_free_pit(kvm);
  
Dongli Zhang Sept. 27, 2023, 12:36 a.m. UTC | #2
Hi Joe,

On 9/26/23 17:29, Joe Jin wrote:
> On 9/26/23 4:06 PM, Dongli Zhang wrote:
>> This is to minimize the kvmclock drift during CPU hotplug (or when the
>> master clock and pvclock_vcpu_time_info are updated). The drift is
>> because kvmclock and raw monotonic (tsc) use different
>> equation/mult/shift to calculate that how many nanoseconds (given the tsc
>> as input) has passed.
>>
>> The calculation of the kvmclock is based on the pvclock_vcpu_time_info
>> provided by the host side.
>>
>> struct pvclock_vcpu_time_info {
>> 	u32   version;
>> 	u32   pad0;
>> 	u64   tsc_timestamp;     --> by host raw monotonic
>> 	u64   system_time;       --> by host raw monotonic
>> 	u32   tsc_to_system_mul; --> by host KVM
>> 	s8    tsc_shift;         --> by host KVM
>> 	u8    flags;
>> 	u8    pad[2];
>> } __attribute__((__packed__));
>>
>> To calculate the current guest kvmclock:
>>
>> 1. Obtain the tsc = rdtsc() of guest.
>>
>> 2. If shift < 0:
>>     tmp = tsc >> tsc_shift
>>    if shift > 0:
>>     tmp = tsc << tsc_shift
>>
>> 3. The kvmclock value will be: (tmp * tsc_to_system_mul) >> 32
>>
>> Therefore, the current kvmclock will be either:
>>
>> (rdtsc() >> tsc_shift) * tsc_to_system_mul >> 32
>>
>> ... or ...
>>
>> (rdtsc() << tsc_shift) * tsc_to_system_mul >> 32
>>
>> The 'tsc_to_system_mul' and 'tsc_shift' are calculated by the host KVM.
>>
>> When the master clock is actively used, the 'tsc_timestamp' and
>> 'system_time' are derived from the host raw monotonic time, which is
>> calculated based on the 'mult' and 'shift' of clocksource_tsc:
>>
>> elapsed_time = (tsc * mult) >> shift
>>
>> Since kvmclock and raw monotonic (clocksource_tsc) use different
>> equation/mult/shift to convert the tsc to nanosecond, there may be clock
>> drift issue during CPU hotplug (when the master clock is updated).
>>
>> 1. The guest boots and all vcpus have the same 'pvclock_vcpu_time_info'
>> (suppose the master clock is used).
>>
>> 2. Since the master clock is never updated, the periodic kvmclock_sync_work
>> does not update the values in 'pvclock_vcpu_time_info'.
>>
>> 3. Suppose a very long period has passed (e.g., 30-day).
>>
>> 4. The user adds another vcpu. Both master clock and
>> 'pvclock_vcpu_time_info' are updated, based on the raw monotonic.
>>
>> (Ideally, we expect the update is based on 'tsc_to_system_mul' and
>> 'tsc_shift' from kvmclock).
>>
>>
>> Because kvmclock and raw monotonic (clocksource_tsc) use different
>> equation/mult/shift to convert the tsc to nanosecond, there will be drift
>> between:
>>
>> (1) kvmclock based on current rdtsc and old 'pvclock_vcpu_time_info'
>> (2) kvmclock based on current rdtsc and new 'pvclock_vcpu_time_info'
>>
>> According to the test, there is a drift of 4502145ns between (1) and (2)
>> after about 40 hours. The longer the time, the large the drift.
>>
>> This is to add a module param to allow the user to configure for how often
>> to refresh the master clock, in order to reduce the kvmclock drift based on
>> user requirement (e.g., every 5-min to every day). The more often that the
>> master clock is refreshed, the smaller the kvmclock drift during the vcpu
>> hotplug.
>>
>> Cc: Joe Jin <joe.jin@oracle.com>
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>> ---
>> Other options are to update the masterclock in:
>> - kvmclock_sync_work, or
>> - pvclock_gtod_notify()
>>
>>  arch/x86/include/asm/kvm_host.h |  1 +
>>  arch/x86/kvm/x86.c              | 34 +++++++++++++++++++++++++++++++++
>>  2 files changed, 35 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 17715cb8731d..57409dce5d73 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1331,6 +1331,7 @@ struct kvm_arch {
>>  	u64 master_cycle_now;
>>  	struct delayed_work kvmclock_update_work;
>>  	struct delayed_work kvmclock_sync_work;
>> +	struct delayed_work masterclock_sync_work;
>>  
>>  	struct kvm_xen_hvm_config xen_hvm_config;
>>  
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 9f18b06bbda6..0b71dc3785eb 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>>  static bool __read_mostly kvmclock_periodic_sync = true;
>>  module_param(kvmclock_periodic_sync, bool, S_IRUGO);
>>  
>> +unsigned int __read_mostly masterclock_sync_period;
>> +module_param(masterclock_sync_period, uint, 0444);
> 
> Can the mode be 0644 and allow it be changed at runtime?

It can be RW.

So far I just copy from kvmclock_periodic_sync as most code are from the
mechanism of kvmclock_periodic_sync.


static bool __read_mostly kvmclock_periodic_sync = true;
module_param(kvmclock_periodic_sync, bool, S_IRUGO);


Thank you very much!

Dongli Zhang

> 
> Thanks,
> Joe
>> +
>>  /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
>>  static u32 __read_mostly tsc_tolerance_ppm = 250;
>>  module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
>> @@ -3298,6 +3301,31 @@ static void kvmclock_sync_fn(struct work_struct *work)
>>  					KVMCLOCK_SYNC_PERIOD);
>>  }
>>  
>> +static void masterclock_sync_fn(struct work_struct *work)
>> +{
>> +	unsigned long i;
>> +	struct delayed_work *dwork = to_delayed_work(work);
>> +	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
>> +					   masterclock_sync_work);
>> +	struct kvm *kvm = container_of(ka, struct kvm, arch);
>> +	struct kvm_vcpu *vcpu;
>> +
>> +	if (!masterclock_sync_period)
>> +		return;
>> +
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		/*
>> +		 * It is not required to kick the vcpu because it is not
>> +		 * expected to update the master clock immediately.
>> +		 */
>> +		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>> +	}
>> +
>> +	schedule_delayed_work(&ka->masterclock_sync_work,
>> +			      masterclock_sync_period * HZ);
>> +}
>> +
>> +
>>  /* These helpers are safe iff @msr is known to be an MCx bank MSR. */
>>  static bool is_mci_control_msr(u32 msr)
>>  {
>> @@ -11970,6 +11998,10 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
>>  	if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0)
>>  		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
>>  						KVMCLOCK_SYNC_PERIOD);
>> +
>> +	if (masterclock_sync_period && vcpu->vcpu_idx == 0)
>> +		schedule_delayed_work(&kvm->arch.masterclock_sync_work,
>> +				      masterclock_sync_period * HZ);
>>  }
>>  
>>  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>> @@ -12344,6 +12376,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>  
>>  	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
>>  	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
>> +	INIT_DELAYED_WORK(&kvm->arch.masterclock_sync_work, masterclock_sync_fn);
>>  
>>  	kvm_apicv_init(kvm);
>>  	kvm_hv_init_vm(kvm);
>> @@ -12383,6 +12416,7 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm)
>>  
>>  void kvm_arch_sync_events(struct kvm *kvm)
>>  {
>> +	cancel_delayed_work_sync(&kvm->arch.masterclock_sync_work);
>>  	cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
>>  	cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
>>  	kvm_free_pit(kvm);
>
  
Sean Christopherson Sept. 28, 2023, 4:18 p.m. UTC | #3
On Tue, Sep 26, 2023, Dongli Zhang wrote:
> Hi Joe,
> 
> On 9/26/23 17:29, Joe Jin wrote:
> > On 9/26/23 4:06 PM, Dongli Zhang wrote:
> >> This is to minimize the kvmclock drift during CPU hotplug (or when the
> >> master clock and pvclock_vcpu_time_info are updated).

Updated by who?

> >> Since kvmclock and raw monotonic (clocksource_tsc) use different
> >> equation/mult/shift to convert the tsc to nanosecond, there may be clock
> >> drift issue during CPU hotplug (when the master clock is updated).

Based on #4, I assume you mean "vCPU hotplug from the host", but from this and
the above it's not clear if this means "vCPU hotplug from the host", "pCPU hotplug
in the host", or "CPU hotplug in the guest".

> >> 1. The guest boots and all vcpus have the same 'pvclock_vcpu_time_info'
> >> (suppose the master clock is used).
> >>
> >> 2. Since the master clock is never updated, the periodic kvmclock_sync_work
> >> does not update the values in 'pvclock_vcpu_time_info'.
> >>
> >> 3. Suppose a very long period has passed (e.g., 30-day).
> >>
> >> 4. The user adds another vcpu. Both master clock and
> >> 'pvclock_vcpu_time_info' are updated, based on the raw monotonic.

So why can't KVM simply force a KVM_REQ_MASTERCLOCK_UPDATE request when a vCPU
is added?  I'm missing why this needs a persistent, periodic refresh.

> >> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
> >>  static bool __read_mostly kvmclock_periodic_sync = true;
> >>  module_param(kvmclock_periodic_sync, bool, S_IRUGO);
> >>  
> >> +unsigned int __read_mostly masterclock_sync_period;
> >> +module_param(masterclock_sync_period, uint, 0444);
> > 
> > Can the mode be 0644 and allow it be changed at runtime?
> 
> It can be RW.
> 
> So far I just copy from kvmclock_periodic_sync as most code are from the
> mechanism of kvmclock_periodic_sync.
> 
> static bool __read_mostly kvmclock_periodic_sync = true;
> module_param(kvmclock_periodic_sync, bool, S_IRUGO);

Unless there's a very good reason for making it writable, I vote to keep it RO
to simplify the code.

> >>  /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
> >>  static u32 __read_mostly tsc_tolerance_ppm = 250;
> >>  module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> >> @@ -3298,6 +3301,31 @@ static void kvmclock_sync_fn(struct work_struct *work)
> >>  					KVMCLOCK_SYNC_PERIOD);
> >>  }
> >>  
> >> +static void masterclock_sync_fn(struct work_struct *work)
> >> +{
> >> +	unsigned long i;
> >> +	struct delayed_work *dwork = to_delayed_work(work);
> >> +	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
> >> +					   masterclock_sync_work);
> >> +	struct kvm *kvm = container_of(ka, struct kvm, arch);
> >> +	struct kvm_vcpu *vcpu;
> >> +
> >> +	if (!masterclock_sync_period)

This function should never be called if masterclock_sync_period=0.  The param
is RO and so kvm_arch_vcpu_postcreate() shouldn't create the work in the first
place.

> >> +		return;
> >> +
> >> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> >> +		/*
> >> +		 * It is not required to kick the vcpu because it is not
> >> +		 * expected to update the master clock immediately.
> >> +		 */

This comment needs to explain *why* it's ok for vCPUs to lazily handle the
masterclock update.  Saying "it is not expected" doesn't help understand who/what
expects anything, or why.

> >> +		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
> >> +	}
> >> +
> >> +	schedule_delayed_work(&ka->masterclock_sync_work,
> >> +			      masterclock_sync_period * HZ);
> >> +}
  
Dongli Zhang Sept. 29, 2023, 8:15 p.m. UTC | #4
Hi Sean,

On 9/28/23 09:18, Sean Christopherson wrote:
> On Tue, Sep 26, 2023, Dongli Zhang wrote:
>> Hi Joe,
>>
>> On 9/26/23 17:29, Joe Jin wrote:
>>> On 9/26/23 4:06 PM, Dongli Zhang wrote:
>>>> This is to minimize the kvmclock drift during CPU hotplug (or when the
>>>> master clock and pvclock_vcpu_time_info are updated).
> 
> Updated by who?

This is about vCPU hotplug, e.g., when we run "device_add
host-x86_64-cpu,id=core4,socket-id=0,core-id=4,thread-id=0" at QEMU side.

When we add a new vCPU to KVM, the
kvm_synchronize_tsc()-->kvm_track_tsc_matching() triggers
KVM_REQ_MASTERCLOCK_UPDATE.

When the new vCPU is onlined for the first time at VM side, the handler of
KVM_REQ_MASTERCLOCK_UPDATE (that is, kvm_update_masterclock()) updates the
master clock (e.g., master_kernel_ns and master_cycle_now, based on the host raw
monotonic).

The kvm_update_masterclock() finally triggers KVM_REQ_CLOCK_UPDATE so that the
master_kernel_ns and the master_cycle_now are propagated to:

- pvclock_vcpu_time_info->system_time
- pvclock_vcpu_time_info->tsc_timestamp

That is ...

- vcpu->hv_clock.system_time
- vcpu->hv_clock.tsc_timestamp

> 
>>>> Since kvmclock and raw monotonic (clocksource_tsc) use different
>>>> equation/mult/shift to convert the tsc to nanosecond, there may be clock
>>>> drift issue during CPU hotplug (when the master clock is updated).
> 
> Based on #4, I assume you mean "vCPU hotplug from the host", but from this and
> the above it's not clear if this means "vCPU hotplug from the host", "pCPU hotplug
> in the host", or "CPU hotplug in the guest".

It is about vCPU hotplug from the host (e.g., QEMU), although the
KVM_REQ_MASTERCLOCK_UPDATE handler (kvm_update_masterclock()) is triggered when
the new vCPU is onlined (usually via udev) at VM side for the first time.

1. QEMU adds new vCPU to KVM VM
2. VM side uses udev or manual echo command to sysfs to online the vCPU

> 
>>>> 1. The guest boots and all vcpus have the same 'pvclock_vcpu_time_info'
>>>> (suppose the master clock is used).
>>>>
>>>> 2. Since the master clock is never updated, the periodic kvmclock_sync_work
>>>> does not update the values in 'pvclock_vcpu_time_info'.
>>>>
>>>> 3. Suppose a very long period has passed (e.g., 30-day).
>>>>
>>>> 4. The user adds another vcpu. Both master clock and
>>>> 'pvclock_vcpu_time_info' are updated, based on the raw monotonic.
> 
> So why can't KVM simply force a KVM_REQ_MASTERCLOCK_UPDATE request when a vCPU
> is added?  I'm missing why this needs a persistent, periodic refresh.

Sorry for making the commit message confusing.

There is always a KVM_REQ_MASTERCLOCK_UPDATE request when a vCPU is added.
However, the problem is: only the vCPU hotplug triggers
KVM_REQ_MASTERCLOCK_UPDATE (suppose without live migration). That is, generally
(e.g., without migration), there will be no master clock update if we do not do
vCPU hot-add during long period of time.

We want more frequent KVM_REQ_MASTERCLOCK_UPDATE.

This is because:

1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.

2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.

3. As a result, given the same rdtsc, kvmclock and raw monotonic may return
different results (this is expected because they have different
mult/shift/equation).

4. However, the base in  kvmclock calculation (tsc_timestamp and system_time)
are derived from raw monotonic clock (master clock)


When tsc_timestamp and system_time are updated:

tsc_diff = tsc_timestamp_new - tsc_timestamp_old
system_time_new = system_time_old + (incremental from raw clock source) <--- (1)

However, from kvmclock, it expects:

system_time_new = system_time_old + kvmclock_equation(tsc_diff) <--- (2)


There is diff between (1) and (2). That will be the reason of kvmclock drift
when we add a new vCPU.

Indeed, the drift is between:

(3) incremental from raw clock source (that is, tsc_equation(tsc_diff)), and

(4) kvmclock_equation(tsc_diff)

The less frequent that master clock is updated, the larger the tsc_diff. As a
result, the larger the drift.


We would like to update the master clock more frequently to reduce the tsc_diff.

> 
>>>> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>>>>  static bool __read_mostly kvmclock_periodic_sync = true;
>>>>  module_param(kvmclock_periodic_sync, bool, S_IRUGO);
>>>>  
>>>> +unsigned int __read_mostly masterclock_sync_period;
>>>> +module_param(masterclock_sync_period, uint, 0444);
>>>
>>> Can the mode be 0644 and allow it be changed at runtime?
>>
>> It can be RW.
>>
>> So far I just copy from kvmclock_periodic_sync as most code are from the
>> mechanism of kvmclock_periodic_sync.
>>
>> static bool __read_mostly kvmclock_periodic_sync = true;
>> module_param(kvmclock_periodic_sync, bool, S_IRUGO);
> 
> Unless there's a very good reason for making it writable, I vote to keep it RO
> to simplify the code.

Unless there's a very good reason for making it writable, I vote to keep it RO
to simplify the code.


I will keep it as readable. Thank you very much!

> 
>>>>  /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
>>>>  static u32 __read_mostly tsc_tolerance_ppm = 250;
>>>>  module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
>>>> @@ -3298,6 +3301,31 @@ static void kvmclock_sync_fn(struct work_struct *work)
>>>>  					KVMCLOCK_SYNC_PERIOD);
>>>>  }
>>>>  
>>>> +static void masterclock_sync_fn(struct work_struct *work)
>>>> +{
>>>> +	unsigned long i;
>>>> +	struct delayed_work *dwork = to_delayed_work(work);
>>>> +	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
>>>> +					   masterclock_sync_work);
>>>> +	struct kvm *kvm = container_of(ka, struct kvm, arch);
>>>> +	struct kvm_vcpu *vcpu;
>>>> +
>>>> +	if (!masterclock_sync_period)
> 
> This function should never be called if masterclock_sync_period=0.  The param
> is RO and so kvm_arch_vcpu_postcreate() shouldn't create the work in the first
> place.

This function should never be called if masterclock_sync_period=0.  The param
is RO and so kvm_arch_vcpu_postcreate() shouldn't create the work in the first
place.



Thank you very much for pointing out that. I just copied the code from
kvmclock_sync_fn() (although I have noticed that :) )

I think I may need to send a cleanup patch to remove line 3296-3297 from
existing code as well.

> 
>>>> +		return;
>>>> +
>>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>>> +		/*
>>>> +		 * It is not required to kick the vcpu because it is not
>>>> +		 * expected to update the master clock immediately.
>>>> +		 */
> 
> This comment needs to explain *why* it's ok for vCPUs to lazily handle the
> masterclock update.  Saying "it is not expected" doesn't help understand who/what
> expects anything, or why.

I will update the comment as:

The objective to update the master clock is to reduce (but not to avoid) the
clock drift when there is long period of time between two master clock updates.
It is not expected to update immediately. It is fine to wait until next vCPU entry.


Please let me know if any clarification is needed. I used the below patch to
help diagnose the drift issue.

@@ -3068,6 +3110,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	u64 tsc_timestamp, host_tsc;
 	u8 pvclock_flags;
 	bool use_master_clock;
+	struct pvclock_vcpu_time_info old_hv_clock;
+	u64 tsc_raw, tsc, old_ns, new_ns, diff;
+	bool backward;
+
+	memcpy(&old_hv_clock, &vcpu->hv_clock, sizeof(old_hv_clock));

 	kernel_ns = 0;
 	host_tsc = 0;
@@ -3144,6 +3191,25 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)

 	vcpu->hv_clock.flags = pvclock_flags;

+	tsc_raw = rdtsc();
+	tsc = kvm_read_l1_tsc(v, tsc_raw);
+	old_ns = __pvclock_read_cycles(&old_hv_clock, tsc);
+	new_ns = __pvclock_read_cycles(&vcpu->hv_clock, tsc);
+	if (old_ns > new_ns) {
+		backward = true;
+		diff = old_ns - new_ns;
+	} else {
+		backward = false;
+		diff = new_ns - old_ns;
+	}
+	pr_alert("backward=%d, diff=%llu, old_ns=%llu, new_ns=%llu\n",
+		 backward, diff, old_ns, new_ns);


Thank you very much!

Dongli Zhang

> 
>>>> +		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>>>> +	}
>>>> +
>>>> +	schedule_delayed_work(&ka->masterclock_sync_work,
>>>> +			      masterclock_sync_period * HZ);
>>>> +}
  
David Woodhouse Oct. 2, 2023, 8:33 a.m. UTC | #5
On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote:
> 
> 
> We want more frequent KVM_REQ_MASTERCLOCK_UPDATE.
> 
> This is because:
> 
> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.
> 
> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.
> 
> 3. As a result, given the same rdtsc, kvmclock and raw monotonic may return
> different results (this is expected because they have different
> mult/shift/equation).
> 
> 4. However, the base in  kvmclock calculation (tsc_timestamp and system_time)
> are derived from raw monotonic clock (master clock)

That just seems wrong. I don't mean that you're incorrect; it seems
*morally* wrong.

In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use
a *different* mult/shift/equation (your #1) to convert TSC ticks to
nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2).

I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's
adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent.

Fix that, and the whole problem goes away, doesn't it?

What am I missing here, that means we can't do that?

Alternatively... with X86_FEATURE_CONSTANT_TSC, why do the sync at all?
If KVM wants to decide that the TSC runs at a different frequency to
the frequency that the host uses for CLOCK_MONOTONIC_RAW, why can't KVM
just *stick* to that?
  
Sean Christopherson Oct. 2, 2023, 4:37 p.m. UTC | #6
On Mon, Oct 02, 2023, David Woodhouse wrote:
> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote:
> > 
> > 
> > We want more frequent KVM_REQ_MASTERCLOCK_UPDATE.
> > 
> > This is because:
> > 
> > 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.
> > 
> > 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.
> > 
> > 3. As a result, given the same rdtsc, kvmclock and raw monotonic may return
> > different results (this is expected because they have different
> > mult/shift/equation).
> > 
> > 4. However, the base in  kvmclock calculation (tsc_timestamp and system_time)
> > are derived from raw monotonic clock (master clock)
> 
> That just seems wrong. I don't mean that you're incorrect; it seems
> *morally* wrong.
> 
> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use
> a *different* mult/shift/equation (your #1) to convert TSC ticks to
> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2).
> 
> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's
> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent.
> 
> Fix that, and the whole problem goes away, doesn't it?
> 
> What am I missing here, that means we can't do that?

I believe the answer is that "struct pvclock_vcpu_time_info" and its math are
ABI between KVM and KVM guests.

Like many of the older bits of KVM, my guess is that KVM's behavior is the product
of making things kinda sorta work with old hardware, i.e. was probably the least
awful solution in the days before constant TSCs, but is completely nonsensical on
modern hardware.

> Alternatively... with X86_FEATURE_CONSTANT_TSC, why do the sync at all?
> If KVM wants to decide that the TSC runs at a different frequency to
> the frequency that the host uses for CLOCK_MONOTONIC_RAW, why can't KVM
> just *stick* to that?

Yeah, bouncing around guest time when the TSC is constant seems counterproductive.

However, why does any of this matter if the host has a constant TSC?  If that's
the case, a sane setup will expose a constant TSC to the guest and the guest will
use the TSC instead of kvmclock for the guest clocksource.

Dongli, is this for long-lived "legacy" guests that were created on hosts without
a constant TSC?  If not, then why is kvmclock being used?  Or heaven forbid, are
you running on hardware without a constant TSC? :-)

Not saying we shouldn't sanitize the kvmclock behavior, but knowing the exact
problematic configuration(s) will help us make a better decision on how to fix
the mess.
  
Dongli Zhang Oct. 2, 2023, 5:17 p.m. UTC | #7
Hi Sean and David,

On 10/2/23 09:37, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, David Woodhouse wrote:
>> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote:
>>>
>>>
>>> We want more frequent KVM_REQ_MASTERCLOCK_UPDATE.
>>>
>>> This is because:
>>>
>>> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.
>>>
>>> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.
>>>
>>> 3. As a result, given the same rdtsc, kvmclock and raw monotonic may return
>>> different results (this is expected because they have different
>>> mult/shift/equation).
>>>
>>> 4. However, the base in  kvmclock calculation (tsc_timestamp and system_time)
>>> are derived from raw monotonic clock (master clock)
>>
>> That just seems wrong. I don't mean that you're incorrect; it seems
>> *morally* wrong.
>>
>> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use
>> a *different* mult/shift/equation (your #1) to convert TSC ticks to
>> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2).
>>
>> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's
>> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent.
>>
>> Fix that, and the whole problem goes away, doesn't it?
>>
>> What am I missing here, that means we can't do that?
> 
> I believe the answer is that "struct pvclock_vcpu_time_info" and its math are
> ABI between KVM and KVM guests.
> 
> Like many of the older bits of KVM, my guess is that KVM's behavior is the product
> of making things kinda sorta work with old hardware, i.e. was probably the least
> awful solution in the days before constant TSCs, but is completely nonsensical on
> modern hardware.
> 
>> Alternatively... with X86_FEATURE_CONSTANT_TSC, why do the sync at all?
>> If KVM wants to decide that the TSC runs at a different frequency to
>> the frequency that the host uses for CLOCK_MONOTONIC_RAW, why can't KVM
>> just *stick* to that?
> 
> Yeah, bouncing around guest time when the TSC is constant seems counterproductive.
> 
> However, why does any of this matter if the host has a constant TSC?  If that's
> the case, a sane setup will expose a constant TSC to the guest and the guest will
> use the TSC instead of kvmclock for the guest clocksource.
> 
> Dongli, is this for long-lived "legacy" guests that were created on hosts without
> a constant TSC?  If not, then why is kvmclock being used?  Or heaven forbid, are
> you running on hardware without a constant TSC? :-)

This is for test guests, and the host has all of below:

tsc, rdtscp, constant_tsc, nonstop_tsc, tsc_deadline_timer, tsc_adjust


A clocksource is used for two things.


1. current_clocksource. Yes, I agree we should always use tsc on modern hardware.

Do we need to update the documentation to always suggest TSC when it is
constant, as I believe many users still prefer pv clock than tsc?

Thanks to tsc ratio scaling, the live migration will not impact tsc.

From the source code, the rating of kvm-clock is still higher than tsc.

BTW., how about to decrease the rating if guest detects constant tsc?

166 struct clocksource kvm_clock = {
167         .name   = "kvm-clock",
168         .read   = kvm_clock_get_cycles,
169         .rating = 400,
170         .mask   = CLOCKSOURCE_MASK(64),
171         .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
172         .enable = kvm_cs_enable,
173 };

1196 static struct clocksource clocksource_tsc = {
1197         .name                   = "tsc",
1198         .rating                 = 300,
1199         .read                   = read_tsc,


2. The sched_clock.

The scheduling is impacted if there is big drift.

Fortunately, according to my general test (without RT requirement), the impact
is trivial unless the two masterclock *updates* are between a super long period.

In the past, the scheduling was impacted a lot when the masterclock was still
based on monotonic (not raw).

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=53fafdbb8b21fa99dfd8376ca056bffde8cafc11


Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock
operations (not only sched_clock), e.g., after line 300.

296 void __init kvmclock_init(void)
297 {
298         u8 flags;
299
300         if (!kvm_para_available() || !kvmclock)
301                 return;
302
303         if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
304                 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
305                 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
306         } else if (!kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
307                 return;
308         }
309
310         if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
311                               kvmclock_setup_percpu, NULL) < 0) {
312                 return;
313         }
314
315         pr_info("kvm-clock: Using msrs %x and %x",
316                 msr_kvm_system_time, msr_kvm_wall_clock);
317
318         this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]);
319         kvm_register_clock("primary cpu clock");
320         pvclock_set_pvti_cpu0_va(hv_clock_boot);
321
322         if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
323                 pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
324
325         flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
326         kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
327
328         x86_platform.calibrate_tsc = kvm_get_tsc_khz;
329         x86_platform.calibrate_cpu = kvm_get_tsc_khz;
330         x86_platform.get_wallclock = kvm_get_wallclock;
331         x86_platform.set_wallclock = kvm_set_wallclock;
332 #ifdef CONFIG_X86_LOCAL_APIC
333         x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
334 #endif
335         x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
336         x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
337         kvm_get_preset_lpj();


Should I introduce a new param to disable no-kvm-sched-clock only, or to
introduce a new param to allow the selection of sched_clock?


Those may not resolve the issue in another thread. (I guess there is a chance
that to refresh the masterclock may reduce the drift in that case, although
never tried)

https://lore.kernel.org/all/00fba193-238e-49dc-fdc4-0b93f20569ec@oracle.com/

Thank you very much!

Dongli Zhang

> 
> Not saying we shouldn't sanitize the kvmclock behavior, but knowing the exact
> problematic configuration(s) will help us make a better decision on how to fix
> the mess.
  
David Woodhouse Oct. 2, 2023, 6:16 p.m. UTC | #8
On Mon, 2023-10-02 at 09:37 -0700, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, David Woodhouse wrote:
> > On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote:
> > > 
> > > 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.
> > > 
> > > 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.
> > > 
> > 
> > That just seems wrong. I don't mean that you're incorrect; it seems
> > *morally* wrong.
> > 
> > In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use
> > a *different* mult/shift/equation (your #1) to convert TSC ticks to
> > nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2).
> > 
> > I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's
> > adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent.
> > 
> > Fix that, and the whole problem goes away, doesn't it?
> > 
> > What am I missing here, that means we can't do that?
> 
> I believe the answer is that "struct pvclock_vcpu_time_info" and its math are
> ABI between KVM and KVM guests.
> 
> Like many of the older bits of KVM, my guess is that KVM's behavior is the product
> of making things kinda sorta work with old hardware, i.e. was probably the least
> awful solution in the days before constant TSCs, but is completely nonsensical on
> modern hardware.

I still don't understand. The ABI and its math are fine. The ABI is just
 "at time X the TSC was Y, and the TSC frequency is Z"

I understand why on older hardware, those values needed to *change*
occasionally when TSC stupidity happened. 

But on newer hardware, surely we can set them precisely *once* when the
VM starts, and never ever have to change them again? Theoretically not
even when we pause the VM, kexec into a new kernel, and resume the VM!

But we *are* having to change it, because apparently
CLOCK_MONOTONIC_RAW is doing something *other* than incrementing at
precisely the frequency of the known and constant TSC.

But *why* is CLOCK_MONOTONIC_RAW doing that? I thought that the whole
point of CLOCK_MONOTONIC_RAW was to be consistent and not adjusted by
NTP etc.? Shouldn't it run at precisely the same rate as the kvmclock,
with no skew at all?

And if CLOCK_MONOTONIC_RAW is not what I thought it was... do we really
have to keep resetting the kvmclock to it at all? On modern hardware
can't the kvmclock be defined by the TSC alone?
  
Sean Christopherson Oct. 2, 2023, 6:18 p.m. UTC | #9
+PeterZ

Thomas and Peter,

We're trying to address an issue where KVM's paravirt kvmclock drifts from the
host's TSC-based monotonic raw clock because of historical reasons (at least, AFAICT),
even when the TSC is constant.  Due to some dubious KVM behavior, KVM may sometimes
re-sync kvmclock against the host's monotonic raw clock, which causes non-trivial
jumps in time from the guest's perspective.

Linux-as-a-guest demotes all paravirt clock sources when the TSC is constant and
nonstop, and so the goofy KVM behavior isn't likely to affect the guest's clocksource,
but the guest's sched_clock() implementation keeps using the paravirt clock.

Irrespective of if/how we fix the KVM host-side mess, using a paravirt clock for
the scheduler when using a constant, nonstop TSC for the clocksource seems at best
inefficient, and at worst unnecessarily complex and risky.

Is there any reason not to prefer native_sched_clock() over whatever paravirt
clock is present when the TSC is the preferred clocksource?  Assuming the desirable
thing to do is to use native_sched_clock() in this scenario, do we need a separate
rating system, or can we simply tie the sched clock selection to the clocksource
selection, e.g. override the paravirt stuff if the TSC clock has higher priority
and is chosen?

Some more details below (and in the rest of the thread).

Thanks!

On Mon, Oct 02, 2023, Dongli Zhang wrote:
> Hi Sean and David,
> 
> On 10/2/23 09:37, Sean Christopherson wrote:
> > However, why does any of this matter if the host has a constant TSC?  If that's
> > the case, a sane setup will expose a constant TSC to the guest and the guest will
> > use the TSC instead of kvmclock for the guest clocksource.
> > 
> > Dongli, is this for long-lived "legacy" guests that were created on hosts without
> > a constant TSC?  If not, then why is kvmclock being used?  Or heaven forbid, are
> > you running on hardware without a constant TSC? :-)
> 
> This is for test guests, and the host has all of below:
> 
> tsc, rdtscp, constant_tsc, nonstop_tsc, tsc_deadline_timer, tsc_adjust
> 
> A clocksource is used for two things.
> 
> 
> 1. current_clocksource. Yes, I agree we should always use tsc on modern hardware.
> 
> Do we need to update the documentation to always suggest TSC when it is
> constant, as I believe many users still prefer pv clock than tsc?
> 
> Thanks to tsc ratio scaling, the live migration will not impact tsc.
> 
> >From the source code, the rating of kvm-clock is still higher than tsc.
> 
> BTW., how about to decrease the rating if guest detects constant tsc?
> 
> 166 struct clocksource kvm_clock = {
> 167         .name   = "kvm-clock",
> 168         .read   = kvm_clock_get_cycles,
> 169         .rating = 400,
> 170         .mask   = CLOCKSOURCE_MASK(64),
> 171         .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
> 172         .enable = kvm_cs_enable,
> 173 };
> 
> 1196 static struct clocksource clocksource_tsc = {
> 1197         .name                   = "tsc",
> 1198         .rating                 = 300,
> 1199         .read                   = read_tsc,

That's already done in kvmclock_init().

	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
	    !check_tsc_unstable())
		kvm_clock.rating = 299;

See also: https://lore.kernel.org/all/ZOjF2DMBgW%2FzVvL3@google.com

> 2. The sched_clock.
> 
> The scheduling is impacted if there is big drift.

...

> Unfortunately, the "no-kvmclock" kernel parameter disables all pv clock
> operations (not only sched_clock), e.g., after line 300.

...
 
> Should I introduce a new param to disable no-kvm-sched-clock only, or to
> introduce a new param to allow the selection of sched_clock?

I don't think we want a KVM-specific knob, because every flavor of paravirt guest
would need to do the same thing.  And unless there's a good reason to use a
paravirt clock, this really shouldn't be something the guest admin needs to opt
into using.
  
Peter Zijlstra Oct. 2, 2023, 9:06 p.m. UTC | #10
On Mon, Oct 02, 2023 at 11:18:50AM -0700, Sean Christopherson wrote:
> +PeterZ
> 
> Thomas and Peter,
> 
> We're trying to address an issue where KVM's paravirt kvmclock drifts from the
> host's TSC-based monotonic raw clock because of historical reasons (at least, AFAICT),
> even when the TSC is constant.  Due to some dubious KVM behavior, KVM may sometimes
> re-sync kvmclock against the host's monotonic raw clock, which causes non-trivial
> jumps in time from the guest's perspective.
> 
> Linux-as-a-guest demotes all paravirt clock sources when the TSC is constant and
> nonstop, and so the goofy KVM behavior isn't likely to affect the guest's clocksource,
> but the guest's sched_clock() implementation keeps using the paravirt clock.
> 
> Irrespective of if/how we fix the KVM host-side mess, using a paravirt clock for
> the scheduler when using a constant, nonstop TSC for the clocksource seems at best
> inefficient, and at worst unnecessarily complex and risky.
> 
> Is there any reason not to prefer native_sched_clock() over whatever paravirt
> clock is present when the TSC is the preferred clocksource? 

I see none, that whole pv_clock thing is horrible crap.

> Assuming the desirable
> thing to do is to use native_sched_clock() in this scenario, do we need a separate
> rating system, or can we simply tie the sched clock selection to the clocksource
> selection, e.g. override the paravirt stuff if the TSC clock has higher priority
> and is chosen?

Yeah, I see no point of another rating system. Just force the thing back
to native (or don't set it to that other thing).
  
Peter Zijlstra Oct. 2, 2023, 9:16 p.m. UTC | #11
On Mon, Oct 02, 2023 at 11:06:07PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 02, 2023 at 11:18:50AM -0700, Sean Christopherson wrote:
> > +PeterZ
> > 
> > Thomas and Peter,
> > 
> > We're trying to address an issue where KVM's paravirt kvmclock drifts from the
> > host's TSC-based monotonic raw clock because of historical reasons (at least, AFAICT),
> > even when the TSC is constant.  Due to some dubious KVM behavior, KVM may sometimes
> > re-sync kvmclock against the host's monotonic raw clock, which causes non-trivial
> > jumps in time from the guest's perspective.
> > 
> > Linux-as-a-guest demotes all paravirt clock sources when the TSC is constant and
> > nonstop, and so the goofy KVM behavior isn't likely to affect the guest's clocksource,
> > but the guest's sched_clock() implementation keeps using the paravirt clock.
> > 
> > Irrespective of if/how we fix the KVM host-side mess, using a paravirt clock for
> > the scheduler when using a constant, nonstop TSC for the clocksource seems at best
> > inefficient, and at worst unnecessarily complex and risky.
> > 
> > Is there any reason not to prefer native_sched_clock() over whatever paravirt
> > clock is present when the TSC is the preferred clocksource? 
> 
> I see none, that whole pv_clock thing is horrible crap.

In fact, I don't really see a reason to ever use pv_clock, even on
non-constant TSC. The sched_clock machinery used on x86 (and ia64 at
some point) reverts to tick-based + 'TSC-with-monotonicity-filter
refinement' once it detects the TSC is crap.

And that should work in a guest too I suppose.

Also, I really should clean all that up -- it's all static_key based,
but I think I can do a saner version with static_call. But that's stuck
somewhere on the eternal todo list.
  
Sean Christopherson Oct. 3, 2023, 12:53 a.m. UTC | #12
On Mon, Oct 02, 2023, David Woodhouse wrote:
> On Mon, 2023-10-02 at 09:37 -0700, Sean Christopherson wrote:
> > On Mon, Oct 02, 2023, David Woodhouse wrote:
> > > On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote:
> > > > 
> > > > 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.
> > > > 
> > > > 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.
> > > > 
> > > 
> > > That just seems wrong. I don't mean that you're incorrect; it seems
> > > *morally* wrong.
> > > 
> > > In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use
> > > a *different* mult/shift/equation (your #1) to convert TSC ticks to
> > > nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2).
> > > 
> > > I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's
> > > adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent.
> > > 
> > > Fix that, and the whole problem goes away, doesn't it?
> > > 
> > > What am I missing here, that means we can't do that?
> > 
> > I believe the answer is that "struct pvclock_vcpu_time_info" and its math are
> > ABI between KVM and KVM guests.
> > 
> > Like many of the older bits of KVM, my guess is that KVM's behavior is the product
> > of making things kinda sorta work with old hardware, i.e. was probably the least
> > awful solution in the days before constant TSCs, but is completely nonsensical on
> > modern hardware.
> 
> I still don't understand. The ABI and its math are fine. The ABI is just
>  "at time X the TSC was Y, and the TSC frequency is Z"
> 
> I understand why on older hardware, those values needed to *change*
> occasionally when TSC stupidity happened. 
> 
> But on newer hardware, surely we can set them precisely *once* when the
> VM starts, and never ever have to change them again? Theoretically not
> even when we pause the VM, kexec into a new kernel, and resume the VM!
> 
> But we *are* having to change it, because apparently
> CLOCK_MONOTONIC_RAW is doing something *other* than incrementing at
> precisely the frequency of the known and constant TSC.
>
> But *why* is CLOCK_MONOTONIC_RAW doing that? I thought that the whole
> point of CLOCK_MONOTONIC_RAW was to be consistent and not adjusted by
> NTP etc.? Shouldn't it run at precisely the same rate as the kvmclock,
> with no skew at all?

IIUC, the issue is that the paravirt clock ends up mixing time domains, i.e. is
a weird bastardization of the host's monotonic raw clock and the paravirt clock.

Despite a name that suggests otherwise (to me at least), __pvclock_read_cycles()
counts "cycles" in nanoseconds, not TSC ticks.
 
  u64 __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, u64 tsc)
  {
	u64 delta = tsc - src->tsc_timestamp;
	u64 offset = pvclock_scale_delta(delta, src->tsc_to_system_mul,
					     src->tsc_shift);
	return src->system_time + offset;
  }

In the above, "offset" is computed in the kvmclock domain, whereas system_time
comes from the host's CLOCK_MONOTONIC_RAW domain by way of master_kernel_ns.
The goofy math is much more obvious in __get_kvmclock(), i.e. KVM's host-side
retrieval of the guest's view of kvmclock:

  hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;

The two domains use the same "clock" (constant TSC), but different math to compute
nanoseconds from a given TSC value.  For decently large TSC values, this results
in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds.

When KVM decides to refresh the masterclock, e.g. vCPU hotplug in Dongli's case,
it resets the base time, a.k.a. system_time.  I.e. KVM takes all of the time that
was accumulated in the kvmclock domain and recomputes it in the CLOCK_MONOTONIC_RAW
domain.  The more time that passes between refreshes, the bigger the time jump
from the guest's perspective.

E.g. IIUC, your proposed patch to use a single RDTSC[*] eliminates the drift by
undoing the CLOCK_MONOTONIC_RAW crap using the same TSC value on both the "add"
and the "subtract", but the underlying train wreck of mixing time domains is
still there.

Without a constant TSC, deferring the reference time to the host's computation
makes sense (or at least, is less silly), because the effective TSC would be per
physical CPU, whereas the reference time is per VM.

[*] https://lore.kernel.org/all/ee446c823002dc92c8ea525f21d00a9f5d27de59.camel@infradead.org

> And if CLOCK_MONOTONIC_RAW is not what I thought it was... do we really
> have to keep resetting the kvmclock to it at all? On modern hardware
> can't the kvmclock be defined by the TSC alone?

I think there is still use for synchronizing with the host's view of time, e.g.
to deal with lost time across host suspend+resume.

So I don't think we can completely sever KVM's paravirt clocks from host time,
at least not without harming use cases that rely on the host's view to keep
accurate time.  And honestly at that point, the right answer would be to stop
advertising paravirt clocks entirely.

But I do think we can address the issues that Dongli and David are obversing
where guest time drifts even though the host kernel's base time hasn't changed.
If I've pieced everything together correctly, the drift can be eliminated simply
by using the paravirt clock algorithm when converting the delta from the raw TSC
to nanoseconds.

This is *very* lightly tested, as in it compiles and doesn't explode, but that's
about all I've tested.

---
 arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6573c89c35a9..3ba7edfca47c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2417,6 +2417,9 @@ static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz,
 static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
 #endif
 
+static u32 host_tsc_to_system_mul;
+static s8 host_tsc_shift;
+
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 static unsigned long max_tsc_khz;
 
@@ -2812,27 +2815,18 @@ static u64 read_tsc(void)
 static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
 			  int *mode)
 {
-	u64 tsc_pg_val;
-	long v;
+	u64 ns, v;
 
 	switch (clock->vclock_mode) {
 	case VDSO_CLOCKMODE_HVCLOCK:
-		if (hv_read_tsc_page_tsc(hv_get_tsc_page(),
-					 tsc_timestamp, &tsc_pg_val)) {
-			/* TSC page valid */
+		if (hv_read_tsc_page_tsc(hv_get_tsc_page(), tsc_timestamp, &v))
 			*mode = VDSO_CLOCKMODE_HVCLOCK;
-			v = (tsc_pg_val - clock->cycle_last) &
-				clock->mask;
-		} else {
-			/* TSC page invalid */
+		else
 			*mode = VDSO_CLOCKMODE_NONE;
-		}
 		break;
 	case VDSO_CLOCKMODE_TSC:
 		*mode = VDSO_CLOCKMODE_TSC;
-		*tsc_timestamp = read_tsc();
-		v = (*tsc_timestamp - clock->cycle_last) &
-			clock->mask;
+		v = *tsc_timestamp = read_tsc();
 		break;
 	default:
 		*mode = VDSO_CLOCKMODE_NONE;
@@ -2840,8 +2834,36 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
 
 	if (*mode == VDSO_CLOCKMODE_NONE)
 		*tsc_timestamp = v = 0;
+	else
+		v = (v - clock->cycle_last) & clock->mask;
 
-	return v * clock->mult;
+	ns = clock->base_cycles;
+
+	/*
+	 * When the clock source is a raw, constant TSC, do the conversion to
+	 * nanoseconds using the paravirt clock math so that the delta in ns is
+	 * consistent regardless of whether the delta is converted in the host
+	 * or the guest.
+	 *
+	 * The base for paravirt clocks is the kernel's base time in ns, plus
+	 * the delta since the last sync.   E.g. if a masterclock update occurs,
+	 * KVM will shift some amount of delta from the guest to the host.
+	 * Conversions from TSC to ns for the hosts's CLOCK_MONOTONIC_RAW and
+	 * paravirt clocks aren't equivalent, and so shifting the delta can
+	 * cause time to jump from the guest's view of the paravirt clock.
+	 * This only works for a constant TSC, otherwise the calculation would
+	 * only be valid for the current physical CPU, whereas the base of the
+	 * clock must be valid for all vCPUs in the VM.
+	 */
+	if (static_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+	    *mode == VDSO_CLOCKMODE_TSC && clock == &pvclock_gtod_data.raw_clock) {
+		ns >>= clock->shift;
+		ns += pvclock_scale_delta(v, host_tsc_to_system_mul, host_tsc_shift);
+	} else {
+		ns += v * clock->mult;
+		ns >>= clock->shift;
+	}
+	return ns;
 }
 
 static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
@@ -2853,9 +2875,7 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
 
 	do {
 		seq = read_seqcount_begin(&gtod->seq);
-		ns = gtod->raw_clock.base_cycles;
-		ns += vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
-		ns >>= gtod->raw_clock.shift;
+		ns = vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
 		ns += ktime_to_ns(ktime_add(gtod->raw_clock.offset, gtod->offs_boot));
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 	*t = ns;
@@ -2873,9 +2893,7 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
 	do {
 		seq = read_seqcount_begin(&gtod->seq);
 		ts->tv_sec = gtod->wall_time_sec;
-		ns = gtod->clock.base_cycles;
-		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
-		ns >>= gtod->clock.shift;
+		ns = vgettsc(&gtod->clock, tsc_timestamp, &mode);
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 
 	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
@@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void)
 	if (ret != 0)
 		return ret;
 
+	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
+		kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL,
+				   &host_tsc_shift, &host_tsc_to_system_mul);
+
 	local_tsc = rdtsc();
 	stable = !kvm_check_tsc_unstable();
 	list_for_each_entry(kvm, &vm_list, vm_list) {

base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d
--
  
Dongli Zhang Oct. 3, 2023, 1:32 a.m. UTC | #13
Hi Sean,

A quick question ...

On 10/2/23 17:53, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, David Woodhouse wrote:
>> On Mon, 2023-10-02 at 09:37 -0700, Sean Christopherson wrote:
>>> On Mon, Oct 02, 2023, David Woodhouse wrote:
>>>> On Fri, 2023-09-29 at 13:15 -0700, Dongli Zhang wrote:
>>>>>
>>>>> 1. The vcpu->hv_clock (kvmclock) is based on its own mult/shift/equation.
>>>>>
>>>>> 2. The raw monotonic (tsc_clocksource) uses different mult/shift/equation.
>>>>>
>>>>
>>>> That just seems wrong. I don't mean that you're incorrect; it seems
>>>> *morally* wrong.
>>>>
>>>> In a system with X86_FEATURE_CONSTANT_TSC, why would KVM choose to use
>>>> a *different* mult/shift/equation (your #1) to convert TSC ticks to
>>>> nanoseconds than the host CLOCK_MONOTONIC_RAW does (your #2).
>>>>
>>>> I understand that KVM can't track the host's CLOCK_MONOTONIC, as it's
>>>> adjusted by NTP. But CLOCK_MONOTONIC_RAW is supposed to be consistent.
>>>>
>>>> Fix that, and the whole problem goes away, doesn't it?
>>>>
>>>> What am I missing here, that means we can't do that?
>>>
>>> I believe the answer is that "struct pvclock_vcpu_time_info" and its math are
>>> ABI between KVM and KVM guests.
>>>
>>> Like many of the older bits of KVM, my guess is that KVM's behavior is the product
>>> of making things kinda sorta work with old hardware, i.e. was probably the least
>>> awful solution in the days before constant TSCs, but is completely nonsensical on
>>> modern hardware.
>>
>> I still don't understand. The ABI and its math are fine. The ABI is just
>>  "at time X the TSC was Y, and the TSC frequency is Z"
>>
>> I understand why on older hardware, those values needed to *change*
>> occasionally when TSC stupidity happened. 
>>
>> But on newer hardware, surely we can set them precisely *once* when the
>> VM starts, and never ever have to change them again? Theoretically not
>> even when we pause the VM, kexec into a new kernel, and resume the VM!
>>
>> But we *are* having to change it, because apparently
>> CLOCK_MONOTONIC_RAW is doing something *other* than incrementing at
>> precisely the frequency of the known and constant TSC.
>>
>> But *why* is CLOCK_MONOTONIC_RAW doing that? I thought that the whole
>> point of CLOCK_MONOTONIC_RAW was to be consistent and not adjusted by
>> NTP etc.? Shouldn't it run at precisely the same rate as the kvmclock,
>> with no skew at all?
> 
> IIUC, the issue is that the paravirt clock ends up mixing time domains, i.e. is
> a weird bastardization of the host's monotonic raw clock and the paravirt clock.
> 
> Despite a name that suggests otherwise (to me at least), __pvclock_read_cycles()
> counts "cycles" in nanoseconds, not TSC ticks.
>  
>   u64 __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, u64 tsc)
>   {
> 	u64 delta = tsc - src->tsc_timestamp;
> 	u64 offset = pvclock_scale_delta(delta, src->tsc_to_system_mul,
> 					     src->tsc_shift);
> 	return src->system_time + offset;
>   }
> 
> In the above, "offset" is computed in the kvmclock domain, whereas system_time
> comes from the host's CLOCK_MONOTONIC_RAW domain by way of master_kernel_ns.
> The goofy math is much more obvious in __get_kvmclock(), i.e. KVM's host-side
> retrieval of the guest's view of kvmclock:
> 
>   hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> 
> The two domains use the same "clock" (constant TSC), but different math to compute
> nanoseconds from a given TSC value.  For decently large TSC values, this results
> in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds.
> 
> When KVM decides to refresh the masterclock, e.g. vCPU hotplug in Dongli's case,
> it resets the base time, a.k.a. system_time.  I.e. KVM takes all of the time that
> was accumulated in the kvmclock domain and recomputes it in the CLOCK_MONOTONIC_RAW
> domain.  The more time that passes between refreshes, the bigger the time jump
> from the guest's perspective.
> 
> E.g. IIUC, your proposed patch to use a single RDTSC[*] eliminates the drift by
> undoing the CLOCK_MONOTONIC_RAW crap using the same TSC value on both the "add"
> and the "subtract", but the underlying train wreck of mixing time domains is
> still there.
> 
> Without a constant TSC, deferring the reference time to the host's computation
> makes sense (or at least, is less silly), because the effective TSC would be per
> physical CPU, whereas the reference time is per VM.
> 
> [*] https://urldefense.com/v3/__https://lore.kernel.org/all/ee446c823002dc92c8ea525f21d00a9f5d27de59.camel@infradead.org__;!!ACWV5N9M2RV99hQ!L3CeHeyvOUGoCmVUUQvSn86OuB-4ZJVQ8VEm-r5hq-stW5n41h1m0K9-M8GI_abiKujrexcj-5IpD74nBA$ 
> 
>> And if CLOCK_MONOTONIC_RAW is not what I thought it was... do we really
>> have to keep resetting the kvmclock to it at all? On modern hardware
>> can't the kvmclock be defined by the TSC alone?
> 
> I think there is still use for synchronizing with the host's view of time, e.g.
> to deal with lost time across host suspend+resume.
> 
> So I don't think we can completely sever KVM's paravirt clocks from host time,
> at least not without harming use cases that rely on the host's view to keep
> accurate time.  And honestly at that point, the right answer would be to stop
> advertising paravirt clocks entirely.
> 
> But I do think we can address the issues that Dongli and David are obversing
> where guest time drifts even though the host kernel's base time hasn't changed.
> If I've pieced everything together correctly, the drift can be eliminated simply
> by using the paravirt clock algorithm when converting the delta from the raw TSC
> to nanoseconds.
> 
> This is *very* lightly tested, as in it compiles and doesn't explode, but that's
> about all I've tested.
> 
> ---
>  arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++++---------------
>  1 file changed, 42 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6573c89c35a9..3ba7edfca47c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2417,6 +2417,9 @@ static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz,
>  static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
>  #endif
>  
> +static u32 host_tsc_to_system_mul;
> +static s8 host_tsc_shift;
> +
>  static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
>  static unsigned long max_tsc_khz;
>  
> @@ -2812,27 +2815,18 @@ static u64 read_tsc(void)
>  static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
>  			  int *mode)
>  {
> -	u64 tsc_pg_val;
> -	long v;
> +	u64 ns, v;
>  
>  	switch (clock->vclock_mode) {
>  	case VDSO_CLOCKMODE_HVCLOCK:
> -		if (hv_read_tsc_page_tsc(hv_get_tsc_page(),
> -					 tsc_timestamp, &tsc_pg_val)) {
> -			/* TSC page valid */
> +		if (hv_read_tsc_page_tsc(hv_get_tsc_page(), tsc_timestamp, &v))
>  			*mode = VDSO_CLOCKMODE_HVCLOCK;
> -			v = (tsc_pg_val - clock->cycle_last) &
> -				clock->mask;
> -		} else {
> -			/* TSC page invalid */
> +		else
>  			*mode = VDSO_CLOCKMODE_NONE;
> -		}
>  		break;
>  	case VDSO_CLOCKMODE_TSC:
>  		*mode = VDSO_CLOCKMODE_TSC;
> -		*tsc_timestamp = read_tsc();
> -		v = (*tsc_timestamp - clock->cycle_last) &
> -			clock->mask;
> +		v = *tsc_timestamp = read_tsc();
>  		break;
>  	default:
>  		*mode = VDSO_CLOCKMODE_NONE;
> @@ -2840,8 +2834,36 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
>  
>  	if (*mode == VDSO_CLOCKMODE_NONE)
>  		*tsc_timestamp = v = 0;
> +	else
> +		v = (v - clock->cycle_last) & clock->mask;
>  
> -	return v * clock->mult;
> +	ns = clock->base_cycles;
> +
> +	/*
> +	 * When the clock source is a raw, constant TSC, do the conversion to
> +	 * nanoseconds using the paravirt clock math so that the delta in ns is
> +	 * consistent regardless of whether the delta is converted in the host
> +	 * or the guest.
> +	 *
> +	 * The base for paravirt clocks is the kernel's base time in ns, plus
> +	 * the delta since the last sync.   E.g. if a masterclock update occurs,
> +	 * KVM will shift some amount of delta from the guest to the host.
> +	 * Conversions from TSC to ns for the hosts's CLOCK_MONOTONIC_RAW and
> +	 * paravirt clocks aren't equivalent, and so shifting the delta can
> +	 * cause time to jump from the guest's view of the paravirt clock.
> +	 * This only works for a constant TSC, otherwise the calculation would
> +	 * only be valid for the current physical CPU, whereas the base of the
> +	 * clock must be valid for all vCPUs in the VM.
> +	 */
> +	if (static_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> +	    *mode == VDSO_CLOCKMODE_TSC && clock == &pvclock_gtod_data.raw_clock) {
> +		ns >>= clock->shift;
> +		ns += pvclock_scale_delta(v, host_tsc_to_system_mul, host_tsc_shift);
> +	} else {
> +		ns += v * clock->mult;
> +		ns >>= clock->shift;
> +	}
> +	return ns;
>  }
>  
>  static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
> @@ -2853,9 +2875,7 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
>  
>  	do {
>  		seq = read_seqcount_begin(&gtod->seq);
> -		ns = gtod->raw_clock.base_cycles;
> -		ns += vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
> -		ns >>= gtod->raw_clock.shift;
> +		ns = vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
>  		ns += ktime_to_ns(ktime_add(gtod->raw_clock.offset, gtod->offs_boot));
>  	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
>  	*t = ns;
> @@ -2873,9 +2893,7 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
>  	do {
>  		seq = read_seqcount_begin(&gtod->seq);
>  		ts->tv_sec = gtod->wall_time_sec;
> -		ns = gtod->clock.base_cycles;
> -		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
> -		ns >>= gtod->clock.shift;
> +		ns = vgettsc(&gtod->clock, tsc_timestamp, &mode);
>  	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
>  
>  	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
> @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void)
>  	if (ret != 0)
>  		return ret;
>  
> +	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> +		kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL,
> +				   &host_tsc_shift, &host_tsc_to_system_mul);

I agree that to use the kvmclock to calculate the ns elapsed when updating the
master clock.

Would you take the tsc scaling into consideration?

While the host_tsc_shift and host_tsc_to_system_mul are pre-computed, how about
the VM using different TSC frequency?

Thank you very much!

Dongli Zhang


> +
>  	local_tsc = rdtsc();
>  	stable = !kvm_check_tsc_unstable();
>  	list_for_each_entry(kvm, &vm_list, vm_list) {
> 
> base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d
  
Sean Christopherson Oct. 3, 2023, 1:49 a.m. UTC | #14
On Mon, Oct 02, 2023, Dongli Zhang wrote:
> > @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void)
> >  	if (ret != 0)
> >  		return ret;
> >  
> > +	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> > +		kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL,
> > +				   &host_tsc_shift, &host_tsc_to_system_mul);
> 
> I agree that to use the kvmclock to calculate the ns elapsed when updating the
> master clock.
> 
> Would you take the tsc scaling into consideration?
> 
> While the host_tsc_shift and host_tsc_to_system_mul are pre-computed, how about
> the VM using different TSC frequency?

Heh, I'm pretty sure that's completely broken today.  I don't see anything in KVM
that takes hardware TSC scaling into account.

This code:

	if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
				   &vcpu->hv_clock.tsc_shift,
				   &vcpu->hv_clock.tsc_to_system_mul);
		vcpu->hw_tsc_khz = tgt_tsc_khz;
		kvm_xen_update_tsc_info(v);
	}

is recomputing the multipler+shift for the current *physical* CPU, it's not
related to the guest's TSC in any way.

__get_kvmclock() again shows that quite clearly, there's no scaling for the guest
TSC anywhere in there.
  
Dongli Zhang Oct. 3, 2023, 2:07 a.m. UTC | #15
Hi Sean,

On 10/2/23 18:49, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, Dongli Zhang wrote:
>>> @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void)
>>>  	if (ret != 0)
>>>  		return ret;
>>>  
>>> +	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
>>> +		kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL,
>>> +				   &host_tsc_shift, &host_tsc_to_system_mul);
>>
>> I agree that to use the kvmclock to calculate the ns elapsed when updating the
>> master clock.
>>
>> Would you take the tsc scaling into consideration?
>>
>> While the host_tsc_shift and host_tsc_to_system_mul are pre-computed, how about
>> the VM using different TSC frequency?
> 
> Heh, I'm pretty sure that's completely broken today.  I don't see anything in KVM
> that takes hardware TSC scaling into account.
> 
> This code:
> 
> 	if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
> 		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> 				   &vcpu->hv_clock.tsc_shift,
> 				   &vcpu->hv_clock.tsc_to_system_mul);
> 		vcpu->hw_tsc_khz = tgt_tsc_khz;
> 		kvm_xen_update_tsc_info(v);
> 	}
> 
> is recomputing the multipler+shift for the current *physical* CPU, it's not
> related to the guest's TSC in any way.

The below is the code.

line 3175: query freq for current *physical* CPU.

line 3211: scale the freq if scaling is involved.

line 3215: compute the view for guest based on new 'tgt_tsc_khz' after scaling.

3146 static int kvm_guest_time_update(struct kvm_vcpu *v)
3147 {
3148         unsigned long flags, tgt_tsc_khz;
3149         unsigned seq;
... ...
3173         /* Keep irq disabled to prevent changes to the clock */
3174         local_irq_save(flags);
3175         tgt_tsc_khz = get_cpu_tsc_khz();
... ...
3210         if (kvm_caps.has_tsc_control)
3211                 tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz,
3212                                             v->arch.l1_tsc_scaling_ratio);
3213
3214         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
3215                 kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
3216                                    &vcpu->hv_clock.tsc_shift,
3217                                    &vcpu->hv_clock.tsc_to_system_mul);
3218                 vcpu->hw_tsc_khz = tgt_tsc_khz;
3219                 kvm_xen_update_tsc_info(v);
3220         }


Would you please let me know if the above understanding is incorrect?

Thank you very much!

Dongli Zhang

> 
> __get_kvmclock() again shows that quite clearly, there's no scaling for the guest
> TSC anywhere in there.
  
David Woodhouse Oct. 3, 2023, 5:54 a.m. UTC | #16
On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote:
> 
> The two domains use the same "clock" (constant TSC), but different math to compute
> nanoseconds from a given TSC value.  For decently large TSC values, this results
> in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds.

This is the bit I'm still confused about, and it seems to be the root
of all the other problems.

Both CLOCK_MONOTONIC_RAW and kvmclock have *one* job: to convert a
number of ticks of the TSC running at a constant known frequency, to a
number of nanoseconds.

So how in the name of all that is holy do they manage to get
*different* answers?

I get that the mult/shift thing carries some imprecision, but is that
all it is? Can't we ensure that the kvmclock uses the *same* algorithm,
precisely, as CLOCK_MONOTONIC_RAW?


> E.g. IIUC, your proposed patch to use a single RDTSC[*] eliminates the drift by
> undoing the CLOCK_MONOTONIC_RAW crap using the same TSC value on both the "add"
> and the "subtract", but the underlying train wreck of mixing time domains is
> still there.

> [*] https://lore.kernel.org/all/ee446c823002dc92c8ea525f21d00a9f5d27de59.camel@infradead.org

That one's different; those clock domains are *supposed* to be
divergent and the point in the PV wall clock information is to convey
the delta between the two. I just made it use the delta between the two
at a *given* moment, instead of calculating the difference between
*two* different times.
  
David Woodhouse Oct. 3, 2023, 9:12 a.m. UTC | #17
On 3 October 2023 01:53:11 BST, Sean Christopherson <seanjc@google.com> wrote:
>I think there is still use for synchronizing with the host's view of time, e.g.
>to deal with lost time across host suspend+resume.
>
>So I don't think we can completely sever KVM's paravirt clocks from host time,
>at least not without harming use cases that rely on the host's view to keep
>accurate time.  And honestly at that point, the right answer would be to stop
>advertising paravirt clocks entirely.
>
>But I do think we can address the issues that Dongli and David are obversing
>where guest time drifts even though the host kernel's base time hasn't changed.
>If I've pieced everything together correctly, the drift can be eliminated simply
>by using the paravirt clock algorithm when converting the delta from the raw TSC
>to nanoseconds.
>
>This is *very* lightly tested, as in it compiles and doesn't explode, but that's
>about all I've tested.

Hm, I don't think I like this.

You're making get_monotonic_raw() not *actually* return the monotonic_raw clock, but basically return the kvmclock instead? And why? So that when KVM attempts to synchronize the kvmclock to the monotonic_raw clock, it gets tricked into actually synchronizing the kvmclock to *itself*?

If you get this right, don't we have a fairly complex piece of code that has precisely *no* effect? 

Can't we just *refrain* from synchronizing the kvmclock to *anything*, in the CONSTANT_TSC case? Why do we do that anyway?

(Suspend/resume, live update and live migration are different. In *those* cases we may need to preserve both the guest TSC and kvmclock based on either the host TSC or CLOCK_TAI. But that's different.)
  
David Woodhouse Oct. 3, 2023, 2:29 p.m. UTC | #18
On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote:
> 
> This is *very* lightly tested, as in it compiles and doesn't explode, but that's
> about all I've tested.

I don't think it's working, if I understand what it's supposed to be
doing.

I hacked my wallclock patch *not* to use
kvm_get_walltime_and_clockread(), but instead use
kvm_get_time_and_clockread() so it should be able to compare
monotonic_raw with kvmclock time. It looks like this.

uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
{
	/*
	 * The guest calculates current wall clock time by adding
	 * system time (updated by kvm_guest_time_update below) to the
	 * wall clock specified here.  We do the reverse here.
	 */
#ifdef CONFIG_X86_64
	struct pvclock_vcpu_time_info hv_clock;
	struct kvm_arch *ka = &kvm->arch;
	unsigned long seq, local_tsc_khz = 0;
	struct timespec64 ts;
	uint64_t host_tsc;

	do {
		seq = read_seqcount_begin(&ka->pvclock_sc);

		if (!ka->use_master_clock)
			break;

		/* It all has to happen on the same CPU */
		get_cpu();

		local_tsc_khz = get_cpu_tsc_khz();

		if (local_tsc_khz &&
		    !kvm_get_time_and_clockread(&ts.tv_sec, &host_tsc))
			local_tsc_khz = 0; /* Fall back to old method */

		hv_clock.tsc_timestamp = ka->master_cycle_now;
		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;

		put_cpu();
	} while (read_seqcount_retry(&ka->pvclock_sc, seq));

	/*
	 * If the conditions were right, and obtaining the wallclock+TSC was
	 * successful, calculate the KVM clock at the corresponding time and
	 * subtract one from the other to get the epoch in nanoseconds.
	 */
	if (local_tsc_khz) {
		kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * 1000LL,
				   &hv_clock.tsc_shift,
				   &hv_clock.tsc_to_system_mul);

		uint64_t res = __pvclock_read_cycles(&hv_clock, host_tsc);
		uint64_t d2 = ts.tv_sec + ka->kvmclock_offset;
		printk("Calculated %lld (%lld/%lld delta %lld, ns %lld o %lld)\n",
		       res,
		       ts.tv_sec, d2, d2-res,
		       ka->master_kernel_ns, ka->kvmclock_offset);
		if (0)
		return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec -
			__pvclock_read_cycles(&hv_clock, host_tsc);
	}
#endif
	return ktime_get_real_ns() - get_kvmclock_ns(kvm);
}

I also removed the kvm_make_all_cpus_request(kvm,
KVM_REQ_MASTERCLOCK_UPDATE) from kvm_xen_shared_info_init(). I don't
even know why that's there at all, let alone on *all* vCPUs. So now KVM
doesn't keep clamping the kvmclock back to monotonic_raw.

When I run xen_shinfo_test, the kvmclock still drifts from the
"monotonic_raw" I get from kvm_get_time_and_clockread(), even with your
patch.

[98513.851371] Calculated 522601895 (98513572796678/522601913 delta 18, ns 98513057857491 o -98513050194765)
[98513.851388] Calculated 522619091 (98513572813874/522619109 delta 18, ns 98513057857491 o -98513050194765)
[98513.851394] Calculated 522625423 (98513572820206/522625441 delta 18, ns 98513057857491 o -98513050194765)
[98513.851401] Calculated 522631389 (98513572826172/522631407 delta 18, ns 98513057857491 o -98513050194765)
[98513.851406] Calculated 522636947 (98513572831730/522636965 delta 18, ns 98513057857491 o -98513050194765)
[98513.851412] Calculated 522643004 (98513572837787/522643022 delta 18, ns 98513057857491 o -98513050194765)
[98513.851417] Calculated 522648344 (98513572843127/522648362 delta 18, ns 98513057857491 o -98513050194765)
[98513.851423] Calculated 522654367 (98513572849150/522654385 delta 18, ns 98513057857491 o -98513050194765)
...
[98517.386027] Calculated 4057257718 (98517107452615/4057257850 delta 132, ns 98513057857491 o -98513050194765)
[98517.386030] Calculated 4057261038 (98517107455934/4057261169 delta 131, ns 98513057857491 o -98513050194765)
[98517.386034] Calculated 4057265133 (98517107460029/4057265264 delta 131, ns 98513057857491 o -98513050194765)
[98517.386037] Calculated 4057268310 (98517107463206/4057268441 delta 131, ns 98513057857491 o -98513050194765)
[98517.386040] Calculated 4057271463 (98517107466359/4057271594 delta 131, ns 98513057857491 o -98513050194765)
[98517.386044] Calculated 4057274508 (98517107469404/4057274639 delta 131, ns 98513057857491 o -98513050194765)
[98517.386048] Calculated 4057278587 (98517107473483/4057278718 delta 131, ns 98513057857491 o -98513050194765)
[98517.386051] Calculated 4057281674 (98517107476570/4057281805 delta 131, ns 98513057857491 o -98513050194765)
[98517.386057] Calculated 4057288187 (98517107483083/4057288318 delta 131, ns 98513057857491 o -98513050194765)
[98517.386064] Calculated 4057294557 (98517107489453/4057294688 delta 131, ns 98513057857491 o -98513050194765)
  
Sean Christopherson Oct. 3, 2023, 9 p.m. UTC | #19
On Mon, Oct 02, 2023, Dongli Zhang wrote:
> Hi Sean,
> 
> On 10/2/23 18:49, Sean Christopherson wrote:
> > On Mon, Oct 02, 2023, Dongli Zhang wrote:
> >>> @@ -12185,6 +12203,10 @@ int kvm_arch_hardware_enable(void)
> >>>  	if (ret != 0)
> >>>  		return ret;
> >>>  
> >>> +	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> >>> +		kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL,
> >>> +				   &host_tsc_shift, &host_tsc_to_system_mul);
> >>
> >> I agree that to use the kvmclock to calculate the ns elapsed when updating the
> >> master clock.
> >>
> >> Would you take the tsc scaling into consideration?
> >>
> >> While the host_tsc_shift and host_tsc_to_system_mul are pre-computed, how about
> >> the VM using different TSC frequency?
> > 
> > Heh, I'm pretty sure that's completely broken today.  I don't see anything in KVM
> > that takes hardware TSC scaling into account.
> > 
> > This code:
> > 
> > 	if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
> > 		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> > 				   &vcpu->hv_clock.tsc_shift,
> > 				   &vcpu->hv_clock.tsc_to_system_mul);
> > 		vcpu->hw_tsc_khz = tgt_tsc_khz;
> > 		kvm_xen_update_tsc_info(v);
> > 	}
> > 
> > is recomputing the multipler+shift for the current *physical* CPU, it's not
> > related to the guest's TSC in any way.
> 
> The below is the code.
> 
> line 3175: query freq for current *physical* CPU.
> 
> line 3211: scale the freq if scaling is involved.
> 
> line 3215: compute the view for guest based on new 'tgt_tsc_khz' after scaling.
> 
> 3146 static int kvm_guest_time_update(struct kvm_vcpu *v)
> 3147 {
> 3148         unsigned long flags, tgt_tsc_khz;
> 3149         unsigned seq;
> ... ...
> 3173         /* Keep irq disabled to prevent changes to the clock */
> 3174         local_irq_save(flags);
> 3175         tgt_tsc_khz = get_cpu_tsc_khz();
> ... ...
> 3210         if (kvm_caps.has_tsc_control)
> 3211                 tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz,
> 3212                                             v->arch.l1_tsc_scaling_ratio);
> 3213
> 3214         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
> 3215                 kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> 3216                                    &vcpu->hv_clock.tsc_shift,
> 3217                                    &vcpu->hv_clock.tsc_to_system_mul);
> 3218                 vcpu->hw_tsc_khz = tgt_tsc_khz;
> 3219                 kvm_xen_update_tsc_info(v);
> 3220         }
> 
> 
> Would you please let me know if the above understanding is incorrect?

Ah, yeah, you're correct.  I missed the call to kvm_scale_tsc() at 3211.
  
Sean Christopherson Oct. 4, 2023, 12:04 a.m. UTC | #20
On Tue, Oct 03, 2023, David Woodhouse wrote:
> On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote:
> > 
> > The two domains use the same "clock" (constant TSC), but different math to compute
> > nanoseconds from a given TSC value.  For decently large TSC values, this results
> > in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds.
> 
> This is the bit I'm still confused about, and it seems to be the root
> of all the other problems.
> 
> Both CLOCK_MONOTONIC_RAW and kvmclock have *one* job: to convert a
> number of ticks of the TSC running at a constant known frequency, to a
> number of nanoseconds.
> 
> So how in the name of all that is holy do they manage to get
> *different* answers?
> 
> I get that the mult/shift thing carries some imprecision, but is that
> all it is? 

Yep, pretty sure that's it.  It's like the plot from Office Space / Superman III.
Those little rounding errors add up over time.

PV clock:

  nanoseconds = ((TSC >> shift) * mult) >> 32

or 

  nanoseconds = ((TSC << shift) * mult) >> 32

versus timekeeping (mostly)

  nanoseconds = (TSC * mult) >> shift

The more I look at the PV clock stuff, the more I agree with Peter: it's garbage.
Shifting before multiplying is guaranteed to introduce error.  Shifting right drops
data, and shifting left introduces zeros.

> Can't we ensure that the kvmclock uses the *same* algorithm,
> precisely, as CLOCK_MONOTONIC_RAW?

Yes?  At least for sane hardware, after much staring, I think it's possible.

It's tricky because the two algorithms are wierdly different, the PV clock algorithm
is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh
at us for suggesting that we try to shove the pvclock algorithm into the kernel.

The hardcoded shift right 32 in PV clock is annoying, but not the end of the world.

Compile tested only, but I believe this math is correct.  And I'm guessing we'd
want some safeguards against overflow, e.g. due to a multiplier that is too big.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6573c89c35a9..ae9275c3d580 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
                                            v->arch.l1_tsc_scaling_ratio);
 
        if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
-               kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
-                                  &vcpu->hv_clock.tsc_shift,
-                                  &vcpu->hv_clock.tsc_to_system_mul);
+               u32 shift, mult;
+
+               clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600);
+
+               if (shift <= 32) {
+                       vcpu->hv_clock.tsc_shift = 0;
+                       vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift);
+               } else {
+                       kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
+                                          &vcpu->hv_clock.tsc_shift,
+                                          &vcpu->hv_clock.tsc_to_system_mul);
+               }
+
                vcpu->hw_tsc_khz = tgt_tsc_khz;
                kvm_xen_update_tsc_info(v);
        }
  
Sean Christopherson Oct. 4, 2023, 12:07 a.m. UTC | #21
On Tue, Oct 03, 2023, David Woodhouse wrote:
> 
> 
> On 3 October 2023 01:53:11 BST, Sean Christopherson <seanjc@google.com> wrote:
> >I think there is still use for synchronizing with the host's view of time, e.g.
> >to deal with lost time across host suspend+resume.
> >
> >So I don't think we can completely sever KVM's paravirt clocks from host time,
> >at least not without harming use cases that rely on the host's view to keep
> >accurate time.  And honestly at that point, the right answer would be to stop
> >advertising paravirt clocks entirely.
> >
> >But I do think we can address the issues that Dongli and David are obversing
> >where guest time drifts even though the host kernel's base time hasn't changed.
> >If I've pieced everything together correctly, the drift can be eliminated simply
> >by using the paravirt clock algorithm when converting the delta from the raw TSC
> >to nanoseconds.
> >
> >This is *very* lightly tested, as in it compiles and doesn't explode, but that's
> >about all I've tested.
> 
> Hm, I don't think I like this.

Yeah, I don't like it either.  I'll respond to your other mail with details, but
this is a dead end anything.

> You're making get_monotonic_raw() not *actually* return the monotonic_raw
> clock, but basically return the kvmclock instead? And why? So that when KVM
> attempts to synchronize the kvmclock to the monotonic_raw clock, it gets
> tricked into actually synchronizing the kvmclock to *itself*?
> 
> If you get this right, don't we have a fairly complex piece of code that has
> precisely *no* effect? 
> 
> Can't we just *refrain* from synchronizing the kvmclock to *anything*, in the
> CONSTANT_TSC case? Why do we do that anyway?
> 
> (Suspend/resume, live update and live migration are different. In *those*
> cases we may need to preserve both the guest TSC and kvmclock based on either
> the host TSC or CLOCK_TAI. But that's different.)

The issue is that the timekeeping code doesn't provide a notification mechanism
to *just* get updates for things like suspend/reume.  We could maybe do something
in KVM like unregister the notifier if the TSC is constant, and manually refresh
on suspend/resume.  But that's pretty gross too, and I'd definitely be concerned
that we missed something.
  
Sean Christopherson Oct. 4, 2023, 12:10 a.m. UTC | #22
On Tue, Oct 03, 2023, David Woodhouse wrote:
> On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote:
> > 
> > This is *very* lightly tested, as in it compiles and doesn't explode, but that's
> > about all I've tested.
> 
> I don't think it's working, if I understand what it's supposed to be
> doing.

...

> When I run xen_shinfo_test, the kvmclock still drifts from the
> "monotonic_raw" I get from kvm_get_time_and_clockread(), even with your
> patch.

It "works", in that it does what I intended it to do.  What I missed, and what's
obvious in hindsight, is that the timekeeping code doesn't *just* update what KVM
uses as the "base" when there "big events" like suspend/resume, the timekeeping
code updates the base clock on every tick.

My hack only smoothed away the delta, it didn't address the time that had gotten
moved into the base.  I though the base would be stable if the host was in a
steady state, but see above...
  
David Woodhouse Oct. 4, 2023, 8:06 a.m. UTC | #23
On Tue, 2023-10-03 at 17:07 -0700, Sean Christopherson wrote:
> 
> The issue is that the timekeeping code doesn't provide a notification mechanism
> to *just* get updates for things like suspend/reume.  We could maybe do something
> in KVM like unregister the notifier if the TSC is constant, and manually refresh
> on suspend/resume.  But that's pretty gross too, and I'd definitely be concerned
> that we missed something.

Or *in* the existing notifier, just do the maths (correctly!) and don't
update KVM's raw_clock data if the only thing that's really changed is
the passage of time?

On the whole, it's better if your other patch works to avoid the drift
in the first place. I'll test it...
  
David Woodhouse Oct. 4, 2023, 10:01 a.m. UTC | #24
On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote:
> On Tue, Oct 03, 2023, David Woodhouse wrote:
> > On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote:
> > > 
> > > The two domains use the same "clock" (constant TSC), but different math to compute
> > > nanoseconds from a given TSC value.  For decently large TSC values, this results
> > > in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds.
> > 
> > This is the bit I'm still confused about, and it seems to be the root
> > of all the other problems.
> > 
> > Both CLOCK_MONOTONIC_RAW and kvmclock have *one* job: to convert a
> > number of ticks of the TSC running at a constant known frequency, to a
> > number of nanoseconds.
> > 
> > So how in the name of all that is holy do they manage to get
> > *different* answers?
> > 
> > I get that the mult/shift thing carries some imprecision, but is that
> > all it is? 
> 
> Yep, pretty sure that's it.  It's like the plot from Office Space / Superman III.
> Those little rounding errors add up over time.
> 
> PV clock:
> 
>   nanoseconds = ((TSC >> shift) * mult) >> 32
> 
> or 
> 
>   nanoseconds = ((TSC << shift) * mult) >> 32
> 
> versus timekeeping (mostly)
> 
>   nanoseconds = (TSC * mult) >> shift
> 
> The more I look at the PV clock stuff, the more I agree with Peter: it's garbage.
> Shifting before multiplying is guaranteed to introduce error.  Shifting right drops
> data, and shifting left introduces zeros.
> 
> > Can't we ensure that the kvmclock uses the *same* algorithm,
> > precisely, as CLOCK_MONOTONIC_RAW?
> 
> Yes?  At least for sane hardware, after much staring, I think it's possible.
> 
> It's tricky because the two algorithms are wierdly different, the PV clock algorithm
> is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh
> at us for suggesting that we try to shove the pvclock algorithm into the kernel.
> 
> The hardcoded shift right 32 in PV clock is annoying, but not the end of the world.
> 
> Compile tested only, but I believe this math is correct.  And I'm guessing we'd
> want some safeguards against overflow, e.g. due to a multiplier that is too big.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6573c89c35a9..ae9275c3d580 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>                                             v->arch.l1_tsc_scaling_ratio);
>  
>         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
> -               kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> -                                  &vcpu->hv_clock.tsc_shift,
> -                                  &vcpu->hv_clock.tsc_to_system_mul);
> +               u32 shift, mult;
> +
> +               clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600);
> +
> +               if (shift <= 32) {
> +                       vcpu->hv_clock.tsc_shift = 0;
> +                       vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift);
> +               } else {
> +                       kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> +                                          &vcpu->hv_clock.tsc_shift,
> +                                          &vcpu->hv_clock.tsc_to_system_mul);
> +               }
> +
>                 vcpu->hw_tsc_khz = tgt_tsc_khz;
>                 kvm_xen_update_tsc_info(v);
>         }
> 

I gave that a go on my test box, and for a TSC frequency of 2593992 kHz
it got mult=1655736523, shift=32 and took the 'happy' path instead of
falling back.

It still drifts about the same though, using the same test as before:
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock


I was going to facetiously suggest that perhaps the kvmclock should
have leap nanoseconds... but then realised that that's basically what
Dongli's patch is *doing*. Maybe we just need to *recognise* that, so
rather than having a user-configured period for the update, KVM could
calculate the frequency for the updates based on the rate at which the
clocks would otherwise drift, and a maximum delta? Not my favourite
option, but perhaps better than nothing?
  
Sean Christopherson Oct. 4, 2023, 6:06 p.m. UTC | #25
On Wed, Oct 04, 2023, David Woodhouse wrote:
> On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote:
> > > Can't we ensure that the kvmclock uses the *same* algorithm,
> > > precisely, as CLOCK_MONOTONIC_RAW?
> > 
> > Yes?  At least for sane hardware, after much staring, I think it's possible.
> > 
> > It's tricky because the two algorithms are wierdly different, the PV clock algorithm
> > is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh
> > at us for suggesting that we try to shove the pvclock algorithm into the kernel.
> > 
> > The hardcoded shift right 32 in PV clock is annoying, but not the end of the world.
> > 
> > Compile tested only, but I believe this math is correct.  And I'm guessing we'd
> > want some safeguards against overflow, e.g. due to a multiplier that is too big.
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 6573c89c35a9..ae9275c3d580 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> >                                             v->arch.l1_tsc_scaling_ratio);
> >  
> >         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
> > -               kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> > -                                  &vcpu->hv_clock.tsc_shift,
> > -                                  &vcpu->hv_clock.tsc_to_system_mul);
> > +               u32 shift, mult;
> > +
> > +               clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600);
> > +
> > +               if (shift <= 32) {
> > +                       vcpu->hv_clock.tsc_shift = 0;
> > +                       vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift);
> > +               } else {
> > +                       kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> > +                                          &vcpu->hv_clock.tsc_shift,
> > +                                          &vcpu->hv_clock.tsc_to_system_mul);
> > +               }
> > +
> >                 vcpu->hw_tsc_khz = tgt_tsc_khz;
> >                 kvm_xen_update_tsc_info(v);
> >         }
> > 
> 
> I gave that a go on my test box, and for a TSC frequency of 2593992 kHz
> it got mult=1655736523, shift=32 and took the 'happy' path instead of
> falling back.
> 
> It still drifts about the same though, using the same test as before:
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock
> 
> 
> I was going to facetiously suggest that perhaps the kvmclock should
> have leap nanoseconds... but then realised that that's basically what
> Dongli's patch is *doing*. Maybe we just need to *recognise* that,

Yeah, I suspect trying to get kvmclock to always precisely align with the kernel's
monotonic raw clock is a fool's errand.

> so rather than having a user-configured period for the update, KVM could
> calculate the frequency for the updates based on the rate at which the clocks
> would otherwise drift, and a maximum delta? Not my favourite option, but
> perhaps better than nothing? 

Holy moly, the existing code for the periodic syncs/updates is a mess.  If I'm
reading the code correctly, commits

  0061d53daf26 ("KVM: x86: limit difference between kvmclock updates")
  7e44e4495a39 ("x86: kvm: rate-limit global clock updates")
  332967a3eac0 ("x86: kvm: introduce periodic global clock updates")

splattered together an immpressively inefficient update mechanism.

On the first vCPU creation, KVM schedules kvmclock_sync_fn() at a hardcoded rate
of 300hz.

	if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0)
		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
						KVMCLOCK_SYNC_PERIOD);

That handler does two things: schedule "delayed" work kvmclock_update_fn() to
be executed immediately, and reschedule kvmclock_sync_fn() at 300hz.
kvmclock_sync_fn() then kicks *every* vCPU in the VM, i.e. KVM kicks every vCPU
to sync kvmlock at a 300hz frequency.  

If we're going to kick every vCPU, then we might as well do a masterclock update,
because the extra cost of synchronizing the masterclock is likely in the noise
compared to kicking every vCPU.  There's also zero reason to do the work in vCPU
context.

And because that's not enough, on pCPU migration or if the TSC is unstable,
kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules
kvmclock_update_fn() with a delay of 100ms.  The large delay is to play nice with
unstable TSCs.  But if KVM is periodically doing clock updates on all vCPU,
scheduling another update with a *longer* delay is silly.

The really, really stupid part of all is that the periodic syncs happen even if
kvmclock isn't exposed to the guest.  *sigh*

So rather than add yet another periodic work function, I think we should clean up
the mess we have, fix the whole "leapseconds" mess with the masterclock, and then
tune the frequency (if necessary).

Something like the below is what I'm thinking.  Once the dust settles, I'd like
to do dynamically enable/disable kvmclock_sync_work based on whether or not the
VM actually has vCPU's with a pvclock, but that's definitely an enhancement that
can go on top.

Does this look sane, or am I missing something?

---
 arch/x86/include/asm/kvm_host.h |  3 +-
 arch/x86/kvm/x86.c              | 53 +++++++++++----------------------
 2 files changed, 19 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 34a64527654c..d108452fc301 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -98,7 +98,7 @@
 	KVM_ARCH_REQ_FLAGS(14, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_SCAN_IOAPIC \
 	KVM_ARCH_REQ_FLAGS(15, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
-#define KVM_REQ_GLOBAL_CLOCK_UPDATE	KVM_ARCH_REQ(16)
+/* AVAILABLE BIT!!!!			KVM_ARCH_REQ(16) */
 #define KVM_REQ_APIC_PAGE_RELOAD \
 	KVM_ARCH_REQ_FLAGS(17, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_HV_CRASH		KVM_ARCH_REQ(18)
@@ -1336,7 +1336,6 @@ struct kvm_arch {
 	bool use_master_clock;
 	u64 master_kernel_ns;
 	u64 master_cycle_now;
-	struct delayed_work kvmclock_update_work;
 	struct delayed_work kvmclock_sync_work;
 
 	struct kvm_xen_hvm_config xen_hvm_config;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6573c89c35a9..5d35724f1963 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2367,7 +2367,7 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
 	}
 
 	vcpu->arch.time = system_time;
-	kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
+	kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
 	/* we verify if the enable bit is set... */
 	if (system_time & 1)
@@ -3257,30 +3257,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 
 #define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100)
 
-static void kvmclock_update_fn(struct work_struct *work)
-{
-	unsigned long i;
-	struct delayed_work *dwork = to_delayed_work(work);
-	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
-					   kvmclock_update_work);
-	struct kvm *kvm = container_of(ka, struct kvm, arch);
-	struct kvm_vcpu *vcpu;
-
-	kvm_for_each_vcpu(i, vcpu, kvm) {
-		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-		kvm_vcpu_kick(vcpu);
-	}
-}
-
-static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
-{
-	struct kvm *kvm = v->kvm;
-
-	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
-	schedule_delayed_work(&kvm->arch.kvmclock_update_work,
-					KVMCLOCK_UPDATE_DELAY);
-}
-
 #define KVMCLOCK_SYNC_PERIOD (300 * HZ)
 
 static void kvmclock_sync_fn(struct work_struct *work)
@@ -3290,12 +3266,14 @@ static void kvmclock_sync_fn(struct work_struct *work)
 					   kvmclock_sync_work);
 	struct kvm *kvm = container_of(ka, struct kvm, arch);
 
-	if (!kvmclock_periodic_sync)
-		return;
+	if (ka->use_master_clock)
+		kvm_update_masterclock(kvm);
+	else
+		kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
 
-	schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
-	schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
-					KVMCLOCK_SYNC_PERIOD);
+	if (kvmclock_periodic_sync)
+		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
+				      KVMCLOCK_SYNC_PERIOD);
 }
 
 /* These helpers are safe iff @msr is known to be an MCx bank MSR. */
@@ -4845,7 +4823,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		 * kvmclock on vcpu->cpu migration
 		 */
 		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
-			kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
+			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 		if (vcpu->cpu != cpu)
 			kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
 		vcpu->cpu = cpu;
@@ -10520,12 +10498,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			__kvm_migrate_timers(vcpu);
 		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
 			kvm_update_masterclock(vcpu->kvm);
-		if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
-			kvm_gen_kvmclock_update(vcpu);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
 			r = kvm_guest_time_update(vcpu);
 			if (unlikely(r))
 				goto out;
+
+			/*
+			 * Ensure all other vCPUs synchronize "soon", e.g. so
+			 * that all vCPUs recognize NTP corrections and drift
+			 * corrections (relative to the kernel's raw clock).
+			 */
+			if (!kvmclock_periodic_sync)
+				schedule_delayed_work(&vcpu->kvm->arch.kvmclock_sync_work,
+						      KVMCLOCK_UPDATE_DELAY);
 		}
 		if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
 			kvm_mmu_sync_roots(vcpu);
@@ -12345,7 +12330,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.hv_root_tdp = INVALID_PAGE;
 #endif
 
-	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
 
 	kvm_apicv_init(kvm);
@@ -12387,7 +12371,6 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm)
 void kvm_arch_sync_events(struct kvm *kvm)
 {
 	cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
-	cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
 	kvm_free_pit(kvm);
 }
 

base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d
--
  
Dongli Zhang Oct. 4, 2023, 7:13 p.m. UTC | #26
Hi Sean,

On 10/4/23 11:06 AM, Sean Christopherson wrote:
> On Wed, Oct 04, 2023, David Woodhouse wrote:
>> On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote:
>>>> Can't we ensure that the kvmclock uses the *same* algorithm,
>>>> precisely, as CLOCK_MONOTONIC_RAW?
>>>
>>> Yes?  At least for sane hardware, after much staring, I think it's possible.
>>>
>>> It's tricky because the two algorithms are wierdly different, the PV clock algorithm
>>> is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh
>>> at us for suggesting that we try to shove the pvclock algorithm into the kernel.
>>>
>>> The hardcoded shift right 32 in PV clock is annoying, but not the end of the world.
>>>
>>> Compile tested only, but I believe this math is correct.  And I'm guessing we'd
>>> want some safeguards against overflow, e.g. due to a multiplier that is too big.
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 6573c89c35a9..ae9275c3d580 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>>>                                             v->arch.l1_tsc_scaling_ratio);
>>>  
>>>         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
>>> -               kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
>>> -                                  &vcpu->hv_clock.tsc_shift,
>>> -                                  &vcpu->hv_clock.tsc_to_system_mul);
>>> +               u32 shift, mult;
>>> +
>>> +               clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600);
>>> +
>>> +               if (shift <= 32) {
>>> +                       vcpu->hv_clock.tsc_shift = 0;
>>> +                       vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift);
>>> +               } else {
>>> +                       kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
>>> +                                          &vcpu->hv_clock.tsc_shift,
>>> +                                          &vcpu->hv_clock.tsc_to_system_mul);
>>> +               }
>>> +
>>>                 vcpu->hw_tsc_khz = tgt_tsc_khz;
>>>                 kvm_xen_update_tsc_info(v);
>>>         }
>>>
>>
>> I gave that a go on my test box, and for a TSC frequency of 2593992 kHz
>> it got mult=1655736523, shift=32 and took the 'happy' path instead of
>> falling back.
>>
>> It still drifts about the same though, using the same test as before:
>> https://urldefense.com/v3/__https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock__;!!ACWV5N9M2RV99hQ!KEdDuRZIThXoz2zaZd3O9rk77ywSaHCQ92fTnc7PFP81bdOhTMvudMBReIfZcrm9AITeKw4kyMTmbPbJuA$ 
>>
>>
>> I was going to facetiously suggest that perhaps the kvmclock should
>> have leap nanoseconds... but then realised that that's basically what
>> Dongli's patch is *doing*. Maybe we just need to *recognise* that,
> 
> Yeah, I suspect trying to get kvmclock to always precisely align with the kernel's
> monotonic raw clock is a fool's errand.
> 
>> so rather than having a user-configured period for the update, KVM could
>> calculate the frequency for the updates based on the rate at which the clocks
>> would otherwise drift, and a maximum delta? Not my favourite option, but
>> perhaps better than nothing? 
> 
> Holy moly, the existing code for the periodic syncs/updates is a mess.  If I'm
> reading the code correctly, commits
> 
>   0061d53daf26 ("KVM: x86: limit difference between kvmclock updates")
>   7e44e4495a39 ("x86: kvm: rate-limit global clock updates")
>   332967a3eac0 ("x86: kvm: introduce periodic global clock updates")
> 
> splattered together an immpressively inefficient update mechanism.
> 
> On the first vCPU creation, KVM schedules kvmclock_sync_fn() at a hardcoded rate
> of 300hz.
> 
> 	if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0)
> 		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> 						KVMCLOCK_SYNC_PERIOD);
> 
> That handler does two things: schedule "delayed" work kvmclock_update_fn() to
> be executed immediately, and reschedule kvmclock_sync_fn() at 300hz.
> kvmclock_sync_fn() then kicks *every* vCPU in the VM, i.e. KVM kicks every vCPU
> to sync kvmlock at a 300hz frequency.  
> 
> If we're going to kick every vCPU, then we might as well do a masterclock update,
> because the extra cost of synchronizing the masterclock is likely in the noise
> compared to kicking every vCPU.  There's also zero reason to do the work in vCPU
> context.
> 
> And because that's not enough, on pCPU migration or if the TSC is unstable,
> kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules
> kvmclock_update_fn() with a delay of 100ms.  The large delay is to play nice with
> unstable TSCs.  But if KVM is periodically doing clock updates on all vCPU,
> scheduling another update with a *longer* delay is silly.

We may need to add above message to the places, where
KVM_REQ_GLOBAL_CLOCK_UPDATE is replaced with KVM_REQ_CLOCK_UPDATE in the patch?

This helps understand why KVM_REQ_CLOCK_UPDATE is sometime enough.

> 
> The really, really stupid part of all is that the periodic syncs happen even if
> kvmclock isn't exposed to the guest.  *sigh*
> 
> So rather than add yet another periodic work function, I think we should clean up
> the mess we have, fix the whole "leapseconds" mess with the masterclock, and then
> tune the frequency (if necessary).
> 
> Something like the below is what I'm thinking.  Once the dust settles, I'd like
> to do dynamically enable/disable kvmclock_sync_work based on whether or not the
> VM actually has vCPU's with a pvclock, but that's definitely an enhancement that
> can go on top.
> 
> Does this look sane, or am I missing something?
> 
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +-
>  arch/x86/kvm/x86.c              | 53 +++++++++++----------------------
>  2 files changed, 19 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 34a64527654c..d108452fc301 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -98,7 +98,7 @@
>  	KVM_ARCH_REQ_FLAGS(14, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_SCAN_IOAPIC \
>  	KVM_ARCH_REQ_FLAGS(15, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> -#define KVM_REQ_GLOBAL_CLOCK_UPDATE	KVM_ARCH_REQ(16)
> +/* AVAILABLE BIT!!!!			KVM_ARCH_REQ(16) */
>  #define KVM_REQ_APIC_PAGE_RELOAD \
>  	KVM_ARCH_REQ_FLAGS(17, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_HV_CRASH		KVM_ARCH_REQ(18)
> @@ -1336,7 +1336,6 @@ struct kvm_arch {
>  	bool use_master_clock;
>  	u64 master_kernel_ns;
>  	u64 master_cycle_now;
> -	struct delayed_work kvmclock_update_work;
>  	struct delayed_work kvmclock_sync_work;
>  
>  	struct kvm_xen_hvm_config xen_hvm_config;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6573c89c35a9..5d35724f1963 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2367,7 +2367,7 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
>  	}
>  
>  	vcpu->arch.time = system_time;
> -	kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
> +	kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

As mentioned above, we may need a comment here to explain why
KVM_REQ_CLOCK_UPDATE on the only vcpu is enough.

>  
>  	/* we verify if the enable bit is set... */
>  	if (system_time & 1)
> @@ -3257,30 +3257,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  
>  #define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100)
>  
> -static void kvmclock_update_fn(struct work_struct *work)
> -{
> -	unsigned long i;
> -	struct delayed_work *dwork = to_delayed_work(work);
> -	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
> -					   kvmclock_update_work);
> -	struct kvm *kvm = container_of(ka, struct kvm, arch);
> -	struct kvm_vcpu *vcpu;
> -
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> -		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> -		kvm_vcpu_kick(vcpu);
> -	}
> -}
> -
> -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
> -{
> -	struct kvm *kvm = v->kvm;
> -
> -	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
> -	schedule_delayed_work(&kvm->arch.kvmclock_update_work,
> -					KVMCLOCK_UPDATE_DELAY);
> -}
> -
>  #define KVMCLOCK_SYNC_PERIOD (300 * HZ)

While David mentioned "maximum delta", how about to turn above into a module
param with the default 300HZ.

BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour
or 1-day.

>  
>  static void kvmclock_sync_fn(struct work_struct *work)
> @@ -3290,12 +3266,14 @@ static void kvmclock_sync_fn(struct work_struct *work)
>  					   kvmclock_sync_work);
>  	struct kvm *kvm = container_of(ka, struct kvm, arch);
>  
> -	if (!kvmclock_periodic_sync)
> -		return;
> +	if (ka->use_master_clock)
> +		kvm_update_masterclock(kvm);

Based on the source code, I think it is safe to call kvm_update_masterclock() here.

We want the masterclock to update only once. To call KVM_REQ_MASTERCLOCK_UPDATE
for each vCPU here is meaningless.

I just want to remind that this is shared workqueue. The workqueue stall
detection may report false positive (e.g., due to tsc_write_lock contention.
That should not be lock contensive).


Thank you very much!

Dongli Zhang

> +	else
> +		kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
>  
> -	schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
> -	schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> -					KVMCLOCK_SYNC_PERIOD);
> +	if (kvmclock_periodic_sync)
> +		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> +				      KVMCLOCK_SYNC_PERIOD);
>  }
>  
>  /* These helpers are safe iff @msr is known to be an MCx bank MSR. */
> @@ -4845,7 +4823,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  		 * kvmclock on vcpu->cpu migration
>  		 */
>  		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
> -			kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>  		if (vcpu->cpu != cpu)
>  			kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
>  		vcpu->cpu = cpu;
> @@ -10520,12 +10498,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  			__kvm_migrate_timers(vcpu);
>  		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
>  			kvm_update_masterclock(vcpu->kvm);
> -		if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
> -			kvm_gen_kvmclock_update(vcpu);
>  		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
>  			r = kvm_guest_time_update(vcpu);
>  			if (unlikely(r))
>  				goto out;
> +
> +			/*
> +			 * Ensure all other vCPUs synchronize "soon", e.g. so
> +			 * that all vCPUs recognize NTP corrections and drift
> +			 * corrections (relative to the kernel's raw clock).
> +			 */
> +			if (!kvmclock_periodic_sync)
> +				schedule_delayed_work(&vcpu->kvm->arch.kvmclock_sync_work,
> +						      KVMCLOCK_UPDATE_DELAY);
>  		}
>  		if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
>  			kvm_mmu_sync_roots(vcpu);
> @@ -12345,7 +12330,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	kvm->arch.hv_root_tdp = INVALID_PAGE;
>  #endif
>  
> -	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
>  	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
>  
>  	kvm_apicv_init(kvm);
> @@ -12387,7 +12371,6 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm)
>  void kvm_arch_sync_events(struct kvm *kvm)
>  {
>  	cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
> -	cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
>  	kvm_free_pit(kvm);
>  }
>  
> 
> base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d
  
Sean Christopherson Oct. 11, 2023, 12:20 a.m. UTC | #27
On Wed, Oct 04, 2023, Dongli Zhang wrote:
> > And because that's not enough, on pCPU migration or if the TSC is unstable,
> > kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules
> > kvmclock_update_fn() with a delay of 100ms.  The large delay is to play nice with
> > unstable TSCs.  But if KVM is periodically doing clock updates on all vCPU,
> > scheduling another update with a *longer* delay is silly.
> 
> We may need to add above message to the places, where
> KVM_REQ_GLOBAL_CLOCK_UPDATE is replaced with KVM_REQ_CLOCK_UPDATE in the patch?

Yeah, comments are most definitely needed, this was just intended to be a quick
sketch to get the ball rolling.

> > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
> > -{
> > -	struct kvm *kvm = v->kvm;
> > -
> > -	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
> > -	schedule_delayed_work(&kvm->arch.kvmclock_update_work,
> > -					KVMCLOCK_UPDATE_DELAY);
> > -}
> > -
> >  #define KVMCLOCK_SYNC_PERIOD (300 * HZ)
> 
> While David mentioned "maximum delta", how about to turn above into a module
> param with the default 300HZ.
> 
> BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour
> or 1-day.

Hmm, I think I agree with David that it would be better if KVM can take care of
the gory details and promise a certain level of accuracy.  I'm usually a fan of
punting complexity to userspace, but requiring every userspace to figure out the
ideal sync frequency on every platform is more than a bit unfriendly.  And it
might not even be realistic unless userspace makes assumptions about how the kernel
computes CLOCK_MONOTONIC_RAW from TSC cycles.

 : so rather than having a user-configured period for the update, KVM could
 : calculate the frequency for the updates based on the rate at which the clocks
 : would otherwise drift, and a maximum delta? Not my favourite option, but
 : perhaps better than nothing?
  
David Woodhouse Oct. 11, 2023, 7:18 a.m. UTC | #28
On Tue, 2023-10-10 at 17:20 -0700, Sean Christopherson wrote:
> On Wed, Oct 04, 2023, Dongli Zhang wrote:
> > > And because that's not enough, on pCPU migration or if the TSC is unstable,
> > > kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules
> > > kvmclock_update_fn() with a delay of 100ms.  The large delay is to play nice with
> > > unstable TSCs.  But if KVM is periodically doing clock updates on all vCPU,
> > > scheduling another update with a *longer* delay is silly.
> > 
> > We may need to add above message to the places, where
> > KVM_REQ_GLOBAL_CLOCK_UPDATE is replaced with KVM_REQ_CLOCK_UPDATE in the patch?
> 
> Yeah, comments are most definitely needed, this was just intended to be a quick
> sketch to get the ball rolling.
> 
> > > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
> > > -{
> > > -       struct kvm *kvm = v->kvm;
> > > -
> > > -       kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
> > > -       schedule_delayed_work(&kvm->arch.kvmclock_update_work,
> > > -                                       KVMCLOCK_UPDATE_DELAY);
> > > -}
> > > -
> > >  #define KVMCLOCK_SYNC_PERIOD (300 * HZ)
> > 
> > While David mentioned "maximum delta", how about to turn above into a module
> > param with the default 300HZ.
> > 
> > BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour
> > or 1-day.
> 
> Hmm, I think I agree with David that it would be better if KVM can take care of
> the gory details and promise a certain level of accuracy.  I'm usually a fan of
> punting complexity to userspace, but requiring every userspace to figure out the
> ideal sync frequency on every platform is more than a bit unfriendly.  And it
> might not even be realistic unless userspace makes assumptions about how the kernel
> computes CLOCK_MONOTONIC_RAW from TSC cycles.
> 

I think perhaps I would rather save up my persuasiveness on the topic
of "let's not make things too awful for userspace to cope with" for the
live update/migration mess. I think I need to dust off that attempt at
fixing our 'how to migrate with clocks intact' documentation from
https://lore.kernel.org/kvm/13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@infradead.org/
The changes we're discussing here obviously have an effect on migration
too.

Where the host TSC is actually reliable, I would really prefer for the
kvmclock to just be a fixed function of the guest TSC and *not* to be
arbitrarily yanked back[1] to the host's CLOCK_MONOTONIC periodically.

That makes the process of migrating to another machine (potentially
with a different host TSC frequency) a whole lot simpler. Userspace
just needs to know two things from the kernel:

 • the relationship between the guest's TSC and its kvmclock (which we
   helpfully advertise to the *guest* in a pvclock structure, and can
   give that to userspace too). 

 • The value of *either* the guest's TSC or its kvmclock, at a given
   value of CLOCK_TAI (not CLOCK_WALLTIME). Theoretically, this can be
   either TSC or kvmclock. But I think for best precision it would need
   to be TSC?


I am aware that I glibly said "guest TSC" as if there's only one, or
they're in sync. I think we can substitute "the TSC of guest vCPU0" in
the case of migration. We don't want a guest to be able to change its
kvmclock by writing to vCPU0's TSC though, so we may need to keep (and
migrate) an additional offset value for tracking that?




[1] Yes, I believe "back" does happen. I have test failures in my queue
to look at, where guests see the "Xen" clock going backwards.
  
Sean Christopherson Oct. 13, 2023, 6:07 p.m. UTC | #29
On Wed, Oct 11, 2023, David Woodhouse wrote:
> On Tue, 2023-10-10 at 17:20 -0700, Sean Christopherson wrote:
> > On Wed, Oct 04, 2023, Dongli Zhang wrote:
> > > > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
> > > > -{
> > > > -       struct kvm *kvm = v->kvm;
> > > > -
> > > > -       kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
> > > > -       schedule_delayed_work(&kvm->arch.kvmclock_update_work,
> > > > -                                       KVMCLOCK_UPDATE_DELAY);
> > > > -}
> > > > -
> > > >  #define KVMCLOCK_SYNC_PERIOD (300 * HZ)
> > > 
> > > While David mentioned "maximum delta", how about to turn above into a module
> > > param with the default 300HZ.
> > > 
> > > BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour
> > > or 1-day.
> > 
> > Hmm, I think I agree with David that it would be better if KVM can take care of
> > the gory details and promise a certain level of accuracy.  I'm usually a fan of
> > punting complexity to userspace, but requiring every userspace to figure out the
> > ideal sync frequency on every platform is more than a bit unfriendly.  And it
> > might not even be realistic unless userspace makes assumptions about how the kernel
> > computes CLOCK_MONOTONIC_RAW from TSC cycles.
> > 
> 
> I think perhaps I would rather save up my persuasiveness on the topic
> of "let's not make things too awful for userspace to cope with" for the
> live update/migration mess. I think I need to dust off that attempt at
> fixing our 'how to migrate with clocks intact' documentation from
> https://lore.kernel.org/kvm/13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@infradead.org/
> The changes we're discussing here obviously have an effect on migration
> too.
> 
> Where the host TSC is actually reliable, I would really prefer for the
> kvmclock to just be a fixed function of the guest TSC and *not* to be
> arbitrarily yanked back[1] to the host's CLOCK_MONOTONIC periodically.

CLOCK_MONOTONIC_RAW!  Just wanted to clarify because if kvmclock were tied to the
non-raw clock, then we'd have to somehow reconcile host NTP updates.

I generally support the idea, but I think it needs to an opt-in from userspace.
Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not
suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST".
AFAICT, there are too many edge cases and assumptions about userspace for KVM to
safely couple kvmclock to guest TSC by default.

> [1] Yes, I believe "back" does happen. I have test failures in my queue
> to look at, where guests see the "Xen" clock going backwards.

Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o

What if we add a module param to disable KVM's TSC synchronization craziness
entirely?  If we first clean up the peroidic sync mess, then it seems like it'd
be relatively straightforward to let kill off all of the synchronization, including
the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW.

Not intended to be a functional patch...

---
 arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5b2104bdd99f..75fc6cbaef0d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
 static bool __read_mostly kvmclock_periodic_sync = true;
 module_param(kvmclock_periodic_sync, bool, S_IRUGO);
 
+static bool __read_mostly enable_tsc_sync = true;
+module_param_named(tsc_synchronization, enable_tsc_sync, bool, 0444);
+
 /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
 static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
@@ -2722,6 +2725,12 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	bool matched = false;
 	bool synchronizing = false;
 
+	if (!enable_tsc_sync) {
+		offset = kvm_compute_l1_tsc_offset(vcpu, data);
+		kvm_vcpu_write_tsc_offset(vcpu, offset);
+		return;
+	}
+
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	ns = get_kvmclock_base_ns();
@@ -2967,9 +2976,12 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
 
-	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
-				&& !ka->backwards_tsc_observed
-				&& !ka->boot_vcpu_runs_old_kvmclock;
+	WARN_ON_ONCE(!host_tsc_clocksource && !enable_tsc_sync);
+
+	ka->use_master_clock = host_tsc_clocksource &&
+			       (vcpus_matched || !enable_tsc_sync) &&
+			       !ka->backwards_tsc_observed &&
+			       !ka->boot_vcpu_runs_old_kvmclock;
 
 	if (ka->use_master_clock)
 		atomic_set(&kvm_guest_has_master_clock, 1);
@@ -3278,6 +3290,9 @@ static void kvmclock_sync_fn(struct work_struct *work)
 
 void kvm_adjust_pv_clock_users(struct kvm *kvm, bool add_user)
 {
+	if (!enable_tsc_sync)
+		return;
+
 	/*
 	 * Doesn't need to be a spinlock, but can't be kvm->lock as this is
 	 * call while holding a vCPU's mutext.
@@ -5528,6 +5543,11 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		if (get_user(offset, uaddr))
 			break;
 
+		if (!enable_tsc_sync) {
+			kvm_vcpu_write_tsc_offset(vcpu, offset);
+			break;
+		}
+
 		raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 
 		matched = (vcpu->arch.virtual_tsc_khz &&
@@ -12188,6 +12208,9 @@ int kvm_arch_hardware_enable(void)
 	if (ret != 0)
 		return ret;
 
+	if (!enable_tsc_sync)
+		return 0;
+
 	local_tsc = rdtsc();
 	stable = !kvm_check_tsc_unstable();
 	list_for_each_entry(kvm, &vm_list, vm_list) {
@@ -13670,6 +13693,12 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
 
 static int __init kvm_x86_init(void)
 {
+	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
+		enable_tsc_sync = true;
+
+	if (!enable_tsc_sync)
+		kvmclock_periodic_sync = false;
+
 	kvm_mmu_x86_module_init();
 	mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
 	return 0;

base-commit: 7d2edad0beb2a6f07f6e6c2d477d5874f5417d6c
--
  
David Woodhouse Oct. 13, 2023, 6:21 p.m. UTC | #30
On Fri, 2023-10-13 at 11:07 -0700, Sean Christopherson wrote:
> On Wed, Oct 11, 2023, David Woodhouse wrote:
> > On Tue, 2023-10-10 at 17:20 -0700, Sean Christopherson wrote:
> > > On Wed, Oct 04, 2023, Dongli Zhang wrote:
> > > > > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
> > > > > -{
> > > > > -       struct kvm *kvm = v->kvm;
> > > > > -
> > > > > -       kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
> > > > > -       schedule_delayed_work(&kvm->arch.kvmclock_update_work,
> > > > > -                                       KVMCLOCK_UPDATE_DELAY);
> > > > > -}
> > > > > -
> > > > >  #define KVMCLOCK_SYNC_PERIOD (300 * HZ)
> > > > 
> > > > While David mentioned "maximum delta", how about to turn above into a module
> > > > param with the default 300HZ.
> > > > 
> > > > BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour
> > > > or 1-day.
> > > 
> > > Hmm, I think I agree with David that it would be better if KVM can take care of
> > > the gory details and promise a certain level of accuracy.  I'm usually a fan of
> > > punting complexity to userspace, but requiring every userspace to figure out the
> > > ideal sync frequency on every platform is more than a bit unfriendly.  And it
> > > might not even be realistic unless userspace makes assumptions about how the kernel
> > > computes CLOCK_MONOTONIC_RAW from TSC cycles.
> > > 
> > 
> > I think perhaps I would rather save up my persuasiveness on the topic
> > of "let's not make things too awful for userspace to cope with" for the
> > live update/migration mess. I think I need to dust off that attempt at
> > fixing our 'how to migrate with clocks intact' documentation from
> > https://lore.kernel.org/kvm/13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@infradead.org/
> > The changes we're discussing here obviously have an effect on migration
> > too.
> > 
> > Where the host TSC is actually reliable, I would really prefer for the
> > kvmclock to just be a fixed function of the guest TSC and *not* to be
> > arbitrarily yanked back[1] to the host's CLOCK_MONOTONIC periodically.
> 
> CLOCK_MONOTONIC_RAW!  Just wanted to clarify because if kvmclock were tied to the
> non-raw clock, then we'd have to somehow reconcile host NTP updates.

Sorry, yes. CLOCK_MONOTONIC_RAW. That was just a typo in email.

Of course we'd never try to use CLOCK_MONOTONIC; that would be daft.
We'd never do that. Just like we'd never do something daft like using
CLOCK_REALTIME instead of CLOCK_TAI for the KVM_CLOCK_REALTIME pairing
in __get_kvmclock()... oh.

Shit.

> I generally support the idea, but I think it needs to an opt-in from userspace.
> Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not
> suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST".
> AFAICT, there are too many edge cases and assumptions about userspace for KVM to
> safely couple kvmclock to guest TSC by default.

I think IA32_TSC_ADJUST is OK, isn't it? There is a "real" TSC value
and if vCPUs adjust themselves forward and backwards from that, it's
just handled as a delta.

And we solved 'give all vCPUS the same TSC frequency' by making that
KVM-wide.

Maybe suspending and resuming the host can be treated like live
migration, where you know the host TSC is different so you have to make
do with a delta based on CLOCK_TAI.

But while I'm picking on the edge cases and suggesting that we *can*
cope with some of them, I do agree with your suggestion that "let
kvmclock run by itself without being clamped back to
CLOCK_MONOTONIC_RAW" should be an opt *in* feature.

> > [1] Yes, I believe "back" does happen. I have test failures in my queue
> > to look at, where guests see the "Xen" clock going backwards.
> 
> Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o
> 
> What if we add a module param to disable KVM's TSC synchronization craziness
> entirely?  If we first clean up the peroidic sync mess, then it seems like it'd
> be relatively straightforward to let kill off all of the synchronization, including
> the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW.
> 
> Not intended to be a functional patch...

Will stare harder at the actual patch when it isn't Friday night.

In the meantime, I do think a KVM cap that the VMM opts into is better
than a module param?

> ---
>  arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5b2104bdd99f..75fc6cbaef0d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>  static bool __read_mostly kvmclock_periodic_sync = true;
>  module_param(kvmclock_periodic_sync, bool, S_IRUGO);
>  
> +static bool __read_mostly enable_tsc_sync = true;
> +module_param_named(tsc_synchronization, enable_tsc_sync, bool, 0444);
> +
>  /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
>  static u32 __read_mostly tsc_tolerance_ppm = 250;
>  module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> @@ -2722,6 +2725,12 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>         bool matched = false;
>         bool synchronizing = false;
>  
> +       if (!enable_tsc_sync) {
> +               offset = kvm_compute_l1_tsc_offset(vcpu, data);
> +               kvm_vcpu_write_tsc_offset(vcpu, offset);
> +               return;
> +       }
> +
>         raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>         offset = kvm_compute_l1_tsc_offset(vcpu, data);
>         ns = get_kvmclock_base_ns();
> @@ -2967,9 +2976,12 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>                                         &ka->master_kernel_ns,
>                                         &ka->master_cycle_now);
>  
> -       ka->use_master_clock = host_tsc_clocksource && vcpus_matched
> -                               && !ka->backwards_tsc_observed
> -                               && !ka->boot_vcpu_runs_old_kvmclock;
> +       WARN_ON_ONCE(!host_tsc_clocksource && !enable_tsc_sync);
> +
> +       ka->use_master_clock = host_tsc_clocksource &&
> +                              (vcpus_matched || !enable_tsc_sync) &&
> +                              !ka->backwards_tsc_observed &&
> +                              !ka->boot_vcpu_runs_old_kvmclock;
>  
>         if (ka->use_master_clock)
>                 atomic_set(&kvm_guest_has_master_clock, 1);
> @@ -3278,6 +3290,9 @@ static void kvmclock_sync_fn(struct work_struct *work)
>  
>  void kvm_adjust_pv_clock_users(struct kvm *kvm, bool add_user)
>  {
> +       if (!enable_tsc_sync)
> +               return;
> +
>         /*
>          * Doesn't need to be a spinlock, but can't be kvm->lock as this is
>          * call while holding a vCPU's mutext.
> @@ -5528,6 +5543,11 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
>                 if (get_user(offset, uaddr))
>                         break;
>  
> +               if (!enable_tsc_sync) {
> +                       kvm_vcpu_write_tsc_offset(vcpu, offset);
> +                       break;
> +               }
> +
>                 raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>  
>                 matched = (vcpu->arch.virtual_tsc_khz &&
> @@ -12188,6 +12208,9 @@ int kvm_arch_hardware_enable(void)
>         if (ret != 0)
>                 return ret;
>  
> +       if (!enable_tsc_sync)
> +               return 0;
> +
>         local_tsc = rdtsc();
>         stable = !kvm_check_tsc_unstable();
>         list_for_each_entry(kvm, &vm_list, vm_list) {
> @@ -13670,6 +13693,12 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
>  
>  static int __init kvm_x86_init(void)
>  {
> +       if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> +               enable_tsc_sync = true;
> +
> +       if (!enable_tsc_sync)
> +               kvmclock_periodic_sync = false;
> +
>         kvm_mmu_x86_module_init();
>         mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
>         return 0;
> 
> base-commit: 7d2edad0beb2a6f07f6e6c2d477d5874f5417d6c
  
Sean Christopherson Oct. 13, 2023, 7:02 p.m. UTC | #31
On Fri, Oct 13, 2023, David Woodhouse wrote:
> On Fri, 2023-10-13 at 11:07 -0700, Sean Christopherson wrote:
> > I generally support the idea, but I think it needs to an opt-in from userspace.
> > Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not
> > suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST".
> > AFAICT, there are too many edge cases and assumptions about userspace for KVM to
> > safely couple kvmclock to guest TSC by default.
> 
> I think IA32_TSC_ADJUST is OK, isn't it? There is a "real" TSC value
> and if vCPUs adjust themselves forward and backwards from that, it's
> just handled as a delta.

I meant the host writing IA32_TSC_ADJUST.  E.g. if a host SMM handler mucks with
TSC offsets to try and hide the time spent in the SMM handler, then the platform
owner gets to keep the pieces.

> And we solved 'give all vCPUS the same TSC frequency' by making that
> KVM-wide.
> 
> Maybe suspending and resuming the host can be treated like live
> migration, where you know the host TSC is different so you have to make
> do with a delta based on CLOCK_TAI.
> 
> But while I'm picking on the edge cases and suggesting that we *can*
> cope with some of them, I do agree with your suggestion that "let
> kvmclock run by itself without being clamped back to
> CLOCK_MONOTONIC_RAW" should be an opt *in* feature.

Yeah, I'm of the mind that just because we can cope with some edge cases, doesn't
mean we should.  At this point, kvmclock really should be considered deprecated
on modern hardware.  I.e. needs to be supported for older VMs, but shouldn't be
advertised/used when creating entirely new VMs.

Hence my desire to go with a low effort solution for getting kvmclock to play nice
with modern hardware.

> > > [1] Yes, I believe "back" does happen. I have test failures in my queue
> > > to look at, where guests see the "Xen" clock going backwards.
> > 
> > Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o
> > 
> > What if we add a module param to disable KVM's TSC synchronization craziness
> > entirely?  If we first clean up the peroidic sync mess, then it seems like it'd
> > be relatively straightforward to let kill off all of the synchronization, including
> > the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW.
> > 
> > Not intended to be a functional patch...
> 
> Will stare harder at the actual patch when it isn't Friday night.
> 
> In the meantime, I do think a KVM cap that the VMM opts into is better
> than a module param?

Hmm, yeah, I think a capability would be cleaner overall.  Then KVM could return
-EINVAL instead of silently forcing synchronization if the platform conditions
aren't meant, e.g. if the TSC isn't constant or if the host timekeeping isn't
using TSC.

The interaction with kvmclock_periodic_sync might be a bit awkward, but that's
easy enough to solve with a wrapper.
  
David Woodhouse Oct. 13, 2023, 7:12 p.m. UTC | #32
On Fri, 2023-10-13 at 12:02 -0700, Sean Christopherson wrote:
> On Fri, Oct 13, 2023, David Woodhouse wrote:
> > On Fri, 2023-10-13 at 11:07 -0700, Sean Christopherson wrote:
> > > I generally support the idea, but I think it needs to an opt-in from userspace.
> > > Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not
> > > suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST".
> > > AFAICT, there are too many edge cases and assumptions about userspace for KVM to
> > > safely couple kvmclock to guest TSC by default.
> > 
> > I think IA32_TSC_ADJUST is OK, isn't it? There is a "real" TSC value
> > and if vCPUs adjust themselves forward and backwards from that, it's
> > just handled as a delta.
> 
> I meant the host writing IA32_TSC_ADJUST.  E.g. if a host SMM handler mucks with
> TSC offsets to try and hide the time spent in the SMM handler, then the platform
> owner gets to keep the pieces.

Oh $DEITY yes, absolutely.

> > And we solved 'give all vCPUS the same TSC frequency' by making that
> > KVM-wide.
> > 
> > Maybe suspending and resuming the host can be treated like live
> > migration, where you know the host TSC is different so you have to make
> > do with a delta based on CLOCK_TAI.
> > 
> > But while I'm picking on the edge cases and suggesting that we *can*
> > cope with some of them, I do agree with your suggestion that "let
> > kvmclock run by itself without being clamped back to
> > CLOCK_MONOTONIC_RAW" should be an opt *in* feature.
> 
> Yeah, I'm of the mind that just because we can cope with some edge cases, doesn't
> mean we should.  At this point, kvmclock really should be considered deprecated
> on modern hardware.  I.e. needs to be supported for older VMs, but shouldn't be
> advertised/used when creating entirely new VMs.
> 
> Hence my desire to go with a low effort solution for getting kvmclock to play nice
> with modern hardware.

Yeah... although the kvmclock is also the *Xen* clock (and the clock on
which Xen timers are based). So while I'm perfectly prepared to call
those Xen guests "older VMs", I do still have to launch quite a lot of
new ones the same... :)


> > > > [1] Yes, I believe "back" does happen. I have test failures in my queue
> > > > to look at, where guests see the "Xen" clock going backwards.
> > > 
> > > Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o
> > > 
> > > What if we add a module param to disable KVM's TSC synchronization craziness
> > > entirely?  If we first clean up the peroidic sync mess, then it seems like it'd
> > > be relatively straightforward to let kill off all of the synchronization, including
> > > the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW.
> > > 
> > > Not intended to be a functional patch...
> > 
> > Will stare harder at the actual patch when it isn't Friday night.
> > 
> > In the meantime, I do think a KVM cap that the VMM opts into is better
> > than a module param?
> 
> Hmm, yeah, I think a capability would be cleaner overall.  Then KVM could return
> -EINVAL instead of silently forcing synchronization if the platform conditions
> aren't meant, e.g. if the TSC isn't constant or if the host timekeeping isn't
> using TSC.

Right.

> The interaction with kvmclock_periodic_sync might be a bit awkward, but that's
> easy enough to solve with a wrapper.

At least that's all per-KVM already. We do also still need to deal with
the mess of having a single system-wide kvm_guest_has_master_clock and
different KVMs explicitly setting that to 1 or 0, don't we?
  
Sean Christopherson Oct. 13, 2023, 8:03 p.m. UTC | #33
On Fri, Oct 13, 2023, David Woodhouse wrote:
> On Fri, 2023-10-13 at 12:02 -0700, Sean Christopherson wrote:
> > > But while I'm picking on the edge cases and suggesting that we *can*
> > > cope with some of them, I do agree with your suggestion that "let
> > > kvmclock run by itself without being clamped back to
> > > CLOCK_MONOTONIC_RAW" should be an opt *in* feature.
> > 
> > Yeah, I'm of the mind that just because we can cope with some edge cases, doesn't
> > mean we should.  At this point, kvmclock really should be considered deprecated
> > on modern hardware.  I.e. needs to be supported for older VMs, but shouldn't be
> > advertised/used when creating entirely new VMs.
> > 
> > Hence my desire to go with a low effort solution for getting kvmclock to play nice
> > with modern hardware.
> 
> Yeah... although the kvmclock is also the *Xen* clock (and the clock on
> which Xen timers are based). So while I'm perfectly prepared to call
> those Xen guests "older VMs", I do still have to launch quite a lot of
> new ones the same... :)

Heh, yeah, by "new" I meant "new shapes/classes/types of VMs", not simply "new
instances of existing VM types".

> > > > > [1] Yes, I believe "back" does happen. I have test failures in my queue
> > > > > to look at, where guests see the "Xen" clock going backwards.
> > > > 
> > > > Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o
> > > > 
> > > > What if we add a module param to disable KVM's TSC synchronization craziness
> > > > entirely?  If we first clean up the peroidic sync mess, then it seems like it'd
> > > > be relatively straightforward to let kill off all of the synchronization, including
> > > > the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW.
> > > > 
> > > > Not intended to be a functional patch...
> > > 
> > > Will stare harder at the actual patch when it isn't Friday night.
> > > 
> > > In the meantime, I do think a KVM cap that the VMM opts into is better
> > > than a module param?
> > 
> > Hmm, yeah, I think a capability would be cleaner overall.  Then KVM could return
> > -EINVAL instead of silently forcing synchronization if the platform conditions
> > aren't meant, e.g. if the TSC isn't constant or if the host timekeeping isn't
> > using TSC.
> 
> Right.
> 
> > The interaction with kvmclock_periodic_sync might be a bit awkward, but that's
> > easy enough to solve with a wrapper.
> 
> At least that's all per-KVM already. We do also still need to deal with
> the mess of having a single system-wide kvm_guest_has_master_clock and
> different KVMs explicitly setting that to 1 or 0, don't we?

Hmm, I think the simplest way to handle kvm_guest_has_master_clock would be to
have KVM check that the host clocksource is TSC-based when enabling the capability,
but define the behavior to be that once kvmclock is tied to the TSC, it's *always*
tied to the TSC, even if the host switches to a different clock source.  Then VMs
for which kvmclock is tied to TSC can simply not set kvm_guest_has_master_clock
and be skipped by pvclock_gtod_update_fn().

Side topic, I have no idea why that thing is an atomic.  It's just a flag that
tracks if at least one VM is using masterclock, and its only usage is to disable
all masterclocks if the host stops using TSC as the clocksource for whatever reason.
It really should just be a simple bool that's accessed with {READ,WRITE}_ONCE().
  
Dongli Zhang Oct. 13, 2023, 8:12 p.m. UTC | #34
Hi Sean,

On 10/13/23 11:07, Sean Christopherson wrote:
> On Wed, Oct 11, 2023, David Woodhouse wrote:
>> On Tue, 2023-10-10 at 17:20 -0700, Sean Christopherson wrote:
>>> On Wed, Oct 04, 2023, Dongli Zhang wrote:
>>>>> -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
>>>>> -{
>>>>> -       struct kvm *kvm = v->kvm;
>>>>> -
>>>>> -       kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
>>>>> -       schedule_delayed_work(&kvm->arch.kvmclock_update_work,
>>>>> -                                       KVMCLOCK_UPDATE_DELAY);
>>>>> -}
>>>>> -
>>>>>  #define KVMCLOCK_SYNC_PERIOD (300 * HZ)
>>>>
>>>> While David mentioned "maximum delta", how about to turn above into a module
>>>> param with the default 300HZ.
>>>>
>>>> BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour
>>>> or 1-day.
>>>
>>> Hmm, I think I agree with David that it would be better if KVM can take care of
>>> the gory details and promise a certain level of accuracy.  I'm usually a fan of
>>> punting complexity to userspace, but requiring every userspace to figure out the
>>> ideal sync frequency on every platform is more than a bit unfriendly.  And it
>>> might not even be realistic unless userspace makes assumptions about how the kernel
>>> computes CLOCK_MONOTONIC_RAW from TSC cycles.
>>>
>>
>> I think perhaps I would rather save up my persuasiveness on the topic
>> of "let's not make things too awful for userspace to cope with" for the
>> live update/migration mess. I think I need to dust off that attempt at
>> fixing our 'how to migrate with clocks intact' documentation from
>> https://urldefense.com/v3/__https://lore.kernel.org/kvm/13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@infradead.org/__;!!ACWV5N9M2RV99hQ!Kv3rZZ4bxmh0LeZKB1dQQnbCs8aTkGnEWsWu-eSawdYR3qszqITOY_XkAlWZeIODupS-N18Mnc9TBgk_vw$ 
>> The changes we're discussing here obviously have an effect on migration
>> too.
>>
>> Where the host TSC is actually reliable, I would really prefer for the
>> kvmclock to just be a fixed function of the guest TSC and *not* to be
>> arbitrarily yanked back[1] to the host's CLOCK_MONOTONIC periodically.
> 
> CLOCK_MONOTONIC_RAW!  Just wanted to clarify because if kvmclock were tied to the
> non-raw clock, then we'd have to somehow reconcile host NTP updates.
> 
> I generally support the idea, but I think it needs to an opt-in from userspace.
> Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not
> suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST".
> AFAICT, there are too many edge cases and assumptions about userspace for KVM to
> safely couple kvmclock to guest TSC by default.
> 
>> [1] Yes, I believe "back" does happen. I have test failures in my queue
>> to look at, where guests see the "Xen" clock going backwards.
> 
> Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o
> 
> What if we add a module param to disable KVM's TSC synchronization craziness
> entirely?  If we first clean up the peroidic sync mess, then it seems like it'd
> be relatively straightforward to let kill off all of the synchronization, including
> the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW.
> 
> Not intended to be a functional patch...
> 
> ---
>  arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5b2104bdd99f..75fc6cbaef0d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>  static bool __read_mostly kvmclock_periodic_sync = true;
>  module_param(kvmclock_periodic_sync, bool, S_IRUGO);
>  
> +static bool __read_mostly enable_tsc_sync = true;
> +module_param_named(tsc_synchronization, enable_tsc_sync, bool, 0444);
> +
>  /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
>  static u32 __read_mostly tsc_tolerance_ppm = 250;
>  module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> @@ -2722,6 +2725,12 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>  	bool matched = false;
>  	bool synchronizing = false;
>  
> +	if (!enable_tsc_sync) {
> +		offset = kvm_compute_l1_tsc_offset(vcpu, data);
> +		kvm_vcpu_write_tsc_offset(vcpu, offset);
> +		return;
> +	}

TBH, I do not like this idea for two reasons.

1. As a very primary part of my work is to resolve kernel issue, when debugging
any clock drift issue, it is really happy for me to see all vCPUs have the same
vcpu->arch.tsc_offset in the coredump or vCPU debugfs.

This patch may lead to that different vCPUs added at different time have
different vcpu->arch.tsc_offset.


2. Suppose the KVM host has been running for long time, and the drift between
two domains would be accumulated to super large? (Even it may not introduce
anything bad immediately)


If the objective is to avoid masterclock updating, how about the below I copied
from my prior diagnostic kernel to help debug this issue.

The idea is to never update master clock, if tsc is stable (and masterclock is
already used).

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0b443b9bf562..630f18524000 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2035,6 +2035,9 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 	struct kvm_arch *ka = &kvm->arch;
 	int vclock_mode;
 	bool host_tsc_clocksource, vcpus_matched;
+	bool was_master_clock = ka->use_master_clock;
+	u64 master_kernel_ns;
+	u64 master_cycle_now;

 	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
 			atomic_read(&kvm->online_vcpus));
@@ -2044,13 +2047,18 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 	 * to the guest.
 	 */
 	host_tsc_clocksource = kvm_get_time_and_clockread(
-					&ka->master_kernel_ns,
-					&ka->master_cycle_now);
+					&master_kernel_ns,
+					&master_cycle_now);

 	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
 				&& !ka->backwards_tsc_observed
 				&& !ka->boot_vcpu_runs_old_kvmclock;

+	if (!was_master_clock && ka->use_master_clock) {
+		ka->master_kernel_ns = master_kernel_ns;
+		ka->master_cycle_now = master_cycle_now;
+	}
+
 	if (ka->use_master_clock)
 		atomic_set(&kvm_guest_has_master_clock, 1);


That is, to always re-use the same value in ka->master_kernel_ns and
ka->master_cycle_now since VM creation.

Thank you very much!

Dongli Zhang

> +
>  	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>  	offset = kvm_compute_l1_tsc_offset(vcpu, data);
>  	ns = get_kvmclock_base_ns();
> @@ -2967,9 +2976,12 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>  					&ka->master_kernel_ns,
>  					&ka->master_cycle_now);
>  
> -	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
> -				&& !ka->backwards_tsc_observed
> -				&& !ka->boot_vcpu_runs_old_kvmclock;
> +	WARN_ON_ONCE(!host_tsc_clocksource && !enable_tsc_sync);
> +
> +	ka->use_master_clock = host_tsc_clocksource &&
> +			       (vcpus_matched || !enable_tsc_sync) &&
> +			       !ka->backwards_tsc_observed &&
> +			       !ka->boot_vcpu_runs_old_kvmclock;
>  
>  	if (ka->use_master_clock)
>  		atomic_set(&kvm_guest_has_master_clock, 1);
> @@ -3278,6 +3290,9 @@ static void kvmclock_sync_fn(struct work_struct *work)
>  
>  void kvm_adjust_pv_clock_users(struct kvm *kvm, bool add_user)
>  {
> +	if (!enable_tsc_sync)
> +		return;
> +
>  	/*
>  	 * Doesn't need to be a spinlock, but can't be kvm->lock as this is
>  	 * call while holding a vCPU's mutext.
> @@ -5528,6 +5543,11 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
>  		if (get_user(offset, uaddr))
>  			break;
>  
> +		if (!enable_tsc_sync) {
> +			kvm_vcpu_write_tsc_offset(vcpu, offset);
> +			break;
> +		}
> +
>  		raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>  
>  		matched = (vcpu->arch.virtual_tsc_khz &&
> @@ -12188,6 +12208,9 @@ int kvm_arch_hardware_enable(void)
>  	if (ret != 0)
>  		return ret;
>  
> +	if (!enable_tsc_sync)
> +		return 0;
> +
>  	local_tsc = rdtsc();
>  	stable = !kvm_check_tsc_unstable();
>  	list_for_each_entry(kvm, &vm_list, vm_list) {
> @@ -13670,6 +13693,12 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
>  
>  static int __init kvm_x86_init(void)
>  {
> +	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> +		enable_tsc_sync = true;
> +
> +	if (!enable_tsc_sync)
> +		kvmclock_periodic_sync = false;
> +
>  	kvm_mmu_x86_module_init();
>  	mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
>  	return 0;
> 
> base-commit: 7d2edad0beb2a6f07f6e6c2d477d5874f5417d6c
  
Sean Christopherson Oct. 13, 2023, 11:26 p.m. UTC | #35
On Fri, Oct 13, 2023, Dongli Zhang wrote:
> On 10/13/23 11:07, Sean Christopherson wrote:
> > What if we add a module param to disable KVM's TSC synchronization craziness
> > entirely?  If we first clean up the peroidic sync mess, then it seems like it'd
> > be relatively straightforward to let kill off all of the synchronization, including
> > the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW.
> > 
> > Not intended to be a functional patch...
> > 
> > ---
> >  arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++---
> >  1 file changed, 32 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 5b2104bdd99f..75fc6cbaef0d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
> >  static bool __read_mostly kvmclock_periodic_sync = true;
> >  module_param(kvmclock_periodic_sync, bool, S_IRUGO);
> >  
> > +static bool __read_mostly enable_tsc_sync = true;
> > +module_param_named(tsc_synchronization, enable_tsc_sync, bool, 0444);
> > +
> >  /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
> >  static u32 __read_mostly tsc_tolerance_ppm = 250;
> >  module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> > @@ -2722,6 +2725,12 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
> >  	bool matched = false;
> >  	bool synchronizing = false;
> >  
> > +	if (!enable_tsc_sync) {
> > +		offset = kvm_compute_l1_tsc_offset(vcpu, data);
> > +		kvm_vcpu_write_tsc_offset(vcpu, offset);
> > +		return;
> > +	}
> 
> TBH, I do not like this idea for two reasons.
> 
> 1. As a very primary part of my work is to resolve kernel issue, when debugging
> any clock drift issue, it is really happy for me to see all vCPUs have the same
> vcpu->arch.tsc_offset in the coredump or vCPU debugfs.
> 
> This patch may lead to that different vCPUs added at different time have
> different vcpu->arch.tsc_offset.

The expectation is that *if* userspace disables TSC synchronization, then userespace
would be responsible for setting the guest TSC offset directly.  And disabling
TSC synchronization would be fully opt-in, i.e. we'd fix the existing masterclock
sync issues first.

> 2. Suppose the KVM host has been running for long time, and the drift between
> two domains would be accumulated to super large? (Even it may not introduce
> anything bad immediately)

That already happens today, e.g. unless the host does vCPU hotplug or is using
XEN's shared info page, masterclock updates effectively never happen.  And I'm
not aware of a single bug report of someone complaining that kvmclock has drifted
from the host clock.  The only bug reports we have are when KVM triggers an update
and causes time to jump from the guest's perspective.

If the guest needs accurate timekeeping, it's all but guaranteed to be using NTP
or an equivalent.  I.e. the imprecision that's inherent in the pvclock ABI shouldn't
be problematic for any guest that actually cares.

> If the objective is to avoid masterclock updating, how about the below I copied
> from my prior diagnostic kernel to help debug this issue.
> 
> The idea is to never update master clock, if tsc is stable (and masterclock is
> already used).

That's another option, but if there are no masterclock updates, then it suffers
the exact same (theoretical) problem as #2.  And there are real downsides, e.g.
defining when KVM would synchronize kvmclock with the host clock would be
significantly harder, as we'd have to capture all the possible ways that KVM could
(temporarily) disable the masterclock and start synchronizing.
  
David Woodhouse Oct. 14, 2023, 9:49 a.m. UTC | #36
On 14 October 2023 00:26:45 BST, Sean Christopherson <seanjc@google.com> wrote:
>> 2. Suppose the KVM host has been running for long time, and the drift between
>> two domains would be accumulated to super large? (Even it may not introduce
>> anything bad immediately)
>
>That already happens today, e.g. unless the host does vCPU hotplug or is using
>XEN's shared info page, masterclock updates effectively never happen.  And I'm
>not aware of a single bug report of someone complaining that kvmclock has drifted
>from the host clock.  The only bug reports we have are when KVM triggers an update
>and causes time to jump from the guest's perspective.

I've got reports about the Xen clock going backwards, and also about it drifting over time w.r.t. the guest's TSC clocksource so the watchdog in the guest declares its TSC clocksource unstable. 

I don't understand *why* we update the master lock when we populate the Xen shared info. Or add a vCPU, for that matter. 

>> The idea is to never update master clock, if tsc is stable (and masterclock is
>> already used).
>
>That's another option, but if there are no masterclock updates, then it suffers
>the exact same (theoretical) problem as #2.  And there are real downsides, e.g.
>defining when KVM would synchronize kvmclock with the host clock would be
>significantly harder...

I thought the definition of such an approach would be that we *never* resync the kvmclock to anything. It's based purely on the TSC value when the guest started, and the TSC frequency. The pvclock we advertise to all vCPUs would be the same, and would *never* change except on migration.

(I guess that for consistency we would scale first to the *guest* TSC and from that to nanoseconds.)

If userspace does anything which makes that become invalid, userspace gets to keep both pieces. That includes userspace having to deal with host suspend like migration, etc.
  
Dongli Zhang Oct. 16, 2023, 3:47 p.m. UTC | #37
Hi David and Sean,

On 10/14/23 02:49, David Woodhouse wrote:
> 
> 
> On 14 October 2023 00:26:45 BST, Sean Christopherson <seanjc@google.com> wrote:
>>> 2. Suppose the KVM host has been running for long time, and the drift between
>>> two domains would be accumulated to super large? (Even it may not introduce
>>> anything bad immediately)
>>
>> That already happens today, e.g. unless the host does vCPU hotplug or is using
>> XEN's shared info page, masterclock updates effectively never happen.  And I'm
>> not aware of a single bug report of someone complaining that kvmclock has drifted
>>from the host clock.  The only bug reports we have are when KVM triggers an update
>> and causes time to jump from the guest's perspective.
> 
> I've got reports about the Xen clock going backwards, and also about it drifting over time w.r.t. the guest's TSC clocksource so the watchdog in the guest declares its TSC clocksource unstable. 

I assume you meant Xen on KVM (not Xen guest on Xen hypervisor). According to my
brief review of xen hypervisor code, it looks using the same algorithm to
calculate the clock at hypervisor side, as in the xen guest.

Fortunately, the "tsc=reliable" my disable the watchdog, but I have no idea if
it impacts Xen on KVM.

> 
> I don't understand *why* we update the master lock when we populate the Xen shared info. Or add a vCPU, for that matter. 
> 
>>> The idea is to never update master clock, if tsc is stable (and masterclock is
>>> already used).
>>
>> That's another option, but if there are no masterclock updates, then it suffers
>> the exact same (theoretical) problem as #2.  And there are real downsides, e.g.
>> defining when KVM would synchronize kvmclock with the host clock would be
>> significantly harder...
> 
> I thought the definition of such an approach would be that we *never* resync the kvmclock to anything. It's based purely on the TSC value when the guest started, and the TSC frequency. The pvclock we advertise to all vCPUs would be the same, and would *never* change except on migration.
> 
> (I guess that for consistency we would scale first to the *guest* TSC and from that to nanoseconds.)
> 
> If userspace does anything which makes that become invalid, userspace gets to keep both pieces. That includes userspace having to deal with host suspend like migration, etc.

Suppose we are discussing a non-permanenet solution, I would suggest:

1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen
on KVM) is not good enough in some cases, e.g., vCPU hotplug.

2. Do not reply on any userspace change, so that the solution can be easier to
apply to existing environments running old KVM versions.

That is, to limit the change within KVM.

3. The options would be to (1) stop updating masterclock in the ideal scenario
(e.g., stable tsc), or to (2) refresh periodically to minimize the drift.

Or there is better option ...


Thank you very much!

Dongli Zhang
  
David Woodhouse Oct. 16, 2023, 4:25 p.m. UTC | #38
On Mon, 2023-10-16 at 08:47 -0700, Dongli Zhang wrote:
> Hi David and Sean,
> 
> On 10/14/23 02:49, David Woodhouse wrote:
> > 
> > 
> > On 14 October 2023 00:26:45 BST, Sean Christopherson <seanjc@google.com> wrote:
> > > > 2. Suppose the KVM host has been running for long time, and the drift between
> > > > two domains would be accumulated to super large? (Even it may not introduce
> > > > anything bad immediately)
> > > 
> > > That already happens today, e.g. unless the host does vCPU hotplug or is using
> > > XEN's shared info page, masterclock updates effectively never happen.  And I'm
> > > not aware of a single bug report of someone complaining that kvmclock has drifted
> > > from the host clock.  The only bug reports we have are when KVM triggers an update
> > > and causes time to jump from the guest's perspective.
> > 
> > I've got reports about the Xen clock going backwards, and also
> > about it drifting over time w.r.t. the guest's TSC clocksource so
> > the watchdog in the guest declares its TSC clocksource unstable. 
> 
> I assume you meant Xen on KVM (not Xen guest on Xen hypervisor). According to my
> brief review of xen hypervisor code, it looks using the same algorithm to
> calculate the clock at hypervisor side, as in the xen guest.

Right. It's *exactly* the same thing. Even the same pvclock ABI in the
way it's exposed to the guest (in the KVM case via the MSR, in the Xen
case it's in the vcpu_info or a separate vcpu_time_info set up by Xen
hypercalls).

> Fortunately, the "tsc=reliable" my disable the watchdog, but I have no idea if
> it impacts Xen on KVM.

Right. I think Linux as a KVM guest automatically disables the
watchdog, or at least refuses to use the KVM clock as the watchdog for
the TSC clocksource?

Xen guests, on the other hand, aren't used to the Xen clock being as
unreliable as the KVM clock is, so they *do* use it as a watchdog for
the TSC clocksource.

> > I don't understand *why* we update the master lock when we populate
> > the Xen shared info. Or add a vCPU, for that matter.

Still don't...

> > > > The idea is to never update master clock, if tsc is stable (and masterclock is
> > > > already used).
> > > 
> > > That's another option, but if there are no masterclock updates, then it suffers
> > > the exact same (theoretical) problem as #2.  And there are real downsides, e.g.
> > > defining when KVM would synchronize kvmclock with the host clock would be
> > > significantly harder...
> > 
> > I thought the definition of such an approach would be that we
> > *never* resync the kvmclock to anything. It's based purely on the
> > TSC value when the guest started, and the TSC frequency. The
> > pvclock we advertise to all vCPUs would be the same, and would
> > *never* change except on migration.
> > 
> > (I guess that for consistency we would scale first to the *guest*
> > TSC and from that to nanoseconds.)
> > 
> > If userspace does anything which makes that become invalid,
> > userspace gets to keep both pieces. That includes userspace having
> > to deal with host suspend like migration, etc.
> 
> Suppose we are discussing a non-permanenet solution, I would suggest:
> 
> 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen
> on KVM) is not good enough in some cases, e.g., vCPU hotplug.

I still don't understand the vCPU hotplug case.

In the case where the TSC is actually sane, why would we need to reset
the masterclock on vCPU hotplug? 

The new vCPU gets its TSC synchronised to the others, and its kvmclock
parameters (mul/shift/offset based on the guest TSC) can be *precisely*
the same as the other vCPUs too, can't they? Why reset anything?

> 2. Do not reply on any userspace change, so that the solution can be easier to
> apply to existing environments running old KVM versions.
> 
> That is, to limit the change within KVM.
> 
> 3. The options would be to (1) stop updating masterclock in the ideal scenario
> (e.g., stable tsc), or to (2) refresh periodically to minimize the drift.

If the host TSC is sane, just *never* update the KVM masterclock. It
"drifts" w.r.t. the host CLOCK_MONOTONIC_RAW and nobody will ever care.

The only opt-in we need from userspace for that is to promise that the
host TSC will never get mangled, isn't it?

(We probably want to be able to export the pvclock information to
userspace (in terms of the mul/shift/offset from host TSC to guest TSC
and then the mul/shift/offset to kvmclock). Userspace may want to make
things like the PIT/HPET/PMtimer run on that clock.)
  
Dongli Zhang Oct. 16, 2023, 5:04 p.m. UTC | #39
Hi David,

On 10/16/23 09:25, David Woodhouse wrote:
> On Mon, 2023-10-16 at 08:47 -0700, Dongli Zhang wrote:
>> Hi David and Sean,
>>
>> On 10/14/23 02:49, David Woodhouse wrote:
>>>
>>>
>>> On 14 October 2023 00:26:45 BST, Sean Christopherson <seanjc@google.com> wrote:
>>>>> 2. Suppose the KVM host has been running for long time, and the drift between
>>>>> two domains would be accumulated to super large? (Even it may not introduce
>>>>> anything bad immediately)
>>>>
>>>> That already happens today, e.g. unless the host does vCPU hotplug or is using
>>>> XEN's shared info page, masterclock updates effectively never happen.  And I'm
>>>> not aware of a single bug report of someone complaining that kvmclock has drifted
>>>> from the host clock.  The only bug reports we have are when KVM triggers an update
>>>> and causes time to jump from the guest's perspective.
>>>
>>> I've got reports about the Xen clock going backwards, and also
>>> about it drifting over time w.r.t. the guest's TSC clocksource so
>>> the watchdog in the guest declares its TSC clocksource unstable. 
>>
>> I assume you meant Xen on KVM (not Xen guest on Xen hypervisor). According to my
>> brief review of xen hypervisor code, it looks using the same algorithm to
>> calculate the clock at hypervisor side, as in the xen guest.
> 
> Right. It's *exactly* the same thing. Even the same pvclock ABI in the
> way it's exposed to the guest (in the KVM case via the MSR, in the Xen
> case it's in the vcpu_info or a separate vcpu_time_info set up by Xen
> hypercalls).
> 
>> Fortunately, the "tsc=reliable" my disable the watchdog, but I have no idea if
>> it impacts Xen on KVM.
> 
> Right. I think Linux as a KVM guest automatically disables the
> watchdog, or at least refuses to use the KVM clock as the watchdog for
> the TSC clocksource?

You may refer to the below commit, which disables watchdog for tsc when it is
reliable.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b50db7095fe002fa3e16605546cba66bf1b68a3e

> 
> Xen guests, on the other hand, aren't used to the Xen clock being as
> unreliable as the KVM clock is, so they *do* use it as a watchdog for
> the TSC clocksource.
> 
>>> I don't understand *why* we update the master lock when we populate
>>> the Xen shared info. Or add a vCPU, for that matter.
> 
> Still don't...

I do not have much knowledge on Xen-on-KVM. I assume both that and kvmclock are
the similar things.

The question is: why to update master clock when adding new vCPU (e.g., via QEMU)?

It is already in the source code, and TBH, I do not know why it is in the source
code like that.


Just to explain the source code, taking QEMU + KVM as an example:

1. QEMU adds new vCPU to the running guest.

2. QEMU userspace triggers KVM kvm_synchronize_tsc() via ioctl.

kvm_synchronize_tsc()-->__kvm_synchronize_tsc()-->kvm_track_tsc_matching()

The above tries to sync TSC, and finally sets KVM_REQ_MASTERCLOCK_UPDATE pending
for the new vCPU.


3. The guest side onlines the new vCPU via either udev rule (automatically), or
sysfs (echo and manually).

4. When the vCPU is onlined, it will be starting running at KVM side.

The KVM sides processes KVM_REQ_MASTERCLOCK_UPDATE before entering into the
guest mode.

5. The handler of KVM_REQ_MASTERCLOCK_UPDATE updates the master clock.

> 
>>>>> The idea is to never update master clock, if tsc is stable (and masterclock is
>>>>> already used).
>>>>
>>>> That's another option, but if there are no masterclock updates, then it suffers
>>>> the exact same (theoretical) problem as #2.  And there are real downsides, e.g.
>>>> defining when KVM would synchronize kvmclock with the host clock would be
>>>> significantly harder...
>>>
>>> I thought the definition of such an approach would be that we
>>> *never* resync the kvmclock to anything. It's based purely on the
>>> TSC value when the guest started, and the TSC frequency. The
>>> pvclock we advertise to all vCPUs would be the same, and would
>>> *never* change except on migration.
>>>
>>> (I guess that for consistency we would scale first to the *guest*
>>> TSC and from that to nanoseconds.)
>>>
>>> If userspace does anything which makes that become invalid,
>>> userspace gets to keep both pieces. That includes userspace having
>>> to deal with host suspend like migration, etc.
>>
>> Suppose we are discussing a non-permanenet solution, I would suggest:
>>
>> 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen
>> on KVM) is not good enough in some cases, e.g., vCPU hotplug.
> 
> I still don't understand the vCPU hotplug case.
> 
> In the case where the TSC is actually sane, why would we need to reset
> the masterclock on vCPU hotplug? 
> 
> The new vCPU gets its TSC synchronised to the others, and its kvmclock
> parameters (mul/shift/offset based on the guest TSC) can be *precisely*
> the same as the other vCPUs too, can't they? Why reset anything?

While I understand how source code works, I do not know why.

I shared the below patch from my prior diagnostic kernel, and it avoids updating
the master clock, if it is already used and stable.

https://lore.kernel.org/kvm/cf2b22fc-78f5-dfb9-f0e6-5c4059a970a2@oracle.com/

> 
>> 2. Do not reply on any userspace change, so that the solution can be easier to
>> apply to existing environments running old KVM versions.
>>
>> That is, to limit the change within KVM.
>>
>> 3. The options would be to (1) stop updating masterclock in the ideal scenario
>> (e.g., stable tsc), or to (2) refresh periodically to minimize the drift.
> 
> If the host TSC is sane, just *never* update the KVM masterclock. It
> "drifts" w.r.t. the host CLOCK_MONOTONIC_RAW and nobody will ever care.

I think it is one of the two options, although I prefer the 2 than the 1.

1. Do not update master clock.

2. Refresh master clock periodically.

> 
> The only opt-in we need from userspace for that is to promise that the
> host TSC will never get mangled, isn't it?

Regarding QEMU, I assume you meant either:

(1) -cpu host,+invtsc (at QEMU command line), or
(2) tsc=reliable (at guest kernel command line)

> 
> (We probably want to be able to export the pvclock information to
> userspace (in terms of the mul/shift/offset from host TSC to guest TSC
> and then the mul/shift/offset to kvmclock). Userspace may want to make
> things like the PIT/HPET/PMtimer run on that clock.)
> 


Thank you very much!

Dongli Zhang
  
Sean Christopherson Oct. 16, 2023, 6:49 p.m. UTC | #40
On Mon, Oct 16, 2023, David Woodhouse wrote:
> On Mon, 2023-10-16 at 08:47 -0700, Dongli Zhang wrote:
> > Suppose we are discussing a non-permanenet solution, I would suggest:
> > 
> > 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen
> > on KVM) is not good enough in some cases, e.g., vCPU hotplug.
> 
> I still don't understand the vCPU hotplug case.
> 
> In the case where the TSC is actually sane, why would we need to reset
> the masterclock on vCPU hotplug? 
> 
> The new vCPU gets its TSC synchronised to the others, and its kvmclock
> parameters (mul/shift/offset based on the guest TSC) can be *precisely*
> the same as the other vCPUs too, can't they? Why reset anything?

Aha!  I think I finally figured out why KVM behaves the way it does.

The unnecessary masterclock updates effectively came from:

  commit 7f187922ddf6b67f2999a76dcb71663097b75497
  Author: Marcelo Tosatti <mtosatti@redhat.com>
  Date:   Tue Nov 4 21:30:44 2014 -0200

    KVM: x86: update masterclock values on TSC writes
    
    When the guest writes to the TSC, the masterclock TSC copy must be
    updated as well along with the TSC_OFFSET update, otherwise a negative
    tsc_timestamp is calculated at kvm_guest_time_update.
    
    Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
    "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
    becomes redundant, so remove the do_request boolean and collapse
    everything into a single condition.

Before that, KVM only re-synced the masterclock if it was enabled or disabled,
i.e. KVM behaved as we want it to behave.  Note, at the time of the above commit,
VMX synchronized TSC on *guest* writes to MSR_IA32_TSC:

	case MSR_IA32_TSC:
        	kvm_write_tsc(vcpu, msr_info);
	        break;

That got changed by commit 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization
on guest writes"), but I don't think the guest angle is actually relevant to the
fix.  AFAICT, a write from the host would suffer the same problem.  But knowing
that KVM synced on guest writes is crucial to understanding the changelog.

In kvm_write_tsc(), except for KVM's wonderful "less than 1 second" hack, KVM
snapshotted the vCPU's current TSC *and* the current time in nanoseconds, where
kvm->arch.cur_tsc_nsec is the current host kernel time in nanoseconds.

	ns = get_kernel_ns();

	...

        if (usdiff < USEC_PER_SEC &&
            vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
		...
        } else {
                /*
                 * We split periods of matched TSC writes into generations.
                 * For each generation, we track the original measured
                 * nanosecond time, offset, and write, so if TSCs are in
                 * sync, we can match exact offset, and if not, we can match
                 * exact software computation in compute_guest_tsc()
                 *
                 * These values are tracked in kvm->arch.cur_xxx variables.
                 */
                kvm->arch.cur_tsc_generation++;
                kvm->arch.cur_tsc_nsec = ns;
                kvm->arch.cur_tsc_write = data;
                kvm->arch.cur_tsc_offset = offset;
                matched = false;
                pr_debug("kvm: new tsc generation %llu, clock %llu\n",
                         kvm->arch.cur_tsc_generation, data);
        }

	...

        /* Keep track of which generation this VCPU has synchronized to */
        vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
        vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
        vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;

Note that the above sets matched to false!  But because kvm_track_tsc_matching()
looks for matched+1, i.e. doesn't require the first vCPU to match itself, KVM
would immediately compute vcpus_matched as true for VMs with a single vCPU.  As
a result, KVM would skip the masterlock update, even though a new TSC generation
was created.

        vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
                         atomic_read(&vcpu->kvm->online_vcpus));

        if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
                if (!ka->use_master_clock)
                        do_request = 1;

        if (!vcpus_matched && ka->use_master_clock)
                        do_request = 1;

        if (do_request)
                kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);

On hardware without TSC scaling support, vcpu->tsc_catchup is set to true if the
guest TSC frequency is faster than the host TSC frequency, even if the TSC is
otherwise stable.  And for that mode, kvm_guest_time_update(), by way of
compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the kernel time at the
last TSC write, to compute the guest TSC relative to kernel time:

  static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
  {
	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
				      vcpu->arch.virtual_tsc_mult,
				      vcpu->arch.virtual_tsc_shift);
	tsc += vcpu->arch.this_tsc_write;
	return tsc;
  }

Except the @kernel_ns passed to compute_guest_tsc() isn't the current kernel time,
it's the masterclock snapshot!

        spin_lock(&ka->pvclock_gtod_sync_lock);
        use_master_clock = ka->use_master_clock;
        if (use_master_clock) {
                host_tsc = ka->master_cycle_now;
                kernel_ns = ka->master_kernel_ns;
        }
        spin_unlock(&ka->pvclock_gtod_sync_lock);

	if (vcpu->tsc_catchup) {
		u64 tsc = compute_guest_tsc(v, kernel_ns);
		if (tsc > tsc_timestamp) {
			adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
			tsc_timestamp = tsc;
		}
	}

And so the "kernel_ns-vcpu->arch.this_tsc_nsec" is *guaranteed* to generate a
negative value, because this_tsc_nsec was captured after ka->master_kernel_ns.

Forcing a masterclock update essentially fudged around that problem, but in a
heavy handed way that introduced undesirable side effects, i.e. unnecessarily
forces a masterclock update when a new vCPU joins the party via hotplug.

Compile tested only, but the below should fix the vCPU hotplug case.  Then
someone (not me) just needs to figure out why kvm_xen_shared_info_init() forces
a masterclock update.

I still think we should clean up the periodic sync code, but I don't think we
need to periodically sync the masterclock.

---
 arch/x86/kvm/x86.c | 29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c54e1133e0d3..f0a607b6fc31 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode)
 }
 #endif
 
-static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
+static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 {
 #ifdef CONFIG_X86_64
-	bool vcpus_matched;
 	struct kvm_arch *ka = &vcpu->kvm->arch;
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 
-	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
-			 atomic_read(&vcpu->kvm->online_vcpus));
+	/*
+	 * To use the masterclock, the host clocksource must be based on TSC
+	 * and all vCPUs must have matching TSCs.  Note, the count for matching
+	 * vCPUs doesn't include the reference vCPU, hence "+1".
+	 */
+	bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 ==
+				 atomic_read(&vcpu->kvm->online_vcpus)) &&
+				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
 
 	/*
-	 * Once the masterclock is enabled, always perform request in
-	 * order to update it.
-	 *
-	 * In order to enable masterclock, the host clocksource must be TSC
-	 * and the vcpus need to have matched TSCs.  When that happens,
-	 * perform request to enable masterclock.
+	 * Request a masterclock update if the masterclock needs to be toggled
+	 * on/off, or when starting a new generation and the masterclock is
+	 * enabled (compute_guest_tsc() requires the masterclock snaphot to be
+	 * taken _after_ the new generation is created).
 	 */
-	if (ka->use_master_clock ||
-	    (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
+	if ((ka->use_master_clock && new_generation) ||
+	    (ka->use_master_clock != use_master_clock))
 		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
 
 	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
@@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
 	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
 
-	kvm_track_tsc_matching(vcpu);
+	kvm_track_tsc_matching(vcpu, !matched);
 }
 
 static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)

base-commit: dfdc8b7884b50e3bfa635292973b530a97689f12
--
  
Dongli Zhang Oct. 16, 2023, 10:04 p.m. UTC | #41
Hi Sean,

On 10/16/23 11:49, Sean Christopherson wrote:
> On Mon, Oct 16, 2023, David Woodhouse wrote:
>> On Mon, 2023-10-16 at 08:47 -0700, Dongli Zhang wrote:
>>> Suppose we are discussing a non-permanenet solution, I would suggest:
>>>
>>> 1. Document something to accept that kvm-clock (or pvclock on KVM, including Xen
>>> on KVM) is not good enough in some cases, e.g., vCPU hotplug.
>>
>> I still don't understand the vCPU hotplug case.
>>
>> In the case where the TSC is actually sane, why would we need to reset
>> the masterclock on vCPU hotplug? 
>>
>> The new vCPU gets its TSC synchronised to the others, and its kvmclock
>> parameters (mul/shift/offset based on the guest TSC) can be *precisely*
>> the same as the other vCPUs too, can't they? Why reset anything?
> 
> Aha!  I think I finally figured out why KVM behaves the way it does.
> 
> The unnecessary masterclock updates effectively came from:
> 
>   commit 7f187922ddf6b67f2999a76dcb71663097b75497
>   Author: Marcelo Tosatti <mtosatti@redhat.com>
>   Date:   Tue Nov 4 21:30:44 2014 -0200
> 
>     KVM: x86: update masterclock values on TSC writes
>     
>     When the guest writes to the TSC, the masterclock TSC copy must be
>     updated as well along with the TSC_OFFSET update, otherwise a negative
>     tsc_timestamp is calculated at kvm_guest_time_update.
>     
>     Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
>     "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
>     becomes redundant, so remove the do_request boolean and collapse
>     everything into a single condition.
> 
> Before that, KVM only re-synced the masterclock if it was enabled or disabled,
> i.e. KVM behaved as we want it to behave.  Note, at the time of the above commit,
> VMX synchronized TSC on *guest* writes to MSR_IA32_TSC:
> 
> 	case MSR_IA32_TSC:
>         	kvm_write_tsc(vcpu, msr_info);
> 	        break;
> 
> That got changed by commit 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization
> on guest writes"), but I don't think the guest angle is actually relevant to the
> fix.  AFAICT, a write from the host would suffer the same problem.  But knowing
> that KVM synced on guest writes is crucial to understanding the changelog.
> 
> In kvm_write_tsc(), except for KVM's wonderful "less than 1 second" hack, KVM
> snapshotted the vCPU's current TSC *and* the current time in nanoseconds, where
> kvm->arch.cur_tsc_nsec is the current host kernel time in nanoseconds.
> 
> 	ns = get_kernel_ns();
> 
> 	...
> 
>         if (usdiff < USEC_PER_SEC &&
>             vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
> 		...
>         } else {
>                 /*
>                  * We split periods of matched TSC writes into generations.
>                  * For each generation, we track the original measured
>                  * nanosecond time, offset, and write, so if TSCs are in
>                  * sync, we can match exact offset, and if not, we can match
>                  * exact software computation in compute_guest_tsc()
>                  *
>                  * These values are tracked in kvm->arch.cur_xxx variables.
>                  */
>                 kvm->arch.cur_tsc_generation++;
>                 kvm->arch.cur_tsc_nsec = ns;
>                 kvm->arch.cur_tsc_write = data;
>                 kvm->arch.cur_tsc_offset = offset;
>                 matched = false;
>                 pr_debug("kvm: new tsc generation %llu, clock %llu\n",
>                          kvm->arch.cur_tsc_generation, data);
>         }
> 
> 	...
> 
>         /* Keep track of which generation this VCPU has synchronized to */
>         vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
>         vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
>         vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
> 
> Note that the above sets matched to false!  But because kvm_track_tsc_matching()
> looks for matched+1, i.e. doesn't require the first vCPU to match itself, KVM
> would immediately compute vcpus_matched as true for VMs with a single vCPU.  As
> a result, KVM would skip the masterlock update, even though a new TSC generation
> was created.
> 
>         vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
>                          atomic_read(&vcpu->kvm->online_vcpus));
> 
>         if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
>                 if (!ka->use_master_clock)
>                         do_request = 1;
> 
>         if (!vcpus_matched && ka->use_master_clock)
>                         do_request = 1;
> 
>         if (do_request)
>                 kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
> 
> On hardware without TSC scaling support, vcpu->tsc_catchup is set to true if the
> guest TSC frequency is faster than the host TSC frequency, even if the TSC is
> otherwise stable.  And for that mode, kvm_guest_time_update(), by way of
> compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the kernel time at the
> last TSC write, to compute the guest TSC relative to kernel time:
> 
>   static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
>   {
> 	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
> 				      vcpu->arch.virtual_tsc_mult,
> 				      vcpu->arch.virtual_tsc_shift);
> 	tsc += vcpu->arch.this_tsc_write;
> 	return tsc;
>   }
> 
> Except the @kernel_ns passed to compute_guest_tsc() isn't the current kernel time,
> it's the masterclock snapshot!
> 
>         spin_lock(&ka->pvclock_gtod_sync_lock);
>         use_master_clock = ka->use_master_clock;
>         if (use_master_clock) {
>                 host_tsc = ka->master_cycle_now;
>                 kernel_ns = ka->master_kernel_ns;
>         }
>         spin_unlock(&ka->pvclock_gtod_sync_lock);
> 
> 	if (vcpu->tsc_catchup) {
> 		u64 tsc = compute_guest_tsc(v, kernel_ns);
> 		if (tsc > tsc_timestamp) {
> 			adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
> 			tsc_timestamp = tsc;
> 		}
> 	}
> 
> And so the "kernel_ns-vcpu->arch.this_tsc_nsec" is *guaranteed* to generate a
> negative value, because this_tsc_nsec was captured after ka->master_kernel_ns.
> 
> Forcing a masterclock update essentially fudged around that problem, but in a
> heavy handed way that introduced undesirable side effects, i.e. unnecessarily
> forces a masterclock update when a new vCPU joins the party via hotplug.
> 
> Compile tested only, but the below should fix the vCPU hotplug case.  Then
> someone (not me) just needs to figure out why kvm_xen_shared_info_init() forces
> a masterclock update.
> 
> I still think we should clean up the periodic sync code, but I don't think we
> need to periodically sync the masterclock.

This looks good to me. The core idea is to not update master clock for the
synchronized cases.


How about the negative value case? I see in the linux code it is still there?

(It is out of the scope of my expectation as I do not need to run vCPUs in
different tsc freq as host)

Thank you very much!

Dongli Zhang

> 
> ---
>  arch/x86/kvm/x86.c | 29 ++++++++++++++++-------------
>  1 file changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c54e1133e0d3..f0a607b6fc31 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode)
>  }
>  #endif
>  
> -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
> +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
>  {
>  #ifdef CONFIG_X86_64
> -	bool vcpus_matched;
>  	struct kvm_arch *ka = &vcpu->kvm->arch;
>  	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>  
> -	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> -			 atomic_read(&vcpu->kvm->online_vcpus));
> +	/*
> +	 * To use the masterclock, the host clocksource must be based on TSC
> +	 * and all vCPUs must have matching TSCs.  Note, the count for matching
> +	 * vCPUs doesn't include the reference vCPU, hence "+1".
> +	 */
> +	bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 ==
> +				 atomic_read(&vcpu->kvm->online_vcpus)) &&
> +				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
>  
>  	/*
> -	 * Once the masterclock is enabled, always perform request in
> -	 * order to update it.
> -	 *
> -	 * In order to enable masterclock, the host clocksource must be TSC
> -	 * and the vcpus need to have matched TSCs.  When that happens,
> -	 * perform request to enable masterclock.
> +	 * Request a masterclock update if the masterclock needs to be toggled
> +	 * on/off, or when starting a new generation and the masterclock is
> +	 * enabled (compute_guest_tsc() requires the masterclock snaphot to be
> +	 * taken _after_ the new generation is created).
>  	 */
> -	if (ka->use_master_clock ||
> -	    (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
> +	if ((ka->use_master_clock && new_generation) ||
> +	    (ka->use_master_clock != use_master_clock))
>  		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>  
>  	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
> @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>  	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
>  	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>  
> -	kvm_track_tsc_matching(vcpu);
> +	kvm_track_tsc_matching(vcpu, !matched);
>  }
>  
>  static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
> 
> base-commit: dfdc8b7884b50e3bfa635292973b530a97689f12
  
Sean Christopherson Oct. 16, 2023, 10:48 p.m. UTC | #42
On Mon, Oct 16, 2023, Dongli Zhang wrote:
> Hi Sean,
> 
> On 10/16/23 11:49, Sean Christopherson wrote:
> > Compile tested only, but the below should fix the vCPU hotplug case.  Then
> > someone (not me) just needs to figure out why kvm_xen_shared_info_init() forces
> > a masterclock update.
> > 
> > I still think we should clean up the periodic sync code, but I don't think we
> > need to periodically sync the masterclock.
> 
> This looks good to me. The core idea is to not update master clock for the
> synchronized cases.
> 
> 
> How about the negative value case? I see in the linux code it is still there?

See below.  

> (It is out of the scope of my expectation as I do not need to run vCPUs in
> different tsc freq as host)
> 
> Thank you very much!
> 
> Dongli Zhang
> 
> > 
> > ---
> >  arch/x86/kvm/x86.c | 29 ++++++++++++++++-------------
> >  1 file changed, 16 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index c54e1133e0d3..f0a607b6fc31 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode)
> >  }
> >  #endif
> >  
> > -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
> > +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
> >  {
> >  #ifdef CONFIG_X86_64
> > -	bool vcpus_matched;
> >  	struct kvm_arch *ka = &vcpu->kvm->arch;
> >  	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> >  
> > -	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> > -			 atomic_read(&vcpu->kvm->online_vcpus));
> > +	/*
> > +	 * To use the masterclock, the host clocksource must be based on TSC
> > +	 * and all vCPUs must have matching TSCs.  Note, the count for matching
> > +	 * vCPUs doesn't include the reference vCPU, hence "+1".
> > +	 */
> > +	bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 ==
> > +				 atomic_read(&vcpu->kvm->online_vcpus)) &&
> > +				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
> >  
> >  	/*
> > -	 * Once the masterclock is enabled, always perform request in
> > -	 * order to update it.
> > -	 *
> > -	 * In order to enable masterclock, the host clocksource must be TSC
> > -	 * and the vcpus need to have matched TSCs.  When that happens,
> > -	 * perform request to enable masterclock.
> > +	 * Request a masterclock update if the masterclock needs to be toggled
> > +	 * on/off, or when starting a new generation and the masterclock is
> > +	 * enabled (compute_guest_tsc() requires the masterclock snaphot to be
> > +	 * taken _after_ the new generation is created).
> >  	 */
> > -	if (ka->use_master_clock ||
> > -	    (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
> > +	if ((ka->use_master_clock && new_generation) ||
> > +	    (ka->use_master_clock != use_master_clock))
> >  		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
> >  
> >  	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
> > @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
> >  	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
> >  	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
> >  
> > -	kvm_track_tsc_matching(vcpu);
> > +	kvm_track_tsc_matching(vcpu, !matched);

If my analysis of how the negative timestamp occurred is correct, the problematic
scenario was if cur_tsc_nsec/cur_tsc_write were updated without a masterclock update.
Passing !matched for @new_generation means that KVM will force a masterclock update
if cur_tsc_nsec/cur_tsc_write are changed, i.e. prevent the negative timestamp bug.
  
Dongli Zhang Oct. 17, 2023, 4:18 p.m. UTC | #43
Hi Sean,

On 10/16/23 15:48, Sean Christopherson wrote:
> On Mon, Oct 16, 2023, Dongli Zhang wrote:
>> Hi Sean,
>>
>> On 10/16/23 11:49, Sean Christopherson wrote:
>>> Compile tested only, but the below should fix the vCPU hotplug case.  Then
>>> someone (not me) just needs to figure out why kvm_xen_shared_info_init() forces
>>> a masterclock update.
>>>
>>> I still think we should clean up the periodic sync code, but I don't think we
>>> need to periodically sync the masterclock.
>>
>> This looks good to me. The core idea is to not update master clock for the
>> synchronized cases.
>>
>>
>> How about the negative value case? I see in the linux code it is still there?
> 
> See below.  
> 
>> (It is out of the scope of my expectation as I do not need to run vCPUs in
>> different tsc freq as host)
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>>>
>>> ---
>>>  arch/x86/kvm/x86.c | 29 ++++++++++++++++-------------
>>>  1 file changed, 16 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index c54e1133e0d3..f0a607b6fc31 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode)
>>>  }
>>>  #endif
>>>  
>>> -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
>>> +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
>>>  {
>>>  #ifdef CONFIG_X86_64
>>> -	bool vcpus_matched;
>>>  	struct kvm_arch *ka = &vcpu->kvm->arch;
>>>  	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>>>  
>>> -	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
>>> -			 atomic_read(&vcpu->kvm->online_vcpus));
>>> +	/*
>>> +	 * To use the masterclock, the host clocksource must be based on TSC
>>> +	 * and all vCPUs must have matching TSCs.  Note, the count for matching
>>> +	 * vCPUs doesn't include the reference vCPU, hence "+1".
>>> +	 */
>>> +	bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 ==
>>> +				 atomic_read(&vcpu->kvm->online_vcpus)) &&
>>> +				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
>>>  
>>>  	/*
>>> -	 * Once the masterclock is enabled, always perform request in
>>> -	 * order to update it.
>>> -	 *
>>> -	 * In order to enable masterclock, the host clocksource must be TSC
>>> -	 * and the vcpus need to have matched TSCs.  When that happens,
>>> -	 * perform request to enable masterclock.
>>> +	 * Request a masterclock update if the masterclock needs to be toggled
>>> +	 * on/off, or when starting a new generation and the masterclock is
>>> +	 * enabled (compute_guest_tsc() requires the masterclock snaphot to be
>>> +	 * taken _after_ the new generation is created).
>>>  	 */
>>> -	if (ka->use_master_clock ||
>>> -	    (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
>>> +	if ((ka->use_master_clock && new_generation) ||
>>> +	    (ka->use_master_clock != use_master_clock))
>>>  		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>>>  
>>>  	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
>>> @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>>>  	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
>>>  	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>>>  
>>> -	kvm_track_tsc_matching(vcpu);
>>> +	kvm_track_tsc_matching(vcpu, !matched);
> 
> If my analysis of how the negative timestamp occurred is correct, the problematic
> scenario was if cur_tsc_nsec/cur_tsc_write were updated without a masterclock update.
> Passing !matched for @new_generation means that KVM will force a masterclock update
> if cur_tsc_nsec/cur_tsc_write are changed, i.e. prevent the negative timestamp bug.


Thank you very much for the explanation. Now I understand it.

Thanks to the immediate call to kvm_synchronize_tsc() during each vCPU creation ...

kvm_vm_ioctl(KVM_CREATE_VCPU)
-> kvm_vm_ioctl_create_vcpu()
   -> kvm_arch_vcpu_postcreate()
      -> kvm_synchronize_tsc()

... the local variable "bool use_master_clock" in your patch may always be true.
At that time, the "(ka->use_master_clock != use_master_clock)" returns true.


As a result, we will be able to trigger the KVM_REQ_MASTERCLOCK_UPDATE  during
VM creation for each vCPU.

There is still KVM_REQ_MASTERCLOCK_UPDATE for each vCPU during VM creation.
However, there will be no KVM_REQ_MASTERCLOCK_UPDATE for vCPU hot-add.

Thank you very much!

Dongli Zhang
  

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 17715cb8731d..57409dce5d73 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1331,6 +1331,7 @@  struct kvm_arch {
 	u64 master_cycle_now;
 	struct delayed_work kvmclock_update_work;
 	struct delayed_work kvmclock_sync_work;
+	struct delayed_work masterclock_sync_work;
 
 	struct kvm_xen_hvm_config xen_hvm_config;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9f18b06bbda6..0b71dc3785eb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -157,6 +157,9 @@  module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
 static bool __read_mostly kvmclock_periodic_sync = true;
 module_param(kvmclock_periodic_sync, bool, S_IRUGO);
 
+unsigned int __read_mostly masterclock_sync_period;
+module_param(masterclock_sync_period, uint, 0444);
+
 /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
 static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
@@ -3298,6 +3301,31 @@  static void kvmclock_sync_fn(struct work_struct *work)
 					KVMCLOCK_SYNC_PERIOD);
 }
 
+static void masterclock_sync_fn(struct work_struct *work)
+{
+	unsigned long i;
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
+					   masterclock_sync_work);
+	struct kvm *kvm = container_of(ka, struct kvm, arch);
+	struct kvm_vcpu *vcpu;
+
+	if (!masterclock_sync_period)
+		return;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		/*
+		 * It is not required to kick the vcpu because it is not
+		 * expected to update the master clock immediately.
+		 */
+		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+	}
+
+	schedule_delayed_work(&ka->masterclock_sync_work,
+			      masterclock_sync_period * HZ);
+}
+
+
 /* These helpers are safe iff @msr is known to be an MCx bank MSR. */
 static bool is_mci_control_msr(u32 msr)
 {
@@ -11970,6 +11998,10 @@  void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0)
 		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
 						KVMCLOCK_SYNC_PERIOD);
+
+	if (masterclock_sync_period && vcpu->vcpu_idx == 0)
+		schedule_delayed_work(&kvm->arch.masterclock_sync_work,
+				      masterclock_sync_period * HZ);
 }
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
@@ -12344,6 +12376,7 @@  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
+	INIT_DELAYED_WORK(&kvm->arch.masterclock_sync_work, masterclock_sync_fn);
 
 	kvm_apicv_init(kvm);
 	kvm_hv_init_vm(kvm);
@@ -12383,6 +12416,7 @@  static void kvm_unload_vcpu_mmus(struct kvm *kvm)
 
 void kvm_arch_sync_events(struct kvm *kvm)
 {
+	cancel_delayed_work_sync(&kvm->arch.masterclock_sync_work);
 	cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
 	cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
 	kvm_free_pit(kvm);