[v3] KVM: x86/tsc: Don't sync user changes to TSC with KVM-initiated change

Message ID 20230731080758.29482-1-likexu@tencent.com
State New
Headers
Series [v3] KVM: x86/tsc: Don't sync user changes to TSC with KVM-initiated change |

Commit Message

Like Xu July 31, 2023, 8:07 a.m. UTC
  From: Like Xu <likexu@tencent.com>

Add kvm->arch.user_changed_tsc to avoid synchronizing user changes to
the TSC with the KVM-initiated change in kvm_arch_vcpu_postcreate() by
conditioning this mess on userspace having written the TSC at least
once already.

Here lies UAPI baggage: user-initiated TSC write with a small delta
(1 second) of virtual cycle time against real time is interpreted as an
attempt to synchronize the CPU. In such a scenario, the vcpu's tsc_offset
is not configured as expected, resulting in significant guest service
response latency, which is observed in our production environment.

Reported-by: Yong He <alexyonghe@tencent.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Original-by: Oliver Upton <oliver.upton@linux.dev>
Tested-by: Like Xu <likexu@tencent.com>
Signed-off-by: Like Xu <likexu@tencent.com>
---
V2 -> V3 Changelog:
- Use the kvm->arch.user_changed_tsc proposal; (Oliver & Paolo)
V2: https://lore.kernel.org/kvm/20230724073516.45394-1-likexu@tencent.com/

 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 23 ++++++++++++++++-------
 2 files changed, 17 insertions(+), 7 deletions(-)


base-commit: 5a7591176c47cce363c1eed704241e5d1c42c5a6
  

Comments

Oliver Upton July 31, 2023, 6:29 p.m. UTC | #1
On Mon, Jul 31, 2023 at 04:07:58PM +0800, Like Xu wrote:
> From: Like Xu <likexu@tencent.com>
> 
> Add kvm->arch.user_changed_tsc to avoid synchronizing user changes to
> the TSC with the KVM-initiated change in kvm_arch_vcpu_postcreate() by
> conditioning this mess on userspace having written the TSC at least
> once already.
> 
> Here lies UAPI baggage: user-initiated TSC write with a small delta
> (1 second) of virtual cycle time against real time is interpreted as an
> attempt to synchronize the CPU. In such a scenario, the vcpu's tsc_offset
> is not configured as expected, resulting in significant guest service
> response latency, which is observed in our production environment.

The changelog reads really weird, because it is taken out of context
when it isn't a comment over the affected code. Furthermore, 'our
production environment' is a complete black box to the rest of the
community, it would be helpful spelling out exactly what the use case
is.

Suggested changelog:

  KVM interprets writes to the TSC with values within 1 second of each
  other as an attempt to synchronize the TSC for all vCPUs in the VM,
  and uses a common offset for all vCPUs in a VM. For brevity's sake
  let's just ignore what happens on systems with an unstable TSC.

  While this may seem odd, it is imperative for VM save/restore, as VMMs
  such as QEMU have long resorted to saving the TSCs (by value) from all
  vCPUs in the VM at approximately the same time. Of course, it is
  impossible to synchronize all the vCPU ioctls to capture the exact
  instant in time, hence KVM fudges it a bit on the restore side.

  This has been useful for the 'typical' VM lifecycle, where in all
  likelihood the VM goes through save/restore a considerable amount of
  time after VM creation. Nonetheless, there are some use cases that
  need to restore a VM snapshot that was created very shortly after boot
  (<1 second). Unfortunately the TSC sync code makes no distinction
  between kernel and user-initiated writes, which leads to the target VM
  synchronizing on the TSC offset from creation instead of the
  user-intended value.

  Avoid synchronizing user-initiated changes to the guest TSC with the
  KVM initiated change in kvm_arch_vcpu_postcreate() by conditioning the
  logic on userspace having written the TSC at least once.

I'll also note that the whole value-based TSC sync scheme is in
desperate need of testing.
  
Like Xu Aug. 1, 2023, 2:26 a.m. UTC | #2
On 1/8/2023 2:29 am, Oliver Upton wrote:
> On Mon, Jul 31, 2023 at 04:07:58PM +0800, Like Xu wrote:
>> From: Like Xu <likexu@tencent.com>
>>
>> Add kvm->arch.user_changed_tsc to avoid synchronizing user changes to
>> the TSC with the KVM-initiated change in kvm_arch_vcpu_postcreate() by
>> conditioning this mess on userspace having written the TSC at least
>> once already.
>>
>> Here lies UAPI baggage: user-initiated TSC write with a small delta
>> (1 second) of virtual cycle time against real time is interpreted as an
>> attempt to synchronize the CPU. In such a scenario, the vcpu's tsc_offset
>> is not configured as expected, resulting in significant guest service
>> response latency, which is observed in our production environment.
> 
> The changelog reads really weird, because it is taken out of context
> when it isn't a comment over the affected code. Furthermore, 'our
> production environment' is a complete black box to the rest of the
> community, it would be helpful spelling out exactly what the use case
> is.
> 
> Suggested changelog:
> 
>    KVM interprets writes to the TSC with values within 1 second of each
>    other as an attempt to synchronize the TSC for all vCPUs in the VM,
>    and uses a common offset for all vCPUs in a VM. For brevity's sake
>    let's just ignore what happens on systems with an unstable TSC.
> 
>    While this may seem odd, it is imperative for VM save/restore, as VMMs
>    such as QEMU have long resorted to saving the TSCs (by value) from all
>    vCPUs in the VM at approximately the same time. Of course, it is
>    impossible to synchronize all the vCPU ioctls to capture the exact
>    instant in time, hence KVM fudges it a bit on the restore side.
> 
>    This has been useful for the 'typical' VM lifecycle, where in all
>    likelihood the VM goes through save/restore a considerable amount of
>    time after VM creation. Nonetheless, there are some use cases that
>    need to restore a VM snapshot that was created very shortly after boot
>    (<1 second). Unfortunately the TSC sync code makes no distinction
>    between kernel and user-initiated writes, which leads to the target VM
>    synchronizing on the TSC offset from creation instead of the
>    user-intended value.

Great clarification. Thanks, we're on the same page.

> 
>    Avoid synchronizing user-initiated changes to the guest TSC with the
>    KVM initiated change in kvm_arch_vcpu_postcreate() by conditioning the
>    logic on userspace having written the TSC at least once.
> 
> I'll also note that the whole value-based TSC sync scheme is in
> desperate need of testing.
>
  

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3bc146dfd38d..e8d423ef1474 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1303,6 +1303,7 @@  struct kvm_arch {
 	u64 cur_tsc_offset;
 	u64 cur_tsc_generation;
 	int nr_vcpus_matched_tsc;
+	bool user_changed_tsc;
 
 	u32 default_tsc_khz;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 278dbd37dab2..eeaf4ad9174d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2713,7 +2713,7 @@  static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_track_tsc_matching(vcpu);
 }
 
-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, bool user_initiated)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
@@ -2734,20 +2734,29 @@  static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 			 * kvm_clock stable after CPU hotplug
 			 */
 			synchronizing = true;
-		} else {
+		} else if (kvm->arch.user_changed_tsc) {
 			u64 tsc_exp = kvm->arch.last_tsc_write +
 						nsec_to_cycles(vcpu, elapsed);
 			u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
 			/*
-			 * Special case: TSC write with a small delta (1 second)
-			 * of virtual cycle time against real time is
-			 * interpreted as an attempt to synchronize the CPU.
+			 * Here lies UAPI baggage: user-initiated TSC write with
+			 * a small delta (1 second) of virtual cycle time
+			 * against real time is interpreted as an attempt to
+			 * synchronize the CPU.
+			 *
+			 * Don't synchronize user changes to the TSC with the
+			 * KVM-initiated change in kvm_arch_vcpu_postcreate()
+			 * by conditioning this mess on userspace having
+			 * written the TSC at least once already.
 			 */
 			synchronizing = data < tsc_exp + tsc_hz &&
 					data + tsc_hz > tsc_exp;
 		}
 	}
 
+	if (user_initiated)
+		kvm->arch.user_changed_tsc = true;
+
 	/*
 	 * For a reliable TSC, we can match TSC offsets, and for an unstable
 	 * TSC, we add elapsed time in this computation.  We could let the
@@ -3776,7 +3785,7 @@  int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC:
 		if (msr_info->host_initiated) {
-			kvm_synchronize_tsc(vcpu, data);
+			kvm_synchronize_tsc(vcpu, data, true);
 		} else {
 			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
@@ -11950,7 +11959,7 @@  void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
+	kvm_synchronize_tsc(vcpu, 0, false);
 	vcpu_put(vcpu);
 
 	/* poll control enabled by default */