Message ID | 20230913103729.51194-1-likexu@tencent.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9ecd:0:b0:3f2:4152:657d with SMTP id t13csp1261052vqx; Wed, 13 Sep 2023 10:58:46 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGSuWPFZfSHiwXIaueTuGFOenJheG+hG/8DY9mcZQ2mq09oA87ZsR2lJ+Ox7vHLW0EpptXx X-Received: by 2002:a05:6358:52c8:b0:135:85ec:a080 with SMTP id z8-20020a05635852c800b0013585eca080mr2869424rwz.32.1694627926540; Wed, 13 Sep 2023 10:58:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694627926; cv=none; d=google.com; s=arc-20160816; b=WjymG/jlwDg2VA4qITMTy7ZpnHhJ4PO6QCk3VKnoFyrVOuOgDPNUzhju2zP0X8i5+7 MXVuqecG+DtzbwTu06tfQWuF8P97rdDqgQrKCLzyLEAFq1GblXuquOsRbX/0Ebzlt2lU 7djH1IA3Ri+MVZhIslP6Qwtbf6wFfCphvB1PCMWnpjeJpKgx+Ad275wQluWzGmbHJ6x6 vEe1sWVRvG5FVLEfhndR2FmtARmSUrZEfM6HHvmr4NBdTH6m/EMLFUnx6fXnPY2tyCo5 YZdEkHYF8bcfL4BPUN2PvI+m7SQCdHTapujkZ8BoEFOpiAxpRwht3k9On/sfL4zIsBIH 8SAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=ojXgjXc0jqNFJeHH1Cm2DAbtucHxBYUYh4RahEa6KBs=; fh=hFhCjDowbM1jOeHI8W9Nguow2wVGVWLSZRzy4xAd1Ys=; b=u8VtRyO3MOhXHGQOQdC8X6RhnCiOUhij6rmf3eIvzWXBt9o7CgYmT+kyvSHDjlUq80 N/UH4MBqQUksCvleynwfOO+SF414qsKLMA7IiUKwTAAvDZ2zVtIMVHNs7xdKQt/JjQPi PklQ9pnKmwos914BlcydbKAJFzm2xCOAuF5ZBd9lIFOMokMx4Ma/o24MkSf0+YwiV0b9 s1HAh0LOfga7HBc8idjqPe+I4QwT07S9JHJfF67VVWp2pm7mfYpj3xHauaKzB2/oFl8c L/ViX7EugmKmOJ95fdRvYpls9i39scZGEe14meZae7Acf86nOiFSLePUMSDgxdaTBl+f 27Vw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=DaL0s2N+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id n127-20020a632785000000b00574057d7c0esi10198873pgn.511.2023.09.13.10.58.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Sep 2023 10:58:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=DaL0s2N+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 02E388339670; Wed, 13 Sep 2023 03:37:58 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239723AbjIMKhz (ORCPT <rfc822;pwkd43@gmail.com> + 36 others); Wed, 13 Sep 2023 06:37:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56702 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238013AbjIMKhw (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 13 Sep 2023 06:37:52 -0400 Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B670FD3; Wed, 13 Sep 2023 03:37:48 -0700 (PDT) Received: by mail-pl1-x634.google.com with SMTP id d9443c01a7336-1c3887039d4so37475215ad.1; Wed, 13 Sep 2023 03:37:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1694601468; x=1695206268; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=ojXgjXc0jqNFJeHH1Cm2DAbtucHxBYUYh4RahEa6KBs=; b=DaL0s2N+9mV6Rl/u4VhjBUfpAkX4C5tn92B0U/PnJxVmTGi13jW5FbstLXmfbFpN4Q RJL7ckt2ruTY5S46Uo/bWxoN4ADakLWBD9u/FSMfOHGV/QPbIsgMJfmgn/cxApaJS1Bg VQ6tdzgjuWcg9QApIuRYun57HxTjFptMO/knvUcPbwJlKCxIByMoC6NGlQAcKYRsS1iq MnEwoAKYp2Erg3VrtIBLT8BuYwDTJ7RxFRhmYDvTDSW/C6VUuh31hk5T6+wTtsXZH7h9 i+R5EzudGTdi8FMsoS9Zfd8xkot5ySpnl2CNXCLCMWnIaUDsG1/H720z0aMM2XduhoBP TOJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694601468; x=1695206268; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ojXgjXc0jqNFJeHH1Cm2DAbtucHxBYUYh4RahEa6KBs=; b=E8WiXvsszcVZmgwKa81IjvVXFD3tYRE4V1oMA1bfh1kT81CgX611/fzlWX831rWCp8 3fFfAiqvYd9WqAtjyAUMntSc9xaym/wc6/PRFxjIZfAh4YZQGM7QvZElsjZshdgcuPqY yKLTR8IgE6+JIdcfMEB3P5sYRCFz55tkJKZdXxIbWTiaMyx70illKQZOhoWC7YLly95q Q+w0NwNFQeVnkeIAmsJU6ep86ZW9HGM2h2iAk0huFWpjk3f6LZNyua6+vJ7M727ox5jZ b1m3sFhX8UbyzqHKAyvD3bxugeNPOX5XgCJe/dNuMqDw0X63msCdinAB6KY0iHCqLi0S 9Zpw== X-Gm-Message-State: AOJu0Yx5qPQBHofe9J9l+WWcUUPYdjbHjL8b0jwDeT262Mo7EP7wpqOt cvEEK6s7rxXwIiSOPaaxuH8= X-Received: by 2002:a17:903:2310:b0:1af:aafb:64c8 with SMTP id d16-20020a170903231000b001afaafb64c8mr2623769plh.21.1694601468097; Wed, 13 Sep 2023 03:37:48 -0700 (PDT) Received: from localhost.localdomain ([103.7.29.32]) by smtp.gmail.com with ESMTPSA id b15-20020a170902d50f00b001b8a85489a3sm10090808plg.262.2023.09.13.03.37.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Sep 2023 03:37:47 -0700 (PDT) From: Like Xu <like.xu.linux@gmail.com> X-Google-Original-From: Like Xu <likexu@tencent.com> To: Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org>, Oliver Upton <oliver.upton@linux.dev>, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values Date: Wed, 13 Sep 2023 18:37:29 +0800 Message-ID: <20230913103729.51194-1-likexu@tencent.com> X-Mailer: git-send-email 2.42.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Wed, 13 Sep 2023 03:37:58 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1776939544723722661 X-GMAIL-MSGID: 1776946172281692289 |
Series |
[v6] KVM: x86/tsc: Don't sync user-written TSC against startup values
|
|
Commit Message
Like Xu
Sept. 13, 2023, 10:37 a.m. UTC
From: Like Xu <likexu@tencent.com> The legacy API for setting the TSC is fundamentally broken, and only allows userspace to set a TSC "now", without any way to account for time lost to preemption between the calculation of the value, and the kernel eventually handling the ioctl. To work around this we have had a hack which, if a TSC is set with a value which is within a second's worth of a previous vCPU, assumes that userspace actually intended them to be in sync and adjusts the newly- written TSC value accordingly. Thus, when a VMM restores a guest after suspend or migration using the legacy API, the TSCs aren't necessarily *right*, but at least they're in sync. This trick falls down when restoring a guest which genuinely has been running for less time than the 1 second of imprecision which we allow for in the legacy API. On *creation* the first vCPU starts its TSC counting from zero, and the subsequent vCPUs synchronize to that. But then when the VMM tries to set the intended TSC value, because that's within a second of what the last TSC synced to, it just adjusts it to match that. The correct answer is for the VMM not to use the legacy API of course. But we can pile further hacks onto our existing hackish ABI, and declare that the *first* value written by userspace (on any vCPU) should not be subject to this 'correction' to make it sync up with values that only from the kernel's default vCPU creation. To that end: Add a flag in kvm->arch.user_set_tsc, protected by kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in this KVM *has* been set by userspace. Make the 1-second slop hack only trigger if that flag is already set. Reported-by: Yong He <alexyonghe@tencent.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423 Suggested-by: Oliver Upton <oliver.upton@linux.dev> Original-by: Oliver Upton <oliver.upton@linux.dev> Original-by: Sean Christopherson <seanjc@google.com> Co-developed-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Like Xu <likexu@tencent.com> Tested-by: Yong He <alexyonghe@tencent.com> --- V5 -> V6 Changelog: - Refine commit message and comments to make more sense; (David) - Set kvm->arch.user_set_tsc in the kvm_arch_tsc_set_attr(); (David) - Continue to allow usersapce to write zero to force a sync; (David) V5: https://lore.kernel.org/kvm/20230913072113.78885-1-likexu@tencent.com/ arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/x86.c | 24 ++++++++++++++++++++---- 2 files changed, 21 insertions(+), 4 deletions(-) base-commit: 0bb80ecc33a8fb5a682236443c1e740d5c917d1d
Comments
On Wed, 2023-09-13 at 18:37 +0800, Like Xu wrote: > From: Like Xu <likexu@tencent.com> > > The legacy API for setting the TSC is fundamentally broken, and only > allows userspace to set a TSC "now", without any way to account for > time lost to preemption between the calculation of the value, and the > kernel eventually handling the ioctl. > > To work around this we have had a hack which, if a TSC is set with a > value which is within a second's worth of a previous vCPU, assumes that > userspace actually intended them to be in sync and adjusts the newly- > written TSC value accordingly. > > Thus, when a VMM restores a guest after suspend or migration using the > legacy API, the TSCs aren't necessarily *right*, but at least they're > in sync. > > This trick falls down when restoring a guest which genuinely has been > running for less time than the 1 second of imprecision which we allow > for in the legacy API. On *creation* the first vCPU starts its TSC > counting from zero, and the subsequent vCPUs synchronize to that. But > then when the VMM tries to set the intended TSC value, because that's > within a second of what the last TSC synced to, it just adjusts it to > match that. > Proofreading my own words here... "it just adjusts it to match" is using the same pronoun for different things and is probably hard to follow. Perhaps "KVM just adjusts it to match" is nicer. > The correct answer is for the VMM not to use the legacy API of course. > > But we can pile further hacks onto our existing hackish ABI, and > declare that the *first* value written by userspace (on any vCPU) > should not be subject to this 'correction' to make it sync up with > values that only from the kernel's default vCPU creation. ^^ ... that only *come* from the kernel's... > > To that end: Add a flag in kvm->arch.user_set_tsc, protected by > kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in > this KVM *has* been set by userspace. Make the 1-second slop hack only > trigger if that flag is already set. > > Reported-by: Yong He <alexyonghe@tencent.com> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423 > Suggested-by: Oliver Upton <oliver.upton@linux.dev> > Original-by: Oliver Upton <oliver.upton@linux.dev> > Original-by: Sean Christopherson <seanjc@google.com> > Co-developed-by: David Woodhouse <dwmw2@infradead.org> > Signed-off-by: David Woodhouse <dwmw2@infradead.org> > Signed-off-by: Like Xu <likexu@tencent.com> > Tested-by: Yong He <alexyonghe@tencent.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Please remove the 'Signed-off-by' from me. You must never ever *type* a signed-off-by line for anyone else. You only ever cut and paste those intact when they have provided them for *themselves*. It's OK to remove the Co-developed-by: too. You did the actual typing of the code here; I just heckled :)
On Wed, Sep 13, 2023, Like Xu wrote:
> I'll wait for a cooling off period to see if the maintainers need me to post v7.
You should have waiting to post v5, let alone v6. Resurrecting a thread after a
month and not waiting even 7 hours for others to respond is extremely frustrating.
On 13/9/2023 10:47 pm, Sean Christopherson wrote: > On Wed, Sep 13, 2023, Like Xu wrote: >> I'll wait for a cooling off period to see if the maintainers need me to post v7. > > You should have waiting to post v5, let alone v6. Resurrecting a thread after a > month and not waiting even 7 hours for others to respond is extremely frustrating. You are right. I don't seem to be keeping up with many of other issues. Sorry for that. Wish there were 48 hours in a day. Back to this issue: for commit message, I'd be more inclined to David's understanding, but you have the gavel; and for proposed code diff, how about the following changes: diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1a4def36d5bb..9a7dfef9d32d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1324,6 +1324,7 @@ struct kvm_arch { int nr_vcpus_matched_tsc; u32 default_tsc_khz; + bool user_set_tsc; seqcount_raw_spinlock_t pvclock_sc; bool use_master_clock; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c9c81e82e65..faaae8b3fec4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2714,8 +2714,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, kvm_track_tsc_matching(vcpu); } -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value) { + u64 data = user_value ? *user_value : 0; struct kvm *kvm = vcpu->kvm; u64 offset, ns, elapsed; unsigned long flags; @@ -2728,27 +2729,45 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) elapsed = ns - kvm->arch.last_tsc_nsec; if (vcpu->arch.virtual_tsc_khz) { + /* + * Force synchronization when creating or hotplugging a vCPU, + * i.e. when the TSC value is '0', to help keep clocks stable. + * If this is NOT a hotplug/creation case, skip synchronization + * on the first write from userspace so as not to misconstrue + * state restoration after live migration as an attempt from + * userspace to synchronize. + */ if (data == 0) { - /* - * detection of vcpu initialization -- need to sync - * with other vCPUs. This particularly helps to keep - * kvm_clock stable after CPU hotplug - */ synchronizing = true; - } else { + } else if (kvm->arch.user_set_tsc) { u64 tsc_exp = kvm->arch.last_tsc_write + nsec_to_cycles(vcpu, elapsed); u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; /* - * Special case: TSC write with a small delta (1 second) - * of virtual cycle time against real time is - * interpreted as an attempt to synchronize the CPU. + * Here lies UAPI baggage: when a user-initiated TSC write has + * a small delta (1 second) of virtual cycle time against the + * previously set vCPU, we assume that they were intended to be + * in sync and the delta was only due to the racy nature of the + * legacy API. + * + * This trick falls down when restoring a guest which genuinely + * has been running for less time than the 1 second of imprecision + * which we allow for in the legacy API. In this case, the first + * value written by userspace (on any vCPU) should not be subject + * to this 'correction' to make it sync up with values that only + * from the kernel's default vCPU creation. Make the 1-second slop + * hack only trigger if the user_set_tsc flag is already set. + * + * The correct answer is for the VMM not to use the legacy API. */ synchronizing = data < tsc_exp + tsc_hz && data + tsc_hz > tsc_exp; } } + if (user_value) + kvm->arch.user_set_tsc = true; + /* * For a reliable TSC, we can match TSC offsets, and for an unstable * TSC, we add elapsed time in this computation. We could let the @@ -3777,7 +3796,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case MSR_IA32_TSC: if (msr_info->host_initiated) { - kvm_synchronize_tsc(vcpu, data); + kvm_synchronize_tsc(vcpu, &data); } else { u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset; adjust_tsc_offset_guest(vcpu, adj); @@ -5536,6 +5555,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset; ns = get_kvmclock_base_ns(); + kvm->arch.user_set_tsc = true; __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched); raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags); @@ -11959,7 +11979,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) if (mutex_lock_killable(&vcpu->mutex)) return; vcpu_load(vcpu); - kvm_synchronize_tsc(vcpu, 0); + kvm_synchronize_tsc(vcpu, NULL); vcpu_put(vcpu); /* poll control enabled by default */
On Thu, 2023-09-14 at 11:50 +0800, Like Xu wrote: > On 13/9/2023 10:47 pm, Sean Christopherson wrote: > > On Wed, Sep 13, 2023, Like Xu wrote: > > > I'll wait for a cooling off period to see if the maintainers need me to post v7. > > > > You should have waiting to post v5, let alone v6. Resurrecting a thread after a > > month and not waiting even 7 hours for others to respond is extremely frustrating. > > You are right. I don't seem to be keeping up with many of other issues. Sorry > for that. > Wish there were 48 hours in a day. > > Back to this issue: for commit message, I'd be more inclined to David's > understanding, The discussion that Sean and I had should probably be reflected in the commit message too. To the end of the commit log you used for v6, after the final 'To that end:…' paragraph, let's add: Note that userspace can explicitly request a *synchronization* of the TSC by writing zero. For the purpose of this patch, this counts as "setting" the TSC. If userspace then subsequently writes an explicit non-zero value which happens to be within 1 second of the previous value, it will be 'corrected'. For that case, this preserves the prior behaviour of KVM (which always applied the 1-second 'correction' regardless of user vs. kernel). > @@ -2728,27 +2729,45 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, > u64 data) > elapsed = ns - kvm->arch.last_tsc_nsec; > > if (vcpu->arch.virtual_tsc_khz) { > + /* > + * Force synchronization when creating or hotplugging a vCPU, > + * i.e. when the TSC value is '0', to help keep clocks stable. > + * If this is NOT a hotplug/creation case, skip synchronization > + * on the first write from userspace so as not to misconstrue > + * state restoration after live migration as an attempt from > + * userspace to synchronize. > + */ You cannot *misconstrue* an attempt from userspace to synchronize. If userspace writes a zero, it's a sync attempt. If it's non-zero it's a TSC value to be set. It's not very subtle :) I think the 1-second slop thing is sufficiently documented in the 'else if' clause below, so I started writing an alternative 'overall' comment to go here and found it a bit redundant. So maybe let's just drop this comment and add one back in the if (data == 0) case... > if (data == 0) { > - /* > - * detection of vcpu initialization -- need to sync > - * with other vCPUs. This particularly helps to keep > - * kvm_clock stable after CPU hotplug > - */ /* * Force synchronization when creating a vCPU, or when * userspace explicitly writes a zero value. */ > synchronizing = true; > - } else { > + } else if (kvm->arch.user_set_tsc) { > u64 tsc_exp = kvm->arch.last_tsc_write + > nsec_to_cycles(vcpu, elapsed); > u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; > /* > - * Special case: TSC write with a small delta (1 second) > - * of virtual cycle time against real time is > - * interpreted as an attempt to synchronize the CPU. > + * Here lies UAPI baggage: when a user-initiated TSC write has > + * a small delta (1 second) of virtual cycle time against the > + * previously set vCPU, we assume that they were intended to be > + * in sync and the delta was only due to the racy nature of the > + * legacy API. > + * > + * This trick falls down when restoring a guest which genuinely > + * has been running for less time than the 1 second of imprecision > + * which we allow for in the legacy API. In this case, the first > + * value written by userspace (on any vCPU) should not be subject > + * to this 'correction' to make it sync up with values that only Missing the word 'come' here too, in '…that only *come* from…', > + * from the kernel's default vCPU creation. Make the 1-second slop > + * hack only trigger if the user_set_tsc flag is already set. > + * > + * The correct answer is for the VMM not to use the legacy API. Maybe we should drop this line, as we don't actually have a sane API yet that VMMs can use instead.
On 14/9/2023 3:31 pm, David Woodhouse wrote: > On Thu, 2023-09-14 at 11:50 +0800, Like Xu wrote: >> On 13/9/2023 10:47 pm, Sean Christopherson wrote: >>> On Wed, Sep 13, 2023, Like Xu wrote: >>>> I'll wait for a cooling off period to see if the maintainers need me to post v7. >>> >>> You should have waiting to post v5, let alone v6. Resurrecting a thread after a >>> month and not waiting even 7 hours for others to respond is extremely frustrating. >> >> You are right. I don't seem to be keeping up with many of other issues. Sorry >> for that. >> Wish there were 48 hours in a day. >> >> Back to this issue: for commit message, I'd be more inclined to David's >> understanding, > > The discussion that Sean and I had should probably be reflected in the > commit message too. To the end of the commit log you used for v6, after > the final 'To that end:…' paragraph, let's add: > > Note that userspace can explicitly request a *synchronization* of the > TSC by writing zero. For the purpose of this patch, this counts as > "setting" the TSC. If userspace then subsequently writes an explicit > non-zero value which happens to be within 1 second of the previous > value, it will be 'corrected'. For that case, this preserves the prior > behaviour of KVM (which always applied the 1-second 'correction' > regardless of user vs. kernel). > >> @@ -2728,27 +2729,45 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, >> u64 data) >> elapsed = ns - kvm->arch.last_tsc_nsec; >> >> if (vcpu->arch.virtual_tsc_khz) { >> + /* >> + * Force synchronization when creating or hotplugging a vCPU, >> + * i.e. when the TSC value is '0', to help keep clocks stable. >> + * If this is NOT a hotplug/creation case, skip synchronization >> + * on the first write from userspace so as not to misconstrue >> + * state restoration after live migration as an attempt from >> + * userspace to synchronize. >> + */ > > You cannot *misconstrue* an attempt from userspace to synchronize. If > userspace writes a zero, it's a sync attempt. If it's non-zero it's a > TSC value to be set. It's not very subtle :) > > I think the 1-second slop thing is sufficiently documented in the 'else > if' clause below, so I started writing an alternative 'overall' comment > to go here and found it a bit redundant. So maybe let's just drop this > comment and add one back in the if (data == 0) case... > >> if (data == 0) { >> - /* >> - * detection of vcpu initialization -- need to sync >> - * with other vCPUs. This particularly helps to keep >> - * kvm_clock stable after CPU hotplug >> - */ > > > /* > * Force synchronization when creating a vCPU, or when > * userspace explicitly writes a zero value. > */ > >> synchronizing = true; >> - } else { >> + } else if (kvm->arch.user_set_tsc) { >> u64 tsc_exp = kvm->arch.last_tsc_write + >> nsec_to_cycles(vcpu, elapsed); >> u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; >> /* >> - * Special case: TSC write with a small delta (1 second) >> - * of virtual cycle time against real time is >> - * interpreted as an attempt to synchronize the CPU. >> + * Here lies UAPI baggage: when a user-initiated TSC write has >> + * a small delta (1 second) of virtual cycle time against the >> + * previously set vCPU, we assume that they were intended to be >> + * in sync and the delta was only due to the racy nature of the >> + * legacy API. >> + * >> + * This trick falls down when restoring a guest which genuinely >> + * has been running for less time than the 1 second of imprecision >> + * which we allow for in the legacy API. In this case, the first >> + * value written by userspace (on any vCPU) should not be subject >> + * to this 'correction' to make it sync up with values that only > > Missing the word 'come' here too, in '…that only *come* from…', > >> + * from the kernel's default vCPU creation. Make the 1-second slop >> + * hack only trigger if the user_set_tsc flag is already set. >> + * >> + * The correct answer is for the VMM not to use the legacy API. > > Maybe we should drop this line, as we don't actually have a sane API > yet that VMMs can use instead. > Thanks for your comments, but not sure if Sean has any more concerns to move forward: diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1a4def36d5bb..9a7dfef9d32d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1324,6 +1324,7 @@ struct kvm_arch { int nr_vcpus_matched_tsc; u32 default_tsc_khz; + bool user_set_tsc; seqcount_raw_spinlock_t pvclock_sc; bool use_master_clock; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c9c81e82e65..11fbd2a4a370 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2714,8 +2714,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, kvm_track_tsc_matching(vcpu); } -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value) { + u64 data = user_value ? *user_value : 0; struct kvm *kvm = vcpu->kvm; u64 offset, ns, elapsed; unsigned long flags; @@ -2730,25 +2731,37 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) if (vcpu->arch.virtual_tsc_khz) { if (data == 0) { /* - * detection of vcpu initialization -- need to sync - * with other vCPUs. This particularly helps to keep - * kvm_clock stable after CPU hotplug + * Force synchronization when creating a vCPU, or when + * userspace explicitly writes a zero value. */ synchronizing = true; - } else { + } else if (kvm->arch.user_set_tsc) { u64 tsc_exp = kvm->arch.last_tsc_write + nsec_to_cycles(vcpu, elapsed); u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; /* - * Special case: TSC write with a small delta (1 second) - * of virtual cycle time against real time is - * interpreted as an attempt to synchronize the CPU. + * Here lies UAPI baggage: when a user-initiated TSC write has + * a small delta (1 second) of virtual cycle time against the + * previously set vCPU, we assume that they were intended to be + * in sync and the delta was only due to the racy nature of the + * legacy API. + * + * This trick falls down when restoring a guest which genuinely + * has been running for less time than the 1 second of imprecision + * which we allow for in the legacy API. In this case, the first + * value written by userspace (on any vCPU) should not be subject + * to this 'correction' to make it sync up with values that only + * come from the kernel's default vCPU creation. Make the 1-second + * slop hack only trigger if the user_set_tsc flag is already set. */ synchronizing = data < tsc_exp + tsc_hz && data + tsc_hz > tsc_exp; } } + if (user_value) + kvm->arch.user_set_tsc = true; + /* * For a reliable TSC, we can match TSC offsets, and for an unstable * TSC, we add elapsed time in this computation. We could let the @@ -3777,7 +3790,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) break; case MSR_IA32_TSC: if (msr_info->host_initiated) { - kvm_synchronize_tsc(vcpu, data); + kvm_synchronize_tsc(vcpu, &data); } else { u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset; adjust_tsc_offset_guest(vcpu, adj); @@ -5536,6 +5549,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset; ns = get_kvmclock_base_ns(); + kvm->arch.user_set_tsc = true; __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched); raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags); @@ -11959,7 +11973,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) if (mutex_lock_killable(&vcpu->mutex)) return; vcpu_load(vcpu); - kvm_synchronize_tsc(vcpu, 0); + kvm_synchronize_tsc(vcpu, NULL); vcpu_put(vcpu); /* poll control enabled by default */
Ping for further comments and confirmation from Sean. Could we trigger a new version for this issue ? Thanks. On 19/9/2023 7:29 pm, Like Xu wrote: > On 14/9/2023 3:31 pm, David Woodhouse wrote: >> On Thu, 2023-09-14 at 11:50 +0800, Like Xu wrote: >>> On 13/9/2023 10:47 pm, Sean Christopherson wrote: >>>> On Wed, Sep 13, 2023, Like Xu wrote: >>>>> I'll wait for a cooling off period to see if the maintainers need me to >>>>> post v7. >>>> >>>> You should have waiting to post v5, let alone v6. Resurrecting a thread >>>> after a >>>> month and not waiting even 7 hours for others to respond is extremely >>>> frustrating. >>> >>> You are right. I don't seem to be keeping up with many of other issues. Sorry >>> for that. >>> Wish there were 48 hours in a day. >>> >>> Back to this issue: for commit message, I'd be more inclined to David's >>> understanding, >> >> The discussion that Sean and I had should probably be reflected in the >> commit message too. To the end of the commit log you used for v6, after >> the final 'To that end:…' paragraph, let's add: >> >> Note that userspace can explicitly request a *synchronization* of the >> TSC by writing zero. For the purpose of this patch, this counts as >> "setting" the TSC. If userspace then subsequently writes an explicit >> non-zero value which happens to be within 1 second of the previous >> value, it will be 'corrected'. For that case, this preserves the prior >> behaviour of KVM (which always applied the 1-second 'correction' >> regardless of user vs. kernel). >> >>> @@ -2728,27 +2729,45 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, >>> u64 data) >>> elapsed = ns - kvm->arch.last_tsc_nsec; >>> >>> if (vcpu->arch.virtual_tsc_khz) { >>> + /* >>> + * Force synchronization when creating or hotplugging a vCPU, >>> + * i.e. when the TSC value is '0', to help keep clocks stable. >>> + * If this is NOT a hotplug/creation case, skip synchronization >>> + * on the first write from userspace so as not to misconstrue >>> + * state restoration after live migration as an attempt from >>> + * userspace to synchronize. >>> + */ >> >> You cannot *misconstrue* an attempt from userspace to synchronize. If >> userspace writes a zero, it's a sync attempt. If it's non-zero it's a >> TSC value to be set. It's not very subtle :) >> >> I think the 1-second slop thing is sufficiently documented in the 'else >> if' clause below, so I started writing an alternative 'overall' comment >> to go here and found it a bit redundant. So maybe let's just drop this >> comment and add one back in the if (data == 0) case... >> >>> if (data == 0) { >>> - /* >>> - * detection of vcpu initialization -- need to sync >>> - * with other vCPUs. This particularly helps to keep >>> - * kvm_clock stable after CPU hotplug >>> - */ >> >> >> /* >> * Force synchronization when creating a vCPU, or when >> * userspace explicitly writes a zero value. >> */ >> >>> synchronizing = true; >>> - } else { >>> + } else if (kvm->arch.user_set_tsc) { >>> u64 tsc_exp = kvm->arch.last_tsc_write + >>> nsec_to_cycles(vcpu, elapsed); >>> u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; >>> /* >>> - * Special case: TSC write with a small delta (1 second) >>> - * of virtual cycle time against real time is >>> - * interpreted as an attempt to synchronize the CPU. >>> + * Here lies UAPI baggage: when a user-initiated TSC >>> write has >>> + * a small delta (1 second) of virtual cycle time >>> against the >>> + * previously set vCPU, we assume that they were >>> intended to be >>> + * in sync and the delta was only due to the racy >>> nature of the >>> + * legacy API. >>> + * >>> + * This trick falls down when restoring a guest which >>> genuinely >>> + * has been running for less time than the 1 second >>> of imprecision >>> + * which we allow for in the legacy API. In this >>> case, the first >>> + * value written by userspace (on any vCPU) should >>> not be subject >>> + * to this 'correction' to make it sync up with >>> values that only >> >> Missing the word 'come' here too, in '…that only *come* from…', >> >>> + * from the kernel's default vCPU creation. Make the >>> 1-second slop >>> + * hack only trigger if the user_set_tsc flag is >>> already set. >>> + * >>> + * The correct answer is for the VMM not to use the >>> legacy API. >> >> Maybe we should drop this line, as we don't actually have a sane API >> yet that VMMs can use instead. >> > > Thanks for your comments, but not sure if Sean has any more concerns to move > forward: > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 1a4def36d5bb..9a7dfef9d32d 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1324,6 +1324,7 @@ struct kvm_arch { > int nr_vcpus_matched_tsc; > > u32 default_tsc_khz; > + bool user_set_tsc; > > seqcount_raw_spinlock_t pvclock_sc; > bool use_master_clock; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6c9c81e82e65..11fbd2a4a370 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2714,8 +2714,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, > u64 offset, u64 tsc, > kvm_track_tsc_matching(vcpu); > } > > -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) > +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value) > { > + u64 data = user_value ? *user_value : 0; > struct kvm *kvm = vcpu->kvm; > u64 offset, ns, elapsed; > unsigned long flags; > @@ -2730,25 +2731,37 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, > u64 data) > if (vcpu->arch.virtual_tsc_khz) { > if (data == 0) { > /* > - * detection of vcpu initialization -- need to sync > - * with other vCPUs. This particularly helps to keep > - * kvm_clock stable after CPU hotplug > + * Force synchronization when creating a vCPU, or when > + * userspace explicitly writes a zero value. > */ > synchronizing = true; > - } else { > + } else if (kvm->arch.user_set_tsc) { > u64 tsc_exp = kvm->arch.last_tsc_write + > nsec_to_cycles(vcpu, elapsed); > u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; > /* > - * Special case: TSC write with a small delta (1 second) > - * of virtual cycle time against real time is > - * interpreted as an attempt to synchronize the CPU. > + * Here lies UAPI baggage: when a user-initiated TSC write has > + * a small delta (1 second) of virtual cycle time against the > + * previously set vCPU, we assume that they were intended to be > + * in sync and the delta was only due to the racy nature of the > + * legacy API. > + * > + * This trick falls down when restoring a guest which genuinely > + * has been running for less time than the 1 second of imprecision > + * which we allow for in the legacy API. In this case, the first > + * value written by userspace (on any vCPU) should not be subject > + * to this 'correction' to make it sync up with values that only > + * come from the kernel's default vCPU creation. Make the 1-second > + * slop hack only trigger if the user_set_tsc flag is already set. > */ > synchronizing = data < tsc_exp + tsc_hz && > data + tsc_hz > tsc_exp; > } > } > > + if (user_value) > + kvm->arch.user_set_tsc = true; > + > /* > * For a reliable TSC, we can match TSC offsets, and for an unstable > * TSC, we add elapsed time in this computation. We could let the > @@ -3777,7 +3790,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct > msr_data *msr_info) > break; > case MSR_IA32_TSC: > if (msr_info->host_initiated) { > - kvm_synchronize_tsc(vcpu, data); > + kvm_synchronize_tsc(vcpu, &data); > } else { > u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - > vcpu->arch.l1_tsc_offset; > adjust_tsc_offset_guest(vcpu, adj); > @@ -5536,6 +5549,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, > tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset; > ns = get_kvmclock_base_ns(); > > + kvm->arch.user_set_tsc = true; > __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched); > raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags); > > @@ -11959,7 +11973,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) > if (mutex_lock_killable(&vcpu->mutex)) > return; > vcpu_load(vcpu); > - kvm_synchronize_tsc(vcpu, 0); > + kvm_synchronize_tsc(vcpu, NULL); > vcpu_put(vcpu); > > /* poll control enabled by default */
On 25 September 2023 09:36:47 CEST, Like Xu <like.xu.linux@gmail.com> wrote: >Ping for further comments and confirmation from Sean. >Could we trigger a new version for this issue ? Thanks. I believe you just have a few tweaks to the comments; I'd resend that as v7.
On 25/9/2023 4:31 pm, David Woodhouse wrote: > > > On 25 September 2023 09:36:47 CEST, Like Xu <like.xu.linux@gmail.com> wrote: >> Ping for further comments and confirmation from Sean. >> Could we trigger a new version for this issue ? Thanks. > > I believe you just have a few tweaks to the comments; I'd resend that as v7. OK, thanks. Since I don't seem to have seen V7 yet on the list, just let me know if anyone has any new thoughts on this issue.
On 7 October 2023 08:29:08 BST, Like Xu <like.xu.linux@gmail.com> wrote: >On 25/9/2023 4:31 pm, David Woodhouse wrote: >> >> >> On 25 September 2023 09:36:47 CEST, Like Xu <like.xu.linux@gmail.com> wrote: >>> Ping for further comments and confirmation from Sean. >>> Could we trigger a new version for this issue ? Thanks. >> >> I believe you just have a few tweaks to the comments; I'd resend that as v7. > >OK, thanks. > >Since I don't seem to have seen V7 yet on the list, >just let me know if anyone has any new thoughts on this issue. Ah, the end of my sentence was intended to mean, "(if I were you) I (woul)d resend that as v7." I didn't mean to suggest that I was actually going to do so myself. So if you didn't post a v7, you won't see it on the list yet :)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1a4def36d5bb..427be7ef9702 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1322,6 +1322,7 @@ struct kvm_arch { u64 cur_tsc_offset; u64 cur_tsc_generation; int nr_vcpus_matched_tsc; + bool user_set_tsc; u32 default_tsc_khz; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c9c81e82e65..354169fbc9c4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2735,20 +2735,35 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) * kvm_clock stable after CPU hotplug */ synchronizing = true; - } else { + } else if (kvm->arch.user_set_tsc) { u64 tsc_exp = kvm->arch.last_tsc_write + nsec_to_cycles(vcpu, elapsed); u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL; /* - * Special case: TSC write with a small delta (1 second) - * of virtual cycle time against real time is - * interpreted as an attempt to synchronize the CPU. + * Here lies UAPI baggage: when a user-initiated TSC write has + * a small delta (1 second) of virtual cycle time against the + * previously set vCPU, we assume that they were intended to be + * in sync and the delta was only due to the racy nature of the + * legacy API. + * + * This trick falls down when restoring a guest which genuinely + * has been running for less time than the 1 second of imprecision + * which we allow for in the legacy API. In this case, the first + * value written by userspace (on any vCPU) should not be subject + * to this 'correction' to make it sync up with values that only + * from the kernel's default vCPU creation. Make the 1-second slop + * hack only trigger if the user_set_tsc flag is already set. + * + * The correct answer is for the VMM not to use the legacy API. */ synchronizing = data < tsc_exp + tsc_hz && data + tsc_hz > tsc_exp; } } + if (data) + kvm->arch.user_set_tsc = true; + /* * For a reliable TSC, we can match TSC offsets, and for an unstable * TSC, we add elapsed time in this computation. We could let the @@ -5536,6 +5551,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset; ns = get_kvmclock_base_ns(); + kvm->arch.user_set_tsc = true; __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched); raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);