[v2] KVM x86/xen: add an override for PVCLOCK_TSC_STABLE_BIT
Commit Message
From: Paul Durrant <pdurrant@amazon.com>
Unless explicitly told to do so (by passing 'clocksource=tsc' and
'tsc=stable:socket', and then jumping through some hoops concerning
potential CPU hotplug) Xen will never use TSC as its clocksource.
Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set
in either the primary or secondary pvclock memory areas. This has
led to bugs in some guest kernels which only become evident if
PVCLOCK_TSC_STABLE_BIT *is* set in the pvclocks. Hence, to support
such guests, give the VMM a new Xen HVM config flag to tell KVM to
forcibly clear the bit in the Xen pvclocks.
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
---
Documentation/virt/kvm/api.rst | 6 ++++++
arch/x86/kvm/x86.c | 28 +++++++++++++++++++++++-----
arch/x86/kvm/xen.c | 3 ++-
include/uapi/linux/kvm.h | 1 +
4 files changed, 32 insertions(+), 6 deletions(-)
Comments
On Tue, Oct 31, 2023, Paul Durrant wrote:
> From: Paul Durrant <pdurrant@amazon.com>
>
> Unless explicitly told to do so (by passing 'clocksource=tsc' and
> 'tsc=stable:socket', and then jumping through some hoops concerning
> potential CPU hotplug) Xen will never use TSC as its clocksource.
> Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set
> in either the primary or secondary pvclock memory areas. This has
> led to bugs in some guest kernels which only become evident if
> PVCLOCK_TSC_STABLE_BIT *is* set in the pvclocks. Hence, to support
> such guests, give the VMM a new Xen HVM config flag to tell KVM to
> forcibly clear the bit in the Xen pvclocks.
>
> Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> ---
> Documentation/virt/kvm/api.rst | 6 ++++++
> arch/x86/kvm/x86.c | 28 +++++++++++++++++++++++-----
> arch/x86/kvm/xen.c | 3 ++-
> include/uapi/linux/kvm.h | 1 +
> 4 files changed, 32 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 21a7578142a1..9752a01270df 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8252,6 +8252,7 @@ PVHVM guests. Valid flags are::
> #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
> #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
> #define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6)
> + #define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7)
>
> The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG
> ioctl is available, for the guest to set its hypercall page.
> @@ -8295,6 +8296,11 @@ behave more correctly, not using the XEN_RUNSTATE_UPDATE flag until/unless
> specifically enabled (by the guest making the hypercall, causing the VMM
> to enable the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute).
>
> +The KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag indicates that KVM supports
> +clearing the PVCLOCK_TSC_STABLE_BIT flag in Xen pvclock sources. This will be
> +done when the KVM_CAP_XEN_HVM ioctl sets the
> +KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag.
> +
> 8.31 KVM_CAP_PPC_MULTITCE
> -------------------------
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 41cce5031126..6abad6dacf07 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3096,7 +3096,8 @@ u64 get_kvmclock_ns(struct kvm *kvm)
>
> static void kvm_setup_guest_pvclock(struct kvm_vcpu *v,
> struct gfn_to_pfn_cache *gpc,
> - unsigned int offset)
> + unsigned int offset,
> + bool force_tsc_unstable)
> {
> struct kvm_vcpu_arch *vcpu = &v->arch;
> struct pvclock_vcpu_time_info *guest_hv_clock;
> @@ -3122,6 +3123,10 @@ static void kvm_setup_guest_pvclock(struct kvm_vcpu *v,
> */
>
> guest_hv_clock->version = vcpu->hv_clock.version = (guest_hv_clock->version + 1) | 1;
> +
> + if (force_tsc_unstable)
> + guest_hv_clock->flags &= ~PVCLOCK_TSC_STABLE_BIT;
I don't see how this works. This clears the bit in the guest copy, then clobbers
all of guest_hv_clock with a memcpy().
if (force_tsc_unstable)
guest_hv_clock->flags &= ~PVCLOCK_TSC_STABLE_BIT;
smp_wmb();
/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
vcpu->hv_clock.flags |= (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
if (vcpu->pvclock_set_guest_stopped_request) {
vcpu->hv_clock.flags |= PVCLOCK_GUEST_STOPPED;
vcpu->pvclock_set_guest_stopped_request = false;
}
memcpy(guest_hv_clock, &vcpu->hv_clock, sizeof(*guest_hv_clock)); <= sets PVCLOCK_TSC_STABLE_BIT again, no?
smp_wmb();
Any reason not to make this a generic "capability" instead of a Xen specific flag?
E.g. I assume these problematic guests would mishandle PVCLOCK_TSC_STABLE_BIT if
it showed up in kvmclock, but they don't use kvmclock so it's not a problem in
practice.
I doubt there's a real need or use case, but it'd require less churn and IMO is
simpler overall, e.g.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e3eb608b6692..731b201bfd5a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3225,7 +3225,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
/* If the host uses TSC clocksource, then it is stable */
pvclock_flags = 0;
- if (use_master_clock)
+ if (use_master_clock && !vcpu->kvm.force_tsc_unstable)
pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
vcpu->hv_clock.flags = pvclock_flags;
I also assume this is a "set and forget" thing? I.e. KVM can require the flag
to be set before any vCPUs are created.
On Tue, 2023-10-31 at 15:39 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Paul Durrant wrote:
> >
> > + if (force_tsc_unstable)
> > + guest_hv_clock->flags &= ~PVCLOCK_TSC_STABLE_BIT;
>
> I don't see how this works. This clears the bit in the guest copy, then clobbers
> all of guest_hv_clock with a memcpy().
Agreed, that seems wrong.
> if (force_tsc_unstable)
> guest_hv_clock->flags &= ~PVCLOCK_TSC_STABLE_BIT;
>
> smp_wmb();
>
> /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
> vcpu->hv_clock.flags |= (guest_hv_clock->flags & PVCLOCK_GUEST_STOPPED);
>
> if (vcpu->pvclock_set_guest_stopped_request) {
> vcpu->hv_clock.flags |= PVCLOCK_GUEST_STOPPED;
> vcpu->pvclock_set_guest_stopped_request = false;
> }
>
> memcpy(guest_hv_clock, &vcpu->hv_clock, sizeof(*guest_hv_clock)); <= sets PVCLOCK_TSC_STABLE_BIT again, no?
> smp_wmb();
>
> Any reason not to make this a generic "capability" instead of a Xen specific flag?
> E.g. I assume these problematic guests would mishandle PVCLOCK_TSC_STABLE_BIT if
> it showed up in kvmclock, but they don't use kvmclock so it's not a problem in
> practice.
No, those guests are just fine with kvmclock. It's the *Xen* page they
forgot to map to userspace for the vDSO to use. And it's Xen (true Xen)
which made you jump through hoops to use the TSC that way, such that it
would actually expose the PVCLOCK_TSC_STABLE_BIT. We don't expect, and
have never seen, such issues with native KVM guests.
> I doubt there's a real need or use case, but it'd require less churn and IMO is
> simpler overall, e.g.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e3eb608b6692..731b201bfd5a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3225,7 +3225,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>
> /* If the host uses TSC clocksource, then it is stable */
> pvclock_flags = 0;
> - if (use_master_clock)
> + if (use_master_clock && !vcpu->kvm.force_tsc_unstable)
> pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
>
> vcpu->hv_clock.flags = pvclock_flags;
>
> I also assume this is a "set and forget" thing? I.e. KVM can require the flag
> to be set before any vCPUs are created.
Hrm, not sure we have previously required that the KVM_XEN_HVM_CONFIG
setup be done before any vCPUs were created. I tend to prefer *not* to
push ordering requirements onto userspace. Does it need to be a per-
vcpu thing?
On Tue, Oct 31, 2023, David Woodhouse wrote:
> On Tue, 2023-10-31 at 15:39 -0700, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Paul Durrant wrote:
> > Any reason not to make this a generic "capability" instead of a Xen specific flag?
> > E.g. I assume these problematic guests would mishandle PVCLOCK_TSC_STABLE_BIT if
> > it showed up in kvmclock, but they don't use kvmclock so it's not a problem in
> > practice.
>
> No, those guests are just fine with kvmclock. It's the *Xen* page they
> forgot to map to userspace for the vDSO to use. And it's Xen (true Xen)
> which made you jump through hoops to use the TSC that way, such that it
> would actually expose the PVCLOCK_TSC_STABLE_BIT. We don't expect, and
> have never seen, such issues with native KVM guests.
Hmm, and I suppose theoretically the guest kernel could choose to ignore the Xen
interface for whatever reason. Mostly out of curiosity, is this flag something
that'd be set anytime Xen is advertised to the guest?
> > I doubt there's a real need or use case, but it'd require less churn and IMO is
> > simpler overall, e.g.
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index e3eb608b6692..731b201bfd5a 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3225,7 +3225,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> >
> > /* If the host uses TSC clocksource, then it is stable */
> > pvclock_flags = 0;
> > - if (use_master_clock)
> > + if (use_master_clock && !vcpu->kvm.force_tsc_unstable)
> > pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
> >
> > vcpu->hv_clock.flags = pvclock_flags;
> >
> > I also assume this is a "set and forget" thing? I.e. KVM can require the flag
> > to be set before any vCPUs are created.
>
> Hrm, not sure we have previously required that the KVM_XEN_HVM_CONFIG
> setup be done before any vCPUs were created.
Oh, I was asking in the context of adding a generic capability.
> I tend to prefer *not* to push ordering requirements onto userspace.
For per-VM flags that are consumed by vCPUs, it makes reasoning about correctness
and what is/isn't allowed much, much easier.
> Does it need to be a per-vcpu thing?
Huh? No, I was only asking (again, for a generic capability) if we could do
mutex_lock(&kvm->lock);
if (!kvm->created_vcpus) {
kvm->arch.force_tsc_unstable = true;
r = 0;
}
mutex_unlock(&kvm->lock);
So that it would be blatantly obvious that there's no race with checking a per-VM
flag without any lock/RCU protections.
On Tue, 2023-10-31 at 22:58 +0000, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, David Woodhouse wrote:
> > On Tue, 2023-10-31 at 15:39 -0700, Sean Christopherson wrote:
> > > On Tue, Oct 31, 2023, Paul Durrant wrote:
> > > Any reason not to make this a generic "capability" instead of a Xen specific flag?
> > > E.g. I assume these problematic guests would mishandle PVCLOCK_TSC_STABLE_BIT if
> > > it showed up in kvmclock, but they don't use kvmclock so it's not a problem in
> > > practice.
> >
> > No, those guests are just fine with kvmclock. It's the *Xen* page they
> > forgot to map to userspace for the vDSO to use. And it's Xen (true Xen)
> > which made you jump through hoops to use the TSC that way, such that it
> > would actually expose the PVCLOCK_TSC_STABLE_BIT. We don't expect, and
> > have never seen, such issues with native KVM guests.
>
> Hmm, and I suppose theoretically the guest kernel could choose to ignore the Xen
> interface for whatever reason. Mostly out of curiosity, is this flag something
> that'd be set anytime Xen is advertised to the guest?
Probably not in QEMU; I'll make it optional there.
Hosting providers who are migrating millions of Xen guests to KVM and
want to do so with as little customer pain as possible, and who have
already had customer failures due to this guest kernel bug... are more
likely to turn it on for all "Xen" guests.
> > > I doubt there's a real need or use case, but it'd require less churn and IMO is
> > > simpler overall, e.g.
> > >
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index e3eb608b6692..731b201bfd5a 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -3225,7 +3225,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> > >
> > > /* If the host uses TSC clocksource, then it is stable */
> > > pvclock_flags = 0;
> > > - if (use_master_clock)
> > > + if (use_master_clock && !vcpu->kvm.force_tsc_unstable)
> > > pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
> > >
> > > vcpu->hv_clock.flags = pvclock_flags;
> > >
> > > I also assume this is a "set and forget" thing? I.e. KVM can require the flag
> > > to be set before any vCPUs are created.
> >
> > Hrm, not sure we have previously required that the KVM_XEN_HVM_CONFIG
> > setup be done before any vCPUs were created.
>
> Oh, I was asking in the context of adding a generic capability.
Yeah, it's saner for it to be set-and-forget. We *could* contrive some
kind of detection for the affected guest kernels and turn it off just
for them... but no, I just don't want to.
> > I tend to prefer *not* to push ordering requirements onto userspace.
>
> For per-VM flags that are consumed by vCPUs, it makes reasoning about correctness
> and what is/isn't allowed much, much easier.
>
> > Does it need to be a per-vcpu thing?
>
> Huh? No, I was only asking (again, for a generic capability) if we could do
>
> mutex_lock(&kvm->lock);
> if (!kvm->created_vcpus) {
> kvm->arch.force_tsc_unstable = true;
> r = 0;
> }
> mutex_unlock(&kvm->lock);
>
> So that it would be blatantly obvious that there's no race with checking a per-VM
> flag without any lock/RCU protections.
Makes sense. Although TBH if the VMM wants to flip this bit on and off
at runtime while the guest clocks are being updated, it deserves what
it gets. It's not a problem for KVM.
On 31/10/2023 22:48, David Woodhouse wrote:
> On Tue, 2023-10-31 at 15:39 -0700, Sean Christopherson wrote:
>> On Tue, Oct 31, 2023, Paul Durrant wrote:
>>>
>>> + if (force_tsc_unstable)
>>> + guest_hv_clock->flags &= ~PVCLOCK_TSC_STABLE_BIT;
>>
>> I don't see how this works. This clears the bit in the guest copy, then clobbers
>> all of guest_hv_clock with a memcpy().
>
> Agreed, that seems wrong.
>
It is indeed. Looks like it got moved the wrong side of the memcpy()
when I rebased.
Paul
On 31/10/2023 23:06, David Woodhouse wrote:
[snip]
>> mutex_lock(&kvm->lock);
>> if (!kvm->created_vcpus) {
>> kvm->arch.force_tsc_unstable = true;
>> r = 0;
>> }
>> mutex_unlock(&kvm->lock);
>>
>> So that it would be blatantly obvious that there's no race with checking a per-VM
>> flag without any lock/RCU protections.
>
> Makes sense. Although TBH if the VMM wants to flip this bit on and off
> at runtime while the guest clocks are being updated, it deserves what
> it gets. It's not a problem for KVM.
>
The first version of the patch that used an attribute requested a clock
update when the attribute was set. I dropped that in this version but I
think it'd best to re-instate it.
Paul
@@ -8252,6 +8252,7 @@ PVHVM guests. Valid flags are::
#define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
#define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
#define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6)
+ #define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7)
The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG
ioctl is available, for the guest to set its hypercall page.
@@ -8295,6 +8296,11 @@ behave more correctly, not using the XEN_RUNSTATE_UPDATE flag until/unless
specifically enabled (by the guest making the hypercall, causing the VMM
to enable the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute).
+The KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag indicates that KVM supports
+clearing the PVCLOCK_TSC_STABLE_BIT flag in Xen pvclock sources. This will be
+done when the KVM_CAP_XEN_HVM ioctl sets the
+KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag.
+
8.31 KVM_CAP_PPC_MULTITCE
-------------------------
@@ -3096,7 +3096,8 @@ u64 get_kvmclock_ns(struct kvm *kvm)
static void kvm_setup_guest_pvclock(struct kvm_vcpu *v,
struct gfn_to_pfn_cache *gpc,
- unsigned int offset)
+ unsigned int offset,
+ bool force_tsc_unstable)
{
struct kvm_vcpu_arch *vcpu = &v->arch;
struct pvclock_vcpu_time_info *guest_hv_clock;
@@ -3122,6 +3123,10 @@ static void kvm_setup_guest_pvclock(struct kvm_vcpu *v,
*/
guest_hv_clock->version = vcpu->hv_clock.version = (guest_hv_clock->version + 1) | 1;
+
+ if (force_tsc_unstable)
+ guest_hv_clock->flags &= ~PVCLOCK_TSC_STABLE_BIT;
+
smp_wmb();
/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
@@ -3154,6 +3159,15 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
u8 pvclock_flags;
bool use_master_clock;
+ /*
+ * For Xen guests we may need to override PVCLOCK_TSC_STABLE_BIT as unless
+ * explicitly told to use TSC as its clocksource Xen will not set this bit.
+ * This default behaviour led to bugs in some guest kernels which cause
+ * problems if they observe PVCLOCK_TSC_STABLE_BIT in the pvclock flags.
+ */
+ bool xen_pvclock_tsc_unstable =
+ ka->xen_hvm_config.flags & KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE;
+
kernel_ns = 0;
host_tsc = 0;
@@ -3231,12 +3245,15 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
vcpu->hv_clock.flags = pvclock_flags;
if (vcpu->pv_time.active)
- kvm_setup_guest_pvclock(v, &vcpu->pv_time, 0);
+ kvm_setup_guest_pvclock(v, &vcpu->pv_time, 0, false);
+
if (vcpu->xen.vcpu_info_cache.active)
kvm_setup_guest_pvclock(v, &vcpu->xen.vcpu_info_cache,
- offsetof(struct compat_vcpu_info, time));
+ offsetof(struct compat_vcpu_info, time),
+ xen_pvclock_tsc_unstable);
if (vcpu->xen.vcpu_time_info_cache.active)
- kvm_setup_guest_pvclock(v, &vcpu->xen.vcpu_time_info_cache, 0);
+ kvm_setup_guest_pvclock(v, &vcpu->xen.vcpu_time_info_cache, 0,
+ xen_pvclock_tsc_unstable);
kvm_hv_setup_tsc_page(v->kvm, &vcpu->hv_clock);
return 0;
}
@@ -4531,7 +4548,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL |
KVM_XEN_HVM_CONFIG_SHARED_INFO |
KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL |
- KVM_XEN_HVM_CONFIG_EVTCHN_SEND;
+ KVM_XEN_HVM_CONFIG_EVTCHN_SEND |
+ KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE;
if (sched_info_on())
r |= KVM_XEN_HVM_CONFIG_RUNSTATE |
KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG;
@@ -1113,7 +1113,8 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc)
{
/* Only some feature flags need to be *enabled* by userspace */
u32 permitted_flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL |
- KVM_XEN_HVM_CONFIG_EVTCHN_SEND;
+ KVM_XEN_HVM_CONFIG_EVTCHN_SEND |
+ KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE;
if (xhc->flags & ~permitted_flags)
return -EINVAL;
@@ -1282,6 +1282,7 @@ struct kvm_x86_mce {
#define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
#define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
#define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6)
+#define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7)
struct kvm_xen_hvm_config {
__u32 flags;