[1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

Message ID 20240124003858.3954822-2-mizhang@google.com
State New
Headers
Series minor fix on perf_capabilities in KVM/x86 |

Commit Message

Mingwei Zhang Jan. 24, 2024, 12:38 a.m. UTC
  Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
Without this, there is an issue in live migration. In particular, to
migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
the target is unable to set the value. This creates confusions on the user
side.

Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
wrong on the kvm_get_msr_common() which just reads
vcpu->arch.perf_capabilities.

Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
early in VM setup time.

Cc: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/cpuid.c | 3 +++
 1 file changed, 3 insertions(+)
  

Comments

Sean Christopherson Jan. 24, 2024, 3:49 p.m. UTC | #1
On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
> Without this, there is an issue in live migration. In particular, to
> migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
> non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
> the target is unable to set the value. This creates confusions on the user
> side.
> 
> Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
> value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
> wrong on the kvm_get_msr_common() which just reads
> vcpu->arch.perf_capabilities.
> 
> Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
> early in VM setup time.
> 
> Cc: Aaron Lewis <aaronlewis@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/cpuid.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index adba49afb5fe..416bee03c42a 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -369,6 +369,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
>  	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
>  
> +	/* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
> +	if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> +		vcpu->arch.perf_capabilities = 0;

No, this is just papering over the underlying bug.  KVM shouldn't be stuffing
vcpu->arch.perf_capabilities without explicit writes from host userspace.  E.g
KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
host userspace write to MSR_IA32_PERF_CAPABILITIES.  It's unlikely any userspace
actually does something like that, but KVM overwriting guest state is almost
never a good thing.

I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
KVM needs to simply not stuff vcpu->arch.perf_capabilities.  I believe we are
already fudging around this in our internal kernels, so I don't think there's a
need to carry a hack-a-fix for the destination kernel.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 27e23714e960..fdef9d706d61 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
        kvm_async_pf_hash_reset(vcpu);
 
-       vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
        kvm_pmu_init(vcpu);
 
        vcpu->arch.pending_external_vector = -1;

>  	kvm_pmu_refresh(vcpu);
>  	vcpu->arch.cr4_guest_rsvd_bits =
>  	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
> -- 
> 2.43.0.429.g432eaa2c6b-goog
>
  
Mingwei Zhang Jan. 25, 2024, 12:14 a.m. UTC | #2
On Wed, Jan 24, 2024, Sean Christopherson wrote:
> On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > On Wed, Jan 24, 2024, Sean Christopherson wrote:
> > > On Wed, Jan 24, 2024, Aaron Lewis wrote:
> > > > On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > >
> > > > > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > > > No, this is just papering over the underlying bug.  KVM shouldn't be stuffing
> > > > > vcpu->arch.perf_capabilities without explicit writes from host userspace.  E.g
> > > > > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
> > > > > host userspace write to MSR_IA32_PERF_CAPABILITIES.  It's unlikely any userspace
> > > > > actually does something like that, but KVM overwriting guest state is almost
> > > > > never a good thing.
> > > > >
> > > > > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
> > > > > KVM needs to simply not stuff vcpu->arch.perf_capabilities.  I believe we are
> > > > > already fudging around this in our internal kernels, so I don't think there's a
> > > > > need to carry a hack-a-fix for the destination kernel.
> > > > >
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index 27e23714e960..fdef9d706d61 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > > > >
> > > > >         kvm_async_pf_hash_reset(vcpu);
> > > > >
> > > > > -       vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
> > > > 
> > > > Yeah, that will fix the issue we are seeing.  The only thing that's
> > > > not clear to me is if userspace should expect KVM to set this or if
> > > > KVM should expect userspace to set this.  How is that generally
> > > > decided?
> > > 
> > > By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities?
> > > To be consistent with KVM's CPUID module at vCPU creation, which is completely
> > > empty (vCPU has no PMU and no PDCM support) KVM *must* zero
> > > vcpu->arch.perf_capabilities.
> > > 
> > > If userspace wants a non-zero value, then userspace needs to set CPUID to enable
> > > PDCM and set MSR_IA32_PERF_CAPABILITIES.
> > > 
> > > MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without
> > > X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value.  That too
> > > should be excised.
> > > 
> > hmm, does that mean KVM just allows an invalid vcpu state exist from
> > host point of view?
> 
> Yes.
> 
> https://lore.kernel.org/all/ZC4qF90l77m3X1Ir@google.com
> 
> > I think this makes a lot of confusions on migration where VMM on the source
> > believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the
> > target will find it not true.
> 
> Yes, but seeing a non-zero value is a KVM bug that should be fixed.
> 
How about adding an entry in vmx_get_msr() for
MSR_IA32_PERF_CAPABILITIES and check pmu_version? This basically pairs
with the implementation in vmx_set_msr() for MSR_IA32_PERF_CAPABILITIES.

Doing so allows KVM_GET_MSRS return 0 for the MSR instead of returning
the initial permitted value.

The benefit is that it is not enforcing the VMM to explicitly set the
value. In fact, there are several platform MSRs which has initial value
that VMM may rely on instead of explicitly setting.
MSR_IA32_PERF_CAPABILITIES is only one of them.


> > If we follow the suggestion by removing the initial value at vCPU
> > creation time, then I think it breaks the existing VMM code, since that
> > requires VMM to explicitly set the MSR, which I am not sure we do today.
> 
> Yeah, I'm hoping we can squeak by without breaking existing setups.
> 
> I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES
> since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643
> ("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES").
> 
> Frankly, if our VMM doesn't do the same, then it's wildly busted.  Relying on
> KVM to define the vCPU is irresponsible, to put it nicely.
> 
> > The following code below is different. The key difference is that the
> > following code preserves a valid value, but this case is to not preserve
> > an invalid value. 
> 
> But it's a completely different fix.  I referenced that commit to call out that
> the need for the commit and changelog suggests that someone (*cough* us) is relying
> on KVM to initialize MSR_PLATFORM_INFO, and has been doing so for a very long time.
> That doesn't mean it's the correct KVM behavior, just that it's much riskier to
> change.
  
Paolo Bonzini Jan. 29, 2024, 2:39 p.m. UTC | #3
On 1/24/24 23:51, Sean Christopherson wrote:
>> If we follow the suggestion by removing the initial value at vCPU
>> creation time, then I think it breaks the existing VMM code, since that
>> requires VMM to explicitly set the MSR, which I am not sure we do today.
> Yeah, I'm hoping we can squeak by without breaking existing setups.
> 
> I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES
> since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643
> ("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES").
> 
> Frankly, if our VMM doesn't do the same, then it's wildly busted.  Relying on
> KVM to define the vCPU is irresponsible, to put it nicely.

Yes, I tend to agree.

What QEMU does goes from the squeaky clean to the very debatable 
depending on the parameters you give it.

With "-cpu Haswell" and similar, it will provide values for all CPUID 
and MSR bits that match as much as possible values from an actual CPU 
model.  It will complain if there are some values that do not match[1].

With "-cpu host", it will copy values from KVM_GET_SUPPORTED_CPUID and 
from the feature MSRs, but only for features that it knows about.

With "-cpu host,migratable=no", it will copy values from 
KVM_GET_SUPPORTED_CPUID and from the feature MSRs, but only for *feature 
words* (CPUID registers, or MSRs) that it knows about.  This is where it 
becomes debatable, because a CPUID bit could be added without QEMU 
knowing the corresponding MSR.  In this case, the user probably expects 
the MSR to have a nonzero.  On one hand I agree that it would be 
irresponsible, on the other hand that's the point of "-cpu 
host,migratable=no".

If you want to proceed with the change, I don't have any problem with 
considering it a QEMU bug that it doesn't copy over to the guest any 
unknown leaves or MSRs.

Paolo

[1] Unfortunately it's not fatal because there are way way too many 
models, and also because until recently TCG lacked AVX---and therefore 
could only emulate completely some very old CPU models.  But with "-cpu 
Haswell,enforce" then everything's clean.
  
Paolo Bonzini Jan. 29, 2024, 2:40 p.m. UTC | #4
On 1/26/24 20:30, Mingwei Zhang wrote:
>> Hrm, I don't hate it as a stopgap.  But if we are the only people that are affected,
>> because again I'm pretty sure QEMU is fine, I would rather we just fix things in
>> our VMM and/or internal kernel.
> It is not just QEMU. crossvm is another open source VMM that suffers
> from this one.

Can you explain the symptoms in both Google's internal VMM and crosvm?

Thanks,

Paolo
  

Patch

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index adba49afb5fe..416bee03c42a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -369,6 +369,9 @@  static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
 	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
 
+	/* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
+	if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
+		vcpu->arch.perf_capabilities = 0;
 	kvm_pmu_refresh(vcpu);
 	vcpu->arch.cr4_guest_rsvd_bits =
 	    __cr4_reserved_bits(guest_cpuid_has, vcpu);