KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps

Message ID 20230506030435.80262-1-chao.gao@intel.com
State New
Headers
Series KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps |

Commit Message

Chao Gao May 6, 2023, 3:04 a.m. UTC
  to avoid computing the supported value at runtime every time.

No functional change intended.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---

A new call site of kvm_get_arch_capabilities() is added by [1]. It should be
replaced with the cached value in kvm_caps.

[1] https://lore.kernel.org/all/20230504181827.130532-1-mizhang@google.com/


 arch/x86/kvm/x86.c | 5 +++--
 arch/x86/kvm/x86.h | 1 +
 2 files changed, 4 insertions(+), 2 deletions(-)


base-commit: b3c98052d46948a8d65d2778c7f306ff38366aac
  

Comments

Xiaoyao Li May 18, 2023, 9:32 a.m. UTC | #1
On 5/6/2023 11:04 AM, Chao Gao wrote:
> to avoid computing the supported value at runtime every time.
> 
> No functional change intended.

the value of kvm_get_arch_capabilities() can be changed due to

	if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)
		data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;

and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush 
module param.

We need a detailed analysis that in no real case can 
ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime.

> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
> 
> A new call site of kvm_get_arch_capabilities() is added by [1]. It should be
> replaced with the cached value in kvm_caps.
> 
> [1] https://lore.kernel.org/all/20230504181827.130532-1-mizhang@google.com/
> 
> 
>   arch/x86/kvm/x86.c | 5 +++--
>   arch/x86/kvm/x86.h | 1 +
>   2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 523c39a03c00..94aa70ec169c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1670,7 +1670,7 @@ static int kvm_get_msr_feature(struct kvm_msr_entry *msr)
>   {
>   	switch (msr->index) {
>   	case MSR_IA32_ARCH_CAPABILITIES:
> -		msr->data = kvm_get_arch_capabilities();
> +		msr->data = kvm_caps.supported_arch_cap;
>   		break;
>   	case MSR_IA32_PERF_CAPABILITIES:
>   		msr->data = kvm_caps.supported_perf_cap;
> @@ -9523,6 +9523,7 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
>   		kvm_caps.max_guest_tsc_khz = max;
>   	}
>   	kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits;
> +	kvm_caps.supported_arch_cap = kvm_get_arch_capabilities();
>   	kvm_init_msr_lists();
>   	return 0;
>   
> @@ -11879,7 +11880,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>   	if (r)
>   		goto free_guest_fpu;
>   
> -	vcpu->arch.arch_capabilities = kvm_get_arch_capabilities();
> +	vcpu->arch.arch_capabilities = kvm_caps.supported_arch_cap;
>   	vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
>   	kvm_xen_init_vcpu(vcpu);
>   	kvm_vcpu_mtrr_init(vcpu);
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index c544602d07a3..d3e524bcc169 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -29,6 +29,7 @@ struct kvm_caps {
>   	u64 supported_xcr0;
>   	u64 supported_xss;
>   	u64 supported_perf_cap;
> +	u64 supported_arch_cap;
>   };
>   
>   void kvm_spurious_fault(void);
> 
> base-commit: b3c98052d46948a8d65d2778c7f306ff38366aac
  
Sean Christopherson May 18, 2023, 5:33 p.m. UTC | #2
+Jim

On Thu, May 18, 2023, Xiaoyao Li wrote:
> On 5/6/2023 11:04 AM, Chao Gao wrote:
> > to avoid computing the supported value at runtime every time.
> > 
> > No functional change intended.
> 
> the value of kvm_get_arch_capabilities() can be changed due to
> 
> 	if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)
> 		data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;
> 
> and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush module
> param.

Nice catch!

> We need a detailed analysis that in no real case can
> ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime.

No, the fact that it _can_ be modified by a writable module param is enough to
make this patch buggy.

I do like snapshotting and then updating the value, even though there's likely no
meaningful performance benefit, as that would provide a place to document that
the "supported" value is dynamic.  Though the fact that it's dynamic is arguably a bug
in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different
values for ARCH_CAPABILITIES.  But fixing that is probably a fool's errand.  So
I vote to snapshot the value and toggle the ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit
when l1tf_vmx_mitigation is modified.

On a somewhat related topic, what in the absolute #$#$ is going on with FB_CLEAR_DIS!?!?
I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...

  1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
     buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
     entry+exit.

  2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device()
     is a bug.  AIUI, the vulnerability applies to _any_ MMIO accesses.  Assigning
     a device is necessary to let the device DMA into the guest, but it's not
     necessary to let the guest access MMIO addresses, that's done purely via
     memslots.

  3. Irrespective of whether or not there is a performance benefit, toggling the
     MSR on every entry+exit is completely unnecessary if KVM won't do VERW before
     VM-Enter, i.e. if (!mds_user_clear && !mmio_stale_data_clear), then the
     toggling can be done in vmx_prepare_switch_to_{guest,host}().  This probably
     isn't worth pursuing though, as #4 below is more likely, especially since
     X86_BUG_MSBDS_ONLY is limited to Atom (and MIC, lol) CPUs.

  4. If the host will will _never_ do VERW, i.e. #3 + !X86_BUG_MSBDS_ONLY, then
     KVM just needs to context switch the MSR between guests since the value that's
     loaded while running in the host is irrelevant.  E.g. use a percpu cache to
     track the current value.

  5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
     i.e. the host's desired value is effectively static post-boot, and barring
     a buggy configuration (running KVM as a guest), the boot CPU's value will be
     the same as every other CPU.

  6. Performance aside, KVM should not be speculating (ha!) on what the guest
     will and will not do, and should instead honor whatever behavior is presented
     to the guest.  If the guest CPU model indicates that VERW flushes buffers,
     then KVM damn well needs to let VERW flush buffers.

  7. Why on earth did Intel provide a knob that affects both the host and guest,
     since AFAICT the intent of the MSR is purely to suppress FB clearing for an
     unsuspecting (or misconfigured?) guest!?!?!

FWIW, this trainwreck is another reason why I'm not going to look at the proposed
"Intel IA32_SPEC_CTRL Virtualization" crud until external forces dictate that I
do so.  I have zero confidence that a paravirt interface defined by hardware
vendors to fiddle with mitigations will be sane, flexible, and extensible.

Anyways, can someone from Intel do some basic performance analysis to justify
doing RDMSR + WRMSRx2 on every transition?  Unless someone provides numbers that
show a clear, meaningful benefit to the aggressive toggling, I'm inclined to have
KVM do #4, e.g. end up with something like:

	/* L1D Flush includes CPU buffer clear to mitigate MDS */
	if (static_branch_unlikely(&vmx_l1d_should_flush)) {
		vmx_l1d_flush(vcpu);
	} else if (static_branch_unlikely(&mds_user_clear) ||
		   static_branch_unlikely(&mmio_stale_data_clear)) {
		mds_clear_cpu_buffers();
	} else if (static_branch_unlikely(&kvm_toggle_fb_clear) {
		bool enable_fb_clear = !!(vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR);

		if (this_cpu_read(kvm_fb_clear_enabled) != enable_fb_clear) {
			u64 mcu_opt_ctrl = host_mcu_opt_ctrl;

			if (enable_fb_clear)
				mcu_opt_ctrl &= ~FB_CLEAR_DIS;
			else
				mcu_opt_ctrl |= FB_CLEAR_DIS;
			native_wrmsrl(MSR_IA32_MCU_OPT_CTRL, mcu_opt_ctrl);
			this_cpu_write(kvm_fb_clear_enabled, enable_fb_clear);
		}
	}
  
Chao Gao May 19, 2023, 8:40 a.m. UTC | #3
+Pawan, could you share your thoughts on questions about FB_CLEAR?

On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
>+Jim
>
>On Thu, May 18, 2023, Xiaoyao Li wrote:
>> On 5/6/2023 11:04 AM, Chao Gao wrote:
>> > to avoid computing the supported value at runtime every time.
>> > 
>> > No functional change intended.
>> 
>> the value of kvm_get_arch_capabilities() can be changed due to
>> 
>> 	if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)
>> 		data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;
>> 
>> and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush module
>> param.

Thanks for pointing this out. I noticed l1tf_vmx_mitigation and analyzed if
it could be changed at runtime. Obviously I did a wrong analysis.

>
>Nice catch!
>
>> We need a detailed analysis that in no real case can
>> ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime.
>
>No, the fact that it _can_ be modified by a writable module param is enough to
>make this patch buggy.
>
>I do like snapshotting and then updating the value, even though there's likely no
>meaningful performance benefit, as that would provide a place to document that
>the "supported" value is dynamic.  Though the fact that it's dynamic is arguably a bug
>in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different
>values for ARCH_CAPABILITIES.  But fixing that is probably a fool's errand.  So

I am not sure if fixing it is fool. There would be some other problem:

KVM enables L1DLFUSH and creates a guest. Then ARCH_CAP_SKIP_VMENTRY_L1DFLUSH is
exposed to the guest. If L1DFLUSH is disabled at runtime in KVM, the guest
doesn't know this change and won't do L1DFLUSH when entering L2. Then L2 may use
L1TF to leak some secrets of L1.

>I vote to snapshot the value and toggle the ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit
>when l1tf_vmx_mitigation is modified.

Sure. Will do.

>
>On a somewhat related topic, what in the absolute #$#$ is going on with FB_CLEAR_DIS!?!?
>I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
>
>  1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
>     buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
>     entry+exit.
>
>  2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device()
>     is a bug.  AIUI, the vulnerability applies to _any_ MMIO accesses.  Assigning
>     a device is necessary to let the device DMA into the guest, but it's not
>     necessary to let the guest access MMIO addresses, that's done purely via
>     memslots.
>
>  3. Irrespective of whether or not there is a performance benefit, toggling the
>     MSR on every entry+exit is completely unnecessary if KVM won't do VERW before
>     VM-Enter, i.e. if (!mds_user_clear && !mmio_stale_data_clear), then the
>     toggling can be done in vmx_prepare_switch_to_{guest,host}().  This probably
>     isn't worth pursuing though, as #4 below is more likely, especially since
>     X86_BUG_MSBDS_ONLY is limited to Atom (and MIC, lol) CPUs.
>
>  4. If the host will will _never_ do VERW, i.e. #3 + !X86_BUG_MSBDS_ONLY, then
>     KVM just needs to context switch the MSR between guests since the value that's
>     loaded while running in the host is irrelevant.  E.g. use a percpu cache to
>     track the current value.

Agreed.

Looks VERW can be used in CPL3, should we restore the MSR on returning
to userspace i.e., leverage uret mechanism?

>
>  5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
>     i.e. the host's desired value is effectively static post-boot, and barring
>     a buggy configuration (running KVM as a guest), the boot CPU's value will be
>     the same as every other CPU.
>
>  6. Performance aside, KVM should not be speculating (ha!) on what the guest
>     will and will not do, and should instead honor whatever behavior is presented
>     to the guest.  If the guest CPU model indicates that VERW flushes buffers,
>     then KVM damn well needs to let VERW flush buffers.
>
>  7. Why on earth did Intel provide a knob that affects both the host and guest,
>     since AFAICT the intent of the MSR is purely to suppress FB clearing for an
>     unsuspecting (or misconfigured?) guest!?!?!

I doubt it is purely for guests. Any chance userspace application may use VERW?

And I don't think the original patch is for misconfigured guest. IIUC, it is
about migrating a guest from a vulnerable host to an invulnerable host.

>
>FWIW, this trainwreck is another reason why I'm not going to look at the proposed
>"Intel IA32_SPEC_CTRL Virtualization" crud until external forces dictate that I
>do so. I have zero confidence that a paravirt interface defined by hardware
>vendors to fiddle with mitigations will be sane, flexible, and extensible.
>
>Anyways, can someone from Intel do some basic performance analysis to justify
>doing RDMSR + WRMSRx2 on every transition?  Unless someone provides numbers that

Pawan, could you help to answer this question?

>show a clear, meaningful benefit to the aggressive toggling, I'm inclined to have
>KVM do #4, e.g. end up with something like:
>
>	/* L1D Flush includes CPU buffer clear to mitigate MDS */
>	if (static_branch_unlikely(&vmx_l1d_should_flush)) {
>		vmx_l1d_flush(vcpu);
>	} else if (static_branch_unlikely(&mds_user_clear) ||
>		   static_branch_unlikely(&mmio_stale_data_clear)) {
>		mds_clear_cpu_buffers();
>	} else if (static_branch_unlikely(&kvm_toggle_fb_clear) {
>		bool enable_fb_clear = !!(vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR);
>
>		if (this_cpu_read(kvm_fb_clear_enabled) != enable_fb_clear) {
>			u64 mcu_opt_ctrl = host_mcu_opt_ctrl;
>
>			if (enable_fb_clear)
>				mcu_opt_ctrl &= ~FB_CLEAR_DIS;
>			else
>				mcu_opt_ctrl |= FB_CLEAR_DIS;
>			native_wrmsrl(MSR_IA32_MCU_OPT_CTRL, mcu_opt_ctrl);
>			this_cpu_write(kvm_fb_clear_enabled, enable_fb_clear);
>		}
>	}
  
Sean Christopherson May 19, 2023, 3:25 p.m. UTC | #4
On Fri, May 19, 2023, Chao Gao wrote:
> +Pawan, could you share your thoughts on questions about FB_CLEAR?
> 
> On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> >I do like snapshotting and then updating the value, even though there's likely no
> >meaningful performance benefit, as that would provide a place to document that
> >the "supported" value is dynamic.  Though the fact that it's dynamic is arguably a bug
> >in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different
> >values for ARCH_CAPABILITIES.  But fixing that is probably a fool's errand.  So
> 
> I am not sure if fixing it is fool. There would be some other problem:

Heh, "fool's errand" is an idiom that means doing something has almost no chance
of succeeding, not that doing something is foolish.  I 100% agree that there's
value in presenting a consistent model to the guest, but there are conflicting
requirements in play.  To present a consistent model, KVM essentially needs to
disallow changing the module param after VMs/vCPUs have been created, but that
would prevent userspace from toggling the param while VMs are running, e.g. in
response to a new vulnerability.

The only feasible idea I can think of is to disallow *disabling* the mitigation
while VMs/vCPUs are active.  But then that prevents turning the L1D flush mitigation
back off if some other mitigation is deployed, e.g. via livepatch, policy update,
etc.

That's why I said trying to fix the knob is probably a fool's errand.  AFAICT,
there's no straightforward solution that makes everybody happy.  :-/
  
Pawan Gupta May 20, 2023, 1:02 a.m. UTC | #5
On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> +Jim
> 
> On Thu, May 18, 2023, Xiaoyao Li wrote:
> > On 5/6/2023 11:04 AM, Chao Gao wrote:
> > > to avoid computing the supported value at runtime every time.
> > > 
> > > No functional change intended.
> > 
> > the value of kvm_get_arch_capabilities() can be changed due to
> > 
> > 	if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)
> > 		data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;
> > 
> > and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush module
> > param.
> 
> Nice catch!
> 
> > We need a detailed analysis that in no real case can
> > ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime.
> 
> No, the fact that it _can_ be modified by a writable module param is enough to
> make this patch buggy.
> 
> I do like snapshotting and then updating the value, even though there's likely no
> meaningful performance benefit, as that would provide a place to document that
> the "supported" value is dynamic.  Though the fact that it's dynamic is arguably a bug
> in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different
> values for ARCH_CAPABILITIES.  But fixing that is probably a fool's errand.  So
> I vote to snapshot the value and toggle the ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit
> when l1tf_vmx_mitigation is modified.
> 
> On a somewhat related topic, what in the absolute #$#$ is going on with FB_CLEAR_DIS!?!?
> I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
> 
>   1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
>      buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
>      entry+exit.

Unnecessary VERWs in guest will have much higher impact than due to MSR
read/write at vmentry/exit. On an Icelake system it is pointless for a
guest to incur VERW penalty when the system is not affected by MDS/TAA and
guests don't need mitigation for MMIO Stale Data. MSR writes are only
done when the guest is likely to execute unnecessary VERWs(e.g. when the
guest thinks its running on an older gen CPU).

>   2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device()
>      is a bug.  AIUI, the vulnerability applies to _any_ MMIO accesses.

Vulnerability applies to MMIO access to devices that don't respect "byte
enable" (which indicates valid bytes in a transaction), and don't error
on incorrect read or write size.

>      Assigning a device is necessary to let the device DMA into the
>      guest, but it's not necessary to let the guest access MMIO
>      addresses, that's done purely via memslots.

I will get back on this. The guest would typically need access to an
area that doesn't fail an incorrectly sized MMIO.

>   3. Irrespective of whether or not there is a performance benefit, toggling the
>      MSR on every entry+exit is completely unnecessary if KVM won't do VERW before
>      VM-Enter, i.e. if (!mds_user_clear && !mmio_stale_data_clear), then the
>      toggling can be done in vmx_prepare_switch_to_{guest,host}().  This probably
>      isn't worth pursuing though, as #4 below is more likely, especially since
>      X86_BUG_MSBDS_ONLY is limited to Atom (and MIC, lol) CPUs.
>
>   4. If the host will will _never_ do VERW, i.e. #3 + !X86_BUG_MSBDS_ONLY, then

Is it likely that KVM will not do VERW when affected by MMIO Stale Data?
If you mean on a hardware that is not vulnerable to MDS and MMIO Stale
Data, in that case MSR writes are unnecessary and will be skipped
because FB_CLEAR_DIS is not available on unaffected hardware.

>      KVM just needs to context switch the MSR between guests since the value that's
>      loaded while running in the host is irrelevant.  E.g. use a percpu cache to

I will be happy to avoid the MSR read/write, but its worth considering
that this MSR can receive more bits that host may want to toggle, then
percpu cache implementation would likely change.

>      track the current value.
> 
>   5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
>      i.e. the host's desired value is effectively static post-boot, and barring
>      a buggy configuration (running KVM as a guest), the boot CPU's value will be
>      the same as every other CPU.

Would the MSR value be same on every CPU, if only some guests have
enumerated FB_CLEAR and others haven't? MSR writes (to disable FB_CLEAR)
are not done when a guest enumerates FB_CLEAR. Enumeration of FB_CLEAR
in guest will depend on its configuration.

>   6. Performance aside, KVM should not be speculating (ha!) on what the guest
>      will and will not do, and should instead honor whatever behavior is presented
>      to the guest.  If the guest CPU model indicates that VERW flushes buffers,
>      then KVM damn well needs to let VERW flush buffers.

The current implementation allows guests to have VERW flush buffers when
they enumerate FB_CLEAR. It only restricts the flush behavior when the
guest is trying to mitigate against a vulnerability(like MDS) on a
hardware that is not affected. I guess its common for guests to be
running with older gen configuration on a newer hardware.

>   7. Why on earth did Intel provide a knob that affects both the host and guest,
>      since AFAICT the intent of the MSR is purely to suppress FB clearing for an
>      unsuspecting (or misconfigured?) guest!?!?!

Thats true.
  
Dave Hansen May 22, 2023, 2:23 p.m. UTC | #6
On 5/18/23 10:33, Sean Christopherson wrote:
> 
>   2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device()
>      is a bug.  AIUI, the vulnerability applies to _any_ MMIO accesses.  Assigning
>      a device is necessary to let the device DMA into the guest, but it's not
>      necessary to let the guest access MMIO addresses, that's done purely via
>      memslots.

Just to make sure we're all on the same page: KVM needs mitigations when
real, hardware MMIO is exposed to the guest.  None of this has anything
to do with virtio or what guests _normally_ see as devices or MMIO.  Right?

But direct device assignment does that "real hardware MMIO" for sure
because it's mapping parts of the PCI address space (which is all MMIO)
into the guest.  That's what the kvm_arch_has_assigned_device() check
was going for.

But I think you're also saying that, in the end, memory gets exposed to
the guest by KVM userspace setting up a memslot.  KVM userspace _could_
have mapped a piece of MMIO and could just pass that down to a guest
without kvm_arch_has_assigned_device() being involved.  That makes the
kvm_arch_has_assigned_device() useless.

In other words, all guests with kvm_arch_has_assigned_device() need
mitigation.  But there are potentially situations where the guest can
see real hardware MMIO and yet also be !kvm_arch_has_assigned_device().
  
Sean Christopherson May 22, 2023, 4:37 p.m. UTC | #7
On Mon, May 22, 2023, Dave Hansen wrote:
> On 5/18/23 10:33, Sean Christopherson wrote:
> > 
> >   2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device()
> >      is a bug.  AIUI, the vulnerability applies to _any_ MMIO accesses.  Assigning
> >      a device is necessary to let the device DMA into the guest, but it's not
> >      necessary to let the guest access MMIO addresses, that's done purely via
> >      memslots.
> 
> Just to make sure we're all on the same page: KVM needs mitigations when
> real, hardware MMIO is exposed to the guest.  None of this has anything
> to do with virtio or what guests _normally_ see as devices or MMIO.  Right?

Yes.  I try to always call MMIO that is handled by a synthetic/virtual/emulated
device "emulated MMIO", specifically to differentiate between the two cases.

> But direct device assignment does that "real hardware MMIO" for sure
> because it's mapping parts of the PCI address space (which is all MMIO)
> into the guest.  That's what the kvm_arch_has_assigned_device() check
> was going for.
> 
> But I think you're also saying that, in the end, memory gets exposed to
> the guest by KVM userspace setting up a memslot.  KVM userspace _could_
> have mapped a piece of MMIO and could just pass that down to a guest
> without kvm_arch_has_assigned_device() being involved.  That makes the
> kvm_arch_has_assigned_device() useless.

Yep.

> In other words, all guests with kvm_arch_has_assigned_device() need
> mitigation.

Yes, assuming the guest wants to actually use the device :-)

> But there are potentially situations where the guest can see real hardware MMIO
> and yet also be !kvm_arch_has_assigned_device().

Yes.  There may or may not be _legitimate_ scenarios for exposing host MMIO to the
guest without an assigned device, but as far as the mitigation is concerned, being
legitimate or not doesn't matter, all that matters is that userspace can expose
host MMIO to the guest irrespective of VFIO.

FWIW, I think this would be a minimal fix without having to apply the mitigation
blindly.  My only concern is that there might be gaps in the kvm_is_mmio_pfn()
heuristic, but if that's the case then KVM likely has other issues, e.g. would
potentially map MMIO with the wrong memtype.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2865c3cb3501..ac3c535ae3b9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1274,6 +1274,7 @@ struct kvm_arch {
 
        bool apic_access_memslot_enabled;
        bool apic_access_memslot_inhibited;
+       bool vm_has_passthrough_mmio;
 
        /* Protects apicv_inhibit_reasons */
        struct rw_semaphore apicv_update_lock;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index cf2c6426a6fc..83d235488e56 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -189,6 +189,10 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
        if (level > PG_LEVEL_4K)
                spte |= PT_PAGE_SIZE_MASK;
 
+       if (static_branch_unlikely(&mmio_stale_data_clear) &&
+           !vcpu->kvm->arch.vm_has_passthrough_mmio && kvm_is_mmio_pfn(pfn))
+               vcpu->kvm->arch.vm_has_passthrough_mmio = true;
+
        if (shadow_memtype_mask)
                spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
                                                         kvm_is_mmio_pfn(pfn));
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 44fb619803b8..9c66ba35af92 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7159,7 +7159,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
        else if (static_branch_unlikely(&mds_user_clear))
                mds_clear_cpu_buffers();
        else if (static_branch_unlikely(&mmio_stale_data_clear) &&
-                kvm_arch_has_assigned_device(vcpu->kvm))
+                to_kvm_vmx(vcpu->kvm)->vm_has_passthrough_mmio)
                mds_clear_cpu_buffers();
 
        vmx_disable_fb_clear(vmx);
  
Sean Christopherson May 22, 2023, 5:43 p.m. UTC | #8
On Fri, May 19, 2023, Pawan Gupta wrote:
> On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
> > 
> >   1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
> >      buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
> >      entry+exit.
> 
> Unnecessary VERWs in guest will have much higher impact than due to MSR
> read/write at vmentry/exit.

Can you provide numbers for something closeish to a real world workload?

> On an Icelake system it is pointless for a guest to incur VERW penalty when
> the system is not affected by MDS/TAA and guests don't need mitigation for
> MMIO Stale Data. MSR writes are only done when the guest is likely to execute
> unnecessary VERWs(e.g. when the guest thinks its running on an older gen
> CPU).
>
> >      KVM just needs to context switch the MSR between guests since the value that's
> >      loaded while running in the host is irrelevant.  E.g. use a percpu cache to
> 
> I will be happy to avoid the MSR read/write, but its worth considering
> that this MSR can receive more bits that host may want to toggle, then
> percpu cache implementation would likely change.

Change in and of itself isn't problematic, so long as whatever code we write won't
fall over if/when new bits are added, i.e. doesn't clobber unknown bits.

> >   5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
> >      i.e. the host's desired value is effectively static post-boot, and barring
> >      a buggy configuration (running KVM as a guest), the boot CPU's value will be
> >      the same as every other CPU.
> 
> Would the MSR value be same on every CPU, if only some guests have
> enumerated FB_CLEAR and others haven't?

Ignore the guest, I'm talking purely about the host.  Specifically, there's no
reason to do a RDMSR to get the host value on every VM-Enter since the host's
value is effectively static post-boot.

> MSR writes (to disable FB_CLEAR) are not done when a guest enumerates
> FB_CLEAR. Enumeration of FB_CLEAR in guest will depend on its configuration.
> 
> >   6. Performance aside, KVM should not be speculating (ha!) on what the guest
> >      will and will not do, and should instead honor whatever behavior is presented
> >      to the guest.  If the guest CPU model indicates that VERW flushes buffers,
> >      then KVM damn well needs to let VERW flush buffers.
> 
> The current implementation allows guests to have VERW flush buffers when
> they enumerate FB_CLEAR. It only restricts the flush behavior when the
> guest is trying to mitigate against a vulnerability(like MDS) on a
> hardware that is not affected. I guess its common for guests to be
> running with older gen configuration on a newer hardware.

Right, I'm saying that that behavior is wrong.  KVM shouldn't assume the guest
the guest will do things a certain way and should instead honor the "architectural"
definition, in quotes because I realize there probably is no architectural
definition for any of this.

It might be that the code does (unintentionally?) honor the "architecture", i.e.
this code might actually be accurrate with respect to when the guest can expect
VERW to flush buffers.  But the comment is so, so wrong.

	/*
	 * If guest will not execute VERW, there is no need to set FB_CLEAR_DIS
	 * at VMEntry. Skip the MSR read/write when a guest has no use case to
	 * execute VERW.
	 */
	if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) ||
	   ((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) &&
	    (vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) &&
	    (vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) &&
	    (vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) &&
	    (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO)))
		vmx->disable_fb_clear = false;
  
Xiaoyao Li May 22, 2023, 7:31 p.m. UTC | #9
On 5/23/2023 1:43 AM, Sean Christopherson wrote:
>>>    6. Performance aside, KVM should not be speculating (ha!) on what the guest
>>>       will and will not do, and should instead honor whatever behavior is presented
>>>       to the guest.  If the guest CPU model indicates that VERW flushes buffers,
>>>       then KVM damn well needs to let VERW flush buffers.
>> The current implementation allows guests to have VERW flush buffers when
>> they enumerate FB_CLEAR. It only restricts the flush behavior when the
>> guest is trying to mitigate against a vulnerability(like MDS) on a
>> hardware that is not affected. I guess its common for guests to be
>> running with older gen configuration on a newer hardware.
> Right, I'm saying that that behavior is wrong.  KVM shouldn't assume the guest
> the guest will do things a certain way and should instead honor the "architectural"
> definition, in quotes because I realize there probably is no architectural
> definition for any of this.
> 
> It might be that the code does (unintentionally?) honor the "architecture", i.e.
> this code might actually be accurrate with respect to when the guest can expect
> VERW to flush buffers.  But the comment is so, so wrong.

The comment is wrong and the code is wrong in some case as well.

If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO, 
ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are 
exposed to VM, the VM is type of "affected by MDS".

And accroding to the page 
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html

if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it 
implicitly enumerates FB_CLEAR as part of their MD_CLEAR support.

However, the code will leave vmx->disable_fb_clear as 1 if hardware 
supports it, and VERW intruction doesn't clear FB in the VM, which 
conflicts "architectural" definition.

> 	/*
> 	 * If guest will not execute VERW, there is no need to set FB_CLEAR_DIS
> 	 * at VMEntry. Skip the MSR read/write when a guest has no use case to
> 	 * execute VERW.
> 	 */
> 	if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) ||
> 	   ((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) &&
> 	    (vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) &&
> 	    (vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) &&
> 	    (vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) &&
> 	    (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO)))
> 		vmx->disable_fb_clear = false;
  
Pawan Gupta May 22, 2023, 8:54 p.m. UTC | #10
On Mon, May 22, 2023 at 10:43:49AM -0700, Sean Christopherson wrote:
> On Fri, May 19, 2023, Pawan Gupta wrote:
> > On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> > > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
> > > 
> > >   1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
> > >      buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
> > >      entry+exit.
> > 
> > Unnecessary VERWs in guest will have much higher impact than due to MSR
> > read/write at vmentry/exit.
> 
> Can you provide numbers for something closeish to a real world workload?

I am collecting the numbers, will update here soon.

> > On an Icelake system it is pointless for a guest to incur VERW penalty when
> > the system is not affected by MDS/TAA and guests don't need mitigation for
> > MMIO Stale Data. MSR writes are only done when the guest is likely to execute
> > unnecessary VERWs(e.g. when the guest thinks its running on an older gen
> > CPU).
> >
> > >      KVM just needs to context switch the MSR between guests since the value that's
> > >      loaded while running in the host is irrelevant.  E.g. use a percpu cache to
> > 
> > I will be happy to avoid the MSR read/write, but its worth considering
> > that this MSR can receive more bits that host may want to toggle, then
> > percpu cache implementation would likely change.
> 
> Change in and of itself isn't problematic, so long as whatever code we write won't
> fall over if/when new bits are added, i.e. doesn't clobber unknown bits.

Ok.

> > >   5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up,
> > >      i.e. the host's desired value is effectively static post-boot, and barring
> > >      a buggy configuration (running KVM as a guest), the boot CPU's value will be
> > >      the same as every other CPU.
> > 
> > Would the MSR value be same on every CPU, if only some guests have
> > enumerated FB_CLEAR and others haven't?
> 
> Ignore the guest, I'm talking purely about the host.  Specifically, there's no
> reason to do a RDMSR to get the host value on every VM-Enter since the host's
> value is effectively static post-boot.

That right(ignoring late microcode load adding stuff to the MSR or
msr-tools fiddling).

> > MSR writes (to disable FB_CLEAR) are not done when a guest enumerates
> > FB_CLEAR. Enumeration of FB_CLEAR in guest will depend on its configuration.
> > 
> > >   6. Performance aside, KVM should not be speculating (ha!) on what the guest
> > >      will and will not do, and should instead honor whatever behavior is presented
> > >      to the guest.  If the guest CPU model indicates that VERW flushes buffers,
> > >      then KVM damn well needs to let VERW flush buffers.
> > 
> > The current implementation allows guests to have VERW flush buffers when
> > they enumerate FB_CLEAR. It only restricts the flush behavior when the
> > guest is trying to mitigate against a vulnerability(like MDS) on a
> > hardware that is not affected. I guess its common for guests to be
> > running with older gen configuration on a newer hardware.
> 
> Right, I'm saying that that behavior is wrong.  KVM shouldn't assume the guest
> the guest will do things a certain way and should instead honor the "architectural"
> definition, in quotes because I realize there probably is no architectural
> definition for any of this.

Before MMIO Stale Data, processors that were not affected by MDS/TAA did
not clear CPU buffers, even if they enumerated MD_CLEAR. On such
processors guests that deployed VERW(thinking they are vulnerable to
MDS) did not clear the CPU buffers. After MMIO Stale Data was discovered
FB_CLEAR_DIS was introduced to restore this behavior.

> It might be that the code does (unintentionally?) honor the "architecture", i.e.
> this code might actually be accurrate with respect to when the guest can expect
> VERW to flush buffers.  But the comment is so, so wrong.

Agree, the comment needs to explain this well.

> 	/*
> 	 * If guest will not execute VERW, there is no need to set FB_CLEAR_DIS
> 	 * at VMEntry. Skip the MSR read/write when a guest has no use case to
> 	 * execute VERW.
> 	 */
> 	if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) ||
> 	   ((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) &&
> 	    (vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) &&
> 	    (vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) &&
> 	    (vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) &&
> 	    (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO)))
> 		vmx->disable_fb_clear = false;
  
Pawan Gupta May 22, 2023, 9:23 p.m. UTC | #11
On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote:
> On 5/23/2023 1:43 AM, Sean Christopherson wrote:
> > > >    6. Performance aside, KVM should not be speculating (ha!) on what the guest
> > > >       will and will not do, and should instead honor whatever behavior is presented
> > > >       to the guest.  If the guest CPU model indicates that VERW flushes buffers,
> > > >       then KVM damn well needs to let VERW flush buffers.
> > > The current implementation allows guests to have VERW flush buffers when
> > > they enumerate FB_CLEAR. It only restricts the flush behavior when the
> > > guest is trying to mitigate against a vulnerability(like MDS) on a
> > > hardware that is not affected. I guess its common for guests to be
> > > running with older gen configuration on a newer hardware.
> > Right, I'm saying that that behavior is wrong.  KVM shouldn't assume the guest
> > the guest will do things a certain way and should instead honor the "architectural"
> > definition, in quotes because I realize there probably is no architectural
> > definition for any of this.
> > 
> > It might be that the code does (unintentionally?) honor the "architecture", i.e.
> > this code might actually be accurrate with respect to when the guest can expect
> > VERW to flush buffers.  But the comment is so, so wrong.
> 
> The comment is wrong and the code is wrong in some case as well.
> 
> If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO,
> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to
> VM, the VM is type of "affected by MDS".
> 
> And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
> 
> if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly
> enumerates FB_CLEAR as part of their MD_CLEAR support.

This is the excerpt from the link that you mentioned:

  "For processors that are affected by MDS and support L1D_FLUSH
  operations and MD_CLEAR operations, the VERW instruction flushes fill
  buffers."

You are missing an important information here "For the processors
_affected_ by MDS". On such processors ...

> However, the code will leave vmx->disable_fb_clear as 1 if hardware supports
> it, and VERW intruction doesn't clear FB in the VM, which conflicts
> "architectural" definition.

... Fill buffer clear is not enabled at all:

  vmx_setup_fb_clear_ctrl()
  {
  	u64 msr;
  
  	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) &&
  	    !boot_cpu_has_bug(X86_BUG_MDS) &&
  	    !boot_cpu_has_bug(X86_BUG_TAA)) {
  		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);
  		if (msr & ARCH_CAP_FB_CLEAR_CTRL)
  			vmx_fb_clear_ctrl_available = true;
  	}
  }
  
Xiaoyao Li May 23, 2023, 1 a.m. UTC | #12
On 5/23/2023 5:23 AM, Pawan Gupta wrote:
> On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote:
>> On 5/23/2023 1:43 AM, Sean Christopherson wrote:
>>>>>     6. Performance aside, KVM should not be speculating (ha!) on what the guest
>>>>>        will and will not do, and should instead honor whatever behavior is presented
>>>>>        to the guest.  If the guest CPU model indicates that VERW flushes buffers,
>>>>>        then KVM damn well needs to let VERW flush buffers.
>>>> The current implementation allows guests to have VERW flush buffers when
>>>> they enumerate FB_CLEAR. It only restricts the flush behavior when the
>>>> guest is trying to mitigate against a vulnerability(like MDS) on a
>>>> hardware that is not affected. I guess its common for guests to be
>>>> running with older gen configuration on a newer hardware.
>>> Right, I'm saying that that behavior is wrong.  KVM shouldn't assume the guest
>>> the guest will do things a certain way and should instead honor the "architectural"
>>> definition, in quotes because I realize there probably is no architectural
>>> definition for any of this.
>>>
>>> It might be that the code does (unintentionally?) honor the "architecture", i.e.
>>> this code might actually be accurrate with respect to when the guest can expect
>>> VERW to flush buffers.  But the comment is so, so wrong.
>>
>> The comment is wrong and the code is wrong in some case as well.
>>
>> If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO,
>> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to
>> VM, the VM is type of "affected by MDS".
>>
>> And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
>>
>> if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly
>> enumerates FB_CLEAR as part of their MD_CLEAR support.
> 
> This is the excerpt from the link that you mentioned:
> 
>    "For processors that are affected by MDS and support L1D_FLUSH
>    operations and MD_CLEAR operations, the VERW instruction flushes fill
>    buffers."
> 
> You are missing an important information here "For the processors
> _affected_ by MDS". On such processors ...
> 
>> However, the code will leave vmx->disable_fb_clear as 1 if hardware supports
>> it, and VERW intruction doesn't clear FB in the VM, which conflicts
>> "architectural" definition.
> 
> ... Fill buffer clear is not enabled at all:
> 
>    vmx_setup_fb_clear_ctrl()
>    {
>    	u64 msr;
>    
>    	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) &&
>    	    !boot_cpu_has_bug(X86_BUG_MDS) &&
>    	    !boot_cpu_has_bug(X86_BUG_TAA)) {
>    		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);
>    		if (msr & ARCH_CAP_FB_CLEAR_CTRL)
>    			vmx_fb_clear_ctrl_available = true;
>    	}
>    }

This is the check of bare metal, while the check in 
vmx_update_fb_clear_dis() is of guest VM.

For example, if the hardware (host) enumerates ARCH_CAP_TAA_NO, 
ARCH_CAP_MDS_NO, ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, 
ARCH_CAP_SBDR_SSDP_NO, ARCH_CAP_FB_CLEAR, and ARCH_CAP_FB_CLEAR_CTRL, 
the VERW on this hardware clears Fill Buffer (if FB_CLEAR_DIS is not 
enabled in MSR_IA32_MCU_OPT_CTRL). vmx_setup_fb_clear_ctrl() does set 
vmx_fb_clear_ctrl_available  to true.

If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO, 
ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and 
ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave 
vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for 
guest. But in the view of guset, it expects VERW to clear Fill Buffer.
  
Pawan Gupta May 23, 2023, 3:34 a.m. UTC | #13
On Tue, May 23, 2023 at 09:00:50AM +0800, Xiaoyao Li wrote:
> On 5/23/2023 5:23 AM, Pawan Gupta wrote:
> > On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote:
> > > On 5/23/2023 1:43 AM, Sean Christopherson wrote:
> > > > > >     6. Performance aside, KVM should not be speculating (ha!) on what the guest
> > > > > >        will and will not do, and should instead honor whatever behavior is presented
> > > > > >        to the guest.  If the guest CPU model indicates that VERW flushes buffers,
> > > > > >        then KVM damn well needs to let VERW flush buffers.
> > > > > The current implementation allows guests to have VERW flush buffers when
> > > > > they enumerate FB_CLEAR. It only restricts the flush behavior when the
> > > > > guest is trying to mitigate against a vulnerability(like MDS) on a
> > > > > hardware that is not affected. I guess its common for guests to be
> > > > > running with older gen configuration on a newer hardware.
> > > > Right, I'm saying that that behavior is wrong.  KVM shouldn't assume the guest
> > > > the guest will do things a certain way and should instead honor the "architectural"
> > > > definition, in quotes because I realize there probably is no architectural
> > > > definition for any of this.
> > > > 
> > > > It might be that the code does (unintentionally?) honor the "architecture", i.e.
> > > > this code might actually be accurrate with respect to when the guest can expect
> > > > VERW to flush buffers.  But the comment is so, so wrong.
> > > 
> > > The comment is wrong and the code is wrong in some case as well.
> > > 
> > > If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO,
> > > ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to
> > > VM, the VM is type of "affected by MDS".
> > > 
> > > And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
> > > 
> > > if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly
> > > enumerates FB_CLEAR as part of their MD_CLEAR support.
> > 
> > This is the excerpt from the link that you mentioned:
> > 
> >    "For processors that are affected by MDS and support L1D_FLUSH
> >    operations and MD_CLEAR operations, the VERW instruction flushes fill
> >    buffers."
> > 
> > You are missing an important information here "For the processors
> > _affected_ by MDS". On such processors ...
> > 
> > > However, the code will leave vmx->disable_fb_clear as 1 if hardware supports
> > > it, and VERW intruction doesn't clear FB in the VM, which conflicts
> > > "architectural" definition.
> > 
> > ... Fill buffer clear is not enabled at all:
> > 
> >    vmx_setup_fb_clear_ctrl()
> >    {
> >    	u64 msr;
> >    	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) &&
> >    	    !boot_cpu_has_bug(X86_BUG_MDS) &&
> >    	    !boot_cpu_has_bug(X86_BUG_TAA)) {
> >    		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);
> >    		if (msr & ARCH_CAP_FB_CLEAR_CTRL)
> >    			vmx_fb_clear_ctrl_available = true;
> >    	}
> >    }
> 
> This is the check of bare metal, while the check in
> vmx_update_fb_clear_dis() is of guest VM.
> 
> For example, if the hardware (host) enumerates ARCH_CAP_TAA_NO,
> ARCH_CAP_MDS_NO, ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO,
> ARCH_CAP_FB_CLEAR, and ARCH_CAP_FB_CLEAR_CTRL, the VERW on this hardware
> clears Fill Buffer (if FB_CLEAR_DIS is not enabled in
> MSR_IA32_MCU_OPT_CTRL). vmx_setup_fb_clear_ctrl() does set
> vmx_fb_clear_ctrl_available  to true.
> 
> If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO,
> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and
> ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave
> vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for guest.
> But in the view of guset, it expects VERW to clear Fill Buffer.

That is correct, but whether VERW clears the CPU buffers also depends on
if the hardware is affected or not, enumerating MD_CLEAR solely does not
guarantee that VERW will flush CPU buffers. This was true even before
MMIO Stale Data was discovered.

If host(hardware) enumerates:

	MD_CLEAR | MDS_NO |   VERW behavior
	---------|--------|-------------------
	    1	 |    0	  | Clears CPU buffers

But on an MDS mitigated hardware(MDS_NO=1) if guest enumerates:

	MD_CLEAR | MDS_NO |   VERW behavior
	---------|--------|-----------------------
	    1	 |    0	  | Not guaranteed to clear
	                        CPU buffers

After MMIO Stale Data, FB_CLEAR_DIS was introduced to keep this behavior
intact(for hardware that is not affected by MDS/TAA). If the userspace
truly wants the guest to have VERW flush behavior, it can export
FB_CLEAR.

I see your point that from a guest's perspective it is being lied about
VERW behavior. OTOH, I am not sure if it is a good enough reason for
mitigated hardware to keep the overhead of clearing micro-architectural
buffers for generations of CPUs.
  
Pawan Gupta May 23, 2023, 4:47 a.m. UTC | #14
On Mon, May 22, 2023 at 01:54:55PM -0700, Pawan Gupta wrote:
> On Mon, May 22, 2023 at 10:43:49AM -0700, Sean Christopherson wrote:
> > On Fri, May 19, 2023, Pawan Gupta wrote:
> > > On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> > > > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL...
> > > > 
> > > >   1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill
> > > >      buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every
> > > >      entry+exit.
> > > 
> > > Unnecessary VERWs in guest will have much higher impact than due to MSR
> > > read/write at vmentry/exit.
> > 
> > Can you provide numbers for something closeish to a real world workload?
> 
> I am collecting the numbers, will update here soon.

Looks like avoiding VERW flush behavior in guest results in 2-5%
improvement in performance.

Pybench and CPU bound sysbench test shows minor improvement when
avoiding reading/writing MSR_MCU_OPT_CTRL at VMentry/VMexit.

Baseline : v6.4-rc3
Target   : v6.4-rc3 + No read/write to MSR_MCU_OPT_CTRL at VMentry/VMexit

| Test      | Configuration | Relative |
| --------- | ------------- | -------- |
| nginx     | 200           | 0.977    |
| hackbench | 8 - Process   | 0.975    |
| sysbench  | RAM / Memory  | 0.946    |
| sysbench  | CPU           | 1.002    |
| pybench   | T.F.A.T.T     | 1.007    |

Host configuration (Icelake server):
  CPU family:                      6
  Model:                           106
  Stepping:                        6
  Vulnerability Itlb multihit:     Not affected
  Vulnerability L1tf:              Not affected
  Vulnerability Mds:               Not affected
  Vulnerability Meltdown:          Not affected
  Vulnerability Mmio stale data:   Mitigation; Clear CPU buffers; SMT vulnerable
  Vulnerability Retbleed:          Not affected
  Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
  Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Vulnerability Spectre v2:        Vulnerable: eIBRS with unprivileged eBPF
  Vulnerability Srbds:             Not affected
  Vulnerability Tsx async abort:   Not affected


Guest configuration (Skylake Client):
  CPU family:			  6
  Model:			  94
  Stepping:			  3
  Vulnerabilities:
    Itlb multihit:         KVM: Mitigation: VMX unsupported
    L1tf:                  Mitigation; PTE Inversion
    Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode;
                           SMT Host state unknown
    Meltdown:              Mitigation; PTI
    Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode;
                           SMT Host state unknown
    Retbleed:              Vulnerable
    Spec store bypass:     Vulnerable
    Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer
                            sanitization
    Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PB
                           RSB-eIBRS Not affected
    Srbds:                 Unknown: Dependent on hypervisor status
    Tsx async abort:       Vulnerable: Clear CPU buffers attempted, no microcode;
                           SMT Host state unknown
  
Xiaoyao Li May 25, 2023, 3:42 p.m. UTC | #15
On 5/23/2023 11:34 AM, Pawan Gupta wrote:
> On Tue, May 23, 2023 at 09:00:50AM +0800, Xiaoyao Li wrote:
>> On 5/23/2023 5:23 AM, Pawan Gupta wrote:
>>> On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote:
>>>> On 5/23/2023 1:43 AM, Sean Christopherson wrote:
>>>>>>>      6. Performance aside, KVM should not be speculating (ha!) on what the guest
>>>>>>>         will and will not do, and should instead honor whatever behavior is presented
>>>>>>>         to the guest.  If the guest CPU model indicates that VERW flushes buffers,
>>>>>>>         then KVM damn well needs to let VERW flush buffers.
>>>>>> The current implementation allows guests to have VERW flush buffers when
>>>>>> they enumerate FB_CLEAR. It only restricts the flush behavior when the
>>>>>> guest is trying to mitigate against a vulnerability(like MDS) on a
>>>>>> hardware that is not affected. I guess its common for guests to be
>>>>>> running with older gen configuration on a newer hardware.
>>>>> Right, I'm saying that that behavior is wrong.  KVM shouldn't assume the guest
>>>>> the guest will do things a certain way and should instead honor the "architectural"
>>>>> definition, in quotes because I realize there probably is no architectural
>>>>> definition for any of this.
>>>>>
>>>>> It might be that the code does (unintentionally?) honor the "architecture", i.e.
>>>>> this code might actually be accurrate with respect to when the guest can expect
>>>>> VERW to flush buffers.  But the comment is so, so wrong.
>>>>
>>>> The comment is wrong and the code is wrong in some case as well.
>>>>
>>>> If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO,
>>>> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to
>>>> VM, the VM is type of "affected by MDS".
>>>>
>>>> And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html
>>>>
>>>> if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly
>>>> enumerates FB_CLEAR as part of their MD_CLEAR support.
>>>
>>> This is the excerpt from the link that you mentioned:
>>>
>>>     "For processors that are affected by MDS and support L1D_FLUSH
>>>     operations and MD_CLEAR operations, the VERW instruction flushes fill
>>>     buffers."
>>>
>>> You are missing an important information here "For the processors
>>> _affected_ by MDS". On such processors ...
>>>
>>>> However, the code will leave vmx->disable_fb_clear as 1 if hardware supports
>>>> it, and VERW intruction doesn't clear FB in the VM, which conflicts
>>>> "architectural" definition.
>>>
>>> ... Fill buffer clear is not enabled at all:
>>>
>>>     vmx_setup_fb_clear_ctrl()
>>>     {
>>>     	u64 msr;
>>>     	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) &&
>>>     	    !boot_cpu_has_bug(X86_BUG_MDS) &&
>>>     	    !boot_cpu_has_bug(X86_BUG_TAA)) {
>>>     		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);
>>>     		if (msr & ARCH_CAP_FB_CLEAR_CTRL)
>>>     			vmx_fb_clear_ctrl_available = true;
>>>     	}
>>>     }
>>
>> This is the check of bare metal, while the check in
>> vmx_update_fb_clear_dis() is of guest VM.
>>
>> For example, if the hardware (host) enumerates ARCH_CAP_TAA_NO,
>> ARCH_CAP_MDS_NO, ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO,
>> ARCH_CAP_FB_CLEAR, and ARCH_CAP_FB_CLEAR_CTRL, the VERW on this hardware
>> clears Fill Buffer (if FB_CLEAR_DIS is not enabled in
>> MSR_IA32_MCU_OPT_CTRL). vmx_setup_fb_clear_ctrl() does set
>> vmx_fb_clear_ctrl_available  to true.
>>
>> If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO,
>> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and
>> ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave
>> vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for guest.
>> But in the view of guset, it expects VERW to clear Fill Buffer.
> 
> That is correct, but whether VERW clears the CPU buffers also depends on
> if the hardware is affected or not, enumerating MD_CLEAR solely does not
> guarantee that VERW will flush CPU buffers. This was true even before
> MMIO Stale Data was discovered.
> 
> If host(hardware) enumerates:
> 
> 	MD_CLEAR | MDS_NO |   VERW behavior
> 	---------|--------|-------------------
> 	    1	 |    0	  | Clears CPU buffers
> 
> But on an MDS mitigated hardware(MDS_NO=1) if guest enumerates:
> 
> 	MD_CLEAR | MDS_NO |   VERW behavior
> 	---------|--------|-----------------------
> 	    1	 |    0	  | Not guaranteed to clear
> 	                        CPU buffers
> 
> After MMIO Stale Data, FB_CLEAR_DIS was introduced to keep this behavior
> intact(for hardware that is not affected by MDS/TAA). 

Sorry, I don't understand it. What the behavior is?

> If the userspace
> truly wants the guest to have VERW flush behavior, it can export
> FB_CLEAR.
 >
> I see your point that from a guest's perspective it is being lied about
> VERW behavior. OTOH, I am not sure if it is a good enough reason for
> mitigated hardware to keep the overhead of clearing micro-architectural
> buffers for generations of CPUs.

User takes the responsiblity because itself requests the specific 
feature combination for its guest.
  
Pawan Gupta May 25, 2023, 8:33 p.m. UTC | #16
On Thu, May 25, 2023 at 11:42:24PM +0800, Xiaoyao Li wrote:
> On 5/23/2023 11:34 AM, Pawan Gupta wrote:
> > > If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO,
> > > ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and
> > > ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave
> > > vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for guest.
> > > But in the view of guset, it expects VERW to clear Fill Buffer.
> > 
> > That is correct, but whether VERW clears the CPU buffers also depends on
> > if the hardware is affected or not, enumerating MD_CLEAR solely does not
> > guarantee that VERW will flush CPU buffers. This was true even before
> > MMIO Stale Data was discovered.
> > 
> > If host(hardware) enumerates:
> > 
> > 	MD_CLEAR | MDS_NO |   VERW behavior
> > 	---------|--------|-------------------
> > 	    1	 |    0	  | Clears CPU buffers
> > 
> > But on an MDS mitigated hardware(MDS_NO=1) if guest enumerates:
> > 
> > 	MD_CLEAR | MDS_NO |   VERW behavior
> > 	---------|--------|-----------------------
> > 	    1	 |    0	  | Not guaranteed to clear
> > 	                        CPU buffers
> > 
> > After MMIO Stale Data, FB_CLEAR_DIS was introduced to keep this behavior
> > intact(for hardware that is not affected by MDS/TAA).
> 
> Sorry, I don't understand it. What the behavior is?

That on a mitigated hardware VERW may not clear the micro-architectural
buffers.

There are many micro-architectural buffers, VERW only clears the
affected ones. This is indicated in section "Fill Buffer Clearing
Operations" of [1].

  Some processors may enumerate MD_CLEAR because they overwrite all
  buffers affected by MDS/TAA, but they do not overwrite fill buffer
  values. This is because fill buffers are not susceptible to MDS or TAA
  on those processors.

  For processors affected by FBSDP where MD_CLEAR may not overwrite fill
  buffer values, Intel has released microcode updates that enumerate
  FB_CLEAR so that VERW does overwrite fill buffer values.

> > If the userspace
> > truly wants the guest to have VERW flush behavior, it can export
> > FB_CLEAR.
> >
> > I see your point that from a guest's perspective it is being lied about
> > VERW behavior. OTOH, I am not sure if it is a good enough reason for
> > mitigated hardware to keep the overhead of clearing micro-architectural
> > buffers for generations of CPUs.
> 
> User takes the responsiblity because itself requests the specific feature
> combination for its guest.

As I understand, the MD_CLEAR enumeration on mitigated hardware is done
purely for VM migration compatibility. Software is not expected to use
VERW on mitigated hardware, below is from MDS documentation [2]:

  Future processors set the MDS_NO bit in IA32_ARCH_CAPABILITIES to
  indicate they are not affected by microarchitectural data sampling.
  Such processors will continue to enumerate the MD_CLEAR bit in CPUID.
  As none of these data buffers are vulnerable to exposure on such
  parts, no data buffer overwriting is required or expected for such
  parts, despite the MD_CLEAR indication. Software should look to the
  MDS_NO bit to determine whether buffer overwriting mitigations are
  required.

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html

[2] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/processors-affected-mds.html
  
Chao Gao May 29, 2023, 3:35 a.m. UTC | #17
On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
>FWIW, this trainwreck is another reason why I'm not going to look at the proposed
>"Intel IA32_SPEC_CTRL Virtualization" crud until external forces dictate that I
>do so.  I have zero confidence that a paravirt interface defined by hardware
>vendors to fiddle with mitigations will be sane, flexible, and extensible.

Hi Sean,

Just to confirm we are on the same page:

"Intel IA32_SPEC_CTRL Virtualization" series consists of 3 parts:

1. Expose BHI_CTRL, RRSBA_CTRL to guests. They are hardware mitigations to
   disable BHI and RRSBA behaviors and can be used by both guest/host.

2. Enable IA32_SPEC_CTRL Virtualization which is a VMX feature. This is for KVM
   to effectively lock some bits of IA32_SPEC_CTRL MSR when guests are running.

3. Implement the paravirt interface (the virtual MSRs) for guests to report
   software mitigations in-use. KVM can utilize such information to enable
   hardware mitigations for guests transparently to address software mitigation
   effectiveness issues caused by CPU microarchitecture changes (RRSBA behavior,
   size of branch history table).

As per my understanding, your concerns are primarily focused on #3, the
paravirt interface, rather than the entire series. Am I correct in assuming that
you do not oppose #1 and #2?


You previously mentioned that the paravirt interface was not common [1], and
this time, you expressed an expectation for the interface to be "sane, flexible,
and extensible." To ensure clarity, I want to confirm my interpretation of
your expectations:

1. The interface should not be tied to a specific CPU vendor but instead be
   beneficial for Intel and AMD (and even ARM, and potentially others).

2. The interface should have the capability to solve other issues (e.g,
   coordinate mitigations in guest/host to address other perf/function issues),
   not limited to software mitigation effectiveness on Intel CPUs.

3. The interface should be extendable by VMMs rather than relying on hardware
   vendors rolling out new spec. This enables VMM developers to propose new
   ideas to coordinate mitigations in guest/host.

Please let me know if I missed any key points or if any of the above statements
do not align with your expectations. 

[1]: https://lore.kernel.org/all/Y6Sin1bmLN10yvMw@google.com/
  
Sean Christopherson June 6, 2023, 4:54 p.m. UTC | #18
On Mon, May 29, 2023, Chao Gao wrote:
> On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote:
> >FWIW, this trainwreck is another reason why I'm not going to look at the proposed
> >"Intel IA32_SPEC_CTRL Virtualization" crud until external forces dictate that I
> >do so.  I have zero confidence that a paravirt interface defined by hardware
> >vendors to fiddle with mitigations will be sane, flexible, and extensible.
> 
> Hi Sean,
> 
> Just to confirm we are on the same page:
> 
> "Intel IA32_SPEC_CTRL Virtualization" series consists of 3 parts:
> 
> 1. Expose BHI_CTRL, RRSBA_CTRL to guests. They are hardware mitigations to
>    disable BHI and RRSBA behaviors and can be used by both guest/host.
> 
> 2. Enable IA32_SPEC_CTRL Virtualization which is a VMX feature. This is for KVM
>    to effectively lock some bits of IA32_SPEC_CTRL MSR when guests are running.
> 
> 3. Implement the paravirt interface (the virtual MSRs) for guests to report
>    software mitigations in-use. KVM can utilize such information to enable
>    hardware mitigations for guests transparently to address software mitigation
>    effectiveness issues caused by CPU microarchitecture changes (RRSBA behavior,
>    size of branch history table).
> 
> As per my understanding, your concerns are primarily focused on #3, the
> paravirt interface, rather than the entire series. Am I correct in assuming that
> you do not oppose #1 and #2?

Yes, correct.  I definitely recommend posting #1 and #2 separately from the
paravirt interface, I ignored the entire series without realizing there is real
hardware support in there.

> You previously mentioned that the paravirt interface was not common [1], and
> this time, you expressed an expectation for the interface to be "sane, flexible,
> and extensible." To ensure clarity, I want to confirm my interpretation of
> your expectations:
> 
> 1. The interface should not be tied to a specific CPU vendor but instead be
>    beneficial for Intel and AMD (and even ARM, and potentially others).
> 
> 2. The interface should have the capability to solve other issues (e.g,
>    coordinate mitigations in guest/host to address other perf/function issues),
>    not limited to software mitigation effectiveness on Intel CPUs.
> 3. The interface should be extendable by VMMs rather than relying on hardware
>    vendors rolling out new spec. This enables VMM developers to propose new
>    ideas to coordinate mitigations in guest/host.

Ya, that's more or less my opinion.  Even more than allowing VMM developers to
extend/define the interface, I want the definition of the interace to be a
community/collaborative effort.  LKML has active representatives from all of the
major (known) hypervisors, so it shouldn't be *that* hard to figure out a way to
make the interface community driven.

Note that it doesn't necessarily have to be VMM developers, e.g. many of the
people that are intimately familiar with the mitigations aren't virtualization
folks.

> Please let me know if I missed any key points or if any of the above statements
> do not align with your expectations. 
> 
> [1]: https://lore.kernel.org/all/Y6Sin1bmLN10yvMw@google.com/
  

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 523c39a03c00..94aa70ec169c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1670,7 +1670,7 @@  static int kvm_get_msr_feature(struct kvm_msr_entry *msr)
 {
 	switch (msr->index) {
 	case MSR_IA32_ARCH_CAPABILITIES:
-		msr->data = kvm_get_arch_capabilities();
+		msr->data = kvm_caps.supported_arch_cap;
 		break;
 	case MSR_IA32_PERF_CAPABILITIES:
 		msr->data = kvm_caps.supported_perf_cap;
@@ -9523,6 +9523,7 @@  static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
 		kvm_caps.max_guest_tsc_khz = max;
 	}
 	kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits;
+	kvm_caps.supported_arch_cap = kvm_get_arch_capabilities();
 	kvm_init_msr_lists();
 	return 0;
 
@@ -11879,7 +11880,7 @@  int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	if (r)
 		goto free_guest_fpu;
 
-	vcpu->arch.arch_capabilities = kvm_get_arch_capabilities();
+	vcpu->arch.arch_capabilities = kvm_caps.supported_arch_cap;
 	vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
 	kvm_xen_init_vcpu(vcpu);
 	kvm_vcpu_mtrr_init(vcpu);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index c544602d07a3..d3e524bcc169 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -29,6 +29,7 @@  struct kvm_caps {
 	u64 supported_xcr0;
 	u64 supported_xss;
 	u64 supported_perf_cap;
+	u64 supported_arch_cap;
 };
 
 void kvm_spurious_fault(void);