[v3,6/6] KVM: VMX: Move VERW closer to VMentry for MDS mitigation

Message ID 20231025-delay-verw-v3-6-52663677ee35@linux.intel.com
State New
Headers
Series Delay VERW |

Commit Message

Pawan Gupta Oct. 25, 2023, 8:53 p.m. UTC
  During VMentry VERW is executed to mitigate MDS. After VERW, any memory
access like register push onto stack may put host data in MDS affected
CPU buffers. A guest can then use MDS to sample host data.

Although likelihood of secrets surviving in registers at current VERW
callsite is less, but it can't be ruled out. Harden the MDS mitigation
by moving the VERW mitigation late in VMentry path.

Note that VERW for MMIO Stale Data mitigation is unchanged because of
the complexity of per-guest conditional VERW which is not easy to handle
that late in asm with no GPRs available. If the CPU is also affected by
MDS, VERW is unconditionally executed late in asm regardless of guest
having MMIO access.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
 arch/x86/kvm/vmx/vmenter.S |  3 +++
 arch/x86/kvm/vmx/vmx.c     | 10 +++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)
  

Comments

Nikolay Borisov Oct. 26, 2023, 4:14 p.m. UTC | #1
On 25.10.23 г. 23:53 ч., Pawan Gupta wrote:
> During VMentry VERW is executed to mitigate MDS. After VERW, any memory
> access like register push onto stack may put host data in MDS affected
> CPU buffers. A guest can then use MDS to sample host data.
> 
> Although likelihood of secrets surviving in registers at current VERW
> callsite is less, but it can't be ruled out. Harden the MDS mitigation
> by moving the VERW mitigation late in VMentry path.
> 
> Note that VERW for MMIO Stale Data mitigation is unchanged because of
> the complexity of per-guest conditional VERW which is not easy to handle
> that late in asm with no GPRs available. If the CPU is also affected by
> MDS, VERW is unconditionally executed late in asm regardless of guest
> having MMIO access.
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
>   arch/x86/kvm/vmx/vmenter.S |  3 +++
>   arch/x86/kvm/vmx/vmx.c     | 10 +++++++---
>   2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index b3b13ec04bac..139960deb736 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -161,6 +161,9 @@ SYM_FUNC_START(__vmx_vcpu_run)
>   	/* Load guest RAX.  This kills the @regs pointer! */
>   	mov VCPU_RAX(%_ASM_AX), %_ASM_AX
>   
> +	/* Clobbers EFLAGS.ZF */
> +	CLEAR_CPU_BUFFERS
> +
>   	/* Check EFLAGS.CF from the VMX_RUN_VMRESUME bit test above. */
>   	jnc .Lvmlaunch
>   
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 24e8694b83fc..2d149589cf5b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7226,13 +7226,17 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>   
>   	guest_state_enter_irqoff();
>   
> -	/* L1D Flush includes CPU buffer clear to mitigate MDS */
> +	/*
> +	 * L1D Flush includes CPU buffer clear to mitigate MDS, but VERW
> +	 * mitigation for MDS is done late in VMentry and is still
> +	 * executed inspite of L1D Flush. This is because an extra VERW
> +	 * should not matter much after the big hammer L1D Flush.
> +	 */
>   	if (static_branch_unlikely(&vmx_l1d_should_flush))
>   		vmx_l1d_flush(vcpu);
> -	else if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
> -		mds_clear_cpu_buffers();
>   	else if (static_branch_unlikely(&mmio_stale_data_clear) &&
>   		 kvm_arch_has_assigned_device(vcpu->kvm))
> +		/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */

Mutually exclusive implies that you have one or the other but not both, 
whilst I think the right formulation here is redundant? Because if mmio 
is enabled  mds_clear_cpu_buffers() will clear the buffers here  and 
later they'll be cleared again, no ? Alternatively you might augment 
this check to only execute iff X86_FEATURE_CLEAR_CPU_BUF is not set?

>   		mds_clear_cpu_buffers();
>   
>   	vmx_disable_fb_clear(vmx);
>
  
Pawan Gupta Oct. 26, 2023, 7:07 p.m. UTC | #2
On Thu, Oct 26, 2023 at 07:14:18PM +0300, Nikolay Borisov wrote:
> >   	if (static_branch_unlikely(&vmx_l1d_should_flush))
> >   		vmx_l1d_flush(vcpu);
> > -	else if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
> > -		mds_clear_cpu_buffers();
> >   	else if (static_branch_unlikely(&mmio_stale_data_clear) &&
> >   		 kvm_arch_has_assigned_device(vcpu->kvm))
> > +		/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */
> 
> Mutually exclusive implies that you have one or the other but not both,
> whilst I think the right formulation here is redundant? Because if mmio is
> enabled  mds_clear_cpu_buffers() will clear the buffers here  and later
> they'll be cleared again, no ?

No, because when mmio_stale_data_clear is enabled,
X86_FEATURE_CLEAR_CPU_BUF will not be set because of how mitigation is
selected in mmio_select_mitigation():

mmio_select_mitigation()
{
...
         /*
          * Enable CPU buffer clear mitigation for host and VMM if also affected
          * by MDS or TAA. Otherwise, enable mitigation for VMM only.
          */
         if (boot_cpu_has_bug(X86_BUG_MDS) || (boot_cpu_has_bug(X86_BUG_TAA) &&
                                               boot_cpu_has(X86_FEATURE_RTM)))
                 setup_force_cpu_cap(X86_FEATURE_CLEAR_CPU_BUF);
         else
                 static_branch_enable(&mmio_stale_data_clear);

> Alternatively you might augment this check to only execute iff
> X86_FEATURE_CLEAR_CPU_BUF is not set?

It already is like that due to the logic above. That is what the
comment:

	/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */

... is trying to convey. Suggestions welcome to improve the comment.
  
Sean Christopherson Oct. 26, 2023, 7:30 p.m. UTC | #3
On Wed, Oct 25, 2023, Pawan Gupta wrote:
> During VMentry VERW is executed to mitigate MDS. After VERW, any memory
> access like register push onto stack may put host data in MDS affected
> CPU buffers. A guest can then use MDS to sample host data.
> 
> Although likelihood of secrets surviving in registers at current VERW
> callsite is less, but it can't be ruled out. Harden the MDS mitigation
> by moving the VERW mitigation late in VMentry path.
> 
> Note that VERW for MMIO Stale Data mitigation is unchanged because of
> the complexity of per-guest conditional VERW which is not easy to handle
> that late in asm with no GPRs available. If the CPU is also affected by
> MDS, VERW is unconditionally executed late in asm regardless of guest
> having MMIO access.
> 
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> ---
>  arch/x86/kvm/vmx/vmenter.S |  3 +++
>  arch/x86/kvm/vmx/vmx.c     | 10 +++++++---
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index b3b13ec04bac..139960deb736 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -161,6 +161,9 @@ SYM_FUNC_START(__vmx_vcpu_run)
>  	/* Load guest RAX.  This kills the @regs pointer! */
>  	mov VCPU_RAX(%_ASM_AX), %_ASM_AX
>  
> +	/* Clobbers EFLAGS.ZF */
> +	CLEAR_CPU_BUFFERS
> +
>  	/* Check EFLAGS.CF from the VMX_RUN_VMRESUME bit test above. */
>  	jnc .Lvmlaunch
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 24e8694b83fc..2d149589cf5b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7226,13 +7226,17 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>  
>  	guest_state_enter_irqoff();
>  
> -	/* L1D Flush includes CPU buffer clear to mitigate MDS */
> +	/*
> +	 * L1D Flush includes CPU buffer clear to mitigate MDS, but VERW
> +	 * mitigation for MDS is done late in VMentry and is still
> +	 * executed inspite of L1D Flush. This is because an extra VERW

in spite

> +	 * should not matter much after the big hammer L1D Flush.
> +	 */
>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
>  		vmx_l1d_flush(vcpu);

There's an existing bug here.  vmx_1ld_flush() is not guaranteed to do a flush in
"conditional mode", and is not guaranteed to do a ucode-based flush (though I can't
tell if it's possible for the VERW magic to exist without X86_FEATURE_FLUSH_L1D).

If we care, something like the diff at the bottom is probably needed.

> -	else if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
> -		mds_clear_cpu_buffers();
>  	else if (static_branch_unlikely(&mmio_stale_data_clear) &&
>  		 kvm_arch_has_assigned_device(vcpu->kvm))
> +		/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */

Please don't put comments inside an if/elif without curly braces (and I don't
want to add curly braces).  Though I think that's a moot point if we first fix
the conditional L1D flush issue.  E.g. when the dust settles we can end up with:

	/*
	 * Note, a ucode-based L1D flush also flushes CPU buffers, i.e. the
	 * manual VERW in __vmx_vcpu_run() to mitigate MDS *may* be redundant.
	 * But an L1D Flush is not guaranteed for "conditional mode", and the
	 * cost of an extra VERW after a full L1D flush is negligible.
	 */
	if (static_branch_unlikely(&vmx_l1d_should_flush))
		cpu_buffers_flushed = vmx_l1d_flush(vcpu);

	/*
	 * The MMIO stale data vulnerability is a subset of the general MDS
	 * vulnerability, i.e. this is mutually exclusive with the VERW that's
	 * done just before VM-Enter.  The vulnerability requires the attacker,
	 * i.e. the guest, to do MMIO, so this "clear" can be done earlier.
	 */
	if (static_branch_unlikely(&mmio_stale_data_clear) &&
	    !cpu_buffers_flushed && kvm_arch_has_assigned_device(vcpu->kvm))
		mds_clear_cpu_buffers();

>  		mds_clear_cpu_buffers();
>  
>  	vmx_disable_fb_clear(vmx);

LOL, nice.  IIUC, setting FB_CLEAR_DIS is mutually exclusive with doing a late
VERW, as KVM will never set FB_CLEAR_DIS if the CPU is susceptible to X86_BUG_MDS.
But the checks aren't identical, which makes this _look_ sketchy.

Can you do something like this to ensure we don't accidentally neuter the late VERW?

static void vmx_update_fb_clear_dis(struct kvm_vcpu *vcpu, struct vcpu_vmx *vmx)
{
	vmx->disable_fb_clear = (host_arch_capabilities & ARCH_CAP_FB_CLEAR_CTRL) &&
				!boot_cpu_has_bug(X86_BUG_MDS) &&
				!boot_cpu_has_bug(X86_BUG_TAA);

	if (vmx->disable_fb_clear &&
	    WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF)))
	    	vmx->disable_fb_clear = false;

	...
}

--
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6e502ba93141..cf6e06bb8310 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6606,8 +6606,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
  * is not exactly LRU. This could be sized at runtime via topology
  * information but as all relevant affected CPUs have 32KiB L1D cache size
  * there is no point in doing so.
+ *
+ * Returns %true if CPU buffers were cleared, i.e. if a microcode-based L1D
+ * flush was executed (which also clears CPU buffers).
  */
-static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
+static noinstr bool vmx_l1d_flush(struct kvm_vcpu *vcpu)
 {
        int size = PAGE_SIZE << L1D_CACHE_ORDER;
 
@@ -6634,14 +6637,14 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
                kvm_clear_cpu_l1tf_flush_l1d();
 
                if (!flush_l1d)
-                       return;
+                       return false;
        }
 
        vcpu->stat.l1d_flush++;
 
        if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
                native_wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
-               return;
+               return true;
        }
 
        asm volatile(
@@ -6665,6 +6668,8 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
                :: [flush_pages] "r" (vmx_l1d_flush_pages),
                    [size] "r" (size)
                : "eax", "ebx", "ecx", "edx");
+
+       return false;
 }
 
 static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
@@ -7222,16 +7227,17 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
                                        unsigned int flags)
 {
        struct vcpu_vmx *vmx = to_vmx(vcpu);
+       bool cpu_buffers_flushed = false;
 
        guest_state_enter_irqoff();
 
-       /* L1D Flush includes CPU buffer clear to mitigate MDS */
        if (static_branch_unlikely(&vmx_l1d_should_flush))
-               vmx_l1d_flush(vcpu);
-       else if (static_branch_unlikely(&mds_user_clear))
-               mds_clear_cpu_buffers();
-       else if (static_branch_unlikely(&mmio_stale_data_clear) &&
-                kvm_arch_has_assigned_device(vcpu->kvm))
+               cpu_buffers_flushed = vmx_l1d_flush(vcpu);
+
+       if ((static_branch_unlikely(&mds_user_clear) ||
+            (static_branch_unlikely(&mmio_stale_data_clear) &&
+             kvm_arch_has_assigned_device(vcpu->kvm))) &&
+           !cpu_buffers_flushed)
                mds_clear_cpu_buffers();
 
        vmx_disable_fb_clear(vmx);
  
Sean Christopherson Oct. 26, 2023, 8:17 p.m. UTC | #4
On Thu, Oct 26, 2023, Sean Christopherson wrote:
> On Wed, Oct 25, 2023, Pawan Gupta wrote:
> >  	vmx_disable_fb_clear(vmx);
> 
> LOL, nice.  IIUC, setting FB_CLEAR_DIS is mutually exclusive with doing a late
> VERW, as KVM will never set FB_CLEAR_DIS if the CPU is susceptible to X86_BUG_MDS.
> But the checks aren't identical, which makes this _look_ sketchy.
> 
> Can you do something like this to ensure we don't accidentally neuter the late VERW?
> 
> static void vmx_update_fb_clear_dis(struct kvm_vcpu *vcpu, struct vcpu_vmx *vmx)
> {
> 	vmx->disable_fb_clear = (host_arch_capabilities & ARCH_CAP_FB_CLEAR_CTRL) &&
> 				!boot_cpu_has_bug(X86_BUG_MDS) &&
> 				!boot_cpu_has_bug(X86_BUG_TAA);
> 
> 	if (vmx->disable_fb_clear &&
> 	    WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF)))
> 	    	vmx->disable_fb_clear = false;
> 
> 	...
> }

Alternatively, and maybe even preferably, this would make it more obvious that
the two are mutually exclusive and would also be a (very, very) small perf win
when the mitigation is enabled.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0936516cb93b..592103df1754 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7236,7 +7236,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
                 kvm_arch_has_assigned_device(vcpu->kvm))
                mds_clear_cpu_buffers();
 
-       vmx_disable_fb_clear(vmx);
+       if (!cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
+               vmx_disable_fb_clear(vmx);
 
        if (vcpu->arch.cr2 != native_read_cr2())
                native_write_cr2(vcpu->arch.cr2);
@@ -7249,7 +7250,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 
        vmx->idt_vectoring_info = 0;
 
-       vmx_enable_fb_clear(vmx);
+       if (!cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
+               vmx_enable_fb_clear(vmx);
 
        if (unlikely(vmx->fail)) {
                vmx->exit_reason.full = 0xdead;
  
Pawan Gupta Oct. 26, 2023, 8:48 p.m. UTC | #5
On Thu, Oct 26, 2023 at 12:30:55PM -0700, Sean Christopherson wrote:
> > -	/* L1D Flush includes CPU buffer clear to mitigate MDS */
> > +	/*
> > +	 * L1D Flush includes CPU buffer clear to mitigate MDS, but VERW
> > +	 * mitigation for MDS is done late in VMentry and is still
> > +	 * executed inspite of L1D Flush. This is because an extra VERW
> 
> in spite

Ok.

> > +	 * should not matter much after the big hammer L1D Flush.
> > +	 */
> >  	if (static_branch_unlikely(&vmx_l1d_should_flush))
> >  		vmx_l1d_flush(vcpu);
> 
> There's an existing bug here.  vmx_1ld_flush() is not guaranteed to do a flush in
> "conditional mode", and is not guaranteed to do a ucode-based flush

AFAICT, it is based on the condition whether after a VMexit any
sensitive data could have been touched or not. If L1TF mitigation
doesn't consider certain data sensitive and skips L1D flush, executing
VERW isn't giving any protection, since that data can anyways be leaked
from L1D using L1TF.

> (though I can't tell if it's possible for the VERW magic to exist
> without X86_FEATURE_FLUSH_L1D).

Likely not, ucode that adds VERW should have X86_FEATURE_FLUSH_L1D as
L1TF was mitigation prior to MDS.

> If we care, something like the diff at the bottom is probably needed.
> 
> > -	else if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
> > -		mds_clear_cpu_buffers();
> >  	else if (static_branch_unlikely(&mmio_stale_data_clear) &&
> >  		 kvm_arch_has_assigned_device(vcpu->kvm))
> > +		/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */
> 
> Please don't put comments inside an if/elif without curly braces (and I don't
> want to add curly braces).  Though I think that's a moot point if we first fix
> the conditional L1D flush issue.  E.g. when the dust settles we can end up with:

Ok.

> 	/*
> 	 * Note, a ucode-based L1D flush also flushes CPU buffers, i.e. the
> 	 * manual VERW in __vmx_vcpu_run() to mitigate MDS *may* be redundant.
> 	 * But an L1D Flush is not guaranteed for "conditional mode", and the
> 	 * cost of an extra VERW after a full L1D flush is negligible.
> 	 */
> 	if (static_branch_unlikely(&vmx_l1d_should_flush))
> 		cpu_buffers_flushed = vmx_l1d_flush(vcpu);
> 
> 	/*
> 	 * The MMIO stale data vulnerability is a subset of the general MDS
> 	 * vulnerability, i.e. this is mutually exclusive with the VERW that's
> 	 * done just before VM-Enter.  The vulnerability requires the attacker,
> 	 * i.e. the guest, to do MMIO, so this "clear" can be done earlier.
> 	 */
> 	if (static_branch_unlikely(&mmio_stale_data_clear) &&
> 	    !cpu_buffers_flushed && kvm_arch_has_assigned_device(vcpu->kvm))
> 		mds_clear_cpu_buffers();

This is certainly better, but I don't know what scenario is this helping with.

> >  		mds_clear_cpu_buffers();
> >  
> >  	vmx_disable_fb_clear(vmx);
> 
> LOL, nice.  IIUC, setting FB_CLEAR_DIS is mutually exclusive with doing a late
> VERW, as KVM will never set FB_CLEAR_DIS if the CPU is susceptible to X86_BUG_MDS.
> But the checks aren't identical, which makes this _look_ sketchy.
> 
> Can you do something like this to ensure we don't accidentally neuter the late VERW?
> 
> static void vmx_update_fb_clear_dis(struct kvm_vcpu *vcpu, struct vcpu_vmx *vmx)
> {
> 	vmx->disable_fb_clear = (host_arch_capabilities & ARCH_CAP_FB_CLEAR_CTRL) &&
> 				!boot_cpu_has_bug(X86_BUG_MDS) &&
> 				!boot_cpu_has_bug(X86_BUG_TAA);
> 
> 	if (vmx->disable_fb_clear &&
> 	    WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF)))
> 	    	vmx->disable_fb_clear = false;

Will do, this makes a lot of sense.
  
Sean Christopherson Oct. 26, 2023, 9:22 p.m. UTC | #6
On Thu, Oct 26, 2023, Pawan Gupta wrote:
> On Thu, Oct 26, 2023 at 12:30:55PM -0700, Sean Christopherson wrote:
> > >  	if (static_branch_unlikely(&vmx_l1d_should_flush))
> > >  		vmx_l1d_flush(vcpu);
> > 
> > There's an existing bug here.  vmx_1ld_flush() is not guaranteed to do a flush in
> > "conditional mode", and is not guaranteed to do a ucode-based flush
> 
> AFAICT, it is based on the condition whether after a VMexit any
> sensitive data could have been touched or not. If L1TF mitigation
> doesn't consider certain data sensitive and skips L1D flush, executing
> VERW isn't giving any protection, since that data can anyways be leaked
> from L1D using L1TF.

That assumes vcpu->arch.l1tf_flush_l1d is 100% precise and accurate, which is most
definitely not the case.  You're also preventing the admin from choosing between
being super paranoind (always flush L1D) and mostly paranoid (conditionally flush
L1D, always flush CPU buffers).

AIUI, flushing the L1D is crazy expensive compared to flushing the CPU buffers,
so it's entirely plausible for someone to want to choose the mostly paranoid
option.

Side topic, isn't the NMI path missing a call to kvm_set_cpu_l1tf_flush_l1d()?

> > 	/*
> > 	 * The MMIO stale data vulnerability is a subset of the general MDS
> > 	 * vulnerability, i.e. this is mutually exclusive with the VERW that's
> > 	 * done just before VM-Enter.  The vulnerability requires the attacker,
> > 	 * i.e. the guest, to do MMIO, so this "clear" can be done earlier.
> > 	 */
> > 	if (static_branch_unlikely(&mmio_stale_data_clear) &&
> > 	    !cpu_buffers_flushed && kvm_arch_has_assigned_device(vcpu->kvm))
> > 		mds_clear_cpu_buffers();
> 
> This is certainly better, but I don't know what scenario is this helping with.

Heh, that's host I feel about moving VERW to just before VM-Enter.  I have a hard
time believing there's meaningful sensitive that's accessed in __vmx_vcpu_run().
The closest thing is probably CR2, but that's a very dubious vector since CR2 will
hold a guest value for most VM-Enters.

I'm not against moving VERW close to VM-Enter because it's relatively straightforward,
but if we're going to be super paranoid, why not go all the way and not have to
worry about what ifs?
  
Pawan Gupta Oct. 26, 2023, 9:27 p.m. UTC | #7
On Thu, Oct 26, 2023 at 01:17:47PM -0700, Sean Christopherson wrote:
> Alternatively, and maybe even preferably, this would make it more obvious that
> the two are mutually exclusive and would also be a (very, very) small perf win
> when the mitigation is enabled.

Agree.

> -       vmx_disable_fb_clear(vmx);
> +       if (!cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
> +               vmx_disable_fb_clear(vmx);
  
Pawan Gupta Oct. 26, 2023, 10:03 p.m. UTC | #8
On Thu, Oct 26, 2023 at 02:22:58PM -0700, Sean Christopherson wrote:
> On Thu, Oct 26, 2023, Pawan Gupta wrote:
> > On Thu, Oct 26, 2023 at 12:30:55PM -0700, Sean Christopherson wrote:
> > > >  	if (static_branch_unlikely(&vmx_l1d_should_flush))
> > > >  		vmx_l1d_flush(vcpu);
> > > 
> > > There's an existing bug here.  vmx_1ld_flush() is not guaranteed to do a flush in
> > > "conditional mode", and is not guaranteed to do a ucode-based flush
> > 
> > AFAICT, it is based on the condition whether after a VMexit any
> > sensitive data could have been touched or not. If L1TF mitigation
> > doesn't consider certain data sensitive and skips L1D flush, executing
> > VERW isn't giving any protection, since that data can anyways be leaked
> > from L1D using L1TF.
> 
> That assumes vcpu->arch.l1tf_flush_l1d is 100% precise and accurate, which is most
> definitely not the case.  You're also preventing the admin from choosing between
> being super paranoind (always flush L1D) and mostly paranoid (conditionally flush
> L1D, always flush CPU buffers).
> AIUI, flushing the L1D is crazy expensive compared to flushing the CPU buffers,
> so it's entirely plausible for someone to want to choose the mostly paranoid
> option.

Sure, if it helps an admin. I was asking about the problematic scenario
out of curiosity. BTW, the changes you suggested are definitely worth
doing.

> Side topic, isn't the NMI path missing a call to kvm_set_cpu_l1tf_flush_l1d()?

Yes, it is missing. Not sure if it was omitted intentionally.

> > This is certainly better, but I don't know what scenario is this helping with.
> 
> Heh, that's host I feel about moving VERW to just before VM-Enter.  I have a hard
> time believing there's meaningful sensitive that's accessed in __vmx_vcpu_run().
> The closest thing is probably CR2, but that's a very dubious vector since CR2 will
> hold a guest value for most VM-Enters.

Yes, kernel->user case has a better chance of leaking anything.

> I'm not against moving VERW close to VM-Enter because it's relatively straightforward,
> but if we're going to be super paranoid, why not go all the way and not have to
> worry about what ifs?

Right. The VMenter changes are mostly done to be consistent with what is being
done for kernel->user.
  

Patch

diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index b3b13ec04bac..139960deb736 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -161,6 +161,9 @@  SYM_FUNC_START(__vmx_vcpu_run)
 	/* Load guest RAX.  This kills the @regs pointer! */
 	mov VCPU_RAX(%_ASM_AX), %_ASM_AX
 
+	/* Clobbers EFLAGS.ZF */
+	CLEAR_CPU_BUFFERS
+
 	/* Check EFLAGS.CF from the VMX_RUN_VMRESUME bit test above. */
 	jnc .Lvmlaunch
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 24e8694b83fc..2d149589cf5b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7226,13 +7226,17 @@  static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 
 	guest_state_enter_irqoff();
 
-	/* L1D Flush includes CPU buffer clear to mitigate MDS */
+	/*
+	 * L1D Flush includes CPU buffer clear to mitigate MDS, but VERW
+	 * mitigation for MDS is done late in VMentry and is still
+	 * executed inspite of L1D Flush. This is because an extra VERW
+	 * should not matter much after the big hammer L1D Flush.
+	 */
 	if (static_branch_unlikely(&vmx_l1d_should_flush))
 		vmx_l1d_flush(vcpu);
-	else if (cpu_feature_enabled(X86_FEATURE_CLEAR_CPU_BUF))
-		mds_clear_cpu_buffers();
 	else if (static_branch_unlikely(&mmio_stale_data_clear) &&
 		 kvm_arch_has_assigned_device(vcpu->kvm))
+		/* MMIO mitigation is mutually exclusive with MDS mitigation later in asm */
 		mds_clear_cpu_buffers();
 
 	vmx_disable_fb_clear(vmx);