[v4,19/19] KVM: VMX: Skip VMCLEAR logic during emergency reboots if CR4.VMXE=0

Message ID 20230721201859.2307736-20-seanjc@google.com
State New
Headers
Series x86/reboot: KVM: Clean up "emergency" virt code |

Commit Message

Sean Christopherson July 21, 2023, 8:18 p.m. UTC
  Bail from vmx_emergency_disable() without processing the list of loaded
VMCSes if CR4.VMXE=0, i.e. if the CPU can't be post-VMXON.  It should be
impossible for the list to have entries if VMX is already disabled, and
even if that invariant doesn't hold, VMCLEAR will #UD anyways, i.e.
processing the list is pointless even if it somehow isn't empty.

Assuming no existing KVM bugs, this should be a glorified nop.  The
primary motivation for the change is to avoid having code that looks like
it does VMCLEAR, but then skips VMXON, which is nonsensical.

Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)
  

Comments

Kai Huang July 25, 2023, 3:51 a.m. UTC | #1
On Fri, 2023-07-21 at 13:18 -0700, Sean Christopherson wrote:
> Bail from vmx_emergency_disable() without processing the list of loaded
> VMCSes if CR4.VMXE=0, i.e. if the CPU can't be post-VMXON.  It should be
> impossible for the list to have entries if VMX is already disabled, and
> even if that invariant doesn't hold, VMCLEAR will #UD anyways, i.e.
> processing the list is pointless even if it somehow isn't empty.
> 
> Assuming no existing KVM bugs, this should be a glorified nop.  The
> primary motivation for the change is to avoid having code that looks like
> it does VMCLEAR, but then skips VMXON, which is nonsensical.
> 
> Suggested-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 5d21931842a5..0ef5ede9cb7c 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -773,12 +773,20 @@ static void vmx_emergency_disable(void)
>  
>  	kvm_rebooting = true;
>  
> +	/*
> +	 * Note, CR4.VMXE can be _cleared_ in NMI context, but it can only be
> +	 * set in task context.  If this races with VMX is disabled by an NMI,
> +	 * VMCLEAR and VMXOFF may #UD, but KVM will eat those faults due to
> +	 * kvm_rebooting set.
> +	 */

I am not quite following this comment.  IIUC this code path is only called from
NMI context in case of emergency VMX disable.  How can it race with "VMX is
disabled by an NMI"?  It should be the normal vmx_hardware_disable() may race
with NMI, but not this one?

> +	if (!(__read_cr4() & X86_CR4_VMXE))
> +		return;
> +
>  	list_for_each_entry(v, &per_cpu(loaded_vmcss_on_cpu, cpu),
>  			    loaded_vmcss_on_cpu_link)
>  		vmcs_clear(v->vmcs);
>  
> -	if (__read_cr4() & X86_CR4_VMXE)
> -		kvm_cpu_vmxoff();
> +	kvm_cpu_vmxoff();
>  }
>  
>  static void __loaded_vmcs_clear(void *arg)

Anyway, the actual code change LGTM:

Reviewed-by: Kai Huang <kai.huang@intel.com>
  
Sean Christopherson July 25, 2023, 6:15 p.m. UTC | #2
On Tue, Jul 25, 2023, Kai Huang wrote:
> On Fri, 2023-07-21 at 13:18 -0700, Sean Christopherson wrote:
> > Bail from vmx_emergency_disable() without processing the list of loaded
> > VMCSes if CR4.VMXE=0, i.e. if the CPU can't be post-VMXON.  It should be
> > impossible for the list to have entries if VMX is already disabled, and
> > even if that invariant doesn't hold, VMCLEAR will #UD anyways, i.e.
> > processing the list is pointless even if it somehow isn't empty.
> > 
> > Assuming no existing KVM bugs, this should be a glorified nop.  The
> > primary motivation for the change is to avoid having code that looks like
> > it does VMCLEAR, but then skips VMXON, which is nonsensical.
> > 
> > Suggested-by: Kai Huang <kai.huang@intel.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 12 ++++++++++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 5d21931842a5..0ef5ede9cb7c 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -773,12 +773,20 @@ static void vmx_emergency_disable(void)
> >  
> >  	kvm_rebooting = true;
> >  
> > +	/*
> > +	 * Note, CR4.VMXE can be _cleared_ in NMI context, but it can only be
> > +	 * set in task context.  If this races with VMX is disabled by an NMI,
> > +	 * VMCLEAR and VMXOFF may #UD, but KVM will eat those faults due to
> > +	 * kvm_rebooting set.
> > +	 */
> 
> I am not quite following this comment.  IIUC this code path is only called from
> NMI context in case of emergency VMX disable.

The CPU that initiates the emergency reboot can invoke the callback from process
context, only responding CPUs are guaranteed to be handled via NMI shootdown.
E.g. `reboot -f` will reach this point synchronously.

> How can it race with "VMX is disabled by an NMI"?

Somewhat theoretically, a different CPU could panic() and do a shootdown of the
CPU that is handling `reboot -f`.
  
Kai Huang July 25, 2023, 10:20 p.m. UTC | #3
On Tue, 2023-07-25 at 11:15 -0700, Sean Christopherson wrote:
> On Tue, Jul 25, 2023, Kai Huang wrote:
> > On Fri, 2023-07-21 at 13:18 -0700, Sean Christopherson wrote:
> > > Bail from vmx_emergency_disable() without processing the list of loaded
> > > VMCSes if CR4.VMXE=0, i.e. if the CPU can't be post-VMXON.  It should be
> > > impossible for the list to have entries if VMX is already disabled, and
> > > even if that invariant doesn't hold, VMCLEAR will #UD anyways, i.e.
> > > processing the list is pointless even if it somehow isn't empty.
> > > 
> > > Assuming no existing KVM bugs, this should be a glorified nop.  The
> > > primary motivation for the change is to avoid having code that looks like
> > > it does VMCLEAR, but then skips VMXON, which is nonsensical.
> > > 
> > > Suggested-by: Kai Huang <kai.huang@intel.com>
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/kvm/vmx/vmx.c | 12 ++++++++++--
> > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 5d21931842a5..0ef5ede9cb7c 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -773,12 +773,20 @@ static void vmx_emergency_disable(void)
> > >  
> > >  	kvm_rebooting = true;
> > >  
> > > +	/*
> > > +	 * Note, CR4.VMXE can be _cleared_ in NMI context, but it can only be
> > > +	 * set in task context.  If this races with VMX is disabled by an NMI,
> > > +	 * VMCLEAR and VMXOFF may #UD, but KVM will eat those faults due to
> > > +	 * kvm_rebooting set.
> > > +	 */
> > 
> > I am not quite following this comment.  IIUC this code path is only called from
> > NMI context in case of emergency VMX disable.
> 
> The CPU that initiates the emergency reboot can invoke the callback from process
> context, only responding CPUs are guaranteed to be handled via NMI shootdown.
> E.g. `reboot -f` will reach this point synchronously.
> 
> > How can it race with "VMX is disabled by an NMI"?
> 
> Somewhat theoretically, a different CPU could panic() and do a shootdown of the
> CPU that is handling `reboot -f`.

Yeah this is the only case I can think of too.

Anyway, LGTM.  Thanks for explaining.
  

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5d21931842a5..0ef5ede9cb7c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -773,12 +773,20 @@  static void vmx_emergency_disable(void)
 
 	kvm_rebooting = true;
 
+	/*
+	 * Note, CR4.VMXE can be _cleared_ in NMI context, but it can only be
+	 * set in task context.  If this races with VMX is disabled by an NMI,
+	 * VMCLEAR and VMXOFF may #UD, but KVM will eat those faults due to
+	 * kvm_rebooting set.
+	 */
+	if (!(__read_cr4() & X86_CR4_VMXE))
+		return;
+
 	list_for_each_entry(v, &per_cpu(loaded_vmcss_on_cpu, cpu),
 			    loaded_vmcss_on_cpu_link)
 		vmcs_clear(v->vmcs);
 
-	if (__read_cr4() & X86_CR4_VMXE)
-		kvm_cpu_vmxoff();
+	kvm_cpu_vmxoff();
 }
 
 static void __loaded_vmcs_clear(void *arg)