[kernel,v5,5/6] KVM: SEV: Enable data breakpoints in SEV-ES

Message ID 20230411125718.2297768-6-aik@amd.com
State New
Headers
Series KVM: SEV: Enable AMD SEV-ES DebugSwap |

Commit Message

Alexey Kardashevskiy April 11, 2023, 12:57 p.m. UTC
  Prior to SEV-ES, KVM saved/restored host debug registers upon switching
to/from a VM. Changing those registers inside a running SEV VM
triggered #VMEXIT to KVM.

SEV-ES added encrypted state (ES) which uses an encrypted page
for the VM state (VMSA). The hardware saves/restores certain registers
on VMRUN/VMEXIT according to a swap type (A, B, C), see
"Table B-3. Swap Types" in the AMD Architecture Programmer’s Manual
volume 2.

The DR6 and DR7 registers have always been swapped as Type A for SEV-ES
guests, but a new feature is available, identified via
CPUID Fn8000001F_EAX[14] "DebugSwap for SEV-ES guests", that provides
support for swapping additional debug registers. DR[0-3] and
DR[0-3]_ADDR_MASK are swapped as Type B when SEV_FEATURES[5] (DebugSwap)
is set.

Enable DebugSwap for a VMSA but only do so if CPUID Fn80000021_EAX[0]
("NoNestedDataBp", "Processor ignores nested data breakpoints") is
supported by the SOC as otherwise a malicious SEV-ES guest can set up
data breakpoints on the #DB IDT entry/stack and cause an infinite loop.
Set the features bit in sev_es_sync_vmsa() which is the last point
when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most
init happens) is called not only when VCPU is initialized but also on
intrahost migration when VMSA is encrypted.

Eliminate DR7 and #DB intercepts as:
- they are not needed when DebugSwap is supported;
- #VC for these intercepts is most likely not supported anyway and
kills the VM.

Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
---
Changes:
v8:
* added CPUID's DebugSwap feature
* commit log, comments updated
* redid the whole thing

v4:
* removed sev_es_is_debug_swap_enabled() helper
* made sev_es_debug_swap_enabled (module param) static
* set sev_feature early in sev_es_init_vmcb() and made intercepts
  dependend on it vs. module param
* move set_/clr_dr_intercepts to .c

v3:
* rewrote the commit log again
* rebased on tip/master to use recently defined X86_FEATURE_NO_NESTED_DATA_BP
* s/boot_cpu_has/cpu_feature_enabled/

v2:
* debug_swap moved from vcpu to module_param
* rewrote commit log

---
Tested with:
===
int x;
int main(int argc, char *argv[])
{
        x = 1;
        return 0;
}
===
gcc -g a.c
rsync a.out ruby-954vm:~/
ssh -t ruby-954vm 'gdb -ex "file a.out" -ex "watch x" -ex r'

where ruby-954vm is a VM.

With "/sys/module/kvm_amd/parameters/debug_swap = 0", gdb does not stop
on the watchpoint, with "= 1" - gdb does.
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/include/asm/svm.h         |  1 +
 arch/x86/kvm/svm/sev.c             | 36 ++++++++++++++++++--
 3 files changed, 35 insertions(+), 3 deletions(-)
  

Comments

Gupta, Pankaj May 9, 2023, 10:58 a.m. UTC | #1
> Prior to SEV-ES, KVM saved/restored host debug registers upon switching
> to/from a VM. Changing those registers inside a running SEV VM
> triggered #VMEXIT to KVM.
> 
> SEV-ES added encrypted state (ES) which uses an encrypted page
> for the VM state (VMSA). The hardware saves/restores certain registers
> on VMRUN/VMEXIT according to a swap type (A, B, C), see
> "Table B-3. Swap Types" in the AMD Architecture Programmer’s Manual
> volume 2.
> 
> The DR6 and DR7 registers have always been swapped as Type A for SEV-ES
> guests, but a new feature is available, identified via
> CPUID Fn8000001F_EAX[14] "DebugSwap for SEV-ES guests", that provides
> support for swapping additional debug registers. DR[0-3] and
> DR[0-3]_ADDR_MASK are swapped as Type B when SEV_FEATURES[5] (DebugSwap)
> is set.
> 
> Enable DebugSwap for a VMSA but only do so if CPUID Fn80000021_EAX[0]
> ("NoNestedDataBp", "Processor ignores nested data breakpoints") is
> supported by the SOC as otherwise a malicious SEV-ES guest can set up
> data breakpoints on the #DB IDT entry/stack and cause an infinite loop.

You mean #DB => #BP here?

> Set the features bit in sev_es_sync_vmsa() which is the last point
> when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most
> init happens) is called not only when VCPU is initialized but also on
> intrahost migration when VMSA is encrypted.
> 
> Eliminate DR7 and #DB intercepts as:
> - they are not needed when DebugSwap is supported;
> - #VC for these intercepts is most likely not supported anyway and
> kills the VM.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> ---
> Changes:
> v8:
> * added CPUID's DebugSwap feature
> * commit log, comments updated
> * redid the whole thing
> 
> v4:
> * removed sev_es_is_debug_swap_enabled() helper
> * made sev_es_debug_swap_enabled (module param) static
> * set sev_feature early in sev_es_init_vmcb() and made intercepts
>    dependend on it vs. module param
> * move set_/clr_dr_intercepts to .c
> 
> v3:
> * rewrote the commit log again
> * rebased on tip/master to use recently defined X86_FEATURE_NO_NESTED_DATA_BP
> * s/boot_cpu_has/cpu_feature_enabled/
> 
> v2:
> * debug_swap moved from vcpu to module_param
> * rewrote commit log
> 
> ---
> Tested with:
> ===
> int x;
> int main(int argc, char *argv[])
> {
>          x = 1;
>          return 0;
> }
> ===
> gcc -g a.c
> rsync a.out ruby-954vm:~/
> ssh -t ruby-954vm 'gdb -ex "file a.out" -ex "watch x" -ex r'
> 
> where ruby-954vm is a VM.
> 
> With "/sys/module/kvm_amd/parameters/debug_swap = 0", gdb does not stop
> on the watchpoint, with "= 1" - gdb does.
> ---
>   arch/x86/include/asm/cpufeatures.h |  1 +
>   arch/x86/include/asm/svm.h         |  1 +
>   arch/x86/kvm/svm/sev.c             | 36 ++++++++++++++++++--
>   3 files changed, 35 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index d9c190cdefa9..3a5eeb178778 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -435,6 +435,7 @@
>   #define X86_FEATURE_SEV_ES		(19*32+ 3) /* AMD Secure Encrypted Virtualization - Encrypted State */
>   #define X86_FEATURE_V_TSC_AUX		(19*32+ 9) /* "" Virtual TSC_AUX */
>   #define X86_FEATURE_SME_COHERENT	(19*32+10) /* "" AMD hardware-enforced cache coherency */
> +#define X86_FEATURE_DEBUG_SWAP		(19*32+14) /* AMD SEV-ES full debug state swap support */
>   
>   /* AMD-defined Extended Feature 2 EAX, CPUID level 0x80000021 (EAX), word 20 */
>   #define X86_FEATURE_NO_NESTED_DATA_BP	(20*32+ 0) /* "" No Nested Data Breakpoints */
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index 770dcf75eaa9..3a422d213010 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -280,6 +280,7 @@ static_assert((X2AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == X2AVIC_
>   #define AVIC_HPA_MASK	~((0xFFFULL << 52) | 0xFFF)
>   #define VMCB_AVIC_APIC_BAR_MASK		0xFFFFFFFFFF000ULL
>   
> +#define SVM_SEV_FEAT_DEBUG_SWAP                        BIT(5)
>   
>   struct vmcb_seg {
>   	u16 selector;
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index f0885250252d..ba12e7962e94 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -22,6 +22,7 @@
>   #include <asm/pkru.h>
>   #include <asm/trapnr.h>
>   #include <asm/fpu/xcr.h>
> +#include <asm/debugreg.h>
>   
>   #include "mmu.h"
>   #include "x86.h"
> @@ -53,9 +54,14 @@ module_param_named(sev, sev_enabled, bool, 0444);
>   /* enable/disable SEV-ES support */
>   static bool sev_es_enabled = true;
>   module_param_named(sev_es, sev_es_enabled, bool, 0444);
> +
> +/* enable/disable SEV-ES DebugSwap support */
> +static bool sev_es_debug_swap_enabled = true;
> +module_param_named(debug_swap, sev_es_debug_swap_enabled, bool, 0444);
>   #else
>   #define sev_enabled false
>   #define sev_es_enabled false
> +#define sev_es_debug_swap_enabled false
>   #endif /* CONFIG_KVM_AMD_SEV */
>   
>   static u8 sev_enc_bit;
> @@ -605,6 +611,9 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
>   	save->xss  = svm->vcpu.arch.ia32_xss;
>   	save->dr6  = svm->vcpu.arch.dr6;
>   
> +	if (sev_es_debug_swap_enabled)
> +		save->sev_features |= SVM_SEV_FEAT_DEBUG_SWAP;
> +
>   	pr_debug("Virtual Machine Save Area (VMSA):\n");
>   	print_hex_dump_debug("", DUMP_PREFIX_NONE, 16, 1, save, sizeof(*save), false);
>   
> @@ -2256,6 +2265,9 @@ void __init sev_hardware_setup(void)
>   out:
>   	sev_enabled = sev_supported;
>   	sev_es_enabled = sev_es_supported;
> +	if (!sev_es_enabled || !cpu_feature_enabled(X86_FEATURE_DEBUG_SWAP) ||
> +	    !cpu_feature_enabled(X86_FEATURE_NO_NESTED_DATA_BP))
> +		sev_es_debug_swap_enabled = false;
>   #endif
>   }
>   
> @@ -2976,14 +2988,20 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm)
>   	svm_set_intercept(svm, TRAP_CR8_WRITE);
>   
>   	/*
> +	 * Unless DebugSwap (depends on X86_FEATURE_NO_NESTED_DATA_BP) is enabled,
>   	 * DR7 access must remain intercepted for an SEV-ES guest to disallow
>   	 * the guest kernel enable debugging as otherwise a VM writing to DR7
>   	 * from the #DB handler may trigger infinite loop of #DB's.
>   	 */
>   	vmcb->control.intercepts[INTERCEPT_DR] = 0;
> -	vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_READ);
> -	vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_WRITE);
> -	recalc_intercepts(svm);
> +	if (sev_es_debug_swap_enabled) {
> +		clr_exception_intercept(svm, DB_VECTOR);
> +		/* clr_exception_intercept() called recalc_intercepts() */
> +	} else {
> +		vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_READ);
> +		vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_WRITE);
> +		recalc_intercepts(svm);
> +	}
>   
>   	/* Can't intercept XSETBV, HV can't modify XCR0 directly */
>   	svm_clr_intercept(svm, INTERCEPT_XSETBV);
> @@ -3048,6 +3066,18 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
>   
>   	/* MSR_IA32_XSS is restored on VMEXIT, save the currnet host value */
>   	hostsa->xss = host_xss;
> +
> +	/* The DebugSwap SEV feature does Type B swaps of DR[0-3] */
> +	if (sev_es_debug_swap_enabled) {
> +		hostsa->dr0 = native_get_debugreg(0);
> +		hostsa->dr1 = native_get_debugreg(1);
> +		hostsa->dr2 = native_get_debugreg(2);
> +		hostsa->dr3 = native_get_debugreg(3);
> +		hostsa->dr0_addr_mask = amd_get_dr_addr_mask(0);
> +		hostsa->dr1_addr_mask = amd_get_dr_addr_mask(1);
> +		hostsa->dr2_addr_mask = amd_get_dr_addr_mask(2);
> +		hostsa->dr3_addr_mask = amd_get_dr_addr_mask(3);
> +	}
>   }
>   
>   void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
  
Gupta, Pankaj May 10, 2023, 9:35 a.m. UTC | #2
>> Enable DebugSwap for a VMSA but only do so if CPUID Fn80000021_EAX[0]
>> ("NoNestedDataBp", "Processor ignores nested data breakpoints") is
>> supported by the SOC as otherwise a malicious SEV-ES guest can set up
>> data breakpoints on the #DB IDT entry/stack and cause an infinite loop.
> 
> You mean #DB => #BP here
Indeed its #DB. Was thinking something else.

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
  
Sean Christopherson May 22, 2023, 11:39 p.m. UTC | #3
On Tue, Apr 11, 2023, Alexey Kardashevskiy wrote:
> Prior to SEV-ES, KVM saved/restored host debug registers upon switching
> to/from a VM. Changing those registers inside a running SEV VM
> triggered #VMEXIT to KVM.

Please, please don't make it sound like some behavior is *the* one and only behavior.
*KVM* *chooses* to intercept debug register accesses.  Though I would omit this
paragraph (and largely the next) entirely, IMO it's safe to assume the reader has
a basic understanding of how KVM deals with DRs and #DBs. 

> SEV-ES added encrypted state (ES) which uses an encrypted page
> for the VM state (VMSA). The hardware saves/restores certain registers
> on VMRUN/VMEXIT according to a swap type (A, B, C), see
> "Table B-3. Swap Types" in the AMD Architecture Programmer’s Manual
> volume 2.
> 
> The DR6 and DR7 registers have always been swapped as Type A for SEV-ES

Please rewrite this to just state what the behavior is instead of "Type A" vs.
"Type B".  Outside of AMD, the "type a/b/c" stuff isn't anywhere near ubiquitous
enough to justify obfuscating super simple concepts.

Actually, this feature really has nothing to do with Type A vs. Type B, since
that's purely about host state.  I assume the switch from Type A to Type B is a
side effect, or an opportunistic optimization?

Regardless, something like this is a lot easier for a human to read.  It's easy
enough to find DebugSwap in the APM (literally took me longer to find my copy of
the PDF).

  Add support for "DebugSwap for SEV-ES guests", which provides support
  for swapping DR[0-3] and DR[0-3]_ADDR_MASK on VMRUN and VMEXIT, i.e.
  allows KVM to expose debug capabilities to SEV-ES guests.  Without
  DebugSwap support, the CPU doesn't save/load _guest_ debug registers,
  and KVM cannot manually context switch guest DRs due the VMSA being
  encrypted.

  Enable DebugSwap if and only if the CPU also supports NoNestedDataBp,
  which causes the CPU to ignore nested #DBs, i.e. #DBs that occur when
  vectoring a #DB.  Without NoNestedDataBp, a malicious guest can DoS
  the host by putting the CPU into an infinite loop of vectoring #DBs
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1278496)

  Set the features bit in sev_es_sync_vmsa() which is the last point
  when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most
  init happens) is called not only when VCPU is initialized but also on
  intrahost migration when VMSA is encrypted.

> guests, but a new feature is available, identified via
> CPUID Fn8000001F_EAX[14] "DebugSwap for SEV-ES guests", that provides
> support for swapping additional debug registers. DR[0-3] and
> DR[0-3]_ADDR_MASK are swapped as Type B when SEV_FEATURES[5] (DebugSwap)
> is set.
> 
> Enable DebugSwap for a VMSA but only do so if CPUID Fn80000021_EAX[0]
> ("NoNestedDataBp", "Processor ignores nested data breakpoints") is
> supported by the SOC as otherwise a malicious SEV-ES guest can set up
> data breakpoints on the #DB IDT entry/stack and cause an infinite loop.
> Set the features bit in sev_es_sync_vmsa() which is the last point
> when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most
> init happens) is called not only when VCPU is initialized but also on
> intrahost migration when VMSA is encrypted.
> 
> Eliminate DR7 and #DB intercepts as:
> - they are not needed when DebugSwap is supported;

"not needed" isn't sufficient justification.  KVM doesn't strictly need to do a
lot of things, but does them anyways for a variety of reasons.  E.g. #DB intercept
is also not needed when NoNestedDataBp is supported and vcpu->guest_debug==0, i.e.
this should clarify why KVM doesn't simply disable #DB intercepts for _all_ VMs
when NoNestedDataBp is support.  Presumably the answer is "because it's simpler
than toggling #DB interception for guest_debug.

Actually, can't disabling #DB interception for DebugSwap SEV-ES guests be a
separate patch?  KVM can still inject #DBs for SEV-ES guests, no?

As for DR7, the real justification is that, as above, KVM can't modify guest DR7,
and intercepting DR7 would completely defeat the purpose of enabling DebugSwap.

> - #VC for these intercepts is most likely not supported anyway and
> kills the VM.

I don't see how this is relevant or helpful.  What the guest may or may not do in
response to a #VC on a DR7 write has no bearing on KVM's behavior. 

> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> ---

...

> @@ -3048,6 +3066,18 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
>  
>  	/* MSR_IA32_XSS is restored on VMEXIT, save the currnet host value */
>  	hostsa->xss = host_xss;
> +
> +	/* The DebugSwap SEV feature does Type B swaps of DR[0-3] */

Since dangling a carrot didn't work[*], I'm going to resort to extortion :-)
Can you fold the below somewhere before this patch, and then tweak this comment
to say something like:

	/*
	 * If DebugSwap is enabled, debug registers are loaded but NOT saved by
	 * the CPU (Type-B).  If DebugSwap is disabled/unsupported, the CPU both
	 * saves and loads debug registers (Type-A).
	 */

[*] https://lore.kernel.org/all/YzOTOzUzKPQSqURo@google.com/

--
From: Sean Christopherson <seanjc@google.com>
Date: Mon, 22 May 2023 16:29:52 -0700
Subject: [PATCH] KVM: SVM: Rewrite sev_es_prepare_switch_to_guest()'s comment
 about swap types

Rewrite the comment(s) in sev_es_prepare_switch_to_guest() to explain the
swap types employed by the CPU for SEV-ES guests, i.e. to explain why KVM
needs to save a seemingly random subset of host state, and to provide a
decoder for the APM's Type-A/B/C terminology.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/sev.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 69ae5e1b3120..1e0e9b08e491 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3017,19 +3017,24 @@ void sev_es_vcpu_reset(struct vcpu_svm *svm)
 void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
 {
 	/*
-	 * As an SEV-ES guest, hardware will restore the host state on VMEXIT,
-	 * of which one step is to perform a VMLOAD.  KVM performs the
-	 * corresponding VMSAVE in svm_prepare_guest_switch for both
-	 * traditional and SEV-ES guests.
+	 * All host state for SEV-ES guests is categorized into three swap types
+	 * based on how it is handled by hardware during a world switch:
+	 *
+	 * A: VMRUN:   Host state saved in host save area
+	 *    VMEXIT:  Host state loaded from host save area
+	 *
+	 * B: VMRUN:   Host state _NOT_ saved in host save area
+	 *    VMEXIT:  Host state loaded from host save area
+	 *
+	 * C: VMRUN:   Host state _NOT_ saved in host save area
+	 *    VMEXIT:  Host state initialized to default(reset) values
+	 *
+	 * Manually save type-B state, i.e. state that is loaded by VMEXIT but
+	 * isn't saved by VMRUN, that isn't already saved by VMSAVE (performed
+	 * by common SVM code).
 	 */
-
-	/* XCR0 is restored on VMEXIT, save the current host value */
 	hostsa->xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
-
-	/* PKRU is restored on VMEXIT, save the current host value */
 	hostsa->pkru = read_pkru();
-
-	/* MSR_IA32_XSS is restored on VMEXIT, save the currnet host value */
 	hostsa->xss = host_xss;
 }
 

base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
--
  
Alexey Kardashevskiy May 23, 2023, 11:33 a.m. UTC | #4
On 23/5/23 09:39, Sean Christopherson wrote:
> On Tue, Apr 11, 2023, Alexey Kardashevskiy wrote:
>> Prior to SEV-ES, KVM saved/restored host debug registers upon switching
>> to/from a VM. Changing those registers inside a running SEV VM
>> triggered #VMEXIT to KVM.
> 
> Please, please don't make it sound like some behavior is *the* one and only behavior.
> *KVM* *chooses* to intercept debug register accesses.  Though I would omit this
> paragraph (and largely the next) entirely, IMO it's safe to assume the reader has
> a basic understanding of how KVM deals with DRs and #DBs.

Out of curiosity - is ARM similar in this regard, interceptions and stuff?

>> SEV-ES added encrypted state (ES) which uses an encrypted page
>> for the VM state (VMSA). The hardware saves/restores certain registers
>> on VMRUN/VMEXIT according to a swap type (A, B, C), see
>> "Table B-3. Swap Types" in the AMD Architecture Programmer’s Manual
>> volume 2.
>>
>> The DR6 and DR7 registers have always been swapped as Type A for SEV-ES
> 
> Please rewrite this to just state what the behavior is instead of "Type A" vs.
> "Type B".  Outside of AMD, the "type a/b/c" stuff isn't anywhere near ubiquitous
> enough to justify obfuscating super simple concepts.
> 
> Actually, this feature really has nothing to do with Type A vs. Type B, since
> that's purely about host state. 

Mmm. Nothing? If the feature is enabled and DR[0-3] are not saved in 
HOSTSA, then the host looses debug state because of the working feature.

> I assume the switch from Type A to Type B is a
> side effect, or an opportunistic optimization?

There is no switch. DR[67] were and are one type, and the other DRs were 
not swapped and now are, but with a different Swap Type.

And the patch saves DR[0-3] in HOSTSA but not DR[67] and this deserves 
some explaining why is that.

> Regardless, something like this is a lot easier for a human to read.  It's easy
> enough to find DebugSwap in the APM (literally took me longer to find my copy of
> the PDF).

It is also easy to find "Swap Types" in the APM...

>    Add support for "DebugSwap for SEV-ES guests", which provides support
>    for swapping DR[0-3] and DR[0-3]_ADDR_MASK on VMRUN and VMEXIT, i.e.
>    allows KVM to expose debug capabilities to SEV-ES guests.  Without
>    DebugSwap support, the CPU doesn't save/load _guest_ debug registers,

But it does save/load DR6 and DR7.

>    and KVM cannot manually context switch guest DRs due the VMSA being
>    encrypted.
> 
>    Enable DebugSwap if and only if the CPU also supports NoNestedDataBp,
>    which causes the CPU to ignore nested #DBs, i.e. #DBs that occur when
>    vectoring a #DB. 

(english question) What does "vectoring" mean here precisely? Handling? 
Processing?

> Without NoNestedDataBp, a malicious guest can DoS
>    the host by putting the CPU into an infinite loop of vectoring #DBs
>    (see https://bugzilla.redhat.com/show_bug.cgi?id=1278496)

Good one, thanks for the link.

> 
>    Set the features bit in sev_es_sync_vmsa() which is the last point
>    when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most
>    init happens) is called not only when VCPU is initialized but also on
>    intrahost migration when VMSA is encrypted.
> 
>> guests, but a new feature is available, identified via
>> CPUID Fn8000001F_EAX[14] "DebugSwap for SEV-ES guests", that provides
>> support for swapping additional debug registers. DR[0-3] and
>> DR[0-3]_ADDR_MASK are swapped as Type B when SEV_FEATURES[5] (DebugSwap)
>> is set.
>>
>> Enable DebugSwap for a VMSA but only do so if CPUID Fn80000021_EAX[0]
>> ("NoNestedDataBp", "Processor ignores nested data breakpoints") is
>> supported by the SOC as otherwise a malicious SEV-ES guest can set up
>> data breakpoints on the #DB IDT entry/stack and cause an infinite loop.
>> Set the features bit in sev_es_sync_vmsa() which is the last point
>> when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most
>> init happens) is called not only when VCPU is initialized but also on
>> intrahost migration when VMSA is encrypted.
>>
>> Eliminate DR7 and #DB intercepts as:
>> - they are not needed when DebugSwap is supported;
> 
> "not needed" isn't sufficient justification.  KVM doesn't strictly need to do a
> lot of things, but does them anyways for a variety of reasons.  E.g. #DB intercept
> is also not needed when NoNestedDataBp is supported and vcpu->guest_debug==0, i.e.
> this should clarify why KVM doesn't simply disable #DB intercepts for _all_ VMs
> when NoNestedDataBp is support.  Presumably the answer is "because it's simpler
> than toggling #DB interception for guest_debug.

TBH I did not have a good answer but from the above I'd say we want to 
disable #DB intercepts in a separate patch, just as you say below.

> Actually, can't disabling #DB interception for DebugSwap SEV-ES guests be a
> separate patch?  KVM can still inject #DBs for SEV-ES guests, no?

Sorry for my ignorance but what is the point of injecting #DB if there 
is no way of changing the guest's DR7?


> As for DR7, the real justification is that, as above, KVM can't modify guest DR7,
> and intercepting DR7 would completely defeat the purpose of enabling DebugSwap.
> 
>> - #VC for these intercepts is most likely not supported anyway and
>> kills the VM.
> 
> I don't see how this is relevant or helpful.  What the guest may or may not do in
> response to a #VC on a DR7 write has no bearing on KVM's behavior.

Readers of the patch may wonder of the chances of breaks guests. 
Definitely helps me to realize there is likely to be some sort of 
panic() in the guest if such intercept happens.


>> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
>> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
>> ---
> 
> ...
> 
>> @@ -3048,6 +3066,18 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
>>   
>>   	/* MSR_IA32_XSS is restored on VMEXIT, save the currnet host value */
>>   	hostsa->xss = host_xss;
>> +
>> +	/* The DebugSwap SEV feature does Type B swaps of DR[0-3] */
> 
> Since dangling a carrot didn't work[*], I'm going to resort to extortion :-)
> Can you fold the below somewhere before this patch, and then tweak this comment
> to say something like:
> 
> 	/*
> 	 * If DebugSwap is enabled, debug registers are loaded but NOT saved by
> 	 * the CPU (Type-B).  If DebugSwap is disabled/unsupported, the CPU both
> 	 * saves and loads debug registers (Type-A).
> 	 */

Sure but...

> 
> [*] https://lore.kernel.org/all/YzOTOzUzKPQSqURo@google.com/
> 
> 
> From: Sean Christopherson <seanjc@google.com>
> Date: Mon, 22 May 2023 16:29:52 -0700
> Subject: [PATCH] KVM: SVM: Rewrite sev_es_prepare_switch_to_guest()'s comment
>   about swap types


... I am missing the point here. You already wrote the patch below which 
1) you are happy about 2) you can push out right away to the upstream. 
Are you suggesting that I repost it?


> 
> Rewrite the comment(s) in sev_es_prepare_switch_to_guest() to explain the
> swap types employed by the CPU for SEV-ES guests, i.e. to explain why KVM
> needs to save a seemingly random subset of host state, and to provide a
> decoder for the APM's Type-A/B/C terminology.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/sev.c | 25 +++++++++++++++----------
>   1 file changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 69ae5e1b3120..1e0e9b08e491 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3017,19 +3017,24 @@ void sev_es_vcpu_reset(struct vcpu_svm *svm)
>   void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
>   {
>   	/*
> -	 * As an SEV-ES guest, hardware will restore the host state on VMEXIT,
> -	 * of which one step is to perform a VMLOAD. KVM performs the
> -	 * corresponding VMSAVE in svm_prepare_guest_switch for both
> -	 * traditional and SEV-ES guests.


When I see "traditional", I assume there was one single x86 KVM before 
say 2010 which was slow, emulated a lot but worked exactly the same on 
Intel and AMD. Which does not seem to ever be the case. May be say "SVM" 
here?


> +	 * All host state for SEV-ES guests is categorized into three swap types
> +	 * based on how it is handled by hardware during a world switch:
> +	 *
> +	 * A: VMRUN:   Host state saved in host save area
> +	 *    VMEXIT:  Host state loaded from host save area
> +	 *
> +	 * B: VMRUN:   Host state _NOT_ saved in host save area
> +	 *    VMEXIT:  Host state loaded from host save area
> +	 *
> +	 * C: VMRUN:   Host state _NOT_ saved in host save area
> +	 *    VMEXIT:  Host state initialized to default(reset) values
> +	 *
> +	 * Manually save type-B state, i.e. state that is loaded by VMEXIT but
> +	 * isn't saved by VMRUN, that isn't already saved by VMSAVE (performed
> +	 * by common SVM code).

Like here - "common SVM"?

Thanks for the comments!


>   	 */
> -
> -	/* XCR0 is restored on VMEXIT, save the current host value */
>   	hostsa->xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
> -
> -	/* PKRU is restored on VMEXIT, save the current host value */
>   	hostsa->pkru = read_pkru();
> -
> -	/* MSR_IA32_XSS is restored on VMEXIT, save the currnet host value */
>   	hostsa->xss = host_xss;
>   }
>   
> 
> base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
  
Sean Christopherson May 23, 2023, 3:44 p.m. UTC | #5
On Tue, May 23, 2023, Alexey Kardashevskiy wrote:
> 
> 
> On 23/5/23 09:39, Sean Christopherson wrote:
> > On Tue, Apr 11, 2023, Alexey Kardashevskiy wrote:
> > > Prior to SEV-ES, KVM saved/restored host debug registers upon switching
> > > to/from a VM. Changing those registers inside a running SEV VM
> > > triggered #VMEXIT to KVM.
> > 
> > Please, please don't make it sound like some behavior is *the* one and only behavior.
> > *KVM* *chooses* to intercept debug register accesses.  Though I would omit this
> > paragraph (and largely the next) entirely, IMO it's safe to assume the reader has
> > a basic understanding of how KVM deals with DRs and #DBs.
> 
> Out of curiosity - is ARM similar in this regard, interceptions and stuff?

Yes.  The granularity of interceptions varies based on the architectural revision,
and presumably there are things that always trap.  But the core concept of letting
software decide what to intercept is the same.

> > > SEV-ES added encrypted state (ES) which uses an encrypted page
> > > for the VM state (VMSA). The hardware saves/restores certain registers
> > > on VMRUN/VMEXIT according to a swap type (A, B, C), see
> > > "Table B-3. Swap Types" in the AMD Architecture Programmer’s Manual
> > > volume 2.
> > > 
> > > The DR6 and DR7 registers have always been swapped as Type A for SEV-ES
> > 
> > Please rewrite this to just state what the behavior is instead of "Type A" vs.
> > "Type B".  Outside of AMD, the "type a/b/c" stuff isn't anywhere near ubiquitous
> > enough to justify obfuscating super simple concepts.
> > 
> > Actually, this feature really has nothing to do with Type A vs. Type B, since
> > that's purely about host state.
> 
> Mmm. Nothing? If the feature is enabled and DR[0-3] are not saved in HOSTSA,
> then the host looses debug state because of the working feature.
> 
> > I assume the switch from Type A to Type B is a
> > side effect, or an opportunistic optimization?
> 
> There is no switch. DR[67] were and are one type, and the other DRs were not
> swapped and now are, but with a different Swap Type.

Ah, this is what I missed.

> And the patch saves DR[0-3] in HOSTSA but not DR[67] and this deserves some
> explaining why is that.
> 
> > Regardless, something like this is a lot easier for a human to read.  It's easy
> > enough to find DebugSwap in the APM (literally took me longer to find my copy of
> > the PDF).
> 
> It is also easy to find "Swap Types" in the APM...

Redirecting readers to specs for gory details is fine.  Redirecting for basic,
simple functionality is not.  Maybe the swap types will someday be common knowledge
in KVM, and an explanation will no longer be necessary, but we are nowhere near
that point.

> >    Add support for "DebugSwap for SEV-ES guests", which provides support
> >    for swapping DR[0-3] and DR[0-3]_ADDR_MASK on VMRUN and VMEXIT, i.e.
> >    allows KVM to expose debug capabilities to SEV-ES guests.  Without
> >    DebugSwap support, the CPU doesn't save/load _guest_ debug registers,
> 
> But it does save/load DR6 and DR7.
> 
> >    and KVM cannot manually context switch guest DRs due the VMSA being
> >    encrypted.
> > 
> >    Enable DebugSwap if and only if the CPU also supports NoNestedDataBp,
> >    which causes the CPU to ignore nested #DBs, i.e. #DBs that occur when
> >    vectoring a #DB.
> 
> (english question) What does "vectoring" mean here precisely? Handling?
> Processing?

It's not really an English thing.  "Vectoring" is, or at least was, Intel terminology
for describing the flow from the initial detection of an exception to the exception's
delivery to software, e.g. covers the IDT lookup, any GDT/LDT lookups, pushing
information on the stack, fetching the software exception handler, etc.  Importantly,
it describes the period where there are no instructions retired and thus ucode doesn't
open event windows, i.e. where triggering another, non-contributory exception will
effectively "hang" the CPU (at least on CPUs without Intel's "notify" VM-Exit support).

> >    the host by putting the CPU into an infinite loop of vectoring #DBs
> >    (see https://bugzilla.redhat.com/show_bug.cgi?id=1278496)
> 
> Good one, thanks for the link.
> 
> > 
> >    Set the features bit in sev_es_sync_vmsa() which is the last point
> >    when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most
> >    init happens) is called not only when VCPU is initialized but also on
> >    intrahost migration when VMSA is encrypted.
> > 
> > > guests, but a new feature is available, identified via
> > > CPUID Fn8000001F_EAX[14] "DebugSwap for SEV-ES guests", that provides
> > > support for swapping additional debug registers. DR[0-3] and
> > > DR[0-3]_ADDR_MASK are swapped as Type B when SEV_FEATURES[5] (DebugSwap)
> > > is set.
> > > 
> > > Enable DebugSwap for a VMSA but only do so if CPUID Fn80000021_EAX[0]
> > > ("NoNestedDataBp", "Processor ignores nested data breakpoints") is
> > > supported by the SOC as otherwise a malicious SEV-ES guest can set up
> > > data breakpoints on the #DB IDT entry/stack and cause an infinite loop.
> > > Set the features bit in sev_es_sync_vmsa() which is the last point
> > > when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most
> > > init happens) is called not only when VCPU is initialized but also on
> > > intrahost migration when VMSA is encrypted.
> > > 
> > > Eliminate DR7 and #DB intercepts as:
> > > - they are not needed when DebugSwap is supported;
> > 
> > "not needed" isn't sufficient justification.  KVM doesn't strictly need to do a
> > lot of things, but does them anyways for a variety of reasons.  E.g. #DB intercept
> > is also not needed when NoNestedDataBp is supported and vcpu->guest_debug==0, i.e.
> > this should clarify why KVM doesn't simply disable #DB intercepts for _all_ VMs
> > when NoNestedDataBp is support.  Presumably the answer is "because it's simpler
> > than toggling #DB interception for guest_debug.
> 
> TBH I did not have a good answer but from the above I'd say we want to
> disable #DB intercepts in a separate patch, just as you say below.
> 
> > Actually, can't disabling #DB interception for DebugSwap SEV-ES guests be a
> > separate patch?  KVM can still inject #DBs for SEV-ES guests, no?
> 
> Sorry for my ignorance but what is the point of injecting #DB if there is no
> way of changing the guest's DR7?

Well, _injecting_ the #DB is necessary for correctness from the guest's perspective.
"What's the point of _intercepting_ #DB" is the real question.  And for SEV-ES guests
with DebugSwap, there is no point, which is why I agree that KVM should disable
interception in that case.  What I'm calling out is that disabling #Db interception
isn't _necessary_ for correctness (unless I'm missing something), which means that
it can and should go in a separate patch.

> > As for DR7, the real justification is that, as above, KVM can't modify guest DR7,
> > and intercepting DR7 would completely defeat the purpose of enabling DebugSwap.
> > 
> > > - #VC for these intercepts is most likely not supported anyway and
> > > kills the VM.
> > 
> > I don't see how this is relevant or helpful.  What the guest may or may not do in
> > response to a #VC on a DR7 write has no bearing on KVM's behavior.
> 
> Readers of the patch may wonder of the chances of breaks guests. Definitely
> helps me to realize there is likely to be some sort of panic() in the guest
> if such intercept happens.

But that's irrelevant.  Intercepting DR7 writes will break the guest regardless
of how the guest deals with the #VC.  If the guest eats the #VC and continues on,
the debug capabilities expected by the guest will still be missing, i.e. KVM has
broken the functionality of the guest.  I am total ok if the changelog describes
the _possible_ scenarios (within reason), e.g. that the guest will either panic
on an unexpected #VC or lose debug capabilities that were promised.  What I'm
objecting to is speculating on what the guest is _likely_ to do, and especially
using that speculation as justification for doing something in KVM.

> > > Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
> > > Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> > > ---
> > 
> > ...
> > 
> > > @@ -3048,6 +3066,18 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
> > >   	/* MSR_IA32_XSS is restored on VMEXIT, save the currnet host value */
> > >   	hostsa->xss = host_xss;
> > > +
> > > +	/* The DebugSwap SEV feature does Type B swaps of DR[0-3] */
> > 
> > Since dangling a carrot didn't work[*], I'm going to resort to extortion :-)
> > Can you fold the below somewhere before this patch, and then tweak this comment
> > to say something like:
> > 
> > 	/*
> > 	 * If DebugSwap is enabled, debug registers are loaded but NOT saved by
> > 	 * the CPU (Type-B).  If DebugSwap is disabled/unsupported, the CPU both
> > 	 * saves and loads debug registers (Type-A).
> > 	 */
> 
> Sure but...
> 
> > 
> > [*] https://lore.kernel.org/all/YzOTOzUzKPQSqURo@google.com/
> > 
> > 
> > From: Sean Christopherson <seanjc@google.com>
> > Date: Mon, 22 May 2023 16:29:52 -0700
> > Subject: [PATCH] KVM: SVM: Rewrite sev_es_prepare_switch_to_guest()'s comment
> >   about swap types
> 
> 
> ... I am missing the point here. You already wrote the patch below which 1)
> you are happy about 2) you can push out right away to the upstream. Are you
> suggesting that I repost it?

I am asking you to include it in your series because the comment I suggested above
(for DebugSwap) loosely depends on the revamped comment for sev_es_prepare_switch_to_guest()
as a whole.  I would like to settle on the exact wording for all of the comments
in sev_es_prepare_switch_to_guest() in a single series instead of having disjoint
threads.

> > Rewrite the comment(s) in sev_es_prepare_switch_to_guest() to explain the
> > swap types employed by the CPU for SEV-ES guests, i.e. to explain why KVM
> > needs to save a seemingly random subset of host state, and to provide a
> > decoder for the APM's Type-A/B/C terminology.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/svm/sev.c | 25 +++++++++++++++----------
> >   1 file changed, 15 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index 69ae5e1b3120..1e0e9b08e491 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -3017,19 +3017,24 @@ void sev_es_vcpu_reset(struct vcpu_svm *svm)
> >   void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
> >   {
> >   	/*
> > -	 * As an SEV-ES guest, hardware will restore the host state on VMEXIT,
> > -	 * of which one step is to perform a VMLOAD. KVM performs the
> > -	 * corresponding VMSAVE in svm_prepare_guest_switch for both
> > -	 * traditional and SEV-ES guests.
> 
> 
> When I see "traditional", I assume there was one single x86 KVM before say
> 2010 which was slow, emulated a lot but worked exactly the same on Intel and
> AMD. Which does not seem to ever be the case. May be say "SVM" here?

This is the code being removed.  I agree that "traditional" is ambiguous, which
is why I want to delete it :-)

> > +	 * All host state for SEV-ES guests is categorized into three swap types
> > +	 * based on how it is handled by hardware during a world switch:
> > +	 *
> > +	 * A: VMRUN:   Host state saved in host save area
> > +	 *    VMEXIT:  Host state loaded from host save area
> > +	 *
> > +	 * B: VMRUN:   Host state _NOT_ saved in host save area
> > +	 *    VMEXIT:  Host state loaded from host save area
> > +	 *
> > +	 * C: VMRUN:   Host state _NOT_ saved in host save area
> > +	 *    VMEXIT:  Host state initialized to default(reset) values
> > +	 *
> > +	 * Manually save type-B state, i.e. state that is loaded by VMEXIT but
> > +	 * isn't saved by VMRUN, that isn't already saved by VMSAVE (performed
> > +	 * by common SVM code).
  
Alexey Kardashevskiy May 26, 2023, 3:16 a.m. UTC | #6
On 24/5/23 01:44, Sean Christopherson wrote:
> On Tue, May 23, 2023, Alexey Kardashevskiy wrote:
>>
>>
>> On 23/5/23 09:39, Sean Christopherson wrote:
>>> On Tue, Apr 11, 2023, Alexey Kardashevskiy wrote:
>>>> Prior to SEV-ES, KVM saved/restored host debug registers upon switching
>>>> to/from a VM. Changing those registers inside a running SEV VM
>>>> triggered #VMEXIT to KVM.
>>>
>>> Please, please don't make it sound like some behavior is *the* one and only behavior.
>>> *KVM* *chooses* to intercept debug register accesses.  Though I would omit this
>>> paragraph (and largely the next) entirely, IMO it's safe to assume the reader has
>>> a basic understanding of how KVM deals with DRs and #DBs.
>>
>> Out of curiosity - is ARM similar in this regard, interceptions and stuff?
> 
> Yes.  The granularity of interceptions varies based on the architectural revision,
> and presumably there are things that always trap.  But the core concept of letting
> software decide what to intercept is the same.
> 
>>>> SEV-ES added encrypted state (ES) which uses an encrypted page
>>>> for the VM state (VMSA). The hardware saves/restores certain registers
>>>> on VMRUN/VMEXIT according to a swap type (A, B, C), see
>>>> "Table B-3. Swap Types" in the AMD Architecture Programmer’s Manual
>>>> volume 2.
>>>>
>>>> The DR6 and DR7 registers have always been swapped as Type A for SEV-ES
>>>
>>> Please rewrite this to just state what the behavior is instead of "Type A" vs.
>>> "Type B".  Outside of AMD, the "type a/b/c" stuff isn't anywhere near ubiquitous
>>> enough to justify obfuscating super simple concepts.
>>>
>>> Actually, this feature really has nothing to do with Type A vs. Type B, since
>>> that's purely about host state.
>>
>> Mmm. Nothing? If the feature is enabled and DR[0-3] are not saved in HOSTSA,
>> then the host looses debug state because of the working feature.
>>
>>> I assume the switch from Type A to Type B is a
>>> side effect, or an opportunistic optimization?
>>
>> There is no switch. DR[67] were and are one type, and the other DRs were not
>> swapped and now are, but with a different Swap Type.
> 
> Ah, this is what I missed.
> 
>> And the patch saves DR[0-3] in HOSTSA but not DR[67] and this deserves some
>> explaining why is that.
>>
>>> Regardless, something like this is a lot easier for a human to read.  It's easy
>>> enough to find DebugSwap in the APM (literally took me longer to find my copy of
>>> the PDF).
>>
>> It is also easy to find "Swap Types" in the APM...
> 
> Redirecting readers to specs for gory details is fine.  Redirecting for basic,
> simple functionality is not.  Maybe the swap types will someday be common knowledge
> in KVM, and an explanation will no longer be necessary, but we are nowhere near
> that point.
> 
>>>     Add support for "DebugSwap for SEV-ES guests", which provides support
>>>     for swapping DR[0-3] and DR[0-3]_ADDR_MASK on VMRUN and VMEXIT, i.e.
>>>     allows KVM to expose debug capabilities to SEV-ES guests.  Without
>>>     DebugSwap support, the CPU doesn't save/load _guest_ debug registers,
>>
>> But it does save/load DR6 and DR7.
>>
>>>     and KVM cannot manually context switch guest DRs due the VMSA being
>>>     encrypted.
>>>
>>>     Enable DebugSwap if and only if the CPU also supports NoNestedDataBp,
>>>     which causes the CPU to ignore nested #DBs, i.e. #DBs that occur when
>>>     vectoring a #DB.
>>
>> (english question) What does "vectoring" mean here precisely? Handling?
>> Processing?
> 
> It's not really an English thing.  "Vectoring" is, or at least was, Intel terminology
> for describing the flow from the initial detection of an exception to the exception's
> delivery to software, e.g. covers the IDT lookup, any GDT/LDT lookups, pushing
> information on the stack, fetching the software exception handler, etc.  Importantly,
> it describes the period where there are no instructions retired and thus ucode doesn't
> open event windows, i.e. where triggering another, non-contributory exception will
> effectively "hang" the CPU (at least on CPUs without Intel's "notify" VM-Exit support).
> 
>>>     the host by putting the CPU into an infinite loop of vectoring #DBs
>>>     (see https://bugzilla.redhat.com/show_bug.cgi?id=1278496)
>>
>> Good one, thanks for the link.
>>
>>>
>>>     Set the features bit in sev_es_sync_vmsa() which is the last point
>>>     when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most
>>>     init happens) is called not only when VCPU is initialized but also on
>>>     intrahost migration when VMSA is encrypted.
>>>
>>>> guests, but a new feature is available, identified via
>>>> CPUID Fn8000001F_EAX[14] "DebugSwap for SEV-ES guests", that provides
>>>> support for swapping additional debug registers. DR[0-3] and
>>>> DR[0-3]_ADDR_MASK are swapped as Type B when SEV_FEATURES[5] (DebugSwap)
>>>> is set.
>>>>
>>>> Enable DebugSwap for a VMSA but only do so if CPUID Fn80000021_EAX[0]
>>>> ("NoNestedDataBp", "Processor ignores nested data breakpoints") is
>>>> supported by the SOC as otherwise a malicious SEV-ES guest can set up
>>>> data breakpoints on the #DB IDT entry/stack and cause an infinite loop.
>>>> Set the features bit in sev_es_sync_vmsa() which is the last point
>>>> when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most
>>>> init happens) is called not only when VCPU is initialized but also on
>>>> intrahost migration when VMSA is encrypted.
>>>>
>>>> Eliminate DR7 and #DB intercepts as:
>>>> - they are not needed when DebugSwap is supported;
>>>
>>> "not needed" isn't sufficient justification.  KVM doesn't strictly need to do a
>>> lot of things, but does them anyways for a variety of reasons.  E.g. #DB intercept
>>> is also not needed when NoNestedDataBp is supported and vcpu->guest_debug==0, i.e.
>>> this should clarify why KVM doesn't simply disable #DB intercepts for _all_ VMs
>>> when NoNestedDataBp is support.  Presumably the answer is "because it's simpler
>>> than toggling #DB interception for guest_debug.
>>
>> TBH I did not have a good answer but from the above I'd say we want to
>> disable #DB intercepts in a separate patch, just as you say below.
>>
>>> Actually, can't disabling #DB interception for DebugSwap SEV-ES guests be a
>>> separate patch?  KVM can still inject #DBs for SEV-ES guests, no?
>>
>> Sorry for my ignorance but what is the point of injecting #DB if there is no
>> way of changing the guest's DR7?
> 
> Well, _injecting_ the #DB is necessary for correctness from the guest's perspective.
> "What's the point of _intercepting_ #DB" is the real question.  And for SEV-ES guests
> with DebugSwap, there is no point, which is why I agree that KVM should disable
> interception in that case.  What I'm calling out is that disabling #Db interception
> isn't _necessary_ for correctness (unless I'm missing something), which means that
> it can and should go in a separate patch.


About this. Instead of sev_es_init_vmcb(), I can toggle the #DB 
intercept when toggling guest_debug, see below. This 
kvm_x86_ops::update_exception_bitmap hook is called on vcpu reset and 
kvm_arch_vcpu_ioctl_set_guest_debug (which skips this call if 
guest_state_protected = true). Is there any downside?


diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index da28ed69bf4a..a7df5eb4ac00 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1951,9 +1951,15 @@ static void svm_update_exception_bitmap(struct 
kvm_vcpu *vcpu)

         clr_exception_intercept(svm, BP_VECTOR);

+       if (cpu_feature_enabled(X86_FEATURE_NO_NESTED_DATA_BP))
+               clr_exception_intercept(svm, DB_VECTOR);
+
         if (vcpu->guest_debug & KVM_GUESTDBG_ENABLE) {
                 if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP)
                         set_exception_intercept(svm, BP_VECTOR);
+
+               if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
+                       set_exception_intercept(svm, DB_VECTOR);
         }
  }
  
Sean Christopherson May 26, 2023, 2:39 p.m. UTC | #7
On Fri, May 26, 2023, Alexey Kardashevskiy wrote:
> 
> On 24/5/23 01:44, Sean Christopherson wrote:
> > On Tue, May 23, 2023, Alexey Kardashevskiy wrote:
> > > > Actually, can't disabling #DB interception for DebugSwap SEV-ES guests be a
> > > > separate patch?  KVM can still inject #DBs for SEV-ES guests, no?
> > > 
> > > Sorry for my ignorance but what is the point of injecting #DB if there is no
> > > way of changing the guest's DR7?
> > 
> > Well, _injecting_ the #DB is necessary for correctness from the guest's perspective.
> > "What's the point of _intercepting_ #DB" is the real question.  And for SEV-ES guests
> > with DebugSwap, there is no point, which is why I agree that KVM should disable
> > interception in that case.  What I'm calling out is that disabling #Db interception
> > isn't _necessary_ for correctness (unless I'm missing something), which means that
> > it can and should go in a separate patch.
> 
> 
> About this. Instead of sev_es_init_vmcb(), I can toggle the #DB intercept
> when toggling guest_debug, see below. This
> kvm_x86_ops::update_exception_bitmap hook is called on vcpu reset and
> kvm_arch_vcpu_ioctl_set_guest_debug (which skips this call if
> guest_state_protected = true).

KVM also intercepts #DB when single-stepping over IRET to find an NMI window, so
you'd also have to factor in nmi_singlestep, and update svm_enable_nmi_window()
and disable_nmi_singlestep() to call svm_update_exception_bitmap().

> Is there any downside?

Complexity is the main one.  The complexity is quite low, but the benefit to the
guest is likely even lower.  A #DB in the guest isn't likely to be performance
sensitive.  And on the flip side, opening an NMI window would be a tiny bit more
expensive, though I doubt that would be meaningful either.

All in all, I think it makes sense to just keep intercepting #DB for non-SEV-ES
guests.

Side topic, isn't there an existing bug regarding SEV-ES NMI windows?  KVM can't
actually single-step an SEV-ES guest, but tries to set RFLAGS.TF anyways.  Blech,
and suppressing EFER.SVME in efer_trap() is a bit gross, but I suppose since the
GHCB doesn't allow for CLGI or STGI it's "fine".

E.g. shouldn't KVM do this?

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ca32389f3c36..4e4a49031efe 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3784,6 +3784,16 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
        if (svm_get_nmi_mask(vcpu) && !svm->awaiting_iret_completion)
                return; /* IRET will cause a vm exit */
 
+       /*
+        * KVM can't single-step SEV-ES guests and instead assumes that IRET
+        * in the guest will always succeed, i.e. clears NMI masking on the
+        * next VM-Exit.  Note, GIF is guaranteed to be '1' for SEV-ES guests
+        * as the GHCB doesn't allow for CLGI or STGI (and KVM suppresses
+        * EFER.SVME for good measure, see efer_trap()).
+        */
+       if (sev_es_guest(vcpu->kvm))
+               return;
+
        if (!gif_set(svm)) {
                if (vgif)
                        svm_set_intercept(svm, INTERCEPT_STGI);
  
Alexey Kardashevskiy May 30, 2023, 8:57 a.m. UTC | #8
On 27/5/23 00:39, Sean Christopherson wrote:
> On Fri, May 26, 2023, Alexey Kardashevskiy wrote:
>>
>> On 24/5/23 01:44, Sean Christopherson wrote:
>>> On Tue, May 23, 2023, Alexey Kardashevskiy wrote:
>>>>> Actually, can't disabling #DB interception for DebugSwap SEV-ES guests be a
>>>>> separate patch?  KVM can still inject #DBs for SEV-ES guests, no?
>>>>
>>>> Sorry for my ignorance but what is the point of injecting #DB if there is no
>>>> way of changing the guest's DR7?
>>>
>>> Well, _injecting_ the #DB is necessary for correctness from the guest's perspective.
>>> "What's the point of _intercepting_ #DB" is the real question.  And for SEV-ES guests
>>> with DebugSwap, there is no point, which is why I agree that KVM should disable
>>> interception in that case.  What I'm calling out is that disabling #Db interception
>>> isn't _necessary_ for correctness (unless I'm missing something), which means that
>>> it can and should go in a separate patch.
>>
>>
>> About this. Instead of sev_es_init_vmcb(), I can toggle the #DB intercept
>> when toggling guest_debug, see below. This
>> kvm_x86_ops::update_exception_bitmap hook is called on vcpu reset and
>> kvm_arch_vcpu_ioctl_set_guest_debug (which skips this call if
>> guest_state_protected = true).
> 
> KVM also intercepts #DB when single-stepping over IRET to find an NMI window, so
> you'd also have to factor in nmi_singlestep, and update svm_enable_nmi_window()
> and disable_nmi_singlestep() to call svm_update_exception_bitmap().

Uff. New can of worms for me :-/


>> Is there any downside?
> 
> Complexity is the main one.  The complexity is quite low, but the benefit to the
> guest is likely even lower.  A #DB in the guest isn't likely to be performance
> sensitive.  And on the flip side, opening an NMI window would be a tiny bit more
> expensive, though I doubt that would be meaningful either.
> 
> All in all, I think it makes sense to just keep intercepting #DB for non-SEV-ES
> guests.
> 
> Side topic, isn't there an existing bug regarding SEV-ES NMI windows?  KVM can't
> actually single-step an SEV-ES guest, but tries to set RFLAGS.TF anyways. 

Why is it a "bug" and what does the patch fix? Sound to me as it is 
pointless and the guest won't do single stepping and instead will run 
till it exits somehow, what do I miss?

> Blech,
> and suppressing EFER.SVME in efer_trap() is a bit gross,

Why suppressed? svm_set_efer() sets it eventually anyway.

> but I suppose since the
> GHCB doesn't allow for CLGI or STGI it's "fine".

GHCB does not mention this, instead these are always intercepted in 
init_vmcb().

> E.g. shouldn't KVM do this?

It sure can and I am happy to include this into the series, the commit 
log is what I am struggling with :)

> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index ca32389f3c36..4e4a49031efe 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3784,6 +3784,16 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
>          if (svm_get_nmi_mask(vcpu) && !svm->awaiting_iret_completion)
>                  return; /* IRET will cause a vm exit */
>   
> +       /*
> +        * KVM can't single-step SEV-ES guests and instead assumes that IRET
> +        * in the guest will always succeed, 

It relies on GHCB's NMI_COMPLETE (which SVM than handles is it was IRET):

         case SVM_VMGEXIT_NMI_COMPLETE:
                 ret = svm_invoke_exit_handler(vcpu, SVM_EXIT_IRET);
                 break;


> i.e. clears NMI masking on the
> +        * next VM-Exit.  Note, GIF is guaranteed to be '1' for SEV-ES guests
> +        * as the GHCB doesn't allow for CLGI or STGI (and KVM suppresses
> +        * EFER.SVME for good measure, see efer_trap()).

SVM KVM seems to not enforce EFER.SVME, the guest does what it wants and 
KVM is only told the new value via EFER_WRITE_TRAP. And "writes by 
SEV-ES guests to EFER.SVME are always ignored by hardware" says the APM. 
I must be missing the point here...


> +        */
> +       if (sev_es_guest(vcpu->kvm))
> +               return;
> +
>          if (!gif_set(svm)) {
>                  if (vgif)
>                          svm_set_intercept(svm, INTERCEPT_STGI);
  
Alexey Kardashevskiy June 1, 2023, 11:31 p.m. UTC | #9
Sean, ping?

I wonder if this sev-es-not-singlestepping is a showstopper or it is 
alright to repost this patchset without it? Thanks,


On 30/5/23 18:57, Alexey Kardashevskiy wrote:
> 
> 
> On 27/5/23 00:39, Sean Christopherson wrote:
>> On Fri, May 26, 2023, Alexey Kardashevskiy wrote:
>>>
>>> On 24/5/23 01:44, Sean Christopherson wrote:
>>>> On Tue, May 23, 2023, Alexey Kardashevskiy wrote:
>>>>>> Actually, can't disabling #DB interception for DebugSwap SEV-ES 
>>>>>> guests be a
>>>>>> separate patch?  KVM can still inject #DBs for SEV-ES guests, no?
>>>>>
>>>>> Sorry for my ignorance but what is the point of injecting #DB if 
>>>>> there is no
>>>>> way of changing the guest's DR7?
>>>>
>>>> Well, _injecting_ the #DB is necessary for correctness from the 
>>>> guest's perspective.
>>>> "What's the point of _intercepting_ #DB" is the real question.  And 
>>>> for SEV-ES guests
>>>> with DebugSwap, there is no point, which is why I agree that KVM 
>>>> should disable
>>>> interception in that case.  What I'm calling out is that disabling 
>>>> #Db interception
>>>> isn't _necessary_ for correctness (unless I'm missing something), 
>>>> which means that
>>>> it can and should go in a separate patch.
>>>
>>>
>>> About this. Instead of sev_es_init_vmcb(), I can toggle the #DB 
>>> intercept
>>> when toggling guest_debug, see below. This
>>> kvm_x86_ops::update_exception_bitmap hook is called on vcpu reset and
>>> kvm_arch_vcpu_ioctl_set_guest_debug (which skips this call if
>>> guest_state_protected = true).
>>
>> KVM also intercepts #DB when single-stepping over IRET to find an NMI 
>> window, so
>> you'd also have to factor in nmi_singlestep, and update 
>> svm_enable_nmi_window()
>> and disable_nmi_singlestep() to call svm_update_exception_bitmap().
> 
> Uff. New can of worms for me :-/
> 
> 
>>> Is there any downside?
>>
>> Complexity is the main one.  The complexity is quite low, but the 
>> benefit to the
>> guest is likely even lower.  A #DB in the guest isn't likely to be 
>> performance
>> sensitive.  And on the flip side, opening an NMI window would be a 
>> tiny bit more
>> expensive, though I doubt that would be meaningful either.
>>
>> All in all, I think it makes sense to just keep intercepting #DB for 
>> non-SEV-ES
>> guests.
>>
>> Side topic, isn't there an existing bug regarding SEV-ES NMI windows?  
>> KVM can't
>> actually single-step an SEV-ES guest, but tries to set RFLAGS.TF anyways. 
> 
> Why is it a "bug" and what does the patch fix? Sound to me as it is 
> pointless and the guest won't do single stepping and instead will run 
> till it exits somehow, what do I miss?
> 
>> Blech,
>> and suppressing EFER.SVME in efer_trap() is a bit gross,
> 
> Why suppressed? svm_set_efer() sets it eventually anyway.
> 
>> but I suppose since the
>> GHCB doesn't allow for CLGI or STGI it's "fine".
> 
> GHCB does not mention this, instead these are always intercepted in 
> init_vmcb().
> 
>> E.g. shouldn't KVM do this?
> 
> It sure can and I am happy to include this into the series, the commit 
> log is what I am struggling with :)
> 
>>
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index ca32389f3c36..4e4a49031efe 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -3784,6 +3784,16 @@ static void svm_enable_nmi_window(struct 
>> kvm_vcpu *vcpu)
>>          if (svm_get_nmi_mask(vcpu) && !svm->awaiting_iret_completion)
>>                  return; /* IRET will cause a vm exit */
>> +       /*
>> +        * KVM can't single-step SEV-ES guests and instead assumes 
>> that IRET
>> +        * in the guest will always succeed, 
> 
> It relies on GHCB's NMI_COMPLETE (which SVM than handles is it was IRET):
> 
>          case SVM_VMGEXIT_NMI_COMPLETE:
>                  ret = svm_invoke_exit_handler(vcpu, SVM_EXIT_IRET);
>                  break;
> 
> 
>> i.e. clears NMI masking on the
>> +        * next VM-Exit.  Note, GIF is guaranteed to be '1' for SEV-ES 
>> guests
>> +        * as the GHCB doesn't allow for CLGI or STGI (and KVM suppresses
>> +        * EFER.SVME for good measure, see efer_trap()).
> 
> SVM KVM seems to not enforce EFER.SVME, the guest does what it wants and 
> KVM is only told the new value via EFER_WRITE_TRAP. And "writes by 
> SEV-ES guests to EFER.SVME are always ignored by hardware" says the APM. 
> I must be missing the point here...
> 
> 
>> +        */
>> +       if (sev_es_guest(vcpu->kvm))
>> +               return;
>> +
>>          if (!gif_set(svm)) {
>>                  if (vgif)
>>                          svm_set_intercept(svm, INTERCEPT_STGI);
>
  
Sean Christopherson June 13, 2023, 11:19 p.m. UTC | #10
On Fri, Jun 02, 2023, Alexey Kardashevskiy wrote:
> Sean, ping?
>=20
> I wonder if this sev-es-not-singlestepping is a showstopper or it is alri=
ght
> to repost this patchset without it? Thanks,

Ah, shoot, I completely lost this in my inbox.  Sorry :-/

> > > Side topic, isn't there an existing bug regarding SEV-ES NMI windows?
> > > KVM can't actually single-step an SEV-ES guest, but tries to set
> > > RFLAGS.TF anyways.
> >=20
> > Why is it a "bug" and what does the patch fix? Sound to me as it is
> > pointless and the guest won't do single stepping and instead will run
> > till it exits somehow, what do I miss?

The bug is benign in the end, but it's still a bug.  I'm not worried about =
fixing
any behavior, but I dislike having dead, misleading code, especially for so=
mething
like this where both NMI virtualization and SEV-ES are already crazy comple=
x and
subtle.  I think it's safe to say that I've spent more time digging through=
 SEV-ES
and NMI virtualization than most KVM developers, and as evidenced by the nu=
mber of
things I got wrong below, I'm still struggling to keep track of the bigger =
picture.
Developers that are new to all of this need as much help as they can get.

> > > Blech, and suppressing EFER.SVME in efer_trap() is a bit gross,
> >=20
> > Why suppressed? svm_set_efer() sets it eventually anyway.

svm_set_efer() sets SVME in hardware, but KVM's view of the guest's value t=
hat's
stored in vcpu->arch.efer doesn't have SVME set.  E.g. from the guest's per=
spective,
EFER.SVME will have "Reserved Read As Zero" semantics.

> > > but I suppose since the GHCB doesn't allow for CLGI or STGI it's "fin=
e".
> >=20
> > GHCB does not mention this, instead these are always intercepted in
> > init_vmcb().

Right, I'm calling out that the absense of protocol support for requesting =
CLGI
or STGI emulation means dropping the guest's EFER.SVME is ok (though gross =
:-) ).

> > > E.g. shouldn't KVM do this?
> >=20
> > It sure can and I am happy to include this into the series, the commit
> > log is what I am struggling with :)
> >=20
> > >=20
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index ca32389f3c36..4e4a49031efe 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -3784,6 +3784,16 @@ static void svm_enable_nmi_window(struct
> > > kvm_vcpu *vcpu)
> > > =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=
=BD if (svm_get_nmi_mask(vcpu) && !svm->awaiting_iret_completion)
> > > =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=
=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=
 return; /* IRET will cause a vm exit */
> > > +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD /*
> > > +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD * KV=
M can't single-step SEV-ES guests and instead assumes
> > > that IRET
> > > +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD * in=
 the guest will always succeed,
> >=20
> > It relies on GHCB's NMI_COMPLETE (which SVM than handles is it was IRET=
):
> >=20
> >  =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD case S=
VM_VMGEXIT_NMI_COMPLETE:
> >  =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=
=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD ret =3D =
svm_invoke_exit_handler(vcpu, SVM_EXIT_IRET);
> >  =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=
=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD break;

Ah, right, better to say that the guest is responsible for signaling that i=
t's
ready to accept NMIs, which KVM handles by "emulating" IRET.

> > > i.e. clears NMI masking on the
> > > +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD * ne=
xt VM-Exit.=EF=BF=BD Note, GIF is guaranteed to be '1' for
> > > SEV-ES guests
> > > +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD * as=
 the GHCB doesn't allow for CLGI or STGI (and KVM suppresses
> > > +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD * EF=
ER.SVME for good measure, see efer_trap()).
> >=20
> > SVM KVM seems to not enforce EFER.SVME, the guest does what it wants an=
d
> > KVM is only told the new value via EFER_WRITE_TRAP. And "writes by
> > SEV-ES guests to EFER.SVME are always ignored by hardware" says the APM=
.

Ahhh, that blurb in the APM is what I'm missing.

Actually, there's a real bug here.  KVM doesn't immediately unmask NMIs in =
response
to NMI_COMPLETE, and instead goes through the whole awaiting_iret_completio=
n =3D>
svm_complete_interrupts(), which means that KVM doesn't unmask NMIs until t=
he
*next* VM-Exit.  Theoretically, that could be never, e.g. if the host is ti=
ckless
and the guest is configured to busy wait idle CPUs.

Attached patches are compile tested only.
  
Alexey Kardashevskiy June 14, 2023, 3:58 a.m. UTC | #11
On 14/6/23 09:19, Sean Christopherson wrote:
> On Fri, Jun 02, 2023, Alexey Kardashevskiy wrote:
>> Sean, ping?
>> =20
>> I wonder if this sev-es-not-singlestepping is a showstopper or it is alri=
> ght
>> to repost this patchset without it? Thanks,
> 
> Ah, shoot, I completely lost this in my inbox.  Sorry :-/

I saw the "OOO" message the other day and relaxed :)


>>>> Side topic, isn't there an existing bug regarding SEV-ES NMI windows?
>>>> KVM can't actually single-step an SEV-ES guest, but tries to set
>>>> RFLAGS.TF anyways.
>>> =20
>>> Why is it a "bug" and what does the patch fix? Sound to me as it is
>>> pointless and the guest won't do single stepping and instead will run
>>> till it exits somehow, what do I miss?
> 
> The bug is benign in the end, but it's still a bug.  I'm not worried about =


(unrelated) Your response's encoding broke somehow and I wonder if this 
is something I did or you did. Lore got it too:

https://lore.kernel.org/all/ZIj5ms+DohcLyXHE@google.com/


> fixing
> any behavior, but I dislike having dead, misleading code, especially for so=
> mething
> like this where both NMI virtualization and SEV-ES are already crazy comple=
> x and
> subtle.  I think it's safe to say that I've spent more time digging through=
>   SEV-ES
> and NMI virtualization than most KVM developers, and as evidenced by the nu=
> mber of
> things I got wrong below, I'm still struggling to keep track of the bigger =
> picture.
> Developers that are new to all of this need as much help as they can get.
> 
>>>> Blech, and suppressing EFER.SVME in efer_trap() is a bit gross,
>>> =20
>>> Why suppressed? svm_set_efer() sets it eventually anyway.
> 
> svm_set_efer() sets SVME in hardware, but KVM's view of the guest's value t=
> hat's
> stored in vcpu->arch.efer doesn't have SVME set.  E.g. from the guest's per=
> spective,
> EFER.SVME will have "Reserved Read As Zero" semantics.

It is not zero, why? From inside the guest, rdmsrl(MSR_EFER, efer) reads 
0x1d01 from that msr where 0x1000==(1<<_EFER_SVME),  _EFER_SVME==12.


> 
>>>> but I suppose since the GHCB doesn't allow for CLGI or STGI it's "fin=
> e".
>>> =20
>>> GHCB does not mention this, instead these are always intercepted in
>>> init_vmcb().
> 
> Right, I'm calling out that the absense of protocol support for requesting =
> CLGI
> or STGI emulation means dropping the guest's EFER.SVME is ok (though gross =
> :-) ).
> 
>>>> E.g. shouldn't KVM do this?
>>> =20
>>> It sure can and I am happy to include this into the series, the commit
>>> log is what I am struggling with :)
>>> =20
>>>> =20
>>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>>> index ca32389f3c36..4e4a49031efe 100644
>>>> --- a/arch/x86/kvm/svm/svm.c
>>>> +++ b/arch/x86/kvm/svm/svm.c
>>>> @@ -3784,6 +3784,16 @@ static void svm_enable_nmi_window(struct
>>>> kvm_vcpu *vcpu)
>>>> =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=
> =BD if (svm_get_nmi_mask(vcpu) && !svm->awaiting_iret_completion)
>>>> =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=
> =BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=
>   return; /* IRET will cause a vm exit */
>>>> +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD /*
>>>> +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD * KV=
> M can't single-step SEV-ES guests and instead assumes
>>>> that IRET
>>>> +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD * in=
>   the guest will always succeed,
>>> =20
>>> It relies on GHCB's NMI_COMPLETE (which SVM than handles is it was IRET=
> ):
>>> =20
>>>   =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD case S=
> VM_VMGEXIT_NMI_COMPLETE:
>>>   =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=
> =BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD ret =3D =
> svm_invoke_exit_handler(vcpu, SVM_EXIT_IRET);
>>>   =EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=
> =BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD break;
> 
> Ah, right, better to say that the guest is responsible for signaling that i=
> t's
> ready to accept NMIs, which KVM handles by "emulating" IRET.
> 
>>>> i.e. clears NMI masking on the
>>>> +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD * ne=
> xt VM-Exit.=EF=BF=BD Note, GIF is guaranteed to be '1' for
>>>> SEV-ES guests
>>>> +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD * as=
>   the GHCB doesn't allow for CLGI or STGI (and KVM suppresses
>>>> +=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD * EF=
> ER.SVME for good measure, see efer_trap()).
>>> =20
>>> SVM KVM seems to not enforce EFER.SVME, the guest does what it wants an=
> d
>>> KVM is only told the new value via EFER_WRITE_TRAP. And "writes by
>>> SEV-ES guests to EFER.SVME are always ignored by hardware" says the APM=
> .
> 
> Ahhh, that blurb in the APM is what I'm missing.
> 
> Actually, there's a real bug here.  KVM doesn't immediately unmask NMIs in =
> response
> to NMI_COMPLETE, and instead goes through the whole awaiting_iret_completio=
> n =3D>
> svm_complete_interrupts(), which means that KVM doesn't unmask NMIs until t=
> he
> *next* VM-Exit.  Theoretically, that could be never, e.g. if the host is ti=
> ckless
> and the guest is configured to busy wait idle CPUs.
> 
> Attached patches are compile tested only.

Well, NMIs still get injected from QEMU so I guess it is a pass? Thanks,
  
Sean Christopherson June 14, 2023, 9:27 p.m. UTC | #12
On Wed, Jun 14, 2023, Alexey Kardashevskiy wrote:
> On 14/6/23 09:19, Sean Christopherson wrote:
> > On Fri, Jun 02, 2023, Alexey Kardashevskiy wrote:
> > > > > Side topic, isn't there an existing bug regarding SEV-ES NMI windows?
> > > > > KVM can't actually single-step an SEV-ES guest, but tries to set
> > > > > RFLAGS.TF anyways.
> > > > =20
> > > > Why is it a "bug" and what does the patch fix? Sound to me as it is
> > > > pointless and the guest won't do single stepping and instead will run
> > > > till it exits somehow, what do I miss?
> > 
> > The bug is benign in the end, but it's still a bug.  I'm not worried about =
> 
> 
> (unrelated) Your response's encoding broke somehow and I wonder if this is
> something I did or you did. Lore got it too:
> 
> https://lore.kernel.org/all/ZIj5ms+DohcLyXHE@google.com/

Huh.  Guessing something I did, but I've no idea what caused it.

> > fixing
> > any behavior, but I dislike having dead, misleading code, especially for so=
> > mething
> > like this where both NMI virtualization and SEV-ES are already crazy comple=
> > x and
> > subtle.  I think it's safe to say that I've spent more time digging through=
> >   SEV-ES
> > and NMI virtualization than most KVM developers, and as evidenced by the nu=
> > mber of
> > things I got wrong below, I'm still struggling to keep track of the bigger =
> > picture.
> > Developers that are new to all of this need as much help as they can get.
> > 
> > > > > Blech, and suppressing EFER.SVME in efer_trap() is a bit gross,
> > > > =20
> > > > Why suppressed? svm_set_efer() sets it eventually anyway.
> > 
> > svm_set_efer() sets SVME in hardware, but KVM's view of the guest's value t=
> > hat's
> > stored in vcpu->arch.efer doesn't have SVME set.  E.g. from the guest's per=
> > spective,
> > EFER.SVME will have "Reserved Read As Zero" semantics.
> 
> It is not zero, why? From inside the guest, rdmsrl(MSR_EFER, efer) reads
> 0x1d01 from that msr where 0x1000==(1<<_EFER_SVME),  _EFER_SVME==12.

Oh, lame.  So the guest gets to see the raw value in the VMSA.  So it really comes
down to the GHCB not providing support for STGI/CLGI.
  

Patch

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d9c190cdefa9..3a5eeb178778 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -435,6 +435,7 @@ 
 #define X86_FEATURE_SEV_ES		(19*32+ 3) /* AMD Secure Encrypted Virtualization - Encrypted State */
 #define X86_FEATURE_V_TSC_AUX		(19*32+ 9) /* "" Virtual TSC_AUX */
 #define X86_FEATURE_SME_COHERENT	(19*32+10) /* "" AMD hardware-enforced cache coherency */
+#define X86_FEATURE_DEBUG_SWAP		(19*32+14) /* AMD SEV-ES full debug state swap support */
 
 /* AMD-defined Extended Feature 2 EAX, CPUID level 0x80000021 (EAX), word 20 */
 #define X86_FEATURE_NO_NESTED_DATA_BP	(20*32+ 0) /* "" No Nested Data Breakpoints */
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 770dcf75eaa9..3a422d213010 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -280,6 +280,7 @@  static_assert((X2AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == X2AVIC_
 #define AVIC_HPA_MASK	~((0xFFFULL << 52) | 0xFFF)
 #define VMCB_AVIC_APIC_BAR_MASK		0xFFFFFFFFFF000ULL
 
+#define SVM_SEV_FEAT_DEBUG_SWAP                        BIT(5)
 
 struct vmcb_seg {
 	u16 selector;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index f0885250252d..ba12e7962e94 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -22,6 +22,7 @@ 
 #include <asm/pkru.h>
 #include <asm/trapnr.h>
 #include <asm/fpu/xcr.h>
+#include <asm/debugreg.h>
 
 #include "mmu.h"
 #include "x86.h"
@@ -53,9 +54,14 @@  module_param_named(sev, sev_enabled, bool, 0444);
 /* enable/disable SEV-ES support */
 static bool sev_es_enabled = true;
 module_param_named(sev_es, sev_es_enabled, bool, 0444);
+
+/* enable/disable SEV-ES DebugSwap support */
+static bool sev_es_debug_swap_enabled = true;
+module_param_named(debug_swap, sev_es_debug_swap_enabled, bool, 0444);
 #else
 #define sev_enabled false
 #define sev_es_enabled false
+#define sev_es_debug_swap_enabled false
 #endif /* CONFIG_KVM_AMD_SEV */
 
 static u8 sev_enc_bit;
@@ -605,6 +611,9 @@  static int sev_es_sync_vmsa(struct vcpu_svm *svm)
 	save->xss  = svm->vcpu.arch.ia32_xss;
 	save->dr6  = svm->vcpu.arch.dr6;
 
+	if (sev_es_debug_swap_enabled)
+		save->sev_features |= SVM_SEV_FEAT_DEBUG_SWAP;
+
 	pr_debug("Virtual Machine Save Area (VMSA):\n");
 	print_hex_dump_debug("", DUMP_PREFIX_NONE, 16, 1, save, sizeof(*save), false);
 
@@ -2256,6 +2265,9 @@  void __init sev_hardware_setup(void)
 out:
 	sev_enabled = sev_supported;
 	sev_es_enabled = sev_es_supported;
+	if (!sev_es_enabled || !cpu_feature_enabled(X86_FEATURE_DEBUG_SWAP) ||
+	    !cpu_feature_enabled(X86_FEATURE_NO_NESTED_DATA_BP))
+		sev_es_debug_swap_enabled = false;
 #endif
 }
 
@@ -2976,14 +2988,20 @@  static void sev_es_init_vmcb(struct vcpu_svm *svm)
 	svm_set_intercept(svm, TRAP_CR8_WRITE);
 
 	/*
+	 * Unless DebugSwap (depends on X86_FEATURE_NO_NESTED_DATA_BP) is enabled,
 	 * DR7 access must remain intercepted for an SEV-ES guest to disallow
 	 * the guest kernel enable debugging as otherwise a VM writing to DR7
 	 * from the #DB handler may trigger infinite loop of #DB's.
 	 */
 	vmcb->control.intercepts[INTERCEPT_DR] = 0;
-	vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_READ);
-	vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_WRITE);
-	recalc_intercepts(svm);
+	if (sev_es_debug_swap_enabled) {
+		clr_exception_intercept(svm, DB_VECTOR);
+		/* clr_exception_intercept() called recalc_intercepts() */
+	} else {
+		vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_READ);
+		vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_WRITE);
+		recalc_intercepts(svm);
+	}
 
 	/* Can't intercept XSETBV, HV can't modify XCR0 directly */
 	svm_clr_intercept(svm, INTERCEPT_XSETBV);
@@ -3048,6 +3066,18 @@  void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa)
 
 	/* MSR_IA32_XSS is restored on VMEXIT, save the currnet host value */
 	hostsa->xss = host_xss;
+
+	/* The DebugSwap SEV feature does Type B swaps of DR[0-3] */
+	if (sev_es_debug_swap_enabled) {
+		hostsa->dr0 = native_get_debugreg(0);
+		hostsa->dr1 = native_get_debugreg(1);
+		hostsa->dr2 = native_get_debugreg(2);
+		hostsa->dr3 = native_get_debugreg(3);
+		hostsa->dr0_addr_mask = amd_get_dr_addr_mask(0);
+		hostsa->dr1_addr_mask = amd_get_dr_addr_mask(1);
+		hostsa->dr2_addr_mask = amd_get_dr_addr_mask(2);
+		hostsa->dr3_addr_mask = amd_get_dr_addr_mask(3);
+	}
 }
 
 void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)