KVM: SVM: Disable TDP MMU when running on Hyper-V

Message ID 20230227171751.1211786-1-jpiotrowski@linux.microsoft.com
State New
Headers
Series KVM: SVM: Disable TDP MMU when running on Hyper-V |

Commit Message

Jeremi Piotrowski Feb. 27, 2023, 5:17 p.m. UTC
  TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17.
The issue was first introduced by two commmits:

- bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB
  flush to caller when freeing TDP MMU shadow pages"
- efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct
  roots via asynchronous worker"

The root cause is that since then there are missing TLB flushes which
are required by HV_X64_NESTED_ENLIGHTENED_TLB. The failure manifests
as L2 guest VMs being unable to complete boot due to memory
inconsistencies between L1 and L2 guests which lead to various
assertion/emulation failures.

The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by
Hyper-V on AMD and is always used by Linux. The TLB flush required by
HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush
that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest
boot performance on AMD is reproducibly slower compared to when TDP MMU is
disabled.

Disable TDP MMU when using SVM Hyper-V for the time being while we
search for a better fix.

Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/t/#u
Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
---
Based on kvm-x86-mmu-6.3. The approach used here does not apply cleanly to
<=v6.2. This would be needed in stable too, and I don't know about putting
fixes tags.

Jeremi

 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu/mmu.c          |  5 +++--
 arch/x86/kvm/svm/svm.c          |  6 +++++-
 arch/x86/kvm/svm/svm_onhyperv.h | 10 ++++++++++
 arch/x86/kvm/vmx/vmx.c          |  3 ++-
 5 files changed, 22 insertions(+), 5 deletions(-)
  

Comments

Vitaly Kuznetsov March 6, 2023, 5:52 p.m. UTC | #1
Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:

> TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17.
> The issue was first introduced by two commmits:
>
> - bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB
>   flush to caller when freeing TDP MMU shadow pages"
> - efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct
>   roots via asynchronous worker"
>
> The root cause is that since then there are missing TLB flushes which
> are required by HV_X64_NESTED_ENLIGHTENED_TLB.

Please share more details on what's actually missing as you get them,
I'd like to understand which flushes can be legally avoided on bare
hardware and Hyper-V/VMX but not on Hyper-V/SVM.

>  The failure manifests
> as L2 guest VMs being unable to complete boot due to memory
> inconsistencies between L1 and L2 guests which lead to various
> assertion/emulation failures.
>
> The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by
> Hyper-V on AMD and is always used by Linux. The TLB flush required by
> HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush
> that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest
> boot performance on AMD is reproducibly slower compared to when TDP MMU is
> disabled.
>
> Disable TDP MMU when using SVM Hyper-V for the time being while we
> search for a better fix.

I'd suggest we go the other way around: disable
HV_X64_NESTED_ENLIGHTENED_TLB on SVM:

diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
index 6981c1e9a809..be98da5a4277 100644
--- a/arch/x86/kvm/svm/svm_onhyperv.h
+++ b/arch/x86/kvm/svm/svm_onhyperv.h
@@ -32,7 +32,8 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
 
 static inline void svm_hv_hardware_setup(void)
 {
-       if (npt_enabled &&
+       /* A comment about missing TLB flushes */
+       if (!tdp_mmu_enabled && npt_enabled &&
            ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) {
                pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n");
                svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;

this way we won't have a not-obvious-at-all MMU change on Hyper-V. I
understand this may have some performance implications but MMU switch
has some as well.

>
> Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/t/#u
> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
> ---
> Based on kvm-x86-mmu-6.3. The approach used here does not apply cleanly to
> <=v6.2. This would be needed in stable too, and I don't know about putting
> fixes tags.

Cc: stable@vger.kernel.org # 5.17.0 

should do)

>
> Jeremi
>
>  arch/x86/include/asm/kvm_host.h |  3 ++-
>  arch/x86/kvm/mmu/mmu.c          |  5 +++--
>  arch/x86/kvm/svm/svm.c          |  6 +++++-
>  arch/x86/kvm/svm/svm_onhyperv.h | 10 ++++++++++
>  arch/x86/kvm/vmx/vmx.c          |  3 ++-
>  5 files changed, 22 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4d2bc08794e4..a0868ae3688d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2031,7 +2031,8 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
>  void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
>  
>  void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> -		       int tdp_max_root_level, int tdp_huge_page_level);
> +		       int tdp_max_root_level, int tdp_huge_page_level,
> +		       bool enable_tdp_mmu);
>  
>  static inline u16 kvm_read_ldt(void)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c91ee2927dd7..5c0e28a7a3bc 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5787,14 +5787,15 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
>  }
>  
>  void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> -		       int tdp_max_root_level, int tdp_huge_page_level)
> +		       int tdp_max_root_level, int tdp_huge_page_level,
> +		       bool enable_tdp_mmu)
>  {
>  	tdp_enabled = enable_tdp;
>  	tdp_root_level = tdp_forced_root_level;
>  	max_tdp_level = tdp_max_root_level;
>  
>  #ifdef CONFIG_X86_64
> -	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
> +	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu;
>  #endif
>  	/*
>  	 * max_huge_page_level reflects KVM's MMU capabilities irrespective
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index d13cf53e7390..070c3f7f8c9f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4925,6 +4925,7 @@ static __init int svm_hardware_setup(void)
>  	struct page *iopm_pages;
>  	void *iopm_va;
>  	int r;
> +	bool enable_tdp_mmu;
>  	unsigned int order = get_order(IOPM_SIZE);
>  
>  	/*
> @@ -4991,9 +4992,12 @@ static __init int svm_hardware_setup(void)
>  	if (!boot_cpu_has(X86_FEATURE_NPT))
>  		npt_enabled = false;
>  
> +	enable_tdp_mmu = svm_hv_enable_tdp_mmu();
> +
>  	/* Force VM NPT level equal to the host's paging level */
>  	kvm_configure_mmu(npt_enabled, get_npt_level(),
> -			  get_npt_level(), PG_LEVEL_1G);
> +			  get_npt_level(), PG_LEVEL_1G,
> +			  enable_tdp_mmu);
>  	pr_info("Nested Paging %sabled\n", npt_enabled ? "en" : "dis");
>  
>  	/* Setup shadow_me_value and shadow_me_mask */
> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
> index 6981c1e9a809..aa49ac5d66bc 100644
> --- a/arch/x86/kvm/svm/svm_onhyperv.h
> +++ b/arch/x86/kvm/svm/svm_onhyperv.h
> @@ -30,6 +30,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>  		hve->hv_enlightenments_control.msr_bitmap = 1;
>  }
>  
> +static inline bool svm_hv_enable_tdp_mmu(void)
> +{
> +	return !(npt_enabled && ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB);
> +}
> +
>  static inline void svm_hv_hardware_setup(void)
>  {
>  	if (npt_enabled &&
> @@ -84,6 +89,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>  {
>  }
>  
> +static inline bool svm_hv_enable_tdp_mmu(void)
> +{
> +	return true;
> +}
> +
>  static inline void svm_hv_hardware_setup(void)
>  {
>  }
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c788aa382611..4d3808755d39 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -8442,7 +8442,8 @@ static __init int hardware_setup(void)
>  	vmx_setup_me_spte_mask();
>  
>  	kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(),
> -			  ept_caps_to_lpage_level(vmx_capability.ept));
> +			  ept_caps_to_lpage_level(vmx_capability.ept),
> +			  true);
>  
>  	/*
>  	 * Only enable PML when hardware supports PML feature, and both EPT
  
Jeremi Piotrowski March 6, 2023, 6:31 p.m. UTC | #2
On 06/03/2023 18:52, Vitaly Kuznetsov wrote:
> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:
> 
>> TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17.
>> The issue was first introduced by two commmits:
>>
>> - bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB
>>   flush to caller when freeing TDP MMU shadow pages"
>> - efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct
>>   roots via asynchronous worker"
>>
>> The root cause is that since then there are missing TLB flushes which
>> are required by HV_X64_NESTED_ENLIGHTENED_TLB.
> 
> Please share more details on what's actually missing as you get them,
> I'd like to understand which flushes can be legally avoided on bare
> hardware and Hyper-V/VMX but not on Hyper-V/SVM.
> 

See the linked thread here
https://lore.kernel.org/lkml/20d189fc-8d20-8083-b448-460cc0420151@linux.microsoft.com/#t
for all the details/analyses but the summary was that either of these 2
options would work, with a) having less flushes (footnote: less flushes is not necessarily
better):

a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp()
b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current()

These are only needed on Hyper-V/SVM because of how the enlightenment works (needs an explicit
flush to rebuild L0 shadow page tables). Hyper-V/VMX does not need any changes and currently
works. Let me know if you need more information on something here, I'll try to get it.

>>  The failure manifests
>> as L2 guest VMs being unable to complete boot due to memory
>> inconsistencies between L1 and L2 guests which lead to various
>> assertion/emulation failures.
>>
>> The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by
>> Hyper-V on AMD and is always used by Linux. The TLB flush required by
>> HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush
>> that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest
>> boot performance on AMD is reproducibly slower compared to when TDP MMU is
>> disabled.
>>
>> Disable TDP MMU when using SVM Hyper-V for the time being while we
>> search for a better fix.
> 
> I'd suggest we go the other way around: disable
> HV_X64_NESTED_ENLIGHTENED_TLB on SVM:

Paolo suggested disabling TDP_MMU when HV_X64_NESTED_ENLIGHTENED_TLB is used, and
I prefer that option too. The enlighenment does offer a nice performance advantage
with non-TDP_MMU, and I did not see TDP_MMU perform any better compared to that.
Afaik the code to use the enlightenment on Hyper-V/SVM was written/tested before
TDP_MMU became the default.

If you have a specific scenario in mind, we could test and see what the implications
are there.

> 
> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
> index 6981c1e9a809..be98da5a4277 100644
> --- a/arch/x86/kvm/svm/svm_onhyperv.h
> +++ b/arch/x86/kvm/svm/svm_onhyperv.h
> @@ -32,7 +32,8 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>  
>  static inline void svm_hv_hardware_setup(void)
>  {
> -       if (npt_enabled &&
> +       /* A comment about missing TLB flushes */
> +       if (!tdp_mmu_enabled && npt_enabled &&
>             ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) {
>                 pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n");
>                 svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
> 
> this way we won't have a not-obvious-at-all MMU change on Hyper-V. I
> understand this may have some performance implications but MMU switch
> has some as well.
> 
>>
>> Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/t/#u
>> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
>> ---
>> Based on kvm-x86-mmu-6.3. The approach used here does not apply cleanly to
>> <=v6.2. This would be needed in stable too, and I don't know about putting
>> fixes tags.
> 
> Cc: stable@vger.kernel.org # 5.17.0 
> 
> should do)
> 
>>
>> Jeremi
>>
>>  arch/x86/include/asm/kvm_host.h |  3 ++-
>>  arch/x86/kvm/mmu/mmu.c          |  5 +++--
>>  arch/x86/kvm/svm/svm.c          |  6 +++++-
>>  arch/x86/kvm/svm/svm_onhyperv.h | 10 ++++++++++
>>  arch/x86/kvm/vmx/vmx.c          |  3 ++-
>>  5 files changed, 22 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 4d2bc08794e4..a0868ae3688d 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -2031,7 +2031,8 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
>>  void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
>>  
>>  void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>> -		       int tdp_max_root_level, int tdp_huge_page_level);
>> +		       int tdp_max_root_level, int tdp_huge_page_level,
>> +		       bool enable_tdp_mmu);
>>  
>>  static inline u16 kvm_read_ldt(void)
>>  {
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index c91ee2927dd7..5c0e28a7a3bc 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5787,14 +5787,15 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
>>  }
>>  
>>  void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>> -		       int tdp_max_root_level, int tdp_huge_page_level)
>> +		       int tdp_max_root_level, int tdp_huge_page_level,
>> +		       bool enable_tdp_mmu)
>>  {
>>  	tdp_enabled = enable_tdp;
>>  	tdp_root_level = tdp_forced_root_level;
>>  	max_tdp_level = tdp_max_root_level;
>>  
>>  #ifdef CONFIG_X86_64
>> -	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
>> +	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu;
>>  #endif
>>  	/*
>>  	 * max_huge_page_level reflects KVM's MMU capabilities irrespective
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index d13cf53e7390..070c3f7f8c9f 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -4925,6 +4925,7 @@ static __init int svm_hardware_setup(void)
>>  	struct page *iopm_pages;
>>  	void *iopm_va;
>>  	int r;
>> +	bool enable_tdp_mmu;
>>  	unsigned int order = get_order(IOPM_SIZE);
>>  
>>  	/*
>> @@ -4991,9 +4992,12 @@ static __init int svm_hardware_setup(void)
>>  	if (!boot_cpu_has(X86_FEATURE_NPT))
>>  		npt_enabled = false;
>>  
>> +	enable_tdp_mmu = svm_hv_enable_tdp_mmu();
>> +
>>  	/* Force VM NPT level equal to the host's paging level */
>>  	kvm_configure_mmu(npt_enabled, get_npt_level(),
>> -			  get_npt_level(), PG_LEVEL_1G);
>> +			  get_npt_level(), PG_LEVEL_1G,
>> +			  enable_tdp_mmu);
>>  	pr_info("Nested Paging %sabled\n", npt_enabled ? "en" : "dis");
>>  
>>  	/* Setup shadow_me_value and shadow_me_mask */
>> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
>> index 6981c1e9a809..aa49ac5d66bc 100644
>> --- a/arch/x86/kvm/svm/svm_onhyperv.h
>> +++ b/arch/x86/kvm/svm/svm_onhyperv.h
>> @@ -30,6 +30,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>>  		hve->hv_enlightenments_control.msr_bitmap = 1;
>>  }
>>  
>> +static inline bool svm_hv_enable_tdp_mmu(void)
>> +{
>> +	return !(npt_enabled && ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB);
>> +}
>> +
>>  static inline void svm_hv_hardware_setup(void)
>>  {
>>  	if (npt_enabled &&
>> @@ -84,6 +89,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>>  {
>>  }
>>  
>> +static inline bool svm_hv_enable_tdp_mmu(void)
>> +{
>> +	return true;
>> +}
>> +
>>  static inline void svm_hv_hardware_setup(void)
>>  {
>>  }
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index c788aa382611..4d3808755d39 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -8442,7 +8442,8 @@ static __init int hardware_setup(void)
>>  	vmx_setup_me_spte_mask();
>>  
>>  	kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(),
>> -			  ept_caps_to_lpage_level(vmx_capability.ept));
>> +			  ept_caps_to_lpage_level(vmx_capability.ept),
>> +			  true);
>>  
>>  	/*
>>  	 * Only enable PML when hardware supports PML feature, and both EPT
>
  
Vitaly Kuznetsov March 7, 2023, 10:07 a.m. UTC | #3
Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:

> On 06/03/2023 18:52, Vitaly Kuznetsov wrote:
>> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:
>> 
>>> TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17.
>>> The issue was first introduced by two commmits:
>>>
>>> - bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB
>>>   flush to caller when freeing TDP MMU shadow pages"
>>> - efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct
>>>   roots via asynchronous worker"
>>>
>>> The root cause is that since then there are missing TLB flushes which
>>> are required by HV_X64_NESTED_ENLIGHTENED_TLB.
>> 
>> Please share more details on what's actually missing as you get them,
>> I'd like to understand which flushes can be legally avoided on bare
>> hardware and Hyper-V/VMX but not on Hyper-V/SVM.
>> 
>
> See the linked thread here
> https://lore.kernel.org/lkml/20d189fc-8d20-8083-b448-460cc0420151@linux.microsoft.com/#t
> for all the details/analyses but the summary was that either of these 2
> options would work, with a) having less flushes (footnote: less flushes is not necessarily
> better):
>
> a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp()
> b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current()
>
> These are only needed on Hyper-V/SVM because of how the enlightenment works (needs an explicit
> flush to rebuild L0 shadow page tables). Hyper-V/VMX does not need any changes and currently
> works. Let me know if you need more information on something here, I'll try to get it.
>

Ah, I missed the whole party! Thanks for the pointers!

>>>  The failure manifests
>>> as L2 guest VMs being unable to complete boot due to memory
>>> inconsistencies between L1 and L2 guests which lead to various
>>> assertion/emulation failures.

Which levels are we talking about here, *real* L1 and L2 or L1 and L2
from KVM's perspective (real L2 and L3)?

>>>
>>> The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by
>>> Hyper-V on AMD and is always used by Linux. The TLB flush required by
>>> HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush
>>> that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest
>>> boot performance on AMD is reproducibly slower compared to when TDP MMU is
>>> disabled.
>>>
>>> Disable TDP MMU when using SVM Hyper-V for the time being while we
>>> search for a better fix.
>> 
>> I'd suggest we go the other way around: disable
>> HV_X64_NESTED_ENLIGHTENED_TLB on SVM:
>
> Paolo suggested disabling TDP_MMU when HV_X64_NESTED_ENLIGHTENED_TLB is used, and
> I prefer that option too. The enlighenment does offer a nice performance advantage
> with non-TDP_MMU, and I did not see TDP_MMU perform any better compared to that.
> Afaik the code to use the enlightenment on Hyper-V/SVM was written/tested before
> TDP_MMU became the default.
>
> If you have a specific scenario in mind, we could test and see what the implications
> are there.

I don't have a strong opinion here, I've suggested a smaller change so
it's easier to backport it to stable kernels and easier to revert when a
proper fix comes to mainline. For performance implication, I'd only
consider non-nested scenarios from KVM's perspective (i.e. real L2 from
Hyper-V's PoV), as running L3 is unlikely a common use-case and, if I
understood correctly, is broken anyway.

>
>> 
>> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
>> index 6981c1e9a809..be98da5a4277 100644
>> --- a/arch/x86/kvm/svm/svm_onhyperv.h
>> +++ b/arch/x86/kvm/svm/svm_onhyperv.h
>> @@ -32,7 +32,8 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>>  
>>  static inline void svm_hv_hardware_setup(void)
>>  {
>> -       if (npt_enabled &&
>> +       /* A comment about missing TLB flushes */
>> +       if (!tdp_mmu_enabled && npt_enabled &&
>>             ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) {
>>                 pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n");
>>                 svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
>> 
>> this way we won't have a not-obvious-at-all MMU change on Hyper-V. I
>> understand this may have some performance implications but MMU switch
>> has some as well.
>> 
>>>
>>> Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/t/#u
>>> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
>>> ---
>>> Based on kvm-x86-mmu-6.3. The approach used here does not apply cleanly to
>>> <=v6.2. This would be needed in stable too, and I don't know about putting
>>> fixes tags.
>> 
>> Cc: stable@vger.kernel.org # 5.17.0 
>> 
>> should do)
>> 
>>>
>>> Jeremi
>>>
>>>  arch/x86/include/asm/kvm_host.h |  3 ++-
>>>  arch/x86/kvm/mmu/mmu.c          |  5 +++--
>>>  arch/x86/kvm/svm/svm.c          |  6 +++++-
>>>  arch/x86/kvm/svm/svm_onhyperv.h | 10 ++++++++++
>>>  arch/x86/kvm/vmx/vmx.c          |  3 ++-
>>>  5 files changed, 22 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index 4d2bc08794e4..a0868ae3688d 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -2031,7 +2031,8 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
>>>  void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
>>>  
>>>  void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>>> -		       int tdp_max_root_level, int tdp_huge_page_level);
>>> +		       int tdp_max_root_level, int tdp_huge_page_level,
>>> +		       bool enable_tdp_mmu);
>>>  
>>>  static inline u16 kvm_read_ldt(void)
>>>  {
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index c91ee2927dd7..5c0e28a7a3bc 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -5787,14 +5787,15 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
>>>  }
>>>  
>>>  void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>>> -		       int tdp_max_root_level, int tdp_huge_page_level)
>>> +		       int tdp_max_root_level, int tdp_huge_page_level,
>>> +		       bool enable_tdp_mmu)
>>>  {
>>>  	tdp_enabled = enable_tdp;
>>>  	tdp_root_level = tdp_forced_root_level;
>>>  	max_tdp_level = tdp_max_root_level;
>>>  
>>>  #ifdef CONFIG_X86_64
>>> -	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
>>> +	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu;
>>>  #endif
>>>  	/*
>>>  	 * max_huge_page_level reflects KVM's MMU capabilities irrespective
>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>> index d13cf53e7390..070c3f7f8c9f 100644
>>> --- a/arch/x86/kvm/svm/svm.c
>>> +++ b/arch/x86/kvm/svm/svm.c
>>> @@ -4925,6 +4925,7 @@ static __init int svm_hardware_setup(void)
>>>  	struct page *iopm_pages;
>>>  	void *iopm_va;
>>>  	int r;
>>> +	bool enable_tdp_mmu;
>>>  	unsigned int order = get_order(IOPM_SIZE);
>>>  
>>>  	/*
>>> @@ -4991,9 +4992,12 @@ static __init int svm_hardware_setup(void)
>>>  	if (!boot_cpu_has(X86_FEATURE_NPT))
>>>  		npt_enabled = false;
>>>  
>>> +	enable_tdp_mmu = svm_hv_enable_tdp_mmu();
>>> +
>>>  	/* Force VM NPT level equal to the host's paging level */
>>>  	kvm_configure_mmu(npt_enabled, get_npt_level(),
>>> -			  get_npt_level(), PG_LEVEL_1G);
>>> +			  get_npt_level(), PG_LEVEL_1G,
>>> +			  enable_tdp_mmu);
>>>  	pr_info("Nested Paging %sabled\n", npt_enabled ? "en" : "dis");
>>>  
>>>  	/* Setup shadow_me_value and shadow_me_mask */
>>> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
>>> index 6981c1e9a809..aa49ac5d66bc 100644
>>> --- a/arch/x86/kvm/svm/svm_onhyperv.h
>>> +++ b/arch/x86/kvm/svm/svm_onhyperv.h
>>> @@ -30,6 +30,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>>>  		hve->hv_enlightenments_control.msr_bitmap = 1;
>>>  }
>>>  
>>> +static inline bool svm_hv_enable_tdp_mmu(void)
>>> +{
>>> +	return !(npt_enabled && ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB);
>>> +}
>>> +
>>>  static inline void svm_hv_hardware_setup(void)
>>>  {
>>>  	if (npt_enabled &&
>>> @@ -84,6 +89,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>>>  {
>>>  }
>>>  
>>> +static inline bool svm_hv_enable_tdp_mmu(void)
>>> +{
>>> +	return true;
>>> +}
>>> +
>>>  static inline void svm_hv_hardware_setup(void)
>>>  {
>>>  }
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index c788aa382611..4d3808755d39 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -8442,7 +8442,8 @@ static __init int hardware_setup(void)
>>>  	vmx_setup_me_spte_mask();
>>>  
>>>  	kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(),
>>> -			  ept_caps_to_lpage_level(vmx_capability.ept));
>>> +			  ept_caps_to_lpage_level(vmx_capability.ept),
>>> +			  true);
>>>  
>>>  	/*
>>>  	 * Only enable PML when hardware supports PML feature, and both EPT
>> 
>
  
Sean Christopherson March 7, 2023, 5:36 p.m. UTC | #4
On Mon, Feb 27, 2023, Jeremi Piotrowski wrote:
> Disable TDP MMU when using SVM Hyper-V for the time being while we
> search for a better fix.

...

> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c91ee2927dd7..5c0e28a7a3bc 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5787,14 +5787,15 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
>  }
>  
>  void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> -		       int tdp_max_root_level, int tdp_huge_page_level)
> +		       int tdp_max_root_level, int tdp_huge_page_level,
> +		       bool enable_tdp_mmu)
>  {
>  	tdp_enabled = enable_tdp;
>  	tdp_root_level = tdp_forced_root_level;
>  	max_tdp_level = tdp_max_root_level;
>  
>  #ifdef CONFIG_X86_64
> -	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
> +	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu;
>  #endif

Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
happens to get lucky and not run afoul of the underlying bugs.  The revert appears
to be reasonably straightforward (see bottom).

And _if_ we want to hack-a-fix it, then I would strongly prefer a very isolated,
obviously hacky fix, e.g.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 36e4561554ca..a9ba4ae14fda 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5779,8 +5779,13 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
        tdp_root_level = tdp_forced_root_level;
        max_tdp_level = tdp_max_root_level;
 
+       /*
+        * FIXME: Remove the enlightened TLB restriction when KVM properly
+        * handles TLB flushes for said enlightenment.
+        */.
 #ifdef CONFIG_X86_64
-       tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
+       tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled &&
+                         !(ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB);
 #endif
        /*
         * max_huge_page_level reflects KVM's MMU capabilities irrespective




The revert...

---
 arch/x86/kvm/svm/svm.c          |  3 ---
 arch/x86/kvm/svm/svm_onhyperv.h | 27 ---------------------------
 2 files changed, 30 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 11068e8eb969..292650dc85a0 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1320,7 +1320,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
 	if (sev_guest(vcpu->kvm))
 		sev_init_vmcb(svm);
 
-	svm_hv_init_vmcb(vmcb);
 	init_vmcb_after_set_cpuid(vcpu);
 
 	vmcb_mark_all_dirty(vmcb);
@@ -4075,8 +4074,6 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 		svm->vmcb->control.nested_cr3 = __sme_set(root_hpa);
 		vmcb_mark_dirty(svm->vmcb, VMCB_NPT);
 
-		hv_track_root_tdp(vcpu, root_hpa);
-
 		cr3 = vcpu->arch.cr3;
 	} else if (root_level >= PT64_ROOT_4LEVEL) {
 		cr3 = __sme_set(root_hpa) | kvm_get_active_pcid(vcpu);
diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
index 6981c1e9a809..5118fd273e73 100644
--- a/arch/x86/kvm/svm/svm_onhyperv.h
+++ b/arch/x86/kvm/svm/svm_onhyperv.h
@@ -15,31 +15,8 @@ static struct kvm_x86_ops svm_x86_ops;
 
 int svm_hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu);
 
-static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
-{
-	struct hv_vmcb_enlightenments *hve = &vmcb->control.hv_enlightenments;
-
-	BUILD_BUG_ON(sizeof(vmcb->control.hv_enlightenments) !=
-		     sizeof(vmcb->control.reserved_sw));
-
-	if (npt_enabled &&
-	    ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB)
-		hve->hv_enlightenments_control.enlightened_npt_tlb = 1;
-
-	if (ms_hyperv.nested_features & HV_X64_NESTED_MSR_BITMAP)
-		hve->hv_enlightenments_control.msr_bitmap = 1;
-}
-
 static inline void svm_hv_hardware_setup(void)
 {
-	if (npt_enabled &&
-	    ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) {
-		pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n");
-		svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
-		svm_x86_ops.tlb_remote_flush_with_range =
-				hv_remote_flush_tlb_with_range;
-	}
-
 	if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH) {
 		int cpu;
 
@@ -80,10 +57,6 @@ static inline void svm_hv_update_vp_id(struct vmcb *vmcb, struct kvm_vcpu *vcpu)
 }
 #else
 
-static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
-{
-}
-
 static inline void svm_hv_hardware_setup(void)
 {
 }

base-commit: cb8748a781fe983e451f616ce4861a1c49ce79dd
--
  
Paolo Bonzini March 8, 2023, midnight UTC | #5
On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote:
> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
> happens to get lucky and not run afoul of the underlying bugs.

I don't think it's about luck---the legacy MMU's zapping/invalidation
seems to invoke the flush hypercall correctly:

Jeremi, did you ever track the call stack where
hyperv_nested_flush_guest_mapping is triggered?

Paolo

> The revert appears
> to be reasonably straightforward (see bottom).
>
> And _if_ we want to hack-a-fix it, then I would strongly prefer a very isolated,
> obviously hacky fix, e.g.
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 36e4561554ca..a9ba4ae14fda 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5779,8 +5779,13 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>         tdp_root_level = tdp_forced_root_level;
>         max_tdp_level = tdp_max_root_level;
>
> +       /*
> +        * FIXME: Remove the enlightened TLB restriction when KVM properly
> +        * handles TLB flushes for said enlightenment.
> +        */.
>  #ifdef CONFIG_X86_64
> -       tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
> +       tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled &&
> +                         !(ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB);
>  #endif
>         /*
>          * max_huge_page_level reflects KVM's MMU capabilities irrespective
>
>
>
>
> The revert...
>
> ---
>  arch/x86/kvm/svm/svm.c          |  3 ---
>  arch/x86/kvm/svm/svm_onhyperv.h | 27 ---------------------------
>  2 files changed, 30 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 11068e8eb969..292650dc85a0 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1320,7 +1320,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
>         if (sev_guest(vcpu->kvm))
>                 sev_init_vmcb(svm);
>
> -       svm_hv_init_vmcb(vmcb);
>         init_vmcb_after_set_cpuid(vcpu);
>
>         vmcb_mark_all_dirty(vmcb);
> @@ -4075,8 +4074,6 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>                 svm->vmcb->control.nested_cr3 = __sme_set(root_hpa);
>                 vmcb_mark_dirty(svm->vmcb, VMCB_NPT);
>
> -               hv_track_root_tdp(vcpu, root_hpa);
> -
>                 cr3 = vcpu->arch.cr3;
>         } else if (root_level >= PT64_ROOT_4LEVEL) {
>                 cr3 = __sme_set(root_hpa) | kvm_get_active_pcid(vcpu);
> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
> index 6981c1e9a809..5118fd273e73 100644
> --- a/arch/x86/kvm/svm/svm_onhyperv.h
> +++ b/arch/x86/kvm/svm/svm_onhyperv.h
> @@ -15,31 +15,8 @@ static struct kvm_x86_ops svm_x86_ops;
>
>  int svm_hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu);
>
> -static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
> -{
> -       struct hv_vmcb_enlightenments *hve = &vmcb->control.hv_enlightenments;
> -
> -       BUILD_BUG_ON(sizeof(vmcb->control.hv_enlightenments) !=
> -                    sizeof(vmcb->control.reserved_sw));
> -
> -       if (npt_enabled &&
> -           ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB)
> -               hve->hv_enlightenments_control.enlightened_npt_tlb = 1;
> -
> -       if (ms_hyperv.nested_features & HV_X64_NESTED_MSR_BITMAP)
> -               hve->hv_enlightenments_control.msr_bitmap = 1;
> -}
> -
>  static inline void svm_hv_hardware_setup(void)
>  {
> -       if (npt_enabled &&
> -           ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) {
> -               pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n");
> -               svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
> -               svm_x86_ops.tlb_remote_flush_with_range =
> -                               hv_remote_flush_tlb_with_range;
> -       }
> -
>         if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH) {
>                 int cpu;
>
> @@ -80,10 +57,6 @@ static inline void svm_hv_update_vp_id(struct vmcb *vmcb, struct kvm_vcpu *vcpu)
>  }
>  #else
>
> -static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
> -{
> -}
> -
>  static inline void svm_hv_hardware_setup(void)
>  {
>  }
>
> base-commit: cb8748a781fe983e451f616ce4861a1c49ce79dd
> --
>
  
Sean Christopherson March 8, 2023, 12:39 a.m. UTC | #6
On Wed, Mar 08, 2023, Paolo Bonzini wrote:
> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote:
> > Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
> > hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
> > doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
> > happens to get lucky and not run afoul of the underlying bugs.
> 
> I don't think it's about luck---the legacy MMU's zapping/invalidation
> seems to invoke the flush hypercall correctly:

...for the paths that Jeremi has exercised, and for which a stale TLB entry is
fatal to L2.  E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush
in its path and fully relies on the buggy kvm_flush_remote_tlbs().

In other words, KVM is getting lucky :-)

> Jeremi, did you ever track the call stack where
> hyperv_nested_flush_guest_mapping is triggered?

I don't think it matters.  As above, it only takes one path where KVM is fully
relying on kvm_flush_remote_tlbs() for the whole thing to fall apart.
  
Jeremi Piotrowski March 8, 2023, 3:42 p.m. UTC | #7
On 07/03/2023 11:07, Vitaly Kuznetsov wrote:
> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:
> 
>> On 06/03/2023 18:52, Vitaly Kuznetsov wrote:
>>> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:
>>>
>>>> TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17.
>>>> The issue was first introduced by two commmits:
>>>>
>>>> - bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB
>>>>   flush to caller when freeing TDP MMU shadow pages"
>>>> - efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct
>>>>   roots via asynchronous worker"
>>>>
>>>> The root cause is that since then there are missing TLB flushes which
>>>> are required by HV_X64_NESTED_ENLIGHTENED_TLB.
>>>
>>> Please share more details on what's actually missing as you get them,
>>> I'd like to understand which flushes can be legally avoided on bare
>>> hardware and Hyper-V/VMX but not on Hyper-V/SVM.
>>>
>>
>> See the linked thread here
>> https://lore.kernel.org/lkml/20d189fc-8d20-8083-b448-460cc0420151@linux.microsoft.com/#t
>> for all the details/analyses but the summary was that either of these 2
>> options would work, with a) having less flushes (footnote: less flushes is not necessarily
>> better):
>>
>> a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp()
>> b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current()
>>
>> These are only needed on Hyper-V/SVM because of how the enlightenment works (needs an explicit
>> flush to rebuild L0 shadow page tables). Hyper-V/VMX does not need any changes and currently
>> works. Let me know if you need more information on something here, I'll try to get it.
>>
> 
> Ah, I missed the whole party! Thanks for the pointers!
> 
>>>>  The failure manifests
>>>> as L2 guest VMs being unable to complete boot due to memory
>>>> inconsistencies between L1 and L2 guests which lead to various
>>>> assertion/emulation failures.
> 
> Which levels are we talking about here, *real* L1 and L2 or L1 and L2
> from KVM's perspective (real L2 and L3)?

Real L1 and L2. In this whole discussion L0 is Hyper-V, L1 is KVM and L2 is a Linux VM.

> 
>>>>
>>>> The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by
>>>> Hyper-V on AMD and is always used by Linux. The TLB flush required by
>>>> HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush
>>>> that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest
>>>> boot performance on AMD is reproducibly slower compared to when TDP MMU is
>>>> disabled.
>>>>
>>>> Disable TDP MMU when using SVM Hyper-V for the time being while we
>>>> search for a better fix.
>>>
>>> I'd suggest we go the other way around: disable
>>> HV_X64_NESTED_ENLIGHTENED_TLB on SVM:
>>
>> Paolo suggested disabling TDP_MMU when HV_X64_NESTED_ENLIGHTENED_TLB is used, and
>> I prefer that option too. The enlighenment does offer a nice performance advantage
>> with non-TDP_MMU, and I did not see TDP_MMU perform any better compared to that.
>> Afaik the code to use the enlightenment on Hyper-V/SVM was written/tested before
>> TDP_MMU became the default.
>>
>> If you have a specific scenario in mind, we could test and see what the implications
>> are there.
> 
> I don't have a strong opinion here, I've suggested a smaller change so
> it's easier to backport it to stable kernels and easier to revert when a
> proper fix comes to mainline.

Noted. My concern here is about changing a default in a way that lowers performance,
because the proper fix that comes later might end up not being suitable for stable.

> For performance implication, I'd only
> consider non-nested scenarios from KVM's perspective (i.e. real L2 from
> Hyper-V's PoV), as running L3 is unlikely a common use-case and, if I
> understood correctly, is broken anyway.

I agree with that. Right now L2 is broken, I've never even attempted L3 to
see if it would work.

Jeremi
  
Jeremi Piotrowski March 8, 2023, 3:48 p.m. UTC | #8
On 08/03/2023 01:00, Paolo Bonzini wrote:
> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote:
>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
>> happens to get lucky and not run afoul of the underlying bugs.
> 
> I don't think it's about luck---the legacy MMU's zapping/invalidation
> seems to invoke the flush hypercall correctly:
> 
> Jeremi, did you ever track the call stack where
> hyperv_nested_flush_guest_mapping is triggered?
> 
> Paolo
> 

Yes, here's all the stacktraces I get for hyperv_nested_flush_guest_mapping with legacy MMU:

1
 entry_SYSCALL_64_after_hwframe
  do_syscall_64
   __x64_sys_ioctl
    kvm_vcpu_ioctl
     kvm_arch_vcpu_ioctl_run
      svm_handle_exit
       svm_invoke_exit_handler
        npf_interception
         kvm_mmu_page_fault
          kvm_mmu_do_page_fault
           kvm_tdp_page_fault
            direct_page_fault
             kvm_faultin_pfn
              __gfn_to_pfn_memslot
               hva_to_pfn
                get_user_pages_unlocked
                 __gup_longterm_locked
                  __get_user_pages
                   handle_mm_fault
                    __handle_mm_fault
                     do_huge_pmd_wp_page
                      __split_huge_pmd
                       __mmu_notifier_invalidate_range_start
                        kvm_mmu_notifier_invalidate_range_start
                         kvm_flush_remote_tlbs
                          hv_remote_flush_tlb
                           hv_remote_flush_tlb_with_range
                            hyperv_flush_guest_mapping
1
 entry_SYSCALL_64_after_hwframe
  do_syscall_64
   syscall_exit_to_user_mode
    exit_to_user_mode_prepare
     arch_do_signal_or_restart
      get_signal
       do_group_exit
        do_exit
         mmput
          __mmput
           exit_mmap
            __mmu_notifier_release
             kvm_mmu_notifier_release
              kvm_arch_flush_shadow_all
               kvm_mmu_zap_all
                kvm_mmu_commit_zap_page.part.0
                 kvm_flush_remote_tlbs
                  hv_remote_flush_tlb
                   hv_remote_flush_tlb_with_range
                    hyperv_flush_guest_mapping
1
 entry_SYSCALL_64_after_hwframe
  do_syscall_64
   syscall_exit_to_user_mode
    exit_to_user_mode_prepare
     arch_do_signal_or_restart
      get_signal
       do_group_exit
        do_exit
         task_work_run
          ____fput
           __fput
            kvm_vm_release
             kvm_put_kvm
              kvm_destroy_vm
               kvm_arch_destroy_vm
                kvm_mmu_unload
                 kvm_mmu_free_roots
                  kvm_mmu_commit_zap_page.part.0
                   kvm_flush_remote_tlbs
                    hv_remote_flush_tlb
                     hv_remote_flush_tlb_with_range
                      hyperv_flush_guest_mapping
6
 entry_SYSCALL_64_after_hwframe
  do_syscall_64
   __x64_sys_ioctl
    kvm_vcpu_ioctl
     kvm_arch_vcpu_ioctl_run
      svm_handle_exit
       svm_invoke_exit_handler
        npf_interception
         kvm_mmu_page_fault
          kvm_mmu_do_page_fault
           kvm_tdp_page_fault
            direct_page_fault
             kvm_faultin_pfn
              __gfn_to_pfn_memslot
               hva_to_pfn
                get_user_pages_unlocked
                 __gup_longterm_locked
                  __get_user_pages
                   handle_mm_fault
                    __handle_mm_fault
                     do_wp_page
                      __mmu_notifier_invalidate_range_start
                       kvm_mmu_notifier_invalidate_range_start
                        kvm_flush_remote_tlbs
                         hv_remote_flush_tlb
                          hv_remote_flush_tlb_with_range
                           hyperv_flush_guest_mapping
8
 entry_SYSCALL_64_after_hwframe
  do_syscall_64
   __x64_sys_ioctl
    kvm_vcpu_ioctl
     kvm_arch_vcpu_ioctl_run
      kvm_apic_accept_events
       kvm_vcpu_reset
        kvm_mmu_reset_context
         kvm_mmu_free_roots
          kvm_mmu_commit_zap_page.part.0
           kvm_flush_remote_tlbs
            hv_remote_flush_tlb
             hv_remote_flush_tlb_with_range
              hyperv_flush_guest_mapping
20
 entry_SYSCALL_64_after_hwframe
  do_syscall_64
   __x64_sys_ioctl
    kvm_vcpu_ioctl
     kvm_arch_vcpu_ioctl_run
      svm_handle_exit
       svm_invoke_exit_handler
        npf_interception
         kvm_mmu_page_fault
          kvm_mmu_do_page_fault
           kvm_tdp_page_fault
            direct_page_fault
             mmu_set_spte
              hv_remote_flush_tlb_with_range
               hyperv_flush_guest_mapping_range
406
 entry_SYSCALL_64_after_hwframe
  do_syscall_64
   __x64_sys_ioctl
    kvm_vm_ioctl
     __kvm_set_memory_region
      kvm_set_memslot
       kvm_arch_flush_shadow_memslot
        kvm_page_track_flush_slot
         kvm_mmu_invalidate_zap_pages_in_memslot
          kvm_mmu_zap_all_fast
           kvm_mmu_commit_zap_page.part.0
            kvm_flush_remote_tlbs
             hv_remote_flush_tlb
              hv_remote_flush_tlb_with_range
               hyperv_flush_guest_mapping
406
 entry_SYSCALL_64_after_hwframe
  do_syscall_64
   __x64_sys_ioctl
    kvm_vcpu_ioctl
     kvm_arch_vcpu_ioctl_run
      kvm_mmu_free_obsolete_roots
       __kvm_mmu_free_obsolete_roots
        kvm_mmu_free_roots
         kvm_mmu_commit_zap_page.part.0
          kvm_flush_remote_tlbs
           hv_remote_flush_tlb
            hv_remote_flush_tlb_with_range
             hyperv_flush_guest_mapping


and here's the stacks for TDP MMU:

1
 entry_SYSCALL_64_after_hwframe
  do_syscall_64
   __x64_sys_ioctl
    kvm_vcpu_ioctl
     kvm_arch_vcpu_ioctl_run
      svm_handle_exit
       svm_invoke_exit_handler
        npf_interception
         kvm_mmu_page_fault
          kvm_mmu_do_page_fault
           kvm_tdp_page_fault
            kvm_faultin_pfn
             __gfn_to_pfn_memslot
              hva_to_pfn
               get_user_pages_unlocked
                __gup_longterm_locked
                 __get_user_pages
                  handle_mm_fault
                   __handle_mm_fault
                    do_huge_pmd_wp_page
                     __split_huge_pmd
                      __mmu_notifier_invalidate_range_start
                       kvm_mmu_notifier_invalidate_range_start
                        kvm_flush_remote_tlbs
                         hv_remote_flush_tlb
                          hv_remote_flush_tlb_with_range
                           hyperv_flush_guest_mapping
6
 entry_SYSCALL_64_after_hwframe
  do_syscall_64
   __x64_sys_ioctl
    kvm_vcpu_ioctl
     kvm_arch_vcpu_ioctl_run
      svm_handle_exit
       svm_invoke_exit_handler
        npf_interception
         kvm_mmu_page_fault
          kvm_mmu_do_page_fault
           kvm_tdp_page_fault
            kvm_faultin_pfn
             __gfn_to_pfn_memslot
              hva_to_pfn
               get_user_pages_unlocked
                __gup_longterm_locked
                 __get_user_pages
                  handle_mm_fault
                   __handle_mm_fault
                    do_wp_page
                     __mmu_notifier_invalidate_range_start
                      kvm_mmu_notifier_invalidate_range_start
                       kvm_flush_remote_tlbs
                        hv_remote_flush_tlb
                         hv_remote_flush_tlb_with_range
                          hyperv_flush_guest_mapping

>> The revert appears
>> to be reasonably straightforward (see bottom).
>>
>> And _if_ we want to hack-a-fix it, then I would strongly prefer a very isolated,
>> obviously hacky fix, e.g.
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 36e4561554ca..a9ba4ae14fda 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5779,8 +5779,13 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>>         tdp_root_level = tdp_forced_root_level;
>>         max_tdp_level = tdp_max_root_level;
>>
>> +       /*
>> +        * FIXME: Remove the enlightened TLB restriction when KVM properly
>> +        * handles TLB flushes for said enlightenment.
>> +        */.
>>  #ifdef CONFIG_X86_64
>> -       tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
>> +       tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled &&
>> +                         !(ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB);
>>  #endif
>>         /*
>>          * max_huge_page_level reflects KVM's MMU capabilities irrespective
>>
>>
>>
>>
>> The revert...
>>
>> ---
>>  arch/x86/kvm/svm/svm.c          |  3 ---
>>  arch/x86/kvm/svm/svm_onhyperv.h | 27 ---------------------------
>>  2 files changed, 30 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 11068e8eb969..292650dc85a0 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -1320,7 +1320,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
>>         if (sev_guest(vcpu->kvm))
>>                 sev_init_vmcb(svm);
>>
>> -       svm_hv_init_vmcb(vmcb);
>>         init_vmcb_after_set_cpuid(vcpu);
>>
>>         vmcb_mark_all_dirty(vmcb);
>> @@ -4075,8 +4074,6 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>>                 svm->vmcb->control.nested_cr3 = __sme_set(root_hpa);
>>                 vmcb_mark_dirty(svm->vmcb, VMCB_NPT);
>>
>> -               hv_track_root_tdp(vcpu, root_hpa);
>> -
>>                 cr3 = vcpu->arch.cr3;
>>         } else if (root_level >= PT64_ROOT_4LEVEL) {
>>                 cr3 = __sme_set(root_hpa) | kvm_get_active_pcid(vcpu);
>> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
>> index 6981c1e9a809..5118fd273e73 100644
>> --- a/arch/x86/kvm/svm/svm_onhyperv.h
>> +++ b/arch/x86/kvm/svm/svm_onhyperv.h
>> @@ -15,31 +15,8 @@ static struct kvm_x86_ops svm_x86_ops;
>>
>>  int svm_hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu);
>>
>> -static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>> -{
>> -       struct hv_vmcb_enlightenments *hve = &vmcb->control.hv_enlightenments;
>> -
>> -       BUILD_BUG_ON(sizeof(vmcb->control.hv_enlightenments) !=
>> -                    sizeof(vmcb->control.reserved_sw));
>> -
>> -       if (npt_enabled &&
>> -           ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB)
>> -               hve->hv_enlightenments_control.enlightened_npt_tlb = 1;
>> -
>> -       if (ms_hyperv.nested_features & HV_X64_NESTED_MSR_BITMAP)
>> -               hve->hv_enlightenments_control.msr_bitmap = 1;
>> -}
>> -
>>  static inline void svm_hv_hardware_setup(void)
>>  {
>> -       if (npt_enabled &&
>> -           ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) {
>> -               pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n");
>> -               svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
>> -               svm_x86_ops.tlb_remote_flush_with_range =
>> -                               hv_remote_flush_tlb_with_range;
>> -       }
>> -
>>         if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH) {
>>                 int cpu;
>>
>> @@ -80,10 +57,6 @@ static inline void svm_hv_update_vp_id(struct vmcb *vmcb, struct kvm_vcpu *vcpu)
>>  }
>>  #else
>>
>> -static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>> -{
>> -}
>> -
>>  static inline void svm_hv_hardware_setup(void)
>>  {
>>  }
>>
>> base-commit: cb8748a781fe983e451f616ce4861a1c49ce79dd
>> --
>>
  
Jeremi Piotrowski March 8, 2023, 3:55 p.m. UTC | #9
On 08/03/2023 01:39, Sean Christopherson wrote:
> On Wed, Mar 08, 2023, Paolo Bonzini wrote:
>> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote:
>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
>>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
>>> happens to get lucky and not run afoul of the underlying bugs.
>>
>> I don't think it's about luck---the legacy MMU's zapping/invalidation
>> seems to invoke the flush hypercall correctly:
> 
> ...for the paths that Jeremi has exercised, and for which a stale TLB entry is
> fatal to L2.  E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush
> in its path and fully relies on the buggy kvm_flush_remote_tlbs().
>

Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs calls the hypercall
that is needed, I don't see how this might be an issue of a missing "range-based TLB flush".

kvm_unmap_gfn_range is called from kvm_mmu_notifier_invalidate_range_start and 'flush_on_ret=true'
is set, so it is followed by kvm_flush_remote_tlbs which calls hv_remote_flush_tlb.

> In other words, KVM is getting lucky :-)
> 
>> Jeremi, did you ever track the call stack where
>> hyperv_nested_flush_guest_mapping is triggered?
> 
> I don't think it matters.  As above, it only takes one path where KVM is fully
> relying on kvm_flush_remote_tlbs() for the whole thing to fall apart
  
Jeremi Piotrowski March 8, 2023, 5:22 p.m. UTC | #10
On 08/03/2023 16:55, Jeremi Piotrowski wrote:
> 
> 
> On 08/03/2023 01:39, Sean Christopherson wrote:
>> On Wed, Mar 08, 2023, Paolo Bonzini wrote:
>>> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote:
>>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
>>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
>>>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
>>>> happens to get lucky and not run afoul of the underlying bugs.
>>>
>>> I don't think it's about luck---the legacy MMU's zapping/invalidation
>>> seems to invoke the flush hypercall correctly:
>>
>> ...for the paths that Jeremi has exercised, and for which a stale TLB entry is
>> fatal to L2.  E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush
>> in its path and fully relies on the buggy kvm_flush_remote_tlbs().
>>
> 
> Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs calls the hypercall
> that is needed, I don't see how this might be an issue of a missing "range-based TLB flush".
> 
> kvm_unmap_gfn_range is called from kvm_mmu_notifier_invalidate_range_start and 'flush_on_ret=true'
> is set, so it is followed by kvm_flush_remote_tlbs which calls hv_remote_flush_tlb.
> 
>> In other words, KVM is getting lucky :-)
>>
>>> Jeremi, did you ever track the call stack where
>>> hyperv_nested_flush_guest_mapping is triggered?
>>
>> I don't think it matters.  As above, it only takes one path where KVM is fully
>> relying on kvm_flush_remote_tlbs() for the whole thing to fall apart

Slowly I'm starting to understand what we've been talking about, thank you :)

Paolo/Sean, what do you think about smth like the following, except I would make
it SVM only, and I'd need to think about what to do with the return.
I believe this accurately reflects what the optimization is about. hv_track_root_tdp
is called from kvm_mmu_load_pgd, which covers both kvm_mmu_load and kvm_mmu_new_pgd
(which requests KVM_REQ_LOAD_MMU_PGD).

diff --git a/arch/x86/kvm/kvm_onhyperv.c b/arch/x86/kvm/kvm_onhyperv.c
index 482d6639ef88..6a5bd3cbace8 100644
--- a/arch/x86/kvm/kvm_onhyperv.c
+++ b/arch/x86/kvm/kvm_onhyperv.c
@@ -29,6 +29,18 @@ static inline int hv_remote_flush_root_tdp(hpa_t root_tdp,
 		return hyperv_flush_guest_mapping(root_tdp);
 }
 
+static int hv_vcpu_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+	struct kvm_arch *kvm_arch = &vcpu->kvm->arch;
+	hpa_t root_tdp = vcpu->arch.hv_root_tdp;
+	int ret;
+
+	ret = hyperv_flush_guest_mapping(root_tdp);
+	if (!ret)
+		kvm_arch->hv_root_tdp = root_tdp;
+	return ret;
+}
+
 int hv_remote_flush_tlb_with_range(struct kvm *kvm,
 		struct kvm_tlb_range *range)
 {
@@ -101,8 +113,10 @@ void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp)
 	if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb) {
 		spin_lock(&kvm_arch->hv_root_tdp_lock);
 		vcpu->arch.hv_root_tdp = root_tdp;
-		if (root_tdp != kvm_arch->hv_root_tdp)
+		if (root_tdp != kvm_arch->hv_root_tdp) {
 			kvm_arch->hv_root_tdp = INVALID_PAGE;
+			hv_vcpu_flush_tlb_current(vcpu);
+		}
 		spin_unlock(&kvm_arch->hv_root_tdp_lock);
 	}
 }
  
Sean Christopherson March 8, 2023, 7:11 p.m. UTC | #11
On Wed, Mar 08, 2023, Jeremi Piotrowski wrote:
> On 08/03/2023 01:39, Sean Christopherson wrote:
> > On Wed, Mar 08, 2023, Paolo Bonzini wrote:
> >> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote:
> >>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
> >>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
> >>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
> >>> happens to get lucky and not run afoul of the underlying bugs.
> >>
> >> I don't think it's about luck---the legacy MMU's zapping/invalidation
> >> seems to invoke the flush hypercall correctly:
> > 
> > ...for the paths that Jeremi has exercised, and for which a stale TLB entry is
> > fatal to L2.  E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush
> > in its path and fully relies on the buggy kvm_flush_remote_tlbs().
> >
> 
> Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs calls the
> hypercall that is needed, I don't see how this might be an issue of a missing
> "range-based TLB flush".

Doh, I forgot that the arch hook in kvm_flush_remote_tlbs() leads to the Hyper-V
hook.

svm_flush_tlb_current is very much broken, but in practice it doesn't matter outside
of the direct call from kvm_mmu_load(), because in all other paths KVM will flow
through a Hyper-V flush if KVM actually modifies its MMU in any ways.  E.g. the
request from kvm_mmu_new_pgd() when force_flush_and_sync_on_reuse=true is neutered,
but that exists only as a safeguard against MMU bugs.  And for things like
kvm_invalidate_pcid() and kvm_post_set_cr4(), my understanding is that Hyper-V
will still flush the bare metal TLB, it's only Hyper-V's shadow page tables that
are stale.

Depending on how Hyper-V handles ASIDs, pre_svm_run() may also be broken.  If
Hyper-V tracks and rebuilds only the current ASID, then bumping the ASID won't
rebuild potentially stale page tables.  But I'm guessing Hyper-V ignores the ASID
since the hypercall takes only the root PA.

The truly problematic case is kvm_mmu_load(), where KVM relies on the flush to
force Hyper-V to rebuild shadow page tables for an old, possibly stale nCR3.  This
affects only the TDP MMU because of an explicit optimization in the TDP MMU.  So
in practice we could squeak by with something like this:

	if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb)
		hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa);
	else
		static_call(kvm_x86_flush_tlb_current)(vcpu);

but I'm not convinced that avoiding a hypercall in svm_flush_tlb_current() just
to avoid overhead when running an L3 (nested VM from L1 KVM's perspective) is
worthwhile.  The real problem there is that KVM nested SVM TLB/ASID support is an
unoptimized mess, and I can't imagine someone running an L3 is going to be super
concerned with performance.

I also think we should have a sanity check in the flush_tlb_all() path, i.e. WARN
if kvm_flush_remote_tlbs() falls back.

Something like this (probably doesn't compile, likely needs #ifdefs or helpers):

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7ef4f9e3b35a..38afc5cac1c4 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3770,7 +3770,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
        svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
 }
 
-static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
+static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
 {
        struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -3794,6 +3794,23 @@ static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
                svm->current_vmcb->asid_generation--;
 }
 
+static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+       if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb &&
+           VALID_PAGE(vcpu->arch.mmu->root.hpa))
+               hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa);
+
+       svm_flush_tlb_asid(vcpu);
+}
+
+static void svm_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+       if (WARN_ON_ONCE(kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb))
+               hv_remote_flush_tlb(vcpu->kvm);
+
+       svm_flush_tlb_asid(vcpu);
+}
+
 static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
 {
        struct vcpu_svm *svm = to_svm(vcpu);
@@ -4786,10 +4803,10 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
        .set_rflags = svm_set_rflags,
        .get_if_flag = svm_get_if_flag,
 
-       .flush_tlb_all = svm_flush_tlb_current,
+       .flush_tlb_all = svm_flush_tlb_all,
        .flush_tlb_current = svm_flush_tlb_current,
        .flush_tlb_gva = svm_flush_tlb_gva,
-       .flush_tlb_guest = svm_flush_tlb_current,
+       .flush_tlb_guest = svm_flush_tlb_asid,
 
        .vcpu_pre_run = svm_vcpu_pre_run,
        .vcpu_run = svm_vcpu_run,
  
Sean Christopherson March 8, 2023, 7:20 p.m. UTC | #12
On Wed, Mar 08, 2023, Jeremi Piotrowski wrote:
> On 08/03/2023 16:55, Jeremi Piotrowski wrote:
> > 
> > 
> > On 08/03/2023 01:39, Sean Christopherson wrote:
> >> On Wed, Mar 08, 2023, Paolo Bonzini wrote:
> >>> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote:
> >>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
> >>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
> >>>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
> >>>> happens to get lucky and not run afoul of the underlying bugs.
> >>>
> >>> I don't think it's about luck---the legacy MMU's zapping/invalidation
> >>> seems to invoke the flush hypercall correctly:
> >>
> >> ...for the paths that Jeremi has exercised, and for which a stale TLB entry is
> >> fatal to L2.  E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush
> >> in its path and fully relies on the buggy kvm_flush_remote_tlbs().
> >>
> > 
> > Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs calls the hypercall
> > that is needed, I don't see how this might be an issue of a missing "range-based TLB flush".
> > 
> > kvm_unmap_gfn_range is called from kvm_mmu_notifier_invalidate_range_start and 'flush_on_ret=true'
> > is set, so it is followed by kvm_flush_remote_tlbs which calls hv_remote_flush_tlb.
> > 
> >> In other words, KVM is getting lucky :-)
> >>
> >>> Jeremi, did you ever track the call stack where
> >>> hyperv_nested_flush_guest_mapping is triggered?
> >>
> >> I don't think it matters.  As above, it only takes one path where KVM is fully
> >> relying on kvm_flush_remote_tlbs() for the whole thing to fall apart
> 
> Slowly I'm starting to understand what we've been talking about, thank you :)
> 
> Paolo/Sean, what do you think about smth like the following, except I would make
> it SVM only, and I'd need to think about what to do with the return.
> I believe this accurately reflects what the optimization is about. hv_track_root_tdp
> is called from kvm_mmu_load_pgd, which covers both kvm_mmu_load and kvm_mmu_new_pgd
> (which requests KVM_REQ_LOAD_MMU_PGD).

It's close, but KVM doesn't *always* need to flush when loading a root.  KVM needs
to flush when loading a brand spanking new root, which is the kvm_mmu_load() path.
But when KVM loads a root via KVM_REQ_LOAD_MMU_PGD/kvm_mmu_new_pgd(), a flush may
or may not be necessary, e.g. if KVM reuses an old, but still valid, root (each
vCPU has a 3-entry root cache) and a TLB flush isn't architecturally required, then
there is no need to flush.

And as mentioned in the other tendril of this thread, I'd really like to fix
svm_flush_tlb_current() since it's technically broken, even though it's highly
unlikely (maybe even impossible?) to cause issues in practice.

> diff --git a/arch/x86/kvm/kvm_onhyperv.c b/arch/x86/kvm/kvm_onhyperv.c
> index 482d6639ef88..6a5bd3cbace8 100644
> --- a/arch/x86/kvm/kvm_onhyperv.c
> +++ b/arch/x86/kvm/kvm_onhyperv.c
> @@ -29,6 +29,18 @@ static inline int hv_remote_flush_root_tdp(hpa_t root_tdp,
>  		return hyperv_flush_guest_mapping(root_tdp);
>  }
>  
> +static int hv_vcpu_flush_tlb_current(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_arch *kvm_arch = &vcpu->kvm->arch;
> +	hpa_t root_tdp = vcpu->arch.hv_root_tdp;
> +	int ret;
> +
> +	ret = hyperv_flush_guest_mapping(root_tdp);
> +	if (!ret)
> +		kvm_arch->hv_root_tdp = root_tdp;
> +	return ret;
> +}
> +
>  int hv_remote_flush_tlb_with_range(struct kvm *kvm,
>  		struct kvm_tlb_range *range)
>  {
> @@ -101,8 +113,10 @@ void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp)
>  	if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb) {
>  		spin_lock(&kvm_arch->hv_root_tdp_lock);
>  		vcpu->arch.hv_root_tdp = root_tdp;
> -		if (root_tdp != kvm_arch->hv_root_tdp)
> +		if (root_tdp != kvm_arch->hv_root_tdp) {
>  			kvm_arch->hv_root_tdp = INVALID_PAGE;
> +			hv_vcpu_flush_tlb_current(vcpu);
> +		}
>  		spin_unlock(&kvm_arch->hv_root_tdp_lock);
>  	}
>  }
>
  
Jeremi Piotrowski March 9, 2023, 5:58 p.m. UTC | #13
@Alex,
do you know the answer to Sean's question below about ASID handling in Hyper-V?
Any other comments about how we're intending to fix things are also welcome.

Jeremi

On 08/03/2023 20:11, Sean Christopherson wrote:
> On Wed, Mar 08, 2023, Jeremi Piotrowski wrote:
>> On 08/03/2023 01:39, Sean Christopherson wrote:
>>> On Wed, Mar 08, 2023, Paolo Bonzini wrote:
>>>> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote:
>>>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
>>>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
>>>>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
>>>>> happens to get lucky and not run afoul of the underlying bugs.
>>>>
>>>> I don't think it's about luck---the legacy MMU's zapping/invalidation
>>>> seems to invoke the flush hypercall correctly:
>>>
>>> ...for the paths that Jeremi has exercised, and for which a stale TLB entry is
>>> fatal to L2.  E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush
>>> in its path and fully relies on the buggy kvm_flush_remote_tlbs().
>>>
>>
>> Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs calls the
>> hypercall that is needed, I don't see how this might be an issue of a missing
>> "range-based TLB flush".
> 
> Doh, I forgot that the arch hook in kvm_flush_remote_tlbs() leads to the Hyper-V
> hook.
> 
> svm_flush_tlb_current is very much broken, but in practice it doesn't matter outside
> of the direct call from kvm_mmu_load(), because in all other paths KVM will flow
> through a Hyper-V flush if KVM actually modifies its MMU in any ways.  E.g. the
> request from kvm_mmu_new_pgd() when force_flush_and_sync_on_reuse=true is neutered,
> but that exists only as a safeguard against MMU bugs.  And for things like
> kvm_invalidate_pcid() and kvm_post_set_cr4(), my understanding is that Hyper-V
> will still flush the bare metal TLB, it's only Hyper-V's shadow page tables that
> are stale.
> 
> Depending on how Hyper-V handles ASIDs, pre_svm_run() may also be broken.  If
> Hyper-V tracks and rebuilds only the current ASID, then bumping the ASID won't
> rebuild potentially stale page tables.  But I'm guessing Hyper-V ignores the ASID
> since the hypercall takes only the root PA.
> 
> The truly problematic case is kvm_mmu_load(), where KVM relies on the flush to
> force Hyper-V to rebuild shadow page tables for an old, possibly stale nCR3.  This
> affects only the TDP MMU because of an explicit optimization in the TDP MMU.  So
> in practice we could squeak by with something like this:
> 
> 	if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb)
> 		hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa);
> 	else
> 		static_call(kvm_x86_flush_tlb_current)(vcpu);
> 
> but I'm not convinced that avoiding a hypercall in svm_flush_tlb_current() just
> to avoid overhead when running an L3 (nested VM from L1 KVM's perspective) is
> worthwhile.  The real problem there is that KVM nested SVM TLB/ASID support is an
> unoptimized mess, and I can't imagine someone running an L3 is going to be super
> concerned with performance.
> 
> I also think we should have a sanity check in the flush_tlb_all() path, i.e. WARN
> if kvm_flush_remote_tlbs() falls back.
> 
> Something like this (probably doesn't compile, likely needs #ifdefs or helpers):
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7ef4f9e3b35a..38afc5cac1c4 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3770,7 +3770,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
>         svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
>  }
>  
> -static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
> +static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
>  {
>         struct vcpu_svm *svm = to_svm(vcpu);
>  
> @@ -3794,6 +3794,23 @@ static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
>                 svm->current_vmcb->asid_generation--;
>  }
>  
> +static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
> +{
> +       if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb &&
> +           VALID_PAGE(vcpu->arch.mmu->root.hpa))
> +               hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa);
> +
> +       svm_flush_tlb_asid(vcpu);
> +}
> +
> +static void svm_flush_tlb_all(struct kvm_vcpu *vcpu)
> +{
> +       if (WARN_ON_ONCE(kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb))
> +               hv_remote_flush_tlb(vcpu->kvm);
> +
> +       svm_flush_tlb_asid(vcpu);
> +}
> +
>  static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
>  {
>         struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4786,10 +4803,10 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>         .set_rflags = svm_set_rflags,
>         .get_if_flag = svm_get_if_flag,
>  
> -       .flush_tlb_all = svm_flush_tlb_current,
> +       .flush_tlb_all = svm_flush_tlb_all,
>         .flush_tlb_current = svm_flush_tlb_current,
>         .flush_tlb_gva = svm_flush_tlb_gva,
> -       .flush_tlb_guest = svm_flush_tlb_current,
> +       .flush_tlb_guest = svm_flush_tlb_asid,
>  
>         .vcpu_pre_run = svm_vcpu_pre_run,
>         .vcpu_run = svm_vcpu_run,
  
Alexander Grest March 12, 2023, 5:42 p.m. UTC | #14
Yes, I agree that bumping the ASID doesn't flush Hyper-V's shadow page tables (if the guest opts into "enlightened TLB"). 

Here is the relevant section from the TLFS:

"The nested hypervisor can optionally opt into an "enlightened TLB" by setting EnlightenedNptTlb to "1" in HV_SVM_ENLIGHTENED_VMCB_FIELDS. If the nested hypervisor opts into the enlightenment, ASID invalidations just flush TLB entries derived from first level address translation (i.e. the virtual address space). To flush TLB entries derived from the nested page table (NPT) and force the L0 hypervisor to rebuild shadow page tables, the HvCallFlushGuestPhysicalAddressSpace or HvCallFlushGuestPhysicalAddressList hypercalls must be used."

(see https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/nested-virtualization)

Thanks,
Alex

-----Original Message-----
From: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> 
Sent: Thursday, March 9, 2023 9:59 AM
To: Alexander Grest <Alexander.Grest@microsoft.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>; linux-kernel@vger.kernel.org; kvm@vger.kernel.org; Vitaly Kuznetsov <vkuznets@redhat.com>; Tianyu Lan <ltykernel@gmail.com>; Michael Kelley (LINUX) <mikelley@microsoft.com>; Sean Christopherson <seanjc@google.com>
Subject: Re: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V

@Alex,
do you know the answer to Sean's question below about ASID handling in Hyper-V?
Any other comments about how we're intending to fix things are also welcome.

Jeremi

On 08/03/2023 20:11, Sean Christopherson wrote:
> On Wed, Mar 08, 2023, Jeremi Piotrowski wrote:
>> On 08/03/2023 01:39, Sean Christopherson wrote:
>>> On Wed, Mar 08, 2023, Paolo Bonzini wrote:
>>>> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote:
>>>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
>>>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly 
>>>>> straitaway.  KVM doesn't magically handle the flushes correctly 
>>>>> for the shadow/legacy MMU, KVM just happens to get lucky and not run afoul of the underlying bugs.
>>>>
>>>> I don't think it's about luck---the legacy MMU's 
>>>> zapping/invalidation seems to invoke the flush hypercall correctly:
>>>
>>> ...for the paths that Jeremi has exercised, and for which a stale 
>>> TLB entry is fatal to L2.  E.g. kvm_unmap_gfn_range() does not have 
>>> a range-based TLB flush in its path and fully relies on the buggy kvm_flush_remote_tlbs().
>>>
>>
>> Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs 
>> calls the hypercall that is needed, I don't see how this might be an 
>> issue of a missing "range-based TLB flush".
> 
> Doh, I forgot that the arch hook in kvm_flush_remote_tlbs() leads to 
> the Hyper-V hook.
> 
> svm_flush_tlb_current is very much broken, but in practice it doesn't 
> matter outside of the direct call from kvm_mmu_load(), because in all 
> other paths KVM will flow through a Hyper-V flush if KVM actually 
> modifies its MMU in any ways.  E.g. the request from kvm_mmu_new_pgd() 
> when force_flush_and_sync_on_reuse=true is neutered, but that exists 
> only as a safeguard against MMU bugs.  And for things like
> kvm_invalidate_pcid() and kvm_post_set_cr4(), my understanding is that 
> Hyper-V will still flush the bare metal TLB, it's only Hyper-V's 
> shadow page tables that are stale.
> 
> Depending on how Hyper-V handles ASIDs, pre_svm_run() may also be 
> broken.  If Hyper-V tracks and rebuilds only the current ASID, then 
> bumping the ASID won't rebuild potentially stale page tables.  But I'm 
> guessing Hyper-V ignores the ASID since the hypercall takes only the root PA.
> 
> The truly problematic case is kvm_mmu_load(), where KVM relies on the 
> flush to force Hyper-V to rebuild shadow page tables for an old, 
> possibly stale nCR3.  This affects only the TDP MMU because of an 
> explicit optimization in the TDP MMU.  So in practice we could squeak by with something like this:
> 
> 	if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb)
> 		hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa);
> 	else
> 		static_call(kvm_x86_flush_tlb_current)(vcpu);
> 
> but I'm not convinced that avoiding a hypercall in 
> svm_flush_tlb_current() just to avoid overhead when running an L3 
> (nested VM from L1 KVM's perspective) is worthwhile.  The real problem 
> there is that KVM nested SVM TLB/ASID support is an unoptimized mess, 
> and I can't imagine someone running an L3 is going to be super concerned with performance.
> 
> I also think we should have a sanity check in the flush_tlb_all() 
> path, i.e. WARN if kvm_flush_remote_tlbs() falls back.
> 
> Something like this (probably doesn't compile, likely needs #ifdefs or helpers):
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 
> 7ef4f9e3b35a..38afc5cac1c4 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3770,7 +3770,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
>         svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);  }
>  
> -static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
> +static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
>  {
>         struct vcpu_svm *svm = to_svm(vcpu);
>  
> @@ -3794,6 +3794,23 @@ static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
>                 svm->current_vmcb->asid_generation--;
>  }
>  
> +static void svm_flush_tlb_current(struct kvm_vcpu *vcpu) {
> +       if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb &&
> +           VALID_PAGE(vcpu->arch.mmu->root.hpa))
> +               hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa);
> +
> +       svm_flush_tlb_asid(vcpu);
> +}
> +
> +static void svm_flush_tlb_all(struct kvm_vcpu *vcpu) {
> +       if (WARN_ON_ONCE(kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb))
> +               hv_remote_flush_tlb(vcpu->kvm);
> +
> +       svm_flush_tlb_asid(vcpu);
> +}
> +
>  static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)  {
>         struct vcpu_svm *svm = to_svm(vcpu); @@ -4786,10 +4803,10 @@ 
> static struct kvm_x86_ops svm_x86_ops __initdata = {
>         .set_rflags = svm_set_rflags,
>         .get_if_flag = svm_get_if_flag,
>  
> -       .flush_tlb_all = svm_flush_tlb_current,
> +       .flush_tlb_all = svm_flush_tlb_all,
>         .flush_tlb_current = svm_flush_tlb_current,
>         .flush_tlb_gva = svm_flush_tlb_gva,
> -       .flush_tlb_guest = svm_flush_tlb_current,
> +       .flush_tlb_guest = svm_flush_tlb_asid,
>  
>         .vcpu_pre_run = svm_vcpu_pre_run,
>         .vcpu_run = svm_vcpu_run,
  
Jeremi Piotrowski April 5, 2023, 4:43 p.m. UTC | #15
On 3/7/2023 6:36 PM, Sean Christopherson wrote:
> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
> happens to get lucky and not run afoul of the underlying bugs.  The revert appears
> to be reasonably straightforward (see bottom).

Hi Sean,

I'm back, and I don't have good news. The fix for the missing hyperv TLB flushes has
landed in Linus' tree and I now had the chance to test things outside Azure, in WSL on my
AMD laptop.

There is some seriously weird interaction going on between TDP MMU and Hyper-V, with
or without enlightened TLB. My laptop has 16 vCPUs, so the WSL VM also has 16 vCPUs.
I have hardcoded the kernel to disable enlightened TLB (so we know that is not interfering).
I'm running a Flatcar Linux VM inside the WSL VM using legacy BIOS, a single CPU
and 4GB of RAM.

If I run with `kvm.tdp_mmu=0`, I can boot and shutdown my VM consistently in 20 seconds.

If I run with TDP MMU, the VM boot stalls for seconds at a time in various spots
(loading grub, decompressing kernel, during kernel boot), the boot output feels like
it's happening in slow motion. The fastest I see it finish the same cycle is 2 minutes,
I have also seen it take 4 minutes, sometimes even not finish at all. Same everything,
the only difference is the value of `kvm.tdp_mmu`.

So I would like to revisit disabling tdp_mmu on hyperv altogether for the time being but it
should probably be with the following condition:

  tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && !hypervisor_is_type(X86_HYPER_MS_HYPERV)

Do you have an environment where you would be able to reproduce this? A Windows server perhaps
or an AMD laptop?

Jeremi

> 
> And _if_ we want to hack-a-fix it, then I would strongly prefer a very isolated,
> obviously hacky fix, e.g.
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 36e4561554ca..a9ba4ae14fda 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5779,8 +5779,13 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>         tdp_root_level = tdp_forced_root_level;
>         max_tdp_level = tdp_max_root_level;
>  
> +       /*
> +        * FIXME: Remove the enlightened TLB restriction when KVM properly
> +        * handles TLB flushes for said enlightenment.
> +        */.
>  #ifdef CONFIG_X86_64
> -       tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
> +       tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled &&
> +                         !(ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB);
>  #endif
>         /*
>          * max_huge_page_level reflects KVM's MMU capabilities irrespective
> 
>
  
Sean Christopherson April 10, 2023, 11:25 p.m. UTC | #16
On Wed, Apr 05, 2023, Jeremi Piotrowski wrote:
> On 3/7/2023 6:36 PM, Sean Christopherson wrote:
> > Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
> > hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
> > doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
> > happens to get lucky and not run afoul of the underlying bugs.  The revert appears
> > to be reasonably straightforward (see bottom).
> 
> Hi Sean,
> 
> I'm back, and I don't have good news. The fix for the missing hyperv TLB flushes has
> landed in Linus' tree and I now had the chance to test things outside Azure, in WSL on my
> AMD laptop.
> 
> There is some seriously weird interaction going on between TDP MMU and Hyper-V, with
> or without enlightened TLB. My laptop has 16 vCPUs, so the WSL VM also has 16 vCPUs.
> I have hardcoded the kernel to disable enlightened TLB (so we know that is not interfering).
> I'm running a Flatcar Linux VM inside the WSL VM using legacy BIOS, a single CPU
> and 4GB of RAM.
> 
> If I run with `kvm.tdp_mmu=0`, I can boot and shutdown my VM consistently in 20 seconds.
> 
> If I run with TDP MMU, the VM boot stalls for seconds at a time in various spots
> (loading grub, decompressing kernel, during kernel boot), the boot output feels like
> it's happening in slow motion. The fastest I see it finish the same cycle is 2 minutes,
> I have also seen it take 4 minutes, sometimes even not finish at all. Same everything,
> the only difference is the value of `kvm.tdp_mmu`.

When a stall occurs, can you tell where the time is lost?  E.g. is the CPU stuck
in L0, L1, or L2?  L2 being a single vCPU rules out quite a few scenarios, e.g.
lock contention and whatnot.

If you can run perf in WSL, that might be the easiest way to suss out what's going
on.

> So I would like to revisit disabling tdp_mmu on hyperv altogether for the time being but it
> should probably be with the following condition:
> 
>   tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && !hypervisor_is_type(X86_HYPER_MS_HYPERV)
> 
> Do you have an environment where you would be able to reproduce this? A Windows server perhaps
> or an AMD laptop?

Hrm, not easily, no.  Can you try two things?

  1. Linus' tree on Intel hardware
  2. kvm-x86/next[*] on Intel hardware

Don't bother with #2 if #1 (Linus' tree) does NOT suffer the same stalls as AMD.
#2 is interesting iff Intel is also affected as kvm-x86/next has an optimization
for CR0.WP toggling, which was the achilles heel of the TDP MMU.  If Intel isn't
affected, then something other than CR0.WP is to blame.

I fully expect both experiments to show the same behavior as AMD, but if for some
reason they don't, the results should help narrow the search.

[*] https://github.com/kvm-x86/linux/tree/next
  
Jeremi Piotrowski April 11, 2023, 2:22 p.m. UTC | #17
On 4/11/2023 1:25 AM, Sean Christopherson wrote:
> On Wed, Apr 05, 2023, Jeremi Piotrowski wrote:
>> On 3/7/2023 6:36 PM, Sean Christopherson wrote:
>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
>>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
>>> happens to get lucky and not run afoul of the underlying bugs.  The revert appears
>>> to be reasonably straightforward (see bottom).
>>
>> Hi Sean,
>>
>> I'm back, and I don't have good news. The fix for the missing hyperv TLB flushes has
>> landed in Linus' tree and I now had the chance to test things outside Azure, in WSL on my
>> AMD laptop.
>>
>> There is some seriously weird interaction going on between TDP MMU and Hyper-V, with
>> or without enlightened TLB. My laptop has 16 vCPUs, so the WSL VM also has 16 vCPUs.
>> I have hardcoded the kernel to disable enlightened TLB (so we know that is not interfering).
>> I'm running a Flatcar Linux VM inside the WSL VM using legacy BIOS, a single CPU
>> and 4GB of RAM.
>>
>> If I run with `kvm.tdp_mmu=0`, I can boot and shutdown my VM consistently in 20 seconds.
>>
>> If I run with TDP MMU, the VM boot stalls for seconds at a time in various spots
>> (loading grub, decompressing kernel, during kernel boot), the boot output feels like
>> it's happening in slow motion. The fastest I see it finish the same cycle is 2 minutes,
>> I have also seen it take 4 minutes, sometimes even not finish at all. Same everything,
>> the only difference is the value of `kvm.tdp_mmu`.
> 
> When a stall occurs, can you tell where the time is lost?  E.g. is the CPU stuck
> in L0, L1, or L2?  L2 being a single vCPU rules out quite a few scenarios, e.g.
> lock contention and whatnot.

It shows up as around 90% L2 time, 10% L1 time. I don't have great visibility into L0 time
right now, I'm trying to find someone who might be able to help with that.

> 
> If you can run perf in WSL, that might be the easiest way to suss out what's going
> on.

I can run perf, what trace would help?

> 
>> So I would like to revisit disabling tdp_mmu on hyperv altogether for the time being but it
>> should probably be with the following condition:
>>
>>   tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && !hypervisor_is_type(X86_HYPER_MS_HYPERV)
>>
>> Do you have an environment where you would be able to reproduce this? A Windows server perhaps
>> or an AMD laptop?
> 
> Hrm, not easily, no.  Can you try two things?
> 
>   1. Linus' tree on Intel hardware
This shows the same problematic behavior (tested commit 0d3eb744aed40ffce820cded61d7eac515199165).

>   2. kvm-x86/next[*] on Intel hardware

Same here (tested kvm-x86-next-2023.04.10)

> 
> Don't bother with #2 if #1 (Linus' tree) does NOT suffer the same stalls as AMD.
> #2 is interesting iff Intel is also affected as kvm-x86/next has an optimization
> for CR0.WP toggling, which was the achilles heel of the TDP MMU.  If Intel isn't
> affected, then something other than CR0.WP is to blame.
> 
> I fully expect both experiments to show the same behavior as AMD, but if for some
> reason they don't, the results should help narrow the search.

The results are the same for both branches, and it does look like this affects AMD and
Intel equally.

So seeing as this will likely take a while to figure out (and I know I won't be able to
spend too many cycles on this in the next few weeks), what do you think of a patch to
disable tdp_mmu in this configuration (for the time being?).

Something else I've been wondering: in a KVM-on-KVM setup, is tdp_mmu used in both L0
and L1 hypervisors right now?

> 
> [*] https://github.com/kvm-x86/linux/tree/next
  
Sean Christopherson April 11, 2023, 4:02 p.m. UTC | #18
On Tue, Apr 11, 2023, Jeremi Piotrowski wrote:
> On 4/11/2023 1:25 AM, Sean Christopherson wrote:
> > On Wed, Apr 05, 2023, Jeremi Piotrowski wrote:
> >> On 3/7/2023 6:36 PM, Sean Christopherson wrote:
> >>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
> >>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
> >>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
> >>> happens to get lucky and not run afoul of the underlying bugs.  The revert appears
> >>> to be reasonably straightforward (see bottom).
> >>
> >> Hi Sean,
> >>
> >> I'm back, and I don't have good news. The fix for the missing hyperv TLB flushes has
> >> landed in Linus' tree and I now had the chance to test things outside Azure, in WSL on my
> >> AMD laptop.
> >>
> >> There is some seriously weird interaction going on between TDP MMU and Hyper-V, with
> >> or without enlightened TLB. My laptop has 16 vCPUs, so the WSL VM also has 16 vCPUs.
> >> I have hardcoded the kernel to disable enlightened TLB (so we know that is not interfering).
> >> I'm running a Flatcar Linux VM inside the WSL VM using legacy BIOS, a single CPU
> >> and 4GB of RAM.
> >>
> >> If I run with `kvm.tdp_mmu=0`, I can boot and shutdown my VM consistently in 20 seconds.
> >>
> >> If I run with TDP MMU, the VM boot stalls for seconds at a time in various spots
> >> (loading grub, decompressing kernel, during kernel boot), the boot output feels like
> >> it's happening in slow motion. The fastest I see it finish the same cycle is 2 minutes,
> >> I have also seen it take 4 minutes, sometimes even not finish at all. Same everything,
> >> the only difference is the value of `kvm.tdp_mmu`.
> > 
> > When a stall occurs, can you tell where the time is lost?  E.g. is the CPU stuck
> > in L0, L1, or L2?  L2 being a single vCPU rules out quite a few scenarios, e.g.
> > lock contention and whatnot.
> 
> It shows up as around 90% L2 time, 10% L1 time.

Are those numbers coming from /proc/<pid>/stat?  Something else?

> I don't have great visibility into L0 time right now, I'm trying to find
> someone who might be able to help with that.
> 
> > 
> > If you can run perf in WSL, that might be the easiest way to suss out what's going
> > on.
> 
> I can run perf, what trace would help?

Good question.  I'm not exactly a perf expert and almost never do anything beyond
`perf top`.  That's probably sufficient for now, I really just want to confirm that
L1 doesn't appear to be stuck, e.g. in KVM's page fault handler.

> The results are the same for both branches, and it does look like this affects AMD and
> Intel equally.

Nice.  I have Intel hardware at home that I'll try to repro on, though it will
be several weeks until I can dive into this.

> So seeing as this will likely take a while to figure out (and I know I won't be able to
> spend too many cycles on this in the next few weeks), what do you think of a patch to
> disable tdp_mmu in this configuration (for the time being?).

I don't particularly love the idea of disabling the TDP MMU without having the
slightest clue what's going wrong, but I'm not totally opposed to it.

Paolo, any thoughts?  You have far more experience with supporting downstream
consumers of KVM.

> Something else I've been wondering: in a KVM-on-KVM setup, is tdp_mmu used in both L0
> and L1 hypervisors right now?

By default, yes.  I double checked that L2 has similar boot times for KVM-on-KVM
with and without the TDP MMU.  Certainly nothing remotely close to 2 minutes.
  
Jeremi Piotrowski April 13, 2023, 9:53 a.m. UTC | #19
On 4/11/2023 6:02 PM, Sean Christopherson wrote:
> On Tue, Apr 11, 2023, Jeremi Piotrowski wrote:
>> On 4/11/2023 1:25 AM, Sean Christopherson wrote:
>>> On Wed, Apr 05, 2023, Jeremi Piotrowski wrote:
>>>> On 3/7/2023 6:36 PM, Sean Christopherson wrote:
>>>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
>>>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway.  KVM
>>>>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just
>>>>> happens to get lucky and not run afoul of the underlying bugs.  The revert appears
>>>>> to be reasonably straightforward (see bottom).
>>>>
>>>> Hi Sean,
>>>>
>>>> I'm back, and I don't have good news. The fix for the missing hyperv TLB flushes has
>>>> landed in Linus' tree and I now had the chance to test things outside Azure, in WSL on my
>>>> AMD laptop.
>>>>
>>>> There is some seriously weird interaction going on between TDP MMU and Hyper-V, with
>>>> or without enlightened TLB. My laptop has 16 vCPUs, so the WSL VM also has 16 vCPUs.
>>>> I have hardcoded the kernel to disable enlightened TLB (so we know that is not interfering).
>>>> I'm running a Flatcar Linux VM inside the WSL VM using legacy BIOS, a single CPU
>>>> and 4GB of RAM.
>>>>
>>>> If I run with `kvm.tdp_mmu=0`, I can boot and shutdown my VM consistently in 20 seconds.
>>>>
>>>> If I run with TDP MMU, the VM boot stalls for seconds at a time in various spots
>>>> (loading grub, decompressing kernel, during kernel boot), the boot output feels like
>>>> it's happening in slow motion. The fastest I see it finish the same cycle is 2 minutes,
>>>> I have also seen it take 4 minutes, sometimes even not finish at all. Same everything,
>>>> the only difference is the value of `kvm.tdp_mmu`.
>>>
>>> When a stall occurs, can you tell where the time is lost?  E.g. is the CPU stuck
>>> in L0, L1, or L2?  L2 being a single vCPU rules out quite a few scenarios, e.g.
>>> lock contention and whatnot.
>>
>> It shows up as around 90% L2 time, 10% L1 time.
> 
> Are those numbers coming from /proc/<pid>/stat?  Something else?

Yes, /proc/<pid>/stat shows that kind of ratio for the qemu process.

> 
>> I don't have great visibility into L0 time right now, I'm trying to find
>> someone who might be able to help with that.
>>
>>>
>>> If you can run perf in WSL, that might be the easiest way to suss out what's going
>>> on.
>>
>> I can run perf, what trace would help?
> 
> Good question.  I'm not exactly a perf expert and almost never do anything beyond
> `perf top`.  That's probably sufficient for now, I really just want to confirm that
> L1 doesn't appear to be stuck, e.g. in KVM's page fault handler.

Perf does not really show much anything in the L1, the 10% looks like it's vmx_flush_tlb_current.

> 
>> The results are the same for both branches, and it does look like this affects AMD and
>> Intel equally.
> 
> Nice.  I have Intel hardware at home that I'll try to repro on, though it will
> be several weeks until I can dive into this.

Same here.

> 
>> So seeing as this will likely take a while to figure out (and I know I won't be able to
>> spend too many cycles on this in the next few weeks), what do you think of a patch to
>> disable tdp_mmu in this configuration (for the time being?).
> 
> I don't particularly love the idea of disabling the TDP MMU without having the
> slightest clue what's going wrong, but I'm not totally opposed to it.
> 
> Paolo, any thoughts?  You have far more experience with supporting downstream
> consumers of KVM.
> 
>> Something else I've been wondering: in a KVM-on-KVM setup, is tdp_mmu used in both L0
>> and L1 hypervisors right now?
> 
> By default, yes.  I double checked that L2 has similar boot times for KVM-on-KVM
> with and without the TDP MMU.  Certainly nothing remotely close to 2 minutes.

Something I just noticed by tracing hv_track_root_tdp is that the VM appears to go through
some ~10000 unique roots in the period before kernel init starts (so grub + kernel decompression).
That part seems to take a long time. Is this kind of churn of roots by design?

The ftrace output for when the root changes looks something like this, kvm goes through smm emulation
during the exit.

 qemu-system-x86-18971   [015] d.... 95922.997039: kvm_exit: vcpu 0 reason EXCEPTION_NMI rip 0xfd0bd info1 0x0000000000000000 info2 0x0000000000000413 intr_info 0x80000306 error_code 0x00000000
 qemu-system-x86-18971   [015] ..... 95922.997052: p_hv_track_root_tdp_0: (hv_track_root_tdp+0x0/0x70 [kvm]) si=0x18b082000
 qemu-system-x86-18971   [015] d.... 95922.997133: kvm_entry: vcpu 0, rip 0xf7d6b

There are also root changes after IO_INSTRUCTION exits. When I look at non-tdp-mmu it seems to cycle between two
roots in that phase time, and tdp-mmu allocates new ones instead.

You can see this for yourself with this script, my test machine is Ubuntu 22.10, QEMU 7.0.0, and the
a kernel from the kvm-x86/next branch.

#!/bin/bash

set -xe

fetch_flatcar () 
{ 
    sudo apt-get install -y qemu-system-x86 lbzip2
    local channel=$1;
    local version=${2:-current};
    local machine=$(uname -m);
    local arch=;
    if [[ ${machine} = "aarch64" ]]; then
        arch=arm64-usr;
    else
        if [[ ${machine} = "x86_64" ]]; then
            arch=amd64-usr;
        fi;
    fi;
    local base=https://$channel.release.flatcar-linux.net/${arch}/${version};
    wget $base/flatcar_production_qemu.sh;
    wget $base/flatcar_production_qemu_image.img.bz2;
    lbunzip2 -kvf flatcar_production_qemu_image.img.bz2
    chmod +x flatcar_production_qemu.sh
    sed -i -e 's/VM_NCPUS=.*/VM_NCPUS="1"/' flatcar_production_qemu.sh
}

[ -f flatcar_production_qemu_image.img ] || fetch_flatcar stable

cat <<EOF >config.json
{
  "ignition": {
    "version": "3.3.0"
  },
  "storage": {
    "files": [
      {
        "group": {
          "id": 500
        },
        "overwrite": true,
        "path": "/home/core/.bashrc",
        "user": {
          "id": 500
        },
        "contents": {
          "compression": "",
          "source": "data:,sudo%20shutdown%20-h%20now%0A"
        },
        "mode": 420
      }
    ]
  }
}
EOF

# first run is meant to take a bit longer due to processing provisioning
# data, subsequent runs should be stable.
time sudo ./flatcar_production_qemu.sh -i config.json -nographic -m 4096
  
Sean Christopherson April 13, 2023, 5:24 p.m. UTC | #20
On Thu, Apr 13, 2023, Jeremi Piotrowski wrote:
> On 4/11/2023 6:02 PM, Sean Christopherson wrote:
> > By default, yes.  I double checked that L2 has similar boot times for KVM-on-KVM
> > with and without the TDP MMU.  Certainly nothing remotely close to 2 minutes.
> 
> Something I just noticed by tracing hv_track_root_tdp is that the VM appears to go through
> some ~10000 unique roots in the period before kernel init starts (so grub + kernel decompression).
> That part seems to take a long time. Is this kind of churn of roots by design?
> 
> The ftrace output for when the root changes looks something like this, kvm goes through smm emulation
> during the exit.
> 
>  qemu-system-x86-18971   [015] d.... 95922.997039: kvm_exit: vcpu 0 reason EXCEPTION_NMI rip 0xfd0bd info1 0x0000000000000000 info2 0x0000000000000413 intr_info 0x80000306 error_code 0x00000000
>  qemu-system-x86-18971   [015] ..... 95922.997052: p_hv_track_root_tdp_0: (hv_track_root_tdp+0x0/0x70 [kvm]) si=0x18b082000
>  qemu-system-x86-18971   [015] d.... 95922.997133: kvm_entry: vcpu 0, rip 0xf7d6b
> 
> There are also root changes after IO_INSTRUCTION exits. When I look at non-tdp-mmu it seems to cycle between two
> roots in that phase time, and tdp-mmu allocates new ones instead.

#$&*#$*& SMM.  I know _exactly_ what's going on.

When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM,
KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous
roots).  Unloading roots is a simple way to ensure KVM flushes and synchronizes
all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root
(from the vCPU's perspective).

In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM
hash table.  Unloading a "shadow" root just wipes it from the per-vCPU cache; the
root is still tracked in the per-VM hash table.  When KVM loads a "new" root for the
vCPU, KVM will find the old, unloaded root in the per-VM hash table.

But unloading roots is anathema for the TDP MMU.  Unlike the shadow MMU, the TDP MMU
doesn't track _inactive_ roots in a per-VM structure, where "active" in this case
means a root is either in-use or cached as a previous root by at least one vCPU.
When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is
put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing
may be done by a worker, but for all intents and purposes the root is gone).

The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all
roots effectively frees all roots.  Wwhereas in a multi-vCPU setup, a different vCPU
usually holds a reference to an unloaded root and thus keeps the root alive, allowing
the vCPU to reuse its old root after unloading (with a flush+sync).

What's happening in your case is that legacy BIOS does some truly evil crud with
SMM, and can transition to/from SMM thousands of time during boot.  On *every*
transition, KVM unloads its roots, i.e. KVM has to teardown, reallocate, and rebuild
a new root every time the vCPU enters SMM, and every time the vCPU exits SMM.

This exact problem was reported by the grsecurity folks when the guest toggles CR0.WP.
We duct taped a solution together for CR0.WP[1], and now finally have a more complete
fix lined up for 6.4[2], but the underlying flaw of the TDP MMU not preserving inactive
roots still exists.

Aha!  Idea.  There are _at most_ 4 possible roots the TDP MMU can encounter.
4-level non-SMM, 4-level SMM, 5-level non-SMM, and 5-level SMM.  I.e. not keeping
inactive roots on a per-VM basis is just monumentally stupid.  Ugh, and that's not
even the worst of our stupidity.  The truly awful side of all this is that we
spent an absurd amount of time getting kvm_tdp_mmu_put_root() to play nice with
putting the last reference to a valid root while holding mmu_lock for read.

Give me a few hours to whip together and test a patch, I think I see a way to fix
this without a massive amount of churn, and with fairly simple rules for how things
work.

[1] https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
[2] https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
  
Sean Christopherson April 13, 2023, 6:49 p.m. UTC | #21
On Thu, Apr 13, 2023, Sean Christopherson wrote:
> Aha!  Idea.  There are _at most_ 4 possible roots the TDP MMU can encounter.
> 4-level non-SMM, 4-level SMM, 5-level non-SMM, and 5-level SMM.  I.e. not keeping
> inactive roots on a per-VM basis is just monumentally stupid.  Ugh, and that's not
> even the worst of our stupidity.  The truly awful side of all this is that we
> spent an absurd amount of time getting kvm_tdp_mmu_put_root() to play nice with
> putting the last reference to a valid root while holding mmu_lock for read.
> 
> Give me a few hours to whip together and test a patch, I think I see a way to fix
> this without a massive amount of churn, and with fairly simple rules for how things
> work.

Can you test the below patch?  I need to do more testing before posting, but it
holds up to basic testing.  With this, I can do kvm_mmu_reset_context() on every
non-fastpash VM-Exit with pretty much zero performance degradation.

Trying to do the same without this patch just hangs at the reset vector because
KVM can't make forward progress on EPT violations.  Skipping EPT violations still
hangs because of a KVM oddity/flaw where debug register exits require making
forward progress before the next VM-Exit (KVM really should emulate and skip the
first exit).  Skipping DR exits boots, but with amusing performance degration:
9s versus ~2s to get to PID 1, and 30s versus ~4s to console.

I verified forcing kvm_mmu_reset_context() does trigger a "new" root allocation,
e.g. 15 vs. 100k "allocations", so unless I guessed wrong about SMM-induced
kvm_mmu_reset_context() calls being the problem, this should do the trick.

FWIW, my best guess as to why you observe multiple minute boot times is that there
is an "asynchronous" source of SMIs, whereas my hack-a-test is largely limited to
synchronous exits.

---
 arch/x86/kvm/mmu/tdp_mmu.c | 80 +++++++++++++-------------------------
 1 file changed, 28 insertions(+), 52 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b2fca11b91ff..343deccab511 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -40,7 +40,17 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
 
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 {
-	/* Also waits for any queued work items.  */
+	/*
+	 * Invalidate all roots, which besides the obvious, schedules all roots
+	 * for zapping and thus puts the TDP MMU's reference to each root, i.e.
+	 * ultimately frees all roots.
+	 */
+	kvm_tdp_mmu_invalidate_all_roots(kvm);
+
+	/*
+	 * Destroying a workqueue also first flushes the workqueue, i.e. no
+	 * need to invoke kvm_tdp_mmu_zap_invalidated_roots().
+	 */
 	destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
 
 	WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
@@ -116,16 +126,6 @@ static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root
 	queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work);
 }
 
-static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page)
-{
-	union kvm_mmu_page_role role = page->role;
-	role.invalid = true;
-
-	/* No need to use cmpxchg, only the invalid bit can change.  */
-	role.word = xchg(&page->role.word, role.word);
-	return role.invalid;
-}
-
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared)
 {
@@ -134,45 +134,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
 		return;
 
-	WARN_ON(!is_tdp_mmu_page(root));
-
 	/*
-	 * The root now has refcount=0.  It is valid, but readers already
-	 * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
-	 * rejects it.  This remains true for the rest of the execution
-	 * of this function, because readers visit valid roots only
-	 * (except for tdp_mmu_zap_root_work(), which however
-	 * does not acquire any reference itself).
-	 *
-	 * Even though there are flows that need to visit all roots for
-	 * correctness, they all take mmu_lock for write, so they cannot yet
-	 * run concurrently. The same is true after kvm_tdp_root_mark_invalid,
-	 * since the root still has refcount=0.
-	 *
-	 * However, tdp_mmu_zap_root can yield, and writers do not expect to
-	 * see refcount=0 (see for example kvm_tdp_mmu_invalidate_all_roots()).
-	 * So the root temporarily gets an extra reference, going to refcount=1
-	 * while staying invalid.  Readers still cannot acquire any reference;
-	 * but writers are now allowed to run if tdp_mmu_zap_root yields and
-	 * they might take an extra reference if they themselves yield.
-	 * Therefore, when the reference is given back by the worker,
-	 * there is no guarantee that the refcount is still 1.  If not, whoever
-	 * puts the last reference will free the page, but they will not have to
-	 * zap the root because a root cannot go from invalid to valid.
+	 * The TDP MMU itself holds a reference to each root until the root is
+	 * explicitly invalidated, i.e. the final reference should be never be
+	 * put for a valid root.
 	 */
-	if (!kvm_tdp_root_mark_invalid(root)) {
-		refcount_set(&root->tdp_mmu_root_count, 1);
-
-		/*
-		 * Zapping the root in a worker is not just "nice to have";
-		 * it is required because kvm_tdp_mmu_invalidate_all_roots()
-		 * skips already-invalid roots.  If kvm_tdp_mmu_put_root() did
-		 * not add the root to the workqueue, kvm_tdp_mmu_zap_all_fast()
-		 * might return with some roots not zapped yet.
-		 */
-		tdp_mmu_schedule_zap_root(kvm, root);
-		return;
-	}
+	KVM_BUG_ON(!is_tdp_mmu_page(root) || !root->role.invalid, kvm);
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	list_del_rcu(&root->link);
@@ -320,7 +287,14 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 	root = tdp_mmu_alloc_sp(vcpu);
 	tdp_mmu_init_sp(root, NULL, 0, role);
 
-	refcount_set(&root->tdp_mmu_root_count, 1);
+	/*
+	 * TDP MMU roots are kept until they are explicitly invalidated, either
+	 * by a memslot update or by the destruction of the VM.  Initialize the
+	 * refcount to two; one reference for the vCPU, and one reference for
+	 * the TDP MMU itself, which is held until the root is invalidated and
+	 * is ultimately put by tdp_mmu_zap_root_work().
+	 */
+	refcount_set(&root->tdp_mmu_root_count, 2);
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
@@ -964,10 +938,12 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
 {
 	struct kvm_mmu_page *root;
 
-	lockdep_assert_held_write(&kvm->mmu_lock);
+	/* No need to hold mmu_lock when the VM is being destroyed. */
+	if (refcount_read(&kvm->users_count))
+		lockdep_assert_held_write(&kvm->mmu_lock);
+
 	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
-		if (!root->role.invalid &&
-		    !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
+		if (!root->role.invalid) {
 			root->role.invalid = true;
 			tdp_mmu_schedule_zap_root(kvm, root);
 		}

base-commit: 62cf1e941a1169a5e8016fd8683d4d888ab51e01
--
  
Sean Christopherson April 13, 2023, 7:09 p.m. UTC | #22
On Thu, Apr 13, 2023, Sean Christopherson wrote:
> Aha!  Idea.  There are _at most_ 4 possible roots the TDP MMU can encounter.
> 4-level non-SMM, 4-level SMM, 5-level non-SMM, and 5-level SMM.  I.e. not keeping
> inactive roots on a per-VM basis is just monumentally stupid.

One correction: there are 6 possible roots:

  1. 4-level !SMM !guest_mode (i.e. not nested)
  2. 4-level SMM !guest_mode
  3. 5-level !SMM !guest_mode
  4. 5-level SMM !guest_mode
  5. 4-level !SMM guest_mode
  6. 5-level !SMM guest_mode

I forgot that KVM still uses the TDP MMU when running L2 if L1 doesn't enable
EPT/TDP, i.e. if L1 is using shadow paging for L2.  But that really doesn't change
anything as each vCPU can already track 4 roots, i.e. userspace can saturate all
6 roots anyways.  And in practice, no sane VMM will create a VM with both 4-level
and 5-level roots (KVM keys off of guest.MAXPHYADDR for the TDP root level).
  
David Matlack April 13, 2023, 8:21 p.m. UTC | #23
On Thu, Apr 13, 2023 at 12:10 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Apr 13, 2023, Sean Christopherson wrote:
> > Aha!  Idea.  There are _at most_ 4 possible roots the TDP MMU can encounter.
> > 4-level non-SMM, 4-level SMM, 5-level non-SMM, and 5-level SMM.  I.e. not keeping
> > inactive roots on a per-VM basis is just monumentally stupid.
>
> One correction: there are 6 possible roots:
>
>   1. 4-level !SMM !guest_mode (i.e. not nested)
>   2. 4-level SMM !guest_mode
>   3. 5-level !SMM !guest_mode
>   4. 5-level SMM !guest_mode
>   5. 4-level !SMM guest_mode
>   6. 5-level !SMM guest_mode
>
> I forgot that KVM still uses the TDP MMU when running L2 if L1 doesn't enable
> EPT/TDP, i.e. if L1 is using shadow paging for L2.  But that really doesn't change
> anything as each vCPU can already track 4 roots, i.e. userspace can saturate all
> 6 roots anyways.  And in practice, no sane VMM will create a VM with both 4-level
> and 5-level roots (KVM keys off of guest.MAXPHYADDR for the TDP root level).

Why do we create a new root for guest_mode=1 if L1 disables EPT/NPT?
  
Sean Christopherson April 13, 2023, 8:58 p.m. UTC | #24
On Thu, Apr 13, 2023, David Matlack wrote:
> On Thu, Apr 13, 2023 at 12:10 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Apr 13, 2023, Sean Christopherson wrote:
> > > Aha!  Idea.  There are _at most_ 4 possible roots the TDP MMU can encounter.
> > > 4-level non-SMM, 4-level SMM, 5-level non-SMM, and 5-level SMM.  I.e. not keeping
> > > inactive roots on a per-VM basis is just monumentally stupid.
> >
> > One correction: there are 6 possible roots:
> >
> >   1. 4-level !SMM !guest_mode (i.e. not nested)
> >   2. 4-level SMM !guest_mode
> >   3. 5-level !SMM !guest_mode
> >   4. 5-level SMM !guest_mode
> >   5. 4-level !SMM guest_mode
> >   6. 5-level !SMM guest_mode
> >
> > I forgot that KVM still uses the TDP MMU when running L2 if L1 doesn't enable
> > EPT/TDP, i.e. if L1 is using shadow paging for L2.  But that really doesn't change
> > anything as each vCPU can already track 4 roots, i.e. userspace can saturate all
> > 6 roots anyways.  And in practice, no sane VMM will create a VM with both 4-level
> > and 5-level roots (KVM keys off of guest.MAXPHYADDR for the TDP root level).
> 
> Why do we create a new root for guest_mode=1 if L1 disables EPT/NPT?

Because "private", a.k.a. KVM-internal, memslots are visible to L1 but not L2.
Which for TDP means the APIC-access page.  From commit 3a2936dedd20:

    kvm: mmu: Don't expose private memslots to L2
    
    These private pages have special purposes in the virtualization of L1,
    but not in the virtualization of L2. In particular, L1's APIC access
    page should never be entered into L2's page tables, because this
    causes a great deal of confusion when the APIC virtualization hardware
    is being used to accelerate L2's accesses to its own APIC.

FWIW, I _think_ KVM could actually let L2 access the APIC-access page when L1 is
running without any APIC virtualization, i.e. when L1 is passing its APIC through
to L2.  E.g. something like the below, but I ain't touching that with a 10 foot pole
unless someone explicitly asks for it :-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 039fb16560a0..8aa12f5f2c30 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4370,10 +4370,13 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
        if (!kvm_is_visible_memslot(slot)) {
                /* Don't expose private memslots to L2. */
                if (is_guest_mode(vcpu)) {
-                       fault->slot = NULL;
-                       fault->pfn = KVM_PFN_NOSLOT;
-                       fault->map_writable = false;
-                       return RET_PF_CONTINUE;
+                       if (!slot || slot->id != APIC_ACCESS_PAGE_PRIVATE_MEMSLOT ||
+                           nested_cpu_has_virtual_apic(vcpu)) {
+                               fault->slot = NULL;
+                               fault->pfn = KVM_PFN_NOSLOT;
+                               fault->map_writable = false;
+                               return RET_PF_CONTINUE;
+                           }
                }
                /*
                 * If the APIC access page exists but is disabled, go directly
  

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4d2bc08794e4..a0868ae3688d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2031,7 +2031,8 @@  void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
 void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
 
 void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
-		       int tdp_max_root_level, int tdp_huge_page_level);
+		       int tdp_max_root_level, int tdp_huge_page_level,
+		       bool enable_tdp_mmu);
 
 static inline u16 kvm_read_ldt(void)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c91ee2927dd7..5c0e28a7a3bc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5787,14 +5787,15 @@  void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
 }
 
 void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
-		       int tdp_max_root_level, int tdp_huge_page_level)
+		       int tdp_max_root_level, int tdp_huge_page_level,
+		       bool enable_tdp_mmu)
 {
 	tdp_enabled = enable_tdp;
 	tdp_root_level = tdp_forced_root_level;
 	max_tdp_level = tdp_max_root_level;
 
 #ifdef CONFIG_X86_64
-	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
+	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu;
 #endif
 	/*
 	 * max_huge_page_level reflects KVM's MMU capabilities irrespective
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d13cf53e7390..070c3f7f8c9f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4925,6 +4925,7 @@  static __init int svm_hardware_setup(void)
 	struct page *iopm_pages;
 	void *iopm_va;
 	int r;
+	bool enable_tdp_mmu;
 	unsigned int order = get_order(IOPM_SIZE);
 
 	/*
@@ -4991,9 +4992,12 @@  static __init int svm_hardware_setup(void)
 	if (!boot_cpu_has(X86_FEATURE_NPT))
 		npt_enabled = false;
 
+	enable_tdp_mmu = svm_hv_enable_tdp_mmu();
+
 	/* Force VM NPT level equal to the host's paging level */
 	kvm_configure_mmu(npt_enabled, get_npt_level(),
-			  get_npt_level(), PG_LEVEL_1G);
+			  get_npt_level(), PG_LEVEL_1G,
+			  enable_tdp_mmu);
 	pr_info("Nested Paging %sabled\n", npt_enabled ? "en" : "dis");
 
 	/* Setup shadow_me_value and shadow_me_mask */
diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
index 6981c1e9a809..aa49ac5d66bc 100644
--- a/arch/x86/kvm/svm/svm_onhyperv.h
+++ b/arch/x86/kvm/svm/svm_onhyperv.h
@@ -30,6 +30,11 @@  static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
 		hve->hv_enlightenments_control.msr_bitmap = 1;
 }
 
+static inline bool svm_hv_enable_tdp_mmu(void)
+{
+	return !(npt_enabled && ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB);
+}
+
 static inline void svm_hv_hardware_setup(void)
 {
 	if (npt_enabled &&
@@ -84,6 +89,11 @@  static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
 {
 }
 
+static inline bool svm_hv_enable_tdp_mmu(void)
+{
+	return true;
+}
+
 static inline void svm_hv_hardware_setup(void)
 {
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c788aa382611..4d3808755d39 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8442,7 +8442,8 @@  static __init int hardware_setup(void)
 	vmx_setup_me_spte_mask();
 
 	kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(),
-			  ept_caps_to_lpage_level(vmx_capability.ept));
+			  ept_caps_to_lpage_level(vmx_capability.ept),
+			  true);
 
 	/*
 	 * Only enable PML when hardware supports PML feature, and both EPT