Message ID | 20230227171751.1211786-1-jpiotrowski@linux.microsoft.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp2547756wrd; Mon, 27 Feb 2023 09:30:50 -0800 (PST) X-Google-Smtp-Source: AK7set+HvI1ipKQtii+sxBOVjiYy6sPpvTiF17xK15ZZuA4QdGEu4+OSF0cafzhe16XAk/VxzSe+ X-Received: by 2002:a17:906:f755:b0:88a:7408:384c with SMTP id jp21-20020a170906f75500b0088a7408384cmr32958221ejb.47.1677519050241; Mon, 27 Feb 2023 09:30:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1677519050; cv=none; d=google.com; s=arc-20160816; b=P8fpJoJneP8sl1GmXMfbG++UnDJMqlUJKtSOVxHEbEOJo0TjCNxZ4YTk0FKOo4XeHf UCR3Fbu26WfrLF5rKywjkWuswenEDsGqOQsCq+ZEoT3CD2SDpiZAEF2n9OVgzEsT0iwn Meh5robbWyKlxnzYH7VLm2KMfZ1HqqkWsI752yREOIrTIQ7AolaHU3SWR3Rb//NMUnM6 AXU5m/8To8uc/e7lJ9y7Y7jAPmyKYqqILEoVx8Zc0HzfUwLZW0K+a+sYRupL0zFbTy3N i6+un7TMlrKmgxi3Co+ZziB0ablq2dYhEfUc7AkkXOdIRVDo8fkVxSHHu7feGpls4suU dZbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature:dkim-filter; bh=fknedrBN9SwzuVi6ot1s6RGt4uHR8qcI6hAfSyjoEfM=; b=H7E0Sg/6xgWyeBcZh8kWmVytyAyRW1F7lr3pFwBdRBnW5kDgGNSMBxjahU8PDP06+Y TVzYokFQ62fQqiQhhNEbAGx1mLxddpeLdSK8mYxz/VakQFEgJlphEWj8KHWS/aJUGvoz FAWH/XMtAIf7bwPq2lmW+CAgICN+IuzqqaelLXBr3OF90ys61Z2vhpjiaVlzTzmyWSB4 siLqFecuzn30yB3apg86vvh2yS/ygdVLcOK3/f/+AM9D6nwD2guvZIONsy6NjsjuwdcR KjTuDW5QDdVYDAJ7IpuadmyMrTHqh1sy8p0hgwIDxrdpxaWvyNebI7HDyTCLKvPfwADK /wDA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=ThR2HyTE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y20-20020a1709063a9400b008b178585afcsi934415ejd.250.2023.02.27.09.30.27; Mon, 27 Feb 2023 09:30:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=ThR2HyTE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230007AbjB0RST (ORCPT <rfc822;wenzhi022@gmail.com> + 99 others); Mon, 27 Feb 2023 12:18:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55644 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229781AbjB0RSR (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 27 Feb 2023 12:18:17 -0500 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A77E96EB0; Mon, 27 Feb 2023 09:18:15 -0800 (PST) Received: from vm02.corp.microsoft.com (unknown [167.220.196.155]) by linux.microsoft.com (Postfix) with ESMTPSA id A6B2F20BC5E7; Mon, 27 Feb 2023 09:18:13 -0800 (PST) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com A6B2F20BC5E7 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1677518295; bh=fknedrBN9SwzuVi6ot1s6RGt4uHR8qcI6hAfSyjoEfM=; h=From:To:Cc:Subject:Date:From; b=ThR2HyTEPb8Z4zjVX+cn1Rno45lCerw0KQn0uxlZNWM1DtDzFCX+sAc8fK61cK1Rn 0lcaY50yjzua3uHlxiKwxCunnqxlbws+KanjnafbyzzZGWhxLCRufKHaSxGPOAU6XE JJUGO1/hJdQ0CZUWDtOoClvOXhMDOWonCG1rmTQo= From: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> To: linux-kernel@vger.kernel.org Cc: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>, kvm@vger.kernel.org, Vitaly Kuznetsov <vkuznets@redhat.com>, Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com>, Tianyu Lan <ltykernel@gmail.com>, Michael Kelley <mikelley@microsoft.com> Subject: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V Date: Mon, 27 Feb 2023 17:17:51 +0000 Message-Id: <20230227171751.1211786-1-jpiotrowski@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-19.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1759006215537999319?= X-GMAIL-MSGID: =?utf-8?q?1759006215537999319?= |
Series |
KVM: SVM: Disable TDP MMU when running on Hyper-V
|
|
Commit Message
Jeremi Piotrowski
Feb. 27, 2023, 5:17 p.m. UTC
TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17.
The issue was first introduced by two commmits:
- bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB
flush to caller when freeing TDP MMU shadow pages"
- efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct
roots via asynchronous worker"
The root cause is that since then there are missing TLB flushes which
are required by HV_X64_NESTED_ENLIGHTENED_TLB. The failure manifests
as L2 guest VMs being unable to complete boot due to memory
inconsistencies between L1 and L2 guests which lead to various
assertion/emulation failures.
The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by
Hyper-V on AMD and is always used by Linux. The TLB flush required by
HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush
that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest
boot performance on AMD is reproducibly slower compared to when TDP MMU is
disabled.
Disable TDP MMU when using SVM Hyper-V for the time being while we
search for a better fix.
Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/t/#u
Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
---
Based on kvm-x86-mmu-6.3. The approach used here does not apply cleanly to
<=v6.2. This would be needed in stable too, and I don't know about putting
fixes tags.
Jeremi
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/mmu/mmu.c | 5 +++--
arch/x86/kvm/svm/svm.c | 6 +++++-
arch/x86/kvm/svm/svm_onhyperv.h | 10 ++++++++++
arch/x86/kvm/vmx/vmx.c | 3 ++-
5 files changed, 22 insertions(+), 5 deletions(-)
Comments
Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: > TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17. > The issue was first introduced by two commmits: > > - bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB > flush to caller when freeing TDP MMU shadow pages" > - efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct > roots via asynchronous worker" > > The root cause is that since then there are missing TLB flushes which > are required by HV_X64_NESTED_ENLIGHTENED_TLB. Please share more details on what's actually missing as you get them, I'd like to understand which flushes can be legally avoided on bare hardware and Hyper-V/VMX but not on Hyper-V/SVM. > The failure manifests > as L2 guest VMs being unable to complete boot due to memory > inconsistencies between L1 and L2 guests which lead to various > assertion/emulation failures. > > The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by > Hyper-V on AMD and is always used by Linux. The TLB flush required by > HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush > that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest > boot performance on AMD is reproducibly slower compared to when TDP MMU is > disabled. > > Disable TDP MMU when using SVM Hyper-V for the time being while we > search for a better fix. I'd suggest we go the other way around: disable HV_X64_NESTED_ENLIGHTENED_TLB on SVM: diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h index 6981c1e9a809..be98da5a4277 100644 --- a/arch/x86/kvm/svm/svm_onhyperv.h +++ b/arch/x86/kvm/svm/svm_onhyperv.h @@ -32,7 +32,8 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb) static inline void svm_hv_hardware_setup(void) { - if (npt_enabled && + /* A comment about missing TLB flushes */ + if (!tdp_mmu_enabled && npt_enabled && ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) { pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n"); svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb; this way we won't have a not-obvious-at-all MMU change on Hyper-V. I understand this may have some performance implications but MMU switch has some as well. > > Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/t/#u > Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> > --- > Based on kvm-x86-mmu-6.3. The approach used here does not apply cleanly to > <=v6.2. This would be needed in stable too, and I don't know about putting > fixes tags. Cc: stable@vger.kernel.org # 5.17.0 should do) > > Jeremi > > arch/x86/include/asm/kvm_host.h | 3 ++- > arch/x86/kvm/mmu/mmu.c | 5 +++-- > arch/x86/kvm/svm/svm.c | 6 +++++- > arch/x86/kvm/svm/svm_onhyperv.h | 10 ++++++++++ > arch/x86/kvm/vmx/vmx.c | 3 ++- > 5 files changed, 22 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 4d2bc08794e4..a0868ae3688d 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -2031,7 +2031,8 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid); > void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd); > > void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, > - int tdp_max_root_level, int tdp_huge_page_level); > + int tdp_max_root_level, int tdp_huge_page_level, > + bool enable_tdp_mmu); > > static inline u16 kvm_read_ldt(void) > { > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index c91ee2927dd7..5c0e28a7a3bc 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -5787,14 +5787,15 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid) > } > > void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, > - int tdp_max_root_level, int tdp_huge_page_level) > + int tdp_max_root_level, int tdp_huge_page_level, > + bool enable_tdp_mmu) > { > tdp_enabled = enable_tdp; > tdp_root_level = tdp_forced_root_level; > max_tdp_level = tdp_max_root_level; > > #ifdef CONFIG_X86_64 > - tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled; > + tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu; > #endif > /* > * max_huge_page_level reflects KVM's MMU capabilities irrespective > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index d13cf53e7390..070c3f7f8c9f 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -4925,6 +4925,7 @@ static __init int svm_hardware_setup(void) > struct page *iopm_pages; > void *iopm_va; > int r; > + bool enable_tdp_mmu; > unsigned int order = get_order(IOPM_SIZE); > > /* > @@ -4991,9 +4992,12 @@ static __init int svm_hardware_setup(void) > if (!boot_cpu_has(X86_FEATURE_NPT)) > npt_enabled = false; > > + enable_tdp_mmu = svm_hv_enable_tdp_mmu(); > + > /* Force VM NPT level equal to the host's paging level */ > kvm_configure_mmu(npt_enabled, get_npt_level(), > - get_npt_level(), PG_LEVEL_1G); > + get_npt_level(), PG_LEVEL_1G, > + enable_tdp_mmu); > pr_info("Nested Paging %sabled\n", npt_enabled ? "en" : "dis"); > > /* Setup shadow_me_value and shadow_me_mask */ > diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h > index 6981c1e9a809..aa49ac5d66bc 100644 > --- a/arch/x86/kvm/svm/svm_onhyperv.h > +++ b/arch/x86/kvm/svm/svm_onhyperv.h > @@ -30,6 +30,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb) > hve->hv_enlightenments_control.msr_bitmap = 1; > } > > +static inline bool svm_hv_enable_tdp_mmu(void) > +{ > + return !(npt_enabled && ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB); > +} > + > static inline void svm_hv_hardware_setup(void) > { > if (npt_enabled && > @@ -84,6 +89,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb) > { > } > > +static inline bool svm_hv_enable_tdp_mmu(void) > +{ > + return true; > +} > + > static inline void svm_hv_hardware_setup(void) > { > } > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index c788aa382611..4d3808755d39 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -8442,7 +8442,8 @@ static __init int hardware_setup(void) > vmx_setup_me_spte_mask(); > > kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(), > - ept_caps_to_lpage_level(vmx_capability.ept)); > + ept_caps_to_lpage_level(vmx_capability.ept), > + true); > > /* > * Only enable PML when hardware supports PML feature, and both EPT
On 06/03/2023 18:52, Vitaly Kuznetsov wrote: > Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: > >> TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17. >> The issue was first introduced by two commmits: >> >> - bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB >> flush to caller when freeing TDP MMU shadow pages" >> - efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct >> roots via asynchronous worker" >> >> The root cause is that since then there are missing TLB flushes which >> are required by HV_X64_NESTED_ENLIGHTENED_TLB. > > Please share more details on what's actually missing as you get them, > I'd like to understand which flushes can be legally avoided on bare > hardware and Hyper-V/VMX but not on Hyper-V/SVM. > See the linked thread here https://lore.kernel.org/lkml/20d189fc-8d20-8083-b448-460cc0420151@linux.microsoft.com/#t for all the details/analyses but the summary was that either of these 2 options would work, with a) having less flushes (footnote: less flushes is not necessarily better): a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp() b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current() These are only needed on Hyper-V/SVM because of how the enlightenment works (needs an explicit flush to rebuild L0 shadow page tables). Hyper-V/VMX does not need any changes and currently works. Let me know if you need more information on something here, I'll try to get it. >> The failure manifests >> as L2 guest VMs being unable to complete boot due to memory >> inconsistencies between L1 and L2 guests which lead to various >> assertion/emulation failures. >> >> The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by >> Hyper-V on AMD and is always used by Linux. The TLB flush required by >> HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush >> that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest >> boot performance on AMD is reproducibly slower compared to when TDP MMU is >> disabled. >> >> Disable TDP MMU when using SVM Hyper-V for the time being while we >> search for a better fix. > > I'd suggest we go the other way around: disable > HV_X64_NESTED_ENLIGHTENED_TLB on SVM: Paolo suggested disabling TDP_MMU when HV_X64_NESTED_ENLIGHTENED_TLB is used, and I prefer that option too. The enlighenment does offer a nice performance advantage with non-TDP_MMU, and I did not see TDP_MMU perform any better compared to that. Afaik the code to use the enlightenment on Hyper-V/SVM was written/tested before TDP_MMU became the default. If you have a specific scenario in mind, we could test and see what the implications are there. > > diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h > index 6981c1e9a809..be98da5a4277 100644 > --- a/arch/x86/kvm/svm/svm_onhyperv.h > +++ b/arch/x86/kvm/svm/svm_onhyperv.h > @@ -32,7 +32,8 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb) > > static inline void svm_hv_hardware_setup(void) > { > - if (npt_enabled && > + /* A comment about missing TLB flushes */ > + if (!tdp_mmu_enabled && npt_enabled && > ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) { > pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n"); > svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb; > > this way we won't have a not-obvious-at-all MMU change on Hyper-V. I > understand this may have some performance implications but MMU switch > has some as well. > >> >> Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/t/#u >> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> >> --- >> Based on kvm-x86-mmu-6.3. The approach used here does not apply cleanly to >> <=v6.2. This would be needed in stable too, and I don't know about putting >> fixes tags. > > Cc: stable@vger.kernel.org # 5.17.0 > > should do) > >> >> Jeremi >> >> arch/x86/include/asm/kvm_host.h | 3 ++- >> arch/x86/kvm/mmu/mmu.c | 5 +++-- >> arch/x86/kvm/svm/svm.c | 6 +++++- >> arch/x86/kvm/svm/svm_onhyperv.h | 10 ++++++++++ >> arch/x86/kvm/vmx/vmx.c | 3 ++- >> 5 files changed, 22 insertions(+), 5 deletions(-) >> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >> index 4d2bc08794e4..a0868ae3688d 100644 >> --- a/arch/x86/include/asm/kvm_host.h >> +++ b/arch/x86/include/asm/kvm_host.h >> @@ -2031,7 +2031,8 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid); >> void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd); >> >> void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, >> - int tdp_max_root_level, int tdp_huge_page_level); >> + int tdp_max_root_level, int tdp_huge_page_level, >> + bool enable_tdp_mmu); >> >> static inline u16 kvm_read_ldt(void) >> { >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >> index c91ee2927dd7..5c0e28a7a3bc 100644 >> --- a/arch/x86/kvm/mmu/mmu.c >> +++ b/arch/x86/kvm/mmu/mmu.c >> @@ -5787,14 +5787,15 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid) >> } >> >> void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, >> - int tdp_max_root_level, int tdp_huge_page_level) >> + int tdp_max_root_level, int tdp_huge_page_level, >> + bool enable_tdp_mmu) >> { >> tdp_enabled = enable_tdp; >> tdp_root_level = tdp_forced_root_level; >> max_tdp_level = tdp_max_root_level; >> >> #ifdef CONFIG_X86_64 >> - tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled; >> + tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu; >> #endif >> /* >> * max_huge_page_level reflects KVM's MMU capabilities irrespective >> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c >> index d13cf53e7390..070c3f7f8c9f 100644 >> --- a/arch/x86/kvm/svm/svm.c >> +++ b/arch/x86/kvm/svm/svm.c >> @@ -4925,6 +4925,7 @@ static __init int svm_hardware_setup(void) >> struct page *iopm_pages; >> void *iopm_va; >> int r; >> + bool enable_tdp_mmu; >> unsigned int order = get_order(IOPM_SIZE); >> >> /* >> @@ -4991,9 +4992,12 @@ static __init int svm_hardware_setup(void) >> if (!boot_cpu_has(X86_FEATURE_NPT)) >> npt_enabled = false; >> >> + enable_tdp_mmu = svm_hv_enable_tdp_mmu(); >> + >> /* Force VM NPT level equal to the host's paging level */ >> kvm_configure_mmu(npt_enabled, get_npt_level(), >> - get_npt_level(), PG_LEVEL_1G); >> + get_npt_level(), PG_LEVEL_1G, >> + enable_tdp_mmu); >> pr_info("Nested Paging %sabled\n", npt_enabled ? "en" : "dis"); >> >> /* Setup shadow_me_value and shadow_me_mask */ >> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h >> index 6981c1e9a809..aa49ac5d66bc 100644 >> --- a/arch/x86/kvm/svm/svm_onhyperv.h >> +++ b/arch/x86/kvm/svm/svm_onhyperv.h >> @@ -30,6 +30,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb) >> hve->hv_enlightenments_control.msr_bitmap = 1; >> } >> >> +static inline bool svm_hv_enable_tdp_mmu(void) >> +{ >> + return !(npt_enabled && ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB); >> +} >> + >> static inline void svm_hv_hardware_setup(void) >> { >> if (npt_enabled && >> @@ -84,6 +89,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb) >> { >> } >> >> +static inline bool svm_hv_enable_tdp_mmu(void) >> +{ >> + return true; >> +} >> + >> static inline void svm_hv_hardware_setup(void) >> { >> } >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c >> index c788aa382611..4d3808755d39 100644 >> --- a/arch/x86/kvm/vmx/vmx.c >> +++ b/arch/x86/kvm/vmx/vmx.c >> @@ -8442,7 +8442,8 @@ static __init int hardware_setup(void) >> vmx_setup_me_spte_mask(); >> >> kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(), >> - ept_caps_to_lpage_level(vmx_capability.ept)); >> + ept_caps_to_lpage_level(vmx_capability.ept), >> + true); >> >> /* >> * Only enable PML when hardware supports PML feature, and both EPT >
Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: > On 06/03/2023 18:52, Vitaly Kuznetsov wrote: >> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: >> >>> TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17. >>> The issue was first introduced by two commmits: >>> >>> - bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB >>> flush to caller when freeing TDP MMU shadow pages" >>> - efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct >>> roots via asynchronous worker" >>> >>> The root cause is that since then there are missing TLB flushes which >>> are required by HV_X64_NESTED_ENLIGHTENED_TLB. >> >> Please share more details on what's actually missing as you get them, >> I'd like to understand which flushes can be legally avoided on bare >> hardware and Hyper-V/VMX but not on Hyper-V/SVM. >> > > See the linked thread here > https://lore.kernel.org/lkml/20d189fc-8d20-8083-b448-460cc0420151@linux.microsoft.com/#t > for all the details/analyses but the summary was that either of these 2 > options would work, with a) having less flushes (footnote: less flushes is not necessarily > better): > > a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp() > b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current() > > These are only needed on Hyper-V/SVM because of how the enlightenment works (needs an explicit > flush to rebuild L0 shadow page tables). Hyper-V/VMX does not need any changes and currently > works. Let me know if you need more information on something here, I'll try to get it. > Ah, I missed the whole party! Thanks for the pointers! >>> The failure manifests >>> as L2 guest VMs being unable to complete boot due to memory >>> inconsistencies between L1 and L2 guests which lead to various >>> assertion/emulation failures. Which levels are we talking about here, *real* L1 and L2 or L1 and L2 from KVM's perspective (real L2 and L3)? >>> >>> The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by >>> Hyper-V on AMD and is always used by Linux. The TLB flush required by >>> HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush >>> that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest >>> boot performance on AMD is reproducibly slower compared to when TDP MMU is >>> disabled. >>> >>> Disable TDP MMU when using SVM Hyper-V for the time being while we >>> search for a better fix. >> >> I'd suggest we go the other way around: disable >> HV_X64_NESTED_ENLIGHTENED_TLB on SVM: > > Paolo suggested disabling TDP_MMU when HV_X64_NESTED_ENLIGHTENED_TLB is used, and > I prefer that option too. The enlighenment does offer a nice performance advantage > with non-TDP_MMU, and I did not see TDP_MMU perform any better compared to that. > Afaik the code to use the enlightenment on Hyper-V/SVM was written/tested before > TDP_MMU became the default. > > If you have a specific scenario in mind, we could test and see what the implications > are there. I don't have a strong opinion here, I've suggested a smaller change so it's easier to backport it to stable kernels and easier to revert when a proper fix comes to mainline. For performance implication, I'd only consider non-nested scenarios from KVM's perspective (i.e. real L2 from Hyper-V's PoV), as running L3 is unlikely a common use-case and, if I understood correctly, is broken anyway. > >> >> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h >> index 6981c1e9a809..be98da5a4277 100644 >> --- a/arch/x86/kvm/svm/svm_onhyperv.h >> +++ b/arch/x86/kvm/svm/svm_onhyperv.h >> @@ -32,7 +32,8 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb) >> >> static inline void svm_hv_hardware_setup(void) >> { >> - if (npt_enabled && >> + /* A comment about missing TLB flushes */ >> + if (!tdp_mmu_enabled && npt_enabled && >> ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) { >> pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n"); >> svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb; >> >> this way we won't have a not-obvious-at-all MMU change on Hyper-V. I >> understand this may have some performance implications but MMU switch >> has some as well. >> >>> >>> Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/t/#u >>> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> >>> --- >>> Based on kvm-x86-mmu-6.3. The approach used here does not apply cleanly to >>> <=v6.2. This would be needed in stable too, and I don't know about putting >>> fixes tags. >> >> Cc: stable@vger.kernel.org # 5.17.0 >> >> should do) >> >>> >>> Jeremi >>> >>> arch/x86/include/asm/kvm_host.h | 3 ++- >>> arch/x86/kvm/mmu/mmu.c | 5 +++-- >>> arch/x86/kvm/svm/svm.c | 6 +++++- >>> arch/x86/kvm/svm/svm_onhyperv.h | 10 ++++++++++ >>> arch/x86/kvm/vmx/vmx.c | 3 ++- >>> 5 files changed, 22 insertions(+), 5 deletions(-) >>> >>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >>> index 4d2bc08794e4..a0868ae3688d 100644 >>> --- a/arch/x86/include/asm/kvm_host.h >>> +++ b/arch/x86/include/asm/kvm_host.h >>> @@ -2031,7 +2031,8 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid); >>> void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd); >>> >>> void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, >>> - int tdp_max_root_level, int tdp_huge_page_level); >>> + int tdp_max_root_level, int tdp_huge_page_level, >>> + bool enable_tdp_mmu); >>> >>> static inline u16 kvm_read_ldt(void) >>> { >>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >>> index c91ee2927dd7..5c0e28a7a3bc 100644 >>> --- a/arch/x86/kvm/mmu/mmu.c >>> +++ b/arch/x86/kvm/mmu/mmu.c >>> @@ -5787,14 +5787,15 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid) >>> } >>> >>> void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, >>> - int tdp_max_root_level, int tdp_huge_page_level) >>> + int tdp_max_root_level, int tdp_huge_page_level, >>> + bool enable_tdp_mmu) >>> { >>> tdp_enabled = enable_tdp; >>> tdp_root_level = tdp_forced_root_level; >>> max_tdp_level = tdp_max_root_level; >>> >>> #ifdef CONFIG_X86_64 >>> - tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled; >>> + tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu; >>> #endif >>> /* >>> * max_huge_page_level reflects KVM's MMU capabilities irrespective >>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c >>> index d13cf53e7390..070c3f7f8c9f 100644 >>> --- a/arch/x86/kvm/svm/svm.c >>> +++ b/arch/x86/kvm/svm/svm.c >>> @@ -4925,6 +4925,7 @@ static __init int svm_hardware_setup(void) >>> struct page *iopm_pages; >>> void *iopm_va; >>> int r; >>> + bool enable_tdp_mmu; >>> unsigned int order = get_order(IOPM_SIZE); >>> >>> /* >>> @@ -4991,9 +4992,12 @@ static __init int svm_hardware_setup(void) >>> if (!boot_cpu_has(X86_FEATURE_NPT)) >>> npt_enabled = false; >>> >>> + enable_tdp_mmu = svm_hv_enable_tdp_mmu(); >>> + >>> /* Force VM NPT level equal to the host's paging level */ >>> kvm_configure_mmu(npt_enabled, get_npt_level(), >>> - get_npt_level(), PG_LEVEL_1G); >>> + get_npt_level(), PG_LEVEL_1G, >>> + enable_tdp_mmu); >>> pr_info("Nested Paging %sabled\n", npt_enabled ? "en" : "dis"); >>> >>> /* Setup shadow_me_value and shadow_me_mask */ >>> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h >>> index 6981c1e9a809..aa49ac5d66bc 100644 >>> --- a/arch/x86/kvm/svm/svm_onhyperv.h >>> +++ b/arch/x86/kvm/svm/svm_onhyperv.h >>> @@ -30,6 +30,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb) >>> hve->hv_enlightenments_control.msr_bitmap = 1; >>> } >>> >>> +static inline bool svm_hv_enable_tdp_mmu(void) >>> +{ >>> + return !(npt_enabled && ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB); >>> +} >>> + >>> static inline void svm_hv_hardware_setup(void) >>> { >>> if (npt_enabled && >>> @@ -84,6 +89,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb) >>> { >>> } >>> >>> +static inline bool svm_hv_enable_tdp_mmu(void) >>> +{ >>> + return true; >>> +} >>> + >>> static inline void svm_hv_hardware_setup(void) >>> { >>> } >>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c >>> index c788aa382611..4d3808755d39 100644 >>> --- a/arch/x86/kvm/vmx/vmx.c >>> +++ b/arch/x86/kvm/vmx/vmx.c >>> @@ -8442,7 +8442,8 @@ static __init int hardware_setup(void) >>> vmx_setup_me_spte_mask(); >>> >>> kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(), >>> - ept_caps_to_lpage_level(vmx_capability.ept)); >>> + ept_caps_to_lpage_level(vmx_capability.ept), >>> + true); >>> >>> /* >>> * Only enable PML when hardware supports PML feature, and both EPT >> >
On Mon, Feb 27, 2023, Jeremi Piotrowski wrote: > Disable TDP MMU when using SVM Hyper-V for the time being while we > search for a better fix. ... > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index c91ee2927dd7..5c0e28a7a3bc 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -5787,14 +5787,15 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid) > } > > void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, > - int tdp_max_root_level, int tdp_huge_page_level) > + int tdp_max_root_level, int tdp_huge_page_level, > + bool enable_tdp_mmu) > { > tdp_enabled = enable_tdp; > tdp_root_level = tdp_forced_root_level; > max_tdp_level = tdp_max_root_level; > > #ifdef CONFIG_X86_64 > - tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled; > + tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu; > #endif Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just happens to get lucky and not run afoul of the underlying bugs. The revert appears to be reasonably straightforward (see bottom). And _if_ we want to hack-a-fix it, then I would strongly prefer a very isolated, obviously hacky fix, e.g. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 36e4561554ca..a9ba4ae14fda 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5779,8 +5779,13 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, tdp_root_level = tdp_forced_root_level; max_tdp_level = tdp_max_root_level; + /* + * FIXME: Remove the enlightened TLB restriction when KVM properly + * handles TLB flushes for said enlightenment. + */. #ifdef CONFIG_X86_64 - tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled; + tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && + !(ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB); #endif /* * max_huge_page_level reflects KVM's MMU capabilities irrespective The revert... --- arch/x86/kvm/svm/svm.c | 3 --- arch/x86/kvm/svm/svm_onhyperv.h | 27 --------------------------- 2 files changed, 30 deletions(-) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 11068e8eb969..292650dc85a0 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1320,7 +1320,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu) if (sev_guest(vcpu->kvm)) sev_init_vmcb(svm); - svm_hv_init_vmcb(vmcb); init_vmcb_after_set_cpuid(vcpu); vmcb_mark_all_dirty(vmcb); @@ -4075,8 +4074,6 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, svm->vmcb->control.nested_cr3 = __sme_set(root_hpa); vmcb_mark_dirty(svm->vmcb, VMCB_NPT); - hv_track_root_tdp(vcpu, root_hpa); - cr3 = vcpu->arch.cr3; } else if (root_level >= PT64_ROOT_4LEVEL) { cr3 = __sme_set(root_hpa) | kvm_get_active_pcid(vcpu); diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h index 6981c1e9a809..5118fd273e73 100644 --- a/arch/x86/kvm/svm/svm_onhyperv.h +++ b/arch/x86/kvm/svm/svm_onhyperv.h @@ -15,31 +15,8 @@ static struct kvm_x86_ops svm_x86_ops; int svm_hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu); -static inline void svm_hv_init_vmcb(struct vmcb *vmcb) -{ - struct hv_vmcb_enlightenments *hve = &vmcb->control.hv_enlightenments; - - BUILD_BUG_ON(sizeof(vmcb->control.hv_enlightenments) != - sizeof(vmcb->control.reserved_sw)); - - if (npt_enabled && - ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) - hve->hv_enlightenments_control.enlightened_npt_tlb = 1; - - if (ms_hyperv.nested_features & HV_X64_NESTED_MSR_BITMAP) - hve->hv_enlightenments_control.msr_bitmap = 1; -} - static inline void svm_hv_hardware_setup(void) { - if (npt_enabled && - ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) { - pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n"); - svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb; - svm_x86_ops.tlb_remote_flush_with_range = - hv_remote_flush_tlb_with_range; - } - if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH) { int cpu; @@ -80,10 +57,6 @@ static inline void svm_hv_update_vp_id(struct vmcb *vmcb, struct kvm_vcpu *vcpu) } #else -static inline void svm_hv_init_vmcb(struct vmcb *vmcb) -{ -} - static inline void svm_hv_hardware_setup(void) { } base-commit: cb8748a781fe983e451f616ce4861a1c49ce79dd --
On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote: > Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: > hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM > doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just > happens to get lucky and not run afoul of the underlying bugs. I don't think it's about luck---the legacy MMU's zapping/invalidation seems to invoke the flush hypercall correctly: Jeremi, did you ever track the call stack where hyperv_nested_flush_guest_mapping is triggered? Paolo > The revert appears > to be reasonably straightforward (see bottom). > > And _if_ we want to hack-a-fix it, then I would strongly prefer a very isolated, > obviously hacky fix, e.g. > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 36e4561554ca..a9ba4ae14fda 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -5779,8 +5779,13 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, > tdp_root_level = tdp_forced_root_level; > max_tdp_level = tdp_max_root_level; > > + /* > + * FIXME: Remove the enlightened TLB restriction when KVM properly > + * handles TLB flushes for said enlightenment. > + */. > #ifdef CONFIG_X86_64 > - tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled; > + tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && > + !(ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB); > #endif > /* > * max_huge_page_level reflects KVM's MMU capabilities irrespective > > > > > The revert... > > --- > arch/x86/kvm/svm/svm.c | 3 --- > arch/x86/kvm/svm/svm_onhyperv.h | 27 --------------------------- > 2 files changed, 30 deletions(-) > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index 11068e8eb969..292650dc85a0 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -1320,7 +1320,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu) > if (sev_guest(vcpu->kvm)) > sev_init_vmcb(svm); > > - svm_hv_init_vmcb(vmcb); > init_vmcb_after_set_cpuid(vcpu); > > vmcb_mark_all_dirty(vmcb); > @@ -4075,8 +4074,6 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, > svm->vmcb->control.nested_cr3 = __sme_set(root_hpa); > vmcb_mark_dirty(svm->vmcb, VMCB_NPT); > > - hv_track_root_tdp(vcpu, root_hpa); > - > cr3 = vcpu->arch.cr3; > } else if (root_level >= PT64_ROOT_4LEVEL) { > cr3 = __sme_set(root_hpa) | kvm_get_active_pcid(vcpu); > diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h > index 6981c1e9a809..5118fd273e73 100644 > --- a/arch/x86/kvm/svm/svm_onhyperv.h > +++ b/arch/x86/kvm/svm/svm_onhyperv.h > @@ -15,31 +15,8 @@ static struct kvm_x86_ops svm_x86_ops; > > int svm_hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu); > > -static inline void svm_hv_init_vmcb(struct vmcb *vmcb) > -{ > - struct hv_vmcb_enlightenments *hve = &vmcb->control.hv_enlightenments; > - > - BUILD_BUG_ON(sizeof(vmcb->control.hv_enlightenments) != > - sizeof(vmcb->control.reserved_sw)); > - > - if (npt_enabled && > - ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) > - hve->hv_enlightenments_control.enlightened_npt_tlb = 1; > - > - if (ms_hyperv.nested_features & HV_X64_NESTED_MSR_BITMAP) > - hve->hv_enlightenments_control.msr_bitmap = 1; > -} > - > static inline void svm_hv_hardware_setup(void) > { > - if (npt_enabled && > - ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) { > - pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n"); > - svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb; > - svm_x86_ops.tlb_remote_flush_with_range = > - hv_remote_flush_tlb_with_range; > - } > - > if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH) { > int cpu; > > @@ -80,10 +57,6 @@ static inline void svm_hv_update_vp_id(struct vmcb *vmcb, struct kvm_vcpu *vcpu) > } > #else > > -static inline void svm_hv_init_vmcb(struct vmcb *vmcb) > -{ > -} > - > static inline void svm_hv_hardware_setup(void) > { > } > > base-commit: cb8748a781fe983e451f616ce4861a1c49ce79dd > -- >
On Wed, Mar 08, 2023, Paolo Bonzini wrote: > On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote: > > Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: > > hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM > > doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just > > happens to get lucky and not run afoul of the underlying bugs. > > I don't think it's about luck---the legacy MMU's zapping/invalidation > seems to invoke the flush hypercall correctly: ...for the paths that Jeremi has exercised, and for which a stale TLB entry is fatal to L2. E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush in its path and fully relies on the buggy kvm_flush_remote_tlbs(). In other words, KVM is getting lucky :-) > Jeremi, did you ever track the call stack where > hyperv_nested_flush_guest_mapping is triggered? I don't think it matters. As above, it only takes one path where KVM is fully relying on kvm_flush_remote_tlbs() for the whole thing to fall apart.
On 07/03/2023 11:07, Vitaly Kuznetsov wrote: > Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: > >> On 06/03/2023 18:52, Vitaly Kuznetsov wrote: >>> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes: >>> >>>> TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17. >>>> The issue was first introduced by two commmits: >>>> >>>> - bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB >>>> flush to caller when freeing TDP MMU shadow pages" >>>> - efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct >>>> roots via asynchronous worker" >>>> >>>> The root cause is that since then there are missing TLB flushes which >>>> are required by HV_X64_NESTED_ENLIGHTENED_TLB. >>> >>> Please share more details on what's actually missing as you get them, >>> I'd like to understand which flushes can be legally avoided on bare >>> hardware and Hyper-V/VMX but not on Hyper-V/SVM. >>> >> >> See the linked thread here >> https://lore.kernel.org/lkml/20d189fc-8d20-8083-b448-460cc0420151@linux.microsoft.com/#t >> for all the details/analyses but the summary was that either of these 2 >> options would work, with a) having less flushes (footnote: less flushes is not necessarily >> better): >> >> a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp() >> b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current() >> >> These are only needed on Hyper-V/SVM because of how the enlightenment works (needs an explicit >> flush to rebuild L0 shadow page tables). Hyper-V/VMX does not need any changes and currently >> works. Let me know if you need more information on something here, I'll try to get it. >> > > Ah, I missed the whole party! Thanks for the pointers! > >>>> The failure manifests >>>> as L2 guest VMs being unable to complete boot due to memory >>>> inconsistencies between L1 and L2 guests which lead to various >>>> assertion/emulation failures. > > Which levels are we talking about here, *real* L1 and L2 or L1 and L2 > from KVM's perspective (real L2 and L3)? Real L1 and L2. In this whole discussion L0 is Hyper-V, L1 is KVM and L2 is a Linux VM. > >>>> >>>> The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by >>>> Hyper-V on AMD and is always used by Linux. The TLB flush required by >>>> HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush >>>> that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest >>>> boot performance on AMD is reproducibly slower compared to when TDP MMU is >>>> disabled. >>>> >>>> Disable TDP MMU when using SVM Hyper-V for the time being while we >>>> search for a better fix. >>> >>> I'd suggest we go the other way around: disable >>> HV_X64_NESTED_ENLIGHTENED_TLB on SVM: >> >> Paolo suggested disabling TDP_MMU when HV_X64_NESTED_ENLIGHTENED_TLB is used, and >> I prefer that option too. The enlighenment does offer a nice performance advantage >> with non-TDP_MMU, and I did not see TDP_MMU perform any better compared to that. >> Afaik the code to use the enlightenment on Hyper-V/SVM was written/tested before >> TDP_MMU became the default. >> >> If you have a specific scenario in mind, we could test and see what the implications >> are there. > > I don't have a strong opinion here, I've suggested a smaller change so > it's easier to backport it to stable kernels and easier to revert when a > proper fix comes to mainline. Noted. My concern here is about changing a default in a way that lowers performance, because the proper fix that comes later might end up not being suitable for stable. > For performance implication, I'd only > consider non-nested scenarios from KVM's perspective (i.e. real L2 from > Hyper-V's PoV), as running L3 is unlikely a common use-case and, if I > understood correctly, is broken anyway. I agree with that. Right now L2 is broken, I've never even attempted L3 to see if it would work. Jeremi
On 08/03/2023 01:00, Paolo Bonzini wrote: > On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote: >> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: >> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM >> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just >> happens to get lucky and not run afoul of the underlying bugs. > > I don't think it's about luck---the legacy MMU's zapping/invalidation > seems to invoke the flush hypercall correctly: > > Jeremi, did you ever track the call stack where > hyperv_nested_flush_guest_mapping is triggered? > > Paolo > Yes, here's all the stacktraces I get for hyperv_nested_flush_guest_mapping with legacy MMU: 1 entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_ioctl kvm_vcpu_ioctl kvm_arch_vcpu_ioctl_run svm_handle_exit svm_invoke_exit_handler npf_interception kvm_mmu_page_fault kvm_mmu_do_page_fault kvm_tdp_page_fault direct_page_fault kvm_faultin_pfn __gfn_to_pfn_memslot hva_to_pfn get_user_pages_unlocked __gup_longterm_locked __get_user_pages handle_mm_fault __handle_mm_fault do_huge_pmd_wp_page __split_huge_pmd __mmu_notifier_invalidate_range_start kvm_mmu_notifier_invalidate_range_start kvm_flush_remote_tlbs hv_remote_flush_tlb hv_remote_flush_tlb_with_range hyperv_flush_guest_mapping 1 entry_SYSCALL_64_after_hwframe do_syscall_64 syscall_exit_to_user_mode exit_to_user_mode_prepare arch_do_signal_or_restart get_signal do_group_exit do_exit mmput __mmput exit_mmap __mmu_notifier_release kvm_mmu_notifier_release kvm_arch_flush_shadow_all kvm_mmu_zap_all kvm_mmu_commit_zap_page.part.0 kvm_flush_remote_tlbs hv_remote_flush_tlb hv_remote_flush_tlb_with_range hyperv_flush_guest_mapping 1 entry_SYSCALL_64_after_hwframe do_syscall_64 syscall_exit_to_user_mode exit_to_user_mode_prepare arch_do_signal_or_restart get_signal do_group_exit do_exit task_work_run ____fput __fput kvm_vm_release kvm_put_kvm kvm_destroy_vm kvm_arch_destroy_vm kvm_mmu_unload kvm_mmu_free_roots kvm_mmu_commit_zap_page.part.0 kvm_flush_remote_tlbs hv_remote_flush_tlb hv_remote_flush_tlb_with_range hyperv_flush_guest_mapping 6 entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_ioctl kvm_vcpu_ioctl kvm_arch_vcpu_ioctl_run svm_handle_exit svm_invoke_exit_handler npf_interception kvm_mmu_page_fault kvm_mmu_do_page_fault kvm_tdp_page_fault direct_page_fault kvm_faultin_pfn __gfn_to_pfn_memslot hva_to_pfn get_user_pages_unlocked __gup_longterm_locked __get_user_pages handle_mm_fault __handle_mm_fault do_wp_page __mmu_notifier_invalidate_range_start kvm_mmu_notifier_invalidate_range_start kvm_flush_remote_tlbs hv_remote_flush_tlb hv_remote_flush_tlb_with_range hyperv_flush_guest_mapping 8 entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_ioctl kvm_vcpu_ioctl kvm_arch_vcpu_ioctl_run kvm_apic_accept_events kvm_vcpu_reset kvm_mmu_reset_context kvm_mmu_free_roots kvm_mmu_commit_zap_page.part.0 kvm_flush_remote_tlbs hv_remote_flush_tlb hv_remote_flush_tlb_with_range hyperv_flush_guest_mapping 20 entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_ioctl kvm_vcpu_ioctl kvm_arch_vcpu_ioctl_run svm_handle_exit svm_invoke_exit_handler npf_interception kvm_mmu_page_fault kvm_mmu_do_page_fault kvm_tdp_page_fault direct_page_fault mmu_set_spte hv_remote_flush_tlb_with_range hyperv_flush_guest_mapping_range 406 entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_ioctl kvm_vm_ioctl __kvm_set_memory_region kvm_set_memslot kvm_arch_flush_shadow_memslot kvm_page_track_flush_slot kvm_mmu_invalidate_zap_pages_in_memslot kvm_mmu_zap_all_fast kvm_mmu_commit_zap_page.part.0 kvm_flush_remote_tlbs hv_remote_flush_tlb hv_remote_flush_tlb_with_range hyperv_flush_guest_mapping 406 entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_ioctl kvm_vcpu_ioctl kvm_arch_vcpu_ioctl_run kvm_mmu_free_obsolete_roots __kvm_mmu_free_obsolete_roots kvm_mmu_free_roots kvm_mmu_commit_zap_page.part.0 kvm_flush_remote_tlbs hv_remote_flush_tlb hv_remote_flush_tlb_with_range hyperv_flush_guest_mapping and here's the stacks for TDP MMU: 1 entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_ioctl kvm_vcpu_ioctl kvm_arch_vcpu_ioctl_run svm_handle_exit svm_invoke_exit_handler npf_interception kvm_mmu_page_fault kvm_mmu_do_page_fault kvm_tdp_page_fault kvm_faultin_pfn __gfn_to_pfn_memslot hva_to_pfn get_user_pages_unlocked __gup_longterm_locked __get_user_pages handle_mm_fault __handle_mm_fault do_huge_pmd_wp_page __split_huge_pmd __mmu_notifier_invalidate_range_start kvm_mmu_notifier_invalidate_range_start kvm_flush_remote_tlbs hv_remote_flush_tlb hv_remote_flush_tlb_with_range hyperv_flush_guest_mapping 6 entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_ioctl kvm_vcpu_ioctl kvm_arch_vcpu_ioctl_run svm_handle_exit svm_invoke_exit_handler npf_interception kvm_mmu_page_fault kvm_mmu_do_page_fault kvm_tdp_page_fault kvm_faultin_pfn __gfn_to_pfn_memslot hva_to_pfn get_user_pages_unlocked __gup_longterm_locked __get_user_pages handle_mm_fault __handle_mm_fault do_wp_page __mmu_notifier_invalidate_range_start kvm_mmu_notifier_invalidate_range_start kvm_flush_remote_tlbs hv_remote_flush_tlb hv_remote_flush_tlb_with_range hyperv_flush_guest_mapping >> The revert appears >> to be reasonably straightforward (see bottom). >> >> And _if_ we want to hack-a-fix it, then I would strongly prefer a very isolated, >> obviously hacky fix, e.g. >> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >> index 36e4561554ca..a9ba4ae14fda 100644 >> --- a/arch/x86/kvm/mmu/mmu.c >> +++ b/arch/x86/kvm/mmu/mmu.c >> @@ -5779,8 +5779,13 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, >> tdp_root_level = tdp_forced_root_level; >> max_tdp_level = tdp_max_root_level; >> >> + /* >> + * FIXME: Remove the enlightened TLB restriction when KVM properly >> + * handles TLB flushes for said enlightenment. >> + */. >> #ifdef CONFIG_X86_64 >> - tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled; >> + tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && >> + !(ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB); >> #endif >> /* >> * max_huge_page_level reflects KVM's MMU capabilities irrespective >> >> >> >> >> The revert... >> >> --- >> arch/x86/kvm/svm/svm.c | 3 --- >> arch/x86/kvm/svm/svm_onhyperv.h | 27 --------------------------- >> 2 files changed, 30 deletions(-) >> >> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c >> index 11068e8eb969..292650dc85a0 100644 >> --- a/arch/x86/kvm/svm/svm.c >> +++ b/arch/x86/kvm/svm/svm.c >> @@ -1320,7 +1320,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu) >> if (sev_guest(vcpu->kvm)) >> sev_init_vmcb(svm); >> >> - svm_hv_init_vmcb(vmcb); >> init_vmcb_after_set_cpuid(vcpu); >> >> vmcb_mark_all_dirty(vmcb); >> @@ -4075,8 +4074,6 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, >> svm->vmcb->control.nested_cr3 = __sme_set(root_hpa); >> vmcb_mark_dirty(svm->vmcb, VMCB_NPT); >> >> - hv_track_root_tdp(vcpu, root_hpa); >> - >> cr3 = vcpu->arch.cr3; >> } else if (root_level >= PT64_ROOT_4LEVEL) { >> cr3 = __sme_set(root_hpa) | kvm_get_active_pcid(vcpu); >> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h >> index 6981c1e9a809..5118fd273e73 100644 >> --- a/arch/x86/kvm/svm/svm_onhyperv.h >> +++ b/arch/x86/kvm/svm/svm_onhyperv.h >> @@ -15,31 +15,8 @@ static struct kvm_x86_ops svm_x86_ops; >> >> int svm_hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu); >> >> -static inline void svm_hv_init_vmcb(struct vmcb *vmcb) >> -{ >> - struct hv_vmcb_enlightenments *hve = &vmcb->control.hv_enlightenments; >> - >> - BUILD_BUG_ON(sizeof(vmcb->control.hv_enlightenments) != >> - sizeof(vmcb->control.reserved_sw)); >> - >> - if (npt_enabled && >> - ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) >> - hve->hv_enlightenments_control.enlightened_npt_tlb = 1; >> - >> - if (ms_hyperv.nested_features & HV_X64_NESTED_MSR_BITMAP) >> - hve->hv_enlightenments_control.msr_bitmap = 1; >> -} >> - >> static inline void svm_hv_hardware_setup(void) >> { >> - if (npt_enabled && >> - ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) { >> - pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n"); >> - svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb; >> - svm_x86_ops.tlb_remote_flush_with_range = >> - hv_remote_flush_tlb_with_range; >> - } >> - >> if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH) { >> int cpu; >> >> @@ -80,10 +57,6 @@ static inline void svm_hv_update_vp_id(struct vmcb *vmcb, struct kvm_vcpu *vcpu) >> } >> #else >> >> -static inline void svm_hv_init_vmcb(struct vmcb *vmcb) >> -{ >> -} >> - >> static inline void svm_hv_hardware_setup(void) >> { >> } >> >> base-commit: cb8748a781fe983e451f616ce4861a1c49ce79dd >> -- >>
On 08/03/2023 01:39, Sean Christopherson wrote: > On Wed, Mar 08, 2023, Paolo Bonzini wrote: >> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote: >>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: >>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM >>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just >>> happens to get lucky and not run afoul of the underlying bugs. >> >> I don't think it's about luck---the legacy MMU's zapping/invalidation >> seems to invoke the flush hypercall correctly: > > ...for the paths that Jeremi has exercised, and for which a stale TLB entry is > fatal to L2. E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush > in its path and fully relies on the buggy kvm_flush_remote_tlbs(). > Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs calls the hypercall that is needed, I don't see how this might be an issue of a missing "range-based TLB flush". kvm_unmap_gfn_range is called from kvm_mmu_notifier_invalidate_range_start and 'flush_on_ret=true' is set, so it is followed by kvm_flush_remote_tlbs which calls hv_remote_flush_tlb. > In other words, KVM is getting lucky :-) > >> Jeremi, did you ever track the call stack where >> hyperv_nested_flush_guest_mapping is triggered? > > I don't think it matters. As above, it only takes one path where KVM is fully > relying on kvm_flush_remote_tlbs() for the whole thing to fall apart
On 08/03/2023 16:55, Jeremi Piotrowski wrote: > > > On 08/03/2023 01:39, Sean Christopherson wrote: >> On Wed, Mar 08, 2023, Paolo Bonzini wrote: >>> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote: >>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: >>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM >>>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just >>>> happens to get lucky and not run afoul of the underlying bugs. >>> >>> I don't think it's about luck---the legacy MMU's zapping/invalidation >>> seems to invoke the flush hypercall correctly: >> >> ...for the paths that Jeremi has exercised, and for which a stale TLB entry is >> fatal to L2. E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush >> in its path and fully relies on the buggy kvm_flush_remote_tlbs(). >> > > Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs calls the hypercall > that is needed, I don't see how this might be an issue of a missing "range-based TLB flush". > > kvm_unmap_gfn_range is called from kvm_mmu_notifier_invalidate_range_start and 'flush_on_ret=true' > is set, so it is followed by kvm_flush_remote_tlbs which calls hv_remote_flush_tlb. > >> In other words, KVM is getting lucky :-) >> >>> Jeremi, did you ever track the call stack where >>> hyperv_nested_flush_guest_mapping is triggered? >> >> I don't think it matters. As above, it only takes one path where KVM is fully >> relying on kvm_flush_remote_tlbs() for the whole thing to fall apart Slowly I'm starting to understand what we've been talking about, thank you :) Paolo/Sean, what do you think about smth like the following, except I would make it SVM only, and I'd need to think about what to do with the return. I believe this accurately reflects what the optimization is about. hv_track_root_tdp is called from kvm_mmu_load_pgd, which covers both kvm_mmu_load and kvm_mmu_new_pgd (which requests KVM_REQ_LOAD_MMU_PGD). diff --git a/arch/x86/kvm/kvm_onhyperv.c b/arch/x86/kvm/kvm_onhyperv.c index 482d6639ef88..6a5bd3cbace8 100644 --- a/arch/x86/kvm/kvm_onhyperv.c +++ b/arch/x86/kvm/kvm_onhyperv.c @@ -29,6 +29,18 @@ static inline int hv_remote_flush_root_tdp(hpa_t root_tdp, return hyperv_flush_guest_mapping(root_tdp); } +static int hv_vcpu_flush_tlb_current(struct kvm_vcpu *vcpu) +{ + struct kvm_arch *kvm_arch = &vcpu->kvm->arch; + hpa_t root_tdp = vcpu->arch.hv_root_tdp; + int ret; + + ret = hyperv_flush_guest_mapping(root_tdp); + if (!ret) + kvm_arch->hv_root_tdp = root_tdp; + return ret; +} + int hv_remote_flush_tlb_with_range(struct kvm *kvm, struct kvm_tlb_range *range) { @@ -101,8 +113,10 @@ void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp) if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb) { spin_lock(&kvm_arch->hv_root_tdp_lock); vcpu->arch.hv_root_tdp = root_tdp; - if (root_tdp != kvm_arch->hv_root_tdp) + if (root_tdp != kvm_arch->hv_root_tdp) { kvm_arch->hv_root_tdp = INVALID_PAGE; + hv_vcpu_flush_tlb_current(vcpu); + } spin_unlock(&kvm_arch->hv_root_tdp_lock); } }
On Wed, Mar 08, 2023, Jeremi Piotrowski wrote: > On 08/03/2023 01:39, Sean Christopherson wrote: > > On Wed, Mar 08, 2023, Paolo Bonzini wrote: > >> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote: > >>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: > >>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM > >>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just > >>> happens to get lucky and not run afoul of the underlying bugs. > >> > >> I don't think it's about luck---the legacy MMU's zapping/invalidation > >> seems to invoke the flush hypercall correctly: > > > > ...for the paths that Jeremi has exercised, and for which a stale TLB entry is > > fatal to L2. E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush > > in its path and fully relies on the buggy kvm_flush_remote_tlbs(). > > > > Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs calls the > hypercall that is needed, I don't see how this might be an issue of a missing > "range-based TLB flush". Doh, I forgot that the arch hook in kvm_flush_remote_tlbs() leads to the Hyper-V hook. svm_flush_tlb_current is very much broken, but in practice it doesn't matter outside of the direct call from kvm_mmu_load(), because in all other paths KVM will flow through a Hyper-V flush if KVM actually modifies its MMU in any ways. E.g. the request from kvm_mmu_new_pgd() when force_flush_and_sync_on_reuse=true is neutered, but that exists only as a safeguard against MMU bugs. And for things like kvm_invalidate_pcid() and kvm_post_set_cr4(), my understanding is that Hyper-V will still flush the bare metal TLB, it's only Hyper-V's shadow page tables that are stale. Depending on how Hyper-V handles ASIDs, pre_svm_run() may also be broken. If Hyper-V tracks and rebuilds only the current ASID, then bumping the ASID won't rebuild potentially stale page tables. But I'm guessing Hyper-V ignores the ASID since the hypercall takes only the root PA. The truly problematic case is kvm_mmu_load(), where KVM relies on the flush to force Hyper-V to rebuild shadow page tables for an old, possibly stale nCR3. This affects only the TDP MMU because of an explicit optimization in the TDP MMU. So in practice we could squeak by with something like this: if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb) hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa); else static_call(kvm_x86_flush_tlb_current)(vcpu); but I'm not convinced that avoiding a hypercall in svm_flush_tlb_current() just to avoid overhead when running an L3 (nested VM from L1 KVM's perspective) is worthwhile. The real problem there is that KVM nested SVM TLB/ASID support is an unoptimized mess, and I can't imagine someone running an L3 is going to be super concerned with performance. I also think we should have a sanity check in the flush_tlb_all() path, i.e. WARN if kvm_flush_remote_tlbs() falls back. Something like this (probably doesn't compile, likely needs #ifdefs or helpers): diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7ef4f9e3b35a..38afc5cac1c4 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3770,7 +3770,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu) svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF); } -static void svm_flush_tlb_current(struct kvm_vcpu *vcpu) +static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -3794,6 +3794,23 @@ static void svm_flush_tlb_current(struct kvm_vcpu *vcpu) svm->current_vmcb->asid_generation--; } +static void svm_flush_tlb_current(struct kvm_vcpu *vcpu) +{ + if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb && + VALID_PAGE(vcpu->arch.mmu->root.hpa)) + hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa); + + svm_flush_tlb_asid(vcpu); +} + +static void svm_flush_tlb_all(struct kvm_vcpu *vcpu) +{ + if (WARN_ON_ONCE(kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb)) + hv_remote_flush_tlb(vcpu->kvm); + + svm_flush_tlb_asid(vcpu); +} + static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva) { struct vcpu_svm *svm = to_svm(vcpu); @@ -4786,10 +4803,10 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .set_rflags = svm_set_rflags, .get_if_flag = svm_get_if_flag, - .flush_tlb_all = svm_flush_tlb_current, + .flush_tlb_all = svm_flush_tlb_all, .flush_tlb_current = svm_flush_tlb_current, .flush_tlb_gva = svm_flush_tlb_gva, - .flush_tlb_guest = svm_flush_tlb_current, + .flush_tlb_guest = svm_flush_tlb_asid, .vcpu_pre_run = svm_vcpu_pre_run, .vcpu_run = svm_vcpu_run,
On Wed, Mar 08, 2023, Jeremi Piotrowski wrote: > On 08/03/2023 16:55, Jeremi Piotrowski wrote: > > > > > > On 08/03/2023 01:39, Sean Christopherson wrote: > >> On Wed, Mar 08, 2023, Paolo Bonzini wrote: > >>> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote: > >>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: > >>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM > >>>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just > >>>> happens to get lucky and not run afoul of the underlying bugs. > >>> > >>> I don't think it's about luck---the legacy MMU's zapping/invalidation > >>> seems to invoke the flush hypercall correctly: > >> > >> ...for the paths that Jeremi has exercised, and for which a stale TLB entry is > >> fatal to L2. E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush > >> in its path and fully relies on the buggy kvm_flush_remote_tlbs(). > >> > > > > Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs calls the hypercall > > that is needed, I don't see how this might be an issue of a missing "range-based TLB flush". > > > > kvm_unmap_gfn_range is called from kvm_mmu_notifier_invalidate_range_start and 'flush_on_ret=true' > > is set, so it is followed by kvm_flush_remote_tlbs which calls hv_remote_flush_tlb. > > > >> In other words, KVM is getting lucky :-) > >> > >>> Jeremi, did you ever track the call stack where > >>> hyperv_nested_flush_guest_mapping is triggered? > >> > >> I don't think it matters. As above, it only takes one path where KVM is fully > >> relying on kvm_flush_remote_tlbs() for the whole thing to fall apart > > Slowly I'm starting to understand what we've been talking about, thank you :) > > Paolo/Sean, what do you think about smth like the following, except I would make > it SVM only, and I'd need to think about what to do with the return. > I believe this accurately reflects what the optimization is about. hv_track_root_tdp > is called from kvm_mmu_load_pgd, which covers both kvm_mmu_load and kvm_mmu_new_pgd > (which requests KVM_REQ_LOAD_MMU_PGD). It's close, but KVM doesn't *always* need to flush when loading a root. KVM needs to flush when loading a brand spanking new root, which is the kvm_mmu_load() path. But when KVM loads a root via KVM_REQ_LOAD_MMU_PGD/kvm_mmu_new_pgd(), a flush may or may not be necessary, e.g. if KVM reuses an old, but still valid, root (each vCPU has a 3-entry root cache) and a TLB flush isn't architecturally required, then there is no need to flush. And as mentioned in the other tendril of this thread, I'd really like to fix svm_flush_tlb_current() since it's technically broken, even though it's highly unlikely (maybe even impossible?) to cause issues in practice. > diff --git a/arch/x86/kvm/kvm_onhyperv.c b/arch/x86/kvm/kvm_onhyperv.c > index 482d6639ef88..6a5bd3cbace8 100644 > --- a/arch/x86/kvm/kvm_onhyperv.c > +++ b/arch/x86/kvm/kvm_onhyperv.c > @@ -29,6 +29,18 @@ static inline int hv_remote_flush_root_tdp(hpa_t root_tdp, > return hyperv_flush_guest_mapping(root_tdp); > } > > +static int hv_vcpu_flush_tlb_current(struct kvm_vcpu *vcpu) > +{ > + struct kvm_arch *kvm_arch = &vcpu->kvm->arch; > + hpa_t root_tdp = vcpu->arch.hv_root_tdp; > + int ret; > + > + ret = hyperv_flush_guest_mapping(root_tdp); > + if (!ret) > + kvm_arch->hv_root_tdp = root_tdp; > + return ret; > +} > + > int hv_remote_flush_tlb_with_range(struct kvm *kvm, > struct kvm_tlb_range *range) > { > @@ -101,8 +113,10 @@ void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp) > if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb) { > spin_lock(&kvm_arch->hv_root_tdp_lock); > vcpu->arch.hv_root_tdp = root_tdp; > - if (root_tdp != kvm_arch->hv_root_tdp) > + if (root_tdp != kvm_arch->hv_root_tdp) { > kvm_arch->hv_root_tdp = INVALID_PAGE; > + hv_vcpu_flush_tlb_current(vcpu); > + } > spin_unlock(&kvm_arch->hv_root_tdp_lock); > } > } >
@Alex, do you know the answer to Sean's question below about ASID handling in Hyper-V? Any other comments about how we're intending to fix things are also welcome. Jeremi On 08/03/2023 20:11, Sean Christopherson wrote: > On Wed, Mar 08, 2023, Jeremi Piotrowski wrote: >> On 08/03/2023 01:39, Sean Christopherson wrote: >>> On Wed, Mar 08, 2023, Paolo Bonzini wrote: >>>> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote: >>>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: >>>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM >>>>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just >>>>> happens to get lucky and not run afoul of the underlying bugs. >>>> >>>> I don't think it's about luck---the legacy MMU's zapping/invalidation >>>> seems to invoke the flush hypercall correctly: >>> >>> ...for the paths that Jeremi has exercised, and for which a stale TLB entry is >>> fatal to L2. E.g. kvm_unmap_gfn_range() does not have a range-based TLB flush >>> in its path and fully relies on the buggy kvm_flush_remote_tlbs(). >>> >> >> Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs calls the >> hypercall that is needed, I don't see how this might be an issue of a missing >> "range-based TLB flush". > > Doh, I forgot that the arch hook in kvm_flush_remote_tlbs() leads to the Hyper-V > hook. > > svm_flush_tlb_current is very much broken, but in practice it doesn't matter outside > of the direct call from kvm_mmu_load(), because in all other paths KVM will flow > through a Hyper-V flush if KVM actually modifies its MMU in any ways. E.g. the > request from kvm_mmu_new_pgd() when force_flush_and_sync_on_reuse=true is neutered, > but that exists only as a safeguard against MMU bugs. And for things like > kvm_invalidate_pcid() and kvm_post_set_cr4(), my understanding is that Hyper-V > will still flush the bare metal TLB, it's only Hyper-V's shadow page tables that > are stale. > > Depending on how Hyper-V handles ASIDs, pre_svm_run() may also be broken. If > Hyper-V tracks and rebuilds only the current ASID, then bumping the ASID won't > rebuild potentially stale page tables. But I'm guessing Hyper-V ignores the ASID > since the hypercall takes only the root PA. > > The truly problematic case is kvm_mmu_load(), where KVM relies on the flush to > force Hyper-V to rebuild shadow page tables for an old, possibly stale nCR3. This > affects only the TDP MMU because of an explicit optimization in the TDP MMU. So > in practice we could squeak by with something like this: > > if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb) > hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa); > else > static_call(kvm_x86_flush_tlb_current)(vcpu); > > but I'm not convinced that avoiding a hypercall in svm_flush_tlb_current() just > to avoid overhead when running an L3 (nested VM from L1 KVM's perspective) is > worthwhile. The real problem there is that KVM nested SVM TLB/ASID support is an > unoptimized mess, and I can't imagine someone running an L3 is going to be super > concerned with performance. > > I also think we should have a sanity check in the flush_tlb_all() path, i.e. WARN > if kvm_flush_remote_tlbs() falls back. > > Something like this (probably doesn't compile, likely needs #ifdefs or helpers): > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index 7ef4f9e3b35a..38afc5cac1c4 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -3770,7 +3770,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu) > svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF); > } > > -static void svm_flush_tlb_current(struct kvm_vcpu *vcpu) > +static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu) > { > struct vcpu_svm *svm = to_svm(vcpu); > > @@ -3794,6 +3794,23 @@ static void svm_flush_tlb_current(struct kvm_vcpu *vcpu) > svm->current_vmcb->asid_generation--; > } > > +static void svm_flush_tlb_current(struct kvm_vcpu *vcpu) > +{ > + if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb && > + VALID_PAGE(vcpu->arch.mmu->root.hpa)) > + hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa); > + > + svm_flush_tlb_asid(vcpu); > +} > + > +static void svm_flush_tlb_all(struct kvm_vcpu *vcpu) > +{ > + if (WARN_ON_ONCE(kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb)) > + hv_remote_flush_tlb(vcpu->kvm); > + > + svm_flush_tlb_asid(vcpu); > +} > + > static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva) > { > struct vcpu_svm *svm = to_svm(vcpu); > @@ -4786,10 +4803,10 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { > .set_rflags = svm_set_rflags, > .get_if_flag = svm_get_if_flag, > > - .flush_tlb_all = svm_flush_tlb_current, > + .flush_tlb_all = svm_flush_tlb_all, > .flush_tlb_current = svm_flush_tlb_current, > .flush_tlb_gva = svm_flush_tlb_gva, > - .flush_tlb_guest = svm_flush_tlb_current, > + .flush_tlb_guest = svm_flush_tlb_asid, > > .vcpu_pre_run = svm_vcpu_pre_run, > .vcpu_run = svm_vcpu_run,
Yes, I agree that bumping the ASID doesn't flush Hyper-V's shadow page tables (if the guest opts into "enlightened TLB"). Here is the relevant section from the TLFS: "The nested hypervisor can optionally opt into an "enlightened TLB" by setting EnlightenedNptTlb to "1" in HV_SVM_ENLIGHTENED_VMCB_FIELDS. If the nested hypervisor opts into the enlightenment, ASID invalidations just flush TLB entries derived from first level address translation (i.e. the virtual address space). To flush TLB entries derived from the nested page table (NPT) and force the L0 hypervisor to rebuild shadow page tables, the HvCallFlushGuestPhysicalAddressSpace or HvCallFlushGuestPhysicalAddressList hypercalls must be used." (see https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/nested-virtualization) Thanks, Alex -----Original Message----- From: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Sent: Thursday, March 9, 2023 9:59 AM To: Alexander Grest <Alexander.Grest@microsoft.com> Cc: Paolo Bonzini <pbonzini@redhat.com>; linux-kernel@vger.kernel.org; kvm@vger.kernel.org; Vitaly Kuznetsov <vkuznets@redhat.com>; Tianyu Lan <ltykernel@gmail.com>; Michael Kelley (LINUX) <mikelley@microsoft.com>; Sean Christopherson <seanjc@google.com> Subject: Re: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V @Alex, do you know the answer to Sean's question below about ASID handling in Hyper-V? Any other comments about how we're intending to fix things are also welcome. Jeremi On 08/03/2023 20:11, Sean Christopherson wrote: > On Wed, Mar 08, 2023, Jeremi Piotrowski wrote: >> On 08/03/2023 01:39, Sean Christopherson wrote: >>> On Wed, Mar 08, 2023, Paolo Bonzini wrote: >>>> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@google.com> wrote: >>>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: >>>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly >>>>> straitaway. KVM doesn't magically handle the flushes correctly >>>>> for the shadow/legacy MMU, KVM just happens to get lucky and not run afoul of the underlying bugs. >>>> >>>> I don't think it's about luck---the legacy MMU's >>>> zapping/invalidation seems to invoke the flush hypercall correctly: >>> >>> ...for the paths that Jeremi has exercised, and for which a stale >>> TLB entry is fatal to L2. E.g. kvm_unmap_gfn_range() does not have >>> a range-based TLB flush in its path and fully relies on the buggy kvm_flush_remote_tlbs(). >>> >> >> Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs >> calls the hypercall that is needed, I don't see how this might be an >> issue of a missing "range-based TLB flush". > > Doh, I forgot that the arch hook in kvm_flush_remote_tlbs() leads to > the Hyper-V hook. > > svm_flush_tlb_current is very much broken, but in practice it doesn't > matter outside of the direct call from kvm_mmu_load(), because in all > other paths KVM will flow through a Hyper-V flush if KVM actually > modifies its MMU in any ways. E.g. the request from kvm_mmu_new_pgd() > when force_flush_and_sync_on_reuse=true is neutered, but that exists > only as a safeguard against MMU bugs. And for things like > kvm_invalidate_pcid() and kvm_post_set_cr4(), my understanding is that > Hyper-V will still flush the bare metal TLB, it's only Hyper-V's > shadow page tables that are stale. > > Depending on how Hyper-V handles ASIDs, pre_svm_run() may also be > broken. If Hyper-V tracks and rebuilds only the current ASID, then > bumping the ASID won't rebuild potentially stale page tables. But I'm > guessing Hyper-V ignores the ASID since the hypercall takes only the root PA. > > The truly problematic case is kvm_mmu_load(), where KVM relies on the > flush to force Hyper-V to rebuild shadow page tables for an old, > possibly stale nCR3. This affects only the TDP MMU because of an > explicit optimization in the TDP MMU. So in practice we could squeak by with something like this: > > if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb) > hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa); > else > static_call(kvm_x86_flush_tlb_current)(vcpu); > > but I'm not convinced that avoiding a hypercall in > svm_flush_tlb_current() just to avoid overhead when running an L3 > (nested VM from L1 KVM's perspective) is worthwhile. The real problem > there is that KVM nested SVM TLB/ASID support is an unoptimized mess, > and I can't imagine someone running an L3 is going to be super concerned with performance. > > I also think we should have a sanity check in the flush_tlb_all() > path, i.e. WARN if kvm_flush_remote_tlbs() falls back. > > Something like this (probably doesn't compile, likely needs #ifdefs or helpers): > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index > 7ef4f9e3b35a..38afc5cac1c4 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -3770,7 +3770,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu) > svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF); } > > -static void svm_flush_tlb_current(struct kvm_vcpu *vcpu) > +static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu) > { > struct vcpu_svm *svm = to_svm(vcpu); > > @@ -3794,6 +3794,23 @@ static void svm_flush_tlb_current(struct kvm_vcpu *vcpu) > svm->current_vmcb->asid_generation--; > } > > +static void svm_flush_tlb_current(struct kvm_vcpu *vcpu) { > + if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb && > + VALID_PAGE(vcpu->arch.mmu->root.hpa)) > + hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa); > + > + svm_flush_tlb_asid(vcpu); > +} > + > +static void svm_flush_tlb_all(struct kvm_vcpu *vcpu) { > + if (WARN_ON_ONCE(kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb)) > + hv_remote_flush_tlb(vcpu->kvm); > + > + svm_flush_tlb_asid(vcpu); > +} > + > static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva) { > struct vcpu_svm *svm = to_svm(vcpu); @@ -4786,10 +4803,10 @@ > static struct kvm_x86_ops svm_x86_ops __initdata = { > .set_rflags = svm_set_rflags, > .get_if_flag = svm_get_if_flag, > > - .flush_tlb_all = svm_flush_tlb_current, > + .flush_tlb_all = svm_flush_tlb_all, > .flush_tlb_current = svm_flush_tlb_current, > .flush_tlb_gva = svm_flush_tlb_gva, > - .flush_tlb_guest = svm_flush_tlb_current, > + .flush_tlb_guest = svm_flush_tlb_asid, > > .vcpu_pre_run = svm_vcpu_pre_run, > .vcpu_run = svm_vcpu_run,
On 3/7/2023 6:36 PM, Sean Christopherson wrote: > Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: > hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM > doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just > happens to get lucky and not run afoul of the underlying bugs. The revert appears > to be reasonably straightforward (see bottom). Hi Sean, I'm back, and I don't have good news. The fix for the missing hyperv TLB flushes has landed in Linus' tree and I now had the chance to test things outside Azure, in WSL on my AMD laptop. There is some seriously weird interaction going on between TDP MMU and Hyper-V, with or without enlightened TLB. My laptop has 16 vCPUs, so the WSL VM also has 16 vCPUs. I have hardcoded the kernel to disable enlightened TLB (so we know that is not interfering). I'm running a Flatcar Linux VM inside the WSL VM using legacy BIOS, a single CPU and 4GB of RAM. If I run with `kvm.tdp_mmu=0`, I can boot and shutdown my VM consistently in 20 seconds. If I run with TDP MMU, the VM boot stalls for seconds at a time in various spots (loading grub, decompressing kernel, during kernel boot), the boot output feels like it's happening in slow motion. The fastest I see it finish the same cycle is 2 minutes, I have also seen it take 4 minutes, sometimes even not finish at all. Same everything, the only difference is the value of `kvm.tdp_mmu`. So I would like to revisit disabling tdp_mmu on hyperv altogether for the time being but it should probably be with the following condition: tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && !hypervisor_is_type(X86_HYPER_MS_HYPERV) Do you have an environment where you would be able to reproduce this? A Windows server perhaps or an AMD laptop? Jeremi > > And _if_ we want to hack-a-fix it, then I would strongly prefer a very isolated, > obviously hacky fix, e.g. > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 36e4561554ca..a9ba4ae14fda 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -5779,8 +5779,13 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, > tdp_root_level = tdp_forced_root_level; > max_tdp_level = tdp_max_root_level; > > + /* > + * FIXME: Remove the enlightened TLB restriction when KVM properly > + * handles TLB flushes for said enlightenment. > + */. > #ifdef CONFIG_X86_64 > - tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled; > + tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && > + !(ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB); > #endif > /* > * max_huge_page_level reflects KVM's MMU capabilities irrespective > >
On Wed, Apr 05, 2023, Jeremi Piotrowski wrote: > On 3/7/2023 6:36 PM, Sean Christopherson wrote: > > Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: > > hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM > > doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just > > happens to get lucky and not run afoul of the underlying bugs. The revert appears > > to be reasonably straightforward (see bottom). > > Hi Sean, > > I'm back, and I don't have good news. The fix for the missing hyperv TLB flushes has > landed in Linus' tree and I now had the chance to test things outside Azure, in WSL on my > AMD laptop. > > There is some seriously weird interaction going on between TDP MMU and Hyper-V, with > or without enlightened TLB. My laptop has 16 vCPUs, so the WSL VM also has 16 vCPUs. > I have hardcoded the kernel to disable enlightened TLB (so we know that is not interfering). > I'm running a Flatcar Linux VM inside the WSL VM using legacy BIOS, a single CPU > and 4GB of RAM. > > If I run with `kvm.tdp_mmu=0`, I can boot and shutdown my VM consistently in 20 seconds. > > If I run with TDP MMU, the VM boot stalls for seconds at a time in various spots > (loading grub, decompressing kernel, during kernel boot), the boot output feels like > it's happening in slow motion. The fastest I see it finish the same cycle is 2 minutes, > I have also seen it take 4 minutes, sometimes even not finish at all. Same everything, > the only difference is the value of `kvm.tdp_mmu`. When a stall occurs, can you tell where the time is lost? E.g. is the CPU stuck in L0, L1, or L2? L2 being a single vCPU rules out quite a few scenarios, e.g. lock contention and whatnot. If you can run perf in WSL, that might be the easiest way to suss out what's going on. > So I would like to revisit disabling tdp_mmu on hyperv altogether for the time being but it > should probably be with the following condition: > > tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && !hypervisor_is_type(X86_HYPER_MS_HYPERV) > > Do you have an environment where you would be able to reproduce this? A Windows server perhaps > or an AMD laptop? Hrm, not easily, no. Can you try two things? 1. Linus' tree on Intel hardware 2. kvm-x86/next[*] on Intel hardware Don't bother with #2 if #1 (Linus' tree) does NOT suffer the same stalls as AMD. #2 is interesting iff Intel is also affected as kvm-x86/next has an optimization for CR0.WP toggling, which was the achilles heel of the TDP MMU. If Intel isn't affected, then something other than CR0.WP is to blame. I fully expect both experiments to show the same behavior as AMD, but if for some reason they don't, the results should help narrow the search. [*] https://github.com/kvm-x86/linux/tree/next
On 4/11/2023 1:25 AM, Sean Christopherson wrote: > On Wed, Apr 05, 2023, Jeremi Piotrowski wrote: >> On 3/7/2023 6:36 PM, Sean Christopherson wrote: >>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: >>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM >>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just >>> happens to get lucky and not run afoul of the underlying bugs. The revert appears >>> to be reasonably straightforward (see bottom). >> >> Hi Sean, >> >> I'm back, and I don't have good news. The fix for the missing hyperv TLB flushes has >> landed in Linus' tree and I now had the chance to test things outside Azure, in WSL on my >> AMD laptop. >> >> There is some seriously weird interaction going on between TDP MMU and Hyper-V, with >> or without enlightened TLB. My laptop has 16 vCPUs, so the WSL VM also has 16 vCPUs. >> I have hardcoded the kernel to disable enlightened TLB (so we know that is not interfering). >> I'm running a Flatcar Linux VM inside the WSL VM using legacy BIOS, a single CPU >> and 4GB of RAM. >> >> If I run with `kvm.tdp_mmu=0`, I can boot and shutdown my VM consistently in 20 seconds. >> >> If I run with TDP MMU, the VM boot stalls for seconds at a time in various spots >> (loading grub, decompressing kernel, during kernel boot), the boot output feels like >> it's happening in slow motion. The fastest I see it finish the same cycle is 2 minutes, >> I have also seen it take 4 minutes, sometimes even not finish at all. Same everything, >> the only difference is the value of `kvm.tdp_mmu`. > > When a stall occurs, can you tell where the time is lost? E.g. is the CPU stuck > in L0, L1, or L2? L2 being a single vCPU rules out quite a few scenarios, e.g. > lock contention and whatnot. It shows up as around 90% L2 time, 10% L1 time. I don't have great visibility into L0 time right now, I'm trying to find someone who might be able to help with that. > > If you can run perf in WSL, that might be the easiest way to suss out what's going > on. I can run perf, what trace would help? > >> So I would like to revisit disabling tdp_mmu on hyperv altogether for the time being but it >> should probably be with the following condition: >> >> tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && !hypervisor_is_type(X86_HYPER_MS_HYPERV) >> >> Do you have an environment where you would be able to reproduce this? A Windows server perhaps >> or an AMD laptop? > > Hrm, not easily, no. Can you try two things? > > 1. Linus' tree on Intel hardware This shows the same problematic behavior (tested commit 0d3eb744aed40ffce820cded61d7eac515199165). > 2. kvm-x86/next[*] on Intel hardware Same here (tested kvm-x86-next-2023.04.10) > > Don't bother with #2 if #1 (Linus' tree) does NOT suffer the same stalls as AMD. > #2 is interesting iff Intel is also affected as kvm-x86/next has an optimization > for CR0.WP toggling, which was the achilles heel of the TDP MMU. If Intel isn't > affected, then something other than CR0.WP is to blame. > > I fully expect both experiments to show the same behavior as AMD, but if for some > reason they don't, the results should help narrow the search. The results are the same for both branches, and it does look like this affects AMD and Intel equally. So seeing as this will likely take a while to figure out (and I know I won't be able to spend too many cycles on this in the next few weeks), what do you think of a patch to disable tdp_mmu in this configuration (for the time being?). Something else I've been wondering: in a KVM-on-KVM setup, is tdp_mmu used in both L0 and L1 hypervisors right now? > > [*] https://github.com/kvm-x86/linux/tree/next
On Tue, Apr 11, 2023, Jeremi Piotrowski wrote: > On 4/11/2023 1:25 AM, Sean Christopherson wrote: > > On Wed, Apr 05, 2023, Jeremi Piotrowski wrote: > >> On 3/7/2023 6:36 PM, Sean Christopherson wrote: > >>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: > >>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM > >>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just > >>> happens to get lucky and not run afoul of the underlying bugs. The revert appears > >>> to be reasonably straightforward (see bottom). > >> > >> Hi Sean, > >> > >> I'm back, and I don't have good news. The fix for the missing hyperv TLB flushes has > >> landed in Linus' tree and I now had the chance to test things outside Azure, in WSL on my > >> AMD laptop. > >> > >> There is some seriously weird interaction going on between TDP MMU and Hyper-V, with > >> or without enlightened TLB. My laptop has 16 vCPUs, so the WSL VM also has 16 vCPUs. > >> I have hardcoded the kernel to disable enlightened TLB (so we know that is not interfering). > >> I'm running a Flatcar Linux VM inside the WSL VM using legacy BIOS, a single CPU > >> and 4GB of RAM. > >> > >> If I run with `kvm.tdp_mmu=0`, I can boot and shutdown my VM consistently in 20 seconds. > >> > >> If I run with TDP MMU, the VM boot stalls for seconds at a time in various spots > >> (loading grub, decompressing kernel, during kernel boot), the boot output feels like > >> it's happening in slow motion. The fastest I see it finish the same cycle is 2 minutes, > >> I have also seen it take 4 minutes, sometimes even not finish at all. Same everything, > >> the only difference is the value of `kvm.tdp_mmu`. > > > > When a stall occurs, can you tell where the time is lost? E.g. is the CPU stuck > > in L0, L1, or L2? L2 being a single vCPU rules out quite a few scenarios, e.g. > > lock contention and whatnot. > > It shows up as around 90% L2 time, 10% L1 time. Are those numbers coming from /proc/<pid>/stat? Something else? > I don't have great visibility into L0 time right now, I'm trying to find > someone who might be able to help with that. > > > > > If you can run perf in WSL, that might be the easiest way to suss out what's going > > on. > > I can run perf, what trace would help? Good question. I'm not exactly a perf expert and almost never do anything beyond `perf top`. That's probably sufficient for now, I really just want to confirm that L1 doesn't appear to be stuck, e.g. in KVM's page fault handler. > The results are the same for both branches, and it does look like this affects AMD and > Intel equally. Nice. I have Intel hardware at home that I'll try to repro on, though it will be several weeks until I can dive into this. > So seeing as this will likely take a while to figure out (and I know I won't be able to > spend too many cycles on this in the next few weeks), what do you think of a patch to > disable tdp_mmu in this configuration (for the time being?). I don't particularly love the idea of disabling the TDP MMU without having the slightest clue what's going wrong, but I'm not totally opposed to it. Paolo, any thoughts? You have far more experience with supporting downstream consumers of KVM. > Something else I've been wondering: in a KVM-on-KVM setup, is tdp_mmu used in both L0 > and L1 hypervisors right now? By default, yes. I double checked that L2 has similar boot times for KVM-on-KVM with and without the TDP MMU. Certainly nothing remotely close to 2 minutes.
On 4/11/2023 6:02 PM, Sean Christopherson wrote: > On Tue, Apr 11, 2023, Jeremi Piotrowski wrote: >> On 4/11/2023 1:25 AM, Sean Christopherson wrote: >>> On Wed, Apr 05, 2023, Jeremi Piotrowski wrote: >>>> On 3/7/2023 6:36 PM, Sean Christopherson wrote: >>>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM: >>>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straitaway. KVM >>>>> doesn't magically handle the flushes correctly for the shadow/legacy MMU, KVM just >>>>> happens to get lucky and not run afoul of the underlying bugs. The revert appears >>>>> to be reasonably straightforward (see bottom). >>>> >>>> Hi Sean, >>>> >>>> I'm back, and I don't have good news. The fix for the missing hyperv TLB flushes has >>>> landed in Linus' tree and I now had the chance to test things outside Azure, in WSL on my >>>> AMD laptop. >>>> >>>> There is some seriously weird interaction going on between TDP MMU and Hyper-V, with >>>> or without enlightened TLB. My laptop has 16 vCPUs, so the WSL VM also has 16 vCPUs. >>>> I have hardcoded the kernel to disable enlightened TLB (so we know that is not interfering). >>>> I'm running a Flatcar Linux VM inside the WSL VM using legacy BIOS, a single CPU >>>> and 4GB of RAM. >>>> >>>> If I run with `kvm.tdp_mmu=0`, I can boot and shutdown my VM consistently in 20 seconds. >>>> >>>> If I run with TDP MMU, the VM boot stalls for seconds at a time in various spots >>>> (loading grub, decompressing kernel, during kernel boot), the boot output feels like >>>> it's happening in slow motion. The fastest I see it finish the same cycle is 2 minutes, >>>> I have also seen it take 4 minutes, sometimes even not finish at all. Same everything, >>>> the only difference is the value of `kvm.tdp_mmu`. >>> >>> When a stall occurs, can you tell where the time is lost? E.g. is the CPU stuck >>> in L0, L1, or L2? L2 being a single vCPU rules out quite a few scenarios, e.g. >>> lock contention and whatnot. >> >> It shows up as around 90% L2 time, 10% L1 time. > > Are those numbers coming from /proc/<pid>/stat? Something else? Yes, /proc/<pid>/stat shows that kind of ratio for the qemu process. > >> I don't have great visibility into L0 time right now, I'm trying to find >> someone who might be able to help with that. >> >>> >>> If you can run perf in WSL, that might be the easiest way to suss out what's going >>> on. >> >> I can run perf, what trace would help? > > Good question. I'm not exactly a perf expert and almost never do anything beyond > `perf top`. That's probably sufficient for now, I really just want to confirm that > L1 doesn't appear to be stuck, e.g. in KVM's page fault handler. Perf does not really show much anything in the L1, the 10% looks like it's vmx_flush_tlb_current. > >> The results are the same for both branches, and it does look like this affects AMD and >> Intel equally. > > Nice. I have Intel hardware at home that I'll try to repro on, though it will > be several weeks until I can dive into this. Same here. > >> So seeing as this will likely take a while to figure out (and I know I won't be able to >> spend too many cycles on this in the next few weeks), what do you think of a patch to >> disable tdp_mmu in this configuration (for the time being?). > > I don't particularly love the idea of disabling the TDP MMU without having the > slightest clue what's going wrong, but I'm not totally opposed to it. > > Paolo, any thoughts? You have far more experience with supporting downstream > consumers of KVM. > >> Something else I've been wondering: in a KVM-on-KVM setup, is tdp_mmu used in both L0 >> and L1 hypervisors right now? > > By default, yes. I double checked that L2 has similar boot times for KVM-on-KVM > with and without the TDP MMU. Certainly nothing remotely close to 2 minutes. Something I just noticed by tracing hv_track_root_tdp is that the VM appears to go through some ~10000 unique roots in the period before kernel init starts (so grub + kernel decompression). That part seems to take a long time. Is this kind of churn of roots by design? The ftrace output for when the root changes looks something like this, kvm goes through smm emulation during the exit. qemu-system-x86-18971 [015] d.... 95922.997039: kvm_exit: vcpu 0 reason EXCEPTION_NMI rip 0xfd0bd info1 0x0000000000000000 info2 0x0000000000000413 intr_info 0x80000306 error_code 0x00000000 qemu-system-x86-18971 [015] ..... 95922.997052: p_hv_track_root_tdp_0: (hv_track_root_tdp+0x0/0x70 [kvm]) si=0x18b082000 qemu-system-x86-18971 [015] d.... 95922.997133: kvm_entry: vcpu 0, rip 0xf7d6b There are also root changes after IO_INSTRUCTION exits. When I look at non-tdp-mmu it seems to cycle between two roots in that phase time, and tdp-mmu allocates new ones instead. You can see this for yourself with this script, my test machine is Ubuntu 22.10, QEMU 7.0.0, and the a kernel from the kvm-x86/next branch. #!/bin/bash set -xe fetch_flatcar () { sudo apt-get install -y qemu-system-x86 lbzip2 local channel=$1; local version=${2:-current}; local machine=$(uname -m); local arch=; if [[ ${machine} = "aarch64" ]]; then arch=arm64-usr; else if [[ ${machine} = "x86_64" ]]; then arch=amd64-usr; fi; fi; local base=https://$channel.release.flatcar-linux.net/${arch}/${version}; wget $base/flatcar_production_qemu.sh; wget $base/flatcar_production_qemu_image.img.bz2; lbunzip2 -kvf flatcar_production_qemu_image.img.bz2 chmod +x flatcar_production_qemu.sh sed -i -e 's/VM_NCPUS=.*/VM_NCPUS="1"/' flatcar_production_qemu.sh } [ -f flatcar_production_qemu_image.img ] || fetch_flatcar stable cat <<EOF >config.json { "ignition": { "version": "3.3.0" }, "storage": { "files": [ { "group": { "id": 500 }, "overwrite": true, "path": "/home/core/.bashrc", "user": { "id": 500 }, "contents": { "compression": "", "source": "data:,sudo%20shutdown%20-h%20now%0A" }, "mode": 420 } ] } } EOF # first run is meant to take a bit longer due to processing provisioning # data, subsequent runs should be stable. time sudo ./flatcar_production_qemu.sh -i config.json -nographic -m 4096
On Thu, Apr 13, 2023, Jeremi Piotrowski wrote: > On 4/11/2023 6:02 PM, Sean Christopherson wrote: > > By default, yes. I double checked that L2 has similar boot times for KVM-on-KVM > > with and without the TDP MMU. Certainly nothing remotely close to 2 minutes. > > Something I just noticed by tracing hv_track_root_tdp is that the VM appears to go through > some ~10000 unique roots in the period before kernel init starts (so grub + kernel decompression). > That part seems to take a long time. Is this kind of churn of roots by design? > > The ftrace output for when the root changes looks something like this, kvm goes through smm emulation > during the exit. > > qemu-system-x86-18971 [015] d.... 95922.997039: kvm_exit: vcpu 0 reason EXCEPTION_NMI rip 0xfd0bd info1 0x0000000000000000 info2 0x0000000000000413 intr_info 0x80000306 error_code 0x00000000 > qemu-system-x86-18971 [015] ..... 95922.997052: p_hv_track_root_tdp_0: (hv_track_root_tdp+0x0/0x70 [kvm]) si=0x18b082000 > qemu-system-x86-18971 [015] d.... 95922.997133: kvm_entry: vcpu 0, rip 0xf7d6b > > There are also root changes after IO_INSTRUCTION exits. When I look at non-tdp-mmu it seems to cycle between two > roots in that phase time, and tdp-mmu allocates new ones instead. #$&*#$*& SMM. I know _exactly_ what's going on. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a "shadow" root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. But unloading roots is anathema for the TDP MMU. Unlike the shadow MMU, the TDP MMU doesn't track _inactive_ roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. Wwhereas in a multi-vCPU setup, a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). What's happening in your case is that legacy BIOS does some truly evil crud with SMM, and can transition to/from SMM thousands of time during boot. On *every* transition, KVM unloads its roots, i.e. KVM has to teardown, reallocate, and rebuild a new root every time the vCPU enters SMM, and every time the vCPU exits SMM. This exact problem was reported by the grsecurity folks when the guest toggles CR0.WP. We duct taped a solution together for CR0.WP[1], and now finally have a more complete fix lined up for 6.4[2], but the underlying flaw of the TDP MMU not preserving inactive roots still exists. Aha! Idea. There are _at most_ 4 possible roots the TDP MMU can encounter. 4-level non-SMM, 4-level SMM, 5-level non-SMM, and 5-level SMM. I.e. not keeping inactive roots on a per-VM basis is just monumentally stupid. Ugh, and that's not even the worst of our stupidity. The truly awful side of all this is that we spent an absurd amount of time getting kvm_tdp_mmu_put_root() to play nice with putting the last reference to a valid root while holding mmu_lock for read. Give me a few hours to whip together and test a patch, I think I see a way to fix this without a massive amount of churn, and with fairly simple rules for how things work. [1] https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com [2] https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
On Thu, Apr 13, 2023, Sean Christopherson wrote: > Aha! Idea. There are _at most_ 4 possible roots the TDP MMU can encounter. > 4-level non-SMM, 4-level SMM, 5-level non-SMM, and 5-level SMM. I.e. not keeping > inactive roots on a per-VM basis is just monumentally stupid. Ugh, and that's not > even the worst of our stupidity. The truly awful side of all this is that we > spent an absurd amount of time getting kvm_tdp_mmu_put_root() to play nice with > putting the last reference to a valid root while holding mmu_lock for read. > > Give me a few hours to whip together and test a patch, I think I see a way to fix > this without a massive amount of churn, and with fairly simple rules for how things > work. Can you test the below patch? I need to do more testing before posting, but it holds up to basic testing. With this, I can do kvm_mmu_reset_context() on every non-fastpash VM-Exit with pretty much zero performance degradation. Trying to do the same without this patch just hangs at the reset vector because KVM can't make forward progress on EPT violations. Skipping EPT violations still hangs because of a KVM oddity/flaw where debug register exits require making forward progress before the next VM-Exit (KVM really should emulate and skip the first exit). Skipping DR exits boots, but with amusing performance degration: 9s versus ~2s to get to PID 1, and 30s versus ~4s to console. I verified forcing kvm_mmu_reset_context() does trigger a "new" root allocation, e.g. 15 vs. 100k "allocations", so unless I guessed wrong about SMM-induced kvm_mmu_reset_context() calls being the problem, this should do the trick. FWIW, my best guess as to why you observe multiple minute boot times is that there is an "asynchronous" source of SMIs, whereas my hack-a-test is largely limited to synchronous exits. --- arch/x86/kvm/mmu/tdp_mmu.c | 80 +++++++++++++------------------------- 1 file changed, 28 insertions(+), 52 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index b2fca11b91ff..343deccab511 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -40,7 +40,17 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm, void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) { - /* Also waits for any queued work items. */ + /* + * Invalidate all roots, which besides the obvious, schedules all roots + * for zapping and thus puts the TDP MMU's reference to each root, i.e. + * ultimately frees all roots. + */ + kvm_tdp_mmu_invalidate_all_roots(kvm); + + /* + * Destroying a workqueue also first flushes the workqueue, i.e. no + * need to invoke kvm_tdp_mmu_zap_invalidated_roots(). + */ destroy_workqueue(kvm->arch.tdp_mmu_zap_wq); WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages)); @@ -116,16 +126,6 @@ static void tdp_mmu_schedule_zap_root(struct kvm *kvm, struct kvm_mmu_page *root queue_work(kvm->arch.tdp_mmu_zap_wq, &root->tdp_mmu_async_work); } -static inline bool kvm_tdp_root_mark_invalid(struct kvm_mmu_page *page) -{ - union kvm_mmu_page_role role = page->role; - role.invalid = true; - - /* No need to use cmpxchg, only the invalid bit can change. */ - role.word = xchg(&page->role.word, role.word); - return role.invalid; -} - void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, bool shared) { @@ -134,45 +134,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root, if (!refcount_dec_and_test(&root->tdp_mmu_root_count)) return; - WARN_ON(!is_tdp_mmu_page(root)); - /* - * The root now has refcount=0. It is valid, but readers already - * cannot acquire a reference to it because kvm_tdp_mmu_get_root() - * rejects it. This remains true for the rest of the execution - * of this function, because readers visit valid roots only - * (except for tdp_mmu_zap_root_work(), which however - * does not acquire any reference itself). - * - * Even though there are flows that need to visit all roots for - * correctness, they all take mmu_lock for write, so they cannot yet - * run concurrently. The same is true after kvm_tdp_root_mark_invalid, - * since the root still has refcount=0. - * - * However, tdp_mmu_zap_root can yield, and writers do not expect to - * see refcount=0 (see for example kvm_tdp_mmu_invalidate_all_roots()). - * So the root temporarily gets an extra reference, going to refcount=1 - * while staying invalid. Readers still cannot acquire any reference; - * but writers are now allowed to run if tdp_mmu_zap_root yields and - * they might take an extra reference if they themselves yield. - * Therefore, when the reference is given back by the worker, - * there is no guarantee that the refcount is still 1. If not, whoever - * puts the last reference will free the page, but they will not have to - * zap the root because a root cannot go from invalid to valid. + * The TDP MMU itself holds a reference to each root until the root is + * explicitly invalidated, i.e. the final reference should be never be + * put for a valid root. */ - if (!kvm_tdp_root_mark_invalid(root)) { - refcount_set(&root->tdp_mmu_root_count, 1); - - /* - * Zapping the root in a worker is not just "nice to have"; - * it is required because kvm_tdp_mmu_invalidate_all_roots() - * skips already-invalid roots. If kvm_tdp_mmu_put_root() did - * not add the root to the workqueue, kvm_tdp_mmu_zap_all_fast() - * might return with some roots not zapped yet. - */ - tdp_mmu_schedule_zap_root(kvm, root); - return; - } + KVM_BUG_ON(!is_tdp_mmu_page(root) || !root->role.invalid, kvm); spin_lock(&kvm->arch.tdp_mmu_pages_lock); list_del_rcu(&root->link); @@ -320,7 +287,14 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) root = tdp_mmu_alloc_sp(vcpu); tdp_mmu_init_sp(root, NULL, 0, role); - refcount_set(&root->tdp_mmu_root_count, 1); + /* + * TDP MMU roots are kept until they are explicitly invalidated, either + * by a memslot update or by the destruction of the VM. Initialize the + * refcount to two; one reference for the vCPU, and one reference for + * the TDP MMU itself, which is held until the root is invalidated and + * is ultimately put by tdp_mmu_zap_root_work(). + */ + refcount_set(&root->tdp_mmu_root_count, 2); spin_lock(&kvm->arch.tdp_mmu_pages_lock); list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots); @@ -964,10 +938,12 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm) { struct kvm_mmu_page *root; - lockdep_assert_held_write(&kvm->mmu_lock); + /* No need to hold mmu_lock when the VM is being destroyed. */ + if (refcount_read(&kvm->users_count)) + lockdep_assert_held_write(&kvm->mmu_lock); + list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) { - if (!root->role.invalid && - !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) { + if (!root->role.invalid) { root->role.invalid = true; tdp_mmu_schedule_zap_root(kvm, root); } base-commit: 62cf1e941a1169a5e8016fd8683d4d888ab51e01 --
On Thu, Apr 13, 2023, Sean Christopherson wrote: > Aha! Idea. There are _at most_ 4 possible roots the TDP MMU can encounter. > 4-level non-SMM, 4-level SMM, 5-level non-SMM, and 5-level SMM. I.e. not keeping > inactive roots on a per-VM basis is just monumentally stupid. One correction: there are 6 possible roots: 1. 4-level !SMM !guest_mode (i.e. not nested) 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode I forgot that KVM still uses the TDP MMU when running L2 if L1 doesn't enable EPT/TDP, i.e. if L1 is using shadow paging for L2. But that really doesn't change anything as each vCPU can already track 4 roots, i.e. userspace can saturate all 6 roots anyways. And in practice, no sane VMM will create a VM with both 4-level and 5-level roots (KVM keys off of guest.MAXPHYADDR for the TDP root level).
On Thu, Apr 13, 2023 at 12:10 PM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Apr 13, 2023, Sean Christopherson wrote: > > Aha! Idea. There are _at most_ 4 possible roots the TDP MMU can encounter. > > 4-level non-SMM, 4-level SMM, 5-level non-SMM, and 5-level SMM. I.e. not keeping > > inactive roots on a per-VM basis is just monumentally stupid. > > One correction: there are 6 possible roots: > > 1. 4-level !SMM !guest_mode (i.e. not nested) > 2. 4-level SMM !guest_mode > 3. 5-level !SMM !guest_mode > 4. 5-level SMM !guest_mode > 5. 4-level !SMM guest_mode > 6. 5-level !SMM guest_mode > > I forgot that KVM still uses the TDP MMU when running L2 if L1 doesn't enable > EPT/TDP, i.e. if L1 is using shadow paging for L2. But that really doesn't change > anything as each vCPU can already track 4 roots, i.e. userspace can saturate all > 6 roots anyways. And in practice, no sane VMM will create a VM with both 4-level > and 5-level roots (KVM keys off of guest.MAXPHYADDR for the TDP root level). Why do we create a new root for guest_mode=1 if L1 disables EPT/NPT?
On Thu, Apr 13, 2023, David Matlack wrote: > On Thu, Apr 13, 2023 at 12:10 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Thu, Apr 13, 2023, Sean Christopherson wrote: > > > Aha! Idea. There are _at most_ 4 possible roots the TDP MMU can encounter. > > > 4-level non-SMM, 4-level SMM, 5-level non-SMM, and 5-level SMM. I.e. not keeping > > > inactive roots on a per-VM basis is just monumentally stupid. > > > > One correction: there are 6 possible roots: > > > > 1. 4-level !SMM !guest_mode (i.e. not nested) > > 2. 4-level SMM !guest_mode > > 3. 5-level !SMM !guest_mode > > 4. 5-level SMM !guest_mode > > 5. 4-level !SMM guest_mode > > 6. 5-level !SMM guest_mode > > > > I forgot that KVM still uses the TDP MMU when running L2 if L1 doesn't enable > > EPT/TDP, i.e. if L1 is using shadow paging for L2. But that really doesn't change > > anything as each vCPU can already track 4 roots, i.e. userspace can saturate all > > 6 roots anyways. And in practice, no sane VMM will create a VM with both 4-level > > and 5-level roots (KVM keys off of guest.MAXPHYADDR for the TDP root level). > > Why do we create a new root for guest_mode=1 if L1 disables EPT/NPT? Because "private", a.k.a. KVM-internal, memslots are visible to L1 but not L2. Which for TDP means the APIC-access page. From commit 3a2936dedd20: kvm: mmu: Don't expose private memslots to L2 These private pages have special purposes in the virtualization of L1, but not in the virtualization of L2. In particular, L1's APIC access page should never be entered into L2's page tables, because this causes a great deal of confusion when the APIC virtualization hardware is being used to accelerate L2's accesses to its own APIC. FWIW, I _think_ KVM could actually let L2 access the APIC-access page when L1 is running without any APIC virtualization, i.e. when L1 is passing its APIC through to L2. E.g. something like the below, but I ain't touching that with a 10 foot pole unless someone explicitly asks for it :-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 039fb16560a0..8aa12f5f2c30 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4370,10 +4370,13 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault if (!kvm_is_visible_memslot(slot)) { /* Don't expose private memslots to L2. */ if (is_guest_mode(vcpu)) { - fault->slot = NULL; - fault->pfn = KVM_PFN_NOSLOT; - fault->map_writable = false; - return RET_PF_CONTINUE; + if (!slot || slot->id != APIC_ACCESS_PAGE_PRIVATE_MEMSLOT || + nested_cpu_has_virtual_apic(vcpu)) { + fault->slot = NULL; + fault->pfn = KVM_PFN_NOSLOT; + fault->map_writable = false; + return RET_PF_CONTINUE; + } } /* * If the APIC access page exists but is disabled, go directly
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4d2bc08794e4..a0868ae3688d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2031,7 +2031,8 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid); void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd); void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, - int tdp_max_root_level, int tdp_huge_page_level); + int tdp_max_root_level, int tdp_huge_page_level, + bool enable_tdp_mmu); static inline u16 kvm_read_ldt(void) { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c91ee2927dd7..5c0e28a7a3bc 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5787,14 +5787,15 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid) } void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, - int tdp_max_root_level, int tdp_huge_page_level) + int tdp_max_root_level, int tdp_huge_page_level, + bool enable_tdp_mmu) { tdp_enabled = enable_tdp; tdp_root_level = tdp_forced_root_level; max_tdp_level = tdp_max_root_level; #ifdef CONFIG_X86_64 - tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled; + tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu; #endif /* * max_huge_page_level reflects KVM's MMU capabilities irrespective diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index d13cf53e7390..070c3f7f8c9f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4925,6 +4925,7 @@ static __init int svm_hardware_setup(void) struct page *iopm_pages; void *iopm_va; int r; + bool enable_tdp_mmu; unsigned int order = get_order(IOPM_SIZE); /* @@ -4991,9 +4992,12 @@ static __init int svm_hardware_setup(void) if (!boot_cpu_has(X86_FEATURE_NPT)) npt_enabled = false; + enable_tdp_mmu = svm_hv_enable_tdp_mmu(); + /* Force VM NPT level equal to the host's paging level */ kvm_configure_mmu(npt_enabled, get_npt_level(), - get_npt_level(), PG_LEVEL_1G); + get_npt_level(), PG_LEVEL_1G, + enable_tdp_mmu); pr_info("Nested Paging %sabled\n", npt_enabled ? "en" : "dis"); /* Setup shadow_me_value and shadow_me_mask */ diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h index 6981c1e9a809..aa49ac5d66bc 100644 --- a/arch/x86/kvm/svm/svm_onhyperv.h +++ b/arch/x86/kvm/svm/svm_onhyperv.h @@ -30,6 +30,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb) hve->hv_enlightenments_control.msr_bitmap = 1; } +static inline bool svm_hv_enable_tdp_mmu(void) +{ + return !(npt_enabled && ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB); +} + static inline void svm_hv_hardware_setup(void) { if (npt_enabled && @@ -84,6 +89,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb) { } +static inline bool svm_hv_enable_tdp_mmu(void) +{ + return true; +} + static inline void svm_hv_hardware_setup(void) { } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c788aa382611..4d3808755d39 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -8442,7 +8442,8 @@ static __init int hardware_setup(void) vmx_setup_me_spte_mask(); kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(), - ept_caps_to_lpage_level(vmx_capability.ept)); + ept_caps_to_lpage_level(vmx_capability.ept), + true); /* * Only enable PML when hardware supports PML feature, and both EPT