Commit Message
Isaku Yamahata
Oct. 30, 2022, 6:22 a.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com> For private GPA, CPU refers a private page table whose contents are encrypted. The dedicated APIs to operate on it (e.g. updating/reading its PTE entry) are used and their cost is expensive. When KVM resolves KVM page fault, it walks the page tables. To reuse the existing KVM MMU code and mitigate the heavy cost to directly walk protected (encrypted) page table, allocate one more page to copy the protected page table for KVM MMU code to directly walk. Resolve KVM page fault with the existing code, and do additional operations necessary for the protected page table. To distinguish such cases, the existing KVM page table is called a shared page table (i.e. not associated with protected page table), and the page table with protected page table is called a private page table. The relationship is depicted below. Add a private pointer to struct kvm_mmu_page for protected page table and add helper functions to allocate/initialize/free a protected page table page. KVM page fault | | | V | -------------+---------- | | | | V V | shared GPA private GPA | | | | V V | shared PT root private PT root | protected PT root | | | | V V | V shared PT private PT ----propagate----> protected PT | | | | | \-----------------+------\ | | | | | V | V V shared guest page | private guest page | non-encrypted memory | encrypted memory | PT: page table - Shared PT is visible to KVM and it is used by CPU. - Protected PT is used by CPU but it is invisible to KVM. - Private PT is visible to KVM but not used by CPU. It is used to propagate PT change to the actual protected PT which is used by CPU. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> --- arch/x86/include/asm/kvm_host.h | 7 +++ arch/x86/kvm/mmu/mmu.c | 8 +++ arch/x86/kvm/mmu/mmu_internal.h | 90 +++++++++++++++++++++++++++++++-- arch/x86/kvm/mmu/tdp_mmu.c | 1 + 4 files changed, 102 insertions(+), 4 deletions(-)
Comments
On Sat, 2022-10-29 at 23:22 -0700, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > > For private GPA, CPU refers a private page table whose contents are > encrypted. The dedicated APIs to operate on it (e.g. updating/reading its > PTE entry) are used and their cost is expensive. > > When KVM resolves KVM page fault, it walks the page tables. To reuse the > existing KVM MMU code and mitigate the heavy cost to directly walk > protected (encrypted) page table, allocate one more page to copy the > protected page table for KVM MMU code to directly walk. Resolve KVM page > fault with the existing code, and do additional operations necessary for > the protected page table. To distinguish such cases, the existing KVM page > table is called a shared page table (i.e. not associated with protected > page table), and the page table with protected page table is called a > private page table. The relationship is depicted below. > > Add a private pointer to struct kvm_mmu_page for protected page table and > add helper functions to allocate/initialize/free a protected page table > page. > > KVM page fault | > | | > V | > -------------+---------- | > | | | > V V | > shared GPA private GPA | > | | | > V V | > shared PT root private PT root | protected PT root > | | | | > V V | V > shared PT private PT ----propagate----> protected PT > | | | | > | \-----------------+------\ | > | | | | > V | V V > shared guest page | private guest page > | > non-encrypted memory | encrypted memory > | > PT: page table > - Shared PT is visible to KVM and it is used by CPU. > - Protected PT is used by CPU but it is invisible to KVM. > - Private PT is visible to KVM but not used by CPU. It is used to > propagate PT change to the actual protected PT which is used by CPU. > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> > --- > arch/x86/include/asm/kvm_host.h | 7 +++ > arch/x86/kvm/mmu/mmu.c | 8 +++ > arch/x86/kvm/mmu/mmu_internal.h | 90 +++++++++++++++++++++++++++++++-- > arch/x86/kvm/mmu/tdp_mmu.c | 1 + > 4 files changed, 102 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index ee01add57a6b..381df2c8136d 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -754,6 +754,13 @@ struct kvm_vcpu_arch { > struct kvm_mmu_memory_cache mmu_shadow_page_cache; > struct kvm_mmu_memory_cache mmu_shadowed_info_cache; > struct kvm_mmu_memory_cache mmu_page_header_cache; > + /* > + * This cache is to allocate private page table. E.g. Secure-EPT used > + * by the TDX module. Because the TDX module doesn't trust VMM and > + * initializes the pages itself, KVM doesn't initialize them. Allocate > + * pages with garbage and give them to the TDX module. > + */ Perhaps put "Because ..." part to kvm_mmu_create() so people can see why there's no initialization for mmu_private_spte_cache? > + struct kvm_mmu_memory_cache mmu_private_spt_cache; Nit: In changelog you mentioned the page table used by CPU is actually "protected PT". IIUC the sp->private_spt you allocated from this cache is the one that is actually used by CPU. So looks there's some inconsistency here. [...] > + > +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, > + struct kvm_mmu_memory_cache *private_spt_cache, > + struct kvm_mmu_page *sp) This function is very weird in the context of this patch. _Only_ a new vcpu- scope 'mmu_private_spte_cache' is added in this patch, but here you allow caller to pass an additional argument of private_spt_cache. So there must be another cache which is not introduced in this patch? > +{ > + /* > + * vcpu == NULL means non-root SPT: > + * vcpu == NULL is used to split a large SPT into smaller SPT. Root SPT > + * is not a large SPT. I am guessing this "vcpu == NULL" case is for "Eager Splitting"? If so, why not adding a global MMU cache for private_spt allocation, and make vcpu->arch.mmu_private_spt_cache point to the global cache? In this case, in the context where you only have 'kvm', you can use the global cache directly. And in the context where you have a 'vcpu', you just use vcpu's cache. > + */ > + bool is_root = vcpu && > + vcpu->arch.root_mmu.root_role.level == sp->role.level; > + > + if (vcpu) > + private_spt_cache = &vcpu->arch.mmu_private_spt_cache; > + KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm); > + if (is_root) > + /* > + * Because TDX module assigns root Secure-EPT page and set it to > + * Secure-EPTP when TD vcpu is created, secure page table for > + * root isn't needed. > + */ > + sp->private_spt = NULL; > + else { > + sp->private_spt = kvm_mmu_memory_cache_alloc(private_spt_cache); > + /* > + * Because mmu_private_spt_cache is topped up before staring kvm > + * page fault resolving, the allocation above shouldn't fail. > + */ > + WARN_ON_ONCE(!sp->private_spt); > + } > +} > +
On Wed, 2022-11-16 at 10:32 +0000, Huang, Kai wrote: > > + > > +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, > > + struct kvm_mmu_memory_cache > > *private_spt_cache, > > + struct kvm_mmu_page *sp) > > This function is very weird in the context of this patch. _Only_ a new vcpu- > scope 'mmu_private_spte_cache' is added in this patch, but here you allow > caller > to pass an additional argument of private_spt_cache. So there must be another > cache which is not introduced in this patch? > > > +{ > > + /* > > + * vcpu == NULL means non-root SPT: > > + * vcpu == NULL is used to split a large SPT into smaller SPT. > > Root SPT > > + * is not a large SPT. > > I am guessing this "vcpu == NULL" case is for "Eager Splitting"? > > If so, why not adding a global MMU cache for private_spt allocation, and make > vcpu->arch.mmu_private_spt_cache point to the global cache? In this case, in > the context where you only have 'kvm', you can use the global cache directly. > And in the context where you have a 'vcpu', you just use vcpu's cache. So I went through all MMU related patches in this series, but I cannot find a place where this function is called with 'vcpu == NULL' and a valid cache is passed in, if I am reading correctly. Also checked that "Eager Splitting" uses a kvm-scope cache for legacy MMU, but just uses __get_free_page() for TDP MMU. And in later patch "KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU", __get_free_page() is also used to allocate the private_spt (which is consistent with existing eager splitting code). So IIUC only legacy MMU code will call this function with 'vcpu == NULL' and a valid cache. In this case, please remove the 'private_spt_cache' argument for now, and make the function always allocate from the vcpu- >arch.mmu_private_spt_cache. You can add the additional argument when TDX gets legacy MMU support. Also, I think you need to move eager splitting support part (whether that handling is correct is another story) from the later patch to this patch. Otherwise this patch is not complete. > > > + */ > > + bool is_root = vcpu && > > + vcpu->arch.root_mmu.root_role.level == sp->role.level; > > + > > + if (vcpu) > > + private_spt_cache = &vcpu->arch.mmu_private_spt_cache; > > + KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm); > > + if (is_root) > > + /* > > + * Because TDX module assigns root Secure-EPT page and set > > it to > > + * Secure-EPTP when TD vcpu is created, secure page table > > for > > + * root isn't needed. > > + */ > > + sp->private_spt = NULL; > > + else { > > + sp->private_spt = > > kvm_mmu_memory_cache_alloc(private_spt_cache); > > + /* > > + * Because mmu_private_spt_cache is topped up before > > staring kvm > > + * page fault resolving, the allocation above shouldn't > > fail. > > + */ > > + WARN_ON_ONCE(!sp->private_spt); > > + } > > +} > > +
On Wed, Nov 16, 2022 at 11:53:47AM +0000, "Huang, Kai" <kai.huang@intel.com> wrote: > On Wed, 2022-11-16 at 10:32 +0000, Huang, Kai wrote: > > > + > > > +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, > > > + struct kvm_mmu_memory_cache > > > *private_spt_cache, > > > + struct kvm_mmu_page *sp) > > > > This function is very weird in the context of this patch. _Only_ a new vcpu- > > scope 'mmu_private_spte_cache' is added in this patch, but here you allow > > caller > > to pass an additional argument of private_spt_cache. So there must be another > > cache which is not introduced in this patch? > > > > > +{ > > > + /* > > > + * vcpu == NULL means non-root SPT: > > > + * vcpu == NULL is used to split a large SPT into smaller SPT. > > > Root SPT > > > + * is not a large SPT. > > > > I am guessing this "vcpu == NULL" case is for "Eager Splitting"? > > > > If so, why not adding a global MMU cache for private_spt allocation, and make > > vcpu->arch.mmu_private_spt_cache point to the global cache? In this case, in > > the context where you only have 'kvm', you can use the global cache directly. > > And in the context where you have a 'vcpu', you just use vcpu's cache. > > So I went through all MMU related patches in this series, but I cannot find a > place where this function is called with 'vcpu == NULL' and a valid cache is > passed in, if I am reading correctly. > > Also checked that "Eager Splitting" uses a kvm-scope cache for legacy MMU, but > just uses __get_free_page() for TDP MMU. And in later patch "KVM: x86/tdp_mmu: > Support TDX private mapping for TDP MMU", __get_free_page() is also used to > allocate the private_spt (which is consistent with existing eager splitting > code). > > So IIUC only legacy MMU code will call this function with 'vcpu == NULL' and a > valid cache. In this case, please remove the 'private_spt_cache' argument for > now, and make the function always allocate from the vcpu- > >arch.mmu_private_spt_cache. > > You can add the additional argument when TDX gets legacy MMU support. > > Also, I think you need to move eager splitting support part (whether that > handling is correct is another story) from the later patch to this patch. > Otherwise this patch is not complete. Yes, you're right. Somehow legacy mmu part is crept in. I'll remove those from this patch series.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ee01add57a6b..381df2c8136d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -754,6 +754,13 @@ struct kvm_vcpu_arch { struct kvm_mmu_memory_cache mmu_shadow_page_cache; struct kvm_mmu_memory_cache mmu_shadowed_info_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; + /* + * This cache is to allocate private page table. E.g. Secure-EPT used + * by the TDX module. Because the TDX module doesn't trust VMM and + * initializes the pages itself, KVM doesn't initialize them. Allocate + * pages with garbage and give them to the TDX module. + */ + struct kvm_mmu_memory_cache mmu_private_spt_cache; /* * QEMU userspace and the guest each have their own FPU state. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0001e921154e..faf69774c7ce 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -653,6 +653,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu) struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache; int start, end, i, r; + if (kvm_gfn_shared_mask(vcpu->kvm)) { + r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_spt_cache, + PT64_ROOT_MAX_LEVEL); + if (r) + return r; + } + start = kvm_mmu_memory_cache_nr_free_objects(mc); r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL); @@ -702,6 +709,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu) kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache); + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_spt_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache); } diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index b0c4a78404b2..4c013124534b 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -87,7 +87,23 @@ struct kvm_mmu_page { int root_count; refcount_t tdp_mmu_root_count; }; - unsigned int unsync_children; + union { + struct { + unsigned int unsync_children; + /* + * Number of writes since the last time traversal + * visited this page. + */ + atomic_t write_flooding_count; + }; +#ifdef CONFIG_KVM_MMU_PRIVATE + /* + * Associated private shadow page table, e.g. Secure-EPT page + * passed to the TDX module. + */ + void *private_spt; +#endif + }; union { struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ tdp_ptep_t ptep; @@ -109,9 +125,6 @@ struct kvm_mmu_page { int clear_spte_count; #endif - /* Number of writes since the last time traversal visited this page. */ - atomic_t write_flooding_count; - #ifdef CONFIG_X86_64 /* Used for freeing the page asynchronously if it is a TDP MMU page. */ struct rcu_head rcu_head; @@ -153,6 +166,75 @@ static inline bool is_private_sptep(u64 *sptep) return is_private_sp(sptep_to_sp(sptep)); } +#ifdef CONFIG_KVM_MMU_PRIVATE +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp) +{ + return sp->private_spt; +} + +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt) +{ + sp->private_spt = private_spt; +} + +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, + struct kvm_mmu_memory_cache *private_spt_cache, + struct kvm_mmu_page *sp) +{ + /* + * vcpu == NULL means non-root SPT: + * vcpu == NULL is used to split a large SPT into smaller SPT. Root SPT + * is not a large SPT. + */ + bool is_root = vcpu && + vcpu->arch.root_mmu.root_role.level == sp->role.level; + + if (vcpu) + private_spt_cache = &vcpu->arch.mmu_private_spt_cache; + KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm); + if (is_root) + /* + * Because TDX module assigns root Secure-EPT page and set it to + * Secure-EPTP when TD vcpu is created, secure page table for + * root isn't needed. + */ + sp->private_spt = NULL; + else { + sp->private_spt = kvm_mmu_memory_cache_alloc(private_spt_cache); + /* + * Because mmu_private_spt_cache is topped up before staring kvm + * page fault resolving, the allocation above shouldn't fail. + */ + WARN_ON_ONCE(!sp->private_spt); + } +} + +static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp) +{ + if (sp->private_spt) + free_page((unsigned long)sp->private_spt); +} +#else +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp) +{ + return NULL; +} + +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt) +{ +} + +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, + struct kvm_mmu_memory_cache *private_spt_cache, + struct kvm_mmu_page *sp) +{ +} + +static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp) +{ +} +#endif + static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp) { /* diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 1bf58288ea79..b2f56110d62d 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -71,6 +71,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) static void tdp_mmu_free_sp(struct kvm_mmu_page *sp) { + kvm_mmu_free_private_spt(sp); free_page((unsigned long)sp->spt); kmem_cache_free(mmu_page_header_cache, sp); }