[3/3] KVM: x86: do not unload MMU roots when only toggling CR0.WP

Message ID 20230117204556.16217-4-minipli@grsecurity.net
State New
Headers
Series KVM: MMU: performance tweaks for heavy CR0.WP users |

Commit Message

Mathias Krause Jan. 17, 2023, 8:45 p.m. UTC
  There is no need to unload the MMU roots when only CR0.WP has changed --
the paging structures are still valid, only the permission bitmap needs
to be updated.

Change kvm_mmu_reset_context() to get passed the need for unloading MMU
roots and explicitly avoid it if only CR0.WP was toggled on a CR0 write
caused VMEXIT.

This change brings a huge performance gain as the following micro-
benchmark running 'ssdd 10 50000' from rt-tests[1] on a grsecurity L1 VM
shows (runtime in seconds, lower is better):

                      legacy MMU   TDP MMU
kvm.git/queue             11.55s    13.91s
kvm.git/queue+patch        7.44s     7.94s

For legacy MMU this is ~35% faster, for TTP MMU ~43% faster.

[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git

Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          |  7 ++++---
 arch/x86/kvm/smm.c              |  4 ++--
 arch/x86/kvm/vmx/nested.c       |  2 +-
 arch/x86/kvm/x86.c              | 28 +++++++++++++++++++---------
 5 files changed, 27 insertions(+), 16 deletions(-)
  

Comments

Sean Christopherson Jan. 17, 2023, 9:29 p.m. UTC | #1
On Tue, Jan 17, 2023, Mathias Krause wrote:
> There is no need to unload the MMU roots when only CR0.WP has changed --
> the paging structures are still valid, only the permission bitmap needs
> to be updated.

This doesn't hold true when KVM is using shadow paging, in which case CR0.WP
affects the shadow page tables.  I believe that also holds true for nNPT :-(

nEPT doesn't consume CR0.WP so we could expedite that case as well, though
identifying that case might be annoying.

> Change kvm_mmu_reset_context() to get passed the need for unloading MMU
> roots and explicitly avoid it if only CR0.WP was toggled on a CR0 write
> caused VMEXIT.

One thing we should explore on top of this is not intercepting CR0.WP (on Intel)
when TDP is enabled.  It could even trigger after toggling CR0.WP N times, e.g.
to optimize the grsecurity use case without negatively impacting workloads with
a static CR0.WP, as walking guest memory would require an "extra" VMREAD to get
CR0.WP in that case.

Unfortunately, AMD doesn't provide per-bit controls.

> This change brings a huge performance gain as the following micro-
> benchmark running 'ssdd 10 50000' from rt-tests[1] on a grsecurity L1 VM
> shows (runtime in seconds, lower is better):
> 
>                       legacy MMU   TDP MMU
> kvm.git/queue             11.55s    13.91s
> kvm.git/queue+patch        7.44s     7.94s
> 
> For legacy MMU this is ~35% faster, for TTP MMU ~43% faster.
> 
> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
> 
> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
> ---

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 508074e47bc0..d7c326ab94de 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -902,7 +902,9 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>  
>  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>  {
> -	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
> +	unsigned long cr0_change = cr0 ^ old_cr0;
> +
> +	if (cr0_change & X86_CR0_PG) {
>  		kvm_clear_async_pf_completion_queue(vcpu);
>  		kvm_async_pf_hash_reset(vcpu);
>  
> @@ -914,10 +916,18 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>  	}
>  
> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> -		kvm_mmu_reset_context(vcpu);
> +	if (cr0_change & KVM_MMU_CR0_ROLE_BITS) {
> +		bool unload_mmu =
> +			cr0_change & (KVM_MMU_CR0_ROLE_BITS & ~X86_CR0_WP);

As above, this needs to guarded with a check that the MMU is direct.  And rather
than add a flag to kvm_mmu_reset_context(), just call kvm_init_mmu() directly.
E.g. I think this would work?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d07563d0e204..8f9fac6d81d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -927,6 +927,11 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
 
 void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
 {
+       if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
+               kvm_init_mmu(vcpu);
+               return;
+       }
+
        if ((cr0 ^ old_cr0) & X86_CR0_PG) {
                kvm_clear_async_pf_completion_queue(vcpu);
                kvm_async_pf_hash_reset(vcpu);
  
Mathias Krause Jan. 18, 2023, 10:17 a.m. UTC | #2
On 17.01.23 22:29, Sean Christopherson wrote:
> On Tue, Jan 17, 2023, Mathias Krause wrote:
>> There is no need to unload the MMU roots when only CR0.WP has changed --
>> the paging structures are still valid, only the permission bitmap needs
>> to be updated.
> 
> This doesn't hold true when KVM is using shadow paging, in which case CR0.WP
> affects the shadow page tables.  I believe that also holds true for nNPT :-(

Oh, I knew there would be a case I missed. Thank you for pointing it out!

> nEPT doesn't consume CR0.WP so we could expedite that case as well, though
> identifying that case might be annoying.

I'm fine with starting with optimizing L1 only as the performance gain
for this usual case is huge already. But sure, if more is possible, I'm
all for it. It's just that I lack the knowledge about KVM internals to
figure it out all by myself.

>> Change kvm_mmu_reset_context() to get passed the need for unloading MMU
>> roots and explicitly avoid it if only CR0.WP was toggled on a CR0 write
>> caused VMEXIT.
> 
> One thing we should explore on top of this is not intercepting CR0.WP (on Intel)
> when TDP is enabled.  It could even trigger after toggling CR0.WP N times, e.g.
> to optimize the grsecurity use case without negatively impacting workloads with
> a static CR0.WP, as walking guest memory would require an "extra" VMREAD to get
> CR0.WP in that case.

That would be even better, agreed. I'll look into it and will try to
come up with something.

> Unfortunately, AMD doesn't provide per-bit controls.
> 
>> This change brings a huge performance gain as the following micro-
>> benchmark running 'ssdd 10 50000' from rt-tests[1] on a grsecurity L1 VM
>> shows (runtime in seconds, lower is better):
>>
>>                       legacy MMU   TDP MMU
>> kvm.git/queue             11.55s    13.91s
>> kvm.git/queue+patch        7.44s     7.94s
>>
>> For legacy MMU this is ~35% faster, for TTP MMU ~43% faster.
>>
>> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
>>
>> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
>> ---
> 
> ...
> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 508074e47bc0..d7c326ab94de 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -902,7 +902,9 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>>  
>>  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>>  {
>> -	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
>> +	unsigned long cr0_change = cr0 ^ old_cr0;
>> +
>> +	if (cr0_change & X86_CR0_PG) {
>>  		kvm_clear_async_pf_completion_queue(vcpu);
>>  		kvm_async_pf_hash_reset(vcpu);
>>  
>> @@ -914,10 +916,18 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>>  			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>>  	}
>>  
>> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
>> -		kvm_mmu_reset_context(vcpu);
>> +	if (cr0_change & KVM_MMU_CR0_ROLE_BITS) {
>> +		bool unload_mmu =
>> +			cr0_change & (KVM_MMU_CR0_ROLE_BITS & ~X86_CR0_WP);
> 
> As above, this needs to guarded with a check that the MMU is direct.  And rather
> than add a flag to kvm_mmu_reset_context(), just call kvm_init_mmu() directly.
> E.g. I think this would work?
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d07563d0e204..8f9fac6d81d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -927,6 +927,11 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>  
>  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>  {
> +       if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
> +               kvm_init_mmu(vcpu);
> +               return;
> +       }
> +
>         if ((cr0 ^ old_cr0) & X86_CR0_PG) {
>                 kvm_clear_async_pf_completion_queue(vcpu);
>                 kvm_async_pf_hash_reset(vcpu);

Looks much simpler and more direct. Nice. :)

I'll re-test and send a v2 later today.

Thanks,
Mathias
  
Mathias Krause Jan. 27, 2023, 4:15 p.m. UTC | #3
On 18.01.23 11:17, Mathias Krause wrote:
> On 17.01.23 22:29, Sean Christopherson wrote:
>> On Tue, Jan 17, 2023, Mathias Krause wrote:
>>> [...] 
>>> Change kvm_mmu_reset_context() to get passed the need for unloading MMU
>>> roots and explicitly avoid it if only CR0.WP was toggled on a CR0 write
>>> caused VMEXIT.
>>
>> One thing we should explore on top of this is not intercepting CR0.WP (on Intel)
>> when TDP is enabled.  It could even trigger after toggling CR0.WP N times, e.g.
>> to optimize the grsecurity use case without negatively impacting workloads with
>> a static CR0.WP, as walking guest memory would require an "extra" VMREAD to get
>> CR0.WP in that case.
> 
> That would be even better, agreed. I'll look into it and will try to
> come up with something.

I looked into it and we can gain quite a few more cycles from this, e.g.
the runtime for the 'ssdd 10 50000' test running with TDP MMU takes
another bump from 7.31s down to 4.89s. That's overall 2.8 times faster
than the 13.91s we started with. :)

I'll cook up a patch next week and send a v3 series with some more
cleanups I collected in the meantime.

>> Unfortunately, AMD doesn't provide per-bit controls.

Meanwhile I got my hands on an AMD system and it gains from this series
as well, not as much as my Intel system, though. We go down from 5.8s to
4.12s for the 'ssdd 10 50000' test with TDP MMU enabled -- a nearly 30%
runtime reduction.

>>> This change brings a huge performance gain as the following micro-
>>> benchmark running 'ssdd 10 50000' from rt-tests[1] on a grsecurity L1 VM
>>> shows (runtime in seconds, lower is better):
>>>
>>>                       legacy MMU   TDP MMU
>>> kvm.git/queue             11.55s    13.91s
>>> kvm.git/queue+patch        7.44s     7.94s
>>>
>>> For legacy MMU this is ~35% faster, for TTP MMU ~43% faster.
>>>
>>> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
>>>
>>> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
>>> ---
>>
  

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4d2bc08794e4..e7851315ffa6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1812,7 +1812,7 @@  int kvm_mmu_init_vm(struct kvm *kvm);
 void kvm_mmu_uninit_vm(struct kvm *kvm);
 
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu);
-void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
+void kvm_mmu_reset_context(struct kvm_vcpu *vcpu, bool unload_mmu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      const struct kvm_memory_slot *memslot,
 				      int start_level);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 505768631614..4022394d3a25 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5384,7 +5384,7 @@  void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.root_mmu.cpu_role.ext.valid = 0;
 	vcpu->arch.guest_mmu.cpu_role.ext.valid = 0;
 	vcpu->arch.nested_mmu.cpu_role.ext.valid = 0;
-	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_reset_context(vcpu, true);
 
 	/*
 	 * Changing guest CPUID after KVM_RUN is forbidden, see the comment in
@@ -5393,9 +5393,10 @@  void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	KVM_BUG_ON(vcpu->arch.last_vmentry_cpu != -1, vcpu->kvm);
 }
 
-void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
+void kvm_mmu_reset_context(struct kvm_vcpu *vcpu, bool unload_mmu)
 {
-	kvm_mmu_unload(vcpu);
+	if (unload_mmu)
+		kvm_mmu_unload(vcpu);
 	kvm_init_mmu(vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index cc43638d48a3..09f47048eb1b 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -131,7 +131,7 @@  void kvm_smm_changed(struct kvm_vcpu *vcpu, bool entering_smm)
 		vcpu->arch.pdptrs_from_userspace = false;
 	}
 
-	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_reset_context(vcpu, true);
 }
 
 void process_smi(struct kvm_vcpu *vcpu)
@@ -369,7 +369,7 @@  void enter_smm(struct kvm_vcpu *vcpu)
 #endif
 
 	kvm_update_cpuid_runtime(vcpu);
-	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_reset_context(vcpu, true);
 	return;
 error:
 	kvm_vm_dead(vcpu->kvm);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 557b9c468734..14815fd6dcb1 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4648,7 +4648,7 @@  static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
 	if (enable_ept && is_pae_paging(vcpu))
 		ept_save_pdptrs(vcpu);
 
-	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_reset_context(vcpu, true);
 
 	/*
 	 * This nasty bit of open coding is a compromise between blindly
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 508074e47bc0..d7c326ab94de 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -902,7 +902,9 @@  EXPORT_SYMBOL_GPL(load_pdptrs);
 
 void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
 {
-	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
+	unsigned long cr0_change = cr0 ^ old_cr0;
+
+	if (cr0_change & X86_CR0_PG) {
 		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_async_pf_hash_reset(vcpu);
 
@@ -914,10 +916,18 @@  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 	}
 
-	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
+	if (cr0_change & KVM_MMU_CR0_ROLE_BITS) {
+		bool unload_mmu =
+			cr0_change & (KVM_MMU_CR0_ROLE_BITS & ~X86_CR0_WP);
 
-	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
+		/*
+		 * Toggling just CR0.WP doesn't invalidate page tables per se,
+		 * only the permission bits.
+		 */
+		kvm_mmu_reset_context(vcpu, unload_mmu);
+	}
+
+	if ((cr0_change & X86_CR0_CD) &&
 	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
 	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
 		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
@@ -1117,7 +1127,7 @@  static bool kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
 {
 	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_reset_context(vcpu, true);
 
 	/*
 	 * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
@@ -1740,7 +1750,7 @@  static int set_efer(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	}
 
 	if ((efer ^ old_efer) & KVM_MMU_EFER_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_reset_context(vcpu, true);
 
 	return 0;
 }
@@ -11410,7 +11420,7 @@  static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 		return ret;
 
 	if (mmu_reset_needed)
-		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_reset_context(vcpu, true);
 
 	max_bits = KVM_NR_INTERRUPTS;
 	pending_vec = find_first_bit(
@@ -11452,7 +11462,7 @@  static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)
 		vcpu->arch.pdptrs_from_userspace = true;
 	}
 	if (mmu_reset_needed)
-		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_reset_context(vcpu, true);
 	return 0;
 }
 
@@ -11970,7 +11980,7 @@  void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	 */
 	if (old_cr0 & X86_CR0_PG) {
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
-		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_reset_context(vcpu, true);
 	}
 
 	/*