[v3,3/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP

Message ID 20230201194604.11135-4-minipli@grsecurity.net
State New
Headers
Series KVM: MMU: performance tweaks for heavy CR0.WP users |

Commit Message

Mathias Krause Feb. 1, 2023, 7:46 p.m. UTC
  There is no need to unload the MMU roots for a direct MMU role when only
CR0.WP has changed -- the paging structures are still valid, only the
permission bitmap needs to be updated.

One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
implement kernel W^X.

The optimization brings a huge performance gain for this case as the
following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
grsecurity L1 VM shows (runtime in seconds, lower is better):

                       legacy     TDP    shadow
kvm.git/queue          11.55s   13.91s    75.2s
kvm.git/queue+patch     7.32s    7.31s    74.6s

For legacy MMU this is ~36% faster, for TTP MMU even ~47% faster. Also
TDP and legacy MMU now both have around the same runtime which vanishes
the need to disable TDP MMU for grsecurity.

Shadow MMU sees no measurable difference and is still slow, as expected.

[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
v2: handle the CR0.WP case directly in kvm_post_set_cr0() and only for
the direct MMU role -- Sean

I re-ran the benchmark and it's even faster than with my patch, as the
critical path is now the first one handled and is now inline. Thanks a
lot for the suggestion, Sean!

 arch/x86/kvm/x86.c | 9 +++++++++
 1 file changed, 9 insertions(+)
  

Comments

Zhi Wang Feb. 7, 2023, 1:36 p.m. UTC | #1
On Wed,  1 Feb 2023 20:46:01 +0100
Mathias Krause <minipli@grsecurity.net> wrote:

> There is no need to unload the MMU roots for a direct MMU role when only
> CR0.WP has changed -- the paging structures are still valid, only the
> permission bitmap needs to be updated.
> 
> One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
> implement kernel W^X.
> 

Wouldn't it be better to factor out update_permission_bitmask and
update_pkru_bitmask in a common function and call it from here? So that
we can also skip: bunches of if..else..., recalculation of the rsvd mask
and shadow_zero_bit masks.

I suppose this is a critical path according to the patch comments and
kvm_init_mmu() is a non-critical path. Is it better to seperate 
them now for saving the maintanence efforts in future? E.g. something heavier 
might be introduced into the kvm_init_mmu() path and slows down this path.

> The optimization brings a huge performance gain for this case as the
> following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
> grsecurity L1 VM shows (runtime in seconds, lower is better):
> 
>                        legacy     TDP    shadow
> kvm.git/queue          11.55s   13.91s    75.2s
> kvm.git/queue+patch     7.32s    7.31s    74.6s
> 
> For legacy MMU this is ~36% faster, for TTP MMU even ~47% faster. Also
> TDP and legacy MMU now both have around the same runtime which vanishes
> the need to disable TDP MMU for grsecurity.
> 
> Shadow MMU sees no measurable difference and is still slow, as expected.
> 
> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
> 
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
> ---
> v2: handle the CR0.WP case directly in kvm_post_set_cr0() and only for
> the direct MMU role -- Sean
> 
> I re-ran the benchmark and it's even faster than with my patch, as the
> critical path is now the first one handled and is now inline. Thanks a
> lot for the suggestion, Sean!
> 
>  arch/x86/kvm/x86.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 508074e47bc0..f09bfc0a3cc1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -902,6 +902,15 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>  
>  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>  {
> +	/*
> +	 * Toggling just CR0.WP doesn't invalidate page tables per se, only the
> +	 * permission bits.
> +	 */
> +	if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
> +		kvm_init_mmu(vcpu);
> +		return;
> +	}
> +
>  	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
>  		kvm_clear_async_pf_completion_queue(vcpu);
>  		kvm_async_pf_hash_reset(vcpu);
  
Mathias Krause Feb. 8, 2023, 9:52 a.m. UTC | #2
On 07.02.23 14:36, Zhi Wang wrote:
> On Wed,  1 Feb 2023 20:46:01 +0100
> Mathias Krause <minipli@grsecurity.net> wrote:
> 
>> There is no need to unload the MMU roots for a direct MMU role when only
>> CR0.WP has changed -- the paging structures are still valid, only the
>> permission bitmap needs to be updated.
>>
>> One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
>> implement kernel W^X.
>>
> 
> Wouldn't it be better to factor out update_permission_bitmask and
> update_pkru_bitmask in a common function and call it from here? So that
> we can also skip: bunches of if..else..., recalculation of the rsvd mask
> and shadow_zero_bit masks.

Probably, yes. But I dislike the fact that this would imply that we know
about the details how kvm_init_mmu() and, moreover, init_kvm_tdp_mmu()
are implemented and I'd rather like to avoid that to not introduce bugs
or regressions via future code changes in either one of them. By calling
out to kvm_init_mmu() we avoid that implicitly as a future change, for
sure, must check all callers and would find this location. If we instead
simply extract the (as of now) required bits, that might go unnoticed.

That said, I agree that there's still room for improvements.

> I suppose this is a critical path according to the patch comments and
> kvm_init_mmu() is a non-critical path. Is it better to seperate 
> them now for saving the maintanence efforts in future? E.g. something heavier 
> might be introduced into the kvm_init_mmu() path and slows down this path.

I'll look into what can be done about it. But this change is a first
step that can be further optimized via follow up changes.

As you can see from the numbers below, it's already way faster that what
we have right now, so I'd rather land this (imperfect) change sooner
than later and gradually improve on it. This will, however, likely only
bring minor speedups compared to this change, so they're less important,
IMHO.

The question is really what's better from a maintenance point of view:
Keeping the call to the commonly used kvm_init_mmu() function or special
case even further? I fear the latter might regress easier, but YMMV, of
course.

> 
>> The optimization brings a huge performance gain for this case as the
>> following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
>> grsecurity L1 VM shows (runtime in seconds, lower is better):
>>
>>                        legacy     TDP    shadow
>> kvm.git/queue          11.55s   13.91s    75.2s
>> kvm.git/queue+patch     7.32s    7.31s    74.6s
>>
>> For legacy MMU this is ~36% faster, for TTP MMU even ~47% faster. Also
>> TDP and legacy MMU now both have around the same runtime which vanishes
>> the need to disable TDP MMU for grsecurity.
>>
>> Shadow MMU sees no measurable difference and is still slow, as expected.
>>
>> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
>>
>> Co-developed-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
>> ---
>> v2: handle the CR0.WP case directly in kvm_post_set_cr0() and only for
>> the direct MMU role -- Sean
>>
>> I re-ran the benchmark and it's even faster than with my patch, as the
>> critical path is now the first one handled and is now inline. Thanks a
>> lot for the suggestion, Sean!
>>
>>  arch/x86/kvm/x86.c | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 508074e47bc0..f09bfc0a3cc1 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -902,6 +902,15 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>>  
>>  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>>  {
>> +	/*
>> +	 * Toggling just CR0.WP doesn't invalidate page tables per se, only the
>> +	 * permission bits.
>> +	 */
>> +	if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
>> +		kvm_init_mmu(vcpu);
>> +		return;
>> +	}
>> +
>>  	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
>>  		kvm_clear_async_pf_completion_queue(vcpu);
>>  		kvm_async_pf_hash_reset(vcpu);
>
  
Sean Christopherson March 15, 2023, 9:22 p.m. UTC | #3
On Wed, Feb 08, 2023, Mathias Krause wrote:
> On 07.02.23 14:36, Zhi Wang wrote:
> > On Wed,  1 Feb 2023 20:46:01 +0100
> > Mathias Krause <minipli@grsecurity.net> wrote:
> > I suppose this is a critical path according to the patch comments and
> > kvm_init_mmu() is a non-critical path. Is it better to seperate 
> > them now for saving the maintanence efforts in future? E.g. something heavier 
> > might be introduced into the kvm_init_mmu() path and slows down this path.
> 
> I'll look into what can be done about it. But this change is a first
> step that can be further optimized via follow up changes.
> 
> As you can see from the numbers below, it's already way faster that what
> we have right now, so I'd rather land this (imperfect) change sooner
> than later and gradually improve on it. This will, however, likely only
> bring minor speedups compared to this change, so they're less important,
> IMHO.
> 
> The question is really what's better from a maintenance point of view:
> Keeping the call to the commonly used kvm_init_mmu() function or special
> case even further? I fear the latter might regress easier, but YMMV, of
> course.

Agreed.  Unless the performance benefits of getting super precise are significant,
I would much rather keep things simpler and reduce the risk of introducing bugs.
Bugs in this area in particular have a nasty habit of being really good at hiding.
  
Sean Christopherson March 15, 2023, 10:11 p.m. UTC | #4
Can you tweak the shortlog to something like this?  I want to make it very clear
that this applies only to the TDP case (more below).  I did a double take when I
first read the subject :-)

  KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled

On Wed, Feb 01, 2023, Mathias Krause wrote:
> There is no need to unload the MMU roots for a direct MMU role when only
> CR0.WP has changed -- the paging structures are still valid, only the
> permission bitmap needs to be updated.
> 
> One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
> implement kernel W^X.
> 
> The optimization brings a huge performance gain for this case as the
> following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
> grsecurity L1 VM shows (runtime in seconds, lower is better):
> 
>                        legacy     TDP    shadow
> kvm.git/queue          11.55s   13.91s    75.2s
> kvm.git/queue+patch     7.32s    7.31s    74.6s
> 
> For legacy MMU this is ~36% faster, for TTP MMU even ~47% faster. Also
> TDP and legacy MMU now both have around the same runtime which vanishes
> the need to disable TDP MMU for grsecurity.
> 
> Shadow MMU sees no measurable difference and is still slow, as expected.
> 
> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
> 
> Co-developed-by: Sean Christopherson <seanjc@google.com>

No need for this, I just threw a snippet at you as part of code review.  And IMO,
if someone goes through the pain of running benchmarks, they get full credit no
matter what ;-)

> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
> ---
> v2: handle the CR0.WP case directly in kvm_post_set_cr0() and only for
> the direct MMU role -- Sean
> 
> I re-ran the benchmark and it's even faster than with my patch, as the
> critical path is now the first one handled and is now inline. Thanks a
> lot for the suggestion, Sean!
> 
>  arch/x86/kvm/x86.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 508074e47bc0..f09bfc0a3cc1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -902,6 +902,15 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>  
>  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>  {
> +	/*
> +	 * Toggling just CR0.WP doesn't invalidate page tables per se, only the
> +	 * permission bits.
> +	 */
> +	if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {

Past me was wrong, which is a very good thing in this case.  Per the APM,

  The host hCR0.WP bit is ignored under nested paging.

which means that CR0.WP doesn't need to be incorporated into the role.  Ha!  And
really-past me even wrote a very nice comment to call that out in commit 31e96bc63655
("KVM: nSVM: Add a comment to document why nNPT uses vmcb01, not vCPU state").

Double ha!  That's all moot, because if this code is reached for a nested MMU,
it means L2 is active and the CR0 being changed is gCR0, not L1's hCR0.

So more simply, this can be

	if (tdp_enabled && (cr0 ^ old_cr0) == X86_CR0_WP)

or if we want to exempt non-paging mode for the shadow MMU as well...

	if ((cr0 ^ old_cr0) == X86_CR0_WP && (tdp_enabled || !(cr0 & X86_CR0_PG)))

Actually, if we bother to check CR0.PG, then we might as well get greedy and
skip _all_ updates when paging is disabled.  E.g. end up with this over two
patches?  First one exempts the tdp_enabled case, second one completely exempts
paging disabled.

	/*
	 * CR0.WP is incorporated into the MMU role, but only for non-nested,
	 * indirect shadow MMUs.  If paging is disabled, no updates are needed
	 * as there are no permission bits to emulate.  If TDP is enabled, the
	 * MMU's metadata needs to be updated, e.g. so that emulating guest
	 * translations does the right thing, but there's no need to unload the
	 * root as CR0.WP doesn't affect SPTEs when TDP is enabled.
	 */
	if ((cr0 ^ old_cr0) == X86_CR0_WP) {
		if (!(cr0 & X86_CR0_PG))
			return;

		if (tdp_enabled) {
			kvm_init_mmu(vcpu);
			return;
		}
	}
  
Mathias Krause March 20, 2023, 9:13 p.m. UTC | #5
On 15.03.23 23:11, Sean Christopherson wrote:
> Can you tweak the shortlog to something like this?  I want to make it very clear
> that this applies only to the TDP case (more below).  I did a double take when I
> first read the subject :-)
> 
>   KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled

Sure, will do!

> 
> On Wed, Feb 01, 2023, Mathias Krause wrote:
>> There is no need to unload the MMU roots for a direct MMU role when only
>> CR0.WP has changed -- the paging structures are still valid, only the
>> permission bitmap needs to be updated.
>>
>> One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
>> implement kernel W^X.
>>
>> The optimization brings a huge performance gain for this case as the
>> following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
>> grsecurity L1 VM shows (runtime in seconds, lower is better):
>>
>>                        legacy     TDP    shadow
>> kvm.git/queue          11.55s   13.91s    75.2s
>> kvm.git/queue+patch     7.32s    7.31s    74.6s
>>
>> For legacy MMU this is ~36% faster, for TTP MMU even ~47% faster. Also
>> TDP and legacy MMU now both have around the same runtime which vanishes
>> the need to disable TDP MMU for grsecurity.
>>
>> Shadow MMU sees no measurable difference and is still slow, as expected.
>>
>> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
>>
>> Co-developed-by: Sean Christopherson <seanjc@google.com>
> 
> No need for this, I just threw a snippet at you as part of code review.  And IMO,
> if someone goes through the pain of running benchmarks, they get full credit no
> matter what ;-)

Reviewers (and in your case maintainers) get far too little credit, so
I'd rather keep that tag, if you don't mind that hard ;)

> 
>> Signed-off-by: Mathias Krause <minipli@grsecurity.net>
>> ---
>> v2: handle the CR0.WP case directly in kvm_post_set_cr0() and only for
>> the direct MMU role -- Sean
>>
>> I re-ran the benchmark and it's even faster than with my patch, as the
>> critical path is now the first one handled and is now inline. Thanks a
>> lot for the suggestion, Sean!
>>
>>  arch/x86/kvm/x86.c | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 508074e47bc0..f09bfc0a3cc1 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -902,6 +902,15 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>>  
>>  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>>  {
>> +	/*
>> +	 * Toggling just CR0.WP doesn't invalidate page tables per se, only the
>> +	 * permission bits.
>> +	 */
>> +	if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
> 
> Past me was wrong, which is a very good thing in this case.  Per the APM,
> 
>   The host hCR0.WP bit is ignored under nested paging.
> 

See what you did? You even went that far and re-read the manual.
Definitely deserves credit!

> which means that CR0.WP doesn't need to be incorporated into the role.  Ha!  And
> really-past me even wrote a very nice comment to call that out in commit 31e96bc63655
> ("KVM: nSVM: Add a comment to document why nNPT uses vmcb01, not vCPU state").
> 
> Double ha!  That's all moot, because if this code is reached for a nested MMU,
> it means L2 is active and the CR0 being changed is gCR0, not L1's hCR0.
> 
> So more simply, this can be
> 
> 	if (tdp_enabled && (cr0 ^ old_cr0) == X86_CR0_WP)

Looks much simpler, indeed. But might deserve a little comment itself
why it's fine test 'tdp_enabled' only...

> 
> or if we want to exempt non-paging mode for the shadow MMU as well...
> 
> 	if ((cr0 ^ old_cr0) == X86_CR0_WP && (tdp_enabled || !(cr0 & X86_CR0_PG)))
> 
> Actually, if we bother to check CR0.PG, then we might as well get greedy and
> skip _all_ updates when paging is disabled.  E.g. end up with this over two
> patches?  First one exempts the tdp_enabled case, second one completely exempts
> paging disabled.
> 

...and there it is already! :D

> 	/*
> 	 * CR0.WP is incorporated into the MMU role, but only for non-nested,
> 	 * indirect shadow MMUs.  If paging is disabled, no updates are needed
> 	 * as there are no permission bits to emulate.  If TDP is enabled, the
> 	 * MMU's metadata needs to be updated, e.g. so that emulating guest
> 	 * translations does the right thing, but there's no need to unload the
> 	 * root as CR0.WP doesn't affect SPTEs when TDP is enabled.
> 	 */
> 	if ((cr0 ^ old_cr0) == X86_CR0_WP) {
> 		if (!(cr0 & X86_CR0_PG))
> 			return;
> 
> 		if (tdp_enabled) {
> 			kvm_init_mmu(vcpu);
> 			return;
> 		}
> 	}

Thanks, Sean! Sounds all very good to me. I'll cook up these commits, do
some more tests and send a v4 tomorrow, if time allows.
  

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 508074e47bc0..f09bfc0a3cc1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -902,6 +902,15 @@  EXPORT_SYMBOL_GPL(load_pdptrs);
 
 void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
 {
+	/*
+	 * Toggling just CR0.WP doesn't invalidate page tables per se, only the
+	 * permission bits.
+	 */
+	if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
+		kvm_init_mmu(vcpu);
+		return;
+	}
+
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
 		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_async_pf_hash_reset(vcpu);