KVM: x86/mmu: Remove KVM MMU write lock when accessing indirect_shadow_pages

Message ID 20230605004334.1930091-1-mizhang@google.com
State New
Headers
Series KVM: x86/mmu: Remove KVM MMU write lock when accessing indirect_shadow_pages |

Commit Message

Mingwei Zhang June 5, 2023, 12:43 a.m. UTC
  Remove KVM MMU write lock when accessing indirect_shadow_pages counter when
page role is direct because this counter value is used as a coarse-grained
heuristics to check if there is nested guest active. Racing with this
heuristics without mmu lock will be harmless because the corresponding
indirect shadow sptes for the GPA will either be zapped by this thread or
some other thread who has previously zapped all indirect shadow pages and
makes the value to 0.

Because of that, remove the KVM MMU write lock pair to potentially reduce
the lock contension and improve the performance of nested VM. In addition
opportunistically change the comment of 'direct mmu' to make the
description consistent with other places.

Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)


base-commit: 31b4fc3bc64aadd660c5bfa5178c86a7ba61e0f7
  

Comments

Jim Mattson June 5, 2023, 4:55 p.m. UTC | #1
On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> Remove KVM MMU write lock when accessing indirect_shadow_pages counter when
> page role is direct because this counter value is used as a coarse-grained
> heuristics to check if there is nested guest active. Racing with this
> heuristics without mmu lock will be harmless because the corresponding
> indirect shadow sptes for the GPA will either be zapped by this thread or
> some other thread who has previously zapped all indirect shadow pages and
> makes the value to 0.
>
> Because of that, remove the KVM MMU write lock pair to potentially reduce
> the lock contension and improve the performance of nested VM. In addition
> opportunistically change the comment of 'direct mmu' to make the
> description consistent with other places.
>
> Reported-by: Jim Mattson <jmattson@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/x86.c | 10 ++--------
>  1 file changed, 2 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5ad55ef71433..97cfa5a00ff2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
>         kvm_release_pfn_clean(pfn);
>
> -       /* The instructions are well-emulated on direct mmu. */
> +       /* The instructions are well-emulated on Direct MMUs. */
>         if (vcpu->arch.mmu->root_role.direct) {
> -               unsigned int indirect_shadow_pages;
> -
> -               write_lock(&vcpu->kvm->mmu_lock);
> -               indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> -               write_unlock(&vcpu->kvm->mmu_lock);
> -
> -               if (indirect_shadow_pages)
> +               if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))

I don't understand the need for READ_ONCE() here. That implies that
there is something tricky going on, and I don't think that's the case.

>                         kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
>
>                 return true;
>
> base-commit: 31b4fc3bc64aadd660c5bfa5178c86a7ba61e0f7
> --
> 2.41.0.rc0.172.g3f132b7071-goog
>
  
Ben Gardon June 5, 2023, 5:17 p.m. UTC | #2
On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote:
>
> On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote:
> >
> > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when
> > page role is direct because this counter value is used as a coarse-grained
> > heuristics to check if there is nested guest active. Racing with this
> > heuristics without mmu lock will be harmless because the corresponding
> > indirect shadow sptes for the GPA will either be zapped by this thread or
> > some other thread who has previously zapped all indirect shadow pages and
> > makes the value to 0.
> >
> > Because of that, remove the KVM MMU write lock pair to potentially reduce
> > the lock contension and improve the performance of nested VM. In addition
> > opportunistically change the comment of 'direct mmu' to make the
> > description consistent with other places.
> >
> > Reported-by: Jim Mattson <jmattson@google.com>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/x86.c | 10 ++--------
> >  1 file changed, 2 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 5ad55ef71433..97cfa5a00ff2 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >
> >         kvm_release_pfn_clean(pfn);
> >
> > -       /* The instructions are well-emulated on direct mmu. */
> > +       /* The instructions are well-emulated on Direct MMUs. */

Nit: Internally within Google, on older kernels, we have the "Direct
MMU" which was the precursor to the TDP MMU we all know and love. This
comment however does not refer to the Direct MMU. Direct here just
refers to the direct role bit being set. Since it's just descriptive,
direct should not be capitalized in this comment, so no reason to
change this line.

> >         if (vcpu->arch.mmu->root_role.direct) {
> > -               unsigned int indirect_shadow_pages;
> > -
> > -               write_lock(&vcpu->kvm->mmu_lock);
> > -               indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> > -               write_unlock(&vcpu->kvm->mmu_lock);
> > -
> > -               if (indirect_shadow_pages)
> > +               if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
>
> I don't understand the need for READ_ONCE() here. That implies that
> there is something tricky going on, and I don't think that's the case.

Agree this doesn't need a READ_ONCE. Just a read is fine.
kvm_mmu_unprotect_page starts by acquiring the MMU lock, so there's
not much room to reorder anything anyway.

Thanks for sending a patch to fix this. The critical section of the
MMU lock here is small, but any lock acquisition in write mode can
mess up performance of otherwise happy read-mode uses.

>
> >                         kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
> >
> >                 return true;
> >
> > base-commit: 31b4fc3bc64aadd660c5bfa5178c86a7ba61e0f7
> > --
> > 2.41.0.rc0.172.g3f132b7071-goog
> >
  
Mingwei Zhang June 5, 2023, 5:42 p.m. UTC | #3
On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote:
>
> On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote:
> >
> > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when
> > page role is direct because this counter value is used as a coarse-grained
> > heuristics to check if there is nested guest active. Racing with this
> > heuristics without mmu lock will be harmless because the corresponding
> > indirect shadow sptes for the GPA will either be zapped by this thread or
> > some other thread who has previously zapped all indirect shadow pages and
> > makes the value to 0.
> >
> > Because of that, remove the KVM MMU write lock pair to potentially reduce
> > the lock contension and improve the performance of nested VM. In addition
> > opportunistically change the comment of 'direct mmu' to make the
> > description consistent with other places.
> >
> > Reported-by: Jim Mattson <jmattson@google.com>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/x86.c | 10 ++--------
> >  1 file changed, 2 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 5ad55ef71433..97cfa5a00ff2 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >
> >         kvm_release_pfn_clean(pfn);
> >
> > -       /* The instructions are well-emulated on direct mmu. */
> > +       /* The instructions are well-emulated on Direct MMUs. */
> >         if (vcpu->arch.mmu->root_role.direct) {
> > -               unsigned int indirect_shadow_pages;
> > -
> > -               write_lock(&vcpu->kvm->mmu_lock);
> > -               indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> > -               write_unlock(&vcpu->kvm->mmu_lock);
> > -
> > -               if (indirect_shadow_pages)
> > +               if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
>
> I don't understand the need for READ_ONCE() here. That implies that
> there is something tricky going on, and I don't think that's the case.

READ_ONCE() is just telling the compiler not to remove the read. Since
this is reading a global variable,  the compiler might just read a
previous copy if the value has already been read into a local
variable. But that is not the case here...

Note I see there is another READ_ONCE for
kvm->arch.indirect_shadow_pages, so I am reusing the same thing.

I did check the reordering issue but it should be fine because when
'we' see indirect_shadow_pages as 0, the shadow pages must have
already been zapped. Not only because of the locking, but also the
program order in __kvm_mmu_prepare_zap_page() shows that it will zap
shadow pages first before updating the stats.
  
Mingwei Zhang June 5, 2023, 5:53 p.m. UTC | #4
On Mon, Jun 5, 2023 at 10:17 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote:
> >
> > On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote:
> > >
> > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when
> > > page role is direct because this counter value is used as a coarse-grained
> > > heuristics to check if there is nested guest active. Racing with this
> > > heuristics without mmu lock will be harmless because the corresponding
> > > indirect shadow sptes for the GPA will either be zapped by this thread or
> > > some other thread who has previously zapped all indirect shadow pages and
> > > makes the value to 0.
> > >
> > > Because of that, remove the KVM MMU write lock pair to potentially reduce
> > > the lock contension and improve the performance of nested VM. In addition
> > > opportunistically change the comment of 'direct mmu' to make the
> > > description consistent with other places.
> > >
> > > Reported-by: Jim Mattson <jmattson@google.com>
> > > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > > ---
> > >  arch/x86/kvm/x86.c | 10 ++--------
> > >  1 file changed, 2 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 5ad55ef71433..97cfa5a00ff2 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >
> > >         kvm_release_pfn_clean(pfn);
> > >
> > > -       /* The instructions are well-emulated on direct mmu. */
> > > +       /* The instructions are well-emulated on Direct MMUs. */
>
> Nit: Internally within Google, on older kernels, we have the "Direct
> MMU" which was the precursor to the TDP MMU we all know and love. This
> comment however does not refer to the Direct MMU. Direct here just
> refers to the direct role bit being set. Since it's just descriptive,
> direct should not be capitalized in this comment, so no reason to
> change this line.

You are right., it is incorrect to uppercase the 'direct', since that
generates confusions with our internal MMU implementation. So, I will
just uppercase the 'mmu' here in the next version.
  
Jim Mattson June 5, 2023, 6:11 p.m. UTC | #5
On Mon, Jun 5, 2023 at 10:42 AM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote:
> >
> > On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote:
> > >
> > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when
> > > page role is direct because this counter value is used as a coarse-grained
> > > heuristics to check if there is nested guest active. Racing with this
> > > heuristics without mmu lock will be harmless because the corresponding
> > > indirect shadow sptes for the GPA will either be zapped by this thread or
> > > some other thread who has previously zapped all indirect shadow pages and
> > > makes the value to 0.
> > >
> > > Because of that, remove the KVM MMU write lock pair to potentially reduce
> > > the lock contension and improve the performance of nested VM. In addition
> > > opportunistically change the comment of 'direct mmu' to make the
> > > description consistent with other places.
> > >
> > > Reported-by: Jim Mattson <jmattson@google.com>
> > > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > > ---
> > >  arch/x86/kvm/x86.c | 10 ++--------
> > >  1 file changed, 2 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 5ad55ef71433..97cfa5a00ff2 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >
> > >         kvm_release_pfn_clean(pfn);
> > >
> > > -       /* The instructions are well-emulated on direct mmu. */
> > > +       /* The instructions are well-emulated on Direct MMUs. */
> > >         if (vcpu->arch.mmu->root_role.direct) {
> > > -               unsigned int indirect_shadow_pages;
> > > -
> > > -               write_lock(&vcpu->kvm->mmu_lock);
> > > -               indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> > > -               write_unlock(&vcpu->kvm->mmu_lock);
> > > -
> > > -               if (indirect_shadow_pages)
> > > +               if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
> >
> > I don't understand the need for READ_ONCE() here. That implies that
> > there is something tricky going on, and I don't think that's the case.
>
> READ_ONCE() is just telling the compiler not to remove the read. Since
> this is reading a global variable,  the compiler might just read a
> previous copy if the value has already been read into a local
> variable. But that is not the case here...

Not a global variable, actually, but that's not relevant. What would
be wrong with using a previously read copy?

We don't always wrap reads in READ_ONCE(). It's actually pretty rare.
So, there should be an explicit and meaningful reason.

> Note I see there is another READ_ONCE for
> kvm->arch.indirect_shadow_pages, so I am reusing the same thing.

That's not a good reason. "If all of your friends jumped off a cliff,
would you?"

> I did check the reordering issue but it should be fine because when
> 'we' see indirect_shadow_pages as 0, the shadow pages must have
> already been zapped. Not only because of the locking, but also the
> program order in __kvm_mmu_prepare_zap_page() shows that it will zap
> shadow pages first before updating the stats.
  
Mingwei Zhang June 5, 2023, 6:23 p.m. UTC | #6
On Mon, Jun 5, 2023 at 11:12 AM Jim Mattson <jmattson@google.com> wrote:
>
> On Mon, Jun 5, 2023 at 10:42 AM Mingwei Zhang <mizhang@google.com> wrote:
> >
> > On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote:
> > >
> > > On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote:
> > > >
> > > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when
> > > > page role is direct because this counter value is used as a coarse-grained
> > > > heuristics to check if there is nested guest active. Racing with this
> > > > heuristics without mmu lock will be harmless because the corresponding
> > > > indirect shadow sptes for the GPA will either be zapped by this thread or
> > > > some other thread who has previously zapped all indirect shadow pages and
> > > > makes the value to 0.
> > > >
> > > > Because of that, remove the KVM MMU write lock pair to potentially reduce
> > > > the lock contension and improve the performance of nested VM. In addition
> > > > opportunistically change the comment of 'direct mmu' to make the
> > > > description consistent with other places.
> > > >
> > > > Reported-by: Jim Mattson <jmattson@google.com>
> > > > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > > > ---
> > > >  arch/x86/kvm/x86.c | 10 ++--------
> > > >  1 file changed, 2 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 5ad55ef71433..97cfa5a00ff2 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > >
> > > >         kvm_release_pfn_clean(pfn);
> > > >
> > > > -       /* The instructions are well-emulated on direct mmu. */
> > > > +       /* The instructions are well-emulated on Direct MMUs. */
> > > >         if (vcpu->arch.mmu->root_role.direct) {
> > > > -               unsigned int indirect_shadow_pages;
> > > > -
> > > > -               write_lock(&vcpu->kvm->mmu_lock);
> > > > -               indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> > > > -               write_unlock(&vcpu->kvm->mmu_lock);
> > > > -
> > > > -               if (indirect_shadow_pages)
> > > > +               if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
> > >
> > > I don't understand the need for READ_ONCE() here. That implies that
> > > there is something tricky going on, and I don't think that's the case.
> >
> > READ_ONCE() is just telling the compiler not to remove the read. Since
> > this is reading a global variable,  the compiler might just read a
> > previous copy if the value has already been read into a local
> > variable. But that is not the case here...
>
> Not a global variable, actually, but that's not relevant. What would
> be wrong with using a previously read copy?

Nothing will be wrong I think since this is already just a heuristic.

>
> We don't always wrap reads in READ_ONCE(). It's actually pretty rare.
> So, there should be an explicit and meaningful reason.
>
> > Note I see there is another READ_ONCE for
> > kvm->arch.indirect_shadow_pages, so I am reusing the same thing.
>
> That's not a good reason. "If all of your friends jumped off a cliff,
> would you?"

:)

>
> > I did check the reordering issue but it should be fine because when
> > 'we' see indirect_shadow_pages as 0, the shadow pages must have
> > already been zapped. Not only because of the locking, but also the
> > program order in __kvm_mmu_prepare_zap_page() shows that it will zap
> > shadow pages first before updating the stats.

yeah, I forgot to mention that removing READ_ONCE() is ok for me.
  
Sean Christopherson June 5, 2023, 6:25 p.m. UTC | #7
On Mon, Jun 05, 2023, Mingwei Zhang wrote:
> On Mon, Jun 5, 2023 at 9:55 AM Jim Mattson <jmattson@google.com> wrote:
> >
> > On Sun, Jun 4, 2023 at 5:43 PM Mingwei Zhang <mizhang@google.com> wrote:
> > >
> > > Remove KVM MMU write lock when accessing indirect_shadow_pages counter when
> > > page role is direct because this counter value is used as a coarse-grained
> > > heuristics to check if there is nested guest active. Racing with this
> > > heuristics without mmu lock will be harmless because the corresponding
> > > indirect shadow sptes for the GPA will either be zapped by this thread or
> > > some other thread who has previously zapped all indirect shadow pages and
> > > makes the value to 0.
> > >
> > > Because of that, remove the KVM MMU write lock pair to potentially reduce
> > > the lock contension and improve the performance of nested VM. In addition
> > > opportunistically change the comment of 'direct mmu' to make the
> > > description consistent with other places.
> > >
> > > Reported-by: Jim Mattson <jmattson@google.com>
> > > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > > ---
> > >  arch/x86/kvm/x86.c | 10 ++--------
> > >  1 file changed, 2 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 5ad55ef71433..97cfa5a00ff2 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -8585,15 +8585,9 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >
> > >         kvm_release_pfn_clean(pfn);
> > >
> > > -       /* The instructions are well-emulated on direct mmu. */
> > > +       /* The instructions are well-emulated on Direct MMUs. */
> > >         if (vcpu->arch.mmu->root_role.direct) {
> > > -               unsigned int indirect_shadow_pages;
> > > -
> > > -               write_lock(&vcpu->kvm->mmu_lock);
> > > -               indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> > > -               write_unlock(&vcpu->kvm->mmu_lock);
> > > -
> > > -               if (indirect_shadow_pages)
> > > +               if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
> >
> > I don't understand the need for READ_ONCE() here. That implies that
> > there is something tricky going on, and I don't think that's the case.
> 
> READ_ONCE() is just telling the compiler not to remove the read. Since
> this is reading a global variable,  the compiler might just read a
> previous copy if the value has already been read into a local
> variable. But that is not the case here...
> 
> Note I see there is another READ_ONCE for
> kvm->arch.indirect_shadow_pages, so I am reusing the same thing.

I agree with Jim, using READ_ONCE() doesn't make any sense.  I suspect it may have
been a misguided attempt to force the memory read to be as close to the write_lock()
as possible, e.g. to minimize the chance of a false negative.

> I did check the reordering issue but it should be fine because when
> 'we' see indirect_shadow_pages as 0, the shadow pages must have
> already been zapped. Not only because of the locking, but also the
> program order in __kvm_mmu_prepare_zap_page() shows that it will zap
> shadow pages first before updating the stats.

I don't think zapping, i.e. the 1=>0 transition, is a concern.  KVM is dropping
the SPTE, so racing with kvm_mmu_pte_write() is a non-issue because the guest
will either see the old value, or will fault after the SPTE is zapped, i.e. KVM
won't run with a stale even if kvm_mmu_pte_write() sees '0' before TLBs are
flushed.

I believe the 0=>1 transition on the other hand doesn't have a *very* theoretical
bug.  KVM needs to ensure that either kvm_mmu_pte_write() sees an elevated count,
or that a page fault task sees the updated guest PTE, i.e. the emulated write.
The READ_ONCE() likely serves this purpose in practice, though technically it's
insufficient.

So I think this?

---
 arch/x86/kvm/mmu.h     | 14 ++++++++++++++
 arch/x86/kvm/mmu/mmu.c | 13 ++++++++++++-
 arch/x86/kvm/x86.c     |  8 +-------
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 92d5a1924fc1..9cd105ccb1d4 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -264,6 +264,20 @@ static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 	return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
 }
 
+static inline bool kvm_mmu_has_indirect_shadow_pages(struct kvm *kvm)
+{
+	/*
+	 * When emulating guest writes, ensure the written value is visible to
+	 * any task that is handling page faults before checking whether or not
+	 * KVM is shadowing a guest PTE.  This ensures either KVM will create
+	 * the correct SPTE in the page fault handler, or this task will see
+	 * a non-zero indirect_shadow_pages.  Pairs with the smp_mb() in
+	 * account_shadowed() and unaccount_shadowed().
+	 */
+	smp_mb();
+	return kvm->arch.indirect_shadow_pages;
+}
+
 static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
 {
 	/* KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K) must be 0. */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8961f45e3b1..1735bee3f653 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -830,6 +830,17 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	gfn_t gfn;
 
 	kvm->arch.indirect_shadow_pages++;
+
+	/*
+	 * Ensure indirect_shadow_pages is elevated prior to re-reading guest
+	 * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight
+	 * emulated writes are visible before re-reading guest PTEs, or that
+	 * an emulated write will see the elevated count and acquire mmu_lock
+	 * to update SPTEs.  Pairs with the smp_mb() in
+	 * kvm_mmu_has_indirect_shadow_pages().
+	 */
+	smp_mb();
+
 	gfn = sp->gfn;
 	slots = kvm_memslots_for_spte_role(kvm, sp->role);
 	slot = __gfn_to_memslot(slots, gfn);
@@ -5692,7 +5703,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	 * If we don't have indirect shadow pages, it means no page is
 	 * write-protected, so we can exit simply.
 	 */
-	if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
+	if (!kvm_mmu_has_indirect_shadow_pages(vcpu->kvm))
 		return;
 
 	pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index abfba3cae0ba..22c226f5f4f8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8588,13 +8588,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 	/* The instructions are well-emulated on direct mmu. */
 	if (vcpu->arch.mmu->root_role.direct) {
-		unsigned int indirect_shadow_pages;
-
-		write_lock(&vcpu->kvm->mmu_lock);
-		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
-		write_unlock(&vcpu->kvm->mmu_lock);
-
-		if (indirect_shadow_pages)
+		if (kvm_mmu_has_indirect_shadow_pages(vcpu->kvm))
 			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
 		return true;

base-commit: 69b4e5b82fec7195c79c939ce25789b16a133f3a
--
  
Paolo Bonzini June 5, 2023, 6:27 p.m. UTC | #8
On 6/5/23 19:17, Ben Gardon wrote:
>>> -               if (indirect_shadow_pages)
>>> +               if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
>> I don't understand the need for READ_ONCE() here. That implies that
>> there is something tricky going on, and I don't think that's the case.
> Agree this doesn't need a READ_ONCE. Just a read is fine.
> kvm_mmu_unprotect_page starts by acquiring the MMU lock, so there's
> not much room to reorder anything anyway.

I agree that "there's not much room to reorder anything", but it's not a 
particularly formal/reassuring statement :) and READ_ONCE/WRITE_ONCE 
have documentation value too.

Or maybe it's because I've become used to the C11 memory model.  Either 
way, I like the idea to use READ_ONCE/WRITE_ONCE more explicitly 
whenever there can be concurrent accesses (which would be data races in 
the C11 model), and I think Linux is moving in that direction too.

Paolo
  
Mingwei Zhang June 6, 2023, 10:46 p.m. UTC | #9
> > >
> > > I don't understand the need for READ_ONCE() here. That implies that
> > > there is something tricky going on, and I don't think that's the case.
> >
> > READ_ONCE() is just telling the compiler not to remove the read. Since
> > this is reading a global variable,  the compiler might just read a
> > previous copy if the value has already been read into a local
> > variable. But that is not the case here...
> >
> > Note I see there is another READ_ONCE for
> > kvm->arch.indirect_shadow_pages, so I am reusing the same thing.
>
> I agree with Jim, using READ_ONCE() doesn't make any sense.  I suspect it may have
> been a misguided attempt to force the memory read to be as close to the write_lock()
> as possible, e.g. to minimize the chance of a false negative.

Sean :) Your suggestion is the opposite with Jim. He is suggesting
doing nothing, but
your suggestion is doing way more than READ_ONCE().

>
> > I did check the reordering issue but it should be fine because when
> > 'we' see indirect_shadow_pages as 0, the shadow pages must have
> > already been zapped. Not only because of the locking, but also the
> > program order in __kvm_mmu_prepare_zap_page() shows that it will zap
> > shadow pages first before updating the stats.
>
> I don't think zapping, i.e. the 1=>0 transition, is a concern.  KVM is dropping
> the SPTE, so racing with kvm_mmu_pte_write() is a non-issue because the guest
> will either see the old value, or will fault after the SPTE is zapped, i.e. KVM
> won't run with a stale even if kvm_mmu_pte_write() sees '0' before TLBs are
> flushed.

Agree.
>
> I believe the 0=>1 transition on the other hand doesn't have a *very* theoretical
> bug.  KVM needs to ensure that either kvm_mmu_pte_write() sees an elevated count,
> or that a page fault task sees the updated guest PTE, i.e. the emulated write.
> The READ_ONCE() likely serves this purpose in practice, though technically it's
> insufficient.

Agree.

>
> So I think this?

Hmm. I agree with both points above, but below, the change seems too
heavyweight. smp_wb() is a mfence(), i.e., serializing all
loads/stores before the instruction. Doing that for every shadow page
creation and destruction seems a lot.

In fact, the case that only matters is '0->1' which may potentially
confuse kvm_mmu_pte_write() when it reads 'indirect_shadow_count', but
the majority of the cases are 'X => X + 1' where X != 0. So, those
cases do not matter. So, if we want to add barriers, we only need it
for 0->1. Maybe creating a new variable and not blocking
account_shadow() and unaccount_shadow() is a better idea?

Regardless, the above problem is related to interactions among
account_shadow(), unaccount_shadow() and kvm_mmu_pte_write(). It has
nothing to do with the 'reexecute_instruction()', which is what this
patch is about. So, I think having a READ_ONCE() for
reexecute_instruction() should be good enough. What do you think.


>
> ---
>  arch/x86/kvm/mmu.h     | 14 ++++++++++++++
>  arch/x86/kvm/mmu/mmu.c | 13 ++++++++++++-
>  arch/x86/kvm/x86.c     |  8 +-------
>  3 files changed, 27 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 92d5a1924fc1..9cd105ccb1d4 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -264,6 +264,20 @@ static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
>         return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
>  }
>
> +static inline bool kvm_mmu_has_indirect_shadow_pages(struct kvm *kvm)
> +{
> +       /*
> +        * When emulating guest writes, ensure the written value is visible to
> +        * any task that is handling page faults before checking whether or not
> +        * KVM is shadowing a guest PTE.  This ensures either KVM will create
> +        * the correct SPTE in the page fault handler, or this task will see
> +        * a non-zero indirect_shadow_pages.  Pairs with the smp_mb() in
> +        * account_shadowed() and unaccount_shadowed().
> +        */
> +       smp_mb();
> +       return kvm->arch.indirect_shadow_pages;
> +}
> +
>  static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
>  {
>         /* KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K) must be 0. */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8961f45e3b1..1735bee3f653 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -830,6 +830,17 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>         gfn_t gfn;
>
>         kvm->arch.indirect_shadow_pages++;
> +
> +       /*
> +        * Ensure indirect_shadow_pages is elevated prior to re-reading guest
> +        * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight
> +        * emulated writes are visible before re-reading guest PTEs, or that
> +        * an emulated write will see the elevated count and acquire mmu_lock
> +        * to update SPTEs.  Pairs with the smp_mb() in
> +        * kvm_mmu_has_indirect_shadow_pages().
> +        */
> +       smp_mb();
> +
>         gfn = sp->gfn;
>         slots = kvm_memslots_for_spte_role(kvm, sp->role);
>         slot = __gfn_to_memslot(slots, gfn);
> @@ -5692,7 +5703,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>          * If we don't have indirect shadow pages, it means no page is
>          * write-protected, so we can exit simply.
>          */
> -       if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
> +       if (!kvm_mmu_has_indirect_shadow_pages(vcpu->kvm))
>                 return;
>
>         pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index abfba3cae0ba..22c226f5f4f8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8588,13 +8588,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
>         /* The instructions are well-emulated on direct mmu. */
>         if (vcpu->arch.mmu->root_role.direct) {
> -               unsigned int indirect_shadow_pages;
> -
> -               write_lock(&vcpu->kvm->mmu_lock);
> -               indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
> -               write_unlock(&vcpu->kvm->mmu_lock);
> -
> -               if (indirect_shadow_pages)
> +               if (kvm_mmu_has_indirect_shadow_pages(vcpu->kvm))
>                         kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
>
>                 return true;
>
> base-commit: 69b4e5b82fec7195c79c939ce25789b16a133f3a
> --
>
  
Mingwei Zhang June 6, 2023, 10:48 p.m. UTC | #10
> >
> > So I think this?
>
> Hmm. I agree with both points above, but below, the change seems too
> heavyweight. smp_wb() is a mfence(), i.e., serializing all
> loads/stores before the instruction. Doing that for every shadow page
> creation and destruction seems a lot.
>
typo: smp_wb() => smp_mb().
  
Sean Christopherson June 6, 2023, 11:07 p.m. UTC | #11
On Tue, Jun 06, 2023, Mingwei Zhang wrote:
> > > >
> > > > I don't understand the need for READ_ONCE() here. That implies that
> > > > there is something tricky going on, and I don't think that's the case.
> > >
> > > READ_ONCE() is just telling the compiler not to remove the read. Since
> > > this is reading a global variable,  the compiler might just read a
> > > previous copy if the value has already been read into a local
> > > variable. But that is not the case here...
> > >
> > > Note I see there is another READ_ONCE for
> > > kvm->arch.indirect_shadow_pages, so I am reusing the same thing.
> >
> > I agree with Jim, using READ_ONCE() doesn't make any sense.  I suspect it may have
> > been a misguided attempt to force the memory read to be as close to the write_lock()
> > as possible, e.g. to minimize the chance of a false negative.
> 
> Sean :) Your suggestion is the opposite with Jim. He is suggesting
> doing nothing, but your suggestion is doing way more than READ_ONCE().

Not really.  Jim is asserting that the READ_ONCE() is pointless, and I completely
agree.  I am also saying that I think there is a real memory ordering issue here,
and that it was being papered over by the READ_ONCE() in kvm_mmu_pte_write().

> > So I think this?
> 
> Hmm. I agree with both points above, but below, the change seems too
> heavyweight. smp_wb() is a mfence(), i.e., serializing all
> loads/stores before the instruction. Doing that for every shadow page
> creation and destruction seems a lot.

No, the smp_*b() variants are just compiler barriers on x86.

> In fact, the case that only matters is '0->1' which may potentially
> confuse kvm_mmu_pte_write() when it reads 'indirect_shadow_count', but
> the majority of the cases are 'X => X + 1' where X != 0. So, those
> cases do not matter. So, if we want to add barriers, we only need it
> for 0->1. Maybe creating a new variable and not blocking
> account_shadow() and unaccount_shadow() is a better idea?
> 
> Regardless, the above problem is related to interactions among
> account_shadow(), unaccount_shadow() and kvm_mmu_pte_write(). It has
> nothing to do with the 'reexecute_instruction()', which is what this
> patch is about. So, I think having a READ_ONCE() for
> reexecute_instruction() should be good enough. What do you think.

The reexecute_instruction() case should be fine without any fanciness, it's
nothing more than a heuristic, i.e. neither a false positive nor a false negative
will impact functional correctness, and nothing changes regardless of how many
times the compiler reads the variable outside of mmu_lock.

I was thinking that it would be better to have a single helper to locklessly
access indirect_shadow_pages, but I agree that applying the barriers to
reexecute_instruction() introduces a different kind of confusion.

Want to post a v2 of yours without a READ_ONCE(), and I'll post a separate fix
for the theoretical kvm_mmu_pte_write() race?  And then Paolo can tell me that
there's no race and school me on lockless programming once more ;-)
  
Mingwei Zhang June 7, 2023, 12:23 a.m. UTC | #12
> > Hmm. I agree with both points above, but below, the change seems too
> > heavyweight. smp_wb() is a mfence(), i.e., serializing all
> > loads/stores before the instruction. Doing that for every shadow page
> > creation and destruction seems a lot.
>
> No, the smp_*b() variants are just compiler barriers on x86.

hmm, it is a "lock addl" now for smp_mb(). Check this: 450cbdd0125c
("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")

So this means smp_mb() is not a free lunch and we need to be a little
bit careful.

>
> > In fact, the case that only matters is '0->1' which may potentially
> > confuse kvm_mmu_pte_write() when it reads 'indirect_shadow_count', but
> > the majority of the cases are 'X => X + 1' where X != 0. So, those
> > cases do not matter. So, if we want to add barriers, we only need it
> > for 0->1. Maybe creating a new variable and not blocking
> > account_shadow() and unaccount_shadow() is a better idea?
> >
> > Regardless, the above problem is related to interactions among
> > account_shadow(), unaccount_shadow() and kvm_mmu_pte_write(). It has
> > nothing to do with the 'reexecute_instruction()', which is what this
> > patch is about. So, I think having a READ_ONCE() for
> > reexecute_instruction() should be good enough. What do you think.
>
> The reexecute_instruction() case should be fine without any fanciness, it's
> nothing more than a heuristic, i.e. neither a false positive nor a false negative
> will impact functional correctness, and nothing changes regardless of how many
> times the compiler reads the variable outside of mmu_lock.
>
> I was thinking that it would be better to have a single helper to locklessly
> access indirect_shadow_pages, but I agree that applying the barriers to
> reexecute_instruction() introduces a different kind of confusion.
>
> Want to post a v2 of yours without a READ_ONCE(), and I'll post a separate fix
> for the theoretical kvm_mmu_pte_write() race?  And then Paolo can tell me that
> there's no race and school me on lockless programming once more ;-)

yeah, that works for me.
  
Sean Christopherson June 7, 2023, 12:28 a.m. UTC | #13
On Tue, Jun 06, 2023, Mingwei Zhang wrote:
> > > Hmm. I agree with both points above, but below, the change seems too
> > > heavyweight. smp_wb() is a mfence(), i.e., serializing all
> > > loads/stores before the instruction. Doing that for every shadow page
> > > creation and destruction seems a lot.
> >
> > No, the smp_*b() variants are just compiler barriers on x86.
> 
> hmm, it is a "lock addl" now for smp_mb(). Check this: 450cbdd0125c
> ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")
> 
> So this means smp_mb() is not a free lunch and we need to be a little
> bit careful.

Oh, those sneaky macros.  x86 #defines __smp_mb(), not the outer helper.  I'll
take a closer look before posting to see if there's a way to avoid the runtime
barrier.
  
Mingwei Zhang June 15, 2023, 11:57 p.m. UTC | #14
On Tue, Jun 6, 2023 at 5:28 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jun 06, 2023, Mingwei Zhang wrote:
> > > > Hmm. I agree with both points above, but below, the change seems too
> > > > heavyweight. smp_wb() is a mfence(), i.e., serializing all
> > > > loads/stores before the instruction. Doing that for every shadow page
> > > > creation and destruction seems a lot.
> > >
> > > No, the smp_*b() variants are just compiler barriers on x86.
> >
> > hmm, it is a "lock addl" now for smp_mb(). Check this: 450cbdd0125c
> > ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")
> >
> > So this means smp_mb() is not a free lunch and we need to be a little
> > bit careful.
>
> Oh, those sneaky macros.  x86 #defines __smp_mb(), not the outer helper.  I'll
> take a closer look before posting to see if there's a way to avoid the runtime
> barrier.

Checked again, I think using smp_wmb() and smp_rmb() should be fine as
those are just compiler barriers. We don't need a full barrier here.

Thanks.
-Mingwei
  
Jim Mattson June 26, 2023, 5:38 p.m. UTC | #15
On Thu, Jun 15, 2023 at 4:58 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> On Tue, Jun 6, 2023 at 5:28 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jun 06, 2023, Mingwei Zhang wrote:
> > > > > Hmm. I agree with both points above, but below, the change seems too
> > > > > heavyweight. smp_wb() is a mfence(), i.e., serializing all
> > > > > loads/stores before the instruction. Doing that for every shadow page
> > > > > creation and destruction seems a lot.
> > > >
> > > > No, the smp_*b() variants are just compiler barriers on x86.
> > >
> > > hmm, it is a "lock addl" now for smp_mb(). Check this: 450cbdd0125c
> > > ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")
> > >
> > > So this means smp_mb() is not a free lunch and we need to be a little
> > > bit careful.
> >
> > Oh, those sneaky macros.  x86 #defines __smp_mb(), not the outer helper.  I'll
> > take a closer look before posting to see if there's a way to avoid the runtime
> > barrier.
>
> Checked again, I think using smp_wmb() and smp_rmb() should be fine as
> those are just compiler barriers. We don't need a full barrier here.

That seems adequate.

> Thanks.
> -Mingwei
  
Sean Christopherson June 26, 2023, 8:42 p.m. UTC | #16
On Mon, Jun 26, 2023, Jim Mattson wrote:
> On Thu, Jun 15, 2023 at 4:58 PM Mingwei Zhang <mizhang@google.com> wrote:
> >
> > On Tue, Jun 6, 2023 at 5:28 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Jun 06, 2023, Mingwei Zhang wrote:
> > > > > > Hmm. I agree with both points above, but below, the change seems too
> > > > > > heavyweight. smp_wb() is a mfence(), i.e., serializing all
> > > > > > loads/stores before the instruction. Doing that for every shadow page
> > > > > > creation and destruction seems a lot.
> > > > >
> > > > > No, the smp_*b() variants are just compiler barriers on x86.
> > > >
> > > > hmm, it is a "lock addl" now for smp_mb(). Check this: 450cbdd0125c
> > > > ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")
> > > >
> > > > So this means smp_mb() is not a free lunch and we need to be a little
> > > > bit careful.
> > >
> > > Oh, those sneaky macros.  x86 #defines __smp_mb(), not the outer helper.  I'll
> > > take a closer look before posting to see if there's a way to avoid the runtime
> > > barrier.
> >
> > Checked again, I think using smp_wmb() and smp_rmb() should be fine as
> > those are just compiler barriers. We don't need a full barrier here.
> 
> That seems adequate.

Strictly speaking, no, because neither FNAME(fetch) nor kvm_mmu_pte_write() are
pure readers or writers.  FNAME(fetch) reads guest memory (guest PTEs) and writes
indirect_shadow_pages.   kvm_mmu_pte_write() writes guest memory (guest PTEs) and
reads indirect_shadow_pages (it later writes indirect_shadow_pages too, but that
write isn't relevant to the ordering we care about here).
  

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5ad55ef71433..97cfa5a00ff2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8585,15 +8585,9 @@  static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 	kvm_release_pfn_clean(pfn);
 
-	/* The instructions are well-emulated on direct mmu. */
+	/* The instructions are well-emulated on Direct MMUs. */
 	if (vcpu->arch.mmu->root_role.direct) {
-		unsigned int indirect_shadow_pages;
-
-		write_lock(&vcpu->kvm->mmu_lock);
-		indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
-		write_unlock(&vcpu->kvm->mmu_lock);
-
-		if (indirect_shadow_pages)
+		if (READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
 			kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
 		return true;