[v2,4/4] KVM: x86/mmu: Fix a *very* theoretical race in kvm_mmu_track_write()

Message ID 20240203002343.383056-5-seanjc@google.com
State New
Headers
Series KVM: x86/mmu: Clean up indirect_shadow_pages usage |

Commit Message

Sean Christopherson Feb. 3, 2024, 12:23 a.m. UTC
  Add full memory barriers in kvm_mmu_track_write() and account_shadowed()
to plug a (very, very theoretical) race where kvm_mmu_track_write() could
miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant,
*stale* SPTEs.

Without the barriers, because modern x86 CPUs allow (per the SDM):

  Reads may be reordered with older writes to different locations but not
  with older writes to the same location.

it's (again, super theoretically) possible that the following could happen
(terms of values being visible/resolved):

 CPU0                          CPU1
 read memory[gfn] (=Y)
                               memory[gfn] Y=>X
                               read indirect_shadow_pages (=0)
 indirect_shadow_pages 0=>1

or conversely:

 CPU0                          CPU1
 indirect_shadow_pages 0=>1
                               read indirect_shadow_pages (=0)
 read memory[gfn] (=Y)
                               memory[gfn] Y=>X

In practice, this bug is likely benign as both the 0=>1 transition and
reordering of this scope are extremely rare occurrences.

Note, if the cost of the barrier (which is simply a locked ADD, see commit
450cbdd0125c ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")),
is problematic, KVM could avoid the barrier by bailing earlier if checking
kvm_memslots_have_rmaps() is false.  But the odds of the barrier being
problematic is extremely low, *and* the odds of the extra checks being
meaningfully faster overall is also low.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)
  

Comments

Paolo Bonzini Feb. 23, 2024, 8:09 a.m. UTC | #1
On 2/3/24 01:23, Sean Christopherson wrote:
> Add full memory barriers in kvm_mmu_track_write() and account_shadowed()
> to plug a (very, very theoretical) race where kvm_mmu_track_write() could
> miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant,
> *stale* SPTEs.

Ok, so we have

emulator_write_phys
   overwrite PTE
   kvm_page_track_write
     kvm_mmu_track_write
       // memory barrier missing here
       if (indirect_shadow_pages)
         zap();

and on the other side

   FNAME(page_fault)
     FNAME(fetch)
       kvm_mmu_get_child_sp
         kvm_mmu_get_shadow_page
           __kvm_mmu_get_shadow_page
             kvm_mmu_alloc_shadow_page
               account_shadowed
                 indirect shadow pages++
                 // memory barrier missing here
       if (FNAME(gpte_changed)) // reads PTE
         goto out

If you can weave something like this in the commit message the sequence 
would be a bit clearer.

> In practice, this bug is likely benign as both the 0=>1 transition and
> reordering of this scope are extremely rare occurrences.

I wouldn't call it benign, it's more that it's unobservable in practice 
but the race is real.  However...
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c193b096b45..86b85060534d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -830,6 +830,14 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>   	struct kvm_memory_slot *slot;
>   	gfn_t gfn;
>   
> +	/*
> +	 * Ensure indirect_shadow_pages is elevated prior to re-reading guest
> +	 * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight
> +	 * emulated writes are visible before re-reading guest PTEs, or that
> +	 * an emulated write will see the elevated count and acquire mmu_lock
> +	 * to update SPTEs.  Pairs with the smp_mb() in kvm_mmu_track_write().
> +	 */
> +	smp_mb();

.. this memory barrier needs to be after the increment (the desired 
ordering is store-before-read).

Paolo

>   	kvm->arch.indirect_shadow_pages++;
>   	gfn = sp->gfn;
>   	slots = kvm_memslots_for_spte_role(kvm, sp->role);
> @@ -5747,10 +5755,15 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
>   	bool flush = false;
>   
>   	/*
> -	 * If we don't have indirect shadow pages, it means no page is
> -	 * write-protected, so we can exit simply.
> +	 * When emulating guest writes, ensure the written value is visible to
> +	 * any task that is handling page faults before checking whether or not
> +	 * KVM is shadowing a guest PTE.  This ensures either KVM will create
> +	 * the correct SPTE in the page fault handler, or this task will see
> +	 * a non-zero indirect_shadow_pages.  Pairs with the smp_mb() in
> +	 * account_shadowed().
>   	 */
> -	if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
> +	smp_mb();
> +	if (!vcpu->kvm->arch.indirect_shadow_pages)
>   		return;
>   
>   	write_lock(&vcpu->kvm->mmu_lock);
  
Sean Christopherson Feb. 23, 2024, 6:12 p.m. UTC | #2
On Fri, Feb 23, 2024, Paolo Bonzini wrote:
> On 2/3/24 01:23, Sean Christopherson wrote:
> > Add full memory barriers in kvm_mmu_track_write() and account_shadowed()
> > to plug a (very, very theoretical) race where kvm_mmu_track_write() could
> > miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant,
> > *stale* SPTEs.
> 
> Ok, so we have
> 
> emulator_write_phys
>   overwrite PTE
>   kvm_page_track_write
>     kvm_mmu_track_write
>       // memory barrier missing here
>       if (indirect_shadow_pages)
>         zap();
> 
> and on the other side
> 
>   FNAME(page_fault)
>     FNAME(fetch)
>       kvm_mmu_get_child_sp
>         kvm_mmu_get_shadow_page
>           __kvm_mmu_get_shadow_page
>             kvm_mmu_alloc_shadow_page
>               account_shadowed
>                 indirect shadow pages++
>                 // memory barrier missing here
>       if (FNAME(gpte_changed)) // reads PTE
>         goto out
> 
> If you can weave something like this in the commit message the sequence
> would be a bit clearer.

Roger that.

> > In practice, this bug is likely benign as both the 0=>1 transition and
> > reordering of this scope are extremely rare occurrences.
> 
> I wouldn't call it benign, it's more that it's unobservable in practice but
> the race is real.  However...
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 3c193b096b45..86b85060534d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -830,6 +830,14 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> >   	struct kvm_memory_slot *slot;
> >   	gfn_t gfn;
> > +	/*
> > +	 * Ensure indirect_shadow_pages is elevated prior to re-reading guest
> > +	 * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight
> > +	 * emulated writes are visible before re-reading guest PTEs, or that
> > +	 * an emulated write will see the elevated count and acquire mmu_lock
> > +	 * to update SPTEs.  Pairs with the smp_mb() in kvm_mmu_track_write().
> > +	 */
> > +	smp_mb();
> 
> ... this memory barrier needs to be after the increment (the desired
> ordering is store-before-read).

Doh.  I'll post a fixed version as a one-off v3.

Thanks!
  

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c193b096b45..86b85060534d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -830,6 +830,14 @@  static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	struct kvm_memory_slot *slot;
 	gfn_t gfn;
 
+	/*
+	 * Ensure indirect_shadow_pages is elevated prior to re-reading guest
+	 * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight
+	 * emulated writes are visible before re-reading guest PTEs, or that
+	 * an emulated write will see the elevated count and acquire mmu_lock
+	 * to update SPTEs.  Pairs with the smp_mb() in kvm_mmu_track_write().
+	 */
+	smp_mb();
 	kvm->arch.indirect_shadow_pages++;
 	gfn = sp->gfn;
 	slots = kvm_memslots_for_spte_role(kvm, sp->role);
@@ -5747,10 +5755,15 @@  void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 	bool flush = false;
 
 	/*
-	 * If we don't have indirect shadow pages, it means no page is
-	 * write-protected, so we can exit simply.
+	 * When emulating guest writes, ensure the written value is visible to
+	 * any task that is handling page faults before checking whether or not
+	 * KVM is shadowing a guest PTE.  This ensures either KVM will create
+	 * the correct SPTE in the page fault handler, or this task will see
+	 * a non-zero indirect_shadow_pages.  Pairs with the smp_mb() in
+	 * account_shadowed().
 	 */
-	if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
+	smp_mb();
+	if (!vcpu->kvm->arch.indirect_shadow_pages)
 		return;
 
 	write_lock(&vcpu->kvm->mmu_lock);