KVM: move memslot invalidation later than possible failures

Message ID 20221108084416.11447-1-yan.y.zhao@intel.com
State New
Headers
Series KVM: move memslot invalidation later than possible failures |

Commit Message

Yan Zhao Nov. 8, 2022, 8:44 a.m. UTC
  For memslot delete and move, kvm_invalidate_memslot() is required before
the real changes committed.
Besides swapping to an inactive slot, kvm_invalidate_memslot() will call
kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in
arch x86.
And according to the definition in kvm_page_track_notifier_node, users can
drop write-protection for the pages in the memory slot on receiving
.track_flush_slot.

However, if kvm_prepare_memory_region() fails, the later
kvm_activate_memslot() will only swap back the original slot, leaving
previous write protection not recovered.

This may not be a problem for kvm itself as a page tracker user, but may
cause problem to other page tracker users, e.g. kvmgt, whose
write-protected pages are removed from the write-protected list and not
added back.

So call kvm_prepare_memory_region first for meta data preparation before
the slot invalidation so as to avoid failure and recovery.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 virt/kvm/kvm_main.c | 40 +++++++++++++++-------------------------
 1 file changed, 15 insertions(+), 25 deletions(-)
  

Comments

Sean Christopherson Nov. 8, 2022, 5:32 p.m. UTC | #1
On Tue, Nov 08, 2022, Yan Zhao wrote:
> For memslot delete and move, kvm_invalidate_memslot() is required before
> the real changes committed.
> Besides swapping to an inactive slot, kvm_invalidate_memslot() will call
> kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in
> arch x86.
> And according to the definition in kvm_page_track_notifier_node, users can
> drop write-protection for the pages in the memory slot on receiving
> .track_flush_slot.

Ugh, that's a terrible API.  The entire page track API is a mess, e.g. KVMGT is
forced to grab its own references to KVM and also needs to manually acquire/release
mmu_lock in some flows but not others.

Anyways, this is a flaw in the page track API that should be fixed.  Flushing a
slot should not be overloaded to imply "this slot is gone", it should be a flush
command, no more, no less.

AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter
of adding another hook that's invoked when the memory region change is committed.

That would allow KVMGT to fix another bug.  If a memory region is moved and the
new region partially overlaps the old region, KVMGT technically probably wants to
retain its write-protection scheme.  Though that's probably not worth supporting,
might be better to say "don't move memory regions if KVMGT is enabled", because
AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was
broken for years and no one noticed).

Actually, given that MOVE + KVMGT probably doesn't work regardless of where the
region is moved to, maybe we can get away with rejecting MOVE if an external
page tracker cares about the slot in question.

> However, if kvm_prepare_memory_region() fails, the later
> kvm_activate_memslot() will only swap back the original slot, leaving
> previous write protection not recovered.
> 
> This may not be a problem for kvm itself as a page tracker user, but may
> cause problem to other page tracker users, e.g. kvmgt, whose
> write-protected pages are removed from the write-protected list and not
> added back.
> 
> So call kvm_prepare_memory_region first for meta data preparation before
> the slot invalidation so as to avoid failure and recovery.

IIUC, this is purely a theoretical bug and practically speaking can't be a problem
in practice, at least not without completely breaking KVMGT.

On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor
protected VM case is the only scenario where DELETE can fail on any architecture).
The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for
kvm_arch_prepare_memory_region().

And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the
max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but
the old memslot was not, and so KVM needs to allocate new metadata due to the new
memslot needed larger arrays.

As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM
that supports KVMGT and moves relevant memory regions, let alove does something
that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing.

Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses
the existing dirty bitmap but doesn't shift the dirty bits.  This is likely
another combination that KVM can simply reject.

Assuming this is indeed purely theoretical, that should be called out in the
changelog.  Or as above, simply disallow MOVE in this case, which probably has
my vote.

> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  virt/kvm/kvm_main.c | 40 +++++++++++++++-------------------------
>  1 file changed, 15 insertions(+), 25 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 25d7872b29c1..5f29011f432d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1827,45 +1827,35 @@ static int kvm_set_memslot(struct kvm *kvm,
>  	 */
>  	mutex_lock(&kvm->slots_arch_lock);
>  
> -	/*
> -	 * Invalidate the old slot if it's being deleted or moved.  This is
> -	 * done prior to actually deleting/moving the memslot to allow vCPUs to
> -	 * continue running by ensuring there are no mappings or shadow pages
> -	 * for the memslot when it is deleted/moved.  Without pre-invalidation
> -	 * (and without a lock), a window would exist between effecting the
> -	 * delete/move and committing the changes in arch code where KVM or a
> -	 * guest could access a non-existent memslot.
> -	 *
> -	 * Modifications are done on a temporary, unreachable slot.  The old
> -	 * slot needs to be preserved in case a later step fails and the
> -	 * invalidation needs to be reverted.
> -	 */
>  	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
>  		invalid_slot = kzalloc(sizeof(*invalid_slot), GFP_KERNEL_ACCOUNT);
>  		if (!invalid_slot) {
>  			mutex_unlock(&kvm->slots_arch_lock);
>  			return -ENOMEM;
>  		}
> -		kvm_invalidate_memslot(kvm, old, invalid_slot);
>  	}
>  
>  	r = kvm_prepare_memory_region(kvm, old, new, change);
>  	if (r) {
> -		/*
> -		 * For DELETE/MOVE, revert the above INVALID change.  No
> -		 * modifications required since the original slot was preserved
> -		 * in the inactive slots.  Changing the active memslots also
> -		 * release slots_arch_lock.
> -		 */
> -		if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
> -			kvm_activate_memslot(kvm, invalid_slot, old);
> +		if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
>  			kfree(invalid_slot);
> -		} else {
> -			mutex_unlock(&kvm->slots_arch_lock);
> -		}
> +
> +		mutex_unlock(&kvm->slots_arch_lock);
>  		return r;
>  	}
>  
> +	/*
> +	 * Invalidate the old slot if it's being deleted or moved.  This is
> +	 * done prior to actually deleting/moving the memslot to allow vCPUs to
> +	 * continue running by ensuring there are no mappings or shadow pages
> +	 * for the memslot when it is deleted/moved.  Without pre-invalidation
> +	 * (and without a lock), a window would exist between effecting the
> +	 * delete/move and committing the changes in arch code where KVM or a
> +	 * guest could access a non-existent memslot.
> +	 */
> +	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
> +		kvm_invalidate_memslot(kvm, old, invalid_slot);

I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally
correct.  It might be, but there are so many subtleties in this code that I don't
want to change this ordering unless absolutely necessary, or at least not without
an in-depth analysis and a pile of tests.  E.g. it's possible one or more
architectures relies on unmapping, flushing, and invalidating the old region
prior to preparing the new region, e.g. to reuse arch memslot data.

> +
>  	/*
>  	 * For DELETE and MOVE, the working slot is now active as the INVALID
>  	 * version of the old slot.  MOVE is particularly special as it reuses
> -- 
> 2.17.1
>
  
Yan Zhao Nov. 9, 2022, 6:47 a.m. UTC | #2
On Tue, Nov 08, 2022 at 05:32:50PM +0000, Sean Christopherson wrote:
> On Tue, Nov 08, 2022, Yan Zhao wrote:
> > For memslot delete and move, kvm_invalidate_memslot() is required before
> > the real changes committed.
> > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call
> > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in
> > arch x86.
> > And according to the definition in kvm_page_track_notifier_node, users can
> > drop write-protection for the pages in the memory slot on receiving
> > .track_flush_slot.
> 
> Ugh, that's a terrible API.  The entire page track API is a mess, e.g. KVMGT is
> forced to grab its own references to KVM and also needs to manually acquire/release
> mmu_lock in some flows but not others.
> 
> Anyways, this is a flaw in the page track API that should be fixed.  Flushing a
> slot should not be overloaded to imply "this slot is gone", it should be a flush
> command, no more, no less.
hmm, but KVM also registers to the page track
"node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;"
And in kvm_mmu_invalidate_zap_pages_in_memslot, memslot (actaully all the shadow
page tables) is zapped. And during the zap, unaccount_shadowed() will drop the
write protection. But KVM is ok because the dropped write-protection can be
rebuilt during rebuilding the shadow page table.
So, for .track_flush_slot, it's expected that "users can drop write-protection
for the pages in the memory slot", right?


> AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter
> of adding another hook that's invoked when the memory region change is committed.
>
Do you mean adding a new hook in page track, e.g. .track_slot_change?
Then right before committing slot changes, call this interface to notify slot
DELETE/MOVE?

Not only KVMGT, KVM can also be affected by this failure to MOVE if it wants to
support below usecase:
1. KVM pre-populated a memslot
2. memslot MOVE happened
3. KVM pre-populates the new memslot (MOVE succeeds) or the old one (MOVE fails).
So also add a new interface for the MOVE failure?

> That would allow KVMGT to fix another bug.  If a memory region is moved and the
> new region partially overlaps the old region, KVMGT technically probably wants to
> retain its write-protection scheme.  Though that's probably not worth supporting,
> might be better to say "don't move memory regions if KVMGT is enabled", because
> AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was
> broken for years and no one noticed).
>
So, could we disable MOVE in KVM at all?

> Actually, given that MOVE + KVMGT probably doesn't work regardless of where the
> region is moved to, maybe we can get away with rejecting MOVE if an external
> page tracker cares about the slot in question.
> 
> > However, if kvm_prepare_memory_region() fails, the later
> > kvm_activate_memslot() will only swap back the original slot, leaving
> > previous write protection not recovered.
> > 
> > This may not be a problem for kvm itself as a page tracker user, but may
> > cause problem to other page tracker users, e.g. kvmgt, whose
> > write-protected pages are removed from the write-protected list and not
> > added back.
> > 
> > So call kvm_prepare_memory_region first for meta data preparation before
> > the slot invalidation so as to avoid failure and recovery.
> 
> IIUC, this is purely a theoretical bug and practically speaking can't be a problem
> in practice, at least not without completely breaking KVMGT.
> 
> On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor
> protected VM case is the only scenario where DELETE can fail on any architecture).
> The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for
> kvm_arch_prepare_memory_region().
But as long as with current code sequence, we can't relying on that
kvm_prepare_memory_region() will never fail for DELETE.
Or, we need to call kvm_prepare_memory_region() only for !DELETE to avoid future
possible broken.

> 
> And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the
> max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but
> the old memslot was not, and so KVM needs to allocate new metadata due to the new
> memslot needed larger arrays.
kvm_prepare_memory_region() can also fail for MOVE during live migration when
memslot->dirty_bitmap allocation is failed in kvm_alloc_dirty_bitmap().
and in x86, kvm_arch_prepare_memory_region() can also fail for MOVE if
kvm_alloc_memslot_metadata() fails due to -ENOMEM. 
BTW, I don't find the "(2) the new memslot is unaligned but the old memslot was not,
and so KVM needs to allocate new metadata due to the new memslot needed
larger arrays". 
> 
> As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM
> that supports KVMGT and moves relevant memory regions, let alove does something
> that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing.
> 
> Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses
> the existing dirty bitmap but doesn't shift the dirty bits.  This is likely
Do you mean, for the new slot in MOVE, the new dirty bitmap should be
cleared? Otherwise, why shift is required, given mem->userspace_addr and npages
are not changed and dirty_bitmap is indexed using rel_gfn 
(rel_gfn = gfn - memslot->base_gfn) and both QEMU/KVM aligns the bitmap
size to BITS_PER_LONG.

> another combination that KVM can simply reject.
> 
> Assuming this is indeed purely theoretical, that should be called out in the
> changelog.  Or as above, simply disallow MOVE in this case, which probably has
> my vote.
>
Yes, currently it's a purely theoretical bug, as I'm not seeing MOVE is triggered
in upstream QEMU.

<...>

> I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally
> correct.  It might be, but there are so many subtleties in this code that I don't
> want to change this ordering unless absolutely necessary, or at least not without
> an in-depth analysis and a pile of tests.  E.g. it's possible one or more
> architectures relies on unmapping, flushing, and invalidating the old region
> prior to preparing the new region, e.g. to reuse arch memslot data.
yes. what about just moving kvm_arch_flush_shadow_memslot() and
kvm_arch_guest_memory_reclaimed() to later than kvm_arch_prepare_memory_region()?


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 25d7872b29c1..f9d93be2ead2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1716,6 +1716,19 @@ static void kvm_copy_memslot(struct kvm_memory_slot *dest,
 }

 static void kvm_invalidate_memslot(struct kvm *kvm,
+                                  struct kvm_memory_slot *old)
+{
+       /*
+        * From this point no new shadow pages pointing to a deleted, or moved,
+        * memslot will be created.  Validation of sp->gfn happens in:
+        *      - gfn_to_hva (kvm_read_guest, gfn_to_pfn)
+        *      - kvm_is_visible_gfn (mmu_check_root)
+        */
+       kvm_arch_flush_shadow_memslot(kvm, old);
+       kvm_arch_guest_memory_reclaimed(kvm);
+}
+
+static void kvm_deactive_memslot(struct kvm *kvm,
                                   struct kvm_memory_slot *old,
                                   struct kvm_memory_slot *invalid_slot)
 {
@@ -1735,15 +1748,6 @@ static void kvm_invalidate_memslot(struct kvm *kvm,
         */
        kvm_swap_active_memslots(kvm, old->as_id);

-       /*
-        * From this point no new shadow pages pointing to a deleted, or moved,
-        * memslot will be created.  Validation of sp->gfn happens in:
-        *      - gfn_to_hva (kvm_read_guest, gfn_to_pfn)
-        *      - kvm_is_visible_gfn (mmu_check_root)
-        */
-       kvm_arch_flush_shadow_memslot(kvm, old);
-       kvm_arch_guest_memory_reclaimed(kvm);
-
        /* Was released by kvm_swap_active_memslots, reacquire. */
        mutex_lock(&kvm->slots_arch_lock);

@@ -1846,7 +1850,7 @@ static int kvm_set_memslot(struct kvm *kvm,
                        mutex_unlock(&kvm->slots_arch_lock);
                        return -ENOMEM;
                }
-               kvm_invalidate_memslot(kvm, old, invalid_slot);
+               kvm_deactive_memslot(kvm, old, invalid_slot);
        }

        r = kvm_prepare_memory_region(kvm, old, new, change);
@@ -1866,6 +1870,9 @@ static int kvm_set_memslot(struct kvm *kvm,
                return r;
        }

+       if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
+               kvm_invalidate_memslot(kvm, old);
+
        /*
         * For DELETE and MOVE, the working slot is now active as the INVALID
         * version of the old slot.  MOVE is particularly special as it reuses


Thanks
Yan
  
Sean Christopherson Nov. 10, 2022, 12:33 a.m. UTC | #3
On Wed, Nov 09, 2022, Yan Zhao wrote:
> On Tue, Nov 08, 2022 at 05:32:50PM +0000, Sean Christopherson wrote:
> > On Tue, Nov 08, 2022, Yan Zhao wrote:
> > > For memslot delete and move, kvm_invalidate_memslot() is required before
> > > the real changes committed.
> > > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call
> > > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in
> > > arch x86.
> > > And according to the definition in kvm_page_track_notifier_node, users can
> > > drop write-protection for the pages in the memory slot on receiving
> > > .track_flush_slot.
> > 
> > Ugh, that's a terrible API.  The entire page track API is a mess, e.g. KVMGT is
> > forced to grab its own references to KVM and also needs to manually acquire/release
> > mmu_lock in some flows but not others.
> > 
> > Anyways, this is a flaw in the page track API that should be fixed.  Flushing a
> > slot should not be overloaded to imply "this slot is gone", it should be a flush
> > command, no more, no less.
> hmm, but KVM also registers to the page track
> "node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;"
> And in kvm_mmu_invalidate_zap_pages_in_memslot, memslot (actaully all the shadow
> page tables) is zapped. And during the zap, unaccount_shadowed() will drop the
> write protection. But KVM is ok because the dropped write-protection can be
> rebuilt during rebuilding the shadow page table.
> So, for .track_flush_slot, it's expected that "users can drop write-protection
> for the pages in the memory slot", right?

No.  KVM isn't actually dropping write-protection, because for the internal KVM
case, KVM obliterates all of its page tables.

> > AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter
> > of adding another hook that's invoked when the memory region change is committed.
> >
> Do you mean adding a new hook in page track, e.g. .track_slot_change?
> Then right before committing slot changes, call this interface to notify slot
> DELETE/MOVE?
> 
> Not only KVMGT, KVM can also be affected by this failure to MOVE if it wants to
> support below usecase:
> 1. KVM pre-populated a memslot
> 2. memslot MOVE happened
> 3. KVM pre-populates the new memslot (MOVE succeeds) or the old one (MOVE fails).
> So also add a new interface for the MOVE failure?
> 
> > That would allow KVMGT to fix another bug.  If a memory region is moved and the
> > new region partially overlaps the old region, KVMGT technically probably wants to
> > retain its write-protection scheme.  Though that's probably not worth supporting,
> > might be better to say "don't move memory regions if KVMGT is enabled", because
> > AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was
> > broken for years and no one noticed).
> >
> So, could we disable MOVE in KVM at all?

Ideally, yes.  Practically?  Dunno.  It's difficult to know whether or not there
are users out there.

> > Actually, given that MOVE + KVMGT probably doesn't work regardless of where the
> > region is moved to, maybe we can get away with rejecting MOVE if an external
> > page tracker cares about the slot in question.
> > 
> > > However, if kvm_prepare_memory_region() fails, the later
> > > kvm_activate_memslot() will only swap back the original slot, leaving
> > > previous write protection not recovered.
> > > 
> > > This may not be a problem for kvm itself as a page tracker user, but may
> > > cause problem to other page tracker users, e.g. kvmgt, whose
> > > write-protected pages are removed from the write-protected list and not
> > > added back.
> > > 
> > > So call kvm_prepare_memory_region first for meta data preparation before
> > > the slot invalidation so as to avoid failure and recovery.
> > 
> > IIUC, this is purely a theoretical bug and practically speaking can't be a problem
> > in practice, at least not without completely breaking KVMGT.
> > 
> > On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor
> > protected VM case is the only scenario where DELETE can fail on any architecture).
> > The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for
> > kvm_arch_prepare_memory_region().
> But as long as with current code sequence, we can't relying on that
> kvm_prepare_memory_region() will never fail for DELETE.
> Or, we need to call kvm_prepare_memory_region() only for !DELETE to avoid future
> possible broken.

Agreed, I just don't want to touch the common memslots code unless it's necessary.

> > And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the
> > max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but
> > the old memslot was not, and so KVM needs to allocate new metadata due to the new
> > memslot needed larger arrays.
> kvm_prepare_memory_region() can also fail for MOVE during live migration when
> memslot->dirty_bitmap allocation is failed in kvm_alloc_dirty_bitmap().
> and in x86, kvm_arch_prepare_memory_region() can also fail for MOVE if
> kvm_alloc_memslot_metadata() fails due to -ENOMEM. 
> BTW, I don't find the "(2) the new memslot is unaligned but the old memslot was not,
> and so KVM needs to allocate new metadata due to the new memslot needed
> larger arrays". 
> > 
> > As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM
> > that supports KVMGT and moves relevant memory regions, let alove does something
> > that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing.
> > 
> > Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses
> > the existing dirty bitmap but doesn't shift the dirty bits.  This is likely
> Do you mean, for the new slot in MOVE, the new dirty bitmap should be
> cleared? Otherwise, why shift is required, given mem->userspace_addr and npages
> are not changed and dirty_bitmap is indexed using rel_gfn 
> (rel_gfn = gfn - memslot->base_gfn) and both QEMU/KVM aligns the bitmap
> size to BITS_PER_LONG.

Oh, you're right.  I forgot that userspace would also be doing a gfn-relative
calculation in the ridiculously theoretically scenario that a memslot is moved
while it's being dirty-logged.

> > another combination that KVM can simply reject.
> > 
> > Assuming this is indeed purely theoretical, that should be called out in the
> > changelog.  Or as above, simply disallow MOVE in this case, which probably has
> > my vote.
> >
> Yes, currently it's a purely theoretical bug, as I'm not seeing MOVE is triggered
> in upstream QEMU.
> 
> <...>
> 
> > I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally
> > correct.  It might be, but there are so many subtleties in this code that I don't
> > want to change this ordering unless absolutely necessary, or at least not without
> > an in-depth analysis and a pile of tests.  E.g. it's possible one or more
> > architectures relies on unmapping, flushing, and invalidating the old region
> > prior to preparing the new region, e.g. to reuse arch memslot data.
> yes. what about just moving kvm_arch_flush_shadow_memslot() and
> kvm_arch_guest_memory_reclaimed() to later than kvm_arch_prepare_memory_region()?

I'm not dead set against tweaking the memslots flow, but I don't want to do so to
fix this extremely theoretical bug.  In other words, I want to fix this by giving
KVM-GT a more appropriate hook, not by shuffling common KVM code to fudge around
a poor KVM x86 API.

And if KVM-GT moves away from track_flush_slot(), we can delete the hook entirely
after cleaning up clean up another pile of ugly: KVM always registers a page-track
notifier because it relies on the track_flush_slot() call to invoke
kvm_mmu_invalidate_zap_pages_in_memslot(), even when KVM isn't tracking anything.
I'll send patches for this; if/when both land, track_flush_slot() can be deleted
on top.
  
Yan Zhao Nov. 10, 2022, 12:50 a.m. UTC | #4
On Thu, Nov 10, 2022 at 12:33:25AM +0000, Sean Christopherson wrote:
> On Wed, Nov 09, 2022, Yan Zhao wrote:
> > On Tue, Nov 08, 2022 at 05:32:50PM +0000, Sean Christopherson wrote:
> > > On Tue, Nov 08, 2022, Yan Zhao wrote:
> > > > For memslot delete and move, kvm_invalidate_memslot() is required before
> > > > the real changes committed.
> > > > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call
> > > > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in
> > > > arch x86.
> > > > And according to the definition in kvm_page_track_notifier_node, users can
> > > > drop write-protection for the pages in the memory slot on receiving
> > > > .track_flush_slot.
> > > 
> > > Ugh, that's a terrible API.  The entire page track API is a mess, e.g. KVMGT is
> > > forced to grab its own references to KVM and also needs to manually acquire/release
> > > mmu_lock in some flows but not others.
> > > 
> > > Anyways, this is a flaw in the page track API that should be fixed.  Flushing a
> > > slot should not be overloaded to imply "this slot is gone", it should be a flush
> > > command, no more, no less.
> > hmm, but KVM also registers to the page track
> > "node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;"
> > And in kvm_mmu_invalidate_zap_pages_in_memslot, memslot (actaully all the shadow
> > page tables) is zapped. And during the zap, unaccount_shadowed() will drop the
> > write protection. But KVM is ok because the dropped write-protection can be
> > rebuilt during rebuilding the shadow page table.
> > So, for .track_flush_slot, it's expected that "users can drop write-protection
> > for the pages in the memory slot", right?
> 
> No.  KVM isn't actually dropping write-protection, because for the internal KVM
> case, KVM obliterates all of its page tables.
> 
> > > AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter
> > > of adding another hook that's invoked when the memory region change is committed.
> > >
> > Do you mean adding a new hook in page track, e.g. .track_slot_change?
> > Then right before committing slot changes, call this interface to notify slot
> > DELETE/MOVE?
> > 
> > Not only KVMGT, KVM can also be affected by this failure to MOVE if it wants to
> > support below usecase:
> > 1. KVM pre-populated a memslot
> > 2. memslot MOVE happened
> > 3. KVM pre-populates the new memslot (MOVE succeeds) or the old one (MOVE fails).
> > So also add a new interface for the MOVE failure?
> > 
> > > That would allow KVMGT to fix another bug.  If a memory region is moved and the
> > > new region partially overlaps the old region, KVMGT technically probably wants to
> > > retain its write-protection scheme.  Though that's probably not worth supporting,
> > > might be better to say "don't move memory regions if KVMGT is enabled", because
> > > AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was
> > > broken for years and no one noticed).
> > >
> > So, could we disable MOVE in KVM at all?
> 
> Ideally, yes.  Practically?  Dunno.  It's difficult to know whether or not there
> are users out there.
> 
> > > Actually, given that MOVE + KVMGT probably doesn't work regardless of where the
> > > region is moved to, maybe we can get away with rejecting MOVE if an external
> > > page tracker cares about the slot in question.
> > > 
> > > > However, if kvm_prepare_memory_region() fails, the later
> > > > kvm_activate_memslot() will only swap back the original slot, leaving
> > > > previous write protection not recovered.
> > > > 
> > > > This may not be a problem for kvm itself as a page tracker user, but may
> > > > cause problem to other page tracker users, e.g. kvmgt, whose
> > > > write-protected pages are removed from the write-protected list and not
> > > > added back.
> > > > 
> > > > So call kvm_prepare_memory_region first for meta data preparation before
> > > > the slot invalidation so as to avoid failure and recovery.
> > > 
> > > IIUC, this is purely a theoretical bug and practically speaking can't be a problem
> > > in practice, at least not without completely breaking KVMGT.
> > > 
> > > On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor
> > > protected VM case is the only scenario where DELETE can fail on any architecture).
> > > The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for
> > > kvm_arch_prepare_memory_region().
> > But as long as with current code sequence, we can't relying on that
> > kvm_prepare_memory_region() will never fail for DELETE.
> > Or, we need to call kvm_prepare_memory_region() only for !DELETE to avoid future
> > possible broken.
> 
> Agreed, I just don't want to touch the common memslots code unless it's necessary.
Ok. Let me prepare a patch for it.

> 
> > > And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the
> > > max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but
> > > the old memslot was not, and so KVM needs to allocate new metadata due to the new
> > > memslot needed larger arrays.
> > kvm_prepare_memory_region() can also fail for MOVE during live migration when
> > memslot->dirty_bitmap allocation is failed in kvm_alloc_dirty_bitmap().
> > and in x86, kvm_arch_prepare_memory_region() can also fail for MOVE if
> > kvm_alloc_memslot_metadata() fails due to -ENOMEM. 
> > BTW, I don't find the "(2) the new memslot is unaligned but the old memslot was not,
> > and so KVM needs to allocate new metadata due to the new memslot needed
> > larger arrays". 
> > > 
> > > As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM
> > > that supports KVMGT and moves relevant memory regions, let alove does something
> > > that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing.
> > > 
> > > Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses
> > > the existing dirty bitmap but doesn't shift the dirty bits.  This is likely
> > Do you mean, for the new slot in MOVE, the new dirty bitmap should be
> > cleared? Otherwise, why shift is required, given mem->userspace_addr and npages
> > are not changed and dirty_bitmap is indexed using rel_gfn 
> > (rel_gfn = gfn - memslot->base_gfn) and both QEMU/KVM aligns the bitmap
> > size to BITS_PER_LONG.
> 
> Oh, you're right.  I forgot that userspace would also be doing a gfn-relative
> calculation in the ridiculously theoretically scenario that a memslot is moved
> while it's being dirty-logged.
> 
> > > another combination that KVM can simply reject.
> > > 
> > > Assuming this is indeed purely theoretical, that should be called out in the
> > > changelog.  Or as above, simply disallow MOVE in this case, which probably has
> > > my vote.
> > >
> > Yes, currently it's a purely theoretical bug, as I'm not seeing MOVE is triggered
> > in upstream QEMU.
> > 
> > <...>
> > 
> > > I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally
> > > correct.  It might be, but there are so many subtleties in this code that I don't
> > > want to change this ordering unless absolutely necessary, or at least not without
> > > an in-depth analysis and a pile of tests.  E.g. it's possible one or more
> > > architectures relies on unmapping, flushing, and invalidating the old region
> > > prior to preparing the new region, e.g. to reuse arch memslot data.
> > yes. what about just moving kvm_arch_flush_shadow_memslot() and
> > kvm_arch_guest_memory_reclaimed() to later than kvm_arch_prepare_memory_region()?
> 
> I'm not dead set against tweaking the memslots flow, but I don't want to do so to
> fix this extremely theoretical bug.  In other words, I want to fix this by giving
> KVM-GT a more appropriate hook, not by shuffling common KVM code to fudge around
> a poor KVM x86 API.
> 
> And if KVM-GT moves away from track_flush_slot(), we can delete the hook entirely
> after cleaning up clean up another pile of ugly: KVM always registers a page-track
> notifier because it relies on the track_flush_slot() call to invoke
> kvm_mmu_invalidate_zap_pages_in_memslot(), even when KVM isn't tracking anything.
> I'll send patches for this; if/when both land, track_flush_slot() can be deleted
> on top.
Ok. thanks. I'll watch and decide what to do based on your changes.

Thanks
Yan
  

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 25d7872b29c1..5f29011f432d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1827,45 +1827,35 @@  static int kvm_set_memslot(struct kvm *kvm,
 	 */
 	mutex_lock(&kvm->slots_arch_lock);
 
-	/*
-	 * Invalidate the old slot if it's being deleted or moved.  This is
-	 * done prior to actually deleting/moving the memslot to allow vCPUs to
-	 * continue running by ensuring there are no mappings or shadow pages
-	 * for the memslot when it is deleted/moved.  Without pre-invalidation
-	 * (and without a lock), a window would exist between effecting the
-	 * delete/move and committing the changes in arch code where KVM or a
-	 * guest could access a non-existent memslot.
-	 *
-	 * Modifications are done on a temporary, unreachable slot.  The old
-	 * slot needs to be preserved in case a later step fails and the
-	 * invalidation needs to be reverted.
-	 */
 	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
 		invalid_slot = kzalloc(sizeof(*invalid_slot), GFP_KERNEL_ACCOUNT);
 		if (!invalid_slot) {
 			mutex_unlock(&kvm->slots_arch_lock);
 			return -ENOMEM;
 		}
-		kvm_invalidate_memslot(kvm, old, invalid_slot);
 	}
 
 	r = kvm_prepare_memory_region(kvm, old, new, change);
 	if (r) {
-		/*
-		 * For DELETE/MOVE, revert the above INVALID change.  No
-		 * modifications required since the original slot was preserved
-		 * in the inactive slots.  Changing the active memslots also
-		 * release slots_arch_lock.
-		 */
-		if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) {
-			kvm_activate_memslot(kvm, invalid_slot, old);
+		if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
 			kfree(invalid_slot);
-		} else {
-			mutex_unlock(&kvm->slots_arch_lock);
-		}
+
+		mutex_unlock(&kvm->slots_arch_lock);
 		return r;
 	}
 
+	/*
+	 * Invalidate the old slot if it's being deleted or moved.  This is
+	 * done prior to actually deleting/moving the memslot to allow vCPUs to
+	 * continue running by ensuring there are no mappings or shadow pages
+	 * for the memslot when it is deleted/moved.  Without pre-invalidation
+	 * (and without a lock), a window would exist between effecting the
+	 * delete/move and committing the changes in arch code where KVM or a
+	 * guest could access a non-existent memslot.
+	 */
+	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
+		kvm_invalidate_memslot(kvm, old, invalid_slot);
+
 	/*
 	 * For DELETE and MOVE, the working slot is now active as the INVALID
 	 * version of the old slot.  MOVE is particularly special as it reuses