[v2,1/3] KVM: x86: add a new page track hook track_remove_slot

Message ID 20221111103350.22326-1-yan.y.zhao@intel.com
State New
Headers
Series add track_remove_slot and remove track_flush_slot |

Commit Message

Yan Zhao Nov. 11, 2022, 10:33 a.m. UTC
  Page track hook track_remove_slot is used to notify users that a slot
has been removed and is called when a slot DELETE/MOVE is about to be
completed.

Users of this hook can drop write protections in the removed slot.

Note:
Since KVM_MR_MOVE currently never actually happens in KVM/QEMU, and has
never been properly supported in the external page track user, we just
use the hook track_remove_slot to notify users of the old slot being
removed.

Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/kvm_page_track.h | 11 +++++++++++
 arch/x86/kvm/mmu/page_track.c         | 26 ++++++++++++++++++++++++++
 arch/x86/kvm/x86.c                    |  3 +++
 3 files changed, 40 insertions(+)
  

Comments

Sean Christopherson Nov. 11, 2022, 6:19 p.m. UTC | #1
TL;DR: I'm going to try to add more aggressive patches for this into my series to
clean up the KVM side of things, along with many more patches to clean up the page
track APIs.

I'll post patches next week if things go well (fingers crossed), and if not I'll
give an update 

On Fri, Nov 11, 2022, Yan Zhao wrote:
> Page track hook track_remove_slot is used to notify users that a slot
> has been removed and is called when a slot DELETE/MOVE is about to be
> completed.

Phrase this as a command, and explain _why_ the new hook is being added, e.g.

  Add a new page track hook, track_remove_slot(), that is called when a
  memslot DELETE/MOVE operation is about to be committed.  The "remove"
  hook will be used by KVMGT and will effectively replace the existing
  track_flush_slot() altogether now that KVM itself doesn't rely on the
  "flush" hook either.

  The "flush" hook is flawed as it's invoked before the memslot operation
  is guaranteed, i.e. KVM might ultimately keep the existing memslot without
  notifying external page track users, a.k.a. KVMGT.

> Users of this hook can drop write protections in the removed slot.

Hmm, actually, on second thought, after thinking about what KVGT is doing in
response to the memslot update, I think we should be more aggressive and actively
prevent MOVE if there are external page trackers, i.e. if KVMGT is attached.

Dropping write protections when a memslot is being deleted is a waste of cycles.
The memslot and thus gfn_track is literally being deleted immediately after invoking
the hook, updating gfn_track from KVMGT is completely unecessary.

I.e. if we kill off the MOVE path, then KVMGT just needs to delete its hash table
entry.

Oooh!  Looking at this code again made me realize that the larger page track cleanup
that I want to do might actually work.  Long story short, I want to stop forcing
KVMGT to poke into KVM internals, but I thought there was a lock inversion problem.

But AFAICT, there is no such problem.  And the cleanup I want to do will actually
fix an existing KVMGT bug: kvmgt_page_track_write() invokes kvmgt_gfn_is_write_protected()
without holding mmu_lock, and thus could consume garbage when walking the hash
table.

  static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,
		const u8 *val, int len,
		struct kvm_page_track_notifier_node *node)
  {
	struct intel_vgpu *info =
		container_of(node, struct intel_vgpu, track_node);

	if (kvmgt_gfn_is_write_protected(info, gpa_to_gfn(gpa)))
		intel_vgpu_page_track_handler(info, gpa,
						     (void *)val, len);
  }

Acquiring mmu_lock isn't an option as intel_vgpu_page_track_handler() might sleep,
e.g. when acquiring vgpu_lock.

Let me see if the clean up I have in mind will actually work.  If it does, I think
the end result will be quite nice for both KVM and KVMGT.
  
Yan Zhao Nov. 12, 2022, 12:03 a.m. UTC | #2
On Fri, Nov 11, 2022 at 06:19:15PM +0000, Sean Christopherson wrote:
> TL;DR: I'm going to try to add more aggressive patches for this into my series to
> clean up the KVM side of things, along with many more patches to clean up the page
> track APIs.
> 
> I'll post patches next week if things go well (fingers crossed), and if not I'll
> give an update 
> 
> On Fri, Nov 11, 2022, Yan Zhao wrote:
> > Page track hook track_remove_slot is used to notify users that a slot
> > has been removed and is called when a slot DELETE/MOVE is about to be
> > completed.
> 
> Phrase this as a command, and explain _why_ the new hook is being added, e.g.
> 
>   Add a new page track hook, track_remove_slot(), that is called when a
>   memslot DELETE/MOVE operation is about to be committed.  The "remove"
>   hook will be used by KVMGT and will effectively replace the existing
>   track_flush_slot() altogether now that KVM itself doesn't rely on the
>   "flush" hook either.
> 
>   The "flush" hook is flawed as it's invoked before the memslot operation
>   is guaranteed, i.e. KVM might ultimately keep the existing memslot without
>   notifying external page track users, a.k.a. KVMGT.
> 
> > Users of this hook can drop write protections in the removed slot.
> 
> Hmm, actually, on second thought, after thinking about what KVGT is doing in
> response to the memslot update, I think we should be more aggressive and actively
> prevent MOVE if there are external page trackers, i.e. if KVMGT is attached.
> 
> Dropping write protections when a memslot is being deleted is a waste of cycles.
> The memslot and thus gfn_track is literally being deleted immediately after invoking
> the hook, updating gfn_track from KVMGT is completely unecessary.

> I.e. if we kill off the MOVE path, then KVMGT just needs to delete its hash table
> entry.
> 
> Oooh!  Looking at this code again made me realize that the larger page track cleanup
> that I want to do might actually work.  Long story short, I want to stop forcing
> KVMGT to poke into KVM internals, but I thought there was a lock inversion problem.
> 
> But AFAICT, there is no such problem.  And the cleanup I want to do will actually
> fix an existing KVMGT bug: kvmgt_page_track_write() invokes kvmgt_gfn_is_write_protected()
> without holding mmu_lock, and thus could consume garbage when walking the hash
> table.
> 
>   static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,
> 		const u8 *val, int len,
> 		struct kvm_page_track_notifier_node *node)
>   {
> 	struct intel_vgpu *info =
> 		container_of(node, struct intel_vgpu, track_node);
> 
> 	if (kvmgt_gfn_is_write_protected(info, gpa_to_gfn(gpa)))
> 		intel_vgpu_page_track_handler(info, gpa,
> 						     (void *)val, len);
>   }
> 
> Acquiring mmu_lock isn't an option as intel_vgpu_page_track_handler() might sleep,
> e.g. when acquiring vgpu_lock.
>
I totally agree with you and actually had the same feeling as you when
examined the code yesterday. But I thought I'd better send this series
first and do the cleanup later :)
And I'm also not sure if a slots_arch_lock is required for
kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().
(Though it doesn't matter for a removing slot and we actually needn't to
call kvm_slot_page_track_remove_page() in response to a slot removal,
the two interfaces are still required elsewhere.)


> Let me see if the clean up I have in mind will actually work.  If it does, I think
> the end result will be quite nice for both KVM and KVMGT.
Yes, it would be great.

Thanks
Yan
  
Sean Christopherson Nov. 12, 2022, 12:43 a.m. UTC | #3
On Sat, Nov 12, 2022, Yan Zhao wrote:
> And I'm also not sure if a slots_arch_lock is required for
> kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().

It's not required.  slots_arch_lock protects interaction between memslot updates
mmu_first_shadow_root_alloc().  When CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, then
the mmu_first_shadow_root_alloc() doesn't touch the memslots because everything
is pre-allocated:

bool kvm_page_track_write_tracking_enabled(struct kvm *kvm)
{
	return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) ||
	       !tdp_enabled || kvm_shadow_root_allocated(kvm);
}

int kvm_page_track_create_memslot(struct kvm *kvm,
				  struct kvm_memory_slot *slot,
				  unsigned long npages)
{
	if (!kvm_page_track_write_tracking_enabled(kvm)) <== always true
		return 0;

	return __kvm_page_track_write_tracking_alloc(slot, npages);
}

Though now that you point it out, it's tempting to #ifdef out some of those hooks
so that's basically impossible for mmu_first_shadow_root_alloc() to cause problems.
Not sure the extra #ideffery would be worth while though.

slots_arch_lock also protects shadow_root_allocated, but that's a KVM-internal
detail that isn't relevant to the page-tracking machinery when
CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y.
  
Yan Zhao Nov. 14, 2022, 1:05 a.m. UTC | #4
On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote:
> On Sat, Nov 12, 2022, Yan Zhao wrote:
> > And I'm also not sure if a slots_arch_lock is required for
> > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().
> 
> It's not required.  slots_arch_lock protects interaction between memslot updates
In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(),
slot->arch.gfn_track[mode][index] is updated in update_gfn_track(),
do you know which lock is used to protect it?

Thanks
Yan

> mmu_first_shadow_root_alloc().  When CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, then
> the mmu_first_shadow_root_alloc() doesn't touch the memslots because everything
> is pre-allocated:
> 
> bool kvm_page_track_write_tracking_enabled(struct kvm *kvm)
> {
> 	return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) ||
> 	       !tdp_enabled || kvm_shadow_root_allocated(kvm);
> }
> 
> int kvm_page_track_create_memslot(struct kvm *kvm,
> 				  struct kvm_memory_slot *slot,
> 				  unsigned long npages)
> {
> 	if (!kvm_page_track_write_tracking_enabled(kvm)) <== always true
> 		return 0;
> 
> 	return __kvm_page_track_write_tracking_alloc(slot, npages);
> }
> 
> Though now that you point it out, it's tempting to #ifdef out some of those hooks
> so that's basically impossible for mmu_first_shadow_root_alloc() to cause problems.
> Not sure the extra #ideffery would be worth while though.
> 
> slots_arch_lock also protects shadow_root_allocated, but that's a KVM-internal
> detail that isn't relevant to the page-tracking machinery when
> CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y.
  
Sean Christopherson Nov. 14, 2022, 4:32 p.m. UTC | #5
On Mon, Nov 14, 2022, Yan Zhao wrote:
> On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote:
> > On Sat, Nov 12, 2022, Yan Zhao wrote:
> > > And I'm also not sure if a slots_arch_lock is required for
> > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().
> > 
> > It's not required.  slots_arch_lock protects interaction between memslot updates
> In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(),
> slot->arch.gfn_track[mode][index] is updated in update_gfn_track(),
> do you know which lock is used to protect it?

mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated
protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM
allocates gfn_track for all memslots when shadow paging is activated.

The cleanup series I'm prepping adds lockdep assertions for the relevant paths, e.g. 

$ git grep -B 8 -E "update_gfn_write_track.*;"
arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_add_gfn(struct kvm *kvm, struct kvm_memory_slot *slot,
arch/x86/kvm/mmu/page_track.c-                         gfn_t gfn)
arch/x86/kvm/mmu/page_track.c-{
arch/x86/kvm/mmu/page_track.c-  lockdep_assert_held_write(&kvm->mmu_lock);
arch/x86/kvm/mmu/page_track.c-
arch/x86/kvm/mmu/page_track.c-  if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm))
arch/x86/kvm/mmu/page_track.c-          return;
arch/x86/kvm/mmu/page_track.c-
arch/x86/kvm/mmu/page_track.c:  update_gfn_write_track(slot, gfn, 1);
--
arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_remove_gfn(struct kvm *kvm,
arch/x86/kvm/mmu/page_track.c-                            struct kvm_memory_slot *slot, gfn_t gfn)
arch/x86/kvm/mmu/page_track.c-{
arch/x86/kvm/mmu/page_track.c-  lockdep_assert_held_write(&kvm->mmu_lock);
arch/x86/kvm/mmu/page_track.c-
arch/x86/kvm/mmu/page_track.c-  if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm))
arch/x86/kvm/mmu/page_track.c-          return;
arch/x86/kvm/mmu/page_track.c-
arch/x86/kvm/mmu/page_track.c:  update_gfn_write_track(slot, gfn, -1);
  
Yan Zhao Nov. 14, 2022, 10:42 p.m. UTC | #6
On Mon, Nov 14, 2022 at 04:32:34PM +0000, Sean Christopherson wrote:
> On Mon, Nov 14, 2022, Yan Zhao wrote:
> > On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote:
> > > On Sat, Nov 12, 2022, Yan Zhao wrote:
> > > > And I'm also not sure if a slots_arch_lock is required for
> > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().
> > > 
> > > It's not required.  slots_arch_lock protects interaction between memslot updates
> > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(),
> > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(),
> > do you know which lock is used to protect it?
> 
> mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated
> protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM
> allocates gfn_track for all memslots when shadow paging is activated.
Hmm, thanks for the reply.
but in direct_page_fault(),
if (page_fault_handle_page_track(vcpu, fault))
	return RET_PF_EMULATE;

slot->arch.gfn_track is read without any mmu_lock is held.
> 
> The cleanup series I'm prepping adds lockdep assertions for the relevant paths, e.g. 
> 
> $ git grep -B 8 -E "update_gfn_write_track.*;"
> arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_add_gfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> arch/x86/kvm/mmu/page_track.c-                         gfn_t gfn)
> arch/x86/kvm/mmu/page_track.c-{
> arch/x86/kvm/mmu/page_track.c-  lockdep_assert_held_write(&kvm->mmu_lock);
> arch/x86/kvm/mmu/page_track.c-
> arch/x86/kvm/mmu/page_track.c-  if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm))
> arch/x86/kvm/mmu/page_track.c-          return;
> arch/x86/kvm/mmu/page_track.c-
> arch/x86/kvm/mmu/page_track.c:  update_gfn_write_track(slot, gfn, 1);
> --
> arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_remove_gfn(struct kvm *kvm,
> arch/x86/kvm/mmu/page_track.c-                            struct kvm_memory_slot *slot, gfn_t gfn)
> arch/x86/kvm/mmu/page_track.c-{
> arch/x86/kvm/mmu/page_track.c-  lockdep_assert_held_write(&kvm->mmu_lock);
> arch/x86/kvm/mmu/page_track.c-
> arch/x86/kvm/mmu/page_track.c-  if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm))
> arch/x86/kvm/mmu/page_track.c-          return;
> arch/x86/kvm/mmu/page_track.c-
> arch/x86/kvm/mmu/page_track.c:  update_gfn_write_track(slot, gfn, -1);
yes, it will be helpful.

Besides, will WRITE_ONCE or atomic_add in update_gfn_write_track() to
update slot->arch.gfn_track be better?

Thanks
Yan
  
Yan Zhao Nov. 14, 2022, 11:22 p.m. UTC | #7
On Mon, Nov 14, 2022 at 11:24:16PM +0000, Sean Christopherson wrote:
> On Tue, Nov 15, 2022, Yan Zhao wrote:
> > On Mon, Nov 14, 2022 at 04:32:34PM +0000, Sean Christopherson wrote:
> > > On Mon, Nov 14, 2022, Yan Zhao wrote:
> > > > On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote:
> > > > > On Sat, Nov 12, 2022, Yan Zhao wrote:
> > > > > > And I'm also not sure if a slots_arch_lock is required for
> > > > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().
> > > > > 
> > > > > It's not required.  slots_arch_lock protects interaction between memslot updates
> > > > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(),
> > > > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(),
> > > > do you know which lock is used to protect it?
> > > 
> > > mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated
> > > protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM
> > > allocates gfn_track for all memslots when shadow paging is activated.
> > Hmm, thanks for the reply.
> > but in direct_page_fault(),
> > if (page_fault_handle_page_track(vcpu, fault))
> > 	return RET_PF_EMULATE;
> > 
> > slot->arch.gfn_track is read without any mmu_lock is held.
> 
> That's a fast path that deliberately reads out of mmu_lock.  A false positive
> only results in unnecessary emulation, and any false positive is inherently prone
> to races anyways, e.g. fault racing with zap.
what about false negative?
If the fast path read 0 count, no page track write callback will be called and write
protection will be removed in the slow path.


Thanks
Yan
> 
> > > arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_remove_gfn(struct kvm *kvm,
> > > arch/x86/kvm/mmu/page_track.c-                            struct kvm_memory_slot *slot, gfn_t gfn)
> > > arch/x86/kvm/mmu/page_track.c-{
> > > arch/x86/kvm/mmu/page_track.c-  lockdep_assert_held_write(&kvm->mmu_lock);
> > > arch/x86/kvm/mmu/page_track.c-
> > > arch/x86/kvm/mmu/page_track.c-  if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm))
> > > arch/x86/kvm/mmu/page_track.c-          return;
> > > arch/x86/kvm/mmu/page_track.c-
> > > arch/x86/kvm/mmu/page_track.c:  update_gfn_write_track(slot, gfn, -1);
> > yes, it will be helpful.
> > 
> > Besides, will WRITE_ONCE or atomic_add in update_gfn_write_track() to
> > update slot->arch.gfn_track be better?
> 
> WRITE_ONCE() won't suffice, it needs to be atomic.  Switching to atomic_inc/dec
> isn't worth it so long as KVM's shadow MMU takes mmu_lock for write, i.e. while
> the accounting is mutually exclusive for other reasons in both KVM and KVMGT.
  
Sean Christopherson Nov. 14, 2022, 11:24 p.m. UTC | #8
On Tue, Nov 15, 2022, Yan Zhao wrote:
> On Mon, Nov 14, 2022 at 04:32:34PM +0000, Sean Christopherson wrote:
> > On Mon, Nov 14, 2022, Yan Zhao wrote:
> > > On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote:
> > > > On Sat, Nov 12, 2022, Yan Zhao wrote:
> > > > > And I'm also not sure if a slots_arch_lock is required for
> > > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().
> > > > 
> > > > It's not required.  slots_arch_lock protects interaction between memslot updates
> > > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(),
> > > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(),
> > > do you know which lock is used to protect it?
> > 
> > mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated
> > protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM
> > allocates gfn_track for all memslots when shadow paging is activated.
> Hmm, thanks for the reply.
> but in direct_page_fault(),
> if (page_fault_handle_page_track(vcpu, fault))
> 	return RET_PF_EMULATE;
> 
> slot->arch.gfn_track is read without any mmu_lock is held.

That's a fast path that deliberately reads out of mmu_lock.  A false positive
only results in unnecessary emulation, and any false positive is inherently prone
to races anyways, e.g. fault racing with zap.

> > arch/x86/kvm/mmu/page_track.c-void __kvm_write_track_remove_gfn(struct kvm *kvm,
> > arch/x86/kvm/mmu/page_track.c-                            struct kvm_memory_slot *slot, gfn_t gfn)
> > arch/x86/kvm/mmu/page_track.c-{
> > arch/x86/kvm/mmu/page_track.c-  lockdep_assert_held_write(&kvm->mmu_lock);
> > arch/x86/kvm/mmu/page_track.c-
> > arch/x86/kvm/mmu/page_track.c-  if (KVM_BUG_ON(!kvm_page_track_write_tracking_enabled(kvm), kvm))
> > arch/x86/kvm/mmu/page_track.c-          return;
> > arch/x86/kvm/mmu/page_track.c-
> > arch/x86/kvm/mmu/page_track.c:  update_gfn_write_track(slot, gfn, -1);
> yes, it will be helpful.
> 
> Besides, will WRITE_ONCE or atomic_add in update_gfn_write_track() to
> update slot->arch.gfn_track be better?

WRITE_ONCE() won't suffice, it needs to be atomic.  Switching to atomic_inc/dec
isn't worth it so long as KVM's shadow MMU takes mmu_lock for write, i.e. while
the accounting is mutually exclusive for other reasons in both KVM and KVMGT.
  
Sean Christopherson Nov. 15, 2022, 12:55 a.m. UTC | #9
On Tue, Nov 15, 2022, Yan Zhao wrote:
> On Mon, Nov 14, 2022 at 11:24:16PM +0000, Sean Christopherson wrote:
> > On Tue, Nov 15, 2022, Yan Zhao wrote:
> > > On Mon, Nov 14, 2022 at 04:32:34PM +0000, Sean Christopherson wrote:
> > > > On Mon, Nov 14, 2022, Yan Zhao wrote:
> > > > > On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote:
> > > > > > On Sat, Nov 12, 2022, Yan Zhao wrote:
> > > > > > > And I'm also not sure if a slots_arch_lock is required for
> > > > > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().
> > > > > > 
> > > > > > It's not required.  slots_arch_lock protects interaction between memslot updates
> > > > > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(),
> > > > > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(),
> > > > > do you know which lock is used to protect it?
> > > > 
> > > > mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated
> > > > protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM
> > > > allocates gfn_track for all memslots when shadow paging is activated.
> > > Hmm, thanks for the reply.
> > > but in direct_page_fault(),
> > > if (page_fault_handle_page_track(vcpu, fault))
> > > 	return RET_PF_EMULATE;
> > > 
> > > slot->arch.gfn_track is read without any mmu_lock is held.
> > 
> > That's a fast path that deliberately reads out of mmu_lock.  A false positive
> > only results in unnecessary emulation, and any false positive is inherently prone
> > to races anyways, e.g. fault racing with zap.
> what about false negative?
> If the fast path read 0 count, no page track write callback will be called and write
> protection will be removed in the slow path.

No.  For a false negative to occur, a different task would have to create a SPTE
and write-protect the GFN _while holding mmu_lock_.  And then after acquiring
mmu_lock, the vCPU that got the false negative would call make_spte(), which would
detect that making the SPTE writable is disallowed due to the GFN being write-protected.

	if (pte_access & ACC_WRITE_MASK) {
		spte |= PT_WRITABLE_MASK | shadow_mmu_writable_mask;

		/*
		 * Optimization: for pte sync, if spte was writable the hash
		 * lookup is unnecessary (and expensive). Write protection
		 * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
		 * Same reasoning can be applied to dirty page accounting.
		 */
		if (is_writable_pte(old_spte))
			goto out;

		/*
		 * Unsync shadow pages that are reachable by the new, writable
		 * SPTE.  Write-protect the SPTE if the page can't be unsync'd,
		 * e.g. it's write-tracked (upper-level SPs) or has one or more
		 * shadow pages and unsync'ing pages is not allowed.
		 */
		if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) {
			pgprintk("%s: found shadow page for %llx, marking ro\n",
				 __func__, gfn);
			wrprot = true;
			pte_access &= ~ACC_WRITE_MASK;
			spte &= ~(PT_WRITABLE_MASK | shadow_mmu_writable_mask);
		}
	}



int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
			    gfn_t gfn, bool can_unsync, bool prefetch)
{
	struct kvm_mmu_page *sp;
	bool locked = false;

	/*
	 * Force write-protection if the page is being tracked.  Note, the page
	 * track machinery is used to write-protect upper-level shadow pages,
	 * i.e. this guards the role.level == 4K assertion below!
	 */
	if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
		return -EPERM;

	...
}
  
Yan Zhao Nov. 15, 2022, 1:08 a.m. UTC | #10
On Tue, Nov 15, 2022 at 12:55:42AM +0000, Sean Christopherson wrote:
> On Tue, Nov 15, 2022, Yan Zhao wrote:
> > On Mon, Nov 14, 2022 at 11:24:16PM +0000, Sean Christopherson wrote:
> > > On Tue, Nov 15, 2022, Yan Zhao wrote:
> > > > On Mon, Nov 14, 2022 at 04:32:34PM +0000, Sean Christopherson wrote:
> > > > > On Mon, Nov 14, 2022, Yan Zhao wrote:
> > > > > > On Sat, Nov 12, 2022 at 12:43:07AM +0000, Sean Christopherson wrote:
> > > > > > > On Sat, Nov 12, 2022, Yan Zhao wrote:
> > > > > > > > And I'm also not sure if a slots_arch_lock is required for
> > > > > > > > kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page().
> > > > > > > 
> > > > > > > It's not required.  slots_arch_lock protects interaction between memslot updates
> > > > > > In kvm_slot_page_track_add_page() and kvm_slot_page_track_remove_page(),
> > > > > > slot->arch.gfn_track[mode][index] is updated in update_gfn_track(),
> > > > > > do you know which lock is used to protect it?
> > > > > 
> > > > > mmu_lock protects the count, kvm->srcu protects the slot, and shadow_root_allocated
> > > > > protects that validity of gfn_track, i.e. shadow_root_allocated ensures that KVM
> > > > > allocates gfn_track for all memslots when shadow paging is activated.
> > > > Hmm, thanks for the reply.
> > > > but in direct_page_fault(),
> > > > if (page_fault_handle_page_track(vcpu, fault))
> > > > 	return RET_PF_EMULATE;
> > > > 
> > > > slot->arch.gfn_track is read without any mmu_lock is held.
> > > 
> > > That's a fast path that deliberately reads out of mmu_lock.  A false positive
> > > only results in unnecessary emulation, and any false positive is inherently prone
> > > to races anyways, e.g. fault racing with zap.
> > what about false negative?
> > If the fast path read 0 count, no page track write callback will be called and write
> > protection will be removed in the slow path.
> 
> No.  For a false negative to occur, a different task would have to create a SPTE
> and write-protect the GFN _while holding mmu_lock_.  And then after acquiring
> mmu_lock, the vCPU that got the false negative would call make_spte(), which would
> detect that making the SPTE writable is disallowed due to the GFN being write-protected.
> 
> 	if (pte_access & ACC_WRITE_MASK) {
> 		spte |= PT_WRITABLE_MASK | shadow_mmu_writable_mask;
> 
> 		/*
> 		 * Optimization: for pte sync, if spte was writable the hash
> 		 * lookup is unnecessary (and expensive). Write protection
> 		 * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
> 		 * Same reasoning can be applied to dirty page accounting.
> 		 */
> 		if (is_writable_pte(old_spte))
> 			goto out;
> 
> 		/*
> 		 * Unsync shadow pages that are reachable by the new, writable
> 		 * SPTE.  Write-protect the SPTE if the page can't be unsync'd,
> 		 * e.g. it's write-tracked (upper-level SPs) or has one or more
> 		 * shadow pages and unsync'ing pages is not allowed.
> 		 */
> 		if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) {
> 			pgprintk("%s: found shadow page for %llx, marking ro\n",
> 				 __func__, gfn);
> 			wrprot = true;
> 			pte_access &= ~ACC_WRITE_MASK;
> 			spte &= ~(PT_WRITABLE_MASK | shadow_mmu_writable_mask);
> 		}
> 	}
> 
> 
> 
> int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
> 			    gfn_t gfn, bool can_unsync, bool prefetch)
> {
> 	struct kvm_mmu_page *sp;
> 	bool locked = false;
> 
> 	/*
> 	 * Force write-protection if the page is being tracked.  Note, the page
> 	 * track machinery is used to write-protect upper-level shadow pages,
> 	 * i.e. this guards the role.level == 4K assertion below!
> 	 */
> 	if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
> 		return -EPERM;
> 
> 	...
> }

Oh, you are right! I thought mmu_try_to_unsync_pages() is only for the
shadow mmu, and overlooked that TDP MMU will also go into it.

Thanks for the detailed explanation.

Thanks
Yan
  

Patch

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..046b024d1813 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -44,6 +44,16 @@  struct kvm_page_track_notifier_node {
 	 */
 	void (*track_flush_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
 			    struct kvm_page_track_notifier_node *node);
+	/*
+	 * It is called when memory slot is moved or removed
+	 * users can drop write-protection for the pages in that memory slot
+	 *
+	 * @kvm: the kvm where memory slot being moved or removed
+	 * @slot: the memory slot being moved or removed
+	 * @node: this node
+	 */
+	void (*track_remove_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
+				  struct kvm_page_track_notifier_node *node);
 };
 
 int kvm_page_track_init(struct kvm *kvm);
@@ -76,4 +86,5 @@  kvm_page_track_unregister_notifier(struct kvm *kvm,
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 			  int bytes);
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
+void kvm_page_track_remove_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
 #endif
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 2e09d1b6249f..4d6bab1d61c9 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -300,3 +300,29 @@  void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
 			n->track_flush_slot(kvm, slot, n);
 	srcu_read_unlock(&head->track_srcu, idx);
 }
+
+/*
+ * Notify the node that memory slot is removed or moved so that it can
+ * drop write-protection for the pages in the memory slot.
+ *
+ * The node should figure out it has any write-protected pages in this slot
+ * by itself.
+ */
+void kvm_page_track_remove_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+	struct kvm_page_track_notifier_head *head;
+	struct kvm_page_track_notifier_node *n;
+	int idx;
+
+	head = &kvm->arch.track_notifier_head;
+
+	if (hlist_empty(&head->track_notifier_list))
+		return;
+
+	idx = srcu_read_lock(&head->track_srcu);
+	hlist_for_each_entry_srcu(n, &head->track_notifier_list, node,
+				srcu_read_lock_held(&head->track_srcu))
+		if (n->track_remove_slot)
+			n->track_remove_slot(kvm, slot, n);
+	srcu_read_unlock(&head->track_srcu, idx);
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 916ebbc81e52..a24a4a2ad1a0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12844,6 +12844,9 @@  void kvm_arch_commit_memory_region(struct kvm *kvm,
 				const struct kvm_memory_slot *new,
 				enum kvm_mr_change change)
 {
+	if (change == KVM_MR_DELETE || change == KVM_MR_MOVE)
+		kvm_page_track_remove_slot(kvm, old);
+
 	if (!kvm->arch.n_requested_mmu_pages &&
 	    (change == KVM_MR_CREATE || change == KVM_MR_DELETE)) {
 		unsigned long nr_mmu_pages;