KVM: x86/mmu: Make optimized __handle_changed_spte() for clear dirty log

Message ID 20230125213857.824959-1-vipinsh@google.com
State New
Headers
Series KVM: x86/mmu: Make optimized __handle_changed_spte() for clear dirty log |

Commit Message

Vipin Sharma Jan. 25, 2023, 9:38 p.m. UTC
  Use a tone down version of __handle_changed_spte() when clearing dirty
log. Remove checks which will not be needed when dirty logs are cleared.

This change shows ~13% improvement in clear dirty log calls in
dirty_log_perf_test

Before tone down version:
Clear dirty log over 3 iterations took 10.006764203s. (Avg 3.335588067s/iteration)

After tone down version:
Clear dirty log over 3 iterations took 8.686433554s. (Avg 2.895477851s/iteration)

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)
  

Comments

David Matlack Jan. 25, 2023, 10 p.m. UTC | #1
On Wed, Jan 25, 2023 at 01:38:57PM -0800, Vipin Sharma wrote:
> Use a tone down version of __handle_changed_spte() when clearing dirty
> log. Remove checks which will not be needed when dirty logs are cleared.
> 
> This change shows ~13% improvement in clear dirty log calls in
> dirty_log_perf_test
> 
> Before tone down version:
> Clear dirty log over 3 iterations took 10.006764203s. (Avg 3.335588067s/iteration)
> 
> After tone down version:
> Clear dirty log over 3 iterations took 8.686433554s. (Avg 2.895477851s/iteration)
> 
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index bba33aea0fb0..ca21b33c4386 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -504,6 +504,19 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>  	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
>  }
>  
> +static void handle_changed_spte_clear_dirty_log(int as_id, gfn_t gfn,
> +						u64 old_spte, u64 new_spte,
> +						int level)
> +{
> +	if (old_spte == new_spte)
> +		return;
> +
> +	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
> +
> +	if (is_dirty_spte(old_spte) &&  !is_dirty_spte(new_spte))
> +		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> +}
> +
>  /**
>   * __handle_changed_spte - handle bookkeeping associated with an SPTE change
>   * @kvm: kvm instance
> @@ -736,7 +749,12 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>  
>  	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
>  
> -	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> +	if (record_dirty_log)
> +		__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte,
> +				      level, false);
> +	else
> +		handle_changed_spte_clear_dirty_log(as_id, gfn, old_spte,
> +						    new_spte, level);

I find it very non-intuitive to tie this behavior to !record_dirty_log,
even though it happens to work. It's also risky if any future calls are
added that pass record_dirty_log=false but change other bits in the
SPTE.

I wonder if we could get the same effect by optimizing
__handle_changed_spte() to check for a cleared D-bit *first* and if
that's the only diff between the old and new SPTE, bail immediately
rather than doing all the other checks.
  
Ben Gardon Jan. 25, 2023, 10:25 p.m. UTC | #2
On Wed, Jan 25, 2023 at 2:00 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Jan 25, 2023 at 01:38:57PM -0800, Vipin Sharma wrote:
> > Use a tone down version of __handle_changed_spte() when clearing dirty
> > log. Remove checks which will not be needed when dirty logs are cleared.
> >
> > This change shows ~13% improvement in clear dirty log calls in
> > dirty_log_perf_test
> >
> > Before tone down version:
> > Clear dirty log over 3 iterations took 10.006764203s. (Avg 3.335588067s/iteration)
> >
> > After tone down version:
> > Clear dirty log over 3 iterations took 8.686433554s. (Avg 2.895477851s/iteration)
> >
> > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 20 +++++++++++++++++++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index bba33aea0fb0..ca21b33c4386 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -504,6 +504,19 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> >       call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> >  }
> >
> > +static void handle_changed_spte_clear_dirty_log(int as_id, gfn_t gfn,
> > +                                             u64 old_spte, u64 new_spte,
> > +                                             int level)
> > +{
> > +     if (old_spte == new_spte)
> > +             return;
> > +
> > +     trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
> > +
> > +     if (is_dirty_spte(old_spte) &&  !is_dirty_spte(new_spte))
> > +             kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > +}
> > +
> >  /**
> >   * __handle_changed_spte - handle bookkeeping associated with an SPTE change
> >   * @kvm: kvm instance
> > @@ -736,7 +749,12 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> >
> >       old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> >
> > -     __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> > +     if (record_dirty_log)
> > +             __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte,
> > +                                   level, false);
> > +     else
> > +             handle_changed_spte_clear_dirty_log(as_id, gfn, old_spte,
> > +                                                 new_spte, level);
>
> I find it very non-intuitive to tie this behavior to !record_dirty_log,
> even though it happens to work. It's also risky if any future calls are
> added that pass record_dirty_log=false but change other bits in the
> SPTE.
>
> I wonder if we could get the same effect by optimizing
> __handle_changed_spte() to check for a cleared D-bit *first* and if
> that's the only diff between the old and new SPTE, bail immediately
> rather than doing all the other checks.

I would also prefer that course. One big lesson I took when building
the TDP MMU was the value of having all the changed PTE handling go
through one function. That's not always the right choice but the
Shadow MMU has a crazy spaghetti code of different functions which
handle different parts of responding to a PTE change and it makes the
code very hard to read. I agree this path is worth optimizing, but the
more we can keep the code together, the better.
  
Vipin Sharma Jan. 26, 2023, 12:40 a.m. UTC | #3
On Wed, Jan 25, 2023 at 2:25 PM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Jan 25, 2023 at 2:00 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Wed, Jan 25, 2023 at 01:38:57PM -0800, Vipin Sharma wrote:
> > > @@ -736,7 +749,12 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > >
> > >       old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> > >
> > > -     __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> > > +     if (record_dirty_log)
> > > +             __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte,
> > > +                                   level, false);
> > > +     else
> > > +             handle_changed_spte_clear_dirty_log(as_id, gfn, old_spte,
> > > +                                                 new_spte, level);
> >
> > I find it very non-intuitive to tie this behavior to !record_dirty_log,
> > even though it happens to work. It's also risky if any future calls are
> > added that pass record_dirty_log=false but change other bits in the
> > SPTE.
> >
> > I wonder if we could get the same effect by optimizing
> > __handle_changed_spte() to check for a cleared D-bit *first* and if
> > that's the only diff between the old and new SPTE, bail immediately
> > rather than doing all the other checks.
>
> I would also prefer that course. One big lesson I took when building
> the TDP MMU was the value of having all the changed PTE handling go
> through one function. That's not always the right choice but the
> Shadow MMU has a crazy spaghetti code of different functions which
> handle different parts of responding to a PTE change and it makes the
> code very hard to read. I agree this path is worth optimizing, but the
> more we can keep the code together, the better.

Let me see if I am able to get the same improvement in __handle_changed_spte().
  
Sean Christopherson Jan. 26, 2023, 1:49 a.m. UTC | #4
On Wed, Jan 25, 2023, Ben Gardon wrote:
> On Wed, Jan 25, 2023 at 2:00 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Wed, Jan 25, 2023 at 01:38:57PM -0800, Vipin Sharma wrote:
> > > Use a tone down version of __handle_changed_spte() when clearing dirty
> > > log. Remove checks which will not be needed when dirty logs are cleared.
> > >
> > > This change shows ~13% improvement in clear dirty log calls in
> > > dirty_log_perf_test
> > >
> > > Before tone down version:
> > > Clear dirty log over 3 iterations took 10.006764203s. (Avg 3.335588067s/iteration)
> > >
> > > After tone down version:
> > > Clear dirty log over 3 iterations took 8.686433554s. (Avg 2.895477851s/iteration)
> > >
> > > Signed-off-by: Vipin Sharma <vipinsh@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/tdp_mmu.c | 20 +++++++++++++++++++-
> > >  1 file changed, 19 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index bba33aea0fb0..ca21b33c4386 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -504,6 +504,19 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> > >       call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> > >  }
> > >
> > > +static void handle_changed_spte_clear_dirty_log(int as_id, gfn_t gfn,
> > > +                                             u64 old_spte, u64 new_spte,
> > > +                                             int level)
> > > +{
> > > +     if (old_spte == new_spte)
> > > +             return;
> > > +
> > > +     trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
> > > +
> > > +     if (is_dirty_spte(old_spte) &&  !is_dirty_spte(new_spte))
> > > +             kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > > +}
> > > +
> > >  /**
> > >   * __handle_changed_spte - handle bookkeeping associated with an SPTE change
> > >   * @kvm: kvm instance
> > > @@ -736,7 +749,12 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > >
> > >       old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> > >
> > > -     __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> > > +     if (record_dirty_log)
> > > +             __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte,
> > > +                                   level, false);
> > > +     else
> > > +             handle_changed_spte_clear_dirty_log(as_id, gfn, old_spte,
> > > +                                                 new_spte, level);
> >
> > I find it very non-intuitive to tie this behavior to !record_dirty_log,
> > even though it happens to work. It's also risky if any future calls are
> > added that pass record_dirty_log=false but change other bits in the
> > SPTE.
> >
> > I wonder if we could get the same effect by optimizing
> > __handle_changed_spte() to check for a cleared D-bit *first* and if
> > that's the only diff between the old and new SPTE, bail immediately
> > rather than doing all the other checks.
> 
> I would also prefer that course. One big lesson I took when building
> the TDP MMU was the value of having all the changed PTE handling go
> through one function. That's not always the right choice but the
> Shadow MMU has a crazy spaghetti code of different functions which
> handle different parts of responding to a PTE change and it makes the
> code very hard to read. I agree this path is worth optimizing, but the
> more we can keep the code together, the better.

Hrm, not sure I agree on that last point.  I find record_dirty_log and record_acc_track
to be eye sores and terribly confusing (I always forget when they're true/false).
I think we would end up with cleaner code overall if we special case clearing A/D
bits (or their write-protected/access-protect counterparts).

handle_changed_spte_dirty_log() takes effect if and only a SPTE becomes newly
writable, and that should never happen when aging gfns (record_acc_track=%false),
i.e. neither the "age gfns" nor the "clear dirty" paths need to call
handle_changed_spte_dirty_log(), which means @record_dirty_log is superfluous.  

Similarly, clearing dirty bits should never clear the accessed bit, nor should it
toggle PRESENT or change the PFN, i.e. neither path needs to call
handle_changed_spte_acc_track(), which means @record_acc_track is superfluous too.

If we jettison those, then AFAICT the only remaining heuristic is that
tdp_mmu_set_spte_atomic() doesn't update the dirty bitmaps (the comment about
that behavior is unhelpful and doesn't explain _why_).  That heuristic is easy to
handled by looking at @shared.

Looking through all the other stuff in __handle_changed_spte()...

I'm 100% comfortable skipping these sanity checks:

	if (was_leaf && is_leaf && pfn_changed)
		BUG();

	if (is_leaf)
		check_spte_writable_invariants(new_spte);

	if (!was_present && !is_present)
		WARN_ON(!MMIO && !REMOVED);

And none of these are relevant (again assuming we don't have an egregious bug)
except for the kvm_set_pfn_dirty() case, which is trivial to handle.

	if (is_leaf != was_leaf)
		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);

	if (was_leaf && is_dirty_spte(old_spte) &&
	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
		kvm_set_pfn_dirty(spte_to_pfn(old_spte));

	/*
	 * Recursively handle child PTs if the change removed a subtree from
	 * the paging structure.  Note the WARN on the PFN changing without the
	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
	 * pages are kernel allocations and should never be migrated.
	 */
	if (was_present && !was_leaf &&
	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);


So unless I'm missing something, I think we can end up with the below.  Compile
tested only...

---
 arch/x86/kvm/mmu/tdp_mmu.c | 92 ++++++++++----------------------------
 1 file changed, 24 insertions(+), 68 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bba33aea0fb0..2f78ca43a276 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -505,7 +505,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 }
 
 /**
- * __handle_changed_spte - handle bookkeeping associated with an SPTE change
+ * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
  * @as_id: the address space of the paging structure the SPTE was a part of
  * @gfn: the base GFN that was mapped by the SPTE
@@ -519,9 +519,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
  * Handle bookkeeping that might result from the modification of a SPTE.
  * This function must be called for all TDP SPTE modifications.
  */
-static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				  u64 old_spte, u64 new_spte, int level,
-				  bool shared)
+static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
+				u64 old_spte, u64 new_spte, int level,
+				bool shared)
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
@@ -605,23 +605,18 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	if (was_present && !was_leaf &&
 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
-}
 
-static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared)
-{
-	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
-			      shared);
 	handle_changed_spte_acc_track(old_spte, new_spte, level);
-	handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
-				      new_spte, level);
+
+	/* COMMENT GOES HERE. */
+	if (!shared)
+		handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
+					      new_spte, level);
 }
 
 /*
  * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
- * and handle the associated bookkeeping.  Do not mark the page dirty
- * in KVM's dirty bitmaps.
+ * and handle the associated bookkeeping.
  *
  * If setting the SPTE fails because it has changed, iter->old_spte will be
  * refreshed to the current value of the spte.
@@ -658,9 +653,8 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
 		return -EBUSY;
 
-	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			      new_spte, iter->level, true);
-	handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
+	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
+			    new_spte, iter->level, true);
 
 	return 0;
 }
@@ -705,23 +699,12 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
  * @new_spte:	      The new value that will be set for the SPTE
  * @gfn:	      The base GFN that was (or will be) mapped by the SPTE
  * @level:	      The level _containing_ the SPTE (its parent PT's level)
- * @record_acc_track: Notify the MM subsystem of changes to the accessed state
- *		      of the page. Should be set unless handling an MMU
- *		      notifier for access tracking. Leaving record_acc_track
- *		      unset in that case prevents page accesses from being
- *		      double counted.
- * @record_dirty_log: Record the page as dirty in the dirty bitmap if
- *		      appropriate for the change being made. Should be set
- *		      unless performing certain dirty logging operations.
- *		      Leaving record_dirty_log unset in that case prevents page
- *		      writes from being double counted.
  *
  * Returns the old SPTE value, which _may_ be different than @old_spte if the
  * SPTE had voldatile bits.
  */
 static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
-			      u64 old_spte, u64 new_spte, gfn_t gfn, int level,
-			      bool record_acc_track, bool record_dirty_log)
+			      u64 old_spte, u64 new_spte, gfn_t gfn, int level)
 {
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
@@ -736,46 +719,18 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
-	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
-
-	if (record_acc_track)
-		handle_changed_spte_acc_track(old_spte, new_spte, level);
-	if (record_dirty_log)
-		handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
-					      new_spte, level);
+	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
 	return old_spte;
 }
 
-static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
-				     u64 new_spte, bool record_acc_track,
-				     bool record_dirty_log)
+static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
+				     u64 new_spte)
 {
 	WARN_ON_ONCE(iter->yielded);
 
 	iter->old_spte = __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep,
 					    iter->old_spte, new_spte,
-					    iter->gfn, iter->level,
-					    record_acc_track, record_dirty_log);
-}
-
-static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
-				    u64 new_spte)
-{
-	_tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
-}
-
-static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm,
-						 struct tdp_iter *iter,
-						 u64 new_spte)
-{
-	_tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
-}
-
-static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
-						 struct tdp_iter *iter,
-						 u64 new_spte)
-{
-	_tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
+					    iter->gfn, iter->level);
 }
 
 #define tdp_root_for_each_pte(_iter, _root, _start, _end) \
@@ -925,7 +880,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 		return false;
 
 	__tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
-			   sp->gfn, sp->role.level + 1, true, true);
+			   sp->gfn, sp->role.level + 1);
 
 	return true;
 }
@@ -1289,8 +1244,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 		new_spte = mark_spte_for_access_track(new_spte);
 	}
 
-	tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte);
-
+	kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte, new_spte, iter->level);
 	return true;
 }
 
@@ -1326,7 +1280,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
 	 * Note, when changing a read-only SPTE, it's not strictly necessary to
 	 * zero the SPTE before setting the new PFN, but doing so preserves the
 	 * invariant that the PFN of a present * leaf SPTE can never change.
-	 * See __handle_changed_spte().
+	 * See handle_changed_spte().
 	 */
 	tdp_mmu_set_spte(kvm, iter, 0);
 
@@ -1351,7 +1305,7 @@ bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	/*
 	 * No need to handle the remote TLB flush under RCU protection, the
 	 * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
-	 * shadow page.  See the WARN on pfn_changed in __handle_changed_spte().
+	 * shadow page.  See the WARN on pfn_changed in handle_changed_spte().
 	 */
 	return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
 }
@@ -1703,9 +1657,11 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 				new_spte = iter.old_spte & ~shadow_dirty_mask;
 			else
 				continue;
+
+			kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
 		}
 
-		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
+		kvm_tdp_mmu_write_spte(iter.sptep, iter.old_spte, new_spte, iter.level);
 	}
 
 	rcu_read_unlock();

base-commit: f15a87c006901e02727bf8ac75b0251cdf8e0ecc
--
  
Vipin Sharma Jan. 28, 2023, 5:20 p.m. UTC | #5
On Wed, Jan 25, 2023 at 5:49 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Jan 25, 2023, Ben Gardon wrote:
> > On Wed, Jan 25, 2023 at 2:00 PM David Matlack <dmatlack@google.com> wrote:
> > >
> > > On Wed, Jan 25, 2023 at 01:38:57PM -0800, Vipin Sharma wrote:
> > > > @@ -736,7 +749,12 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > > >
> > > >       old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> > > >
> > > > -     __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> > > > +     if (record_dirty_log)
> > > > +             __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte,
> > > > +                                   level, false);
> > > > +     else
> > > > +             handle_changed_spte_clear_dirty_log(as_id, gfn, old_spte,
> > > > +                                                 new_spte, level);
> > >
> > > I find it very non-intuitive to tie this behavior to !record_dirty_log,
> > > even though it happens to work. It's also risky if any future calls are
> > > added that pass record_dirty_log=false but change other bits in the
> > > SPTE.
> > >
> > > I wonder if we could get the same effect by optimizing
> > > __handle_changed_spte() to check for a cleared D-bit *first* and if
> > > that's the only diff between the old and new SPTE, bail immediately
> > > rather than doing all the other checks.
> >
> > I would also prefer that course. One big lesson I took when building
> > the TDP MMU was the value of having all the changed PTE handling go
> > through one function. That's not always the right choice but the
> > Shadow MMU has a crazy spaghetti code of different functions which
> > handle different parts of responding to a PTE change and it makes the
> > code very hard to read. I agree this path is worth optimizing, but the
> > more we can keep the code together, the better.
>
> Hrm, not sure I agree on that last point.  I find record_dirty_log and record_acc_track
> to be eye sores and terribly confusing (I always forget when they're true/false).
> I think we would end up with cleaner code overall if we special case clearing A/D
> bits (or their write-protected/access-protect counterparts).
>
> handle_changed_spte_dirty_log() takes effect if and only a SPTE becomes newly
> writable, and that should never happen when aging gfns (record_acc_track=%false),
> i.e. neither the "age gfns" nor the "clear dirty" paths need to call
> handle_changed_spte_dirty_log(), which means @record_dirty_log is superfluous.
>
> Similarly, clearing dirty bits should never clear the accessed bit, nor should it
> toggle PRESENT or change the PFN, i.e. neither path needs to call
> handle_changed_spte_acc_track(), which means @record_acc_track is superfluous too.
>
> If we jettison those, then AFAICT the only remaining heuristic is that
> tdp_mmu_set_spte_atomic() doesn't update the dirty bitmaps (the comment about
> that behavior is unhelpful and doesn't explain _why_).  That heuristic is easy to
> handled by looking at @shared.

Looking at the git history and current usage,
tdp_mmu_set_spte_atomic() is used for
1. clearing_dirty_gfn_range() for enabling dirty log in PML
2. wrprot_gfn_range() for enabling dirty logging for wrprot and PML
3. zapping pages
4. splitting hugepages

None of the above will make a page dirty.

>
> Looking through all the other stuff in __handle_changed_spte()...
>
> I'm 100% comfortable skipping these sanity checks:
>
>         if (was_leaf && is_leaf && pfn_changed)
>                 BUG();
>
>         if (is_leaf)
>                 check_spte_writable_invariants(new_spte);
>
>         if (!was_present && !is_present)
>                 WARN_ON(!MMIO && !REMOVED);
>
> And none of these are relevant (again assuming we don't have an egregious bug)
> except for the kvm_set_pfn_dirty() case, which is trivial to handle.
>
>         if (is_leaf != was_leaf)
>                 kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
>
>         if (was_leaf && is_dirty_spte(old_spte) &&
>             (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
>                 kvm_set_pfn_dirty(spte_to_pfn(old_spte));
>
>         /*
>          * Recursively handle child PTs if the change removed a subtree from
>          * the paging structure.  Note the WARN on the PFN changing without the
>          * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
>          * pages are kernel allocations and should never be migrated.
>          */
>         if (was_present && !was_leaf &&
>             (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
>                 handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
>
>
> So unless I'm missing something, I think we can end up with the below.  Compile
> tested only...
>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 92 ++++++++++----------------------------
>  1 file changed, 24 insertions(+), 68 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index bba33aea0fb0..2f78ca43a276 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -505,7 +505,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>  }
>
>  /**
> - * __handle_changed_spte - handle bookkeeping associated with an SPTE change
> + * handle_changed_spte - handle bookkeeping associated with an SPTE change
>   * @kvm: kvm instance
>   * @as_id: the address space of the paging structure the SPTE was a part of
>   * @gfn: the base GFN that was mapped by the SPTE
> @@ -519,9 +519,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>   * Handle bookkeeping that might result from the modification of a SPTE.
>   * This function must be called for all TDP SPTE modifications.
>   */
> -static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -                                 u64 old_spte, u64 new_spte, int level,
> -                                 bool shared)
> +static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> +                               u64 old_spte, u64 new_spte, int level,
> +                               bool shared)
>  {
>         bool was_present = is_shadow_present_pte(old_spte);
>         bool is_present = is_shadow_present_pte(new_spte);
> @@ -605,23 +605,18 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>         if (was_present && !was_leaf &&
>             (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
>                 handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> -}
>
> -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -                               u64 old_spte, u64 new_spte, int level,
> -                               bool shared)
> -{
> -       __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> -                             shared);
>         handle_changed_spte_acc_track(old_spte, new_spte, level);
> -       handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> -                                     new_spte, level);
> +
> +       /* COMMENT GOES HERE. */

Current "shared" callers are not making a page dirty. If a new
"shared" caller makes a page dirty then make sure
handle_changed_spte_dirty_log is called.

How is this?

> +       if (!shared)
> +               handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> +                                             new_spte, level);
>  }
>
>  /*
>   * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
> - * and handle the associated bookkeeping.  Do not mark the page dirty
> - * in KVM's dirty bitmaps.
> + * and handle the associated bookkeeping.
>   *
>   * If setting the SPTE fails because it has changed, iter->old_spte will be
>   * refreshed to the current value of the spte.
> @@ -658,9 +653,8 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
>         if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
>                 return -EBUSY;
>
> -       __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -                             new_spte, iter->level, true);
> -       handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
> +       handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> +                           new_spte, iter->level, true);
>
>         return 0;
>  }
> @@ -705,23 +699,12 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>   * @new_spte:        The new value that will be set for the SPTE
>   * @gfn:             The base GFN that was (or will be) mapped by the SPTE
>   * @level:           The level _containing_ the SPTE (its parent PT's level)
> - * @record_acc_track: Notify the MM subsystem of changes to the accessed state
> - *                   of the page. Should be set unless handling an MMU
> - *                   notifier for access tracking. Leaving record_acc_track
> - *                   unset in that case prevents page accesses from being
> - *                   double counted.
> - * @record_dirty_log: Record the page as dirty in the dirty bitmap if
> - *                   appropriate for the change being made. Should be set
> - *                   unless performing certain dirty logging operations.
> - *                   Leaving record_dirty_log unset in that case prevents page
> - *                   writes from being double counted.
>   *
>   * Returns the old SPTE value, which _may_ be different than @old_spte if the
>   * SPTE had voldatile bits.
>   */
>  static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> -                             u64 old_spte, u64 new_spte, gfn_t gfn, int level,
> -                             bool record_acc_track, bool record_dirty_log)
> +                             u64 old_spte, u64 new_spte, gfn_t gfn, int level)
>  {
>         lockdep_assert_held_write(&kvm->mmu_lock);
>
> @@ -736,46 +719,18 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>
>         old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
>
> -       __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> -
> -       if (record_acc_track)
> -               handle_changed_spte_acc_track(old_spte, new_spte, level);
> -       if (record_dirty_log)
> -               handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> -                                             new_spte, level);
> +       handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
>         return old_spte;
>  }
>
> -static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> -                                    u64 new_spte, bool record_acc_track,
> -                                    bool record_dirty_log)
> +static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> +                                    u64 new_spte)
>  {
>         WARN_ON_ONCE(iter->yielded);
>
>         iter->old_spte = __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep,
>                                             iter->old_spte, new_spte,
> -                                           iter->gfn, iter->level,
> -                                           record_acc_track, record_dirty_log);
> -}
> -
> -static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> -                                   u64 new_spte)
> -{
> -       _tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
> -}
> -
> -static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm,
> -                                                struct tdp_iter *iter,
> -                                                u64 new_spte)
> -{
> -       _tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
> -}
> -
> -static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
> -                                                struct tdp_iter *iter,
> -                                                u64 new_spte)
> -{
> -       _tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
> +                                           iter->gfn, iter->level);
>  }
>
>  #define tdp_root_for_each_pte(_iter, _root, _start, _end) \
> @@ -925,7 +880,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>                 return false;
>
>         __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> -                          sp->gfn, sp->role.level + 1, true, true);
> +                          sp->gfn, sp->role.level + 1);
>
>         return true;
>  }
> @@ -1289,8 +1244,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
>                 new_spte = mark_spte_for_access_track(new_spte);
>         }
>
> -       tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte);
> -
> +       kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte, new_spte, iter->level);
>         return true;
>  }
>
> @@ -1326,7 +1280,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
>          * Note, when changing a read-only SPTE, it's not strictly necessary to
>          * zero the SPTE before setting the new PFN, but doing so preserves the
>          * invariant that the PFN of a present * leaf SPTE can never change.
> -        * See __handle_changed_spte().
> +        * See handle_changed_spte().
>          */
>         tdp_mmu_set_spte(kvm, iter, 0);
>
> @@ -1351,7 +1305,7 @@ bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>         /*
>          * No need to handle the remote TLB flush under RCU protection, the
>          * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
> -        * shadow page.  See the WARN on pfn_changed in __handle_changed_spte().
> +        * shadow page.  See the WARN on pfn_changed in handle_changed_spte().
>          */
>         return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
>  }
> @@ -1703,9 +1657,11 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
>                                 new_spte = iter.old_spte & ~shadow_dirty_mask;
>                         else
>                                 continue;
> +
> +                       kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
>                 }
>

Shouldn't we handle spte_ad_need_write_protect(iter.old_spte)
separately and if this function returns true then on clearing
PT_WRITABLE_MASK, kvm_set_pfn_dirty be called?
My understanding is that the spte_ad_need_write_protect() will return
true for nested VM sptes when PML mode is enabled.

> -               tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
> +               kvm_tdp_mmu_write_spte(iter.sptep, iter.old_spte, new_spte, iter.level);
>         }
>
>         rcu_read_unlock();
>
> base-commit: f15a87c006901e02727bf8ac75b0251cdf8e0ecc
> --
>
  
Sean Christopherson Jan. 30, 2023, 6:09 p.m. UTC | #6
On Sat, Jan 28, 2023, Vipin Sharma wrote:
> On Wed, Jan 25, 2023 at 5:49 PM Sean Christopherson <seanjc@google.com> wrote:
> > -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> > -                               u64 old_spte, u64 new_spte, int level,
> > -                               bool shared)
> > -{
> > -       __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > -                             shared);
> >         handle_changed_spte_acc_track(old_spte, new_spte, level);
> > -       handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> > -                                     new_spte, level);
> > +
> > +       /* COMMENT GOES HERE. */
> 
> Current "shared" callers are not making a page dirty. If a new
> "shared" caller makes a page dirty then make sure
> handle_changed_spte_dirty_log is called.
> 
> How is this?

I was hoping for a more definitive "rule" than "KVM doesn't currently do XYZ".

> > +       if (!shared)
> > +               handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> > +                                             new_spte, level);
> >  }
> >
> >  /*
> >   * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
> > - * and handle the associated bookkeeping.  Do not mark the page dirty
> > - * in KVM's dirty bitmaps.
> > + * and handle the associated bookkeeping.
> >   *
> >   * If setting the SPTE fails because it has changed, iter->old_spte will be
> >   * refreshed to the current value of the spte.

...

> > @@ -1703,9 +1657,11 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
> >                                 new_spte = iter.old_spte & ~shadow_dirty_mask;
> >                         else
> >                                 continue;
> > +
> > +                       kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
> >                 }
> >
> 
> Shouldn't we handle spte_ad_need_write_protect(iter.old_spte)
> separately and if this function returns true then on clearing
> PT_WRITABLE_MASK, kvm_set_pfn_dirty be called?
> My understanding is that the spte_ad_need_write_protect() will return
> true for nested VM sptes when PML mode is enabled.

Ah rats.  I missed that is_dirty_spte() checks WRITABLE in that case.  So yeah,
kvm_set_pfn_dirty() should be called in both paths.  I was thinking KVM would mark
the page dirty when faulting the PFN for write, but I have my flows all mixed up.
  
Vipin Sharma Jan. 30, 2023, 8:17 p.m. UTC | #7
On Mon, Jan 30, 2023 at 10:09 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Sat, Jan 28, 2023, Vipin Sharma wrote:
> > On Wed, Jan 25, 2023 at 5:49 PM Sean Christopherson <seanjc@google.com> wrote:
> > > -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> > > -                               u64 old_spte, u64 new_spte, int level,
> > > -                               bool shared)
> > > -{
> > > -       __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > > -                             shared);
> > >         handle_changed_spte_acc_track(old_spte, new_spte, level);
> > > -       handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> > > -                                     new_spte, level);
> > > +
> > > +       /* COMMENT GOES HERE. */
> >
> > Current "shared" callers are not making a page dirty. If a new
> > "shared" caller makes a page dirty then make sure
> > handle_changed_spte_dirty_log is called.
> >
> > How is this?
>
> I was hoping for a more definitive "rule" than "KVM doesn't currently do XYZ".
>
> > > +       if (!shared)
> > > +               handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> > > +                                             new_spte, level);
> > >  }
> > >

What if implementation is changed a little more? I can think of two options:

Option 1:
Remove handle_changed_spte_dirty_log() and let the callers handle
mark_page_dirty_in_slot(). Similar to how fast page faults do this.
This will get rid of the "shared" variable and defining its rules for
the shared and nonshared flow.

Option 2:
Changing meaning of this variable from "shared" to something like
"handle_dirty_log"
Callers will know if they want dirty log to be handled or not.

I am preferring option 1.
  
David Matlack Jan. 30, 2023, 10:12 p.m. UTC | #8
On Wed, Jan 25, 2023 at 5:49 PM Sean Christopherson <seanjc@google.com> wrote:
[...]
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 92 ++++++++++----------------------------
>  1 file changed, 24 insertions(+), 68 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index bba33aea0fb0..2f78ca43a276 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
[...]
> @@ -1289,8 +1244,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
>                 new_spte = mark_spte_for_access_track(new_spte);
>         }
>
> -       tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte);
> -
> +       kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte, new_spte, iter->level);

This can race with fast_page_fault() setting the W-bit and the CPU
setting the D-bit. i.e. This call to kvm_tdp_mmu_write_spte() could
clear the W-bit or D-bit.

> @@ -1703,9 +1657,11 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
>                                 new_spte = iter.old_spte & ~shadow_dirty_mask;
>                         else
>                                 continue;
> +
> +                       kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
>                 }
>
> -               tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
> +               kvm_tdp_mmu_write_spte(iter.sptep, iter.old_spte, new_spte, iter.level);

Similar issue here. This call to kvm_tdp_mmu_write_spte() could clear
the A-bit or mark the SPTE for access-tracking.

> --
>
  
Sean Christopherson Jan. 30, 2023, 10:49 p.m. UTC | #9
On Mon, Jan 30, 2023, Vipin Sharma wrote:
> On Mon, Jan 30, 2023 at 10:09 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Sat, Jan 28, 2023, Vipin Sharma wrote:
> > > On Wed, Jan 25, 2023 at 5:49 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> > > > -                               u64 old_spte, u64 new_spte, int level,
> > > > -                               bool shared)
> > > > -{
> > > > -       __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > > > -                             shared);
> > > >         handle_changed_spte_acc_track(old_spte, new_spte, level);
> > > > -       handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> > > > -                                     new_spte, level);
> > > > +
> > > > +       /* COMMENT GOES HERE. */
> > >
> > > Current "shared" callers are not making a page dirty. If a new
> > > "shared" caller makes a page dirty then make sure
> > > handle_changed_spte_dirty_log is called.
> > >
> > > How is this?
> >
> > I was hoping for a more definitive "rule" than "KVM doesn't currently do XYZ".
> >
> > > > +       if (!shared)
> > > > +               handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> > > > +                                             new_spte, level);
> > > >  }
> > > >
> 
> What if implementation is changed a little more? I can think of two options:
> 
> Option 1:
> Remove handle_changed_spte_dirty_log() and let the callers handle
> mark_page_dirty_in_slot(). Similar to how fast page faults do this.
> This will get rid of the "shared" variable and defining its rules for
> the shared and nonshared flow.
> 
> Option 2:
> Changing meaning of this variable from "shared" to something like
> "handle_dirty_log"
> Callers will know if they want dirty log to be handled or not.
> 
> I am preferring option 1.

Sorry, what I meant by "definitive rule" was an explanation of why KVM doesn't
need to do dirty log tracking for tdp_mmu_set_spte_atomic().

Figured things out after a bit o' archaeology.  Commit bcc4f2bc5026 ("KVM: MMU:
mark page dirty in make_spte") shifted the dirtying of the memslot to make_spte(),
and then commit 6ccf44388206 ("KVM: MMU: unify tdp_mmu_map_set_spte_atomic and
tdp_mmu_set_spte_atomic_no_dirty_log") covered up the crime.

Egad!  I believe that means handle_changed_spte_dirty_log() is dead code for all
intents and purposes, as there is no path that creates a WRITABLE 4KiB SPTE without
bouncing through make_spte().  set_spte_gfn() => kvm_mmu_changed_pte_notifier_make_spte()
only creates !WRITABLE SPTEs, ignoring for the moment that that's currently dead
code too (see commit c13fda237f08 ("KVM: Assert that notifier count is elevated in
.change_pte()")).

So I _think_ we can do option #1 simply by deleting code.
  
Sean Christopherson Jan. 30, 2023, 11:49 p.m. UTC | #10
On Mon, Jan 30, 2023, David Matlack wrote:
> On Wed, Jan 25, 2023 at 5:49 PM Sean Christopherson <seanjc@google.com> wrote:
> [...]
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 92 ++++++++++----------------------------
> >  1 file changed, 24 insertions(+), 68 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index bba33aea0fb0..2f78ca43a276 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> [...]
> > @@ -1289,8 +1244,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
> >                 new_spte = mark_spte_for_access_track(new_spte);
> >         }
> >
> > -       tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte);
> > -
> > +       kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte, new_spte, iter->level);
> 
> This can race with fast_page_fault() setting the W-bit and the CPU
> setting the D-bit. i.e. This call to kvm_tdp_mmu_write_spte() could
> clear the W-bit or D-bit.

Ugh, right.  Hrm.  Duh, I just didn't go far enough.  A straight XCHG is silly.
Except for the access-tracking mess, KVM wants to clear a single bit.  Writing
the entire thing and potentially clobbering bits is wasteful and unnecessarily
dangerous.  And the access-tracking code already needs special handling.

We can just simplify this all by adding a helper to clear a single bit (and
maybe even use clear_bit() and __clear_bit() if we save the bit number for the
W/A/D bits and not just the mask).  And if it weren't for EPT (different A/D
locations), we could even use static asserts to restrict the usage to the W/A/D
bits :-/  Oh well.

E.g. this

static inline void kvm_tdp_mmu_clear_spte_bit(struct tdp_iter *iter, u64 mask)
{
	if (kvm_tdp_mmu_spte_has_volatile_bits(iter->old_spte, iter->level))
		atomic64_and(~mask, (atomic64_t *)rcu_dereference(iter->sptep));
	else
		__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte & ~mask);
}

yields another untested patch:

---
 arch/x86/kvm/mmu/tdp_iter.h | 24 +++++++++++++++++++-----
 arch/x86/kvm/mmu/tdp_mmu.c  | 33 +++++++++++++--------------------
 2 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index f0af385c56e0..09c8f2640ccc 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -29,11 +29,10 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 	WRITE_ONCE(*rcu_dereference(sptep), new_spte);
 }
 
-static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
-					 u64 new_spte, int level)
+static inline bool kvm_tdp_mmu_spte_has_volatile_bits(u64 old_spte, int level)
 {
 	/*
-	 * Atomically write the SPTE if it is a shadow-present, leaf SPTE with
+	 * Atomically write SPTEs if it is a shadow-present, leaf SPTE with
 	 * volatile bits, i.e. has bits that can be set outside of mmu_lock.
 	 * The Writable bit can be set by KVM's fast page fault handler, and
 	 * Accessed and Dirty bits can be set by the CPU.
@@ -44,8 +43,15 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
 	 * logic needs to be reassessed if KVM were to use non-leaf Accessed
 	 * bits, e.g. to skip stepping down into child SPTEs when aging SPTEs.
 	 */
-	if (is_shadow_present_pte(old_spte) && is_last_spte(old_spte, level) &&
-	    spte_has_volatile_bits(old_spte))
+	return is_shadow_present_pte(old_spte) &&
+	       is_last_spte(old_spte, level) &&
+	       spte_has_volatile_bits(old_spte);
+}
+
+static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
+					 u64 new_spte, int level)
+{
+	if (kvm_tdp_mmu_spte_has_volatile_bits(old_spte, level))
 		return kvm_tdp_mmu_write_spte_atomic(sptep, new_spte);
 
 	__kvm_tdp_mmu_write_spte(sptep, new_spte);
@@ -115,4 +121,12 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
 
+static inline void kvm_tdp_mmu_clear_spte_bit(struct tdp_iter *iter, u64 mask)
+{
+	if (kvm_tdp_mmu_spte_has_volatile_bits(iter->old_spte, iter->level))
+		atomic64_and(~mask, (atomic64_t *)rcu_dereference(iter->sptep));
+	else
+		__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte & ~mask);
+}
+
 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2f78ca43a276..32a7209a522d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1223,28 +1223,26 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 			  struct kvm_gfn_range *range)
 {
-	u64 new_spte = 0;
+	u64 new_spte;
 
 	/* If we have a non-accessed entry we don't need to change the pte. */
 	if (!is_accessed_spte(iter->old_spte))
 		return false;
 
-	new_spte = iter->old_spte;
-
-	if (spte_ad_enabled(new_spte)) {
-		new_spte &= ~shadow_accessed_mask;
+	if (spte_ad_enabled(iter->old_spte)) {
+		kvm_tdp_mmu_clear_spte_bit(iter, shadow_accessed_mask);
 	} else {
+		new_spte = mark_spte_for_access_track(iter->old_spte);
+		iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte,
+							new_spte, iter->level);
 		/*
 		 * Capture the dirty status of the page, so that it doesn't get
 		 * lost when the SPTE is marked for access tracking.
 		 */
-		if (is_writable_pte(new_spte))
-			kvm_set_pfn_dirty(spte_to_pfn(new_spte));
-
-		new_spte = mark_spte_for_access_track(new_spte);
+		if (is_writable_pte(iter->old_spte))
+			kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
 	}
 
-	kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte, new_spte, iter->level);
 	return true;
 }
 
@@ -1632,7 +1630,6 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 				  gfn_t gfn, unsigned long mask, bool wrprot)
 {
 	struct tdp_iter iter;
-	u64 new_spte;
 
 	rcu_read_lock();
 
@@ -1648,20 +1645,16 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 		mask &= ~(1UL << (iter.gfn - gfn));
 
 		if (wrprot || spte_ad_need_write_protect(iter.old_spte)) {
-			if (is_writable_pte(iter.old_spte))
-				new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
-			else
+			if (!is_writable_pte(iter.old_spte))
 				continue;
+
+			kvm_tdp_mmu_clear_spte_bit(&iter, PT_WRITABLE_MASK);
 		} else {
-			if (iter.old_spte & shadow_dirty_mask)
-				new_spte = iter.old_spte & ~shadow_dirty_mask;
-			else
+			if (!(iter.old_spte & shadow_dirty_mask))
 				continue;
 
-			kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
+			kvm_tdp_mmu_clear_spte_bit(&iter, shadow_dirty_mask);
 		}
-
-		kvm_tdp_mmu_write_spte(iter.sptep, iter.old_spte, new_spte, iter.level);
 	}
 
 	rcu_read_unlock();

base-commit: 7bb67c88a2d77cc95524912f3b1ca51daf5c0224
--
  
David Matlack Jan. 31, 2023, 5:41 p.m. UTC | #11
On Mon, Jan 30, 2023 at 3:49 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jan 30, 2023, David Matlack wrote:
> > On Wed, Jan 25, 2023 at 5:49 PM Sean Christopherson <seanjc@google.com> wrote:
> > [...]
> > > ---
> > >  arch/x86/kvm/mmu/tdp_mmu.c | 92 ++++++++++----------------------------
> > >  1 file changed, 24 insertions(+), 68 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index bba33aea0fb0..2f78ca43a276 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > [...]
> > > @@ -1289,8 +1244,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
> > >                 new_spte = mark_spte_for_access_track(new_spte);
> > >         }
> > >
> > > -       tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte);
> > > -
> > > +       kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte, new_spte, iter->level);
> >
> > This can race with fast_page_fault() setting the W-bit and the CPU
> > setting the D-bit. i.e. This call to kvm_tdp_mmu_write_spte() could
> > clear the W-bit or D-bit.
>
> Ugh, right.  Hrm.  Duh, I just didn't go far enough.  A straight XCHG is silly.
> Except for the access-tracking mess, KVM wants to clear a single bit.  Writing
> the entire thing and potentially clobbering bits is wasteful and unnecessarily
> dangerous.  And the access-tracking code already needs special handling.
>
> We can just simplify this all by adding a helper to clear a single bit (and
> maybe even use clear_bit() and __clear_bit() if we save the bit number for the
> W/A/D bits and not just the mask).  And if it weren't for EPT (different A/D
> locations), we could even use static asserts to restrict the usage to the W/A/D
> bits :-/  Oh well.
>
> E.g. this

This patch looks good. Vipin can you incorporate this in your next version?
  
Vipin Sharma Jan. 31, 2023, 6:01 p.m. UTC | #12
On Tue, Jan 31, 2023 at 9:41 AM David Matlack <dmatlack@google.com> wrote:
>
> On Mon, Jan 30, 2023 at 3:49 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Jan 30, 2023, David Matlack wrote:
> > > On Wed, Jan 25, 2023 at 5:49 PM Sean Christopherson <seanjc@google.com> wrote:
> > > [...]
> > > > ---
> > > >  arch/x86/kvm/mmu/tdp_mmu.c | 92 ++++++++++----------------------------
> > > >  1 file changed, 24 insertions(+), 68 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > index bba33aea0fb0..2f78ca43a276 100644
> > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > [...]
> > > > @@ -1289,8 +1244,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
> > > >                 new_spte = mark_spte_for_access_track(new_spte);
> > > >         }
> > > >
> > > > -       tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte);
> > > > -
> > > > +       kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte, new_spte, iter->level);
> > >
> > > This can race with fast_page_fault() setting the W-bit and the CPU
> > > setting the D-bit. i.e. This call to kvm_tdp_mmu_write_spte() could
> > > clear the W-bit or D-bit.
> >
> > Ugh, right.  Hrm.  Duh, I just didn't go far enough.  A straight XCHG is silly.
> > Except for the access-tracking mess, KVM wants to clear a single bit.  Writing
> > the entire thing and potentially clobbering bits is wasteful and unnecessarily
> > dangerous.  And the access-tracking code already needs special handling.
> >
> > We can just simplify this all by adding a helper to clear a single bit (and
> > maybe even use clear_bit() and __clear_bit() if we save the bit number for the
> > W/A/D bits and not just the mask).  And if it weren't for EPT (different A/D
> > locations), we could even use static asserts to restrict the usage to the W/A/D
> > bits :-/  Oh well.
> >
> > E.g. this
>
> This patch looks good. Vipin can you incorporate this in your next version?

On it.
  

Patch

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bba33aea0fb0..ca21b33c4386 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -504,6 +504,19 @@  static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
+static void handle_changed_spte_clear_dirty_log(int as_id, gfn_t gfn,
+						u64 old_spte, u64 new_spte,
+						int level)
+{
+	if (old_spte == new_spte)
+		return;
+
+	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
+
+	if (is_dirty_spte(old_spte) &&  !is_dirty_spte(new_spte))
+		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+}
+
 /**
  * __handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -736,7 +749,12 @@  static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
-	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+	if (record_dirty_log)
+		__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte,
+				      level, false);
+	else
+		handle_changed_spte_clear_dirty_log(as_id, gfn, old_spte,
+						    new_spte, level);
 
 	if (record_acc_track)
 		handle_changed_spte_acc_track(old_spte, new_spte, level);