[mm-unstable,v1,3/5] kvm/arm64: add kvm_arch_test_clear_young()

Message ID 20230217041230.2417228-4-yuzhao@google.com
State New
Headers
Series mm/kvm: lockless accessed bit harvest |

Commit Message

Yu Zhao Feb. 17, 2023, 4:12 a.m. UTC
  This patch adds kvm_arch_test_clear_young() for the vast majority of
VMs that are not pKVM and run on hardware that sets the accessed bit
in KVM page tables.

It relies on two techniques, RCU and cmpxchg, to safely test and clear
the accessed bit without taking the MMU lock. The former protects KVM
page tables from being freed while the latter clears the accessed bit
atomically against both the hardware and other software page table
walkers.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/arm64/include/asm/kvm_host.h       |  7 +++
 arch/arm64/include/asm/kvm_pgtable.h    |  8 +++
 arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
 arch/arm64/kvm/arm.c                    |  1 +
 arch/arm64/kvm/hyp/pgtable.c            | 51 ++--------------
 arch/arm64/kvm/mmu.c                    | 77 ++++++++++++++++++++++++-
 6 files changed, 141 insertions(+), 46 deletions(-)
  

Comments

Yu Zhao Feb. 17, 2023, 4:21 a.m. UTC | #1
On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <yuzhao@google.com> wrote:
>
> This patch adds kvm_arch_test_clear_young() for the vast majority of
> VMs that are not pKVM and run on hardware that sets the accessed bit
> in KVM page tables.
>
> It relies on two techniques, RCU and cmpxchg, to safely test and clear
> the accessed bit without taking the MMU lock. The former protects KVM
> page tables from being freed while the latter clears the accessed bit
> atomically against both the hardware and other software page table
> walkers.
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h       |  7 +++
>  arch/arm64/include/asm/kvm_pgtable.h    |  8 +++
>  arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
>  arch/arm64/kvm/arm.c                    |  1 +
>  arch/arm64/kvm/hyp/pgtable.c            | 51 ++--------------
>  arch/arm64/kvm/mmu.c                    | 77 ++++++++++++++++++++++++-
>  6 files changed, 141 insertions(+), 46 deletions(-)

Adding Marc and Will.

Can you please add other interested parties that I've missed?

Thanks.
  
Marc Zyngier Feb. 17, 2023, 9 a.m. UTC | #2
On Fri, 17 Feb 2023 04:21:28 +0000,
Yu Zhao <yuzhao@google.com> wrote:
> 
> On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > VMs that are not pKVM and run on hardware that sets the accessed bit
> > in KVM page tables.

I'm really interested in how you can back this statement. 90% of the
HW I have access to is not FEAT_HWAFDB capable, either because it
predates the feature or because the feature is too buggy to be useful.

Do you have numbers?

> >
> > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > the accessed bit without taking the MMU lock. The former protects KVM
> > page tables from being freed while the latter clears the accessed bit
> > atomically against both the hardware and other software page table
> > walkers.
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h       |  7 +++
> >  arch/arm64/include/asm/kvm_pgtable.h    |  8 +++
> >  arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> >  arch/arm64/kvm/arm.c                    |  1 +
> >  arch/arm64/kvm/hyp/pgtable.c            | 51 ++--------------
> >  arch/arm64/kvm/mmu.c                    | 77 ++++++++++++++++++++++++-
> >  6 files changed, 141 insertions(+), 46 deletions(-)
> 
> Adding Marc and Will.
> 
> Can you please add other interested parties that I've missed?

The MAINTAINERS file has it all:

KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)
M:      Marc Zyngier <maz@kernel.org>
M:      Oliver Upton <oliver.upton@linux.dev>
R:      James Morse <james.morse@arm.com>
R:      Suzuki K Poulose <suzuki.poulose@arm.com>
R:      Zenghui Yu <yuzenghui@huawei.com>
L:      kvmarm@lists.linux.dev

May I suggest that you repost your patch and Cc the interested
parties yourself? I guess most folks will want to see this in context,
and not as a random, isolated change with no rationale.

	M.
  
Oliver Upton Feb. 17, 2023, 9:09 a.m. UTC | #3
Hi Yu,

scripts/get_maintainers.pl is your friend for getting the right set of
emails for a series :) Don't know about others, but generally I would
prefer to be Cc'ed on an entire series (to gather context) than just an
individual patch.

On Thu, Feb 16, 2023 at 09:12:28PM -0700, Yu Zhao wrote:
> This patch adds kvm_arch_test_clear_young() for the vast majority of
> VMs that are not pKVM and run on hardware that sets the accessed bit
> in KVM page tables.
> 
> It relies on two techniques, RCU and cmpxchg, to safely test and clear
> the accessed bit without taking the MMU lock. The former protects KVM
> page tables from being freed while the latter clears the accessed bit
> atomically against both the hardware and other software page table
> walkers.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h       |  7 +++
>  arch/arm64/include/asm/kvm_pgtable.h    |  8 +++
>  arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
>  arch/arm64/kvm/arm.c                    |  1 +
>  arch/arm64/kvm/hyp/pgtable.c            | 51 ++--------------
>  arch/arm64/kvm/mmu.c                    | 77 ++++++++++++++++++++++++-
>  6 files changed, 141 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 35a159d131b5..572bcd321586 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
>  void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
>  bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
>  
> +/* see the comments on the generic kvm_arch_has_test_clear_young() */
> +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> +static inline bool kvm_arch_has_test_clear_young(void)
> +{
> +	return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && !is_protected_kvm_enabled();
> +}

Why does the lack of FEAT_HAFDBS preclude the use of the test-and-clear
notifier?

On implementations without FEAT_HAFDBS, hardware will generate a data
abort for software to set the access flag. Regardless of whether
software or hardware is responsible for updating the PTE that
information is available in the page tables.

Also, I'm at a loss for why we'd need to test if CONFIG_KVM is enabled.
My expectation is that we should provide an implementation that returns
false if !CONFIG_KVM, avoiding the need to repeat that bit in every
single implementation of the function.

> +
>  #endif /* __ARM64_KVM_HOST_H__ */
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 63f81b27a4e3..8c9a04388c88 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -105,6 +105,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
>   * @put_page:			Decrement the refcount on a page. When the
>   *				refcount reaches 0 the page is automatically
>   *				freed.
> + * @put_page_rcu:		RCU variant of put_page().
>   * @page_count:			Return the refcount of a page.
>   * @phys_to_virt:		Convert a physical address into a virtual
>   *				address	mapped in the current context.
> @@ -122,6 +123,7 @@ struct kvm_pgtable_mm_ops {
>  	void		(*free_removed_table)(void *addr, u32 level);
>  	void		(*get_page)(void *addr);
>  	void		(*put_page)(void *addr);
> +	void		(*put_page_rcu)(void *addr);

Why do we need this? We already defer dropping the last reference count
on a page to an RCU callback. Have you observed issues with the existing
implementation?

>  	int		(*page_count)(void *addr);
>  	void*		(*phys_to_virt)(phys_addr_t phys);
>  	phys_addr_t	(*virt_to_phys)(void *addr);
> @@ -188,6 +190,12 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
>   *					children.
>   * @KVM_PGTABLE_WALK_SHARED:		Indicates the page-tables may be shared
>   *					with other software walkers.
> + *
> + * kvm_arch_test_clear_young() is a special case. It relies on two
> + * techniques, RCU and cmpxchg, to safely test and clear the accessed
> + * bit without taking the MMU lock. The former protects KVM page tables
> + * from being freed while the latter clears the accessed bit atomically
> + * against both the hardware and other software page table walkers.
>   */
>  enum kvm_pgtable_walk_flags {
>  	KVM_PGTABLE_WALK_LEAF			= BIT(0),
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index c8dca8ae359c..350437661d4b 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -30,4 +30,47 @@
>   */
>  #define kvm_mmu_cache_min_pages(kvm)	(kvm_stage2_levels(kvm) - 1)
>  
> +#define KVM_PTE_TYPE			BIT(1)
> +#define KVM_PTE_TYPE_BLOCK		0
> +#define KVM_PTE_TYPE_PAGE		1
> +#define KVM_PTE_TYPE_TABLE		1
> +
> +#define KVM_PTE_LEAF_ATTR_LO		GENMASK(11, 2)
> +
> +#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX	GENMASK(4, 2)
> +#define KVM_PTE_LEAF_ATTR_LO_S1_AP	GENMASK(7, 6)
> +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO	3
> +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW	1
> +#define KVM_PTE_LEAF_ATTR_LO_S1_SH	GENMASK(9, 8)
> +#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS	3
> +#define KVM_PTE_LEAF_ATTR_LO_S1_AF	BIT(10)
> +
> +#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR	GENMASK(5, 2)
> +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R	BIT(6)
> +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W	BIT(7)
> +#define KVM_PTE_LEAF_ATTR_LO_S2_SH	GENMASK(9, 8)
> +#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS	3
> +#define KVM_PTE_LEAF_ATTR_LO_S2_AF	BIT(10)
> +
> +#define KVM_PTE_LEAF_ATTR_HI		GENMASK(63, 51)
> +
> +#define KVM_PTE_LEAF_ATTR_HI_SW		GENMASK(58, 55)
> +
> +#define KVM_PTE_LEAF_ATTR_HI_S1_XN	BIT(54)
> +
> +#define KVM_PTE_LEAF_ATTR_HI_S2_XN	BIT(54)
> +
> +#define KVM_PTE_LEAF_ATTR_S2_PERMS	(KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
> +					 KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
> +					 KVM_PTE_LEAF_ATTR_HI_S2_XN)
> +
> +#define KVM_INVALID_PTE_OWNER_MASK	GENMASK(9, 2)
> +#define KVM_MAX_OWNER_ID		1
> +
> +/*
> + * Used to indicate a pte for which a 'break-before-make' sequence is in
> + * progress.
> + */
> +#define KVM_INVALID_PTE_LOCKED		BIT(10)
> +

If there is a need to do these sort of moves, please do it in a separate
patch. It pollutes the context of the functional change you're making.

>  #endif	/* __ARM64_S2_PGTABLE_H_ */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 9c5573bc4614..6770bc47f5c9 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>   */
>  void kvm_arch_destroy_vm(struct kvm *kvm)
>  {
> +	kvm_free_stage2_pgd(&kvm->arch.mmu);
>
>  	bitmap_free(kvm->arch.pmu_filter);
>  	free_cpumask_var(kvm->arch.supported_cpus);
>  

[...]

> +struct test_clear_young_arg {
> +	struct kvm_gfn_range *range;
> +	gfn_t lsb_gfn;
> +	unsigned long *bitmap;
> +};
> +
> +static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
> +				   enum kvm_pgtable_walk_flags flags)
> +{
> +	struct test_clear_young_arg *arg = ctx->arg;
> +	gfn_t gfn = ctx->addr / PAGE_SIZE;
> +	kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
> +
> +	VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
> +	VM_WARN_ON_ONCE(gfn < arg->range->start || gfn >= arg->range->end);

Do we really need to be _this_ pedantic about sanity checking?

> +	if (!kvm_pte_valid(new))
> +		return 0;
> +
> +	if (new == ctx->old)
> +		return 0;
> +
> +	/* see the comments on the generic kvm_arch_has_test_clear_young() */
> +	if (__test_and_change_bit(arg->lsb_gfn - gfn, arg->bitmap))
> +		cmpxchg64(ctx->ptep, ctx->old, new);

Why not use stage2_try_set_pte()? Not only is it idiomatic with the rest
of the stage-2 code, it also will 'do the right thing' according to the
locking scheme of the caller if we decide to change it at some point.

> +	return 0;
> +}
> +
> +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
> +			       gfn_t lsb_gfn, unsigned long *bitmap)
> +{
> +	u64 start = range->start * PAGE_SIZE;
> +	u64 end = range->end * PAGE_SIZE;
> +	struct test_clear_young_arg arg = {
> +		.range		= range,
> +		.lsb_gfn	= lsb_gfn,
> +		.bitmap		= bitmap,
> +	};
> +	struct kvm_pgtable_walker walker = {
> +		.cb		= stage2_test_clear_young,
> +		.arg		= &arg,
> +		.flags		= KVM_PGTABLE_WALK_LEAF,
> +	};
> +
> +	BUILD_BUG_ON(is_hyp_code());

See prior comment about sanity checking.

> +	if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young()))
> +		return false;

Same here...

> +	/* see the comments on kvm_pgtable_walk_flags */
> +	rcu_read_lock();
> +
> +	kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, &walker);
> +
> +	rcu_read_unlock();

The rcu_read_{lock,unlock}() is entirely superfluous.

Of course, it is somewhat hidden by the fact that we must use
abstractions to support host and EL2 use of the page table code, but we
already make use of RCU to protect the stage-2 of a 'regular' VM.

> +	return true;
> +}
> +
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>  	if (!kvm->arch.mmu.pgt)
> @@ -1848,7 +1924,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
>  
>  void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  {
> -	kvm_free_stage2_pgd(&kvm->arch.mmu);
>  }

Doesn't this become a blatant correctness issue? This entirely fails to
uphold the exact expectations of the call.
  
Sean Christopherson Feb. 17, 2023, 4 p.m. UTC | #4
On Fri, Feb 17, 2023, Oliver Upton wrote:
> Hi Yu,
> 
> scripts/get_maintainers.pl is your friend for getting the right set of
> emails for a series :) Don't know about others, but generally I would
> prefer to be Cc'ed on an entire series (to gather context) than just an
> individual patch.

+1

> 
> On Thu, Feb 16, 2023 at 09:12:28PM -0700, Yu Zhao wrote:
> > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > VMs that are not pKVM and run on hardware that sets the accessed bit
> > in KVM page tables.

At least for the x86 changes, please read Documentation/process/maintainer-tip.rst
and rewrite the changelogs.

> > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > the accessed bit without taking the MMU lock. The former protects KVM
> > page tables from being freed while the latter clears the accessed bit
> > atomically against both the hardware and other software page table
> > walkers.
> > 
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h       |  7 +++
> >  arch/arm64/include/asm/kvm_pgtable.h    |  8 +++
> >  arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> >  arch/arm64/kvm/arm.c                    |  1 +
> >  arch/arm64/kvm/hyp/pgtable.c            | 51 ++--------------
> >  arch/arm64/kvm/mmu.c                    | 77 ++++++++++++++++++++++++-
> >  6 files changed, 141 insertions(+), 46 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 35a159d131b5..572bcd321586 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
> >  void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> >  bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
> >  
> > +/* see the comments on the generic kvm_arch_has_test_clear_young() */

Please eliminate all of these "see the comments on blah", in every case they do
nothing more than redirect the reader to something they're likely already aware of.

> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > +	return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && !is_protected_kvm_enabled();
> > +}

...

> Also, I'm at a loss for why we'd need to test if CONFIG_KVM is enabled.
> My expectation is that we should provide an implementation that returns
> false if !CONFIG_KVM, avoiding the need to repeat that bit in every
> single implementation of the function.

mm/vmscan.c uses kvm_arch_has_test_clear_young().  I have opinions on that, but
I'll hold off on expressing them until there's actual justification presented
somewhere.

Yu, this series and each patch needs a big pile of "why".  I get that the goal
is to optimize memory oversubscribe, but there needs to be justification for
why this is KVM only, why nested VMs and !A/D hardware are out of scope, why yet
another mmu_notifier hook is being added, etc.
  
Yu Zhao Feb. 23, 2023, 3:58 a.m. UTC | #5
On Fri, Feb 17, 2023 at 2:00 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 17 Feb 2023 04:21:28 +0000,
> Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > > VMs that are not pKVM and run on hardware that sets the accessed bit
> > > in KVM page tables.
>
> I'm really interested in how you can back this statement. 90% of the
> HW I have access to is not FEAT_HWAFDB capable, either because it
> predates the feature or because the feature is too buggy to be useful.

This is my expericen too -- most devices are pre v8.2.

> Do you have numbers?

Let's do a quick market survey by segment. The following only applies
to ARM CPUs:

1. Phones: none of the major Android phone vendors sell phones running
VMs; no other major Linux phone vendors.
2. Laptops: only a very limited number of Chromebooks run VMs, namely
ACRVM. No other major Linux laptop vendors.
3. Desktops: no major Linux desktop vendors.
4. Embedded/IoT/Router: no major Linux vendors run VMs (Android Auto
can be a VM guest on QNX host).
5. Cloud: this is where the vast majority VMs come from. Among the
vendors available to the general public, Ampere is the biggest player.
Here [1] is a list of its customers. The A-bit works well even on its
EVT products (Neoverse cores).

[1] https://en.wikipedia.org/wiki/Ampere_Computing

> > > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > > the accessed bit without taking the MMU lock. The former protects KVM
> > > page tables from being freed while the latter clears the accessed bit
> > > atomically against both the hardware and other software page table
> > > walkers.
> > >
> > > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h       |  7 +++
> > >  arch/arm64/include/asm/kvm_pgtable.h    |  8 +++
> > >  arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> > >  arch/arm64/kvm/arm.c                    |  1 +
> > >  arch/arm64/kvm/hyp/pgtable.c            | 51 ++--------------
> > >  arch/arm64/kvm/mmu.c                    | 77 ++++++++++++++++++++++++-
> > >  6 files changed, 141 insertions(+), 46 deletions(-)
> >
> > Adding Marc and Will.
> >
> > Can you please add other interested parties that I've missed?
>
> The MAINTAINERS file has it all:
>
> KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)
> M:      Marc Zyngier <maz@kernel.org>
> M:      Oliver Upton <oliver.upton@linux.dev>
> R:      James Morse <james.morse@arm.com>
> R:      Suzuki K Poulose <suzuki.poulose@arm.com>
> R:      Zenghui Yu <yuzenghui@huawei.com>
> L:      kvmarm@lists.linux.dev
>
> May I suggest that you repost your patch and Cc the interested
> parties yourself? I guess most folks will want to see this in context,
> and not as a random, isolated change with no rationale.

This clarified it. Thanks. (I was hesitant to spam people with the
entire series containing changes to other architectures.)
  
Yu Zhao Feb. 23, 2023, 4:43 a.m. UTC | #6
On Fri, Feb 17, 2023 at 2:09 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> Hi Yu,
>
> scripts/get_maintainers.pl is your friend for getting the right set of
> emails for a series :) Don't know about others, but generally I would
> prefer to be Cc'ed on an entire series (to gather context) than just an
> individual patch.

Will do. Thank you.

> On Thu, Feb 16, 2023 at 09:12:28PM -0700, Yu Zhao wrote:
> > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > VMs that are not pKVM and run on hardware that sets the accessed bit
> > in KVM page tables.
> >
> > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > the accessed bit without taking the MMU lock. The former protects KVM
> > page tables from being freed while the latter clears the accessed bit
> > atomically against both the hardware and other software page table
> > walkers.
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h       |  7 +++
> >  arch/arm64/include/asm/kvm_pgtable.h    |  8 +++
> >  arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> >  arch/arm64/kvm/arm.c                    |  1 +
> >  arch/arm64/kvm/hyp/pgtable.c            | 51 ++--------------
> >  arch/arm64/kvm/mmu.c                    | 77 ++++++++++++++++++++++++-
> >  6 files changed, 141 insertions(+), 46 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 35a159d131b5..572bcd321586 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
> >  void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> >  bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
> >
> > +/* see the comments on the generic kvm_arch_has_test_clear_young() */
> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > +     return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && !is_protected_kvm_enabled();
> > +}
>
> Why does the lack of FEAT_HAFDBS preclude the use of the test-and-clear
> notifier?

This all comes down to the return on investment. We could
theoretically make it work but the complexity and the poor performance
would outweigh the benefits -- VM memory overcommit mostly happens in
Cloud and none of the major Cloud vendors use pre v8.2 [1].

[1] https://lore.kernel.org/linux-mm/CAOUHufbbs2gG+DPvSOw_N_Kx7FWdZvpdJUvLzko-BDQ8vfd6Xg@mail.gmail.com/

> On implementations without FEAT_HAFDBS, hardware will generate a data
> abort for software to set the access flag. Regardless of whether
> software or hardware is responsible for updating the PTE that
> information is available in the page tables.

Agreed, the s/w emulation of the A-bit has poor performance. With the
forward looking in mind, businesses who wish to overcommit host memory
will eventually all move onto v8.2 and later. This is another reason
not to worry about pre v8.2 (or 32-bit for that matter).

> Also, I'm at a loss for why we'd need to test if CONFIG_KVM is enabled.
> My expectation is that we should provide an implementation that returns
> false if !CONFIG_KVM, avoiding the need to repeat that bit in every
> single implementation of the function.

We do have that default implementation. But we still need to disable
this implementation when !CONFIG_KVM (it isn't protected by #ifdef
CONFIG_KVM).

> > +
> >  #endif /* __ARM64_KVM_HOST_H__ */
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 63f81b27a4e3..8c9a04388c88 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -105,6 +105,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> >   * @put_page:                        Decrement the refcount on a page. When the
> >   *                           refcount reaches 0 the page is automatically
> >   *                           freed.
> > + * @put_page_rcu:            RCU variant of put_page().
> >   * @page_count:                      Return the refcount of a page.
> >   * @phys_to_virt:            Convert a physical address into a virtual
> >   *                           address mapped in the current context.
> > @@ -122,6 +123,7 @@ struct kvm_pgtable_mm_ops {
> >       void            (*free_removed_table)(void *addr, u32 level);
> >       void            (*get_page)(void *addr);
> >       void            (*put_page)(void *addr);
> > +     void            (*put_page_rcu)(void *addr);
>
> Why do we need this? We already defer dropping the last reference count
> on a page to an RCU callback. Have you observed issues with the existing
> implementation?

That's on the reader path, i.e., collapsing PTEs into a PMD, which
RCU-frees the PTE table.

On the writer path, unmapping wasn't protected by RCU before this
patch, and put_page_rcu() makes it so.

> >       int             (*page_count)(void *addr);
> >       void*           (*phys_to_virt)(phys_addr_t phys);
> >       phys_addr_t     (*virt_to_phys)(void *addr);
> > @@ -188,6 +190,12 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
> >   *                                   children.
> >   * @KVM_PGTABLE_WALK_SHARED:         Indicates the page-tables may be shared
> >   *                                   with other software walkers.
> > + *
> > + * kvm_arch_test_clear_young() is a special case. It relies on two
> > + * techniques, RCU and cmpxchg, to safely test and clear the accessed
> > + * bit without taking the MMU lock. The former protects KVM page tables
> > + * from being freed while the latter clears the accessed bit atomically
> > + * against both the hardware and other software page table walkers.
> >   */
> >  enum kvm_pgtable_walk_flags {
> >       KVM_PGTABLE_WALK_LEAF                   = BIT(0),
> > diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> > index c8dca8ae359c..350437661d4b 100644
> > --- a/arch/arm64/include/asm/stage2_pgtable.h
> > +++ b/arch/arm64/include/asm/stage2_pgtable.h
> > @@ -30,4 +30,47 @@
> >   */
> >  #define kvm_mmu_cache_min_pages(kvm) (kvm_stage2_levels(kvm) - 1)
> >
> > +#define KVM_PTE_TYPE                 BIT(1)
> > +#define KVM_PTE_TYPE_BLOCK           0
> > +#define KVM_PTE_TYPE_PAGE            1
> > +#define KVM_PTE_TYPE_TABLE           1
> > +
> > +#define KVM_PTE_LEAF_ATTR_LO         GENMASK(11, 2)
> > +
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX      GENMASK(4, 2)
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_AP   GENMASK(7, 6)
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO        3
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW        1
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_SH   GENMASK(9, 8)
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS        3
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_AF   BIT(10)
> > +
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR      GENMASK(5, 2)
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R       BIT(6)
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W       BIT(7)
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_SH   GENMASK(9, 8)
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS        3
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_AF   BIT(10)
> > +
> > +#define KVM_PTE_LEAF_ATTR_HI         GENMASK(63, 51)
> > +
> > +#define KVM_PTE_LEAF_ATTR_HI_SW              GENMASK(58, 55)
> > +
> > +#define KVM_PTE_LEAF_ATTR_HI_S1_XN   BIT(54)
> > +
> > +#define KVM_PTE_LEAF_ATTR_HI_S2_XN   BIT(54)
> > +
> > +#define KVM_PTE_LEAF_ATTR_S2_PERMS   (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
> > +                                      KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
> > +                                      KVM_PTE_LEAF_ATTR_HI_S2_XN)
> > +
> > +#define KVM_INVALID_PTE_OWNER_MASK   GENMASK(9, 2)
> > +#define KVM_MAX_OWNER_ID             1
> > +
> > +/*
> > + * Used to indicate a pte for which a 'break-before-make' sequence is in
> > + * progress.
> > + */
> > +#define KVM_INVALID_PTE_LOCKED               BIT(10)
> > +
>
> If there is a need to do these sort of moves, please do it in a separate
> patch. It pollutes the context of the functional change you're making.
>
> >  #endif       /* __ARM64_S2_PGTABLE_H_ */
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 9c5573bc4614..6770bc47f5c9 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
> >   */
> >  void kvm_arch_destroy_vm(struct kvm *kvm)
> >  {
> > +     kvm_free_stage2_pgd(&kvm->arch.mmu);
> >
> >       bitmap_free(kvm->arch.pmu_filter);
> >       free_cpumask_var(kvm->arch.supported_cpus);
> >
>
> [...]
>
> > +struct test_clear_young_arg {
> > +     struct kvm_gfn_range *range;
> > +     gfn_t lsb_gfn;
> > +     unsigned long *bitmap;
> > +};
> > +
> > +static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
> > +                                enum kvm_pgtable_walk_flags flags)
> > +{
> > +     struct test_clear_young_arg *arg = ctx->arg;
> > +     gfn_t gfn = ctx->addr / PAGE_SIZE;
> > +     kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
> > +
> > +     VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
> > +     VM_WARN_ON_ONCE(gfn < arg->range->start || gfn >= arg->range->end);
>
> Do we really need to be _this_ pedantic about sanity checking?

Will remove them. (My experience with the world's large fleets is that
Murphy's law is always true.)

> > +     if (!kvm_pte_valid(new))
> > +             return 0;
> > +
> > +     if (new == ctx->old)
> > +             return 0;
> > +
> > +     /* see the comments on the generic kvm_arch_has_test_clear_young() */
> > +     if (__test_and_change_bit(arg->lsb_gfn - gfn, arg->bitmap))
> > +             cmpxchg64(ctx->ptep, ctx->old, new);
>
> Why not use stage2_try_set_pte()? Not only is it idiomatic with the rest
> of the stage-2 code, it also will 'do the right thing' according to the
> locking scheme of the caller if we decide to change it at some point.

It's not exported. Do you prefer it to be exported?

> > +     return 0;
> > +}
> > +
> > +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
> > +                            gfn_t lsb_gfn, unsigned long *bitmap)
> > +{
> > +     u64 start = range->start * PAGE_SIZE;
> > +     u64 end = range->end * PAGE_SIZE;
> > +     struct test_clear_young_arg arg = {
> > +             .range          = range,
> > +             .lsb_gfn        = lsb_gfn,
> > +             .bitmap         = bitmap,
> > +     };
> > +     struct kvm_pgtable_walker walker = {
> > +             .cb             = stage2_test_clear_young,
> > +             .arg            = &arg,
> > +             .flags          = KVM_PGTABLE_WALK_LEAF,
> > +     };
> > +
> > +     BUILD_BUG_ON(is_hyp_code());
>
> See prior comment about sanity checking.
>
> > +     if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young()))
> > +             return false;
>
> Same here...
>
> > +     /* see the comments on kvm_pgtable_walk_flags */
> > +     rcu_read_lock();
> > +
> > +     kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, &walker);
> > +
> > +     rcu_read_unlock();
>
> The rcu_read_{lock,unlock}() is entirely superfluous.

Not really. I didn't use the KVM_PGTABLE_WALK_SHARED flag above. Yes,
it would be more consistent with the rest of the ARM code. My POV is
that it would be less consistent with other archs, which I fully
expect you to disagree :)

I could add it and remove rcu_read_{lock,unlock}() if you prefer that way.

> Of course, it is somewhat hidden by the fact that we must use
> abstractions to support host and EL2 use of the page table code, but we
> already make use of RCU to protect the stage-2 of a 'regular' VM.
>
> > +     return true;
> > +}
> > +
> >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> >       if (!kvm->arch.mmu.pgt)
> > @@ -1848,7 +1924,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
> >
> >  void kvm_arch_flush_shadow_all(struct kvm *kvm)
> >  {
> > -     kvm_free_stage2_pgd(&kvm->arch.mmu);
> >  }
>
> Doesn't this become a blatant correctness issue? This entirely fails to
> uphold the exact expectations of the call.

I moved kvm_free_stage2_pgd() into kvm_arch_destroy_vm()  above so
that the mmu notifer SRCU will protect pgd at destruction. Don't worry
about this for now. I refactor this change and put_page_rcu() into a
separate patch to make it more clearer -- without them, RCU page table
walkers won't be safe against VM destruction and unmap.
  
Yu Zhao Feb. 23, 2023, 5:25 a.m. UTC | #7
On Fri, Feb 17, 2023 at 9:00 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Feb 17, 2023, Oliver Upton wrote:
> > Hi Yu,
> >
> > scripts/get_maintainers.pl is your friend for getting the right set of
> > emails for a series :) Don't know about others, but generally I would
> > prefer to be Cc'ed on an entire series (to gather context) than just an
> > individual patch.
>
> +1
>
> >
> > On Thu, Feb 16, 2023 at 09:12:28PM -0700, Yu Zhao wrote:
> > > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > > VMs that are not pKVM and run on hardware that sets the accessed bit
> > > in KVM page tables.
>
> At least for the x86 changes, please read Documentation/process/maintainer-tip.rst
> and rewrite the changelogs.

I see -- will remove "this patch".

> > > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > > the accessed bit without taking the MMU lock. The former protects KVM
> > > page tables from being freed while the latter clears the accessed bit
> > > atomically against both the hardware and other software page table
> > > walkers.
> > >
> > > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h       |  7 +++
> > >  arch/arm64/include/asm/kvm_pgtable.h    |  8 +++
> > >  arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> > >  arch/arm64/kvm/arm.c                    |  1 +
> > >  arch/arm64/kvm/hyp/pgtable.c            | 51 ++--------------
> > >  arch/arm64/kvm/mmu.c                    | 77 ++++++++++++++++++++++++-
> > >  6 files changed, 141 insertions(+), 46 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index 35a159d131b5..572bcd321586 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
> > >  void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> > >  bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
> > >
> > > +/* see the comments on the generic kvm_arch_has_test_clear_young() */
>
> Please eliminate all of these "see the comments on blah", in every case they do
> nothing more than redirect the reader to something they're likely already aware of.
>
> > > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > > +static inline bool kvm_arch_has_test_clear_young(void)
> > > +{
> > > +   return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && !is_protected_kvm_enabled();
> > > +}
>
> ...
>
> > Also, I'm at a loss for why we'd need to test if CONFIG_KVM is enabled.
> > My expectation is that we should provide an implementation that returns
> > false if !CONFIG_KVM, avoiding the need to repeat that bit in every
> > single implementation of the function.
>
> mm/vmscan.c uses kvm_arch_has_test_clear_young().  I have opinions on that, but
> I'll hold off on expressing them until there's actual justification presented
> somewhere.
>
> Yu, this series and each patch needs a big pile of "why".  I get that the goal
> is to optimize memory oversubscribe, but there needs to be justification for
> why this is KVM only, why nested VMs and !A/D hardware are out of scope, why yet
> another mmu_notifier hook is being added, etc.

I totally agree.

This is an optimization, not a bug fix. It can't be justified without
performance numbers from some common use cases. That burden of proof
clearly rests on me -- I will follow up on that.

For now, I want to make sure the methodical part is clear:
1. We only have limited resources and we need to prioritize major use cases.
2. We can only improve one thing at a time and we can't cover
everything at the same time.
3. We need to focus on the return on investment and the future.

I hope everyone by now agrees with my "the vast majority of VMs ..."
assertion. If not, I'm happy to revisit that [1]. If so, the next step
would be whether we want to focus on the vast majority first. I think
this naturally answers why the nested VMs and !AD h/w are out of
scope, at the moment (I didn't spell this out; probably I should in
v2). After we have taken the first step, we probably can decide
whether there is enough resource and demand to cover the low return on
investment part (but complexity but less common use cases).

[1] https://lore.kernel.org/linux-mm/20230217041230.2417228-1-yuzhao@google.com/
  
Marc Zyngier Feb. 23, 2023, 9:03 a.m. UTC | #8
On Thu, 23 Feb 2023 03:58:47 +0000,
Yu Zhao <yuzhao@google.com> wrote:
> 
> On Fri, Feb 17, 2023 at 2:00 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Fri, 17 Feb 2023 04:21:28 +0000,
> > Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > > > VMs that are not pKVM and run on hardware that sets the accessed bit
> > > > in KVM page tables.
> >
> > I'm really interested in how you can back this statement. 90% of the
> > HW I have access to is not FEAT_HWAFDB capable, either because it
> > predates the feature or because the feature is too buggy to be useful.
> 
> This is my expericen too -- most devices are pre v8.2.

And yet you have no issue writing the above. Puzzling.

> 
> > Do you have numbers?
> 
> Let's do a quick market survey by segment. The following only applies
> to ARM CPUs:
> 
> 1. Phones: none of the major Android phone vendors sell phones running
> VMs; no other major Linux phone vendors.

Maybe you should have a reality check and look at what your own
employer is shipping.

> 2. Laptops: only a very limited number of Chromebooks run VMs, namely
> ACRVM. No other major Linux laptop vendors.

Again, your employer disagree.

> 3. Desktops: no major Linux desktop vendors.

My desktop disagree (I send this from my arm64 desktop VM ).

> 4. Embedded/IoT/Router: no major Linux vendors run VMs (Android Auto
> can be a VM guest on QNX host).

This email is brought to you via a router VM on an arm64 box.

> 5. Cloud: this is where the vast majority VMs come from. Among the
> vendors available to the general public, Ampere is the biggest player.
> Here [1] is a list of its customers. The A-bit works well even on its
> EVT products (Neoverse cores).

Just the phone stuff dwarfs the number of cloud hosts.

Hopefully your patches are better than your market analysis...

	M.
  
Yu Zhao Feb. 23, 2023, 9:18 a.m. UTC | #9
On Thu, Feb 23, 2023 at 2:03 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Thu, 23 Feb 2023 03:58:47 +0000,
> Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Fri, Feb 17, 2023 at 2:00 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Fri, 17 Feb 2023 04:21:28 +0000,
> > > Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > > > > VMs that are not pKVM and run on hardware that sets the accessed bit
> > > > > in KVM page tables.
> > >
> > > I'm really interested in how you can back this statement. 90% of the
> > > HW I have access to is not FEAT_HWAFDB capable, either because it
> > > predates the feature or because the feature is too buggy to be useful.
> >
> > This is my expericen too -- most devices are pre v8.2.
>
> And yet you have no issue writing the above. Puzzling.

That's best to my knowledge. Mind enlightening me?

> > > Do you have numbers?
> >
> > Let's do a quick market survey by segment. The following only applies
> > to ARM CPUs:
> >
> > 1. Phones: none of the major Android phone vendors sell phones running
> > VMs; no other major Linux phone vendors.
>
> Maybe you should have a reality check and look at what your own
> employer is shipping.

Which model? I'll look it up and see how/how I missed it.

> > 2. Laptops: only a very limited number of Chromebooks run VMs, namely
> > ACRVM. No other major Linux laptop vendors.
>
> Again, your employer disagree.

What do you mean? Sorry, I'm a little surprised here... I do know *a
lot* about Chromebooks.

> > 3. Desktops: no major Linux desktop vendors.
>
> My desktop disagree (I send this from my arm64 desktop VM ).

A model number please?

> > 4. Embedded/IoT/Router: no major Linux vendors run VMs (Android Auto
> > can be a VM guest on QNX host).
>
> This email is brought to you via a router VM on an arm64 box.

More details?

> > 5. Cloud: this is where the vast majority VMs come from. Among the
> > vendors available to the general public, Ampere is the biggest player.
> > Here [1] is a list of its customers. The A-bit works well even on its
> > EVT products (Neoverse cores).
>
> Just the phone stuff dwarfs the number of cloud hosts.

Please point me to something that I can work on so that I wouldn't
sound so ignorant next time.
  

Patch

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 35a159d131b5..572bcd321586 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1031,4 +1031,11 @@  static inline void kvm_hyp_reserve(void) { }
 void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
 bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
 
+/* see the comments on the generic kvm_arch_has_test_clear_young() */
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+	return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && !is_protected_kvm_enabled();
+}
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 63f81b27a4e3..8c9a04388c88 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -105,6 +105,7 @@  static inline bool kvm_level_supports_block_mapping(u32 level)
  * @put_page:			Decrement the refcount on a page. When the
  *				refcount reaches 0 the page is automatically
  *				freed.
+ * @put_page_rcu:		RCU variant of put_page().
  * @page_count:			Return the refcount of a page.
  * @phys_to_virt:		Convert a physical address into a virtual
  *				address	mapped in the current context.
@@ -122,6 +123,7 @@  struct kvm_pgtable_mm_ops {
 	void		(*free_removed_table)(void *addr, u32 level);
 	void		(*get_page)(void *addr);
 	void		(*put_page)(void *addr);
+	void		(*put_page_rcu)(void *addr);
 	int		(*page_count)(void *addr);
 	void*		(*phys_to_virt)(phys_addr_t phys);
 	phys_addr_t	(*virt_to_phys)(void *addr);
@@ -188,6 +190,12 @@  typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
  *					children.
  * @KVM_PGTABLE_WALK_SHARED:		Indicates the page-tables may be shared
  *					with other software walkers.
+ *
+ * kvm_arch_test_clear_young() is a special case. It relies on two
+ * techniques, RCU and cmpxchg, to safely test and clear the accessed
+ * bit without taking the MMU lock. The former protects KVM page tables
+ * from being freed while the latter clears the accessed bit atomically
+ * against both the hardware and other software page table walkers.
  */
 enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_LEAF			= BIT(0),
diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
index c8dca8ae359c..350437661d4b 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -30,4 +30,47 @@ 
  */
 #define kvm_mmu_cache_min_pages(kvm)	(kvm_stage2_levels(kvm) - 1)
 
+#define KVM_PTE_TYPE			BIT(1)
+#define KVM_PTE_TYPE_BLOCK		0
+#define KVM_PTE_TYPE_PAGE		1
+#define KVM_PTE_TYPE_TABLE		1
+
+#define KVM_PTE_LEAF_ATTR_LO		GENMASK(11, 2)
+
+#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX	GENMASK(4, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP	GENMASK(7, 6)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO	3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW	1
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH	GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS	3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AF	BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR	GENMASK(5, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R	BIT(6)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W	BIT(7)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH	GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS	3
+#define KVM_PTE_LEAF_ATTR_LO_S2_AF	BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_HI		GENMASK(63, 51)
+
+#define KVM_PTE_LEAF_ATTR_HI_SW		GENMASK(58, 55)
+
+#define KVM_PTE_LEAF_ATTR_HI_S1_XN	BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_HI_S2_XN	BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_S2_PERMS	(KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
+					 KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
+					 KVM_PTE_LEAF_ATTR_HI_S2_XN)
+
+#define KVM_INVALID_PTE_OWNER_MASK	GENMASK(9, 2)
+#define KVM_MAX_OWNER_ID		1
+
+/*
+ * Used to indicate a pte for which a 'break-before-make' sequence is in
+ * progress.
+ */
+#define KVM_INVALID_PTE_LOCKED		BIT(10)
+
 #endif	/* __ARM64_S2_PGTABLE_H_ */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 9c5573bc4614..6770bc47f5c9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -191,6 +191,7 @@  vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
  */
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+	kvm_free_stage2_pgd(&kvm->arch.mmu);
 	bitmap_free(kvm->arch.pmu_filter);
 	free_cpumask_var(kvm->arch.supported_cpus);
 
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b11cf2c618a6..8d65ee4767f1 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -12,49 +12,6 @@ 
 #include <asm/stage2_pgtable.h>
 
 
-#define KVM_PTE_TYPE			BIT(1)
-#define KVM_PTE_TYPE_BLOCK		0
-#define KVM_PTE_TYPE_PAGE		1
-#define KVM_PTE_TYPE_TABLE		1
-
-#define KVM_PTE_LEAF_ATTR_LO		GENMASK(11, 2)
-
-#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX	GENMASK(4, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP	GENMASK(7, 6)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO	3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW	1
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH	GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS	3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AF	BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR	GENMASK(5, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R	BIT(6)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W	BIT(7)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH	GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS	3
-#define KVM_PTE_LEAF_ATTR_LO_S2_AF	BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_HI		GENMASK(63, 51)
-
-#define KVM_PTE_LEAF_ATTR_HI_SW		GENMASK(58, 55)
-
-#define KVM_PTE_LEAF_ATTR_HI_S1_XN	BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_HI_S2_XN	BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_S2_PERMS	(KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
-					 KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
-					 KVM_PTE_LEAF_ATTR_HI_S2_XN)
-
-#define KVM_INVALID_PTE_OWNER_MASK	GENMASK(9, 2)
-#define KVM_MAX_OWNER_ID		1
-
-/*
- * Used to indicate a pte for which a 'break-before-make' sequence is in
- * progress.
- */
-#define KVM_INVALID_PTE_LOCKED		BIT(10)
-
 struct kvm_pgtable_walk_data {
 	struct kvm_pgtable_walker	*walker;
 
@@ -994,8 +951,12 @@  static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
 		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
 					       kvm_granule_size(ctx->level));
 
-	if (childp)
-		mm_ops->put_page(childp);
+	if (childp) {
+		if (mm_ops->put_page_rcu)
+			mm_ops->put_page_rcu(childp);
+		else
+			mm_ops->put_page(childp);
+	}
 
 	return 0;
 }
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a3ee3b605c9b..761fffc788f5 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -171,6 +171,21 @@  static int kvm_host_page_count(void *addr)
 	return page_count(virt_to_page(addr));
 }
 
+static void kvm_s2_rcu_put_page(struct rcu_head *head)
+{
+	put_page(container_of(head, struct page, rcu_head));
+}
+
+static void kvm_s2_put_page_rcu(void *addr)
+{
+	struct page *page = virt_to_page(addr);
+
+	if (kvm_host_page_count(addr) == 1)
+		kvm_account_pgtable_pages(addr, -1);
+
+	call_rcu(&page->rcu_head, kvm_s2_rcu_put_page);
+}
+
 static phys_addr_t kvm_host_pa(void *addr)
 {
 	return __pa(addr);
@@ -684,6 +699,7 @@  static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.free_removed_table	= stage2_free_removed_table,
 	.get_page		= kvm_host_get_page,
 	.put_page		= kvm_s2_put_page,
+	.put_page_rcu		= kvm_s2_put_page_rcu,
 	.page_count		= kvm_host_page_count,
 	.phys_to_virt		= kvm_host_va,
 	.virt_to_phys		= kvm_host_pa,
@@ -1624,6 +1640,66 @@  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	return pte_valid(pte) && pte_young(pte);
 }
 
+struct test_clear_young_arg {
+	struct kvm_gfn_range *range;
+	gfn_t lsb_gfn;
+	unsigned long *bitmap;
+};
+
+static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
+				   enum kvm_pgtable_walk_flags flags)
+{
+	struct test_clear_young_arg *arg = ctx->arg;
+	gfn_t gfn = ctx->addr / PAGE_SIZE;
+	kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
+
+	VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
+	VM_WARN_ON_ONCE(gfn < arg->range->start || gfn >= arg->range->end);
+
+	if (!kvm_pte_valid(new))
+		return 0;
+
+	if (new == ctx->old)
+		return 0;
+
+	/* see the comments on the generic kvm_arch_has_test_clear_young() */
+	if (__test_and_change_bit(arg->lsb_gfn - gfn, arg->bitmap))
+		cmpxchg64(ctx->ptep, ctx->old, new);
+
+	return 0;
+}
+
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
+			       gfn_t lsb_gfn, unsigned long *bitmap)
+{
+	u64 start = range->start * PAGE_SIZE;
+	u64 end = range->end * PAGE_SIZE;
+	struct test_clear_young_arg arg = {
+		.range		= range,
+		.lsb_gfn	= lsb_gfn,
+		.bitmap		= bitmap,
+	};
+	struct kvm_pgtable_walker walker = {
+		.cb		= stage2_test_clear_young,
+		.arg		= &arg,
+		.flags		= KVM_PGTABLE_WALK_LEAF,
+	};
+
+	BUILD_BUG_ON(is_hyp_code());
+
+	if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young()))
+		return false;
+
+	/* see the comments on kvm_pgtable_walk_flags */
+	rcu_read_lock();
+
+	kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, &walker);
+
+	rcu_read_unlock();
+
+	return true;
+}
+
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	if (!kvm->arch.mmu.pgt)
@@ -1848,7 +1924,6 @@  void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-	kvm_free_stage2_pgd(&kvm->arch.mmu);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,