[1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress

Message ID 20230602011518.787006-2-seanjc@google.com
State New
Headers
Series KVM: x86: Use "standard" mmu_notifier hook for APIC page |

Commit Message

Sean Christopherson June 2, 2023, 1:15 a.m. UTC
  Re-request an APIC-access page reload if there is a relevant mmu_notifier
invalidation in-progress when KVM retrieves the backing pfn, i.e. stall
vCPUs until the backing pfn for the APIC-access page is "officially"
stable.  Relying on the primary MMU to not make changes after invoking
->invalidate_range() works, e.g. any additional changes to a PRESENT PTE
would also trigger an ->invalidate_range(), but using ->invalidate_range()
to fudge around KVM not honoring past and in-progress invalidations is a
bit hacky.

Honoring invalidations will allow using KVM's standard mmu_notifier hooks
to detect APIC-access page reloads, which will in turn allow removing
KVM's implementation of ->invalidate_range() (the APIC-access page case is
a true one-off).

Opportunistically add a comment to explain why doing nothing if a memslot
isn't found is functionally correct.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 50 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 45 insertions(+), 5 deletions(-)
  

Comments

Alistair Popple June 6, 2023, 2:11 a.m. UTC | #1
Thanks for doing this. I'm not overly familiar with KVM implementation
but am familiar with mmu notifiers so read through the KVM usage. Looks
like KVM is sort of doing something similar to what mmu_interval_notifiers
do and I wonder if some of that could be shared. Anyway I didn't spot
anything wrong here so feel free to add:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

Sean Christopherson <seanjc@google.com> writes:

> Re-request an APIC-access page reload if there is a relevant mmu_notifier
> invalidation in-progress when KVM retrieves the backing pfn, i.e. stall
> vCPUs until the backing pfn for the APIC-access page is "officially"
> stable.  Relying on the primary MMU to not make changes after invoking
> ->invalidate_range() works, e.g. any additional changes to a PRESENT PTE
> would also trigger an ->invalidate_range(), but using ->invalidate_range()
> to fudge around KVM not honoring past and in-progress invalidations is a
> bit hacky.
>
> Honoring invalidations will allow using KVM's standard mmu_notifier hooks
> to detect APIC-access page reloads, which will in turn allow removing
> KVM's implementation of ->invalidate_range() (the APIC-access page case is
> a true one-off).
>
> Opportunistically add a comment to explain why doing nothing if a memslot
> isn't found is functionally correct.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 50 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 45 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 44fb619803b8..59195f0dc7a5 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6708,7 +6708,12 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
>  
>  static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  {
> -	struct page *page;
> +	const gfn_t gfn = APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT;
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_memslots *slots = kvm_memslots(kvm);
> +	struct kvm_memory_slot *slot;
> +	unsigned long mmu_seq;
> +	kvm_pfn_t pfn;
>  
>  	/* Defer reload until vmcs01 is the current VMCS. */
>  	if (is_guest_mode(vcpu)) {
> @@ -6720,18 +6725,53 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  	    SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
>  		return;
>  
> -	page = gfn_to_page(vcpu->kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> -	if (is_error_page(page))
> +	/*
> +	 * Grab the memslot so that the hva lookup for the mmu_notifier retry
> +	 * is guaranteed to use the same memslot as the pfn lookup, i.e. rely
> +	 * on the pfn lookup's validation of the memslot to ensure a valid hva
> +	 * is used for the retry check.
> +	 */
> +	slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
> +	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
>  		return;
>  
> -	vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
> +	/*
> +	 * Ensure that the mmu_notifier sequence count is read before KVM
> +	 * retrieves the pfn from the primary MMU.  Note, the memslot is
> +	 * protected by SRCU, not the mmu_notifier.  Pairs with the smp_wmb()
> +	 * in kvm_mmu_invalidate_end().
> +	 */
> +	mmu_seq = kvm->mmu_invalidate_seq;
> +	smp_rmb();
> +
> +	/*
> +	 * No need to retry if the memslot does not exist or is invalid.  KVM
> +	 * controls the APIC-access page memslot, and only deletes the memslot
> +	 * if APICv is permanently inhibited, i.e. the memslot won't reappear.
> +	 */
> +	pfn = gfn_to_pfn_memslot(slot, gfn);
> +	if (is_error_noslot_pfn(pfn))
> +		return;
> +
> +	read_lock(&vcpu->kvm->mmu_lock);
> +	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
> +				     gfn_to_hva_memslot(slot, gfn))) {
> +		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
> +		read_unlock(&vcpu->kvm->mmu_lock);
> +		goto out;
> +	}
> +
> +	vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(pfn));
> +	read_unlock(&vcpu->kvm->mmu_lock);
> +
>  	vmx_flush_tlb_current(vcpu);
>  
> +out:
>  	/*
>  	 * Do not pin apic access page in memory, the MMU notifier
>  	 * will call us again if it is migrated or swapped out.
>  	 */
> -	put_page(page);
> +	kvm_release_pfn_clean(pfn);
>  }
>  
>  static void vmx_hwapic_isr_update(int max_isr)
  
Sean Christopherson June 6, 2023, 5:22 p.m. UTC | #2
On Tue, Jun 06, 2023, Alistair Popple wrote:
> 
> Thanks for doing this. I'm not overly familiar with KVM implementation
> but am familiar with mmu notifiers so read through the KVM usage. Looks
> like KVM is sort of doing something similar to what mmu_interval_notifiers
> do and I wonder if some of that could be shared.

They're very, very similar.  At a glance, KVM likely could be moved over to the
common interval tree implementation, but it would require a substantial amount of
work in KVM, and various extensions to the mmu notifiers too.  I definitely wouldn't
be opposed to converging KVM if someone wants to put in the effort, but at the same
time I wouldn't recommend that anyone actually do the work as the ROI is likely
very low.
  
Jason Gunthorpe June 6, 2023, 5:39 p.m. UTC | #3
On Tue, Jun 06, 2023 at 10:22:14AM -0700, Sean Christopherson wrote:
> On Tue, Jun 06, 2023, Alistair Popple wrote:
> > 
> > Thanks for doing this. I'm not overly familiar with KVM implementation
> > but am familiar with mmu notifiers so read through the KVM usage. Looks
> > like KVM is sort of doing something similar to what mmu_interval_notifiers
> > do and I wonder if some of that could be shared.
> 
> They're very, very similar.  At a glance, KVM likely could be moved over to the
> common interval tree implementation, but it would require a substantial amount of
> work in KVM, and various extensions to the mmu notifiers too.  I definitely wouldn't
> be opposed to converging KVM if someone wants to put in the effort, but at the same
> time I wouldn't recommend that anyone actually do the work as the ROI is likely
> very low.

I looked at it lightly a long time ago and came to sort of the same
conclusion, it might even be a negative for KVM as it isn't quite as
tightly inlined and lock free.

Jason
  
Yu Zhang June 7, 2023, 7:40 a.m. UTC | #4
On Thu, Jun 01, 2023 at 06:15:16PM -0700, Sean Christopherson wrote:
> Re-request an APIC-access page reload if there is a relevant mmu_notifier
> invalidation in-progress when KVM retrieves the backing pfn, i.e. stall
> vCPUs until the backing pfn for the APIC-access page is "officially"
> stable.  Relying on the primary MMU to not make changes after invoking
> ->invalidate_range() works, e.g. any additional changes to a PRESENT PTE
> would also trigger an ->invalidate_range(), but using ->invalidate_range()
> to fudge around KVM not honoring past and in-progress invalidations is a
> bit hacky.
> 
> Honoring invalidations will allow using KVM's standard mmu_notifier hooks
> to detect APIC-access page reloads, which will in turn allow removing
> KVM's implementation of ->invalidate_range() (the APIC-access page case is
> a true one-off).
> 
> Opportunistically add a comment to explain why doing nothing if a memslot
> isn't found is functionally correct.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 50 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 45 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 44fb619803b8..59195f0dc7a5 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6708,7 +6708,12 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
>  
>  static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  {
> -	struct page *page;
> +	const gfn_t gfn = APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT;
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_memslots *slots = kvm_memslots(kvm);
> +	struct kvm_memory_slot *slot;
> +	unsigned long mmu_seq;
> +	kvm_pfn_t pfn;
>  
>  	/* Defer reload until vmcs01 is the current VMCS. */
>  	if (is_guest_mode(vcpu)) {
> @@ -6720,18 +6725,53 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  	    SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
>  		return;
>  
> -	page = gfn_to_page(vcpu->kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> -	if (is_error_page(page))
> +	/*
> +	 * Grab the memslot so that the hva lookup for the mmu_notifier retry
> +	 * is guaranteed to use the same memslot as the pfn lookup, i.e. rely
> +	 * on the pfn lookup's validation of the memslot to ensure a valid hva
> +	 * is used for the retry check.
> +	 */
> +	slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
> +	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
>  		return;
>  
> -	vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
> +	/*
> +	 * Ensure that the mmu_notifier sequence count is read before KVM
> +	 * retrieves the pfn from the primary MMU.  Note, the memslot is
> +	 * protected by SRCU, not the mmu_notifier.  Pairs with the smp_wmb()
> +	 * in kvm_mmu_invalidate_end().
> +	 */
> +	mmu_seq = kvm->mmu_invalidate_seq;
> +	smp_rmb();
> +
> +	/*
> +	 * No need to retry if the memslot does not exist or is invalid.  KVM
> +	 * controls the APIC-access page memslot, and only deletes the memslot
> +	 * if APICv is permanently inhibited, i.e. the memslot won't reappear.
> +	 */
> +	pfn = gfn_to_pfn_memslot(slot, gfn);
> +	if (is_error_noslot_pfn(pfn))
> +		return;
> +
> +	read_lock(&vcpu->kvm->mmu_lock);
> +	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
> +				     gfn_to_hva_memslot(slot, gfn))) {
> +		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);

Are the mmu_invalidate_retry_hva() and the following request meant to stall
the vCPU if there's on going invalidation? 

If yes, may I ask how would the vCPU be stalled?

Below are my understandings and confusions about this process. I must have
missed something, waiting to be educated... :) 

When the backing page of APIC access page is to be reclaimed:
1> kvm_mmu_notifier_invalidate_range_start() -> __kvm_handle_hva_range() will
increase the kvm->mmu_invalidate_in_progress and account the start/end of this
page in kvm_mmu_invalidate_begin().
2> And then kvm_unmap_gfn_range() will zap the TDP, and send the request,
KVM_REQ_APIC_PAGE_RELOAD, to all vCPUs.
3> While a vCPU tries to reload the APIC access page before entering the guest,
kvm->mmu_invalidate_in_progress will be checked in mmu_invalidate_retry_hva(),
but it is possible(or is it?) that the kvm->mmu_invalidate_in_progess is still
positive, so KVM_REQ_APIC_PAGE_RELOAD is set again. No reload, and no TLB flush.
4> So what if the vCPU resumes with KVM_REQ_APIC_PAGE_RELOAD & KVM_REQ_TLB_FLUSH
flags being set, yet APIC access page is not reloaded and TLB is not flushed? Or,
will this happen?

One more dumb question - why does KVM not just pin the APIC access page? 

> +		read_unlock(&vcpu->kvm->mmu_lock);
> +		goto out;
> +	}
> +
> +	vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(pfn));
> +	read_unlock(&vcpu->kvm->mmu_lock);
> +
>  	vmx_flush_tlb_current(vcpu);
>  
> +out:
>  	/*
>  	 * Do not pin apic access page in memory, the MMU notifier
>  	 * will call us again if it is migrated or swapped out.
>  	 */
> -	put_page(page);
> +	kvm_release_pfn_clean(pfn);
>  }
>  
>  static void vmx_hwapic_isr_update(int max_isr)
> -- 
> 2.41.0.rc2.161.g9c6817b8e7-goog
> 

B.R.
Yu
  
Sean Christopherson June 7, 2023, 2:30 p.m. UTC | #5
On Wed, Jun 07, 2023, yu.c.zhang@linux.intel.com wrote:
> On Thu, Jun 01, 2023 at 06:15:16PM -0700, Sean Christopherson wrote:
> > +	pfn = gfn_to_pfn_memslot(slot, gfn);
> > +	if (is_error_noslot_pfn(pfn))
> > +		return;
> > +
> > +	read_lock(&vcpu->kvm->mmu_lock);
> > +	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
> > +				     gfn_to_hva_memslot(slot, gfn))) {
> > +		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
> 
> Are the mmu_invalidate_retry_hva() and the following request meant to stall
> the vCPU if there's on going invalidation? 

Yep.

> If yes, may I ask how would the vCPU be stalled?
> 
> Below are my understandings and confusions about this process. I must have
> missed something, waiting to be educated... :) 
> 
> When the backing page of APIC access page is to be reclaimed:
> 1> kvm_mmu_notifier_invalidate_range_start() -> __kvm_handle_hva_range() will
> increase the kvm->mmu_invalidate_in_progress and account the start/end of this
> page in kvm_mmu_invalidate_begin().
> 2> And then kvm_unmap_gfn_range() will zap the TDP, and send the request,
> KVM_REQ_APIC_PAGE_RELOAD, to all vCPUs.
> 3> While a vCPU tries to reload the APIC access page before entering the guest,
> kvm->mmu_invalidate_in_progress will be checked in mmu_invalidate_retry_hva(),
> but it is possible(or is it?) that the kvm->mmu_invalidate_in_progess is still
> positive, so KVM_REQ_APIC_PAGE_RELOAD is set again. No reload, and no TLB flush.
> 4> So what if the vCPU resumes with KVM_REQ_APIC_PAGE_RELOAD & KVM_REQ_TLB_FLUSH
> flags being set, yet APIC access page is not reloaded and TLB is not flushed? Or,
> will this happen?

Pending requests block KVM from actually entering the guest.  If a request comes
in after vcpu_enter_guest()'s initial handling of requests, KVM will bail before
VM-Enter and go back through the entire "outer" run loop.

This isn't necessarily the most efficient way to handle the stall, e.g. KVM does
a fair bit of prep for VM-Enter before detecting the pending request.  The
alternative would be to have kvm_vcpu_reload_apic_access_page() return value
instructing vcpu_enter_guest() whether to bail immediately or continue on.  I
elected for the re-request approach because (a) it didn't require redefining the
kvm_x86_ops vendor hook, (b) this should be a rare situation and not performance
critical overall, and (c) there's no guarantee that bailing immediately would
actually yield better performance from the guest's perspective, e.g. if there are
other pending requests/work, then the KVM can handle those items while the vCPU
is stalled instead of waiting until the invalidation completes to proceed.

> One more dumb question - why does KVM not just pin the APIC access page?

Definitely not a dumb question, I asked myself the same thing multiple times when
looking at this :-)  Pinning the page would be easier, and KVM actually did that
in the original implementation.  The issue is in how KVM allocates the backing
page.  It's not a traditional kernel allocation, but is instead anonymous memory
allocated by way of vm_mmap(), i.e. for all intents and purposes it's a user
allocation.  That means the kernel expects it to be a regular movable page, e.g.
it's entirely possible the page (if it were pinned) could be the only page in a
2MiB chunk preventing the kernel from migrating/compacting and creating a hugepage.

In hindsight, I'm not entirely convinced that unpinning the page was the right
choice, as it resulted in a handful of nasty bugs.  But, now that we've fixed all
those bugs (knock wood), there's no good argument for undoing all of that work.
Because while the code is subtle and requires hooks in a few paths, it's not *that*
complex and for the most part doesn't require active maintenance.

static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{
	if (kvm_request_pending(vcpu)) {  <= check if any request is pending

		if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
			kvm_vcpu_reload_apic_access_page(vcpu); <= re-requests APIC_PAGE_RELOAD

                ...
	}

	...

	preempt_disable();

	static_call(kvm_x86_prepare_switch_to_guest)(vcpu);

        <host => guest bookkeeping>

	if (kvm_vcpu_exit_request(vcpu)) {  <= detects the new pending request
		vcpu->mode = OUTSIDE_GUEST_MODE;
		smp_wmb();
		local_irq_enable();
		preempt_enable();
		kvm_vcpu_srcu_read_lock(vcpu);
		r = 1;
		goto cancel_injection;  <= bails from actually entering the guest
	}

	if (req_immediate_exit) {
		kvm_make_request(KVM_REQ_EVENT, vcpu);
		static_call(kvm_x86_request_immediate_exit)(vcpu);
	}

	for (;;) {
		<inner run / VM-Enter loop>
	}

	<VM-Exit path>

	r = static_call(kvm_x86_handle_exit)(vcpu, exit_fastpath);
	return r;

cancel_injection:
	if (req_immediate_exit)
		kvm_make_request(KVM_REQ_EVENT, vcpu);
	static_call(kvm_x86_cancel_injection)(vcpu);
	if (unlikely(vcpu->arch.apic_attention))
		kvm_lapic_sync_from_vapic(vcpu);
out:
	return r;
}
  
Yu Zhang June 7, 2023, 5:23 p.m. UTC | #6
> Pending requests block KVM from actually entering the guest.  If a request comes
> in after vcpu_enter_guest()'s initial handling of requests, KVM will bail before
> VM-Enter and go back through the entire "outer" run loop.
> 
> This isn't necessarily the most efficient way to handle the stall, e.g. KVM does
> a fair bit of prep for VM-Enter before detecting the pending request.  The
> alternative would be to have kvm_vcpu_reload_apic_access_page() return value
> instructing vcpu_enter_guest() whether to bail immediately or continue on.  I
> elected for the re-request approach because (a) it didn't require redefining the
> kvm_x86_ops vendor hook, (b) this should be a rare situation and not performance
> critical overall, and (c) there's no guarantee that bailing immediately would
> actually yield better performance from the guest's perspective, e.g. if there are
> other pending requests/work, then the KVM can handle those items while the vCPU
> is stalled instead of waiting until the invalidation completes to proceed.
> 

Wah! Thank you so much! Especially for the code snippets below! :)

> > One more dumb question - why does KVM not just pin the APIC access page?
> 
> Definitely not a dumb question, I asked myself the same thing multiple times when
> looking at this :-)  Pinning the page would be easier, and KVM actually did that
> in the original implementation.  The issue is in how KVM allocates the backing
> page.  It's not a traditional kernel allocation, but is instead anonymous memory
> allocated by way of vm_mmap(), i.e. for all intents and purposes it's a user
> allocation.  That means the kernel expects it to be a regular movable page, e.g.
> it's entirely possible the page (if it were pinned) could be the only page in a
> 2MiB chunk preventing the kernel from migrating/compacting and creating a hugepage.
> 
> In hindsight, I'm not entirely convinced that unpinning the page was the right
> choice, as it resulted in a handful of nasty bugs.  But, now that we've fixed all
> those bugs (knock wood), there's no good argument for undoing all of that work.
> Because while the code is subtle and requires hooks in a few paths, it's not *that*
> complex and for the most part doesn't require active maintenance.
> 

Thanks again! One more thing that bothers me when reading the mmu notifier,
is about the TLB flush request. After the APIC access page is reloaded, the
TLB will be flushed (a single-context EPT invalidation on not-so-outdated
CPUs) in vmx_set_apic_access_page_addr(). But the mmu notifier will send the
KVM_REQ_TLB_FLUSH as well, by kvm_mmu_notifier_invalidate_range_start() ->
__kvm_handle_hva_range(), therefore causing the vCPU to trigger another TLB
flush - normally a global EPT invalidation I guess.

But, is this necessary?

Could we try to return false in kvm_unmap_gfn_range() to indicate no more
flush is needed, if the range to be unmapped falls within guest APIC base,
and leaving the TLB invalidation work to vmx_set_apic_access_page_addr()?

But there are multiple places in vmx_set_apic_access_page_addr() to return
earlier(e.g., if xapic mode is disabled for this vCPU) with no TLB flush being
triggered, I am not sure if doing so would cause more problems... Any comment?
Thanks!

B.R.
Yu
  
Sean Christopherson June 7, 2023, 5:53 p.m. UTC | #7
On Thu, Jun 08, 2023, Yu Zhang wrote:
> Thanks again! One more thing that bothers me when reading the mmu notifier,
> is about the TLB flush request. After the APIC access page is reloaded, the
> TLB will be flushed (a single-context EPT invalidation on not-so-outdated
> CPUs) in vmx_set_apic_access_page_addr(). But the mmu notifier will send the
> KVM_REQ_TLB_FLUSH as well, by kvm_mmu_notifier_invalidate_range_start() ->
> __kvm_handle_hva_range(), therefore causing the vCPU to trigger another TLB
> flush - normally a global EPT invalidation I guess.

Yes.

> But, is this necessary?

Flushing when KVM zaps SPTEs is definitely necessary.  But the flush in
vmx_set_apic_access_page_addr() *should* be redundant.

> Could we try to return false in kvm_unmap_gfn_range() to indicate no more
> flush is needed, if the range to be unmapped falls within guest APIC base,
> and leaving the TLB invalidation work to vmx_set_apic_access_page_addr()?

No, because vmx_flush_tlb_current(), a.k.a. KVM_REQ_TLB_FLUSH_CURRENT, flushes
only the current root, i.e. on the current EP4TA.  kvm_unmap_gfn_range() isn't
tied to a single vCPU and so needs to flush all roots.  We could in theory more
precisely track which roots needs to be flushed, but in practice it's highly
unlikely to matter as there is typically only one "main" root when TDP (EPT) is
in use.  In other words, KVM could avoid unnecessarily flushing entries for other
roots, but it would incur non-trivial complexity, and the probability of the
precise flushing having a measurable impact on guest performance is quite low, at
least outside of nested scenarios.

But as above, flushing in vmx_set_apic_access_page_addr() shouldn't be necessary.
If there were SPTEs, then KVM would already have zapped and flushed.  If there
weren't SPTEs, then it should have been impossible for the guest to have valid
TLB entries.  KVM needs to flush when VIRTUALIZE_APIC_ACCESSES is toggled on, as
the CPU could have non-vAPIC TLB entries, but that's handled by vmx_set_virtual_apic_mode().

I'll send a follow-up patch to drop the flush from vmx_set_apic_access_page_addr(),
I don't *think* I'm missing an edge case...
  
Yu Zhang June 8, 2023, 7 a.m. UTC | #8
> 
> Flushing when KVM zaps SPTEs is definitely necessary.  But the flush in
> vmx_set_apic_access_page_addr() *should* be redundant.
> 
> > Could we try to return false in kvm_unmap_gfn_range() to indicate no more
> > flush is needed, if the range to be unmapped falls within guest APIC base,
> > and leaving the TLB invalidation work to vmx_set_apic_access_page_addr()?
> 
> No, because vmx_flush_tlb_current(), a.k.a. KVM_REQ_TLB_FLUSH_CURRENT, flushes
> only the current root, i.e. on the current EP4TA.  kvm_unmap_gfn_range() isn't
> tied to a single vCPU and so needs to flush all roots.  We could in theory more
> precisely track which roots needs to be flushed, but in practice it's highly
> unlikely to matter as there is typically only one "main" root when TDP (EPT) is
> in use.  In other words, KVM could avoid unnecessarily flushing entries for other
> roots, but it would incur non-trivial complexity, and the probability of the
> precise flushing having a measurable impact on guest performance is quite low, at
> least outside of nested scenarios.

Well, I can understand the invalidation shall be performed for both current EP4TA,
and the nested EP4TA(EPT02) when host retries to reclaim a normal page, because L1
may assign this page to L2. But for APIC base address, will L1 map this address to
L2? 

Also, what if the virtualize APIC access is to be supported in L2, and the backing
page is being reclaimed in L0? I saw nested_get_vmcs12_pages() will check vmcs12
and set the APIC access address in VMCS02, but not sure if this routine will be
triggered by the mmu notifier...

B.R.
Yu

> 
> But as above, flushing in vmx_set_apic_access_page_addr() shouldn't be necessary.
> If there were SPTEs, then KVM would already have zapped and flushed.  If there
> weren't SPTEs, then it should have been impossible for the guest to have valid
> TLB entries.  KVM needs to flush when VIRTUALIZE_APIC_ACCESSES is toggled on, as
> the CPU could have non-vAPIC TLB entries, but that's handled by vmx_set_virtual_apic_mode().
> 
> I'll send a follow-up patch to drop the flush from vmx_set_apic_access_page_addr(),
> I don't *think* I'm missing an edge case...
  
Sean Christopherson June 13, 2023, 7:07 p.m. UTC | #9
On Thu, Jun 08, 2023, Yu Zhang wrote:
> > 
> > Flushing when KVM zaps SPTEs is definitely necessary.  But the flush in
> > vmx_set_apic_access_page_addr() *should* be redundant.
> > 
> > > Could we try to return false in kvm_unmap_gfn_range() to indicate no more
> > > flush is needed, if the range to be unmapped falls within guest APIC base,
> > > and leaving the TLB invalidation work to vmx_set_apic_access_page_addr()?
> > 
> > No, because vmx_flush_tlb_current(), a.k.a. KVM_REQ_TLB_FLUSH_CURRENT, flushes
> > only the current root, i.e. on the current EP4TA.  kvm_unmap_gfn_range() isn't
> > tied to a single vCPU and so needs to flush all roots.  We could in theory more
> > precisely track which roots needs to be flushed, but in practice it's highly
> > unlikely to matter as there is typically only one "main" root when TDP (EPT) is
> > in use.  In other words, KVM could avoid unnecessarily flushing entries for other
> > roots, but it would incur non-trivial complexity, and the probability of the
> > precise flushing having a measurable impact on guest performance is quite low, at
> > least outside of nested scenarios.
> 
> Well, I can understand the invalidation shall be performed for both current EP4TA,
> and the nested EP4TA(EPT02) when host retries to reclaim a normal page, because L1
> may assign this page to L2. But for APIC base address, will L1 map this address to
> L2?

L1 can do whatever it wants.  E.g. L1 could passthrough its APIC to L2, in which
case, yes, L1 will map its APIC base into L2.  KVM (as L0) however doesn't support
mapping the APIC-access page into L2.  KVM *could* support utilizing APICv to
accelerate L2 when L1 has done a full APIC passthrough, but AFAIK no one has
requested such support.  Functionally, an APIC passthrough setup for L1=>L2 will
work, but KVM will trap and emulate APIC accesses from L2 instead of utilizing
hardware acceleration.

More commonly, L1 will use APICv for L2 and thus have an APIC-access page for L2,
and KVM will map _that_ page into L2.

> Also, what if the virtualize APIC access is to be supported in L2,

As above, KVM never maps the APIC-access page that KVM (as L0) manages into L2.

> and the backing page is being reclaimed in L0? I saw
> nested_get_vmcs12_pages() will check vmcs12 and set the APIC access address
> in VMCS02, but not sure if this routine will be triggered by the mmu
> notifier...

Pages from vmcs12 that are referenced by physical address in the VMCS are pinned
(where "pinned" means KVM holds a reference to the page) by kvm_vcpu_map().  I.e.
the page will not be migrated, and if userspace unmaps the page, userspace might
break its VM, but that's true for any guest memory that userspace unexpectedly
unmaps, and there won't be any no use-after-free issues.
  
Yu Zhang June 17, 2023, 3:45 a.m. UTC | #10
> 
> > and the backing page is being reclaimed in L0? I saw
> > nested_get_vmcs12_pages() will check vmcs12 and set the APIC access address
> > in VMCS02, but not sure if this routine will be triggered by the mmu
> > notifier...
> 
> Pages from vmcs12 that are referenced by physical address in the VMCS are pinned
> (where "pinned" means KVM holds a reference to the page) by kvm_vcpu_map().  I.e.
> the page will not be migrated, and if userspace unmaps the page, userspace might
> break its VM, but that's true for any guest memory that userspace unexpectedly
> unmaps, and there won't be any no use-after-free issues.
> 
Thanks, Sean. 

About the kvm_vcpu_map(), is it necessary for APIC access address? L0 only
needs to get its pfn, and does not care about the hva or struct page. Could
we just use gfn_to_pfn() to retrieve the pfn, and kvm_release_pfn_clean() to
unpin it later? 

B.R.
Yu
  
Sean Christopherson June 22, 2023, 11:02 p.m. UTC | #11
On Sat, Jun 17, 2023, Yu Zhang wrote:
> > 
> > > and the backing page is being reclaimed in L0? I saw
> > > nested_get_vmcs12_pages() will check vmcs12 and set the APIC access address
> > > in VMCS02, but not sure if this routine will be triggered by the mmu
> > > notifier...
> > 
> > Pages from vmcs12 that are referenced by physical address in the VMCS are pinned
> > (where "pinned" means KVM holds a reference to the page) by kvm_vcpu_map().  I.e.
> > the page will not be migrated, and if userspace unmaps the page, userspace might
> > break its VM, but that's true for any guest memory that userspace unexpectedly
> > unmaps, and there won't be any no use-after-free issues.
> > 
> Thanks, Sean. 
> 
> About the kvm_vcpu_map(), is it necessary for APIC access address?

No, not strictly necessary.

> L0 only needs to get its pfn, and does not care about the hva or struct page. Could
> we just use gfn_to_pfn() to retrieve the pfn, and kvm_release_pfn_clean() to
> unpin it later? 

Yep, that would suffice.  The main reason I converted the APIC access page to use
kvm_vcpu_map()[1] was that it was easiest way to play nice with struct page memory.

I don't think this is worth doing right now, as the practical downside is really
just that setups that hide memory from the kernel do an unnecessary memremap().
I'd much prefer to "fix" this wart when we (hopefully) overhaul all these APIs[2].

[1] fe1911aa443e ("KVM: nVMX: Use kvm_vcpu_map() to get/pin vmcs12's APIC-access page")
[2] https://lore.kernel.org/all/ZGvUsf7lMkrNDHuE@google.com
  

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 44fb619803b8..59195f0dc7a5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6708,7 +6708,12 @@  void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
 
 static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 {
-	struct page *page;
+	const gfn_t gfn = APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT;
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *slot;
+	unsigned long mmu_seq;
+	kvm_pfn_t pfn;
 
 	/* Defer reload until vmcs01 is the current VMCS. */
 	if (is_guest_mode(vcpu)) {
@@ -6720,18 +6725,53 @@  static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 	    SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 		return;
 
-	page = gfn_to_page(vcpu->kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
-	if (is_error_page(page))
+	/*
+	 * Grab the memslot so that the hva lookup for the mmu_notifier retry
+	 * is guaranteed to use the same memslot as the pfn lookup, i.e. rely
+	 * on the pfn lookup's validation of the memslot to ensure a valid hva
+	 * is used for the retry check.
+	 */
+	slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
+	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return;
 
-	vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
+	/*
+	 * Ensure that the mmu_notifier sequence count is read before KVM
+	 * retrieves the pfn from the primary MMU.  Note, the memslot is
+	 * protected by SRCU, not the mmu_notifier.  Pairs with the smp_wmb()
+	 * in kvm_mmu_invalidate_end().
+	 */
+	mmu_seq = kvm->mmu_invalidate_seq;
+	smp_rmb();
+
+	/*
+	 * No need to retry if the memslot does not exist or is invalid.  KVM
+	 * controls the APIC-access page memslot, and only deletes the memslot
+	 * if APICv is permanently inhibited, i.e. the memslot won't reappear.
+	 */
+	pfn = gfn_to_pfn_memslot(slot, gfn);
+	if (is_error_noslot_pfn(pfn))
+		return;
+
+	read_lock(&vcpu->kvm->mmu_lock);
+	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
+				     gfn_to_hva_memslot(slot, gfn))) {
+		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
+		read_unlock(&vcpu->kvm->mmu_lock);
+		goto out;
+	}
+
+	vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(pfn));
+	read_unlock(&vcpu->kvm->mmu_lock);
+
 	vmx_flush_tlb_current(vcpu);
 
+out:
 	/*
 	 * Do not pin apic access page in memory, the MMU notifier
 	 * will call us again if it is migrated or swapped out.
 	 */
-	put_page(page);
+	kvm_release_pfn_clean(pfn);
 }
 
 static void vmx_hwapic_isr_update(int max_isr)