[v7,11/11] KVM: xen: allow vcpu_info content to be 'safely' copied

Message ID 20231002095740.1472907-12-paul@xen.org
State New
Headers
Series KVM: xen: update shared_info and vcpu_info handling |

Commit Message

Paul Durrant Oct. 2, 2023, 9:57 a.m. UTC
  From: Paul Durrant <pdurrant@amazon.com>

If the guest sets an explicit vcpu_info GPA then, for any of the first 32
vCPUs, the content of the default vcpu_info in the shared_info page must be
copied into the new location. Because this copy may race with event
delivery (which updates the 'evtchn_pending_sel' field in vcpu_info) we
need a way to defer that until the copy is complete.
Happily there is already a shadow of 'evtchn_pending_sel' in kvm_vcpu_xen
that is used in atomic context if the vcpu_info PFN cache has been
invalidated so that the update of vcpu_info can be deferred until the
cache can be refreshed (on vCPU thread's the way back into guest context).
So let's also use this shadow if the vcpu_info cache has been
*deactivated*, so that the VMM can safely copy the vcpu_info content and
then re-activate the cache with the new GPA. To do this, all we need to do
is stop considering an inactive vcpu_info cache as a hard error in
kvm_xen_set_evtchn_fast().

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org

v6:
 - New in this version.
---
 arch/x86/kvm/xen.c | 3 ---
 1 file changed, 3 deletions(-)
  

Comments

Sean Christopherson Oct. 31, 2023, 11:58 p.m. UTC | #1
On Mon, Oct 02, 2023, Paul Durrant wrote:
> From: Paul Durrant <pdurrant@amazon.com>
> 
> If the guest sets an explicit vcpu_info GPA then, for any of the first 32
> vCPUs, the content of the default vcpu_info in the shared_info page must be
> copied into the new location. Because this copy may race with event
> delivery (which updates the 'evtchn_pending_sel' field in vcpu_info) we
> need a way to defer that until the copy is complete.

Nit, add a blank link between paragraphs.

> Happily there is already a shadow of 'evtchn_pending_sel' in kvm_vcpu_xen
> that is used in atomic context if the vcpu_info PFN cache has been
> invalidated so that the update of vcpu_info can be deferred until the
> cache can be refreshed (on vCPU thread's the way back into guest context).
> So let's also use this shadow if the vcpu_info cache has been
> *deactivated*, so that the VMM can safely copy the vcpu_info content and
> then re-activate the cache with the new GPA. To do this, all we need to do
> is stop considering an inactive vcpu_info cache as a hard error in
> kvm_xen_set_evtchn_fast().

Please, please try to write changelogs that adhere to the preferred style.  I
get that the preferred style likely doesn't align with what you're used to, but
the preferred style really doesn't help me get through reviews quicker.

> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index aafc794940e4..e645066217bb 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -1606,9 +1606,6 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
>  		WRITE_ONCE(xe->vcpu_idx, vcpu->vcpu_idx);
>  	}
>  
> -	if (!vcpu->arch.xen.vcpu_info_cache.active)
> -		return -EINVAL;
> -

Hmm, maybe move this check after the "hard" error checks and explicitly do:

		return -EWOULDBLOCK

That way it's much more obvious that this patch is safe.  Alternatively, briefly
explain what happens if the cache is invalid in the changelog.


>  	if (xe->port >= max_evtchn_port(kvm))
>  		return -EINVAL;
>  
> -- 
> 2.39.2
>
  
David Woodhouse Nov. 8, 2023, 3:15 p.m. UTC | #2
On Tue, 2023-10-31 at 16:58 -0700, Sean Christopherson wrote:
> 
> > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> > index aafc794940e4..e645066217bb 100644
> > --- a/arch/x86/kvm/xen.c
> > +++ b/arch/x86/kvm/xen.c
> > @@ -1606,9 +1606,6 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
> >                 WRITE_ONCE(xe->vcpu_idx, vcpu->vcpu_idx);
> >         }
> >   
> > -       if (!vcpu->arch.xen.vcpu_info_cache.active)
> > -               return -EINVAL;
> > -
> 
> Hmm, maybe move this check after the "hard" error checks and explicitly do:
> 
>                 return -EWOULDBLOCK
> 
> That way it's much more obvious that this patch is safe.  Alternatively, briefly
> explain what happens if the cache is invalid in the changelog.


Hm, I thought we dropped that? It doesn't want to return -EWOULDBLOCK.
That would cause a "fast" irqfd delivery to defer to the slow
(workqueue) path, but then the same thing would happen on that path
*too*.

I actually think this check needs to be dropped completely. The
evtchn_pending_sel bit will get set in the *shadow* copy in vcpu-
>arch.xen.evtchn_pending_sel, and will subsequently be set in the real
vcpu_info when that is configured again. We do already handle that case
correctly, I believe:

		if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
			/*
			 * Could not access the vcpu_info. Set the bit in-kernel
			 * and prod the vCPU to deliver it for itself.
			 */
			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
				kick_vcpu = true;
			goto out_rcu;
		}

If we leave this check in place as it is, I think we're going to lose
incoming events (and IPIs) directed at a vCPU while it has its
vcpu_info temporarily removed.

(We do want to "temporarily" remove the vcpu_info so that we can adjust
memslots, because there's no way to atomically break apart a large
memslot and overlay a single page in the middle of it. And also when a
guest is paused for migration, it's nice to be sure that no changes
will happen to the guest memory and trigger sanity checks that the
memory on the destination precisely matches the source.)
  
Sean Christopherson Nov. 8, 2023, 8:55 p.m. UTC | #3
On Wed, Nov 08, 2023, David Woodhouse wrote:
> On Tue, 2023-10-31 at 16:58 -0700, Sean Christopherson wrote:
> > 
> > > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> > > index aafc794940e4..e645066217bb 100644
> > > --- a/arch/x86/kvm/xen.c
> > > +++ b/arch/x86/kvm/xen.c
> > > @@ -1606,9 +1606,6 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
> > >                 WRITE_ONCE(xe->vcpu_idx, vcpu->vcpu_idx);
> > >         }
> > >   
> > > -       if (!vcpu->arch.xen.vcpu_info_cache.active)
> > > -               return -EINVAL;
> > > -
> > 
> > Hmm, maybe move this check after the "hard" error checks and explicitly do:
> > 
> >                 return -EWOULDBLOCK
> > 
> > That way it's much more obvious that this patch is safe.  Alternatively, briefly
> > explain what happens if the cache is invalid in the changelog.
> 
> 
> Hm, I thought we dropped that? It doesn't want to return -EWOULDBLOCK.
> That would cause a "fast" irqfd delivery to defer to the slow
> (workqueue) path, but then the same thing would happen on that path
> *too*.
> 
> I actually think this check needs to be dropped completely. The
> evtchn_pending_sel bit will get set in the *shadow* copy in vcpu-
> >arch.xen.evtchn_pending_sel, and will subsequently be set in the real
> vcpu_info when that is configured again. We do already handle that case
> correctly, I believe:
> 
> 		if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
> 			/*
> 			 * Could not access the vcpu_info. Set the bit in-kernel
> 			 * and prod the vCPU to deliver it for itself.
> 			 */
> 			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
> 				kick_vcpu = true;
> 			goto out_rcu;
> 		}
> 
> If we leave this check in place as it is, I think we're going to lose
> incoming events (and IPIs) directed at a vCPU while it has its
> vcpu_info temporarily removed.

Oh, dagnabbit, there are two different gfn_to_pfn_cache objects.  I assumed "gpc"
was the same thing as vcpu_info_cache.active, i.e. thought the explicit active
check here was just to pre-check the validity before acquiring the lock.

	if (!vcpu->arch.xen.vcpu_info_cache.active)
		return -EINVAL;

	rc = -EWOULDBLOCK;

	idx = srcu_read_lock(&kvm->srcu);

	read_lock_irqsave(&gpc->lock, flags);
	if (!kvm_gpc_check(gpc, PAGE_SIZE))  <== I thought this was the same check
		goto out_rcu;

That's why my comment was nonsensical, and also why I was super confused by your
response for a few minutes :-)

Good gravy, this even conditionally switches gpc and makes the out path drop
different locks.  That's just mean.

Any objection to splitting out the vcpu_info chunk to a separate function?  That'd
eliminate the subtle gpc+lock switch, and provide a convenient and noticeable
location to explain why it's ok for setting the vcpu_info to fail.

E.g.

---
 arch/x86/kvm/xen.c | 102 ++++++++++++++++++++++++++-------------------
 1 file changed, 58 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index d65e89e062e4..cd2021ba0ae3 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1621,6 +1621,57 @@ static void kvm_xen_check_poller(struct kvm_vcpu *vcpu, int port)
 	}
 }
 
+static int kvm_xen_needs_a_name(struct kvm_vcpu *vcpu, int port_word_bit)
+{
+	struct gfn_to_pfn_cache *gpc = &vcpu->arch.xen.vcpu_info_cache;
+	bool kick_vcpu = false;
+	unsigned long flags;
+	int r = 0;
+
+	read_lock_irqsave(&gpc->lock, flags);
+	if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
+		/*
+		 * Could not access the vcpu_info. Set the bit in-kernel and
+		 * prod the vCPU to deliver it for itself.
+		 */
+		if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
+			kick_vcpu = true;
+
+		r = -EWOULDBLOCK;
+		goto out;
+	}
+
+	if (IS_ENABLED(CONFIG_64BIT) && vcpu->kvm->arch.xen.long_mode) {
+		struct vcpu_info *vcpu_info = gpc->khva;
+		if (!test_and_set_bit(port_word_bit, &vcpu_info->evtchn_pending_sel)) {
+			WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
+			kick_vcpu = true;
+		}
+	} else {
+		struct compat_vcpu_info *vcpu_info = gpc->khva;
+		if (!test_and_set_bit(port_word_bit,
+					(unsigned long *)&vcpu_info->evtchn_pending_sel)) {
+			WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
+			kick_vcpu = true;
+		}
+	}
+
+	/* For the per-vCPU lapic vector, deliver it as MSI. */
+	if (kick_vcpu && vcpu->arch.xen.upcall_vector) {
+		kvm_xen_inject_vcpu_vector(vcpu);
+		kick_vcpu = false;
+	}
+
+out:
+	read_unlock_irqrestore(&gpc->lock, flags);
+
+	if (kick_vcpu) {
+		kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
+		kvm_vcpu_kick(vcpu);
+	}
+
+	return r;
+}
 /*
  * The return value from this function is propagated to kvm_set_irq() API,
  * so it returns:
@@ -1638,7 +1689,6 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 	unsigned long *pending_bits, *mask_bits;
 	unsigned long flags;
 	int port_word_bit;
-	bool kick_vcpu = false;
 	int vcpu_idx, idx, rc;
 
 	vcpu_idx = READ_ONCE(xe->vcpu_idx);
@@ -1660,7 +1710,7 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 
 	read_lock_irqsave(&gpc->lock, flags);
 	if (!kvm_gpc_check(gpc, PAGE_SIZE))
-		goto out_rcu;
+		goto out_unlock;
 
 	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 		struct shared_info *shinfo = gpc->khva;
@@ -1688,52 +1738,16 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 		kvm_xen_check_poller(vcpu, xe->port);
 	} else {
 		rc = 1; /* Delivered to the bitmap in shared_info. */
-		/* Now switch to the vCPU's vcpu_info to set the index and pending_sel */
-		read_unlock_irqrestore(&gpc->lock, flags);
-		gpc = &vcpu->arch.xen.vcpu_info_cache;
-
-		read_lock_irqsave(&gpc->lock, flags);
-		if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
-			/*
-			 * Could not access the vcpu_info. Set the bit in-kernel
-			 * and prod the vCPU to deliver it for itself.
-			 */
-			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
-				kick_vcpu = true;
-			goto out_rcu;
-		}
-
-		if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
-			struct vcpu_info *vcpu_info = gpc->khva;
-			if (!test_and_set_bit(port_word_bit, &vcpu_info->evtchn_pending_sel)) {
-				WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
-				kick_vcpu = true;
-			}
-		} else {
-			struct compat_vcpu_info *vcpu_info = gpc->khva;
-			if (!test_and_set_bit(port_word_bit,
-					      (unsigned long *)&vcpu_info->evtchn_pending_sel)) {
-				WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
-				kick_vcpu = true;
-			}
-		}
-
-		/* For the per-vCPU lapic vector, deliver it as MSI. */
-		if (kick_vcpu && vcpu->arch.xen.upcall_vector) {
-			kvm_xen_inject_vcpu_vector(vcpu);
-			kick_vcpu = false;
-		}
 	}
 
- out_rcu:
+ out_unlock:
 	read_unlock_irqrestore(&gpc->lock, flags);
+
+	/* Paul to add comment explaining why it's ok kvm_xen_needs_a_name() fails */
+	if (rc == 1)
+		rc = kvm_xen_needs_a_name(vcpu, port_word_bit);
+
 	srcu_read_unlock(&kvm->srcu, idx);
-
-	if (kick_vcpu) {
-		kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
-		kvm_vcpu_kick(vcpu);
-	}
-
 	return rc;
 }
 

base-commit: 0f34262cf6fab8163456c7740c7ee1888a8af93c
--
  
David Woodhouse Nov. 8, 2023, 9:56 p.m. UTC | #4
On Wed, 2023-11-08 at 12:55 -0800, Sean Christopherson wrote:
> 
> Any objection to splitting out the vcpu_info chunk to a separate function?  That'd
> eliminate the subtle gpc+lock switch, and provide a convenient and noticeable
> location to explain why it's ok for setting the vcpu_info to fail.

No objection, that looks entirely sane to me.

> E.g.
> 
> ---
>  arch/x86/kvm/xen.c | 102 ++++++++++++++++++++++++++-------------------
>  1 file changed, 58 insertions(+), 44 deletions(-)
> 
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index d65e89e062e4..cd2021ba0ae3 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -1621,6 +1621,57 @@ static void kvm_xen_check_poller(struct kvm_vcpu *vcpu, int port)
>         }
>  }
>  
> +static int kvm_xen_needs_a_name(struct kvm_vcpu *vcpu, int port_word_bit)
> 

kvm_xen_deliver_vcpu_evtchn_pending_sel()  ?


...

> - out_rcu:
> + out_unlock:
>         read_unlock_irqrestore(&gpc->lock, flags);
> +
> +       /* Paul to add comment explaining why it's ok kvm_xen_needs_a_name() fails */

Basically the point here is that it *doesn't* fail. Either it delivers
directly to the vcpu_info at the appropriate GPA via the GPC, or it
will set the corresponding bit in the 'shadow'
vcpu->arch.xen.evtchn_pending_sel to be delivered later by
kvm_xen_inject_pending_events(). Either way, it's 'success'.

I do want to vet the code path where userspace forgets to reinstate the
vcpu_info before running the vCPU again. Looking at
kvm_xen_inject_pending_events(), I think the kvm_gpc_check() for the
vcpu_info will fail, so it'll just return, and KVM will enter the vCPU
with vcpu->arch.xen.evtchn_pending_sel still unchanged? The bits won't
get delivered until the next time the vCPU is entered after the
vcpu_info is set.

Which I *think* is fine. Userspace has to stop running the vCPU in
order to actually set the vcpu_info again, when it finally decides to
do so. And if it never does, that works out as well as can be
expected. 

Probably does want adding to the selftest.
  

Patch

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index aafc794940e4..e645066217bb 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1606,9 +1606,6 @@  int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 		WRITE_ONCE(xe->vcpu_idx, vcpu->vcpu_idx);
 	}
 
-	if (!vcpu->arch.xen.vcpu_info_cache.active)
-		return -EINVAL;
-
 	if (xe->port >= max_evtchn_port(kvm))
 		return -EINVAL;