[RFC] KVM: Allow polling vCPUs for events

Message ID 20231001111313.77586-1-nsaenz@amazon.com
State New
Headers
Series [RFC] KVM: Allow polling vCPUs for events |

Commit Message

Nicolas Saenz Julienne Oct. 1, 2023, 11:13 a.m. UTC
  This is a follow up RFC to David Woodhouse's proposal[1] to allow
user-space exits on halt/mwait.

A number of use cases have surfaced where it'd be beneficial to have a
vCPU stop its execution in user-space, as opposed to having it sleep
in-kernel. Be it in order to make better use of the pCPU's time while
the vCPU is halted, or to implement security features like Hyper-V's
VSM.

A problem with this approach is that user-space has no way of knowing
whether the vCPU has pending events (interrupts, timers, etc...), so we
need a new interface to query if they are. poll() turned out to be a
very good fit.

vCPUs being polled are now switched into a new mode, POLLING_FOR_EVENTS.
This mode behaves similar to how OUTSIDE_GUEST_MODE does, except in
kvm_vcpu_kick(), which now wakes up the polling vCPU thread to signal
attention is needed. On wake up the polling thread checks if the pending
requests are relevant for the vCPU (the vCPU might be halted, or it
might be a quiesced VTL vCPU, with different wakeup needs), and if so,
exits back to user-space.

This vCPU mode switch also serves as a synchronization point vs
asynchronous sources of interruptions and it's a benefit versus other
approaches to this problem (for ex. using ioeventfd), which requires
extra synchronization to be viable.

Ultimately, it's up to the code triggering the user-space exit to set
the poll request mask. This allows different exits reasons to be woken
up by different type of events. The request mask is reset upon
re-entering KVM_RUN.

This was tested alongside a Hyper-V VSM PoC that implements Virtual
Trust Level (VTL) handling in user-space by using a distinct vCPU per
VTL. Hence the out-of-tree code in 'hyperv.c'. Note that our approach
requires HVCALL_VTL_RETURN to quiesce the vCPU in user-space, until a
HVCALL_VTL_CALL is performed from a lower VTL, or an interrupt is
targeted at that VTL.

[1] https://lore.kernel.org/lkml/1b52b557beb6606007f7ec5672eab0adf1606a34.camel@infradead.org/

Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
---
 arch/x86/kvm/hyperv.c    |  5 +++++
 include/linux/kvm_host.h |  3 +++
 virt/kvm/kvm_main.c      | 43 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 50 insertions(+), 1 deletion(-)
  

Comments

Sean Christopherson Oct. 4, 2023, 11:47 p.m. UTC | #1
On Sun, Oct 01, 2023, Nicolas Saenz Julienne wrote:
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 9436dca9903b..7c12d44486e1 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -63,6 +63,10 @@
>   */
>  #define HV_EXT_CALL_MAX (HV_EXT_CALL_QUERY_CAPABILITIES + 64)
>  
> +#define HV_VTL_RETURN_POLL_MASK                                 \
> +	(BIT_ULL(KVM_REQ_UNBLOCK) | BIT_ULL(KVM_REQ_HV_STIMER) | \
> +		BIT_ULL(KVM_REQ_EVENT))
> +
>  void kvm_tdp_mmu_role_set_hv_bits(struct kvm_vcpu *vcpu, union kvm_mmu_page_role *role)
>  {
>  	//role->vtl = to_kvm_hv(vcpu->kvm)->hv_enable_vsm ? get_active_vtl(vcpu) : 0;
> @@ -3504,6 +3508,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>  		goto hypercall_userspace_exit;
>  	case HVCALL_VTL_RETURN:
>  		vcpu->dump_state_on_run = true;
> +		vcpu->poll_mask = HV_VTL_RETURN_POLL_MASK;
>  		goto hypercall_userspace_exit;
>  	case HVCALL_TRANSLATE_VIRTUAL_ADDRESS:
>  		if (unlikely(hc.rep_cnt)) {

...

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index db106f2e16d8..2985e462ef56 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -238,7 +238,7 @@ static bool kvm_request_needs_ipi(struct kvm_vcpu *vcpu, unsigned req)
>  	 * READING_SHADOW_PAGE_TABLES mode.
>  	 */
>  	if (req & KVM_REQUEST_WAIT)
> -		return mode != OUTSIDE_GUEST_MODE;
> +		return !(mode == OUTSIDE_GUEST_MODE || mode == POLLING_FOR_EVENTS);

This won't work if the vCPU makes a self-request, because kvm_make_vcpu_request()
won't bother sending an IPI if the current pCPU is running the vCPU.  Piggybacking
the IPI logic is unnecessarily convoluted and silly.  More below.

> @@ -3996,6 +4002,39 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>  	return 0;
>  }
>  
> +static __poll_t kvm_vcpu_poll(struct file *file, poll_table *wait)
> +{
> +	struct kvm_vcpu *vcpu = file->private_data;
> +
> +	if (!vcpu->poll_mask)
> +		return EPOLLERR;
> +
> +	switch (READ_ONCE(vcpu->mode)) {
> +	case OUTSIDE_GUEST_MODE:
> +		/*
> +		 * Make sure writes to vcpu->request are visible before the
> +		 * mode changes.
> +		 */

Huh?  There are no writes to vcpu->request anywhere in here.

> +		smp_store_mb(vcpu->mode, POLLING_FOR_EVENTS);
> +		break;
> +	case POLLING_FOR_EVENTS:
> +		break;
> +	default:
> +		WARN_ONCE(true, "Trying to poll vCPU %d in mode %d\n",
> +			  vcpu->vcpu_id, vcpu->mode);

This is definitely a user-triggerable WARN.

> +		return EPOLLERR;
> +	}
> +
> +	poll_wait(file, &vcpu->wqh, wait);
> +
> +	if (READ_ONCE(vcpu->requests) & vcpu->poll_mask) {

This effectively makes requests ABI.  The simple mask also means that this will
get false positives on unrelated requests.

In short, whatever mechanism controls the polling needs to be formal uAPI.

> +		WRITE_ONCE(vcpu->mode, OUTSIDE_GUEST_MODE);

This does not look remotely safe on multiple fronts.  For starters, I don't see
anything in the .poll() infrastructure that provides serialization, e.g. if there
are multiple tasks polling then this will be "interesting".

And there is zero chance this is race-free, e.g. nothing prevents the vCPU task
itself from changing vcpu->mode from POLLING_FOR_EVENTS to something else.

Why on earth is this mucking with vcpu->mode?  Ignoring for the moment that using
vcpu->requests as the poll source is never going to happen, there's zero reason
to write vcpu->mode.  From a correctness perspective, AFAICT there's no need for
any shenanigans at all, i.e. kvm_make_vcpu_request() could blindly and unconditionally
call wake_up_interruptible().

I suspect what you want is a fast way to track if there *may* be pollers.  Keying
off and *writing* vcpu->mode makes no sense to me.

I think what you want is something like this, where kvm_vcpu_poll() could use
atomic_fetch_or() and atomic_fetch_andnot() to manipulate vcpu->poll_mask.
Or if we only want to support a single poller at a time, it could be a vanilla
u64.  I suspect getting the poll_mask manipulation correct for multiple pollers
would be tricky, e.g. to avoid false negatives and leave a poller hanging.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 486800a7024b..5a260fb3b248 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -259,6 +259,14 @@ static inline bool kvm_kick_many_cpus(struct cpumask *cpus, bool wait)
        return true;
 }
 
+static inline bool kvm_request_is_being_polled(struct kvm_vcpu *vcpu,
+                                              unsigned int req)
+{
+       u32 poll_mask = kvm_request_to_poll_mask(req);
+
+       return (atomic_read(vcpu->poll_mask) & poll_mask)
+}
+
 static void kvm_make_vcpu_request(struct kvm_vcpu *vcpu, unsigned int req,
                                  struct cpumask *tmp, int current_cpu)
 {
@@ -285,6 +293,9 @@ static void kvm_make_vcpu_request(struct kvm_vcpu *vcpu, unsigned int req,
                if (cpu != -1 && cpu != current_cpu)
                        __cpumask_set_cpu(cpu, tmp);
        }
+
+       if (kvm_request_is_being_polled(vcpu, req))
+               wake_up_interruptible(...);
 }
 
 bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
  
Nicolas Saenz Julienne Oct. 5, 2023, 4:27 p.m. UTC | #2
Hi Sean,
Thanks for taking the time to look at this.

On Wed Oct 4, 2023 at 11:47 PM UTC, Sean Christopherson wrote:
[...]
> > @@ -3996,6 +4002,39 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
> >       return 0;
> >  }
> >
> > +static __poll_t kvm_vcpu_poll(struct file *file, poll_table *wait)
> > +{
> > +     struct kvm_vcpu *vcpu = file->private_data;
> > +
> > +     if (!vcpu->poll_mask)
> > +             return EPOLLERR;
> > +
> > +     switch (READ_ONCE(vcpu->mode)) {
> > +     case OUTSIDE_GUEST_MODE:
> > +             /*
> > +              * Make sure writes to vcpu->request are visible before the
> > +              * mode changes.
> > +              */
>
> Huh?  There are no writes to vcpu->request anywhere in here.

My thinking was the vcpu->requests load below could've been speculated
ahead of vcpu->mode's store, this will miss events when first entering
poll().

Since you pointed this out, I thought about it further. There is still
room for a race with the code as is, we need read vcpu->requests only
after poll_wait() returns, so as to make sure concurrent
kvm_make_request()/kvm_vcpu_kick() either wake up poll, or are visible
through the vcpu->requests check that precedes sleeping.

[...]

> > +             WRITE_ONCE(vcpu->mode, OUTSIDE_GUEST_MODE);
>
> This does not look remotely safe on multiple fronts.  For starters, I don't see
> anything in the .poll() infrastructure that provides serialization, e.g. if there
> are multiple tasks polling then this will be "interesting".

Would allowing only one poller be acceptable?

> And there is zero chance this is race-free, e.g. nothing prevents the vCPU task
> itself from changing vcpu->mode from POLLING_FOR_EVENTS to something else.
>
> Why on earth is this mucking with vcpu->mode?  Ignoring for the moment that using
> vcpu->requests as the poll source is never going to happen, there's zero reason

IIUC accessing vcpu->requests in the kvm_vcpu_poll() is out of the
question? Aren't we're forced to do so in order to avoid the race I
mention above.

> to write vcpu->mode.  From a correctness perspective, AFAICT there's no need for
> any shenanigans at all, i.e. kvm_make_vcpu_request() could blindly and unconditionally
> call wake_up_interruptible().

I was fixated with the halt/vtl_return use-cases, where we're either
running the vCPU or polling, and it seemed a decent way to policy
whether calling wake_up_interruptible() is needed. Clearly not the case,
I'll get rid of all the vcpu->mode mucking. :)

> I suspect what you want is a fast way to track if there *may* be pollers.  Keying
> off and *writing* vcpu->mode makes no sense to me.
>
> I think what you want is something like this, where kvm_vcpu_poll() could use
> atomic_fetch_or() and atomic_fetch_andnot() to manipulate vcpu->poll_mask.
> Or if we only want to support a single poller at a time, it could be a vanilla
> u64.  I suspect getting the poll_mask manipulation correct for multiple pollers
> would be tricky, e.g. to avoid false negatives and leave a poller hanging.

I'll have a go at the multiple poller approach.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 486800a7024b..5a260fb3b248 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -259,6 +259,14 @@ static inline bool kvm_kick_many_cpus(struct cpumask *cpus, bool wait)
>         return true;
>  }
>
> +static inline bool kvm_request_is_being_polled(struct kvm_vcpu *vcpu,
> +                                              unsigned int req)
> +{
> +       u32 poll_mask = kvm_request_to_poll_mask(req);
> +
> +       return (atomic_read(vcpu->poll_mask) & poll_mask)
> +}
> +
>  static void kvm_make_vcpu_request(struct kvm_vcpu *vcpu, unsigned int req,
>                                   struct cpumask *tmp, int current_cpu)
>  {
> @@ -285,6 +293,9 @@ static void kvm_make_vcpu_request(struct kvm_vcpu *vcpu, unsigned int req,
>                 if (cpu != -1 && cpu != current_cpu)
>                         __cpumask_set_cpu(cpu, tmp);
>         }
> +
> +       if (kvm_request_is_being_polled(vcpu, req))
> +               wake_up_interruptible(...);
>  }
>
>  bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,

I'll use this approach.

So since we have to provide a proper uAPI, do you have anything against
having user-space set the polling mask through an ioctl? Also any
suggestions on how kvm_request_to_poll_mask() should look like. For ex.
VSM mostly cares for regular interrupts/timers, so mapping

  KVM_REQ_UNBLOCK, KVM_REQ_HV_STIMER, KVM_REQ_EVENT, KVM_REQ_SMI,
  KVM_REQ_NMI

to a KVM_POLL_INTERRUPTS_FLAG would work. We can then have ad-hoc flags
for async-pf, kvmclock updates, dirty logging, etc...

Nicolas
  
Sean Christopherson Oct. 6, 2023, 1:16 a.m. UTC | #3
On Thu, Oct 05, 2023, Nicolas Saenz Julienne wrote:
> Hi Sean,
> Thanks for taking the time to look at this.
> 
> On Wed Oct 4, 2023 at 11:47 PM UTC, Sean Christopherson wrote:
> > This does not look remotely safe on multiple fronts.  For starters, I don't see
> > anything in the .poll() infrastructure that provides serialization, e.g. if there
> > are multiple tasks polling then this will be "interesting".
> 
> Would allowing only one poller be acceptable?

As a last resort, but I think we should first try to support multiple pollers.

> > And there is zero chance this is race-free, e.g. nothing prevents the vCPU task
> > itself from changing vcpu->mode from POLLING_FOR_EVENTS to something else.
> >
> > Why on earth is this mucking with vcpu->mode?  Ignoring for the moment that using
> > vcpu->requests as the poll source is never going to happen, there's zero reason
> 
> IIUC accessing vcpu->requests in the kvm_vcpu_poll() is out of the
> question? Aren't we're forced to do so in order to avoid the race I
> mention above.

Reading vcpu->requests is fine, though I suspect it will be easier to use a
dedicated field.  Requests aren't single bit values and most of them are arch
specific, which will make it annoying to key off of requests directly.  I'm
guessing it will be impossible to completely avoid arch specific polling logic,
but I'd rather not jump straight to that.

> > @@ -285,6 +293,9 @@ static void kvm_make_vcpu_request(struct kvm_vcpu *vcpu, unsigned int req,
> >                 if (cpu != -1 && cpu != current_cpu)
> >                         __cpumask_set_cpu(cpu, tmp);
> >         }
> > +
> > +       if (kvm_request_is_being_polled(vcpu, req))
> > +               wake_up_interruptible(...);
> >  }
> >
> >  bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
> 
> I'll use this approach.
> 
> So since we have to provide a proper uAPI, do you have anything against
> having user-space set the polling mask through an ioctl?

What exactly do you mean?  The mask of what poll "events" userspace cares about?
I doubt that will work well if KVM supports more than one poller, as preventing
them from stomping over one another would be all but impossible.

> Also any suggestions on how kvm_request_to_poll_mask() should look like. For
> ex.  VSM mostly cares for regular interrupts/timers, so mapping
> 
>   KVM_REQ_UNBLOCK, KVM_REQ_HV_STIMER, KVM_REQ_EVENT, KVM_REQ_SMI,
>   KVM_REQ_NMI
> 
> to a KVM_POLL_INTERRUPTS_FLAG would work. We can then have ad-hoc flags
> for async-pf, kvmclock updates, dirty logging, etc...

What all does your use case need/want to poll on?  Mapping out exactly what all
you want/need to poll is probably the best way to answer this question.
  

Patch

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 9436dca9903b..7c12d44486e1 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -63,6 +63,10 @@ 
  */
 #define HV_EXT_CALL_MAX (HV_EXT_CALL_QUERY_CAPABILITIES + 64)
 
+#define HV_VTL_RETURN_POLL_MASK                                 \
+	(BIT_ULL(KVM_REQ_UNBLOCK) | BIT_ULL(KVM_REQ_HV_STIMER) | \
+		BIT_ULL(KVM_REQ_EVENT))
+
 void kvm_tdp_mmu_role_set_hv_bits(struct kvm_vcpu *vcpu, union kvm_mmu_page_role *role)
 {
 	//role->vtl = to_kvm_hv(vcpu->kvm)->hv_enable_vsm ? get_active_vtl(vcpu) : 0;
@@ -3504,6 +3508,7 @@  int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 		goto hypercall_userspace_exit;
 	case HVCALL_VTL_RETURN:
 		vcpu->dump_state_on_run = true;
+		vcpu->poll_mask = HV_VTL_RETURN_POLL_MASK;
 		goto hypercall_userspace_exit;
 	case HVCALL_TRANSLATE_VIRTUAL_ADDRESS:
 		if (unlikely(hc.rep_cnt)) {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ed6b2039b599..975186f03c01 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -284,6 +284,7 @@  enum {
 	IN_GUEST_MODE,
 	EXITING_GUEST_MODE,
 	READING_SHADOW_PAGE_TABLES,
+	POLLING_FOR_EVENTS,
 };
 
 #define KVM_UNMAPPED_PAGE	((void *) 0x500 + POISON_POINTER_DELTA)
@@ -341,6 +342,7 @@  struct kvm_vcpu {
 #endif
 	int mode;
 	u64 requests;
+	u64 poll_mask;
 	unsigned long guest_debug;
 
 	struct mutex mutex;
@@ -401,6 +403,7 @@  struct kvm_vcpu {
 	struct kvm_memory_slot *last_used_slot;
 	u64 last_used_slot_gen;
 	bool dump_state_on_run;
+	wait_queue_head_t wqh;
 };
 
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index db106f2e16d8..2985e462ef56 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -238,7 +238,7 @@  static bool kvm_request_needs_ipi(struct kvm_vcpu *vcpu, unsigned req)
 	 * READING_SHADOW_PAGE_TABLES mode.
 	 */
 	if (req & KVM_REQUEST_WAIT)
-		return mode != OUTSIDE_GUEST_MODE;
+		return !(mode == OUTSIDE_GUEST_MODE || mode == POLLING_FOR_EVENTS);
 
 	/*
 	 * Need to kick a running VCPU, but otherwise there is nothing to do.
@@ -479,6 +479,7 @@  static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	snprintf(vcpu->stats_id, sizeof(vcpu->stats_id), "kvm-%d/vcpu-%d",
 		 task_pid_nr(current), id);
 	vcpu->dump_state_on_run = true;
+	init_waitqueue_head(&vcpu->wqh);
 }
 
 static void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
@@ -3791,7 +3792,12 @@  void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 		cpu = READ_ONCE(vcpu->cpu);
 		if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
 			smp_send_reschedule(cpu);
+		goto out;
 	}
+
+	if (READ_ONCE(vcpu->mode) == POLLING_FOR_EVENTS)
+		wake_up_interruptible(&vcpu->wqh);
+
 out:
 	put_cpu();
 }
@@ -3996,6 +4002,39 @@  static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+static __poll_t kvm_vcpu_poll(struct file *file, poll_table *wait)
+{
+	struct kvm_vcpu *vcpu = file->private_data;
+
+	if (!vcpu->poll_mask)
+		return EPOLLERR;
+
+	switch (READ_ONCE(vcpu->mode)) {
+	case OUTSIDE_GUEST_MODE:
+		/*
+		 * Make sure writes to vcpu->request are visible before the
+		 * mode changes.
+		 */
+		smp_store_mb(vcpu->mode, POLLING_FOR_EVENTS);
+		break;
+	case POLLING_FOR_EVENTS:
+		break;
+	default:
+		WARN_ONCE(true, "Trying to poll vCPU %d in mode %d\n",
+			  vcpu->vcpu_id, vcpu->mode);
+		return EPOLLERR;
+	}
+
+	poll_wait(file, &vcpu->wqh, wait);
+
+	if (READ_ONCE(vcpu->requests) & vcpu->poll_mask) {
+		WRITE_ONCE(vcpu->mode, OUTSIDE_GUEST_MODE);
+		return EPOLLIN;
+	}
+
+	return 0;
+}
+
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
 	struct kvm_vcpu *vcpu = filp->private_data;
@@ -4008,6 +4047,7 @@  static const struct file_operations kvm_vcpu_fops = {
 	.release        = kvm_vcpu_release,
 	.unlocked_ioctl = kvm_vcpu_ioctl,
 	.mmap           = kvm_vcpu_mmap,
+	.poll		= kvm_vcpu_poll,
 	.llseek		= noop_llseek,
 	KVM_COMPAT(kvm_vcpu_compat_ioctl),
 };
@@ -4275,6 +4315,7 @@  static long kvm_vcpu_ioctl(struct file *filp,
 				synchronize_rcu();
 			put_pid(oldpid);
 		}
+		vcpu->poll_mask = 0;
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
 		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
 		break;