[v5,08/24] sched: Introduce per memory space current virtual cpu id

Message ID 20221103200359.328736-9-mathieu.desnoyers@efficios.com
State New
Headers
Series RSEQ node id and virtual cpu id extensions |

Commit Message

Mathieu Desnoyers Nov. 3, 2022, 8:03 p.m. UTC
  This feature allows the scheduler to expose a current virtual cpu id
to user-space. This virtual cpu id is within the possible cpus range,
and is temporarily (and uniquely) assigned while threads are actively
running within a memory space. If a memory space has fewer threads than
cores, or is limited to run on few cores concurrently through sched
affinity or cgroup cpusets, the virtual cpu ids will be values close
to 0, thus allowing efficient use of user-space memory for per-cpu
data structures.

The vcpu_ids are NUMA-aware. On NUMA systems, when a vcpu_id is observed
by user-space to be associated with a NUMA node, it is guaranteed to
never change NUMA node unless a kernel-level NUMA configuration change
happens.

This feature is meant to be exposed by a new rseq thread area field.

The primary purpose of this feature is to do the heavy-lifting needed
by memory allocators to allow them to use per-cpu data structures
efficiently in the following situations:

- Single-threaded applications,
- Multi-threaded applications on large systems (many cores) with limited
  cpu affinity mask,
- Multi-threaded applications on large systems (many cores) with
  restricted cgroup cpuset per container,
- Processes using memory from many NUMA nodes.

One of the key concern from scheduler maintainers is the overhead
associated with additional spin locks or atomic operations in the
scheduler fast-path. This is why the following optimization is
implemented.

On context switch between threads belonging to the same memory space,
transfer the mm_vcpu_id from prev to next without any atomic ops. This
takes care of use-cases involving frequent context switch between
threads belonging to the same memory space.

Additional optimizations can be done if the spin locks added when
context switching between threads belonging to different processes end
up being a performance bottleneck. Those are left out of this patch
though. A performance impact would have to be clearly demonstrated to
justify the added complexity.

The credit goes to Paul Turner (Google) for the vcpu_id idea. This
feature is implemented based on the discussions with Paul Turner and
Peter Oskolkov (Google), but I took the liberty to implement scheduler
fast-path optimizations and my own NUMA-awareness scheme. The rumor has
it that Google have been running a rseq vcpu_id extension internally at
Google in production for a year. The tcmalloc source code indeed has
comments hinting at a vcpu_id prototype extension to the rseq system
call [1].

The following benchmarks do not show any significant overhead added to
the scheduler context switch by this feature:

* perf bench sched messaging (process)

Baseline:                    86.5±0.3 ms
With mm_vcpu_id:             86.7±2.6 ms

* perf bench sched messaging (threaded)

Baseline:                    84.3±3.0 ms
With mm_vcpu_id:             84.7±2.6 ms

* hackbench (process)

Baseline:                    82.9±2.7 ms
With mm_vcpu_id:             82.9±2.9 ms

* hackbench (threaded)

Baseline:                    85.2±2.6 ms
With mm_vcpu_id:             84.4±2.9 ms

[1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
Changes since v3:
- Remove per-runqueue vcpu id cache optimization.
- Remove single-threaded process optimization.
- Introduce spinlock to protect vcpu id bitmaps.

Changes since v4:
- Disable interrupts around mm_vcpu_get/mm_vcpu_put. The spin locks are
  used from within the scheduler context switch with interrupts off, so
  all uses of these spin locks need to have interrupts off.
- Initialize tsk->mm_vcpu to -1 in dup_task_struct to be consistent with
  other states where mm_vcpu_active is 0.
- Use cpumask andnot/notandnot APIs.
---
 fs/exec.c                |   6 ++
 include/linux/mm.h       |  25 ++++++
 include/linux/mm_types.h | 110 ++++++++++++++++++++++++-
 include/linux/sched.h    |   5 ++
 init/Kconfig             |   4 +
 kernel/fork.c            |  11 ++-
 kernel/sched/core.c      |  52 ++++++++++++
 kernel/sched/sched.h     | 168 +++++++++++++++++++++++++++++++++++++++
 kernel/signal.c          |   2 +
 9 files changed, 381 insertions(+), 2 deletions(-)
  

Comments

Peter Zijlstra Nov. 8, 2022, 1 p.m. UTC | #1
On Thu, Nov 03, 2022 at 04:03:43PM -0400, Mathieu Desnoyers wrote:

> diff --git a/fs/exec.c b/fs/exec.c
> index 349a5da91efe..93eb88f4053b 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1013,6 +1013,9 @@ static int exec_mmap(struct mm_struct *mm)
>  	tsk->active_mm = mm;
>  	tsk->mm = mm;
>  	lru_gen_add_mm(mm);
> +	mm_init_vcpu_lock(mm);
> +	mm_init_vcpumask(mm);
> +	mm_init_node_vcpumask(mm);
>  	/*
>  	 * This prevents preemption while active_mm is being loaded and
>  	 * it and mm are being updated, which could cause problems for

> @@ -1150,6 +1154,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  
>  	mm->user_ns = get_user_ns(user_ns);
>  	lru_gen_init_mm(mm);
> +	mm_init_vcpu_lock(mm);
> +	mm_init_vcpumask(mm);
> +	mm_init_node_vcpumask(mm);
>  	return mm;
>  
>  fail_nocontext:

Why isn't all that a single mm_init_vcpu(mm) or something ?
  
Peter Zijlstra Nov. 8, 2022, 1:04 p.m. UTC | #2
On Thu, Nov 03, 2022 at 04:03:43PM -0400, Mathieu Desnoyers wrote:

> The credit goes to Paul Turner (Google) for the vcpu_id idea. This
> feature is implemented based on the discussions with Paul Turner and
> Peter Oskolkov (Google), but I took the liberty to implement scheduler
> fast-path optimizations and my own NUMA-awareness scheme. The rumor has
> it that Google have been running a rseq vcpu_id extension internally at
> Google in production for a year. The tcmalloc source code indeed has
> comments hinting at a vcpu_id prototype extension to the rseq system
> call [1].

Re NUMA thing -- that means that on a 512 node system a single threaded
task can still observe 512 separate vcpu-ids, right?

Also, said space won't be dense.

The main selling point of the whole vcpu-id scheme was that the id space
is dense and not larger than min(nr_cpus, nr_threads), which then gives
useful properties.

But I'm not at all seeing how the NUMA thing preserves that.

Also; given the utter mind-bendiness of the NUMA thing; should it go
into it's own patch; introduce the regular plain old vcpu first, and
then add things to it -- that also allows pushing those weird cpumask
ops you've created later into the series.
  
Mathieu Desnoyers Nov. 8, 2022, 7:45 p.m. UTC | #3
On 2022-11-08 08:00, Peter Zijlstra wrote:
> On Thu, Nov 03, 2022 at 04:03:43PM -0400, Mathieu Desnoyers wrote:
> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 349a5da91efe..93eb88f4053b 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1013,6 +1013,9 @@ static int exec_mmap(struct mm_struct *mm)
>>   	tsk->active_mm = mm;
>>   	tsk->mm = mm;
>>   	lru_gen_add_mm(mm);
>> +	mm_init_vcpu_lock(mm);
>> +	mm_init_vcpumask(mm);
>> +	mm_init_node_vcpumask(mm);
>>   	/*
>>   	 * This prevents preemption while active_mm is being loaded and
>>   	 * it and mm are being updated, which could cause problems for
> 
>> @@ -1150,6 +1154,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>>   
>>   	mm->user_ns = get_user_ns(user_ns);
>>   	lru_gen_init_mm(mm);
>> +	mm_init_vcpu_lock(mm);
>> +	mm_init_vcpumask(mm);
>> +	mm_init_node_vcpumask(mm);
>>   	return mm;
>>   
>>   fail_nocontext:
> 
> Why isn't all that a single mm_init_vcpu(mm) or something ?

Good point, I'll update that.

Thanks,

Mathieu
  
Mathieu Desnoyers Nov. 8, 2022, 8:07 p.m. UTC | #4
On 2022-11-08 08:04, Peter Zijlstra wrote:
> On Thu, Nov 03, 2022 at 04:03:43PM -0400, Mathieu Desnoyers wrote:
> 
>> The credit goes to Paul Turner (Google) for the vcpu_id idea. This
>> feature is implemented based on the discussions with Paul Turner and
>> Peter Oskolkov (Google), but I took the liberty to implement scheduler
>> fast-path optimizations and my own NUMA-awareness scheme. The rumor has
>> it that Google have been running a rseq vcpu_id extension internally at
>> Google in production for a year. The tcmalloc source code indeed has
>> comments hinting at a vcpu_id prototype extension to the rseq system
>> call [1].
> 
> Re NUMA thing -- that means that on a 512 node system a single threaded
> task can still observe 512 separate vcpu-ids, right?

Yes, that's correct.

> 
> Also, said space won't be dense.

Indeed, this can be inefficient if the data structure within the 
single-threaded task is not NUMA-aware *and* that task is free to bounce 
all over the 512 numa nodes.

> 
> The main selling point of the whole vcpu-id scheme was that the id space
> is dense and not larger than min(nr_cpus, nr_threads), which then gives
> useful properties.
> 
> But I'm not at all seeing how the NUMA thing preserves that.

If a userspace per-vcpu data structure is implemented with NUMA-local 
allocations, then it becomes really interesting to guarantee that the 
per-vcpu-id accesses are always numa-local for performance reasons.

If a userspace per-vcpu data structure is not numa-aware, then we have 
two scenarios:

A) The cpuset/sched affinity under which it runs pins it to a set of 
cores belonging to a specific NUMA node. In this case, even with 
numa-aware vcpu id allocation, the ids will stay as close to 0 as if not 
numa-aware.

B) No specific cpuset/sched affinity set, which means the task is free 
to bounce all over. In this case I agree that having the indexing 
numa-aware, but the per-vcpu data structure not numa-aware, is inefficient.

I wonder whether scenarios with 512 nodes systems, with containers using 
few cores, but without using cpusets/sched affinity to pin the workload 
to specific numa nodes is a workload we should optimize for ? It looks 
like the lack of numa locality due to lack of allowed cores restriction 
is a userspace configuration issue.

We also must keep in mind that we can expect a single task to load a mix 
of executable/shared libraries where some pieces may be numa-aware, and 
others may not. This means we should ideally support a numa-aware 
vcpu-id allocation scheme and non-numa-aware vcpu-id allocation scheme 
within the same task.

This could be achieved by exposing two struct rseq fields rather than 
one, e.g.:

vm_vcpu_id -> flat indexing, not numa-aware.
vm_numa_vcpu_id -> numa-aware vcpu id indexing.

This would allow data structures that are inherently numa-aware to 
benefit from numa-locality, without hurting non-numa-aware data structures.

> 
> Also; given the utter mind-bendiness of the NUMA thing; should it go
> into it's own patch; introduce the regular plain old vcpu first, and
> then add things to it -- that also allows pushing those weird cpumask
> ops you've created later into the series.

Good idea. I can do that once we agree on the way forward for flat vs 
numa-aware vcpu-id rseq fields.

Thanks,

Mathieu
  
Peter Zijlstra Nov. 9, 2022, 9:28 a.m. UTC | #5
On Thu, Nov 03, 2022 at 04:03:43PM -0400, Mathieu Desnoyers wrote:

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 08969f5aa38d..6a2323266942 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1047,6 +1047,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  	tsk->reported_split_lock = 0;
>  #endif
>  
> +#ifdef CONFIG_SCHED_MM_VCPU
> +	tsk->mm_vcpu = -1;
> +	tsk->mm_vcpu_active = 0;
> +#endif
>  	return tsk;
>  
>  free_stack:

Note how the above hunk does exactly the same as the below thunk, and I
think they're even on the same code-path.

How about moving all of this to __sched_fork() or something?

> @@ -1579,6 +1586,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>  
>  	tsk->mm = mm;
>  	tsk->active_mm = mm;
> +	sched_vcpu_fork(tsk);
>  	return 0;
>  }

> +void sched_vcpu_fork(struct task_struct *t)
> +{
> +	WARN_ON_ONCE((t->flags & PF_KTHREAD) || !t->mm);
> +	t->mm_vcpu = -1;
> +	t->mm_vcpu_active = 1;
> +}
  
Peter Zijlstra Nov. 9, 2022, 9:42 a.m. UTC | #6
On Thu, Nov 03, 2022 at 04:03:43PM -0400, Mathieu Desnoyers wrote:

> +void sched_vcpu_exit_signals(struct task_struct *t)
> +{
> +	struct mm_struct *mm = t->mm;
> +	unsigned long flags;
> +
> +	if (!mm)
> +		return;
> +	local_irq_save(flags);
> +	mm_vcpu_put(mm, t->mm_vcpu);
> +	t->mm_vcpu = -1;
> +	t->mm_vcpu_active = 0;
> +	local_irq_restore(flags);
> +}
> +
> +void sched_vcpu_before_execve(struct task_struct *t)
> +{
> +	struct mm_struct *mm = t->mm;
> +	unsigned long flags;
> +
> +	if (!mm)
> +		return;
> +	local_irq_save(flags);
> +	mm_vcpu_put(mm, t->mm_vcpu);
> +	t->mm_vcpu = -1;
> +	t->mm_vcpu_active = 0;
> +	local_irq_restore(flags);
> +}
> +
> +void sched_vcpu_after_execve(struct task_struct *t)
> +{
> +	struct mm_struct *mm = t->mm;
> +	unsigned long flags;
> +
> +	WARN_ON_ONCE((t->flags & PF_KTHREAD) || !t->mm);
> +
> +	local_irq_save(flags);
> +	t->mm_vcpu = mm_vcpu_get(mm);
> +	t->mm_vcpu_active = 1;
> +	local_irq_restore(flags);
> +	rseq_set_notify_resume(t);
> +}

> +static inline void mm_vcpu_put(struct mm_struct *mm, int vcpu)
> +{
> +	lockdep_assert_irqs_disabled();
> +	if (vcpu < 0)
> +		return;
> +	spin_lock(&mm->vcpu_lock);
> +	__cpumask_clear_cpu(vcpu, mm_vcpumask(mm));
> +	spin_unlock(&mm->vcpu_lock);
> +}
> +
> +static inline int mm_vcpu_get(struct mm_struct *mm)
> +{
> +	int ret;
> +
> +	lockdep_assert_irqs_disabled();
> +	spin_lock(&mm->vcpu_lock);
> +	ret = __mm_vcpu_get(mm);
> +	spin_unlock(&mm->vcpu_lock);
> +	return ret;
> +}


This:

	local_irq_disable()
	spin_lock()

thing is a PREEMPT_RT anti-pattern.

At the very very least this should then be raw_spin_lock(), not in the
least because you're calling this from under rq->lock, which itself is a
raw_spin_lock_t.
  
Peter Zijlstra Nov. 9, 2022, 10:19 a.m. UTC | #7
On Tue, Nov 08, 2022 at 03:07:42PM -0500, Mathieu Desnoyers wrote:
> On 2022-11-08 08:04, Peter Zijlstra wrote:
> > On Thu, Nov 03, 2022 at 04:03:43PM -0400, Mathieu Desnoyers wrote:
> > 
> > > The credit goes to Paul Turner (Google) for the vcpu_id idea. This
> > > feature is implemented based on the discussions with Paul Turner and
> > > Peter Oskolkov (Google), but I took the liberty to implement scheduler
> > > fast-path optimizations and my own NUMA-awareness scheme. The rumor has
> > > it that Google have been running a rseq vcpu_id extension internally at
> > > Google in production for a year. The tcmalloc source code indeed has
> > > comments hinting at a vcpu_id prototype extension to the rseq system
> > > call [1].
> > 
> > Re NUMA thing -- that means that on a 512 node system a single threaded
> > task can still observe 512 separate vcpu-ids, right?
> 
> Yes, that's correct.

So,.. I've been thinking. How about we expose _two_ vcpu-ids :-)

One is the original, system wide min(nr_thread, nr_cpus) and the other
is a per-node vcpu-id. Not like you did, but min(nr_threads,
nr_cpus_per_node) like.

That way userspace can either use:

 - rseq::cpu_id -- as it does today

 - rseq::vcpu_id -- (when -1, see rseq::cpu_id) the original vcpu_id
                    proposal, which is typically good enough when your
		    application is not NUMA aware and relies on NUMA
		    balancing (most every application out there)

 - rseq::node_id *and*
   rseq::vcpu_node_id -- Combined it gives both node locality
                         and a *dense* space within the node.

This then lets the application pick whatever is best.

Note that using node affine memory allocations by default is a *very*
bad idea since it wrecks NUMA balancing, you really should only use this
if you application is fully NUMA aware and knows what it is doing.


---
 include/linux/mm.h        |   15 ++--
 include/linux/mm_types.h  |   23 +-----
 include/linux/sched.h     |    1 
 include/uapi/linux/rseq.h |    1 
 kernel/fork.c             |    1 
 kernel/rseq.c             |    9 ++
 kernel/sched/core.c       |   18 ++---
 kernel/sched/sched.h      |  157 +++++++++-------------------------------------
 8 files changed, 66 insertions(+), 159 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3484,6 +3484,10 @@ static inline int task_mm_vcpu_id(struct
 {
 	return t->mm_vcpu;
 }
+static inline int task_mm_vcpu_node_id(struct task_struct *t)
+{
+	return t->mm_vcpu_node;
+}
 #else
 static inline void sched_vcpu_before_execve(struct task_struct *t) { }
 static inline void sched_vcpu_after_execve(struct task_struct *t) { }
@@ -3491,12 +3495,11 @@ static inline void sched_vcpu_fork(struc
 static inline void sched_vcpu_exit_signals(struct task_struct *t) { }
 static inline int task_mm_vcpu_id(struct task_struct *t)
 {
-	/*
-	 * Use the processor id as a fall-back when the mm vcpu feature is
-	 * disabled. This provides functional per-cpu data structure accesses
-	 * in user-space, althrough it won't provide the memory usage benefits.
-	 */
-	return raw_smp_processor_id();
+	return -1; /* userspace should use cpu_id */
+}
+static inline int task_mm_vcpu_node_id(struct task_struct *t)
+{
+	return -1;
 }
 #endif
 
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -568,7 +568,7 @@ struct mm_struct {
 		 * bitmap words as they were being concurrently updated by the
 		 * updaters.
 		 */
-		spinlock_t vcpu_lock;
+		raw_spinlock_t vcpu_lock;
 #endif
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* PTE page table pages */
@@ -875,28 +875,15 @@ static inline unsigned int mm_vcpumask_s
 
 #if defined(CONFIG_SCHED_MM_VCPU) && defined(CONFIG_NUMA)
 /*
- * Layout of node vcpumasks:
- * - node_alloc vcpumask:        cpumask tracking which vcpu_id were
- *                               allocated (across nodes) in this
- *                               memory space.
  * - node vcpumask[nr_node_ids]: per-node cpumask tracking which vcpu_id
  *                               were allocated in this memory space.
  */
-static inline cpumask_t *mm_node_alloc_vcpumask(struct mm_struct *mm)
+static inline cpumask_t *mm_node_vcpumask(struct mm_struct *mm, unsigned int node)
 {
 	unsigned long vcpu_bitmap = (unsigned long)mm_vcpumask(mm);
 
 	/* Skip mm_vcpumask */
 	vcpu_bitmap += cpumask_size();
-	return (struct cpumask *)vcpu_bitmap;
-}
-
-static inline cpumask_t *mm_node_vcpumask(struct mm_struct *mm, unsigned int node)
-{
-	unsigned long vcpu_bitmap = (unsigned long)mm_node_alloc_vcpumask(mm);
-
-	/* Skip node alloc vcpumask */
-	vcpu_bitmap += cpumask_size();
 	vcpu_bitmap += node * cpumask_size();
 	return (struct cpumask *)vcpu_bitmap;
 }
@@ -907,21 +894,21 @@ static inline void mm_init_node_vcpumask
 
 	if (num_possible_nodes() == 1)
 		return;
-	cpumask_clear(mm_node_alloc_vcpumask(mm));
+
 	for (node = 0; node < nr_node_ids; node++)
 		cpumask_clear(mm_node_vcpumask(mm, node));
 }
 
 static inline void mm_init_vcpu_lock(struct mm_struct *mm)
 {
-	spin_lock_init(&mm->vcpu_lock);
+	raw_spin_lock_init(&mm->vcpu_lock);
 }
 
 static inline unsigned int mm_node_vcpumask_size(void)
 {
 	if (num_possible_nodes() == 1)
 		return 0;
-	return (nr_node_ids + 1) * cpumask_size();
+	return nr_node_ids * cpumask_size();
 }
 #else
 static inline void mm_init_node_vcpumask(struct mm_struct *mm) { }
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1316,6 +1316,7 @@ struct task_struct {
 
 #ifdef CONFIG_SCHED_MM_VCPU
 	int				mm_vcpu;	/* Current vcpu in mm */
+	int				mm_node_vcpu;	/* Current vcpu for this node */
 	int				mm_vcpu_active;	/* Whether vcpu bitmap is active */
 #endif
 
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -147,6 +147,7 @@ struct rseq {
 	 * (allocated uniquely within a memory space).
 	 */
 	__u32 vm_vcpu_id;
+	__u32 vm_vcpu_node_id;
 
 	/*
 	 * Flexible array member at end of structure, after last feature field.
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1049,6 +1049,7 @@ static struct task_struct *dup_task_stru
 
 #ifdef CONFIG_SCHED_MM_VCPU
 	tsk->mm_vcpu = -1;
+	tsk->mm_vcpu_node = -1;
 	tsk->mm_vcpu_active = 0;
 #endif
 	return tsk;
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struc
 	u32 cpu_id = raw_smp_processor_id();
 	u32 node_id = cpu_to_node(cpu_id);
 	u32 vm_vcpu_id = task_mm_vcpu_id(t);
+	u32 vm_vcpu_node_id = task_mm_vcpu_node_id(t);
 
 	WARN_ON_ONCE((int) vm_vcpu_id < 0);
 	if (!user_write_access_begin(rseq, t->rseq_len))
@@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struc
 	unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
 	unsafe_put_user(node_id, &rseq->node_id, efault_end);
 	unsafe_put_user(vm_vcpu_id, &rseq->vm_vcpu_id, efault_end);
+	unsafe_put_user(vm_vcpu_node_id, &rseq->vm_vcpu_node_id, efault_end);
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally updated only if
@@ -116,8 +118,8 @@ static int rseq_update_cpu_node_id(struc
 
 static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
 {
-	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
-	    vm_vcpu_id = 0;
+	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
+	u32 node_id = 0, vm_vcpu_id = -1, vm_vcpu_node_id = -1;
 
 	/*
 	 * Reset cpu_id_start to its initial state (0).
@@ -141,6 +143,9 @@ static int rseq_reset_rseq_cpu_node_id(s
 	 */
 	if (put_user(vm_vcpu_id, &t->rseq->vm_vcpu_id))
 		return -EFAULT;
+
+	if (put_user(vm_vcpu_node_id, &t->rseq->vm_vcpu_node_id))
+		return -EFAULT;
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally reset only if
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11278,15 +11278,13 @@ void call_trace_sched_update_nr_running(
 #ifdef CONFIG_SCHED_MM_VCPU
 void sched_vcpu_exit_signals(struct task_struct *t)
 {
-	struct mm_struct *mm = t->mm;
 	unsigned long flags;
 
-	if (!mm)
+	if (!t->mm)
 		return;
+
 	local_irq_save(flags);
-	mm_vcpu_put(mm, t->mm_vcpu);
-	t->mm_vcpu = -1;
-	t->mm_vcpu_active = 0;
+	mm_vcpu_put(t);
 	local_irq_restore(flags);
 }
 
@@ -11297,10 +11295,9 @@ void sched_vcpu_before_execve(struct tas
 
 	if (!mm)
 		return;
+
 	local_irq_save(flags);
-	mm_vcpu_put(mm, t->mm_vcpu);
-	t->mm_vcpu = -1;
-	t->mm_vcpu_active = 0;
+	mm_vcpu_put(t);
 	local_irq_restore(flags);
 }
 
@@ -11312,9 +11309,9 @@ void sched_vcpu_after_execve(struct task
 	WARN_ON_ONCE((t->flags & PF_KTHREAD) || !t->mm);
 
 	local_irq_save(flags);
-	t->mm_vcpu = mm_vcpu_get(mm);
-	t->mm_vcpu_active = 1;
+	mm_vcpu_get(t);
 	local_irq_restore(flags);
+
 	rseq_set_notify_resume(t);
 }
 
@@ -11322,6 +11319,7 @@ void sched_vcpu_fork(struct task_struct
 {
 	WARN_ON_ONCE((t->flags & PF_KTHREAD) || !t->mm);
 	t->mm_vcpu = -1;
+	t->mm_vcpu_node = -1;
 	t->mm_vcpu_active = 1;
 }
 #endif
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3262,147 +3262,57 @@ static inline void update_current_exec_r
 }
 
 #ifdef CONFIG_SCHED_MM_VCPU
-static inline int __mm_vcpu_get_single_node(struct mm_struct *mm)
+static inline int __mm_vcpu_get(struct cpumask *cpumask)
 {
-	struct cpumask *cpumask;
 	int vcpu;
 
-	cpumask = mm_vcpumask(mm);
 	vcpu = cpumask_first_zero(cpumask);
-	if (vcpu >= nr_cpu_ids)
+	if (WARN_ON_ONCE(vcpu >= nr_cpu_ids))
 		return -1;
+
 	__cpumask_set_cpu(vcpu, cpumask);
 	return vcpu;
 }
 
-#ifdef CONFIG_NUMA
-static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcpu_id)
-{
-	if (num_possible_nodes() == 1)
-		return true;
-	return cpumask_test_cpu(vcpu_id, mm_node_vcpumask(mm, numa_node_id()));
-}
 
-static inline int __mm_vcpu_get(struct mm_struct *mm)
+static inline void __mm_vcpu_put(struct task_struct *p, bool clear_active)
 {
-	struct cpumask *cpumask = mm_vcpumask(mm),
-		       *node_cpumask = mm_node_vcpumask(mm, numa_node_id()),
-		       *node_alloc_cpumask = mm_node_alloc_vcpumask(mm);
-	unsigned int node;
-	int vcpu;
-
-	if (num_possible_nodes() == 1)
-		return __mm_vcpu_get_single_node(mm);
-
-	/*
-	 * Try to reserve lowest available vcpu number within those already
-	 * reserved for this NUMA node.
-	 */
-	vcpu = cpumask_first_andnot(node_cpumask, cpumask);
-	if (vcpu >= nr_cpu_ids)
-		goto alloc_numa;
-	__cpumask_set_cpu(vcpu, cpumask);
-	goto end;
+	lockdep_assert_irqs_disabled();
+	WARN_ON_ONCE(p->mm_vcpu < 0);
 
-alloc_numa:
-	/*
-	 * Try to reserve lowest available vcpu number within those not already
-	 * allocated for numa nodes.
-	 */
-	vcpu = cpumask_first_notandnot(node_alloc_cpumask, cpumask);
-	if (vcpu >= nr_cpu_ids)
-		goto numa_update;
-	__cpumask_set_cpu(vcpu, cpumask);
-	__cpumask_set_cpu(vcpu, node_cpumask);
-	__cpumask_set_cpu(vcpu, node_alloc_cpumask);
-	goto end;
-
-numa_update:
-	/*
-	 * NUMA node id configuration changed for at least one CPU in the system.
-	 * We need to steal a currently unused vcpu_id from an overprovisioned
-	 * node for our current node. Userspace must handle the fact that the
-	 * node id associated with this vcpu_id may change due to node ID
-	 * reconfiguration.
-	 *
-	 * Count how many possible cpus are attached to each (other) node id,
-	 * and compare this with the per-mm node vcpumask cpu count. Find one
-	 * which has too many cpus in its mask to steal from.
-	 */
-	for (node = 0; node < nr_node_ids; node++) {
-		struct cpumask *iter_cpumask;
-
-		if (node == numa_node_id())
-			continue;
-		iter_cpumask = mm_node_vcpumask(mm, node);
-		if (nr_cpus_node(node) < cpumask_weight(iter_cpumask)) {
-			/* Try to steal from this node. */
-			vcpu = cpumask_first_andnot(iter_cpumask, cpumask);
-			if (vcpu >= nr_cpu_ids)
-				goto steal_fail;
-			__cpumask_set_cpu(vcpu, cpumask);
-			__cpumask_clear_cpu(vcpu, iter_cpumask);
-			__cpumask_set_cpu(vcpu, node_cpumask);
-			goto end;
-		}
-	}
+	raw_spin_lock(&mm->vcpu_lock);
 
-steal_fail:
-	/*
-	 * Our attempt at gracefully stealing a vcpu_id from another
-	 * overprovisioned NUMA node failed. Fallback to grabbing the first
-	 * available vcpu_id.
-	 */
-	vcpu = cpumask_first_zero(cpumask);
-	if (vcpu >= nr_cpu_ids)
-		return -1;
-	__cpumask_set_cpu(vcpu, cpumask);
-	/* Steal vcpu from its numa node mask. */
-	for (node = 0; node < nr_node_ids; node++) {
-		struct cpumask *iter_cpumask;
-
-		if (node == numa_node_id())
-			continue;
-		iter_cpumask = mm_node_vcpumask(mm, node);
-		if (cpumask_test_cpu(vcpu, iter_cpumask)) {
-			__cpumask_clear_cpu(vcpu, iter_cpumask);
-			break;
-		}
-	}
-	__cpumask_set_cpu(vcpu, node_cpumask);
-end:
-	return vcpu;
-}
+	__cpumask_clear_cpu(p->mm_vcpu, mm_vcpumask(mm));
+	p->mm_vcpu = -1;
+#ifdef CONFIG_NUMA
+	__cpumask_clear_cpu(p->mm_vcpu_node, mm_node_vcpumask(mm, task_node(p)));
+	p->mm_vcpu_node = -1;
+#endif
+	if (clear_active)
+		p->mm_vcpu_active = 0;
 
-#else
-static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcpu_id)
-{
-	return true;
-}
-static inline int __mm_vcpu_get(struct mm_struct *mm)
-{
-	return __mm_vcpu_get_single_node(mm);
+	raw_spin_unlock(&mm->vcpu_lock);
 }
-#endif
 
-static inline void mm_vcpu_put(struct mm_struct *mm, int vcpu)
+static inline void mm_vcpu_put(struct task_struct *p, bool clear_active)
 {
-	lockdep_assert_irqs_disabled();
-	if (vcpu < 0)
-		return;
-	spin_lock(&mm->vcpu_lock);
-	__cpumask_clear_cpu(vcpu, mm_vcpumask(mm));
-	spin_unlock(&mm->vcpu_lock);
+	__mm_vcpu_put(p, true);
 }
 
-static inline int mm_vcpu_get(struct mm_struct *mm)
+static inline void mm_vcpu_get(struct task_struct *p)
 {
 	int ret;
 
 	lockdep_assert_irqs_disabled();
-	spin_lock(&mm->vcpu_lock);
-	ret = __mm_vcpu_get(mm);
-	spin_unlock(&mm->vcpu_lock);
+	WARN_ON_ONCE(p->mm_vcpu > 0);
+
+	raw_spin_lock(&mm->vcpu_lock);
+	p->mm_vcpu = __mm_vcpu_get(mm_vcpumask(m));
+#ifdef CONFIG_NUMA
+	p->mm_vcpu_node = __mm_vcpu_get(mm_node_vcpumask(mm, task_node(p)));
+#endif
+	p->mm_vcpu_active = 1;
+	raw_spin_unlock(&mm->vcpu_lock);
 	return ret;
 }
 
@@ -3414,15 +3324,16 @@ static inline void switch_mm_vcpu(struct
 			 * Context switch between threads in same mm, hand over
 			 * the mm_vcpu from prev to next.
 			 */
-			next->mm_vcpu = prev->mm_vcpu;
-			prev->mm_vcpu = -1;
+			swap(next->mm_vcpu, prev->mm_vcpu);
+#ifdef CONFIG_NUMA
+			swap(next->mm_vcpu_node, prev->mm_vcpu_node);
+#endif
 			return;
 		}
-		mm_vcpu_put(prev->mm, prev->mm_vcpu);
-		prev->mm_vcpu = -1;
+		__mm_vcpu_put(prev, false);
 	}
 	if (next->mm_vcpu_active)
-		next->mm_vcpu = mm_vcpu_get(next->mm);
+		mm_vcpu_get(next);
 }
 
 #else
  
Mathieu Desnoyers Nov. 9, 2022, 3:04 p.m. UTC | #8
On 2022-11-09 04:28, Peter Zijlstra wrote:
> On Thu, Nov 03, 2022 at 04:03:43PM -0400, Mathieu Desnoyers wrote:
> 
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 08969f5aa38d..6a2323266942 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1047,6 +1047,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>>   	tsk->reported_split_lock = 0;
>>   #endif
>>   
>> +#ifdef CONFIG_SCHED_MM_VCPU
>> +	tsk->mm_vcpu = -1;
>> +	tsk->mm_vcpu_active = 0;
>> +#endif
>>   	return tsk;
>>   
>>   free_stack:
> 
> Note how the above hunk does exactly the same as the below thunk, and I
> think they're even on the same code-path.
> 
> How about moving all of this to __sched_fork() or something?
> 
>> @@ -1579,6 +1586,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
>>   
>>   	tsk->mm = mm;
>>   	tsk->active_mm = mm;
>> +	sched_vcpu_fork(tsk);
>>   	return 0;
>>   }
> 
>> +void sched_vcpu_fork(struct task_struct *t)
>> +{
>> +	WARN_ON_ONCE((t->flags & PF_KTHREAD) || !t->mm);
>> +	t->mm_vcpu = -1;
>> +	t->mm_vcpu_active = 1;
>> +}

Let's look at how things are brought up in copy_process():

         p = dup_task_struct(current, node);
           tsk->mm_vcpu = -1;
           tsk->mm_vcpu_active = 0;
­[...]

         /* Perform scheduler related setup. Assign this task to a CPU. */
         retval = sched_fork(clone_flags, p);

-> I presume that from this point the task is observable by the scheduler. However, tsk->mm does not point to the new mm yet.
The "mm_vcpu_active" flag == 0 prevents the scheduler from trying to poke into the wrong mm vcpu_id bitmaps across early mm-struct
lifetime (clone/fork), late in the mm-struct lifetime (exit), and across reset of the mm-struct (exec).

[...]

         retval = copy_mm(clone_flags, p);
             new_mm = dup_mm(old_mm)
               mm_init(new_mm)
                  mm_init_vcpu(new_mm)
-> At this point it becomes OK for the scheduler to poke into the tsk->mm vcpu_id bitmaps. Therefore, sched_vcpu_fork() sets
mm_vcpu_active flag = 1.
              sched_vcpu_fork(tsk)
-> From this point the scheduler can poke into tsk->mm's vcpu_id bitmaps.

So what I think we should to do here is to remove the extra "t->mm_vcpu = -1;" assignment from sched_vcpu_fork(), because
it has already been set by dup_task_struct. We could actually turn that into a "WARN_ON_ONCE(t->mm_vcpu != -1)".

However, if my understanding is correct, keeping "tsk->mm_vcpu_active = 0;" early in dup_task_struct and "tsk->mm_vcpu_active = 1"
in sched_vcpu_fork() after the new mm is initialized is really important. Or am I missing something ?

Thanks,

Mathieu
  
Mathieu Desnoyers Nov. 9, 2022, 3:09 p.m. UTC | #9
On 2022-11-09 04:42, Peter Zijlstra wrote:
> On Thu, Nov 03, 2022 at 04:03:43PM -0400, Mathieu Desnoyers wrote:
> 
>> +void sched_vcpu_exit_signals(struct task_struct *t)
>> +{
>> +	struct mm_struct *mm = t->mm;
>> +	unsigned long flags;
>> +
>> +	if (!mm)
>> +		return;
>> +	local_irq_save(flags);
>> +	mm_vcpu_put(mm, t->mm_vcpu);
>> +	t->mm_vcpu = -1;
>> +	t->mm_vcpu_active = 0;
>> +	local_irq_restore(flags);
>> +}
>> +
>> +void sched_vcpu_before_execve(struct task_struct *t)
>> +{
>> +	struct mm_struct *mm = t->mm;
>> +	unsigned long flags;
>> +
>> +	if (!mm)
>> +		return;
>> +	local_irq_save(flags);
>> +	mm_vcpu_put(mm, t->mm_vcpu);
>> +	t->mm_vcpu = -1;
>> +	t->mm_vcpu_active = 0;
>> +	local_irq_restore(flags);
>> +}
>> +
>> +void sched_vcpu_after_execve(struct task_struct *t)
>> +{
>> +	struct mm_struct *mm = t->mm;
>> +	unsigned long flags;
>> +
>> +	WARN_ON_ONCE((t->flags & PF_KTHREAD) || !t->mm);
>> +
>> +	local_irq_save(flags);
>> +	t->mm_vcpu = mm_vcpu_get(mm);
>> +	t->mm_vcpu_active = 1;
>> +	local_irq_restore(flags);
>> +	rseq_set_notify_resume(t);
>> +}
> 
>> +static inline void mm_vcpu_put(struct mm_struct *mm, int vcpu)
>> +{
>> +	lockdep_assert_irqs_disabled();
>> +	if (vcpu < 0)
>> +		return;
>> +	spin_lock(&mm->vcpu_lock);
>> +	__cpumask_clear_cpu(vcpu, mm_vcpumask(mm));
>> +	spin_unlock(&mm->vcpu_lock);
>> +}
>> +
>> +static inline int mm_vcpu_get(struct mm_struct *mm)
>> +{
>> +	int ret;
>> +
>> +	lockdep_assert_irqs_disabled();
>> +	spin_lock(&mm->vcpu_lock);
>> +	ret = __mm_vcpu_get(mm);
>> +	spin_unlock(&mm->vcpu_lock);
>> +	return ret;
>> +}
> 
> 
> This:
> 
> 	local_irq_disable()
> 	spin_lock()
> 
> thing is a PREEMPT_RT anti-pattern.
> 
> At the very very least this should then be raw_spin_lock(), not in the
> least because you're calling this from under rq->lock, which itself is a
> raw_spin_lock_t.

Very good point, will fix using raw_spinlock_t.

Thanks,

Mathieu

>
  
Andy Lutomirski Nov. 11, 2022, 4:41 a.m. UTC | #10
On Thu, Nov 3, 2022 at 1:05 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> This feature allows the scheduler to expose a current virtual cpu id
> to user-space. This virtual cpu id is within the possible cpus range,
> and is temporarily (and uniquely) assigned while threads are actively
> running within a memory space. If a memory space has fewer threads than
> cores, or is limited to run on few cores concurrently through sched
> affinity or cgroup cpusets, the virtual cpu ids will be values close
> to 0, thus allowing efficient use of user-space memory for per-cpu
> data structures.
>

Just to check, is a "memory space" an mm?  I've heard these called
"mms" or sometimes (mostly accurately) "processes" but never memory
spaces.  Although I guess the clone(2) manpage says "memory space".

Also, in my mind "virtual cpu" is vCPU, which this isn't.  Maybe
"compacted cpu" or something?  It's a strange sort of concept.
  
Mathieu Desnoyers Nov. 11, 2022, 2:18 p.m. UTC | #11
On 2022-11-10 23:41, Andy Lutomirski wrote:
> On Thu, Nov 3, 2022 at 1:05 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> This feature allows the scheduler to expose a current virtual cpu id
>> to user-space. This virtual cpu id is within the possible cpus range,
>> and is temporarily (and uniquely) assigned while threads are actively
>> running within a memory space. If a memory space has fewer threads than
>> cores, or is limited to run on few cores concurrently through sched
>> affinity or cgroup cpusets, the virtual cpu ids will be values close
>> to 0, thus allowing efficient use of user-space memory for per-cpu
>> data structures.
>>
> 
> Just to check, is a "memory space" an mm?  I've heard these called
> "mms" or sometimes (mostly accurately) "processes" but never memory
> spaces.  Although I guess the clone(2) manpage says "memory space".

Yes, exactly.

I've had a hard time finding the right word there to describe the 
concept of a struct mm from a user-space point of view, and ended up 
finding that the clone(2) man page expresses the result of a clone 
system call with CLONE_VM set as sharing a "memory space", aka a mm_struct.

 From an internal kernel implementation perspective it is usually 
referred to as a "mm", but it's not a notion that appears to be exposed 
to user-space.

And unfortunately "process" can mean so many things other than a struct 
mm: is it a thread group ? Or just a group of processes sharing a file 
descriptor table ? Or sharing signal handlers ? How would you call a 
thread group (clone with CLONE_THREAD) that does not have CLONE_VM set ?

> 
> Also, in my mind "virtual cpu" is vCPU, which this isn't.  Maybe
> "compacted cpu" or something?  It's a strange sort of concept.

I've kept the same wording that has been introduced in 2011 by Paul 
Turner and used internally at Google since then, although it may be 
confusing if people expect kvm-vCPU and rseq-vcpu to mean the same 
thing. Both really end up providing the semantic of a virtually assigned 
cpu id (in opposition to the logical cpu id on the system), but this is 
much more involved in the case of KVM.

Thanks,

Mathieu
  
Sean Christopherson Nov. 14, 2022, 8:49 p.m. UTC | #12
On Fri, Nov 11, 2022, Mathieu Desnoyers wrote:
> On 2022-11-10 23:41, Andy Lutomirski wrote:
> > On Thu, Nov 3, 2022 at 1:05 PM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> > Also, in my mind "virtual cpu" is vCPU, which this isn't.  Maybe
> > "compacted cpu" or something?  It's a strange sort of concept.
> 
> I've kept the same wording that has been introduced in 2011 by Paul Turner
> and used internally at Google since then, although it may be confusing if
> people expect kvm-vCPU and rseq-vcpu to mean the same thing. Both really end
> up providing the semantic of a virtually assigned cpu id (in opposition to
> the logical cpu id on the system), but this is much more involved in the
> case of KVM.

I had the same reaction as Andy.  The rseq concepts don't worry me so much as the
existence of "vcpu" in mm_struct/task_struct, e.g. switch_mm_vcpu() when switching
between KVM vCPU tasks is going to be super confusing.  Ditto for mm_vcpu_get()
and mm_vcpu_put() in the few cases where KVM currently does mmget()/mmput().
  
Mathieu Desnoyers Nov. 17, 2022, 5:19 p.m. UTC | #13
On 2022-11-14 15:49, Sean Christopherson wrote:
> On Fri, Nov 11, 2022, Mathieu Desnoyers wrote:
>> On 2022-11-10 23:41, Andy Lutomirski wrote:
>>> On Thu, Nov 3, 2022 at 1:05 PM Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>> Also, in my mind "virtual cpu" is vCPU, which this isn't.  Maybe
>>> "compacted cpu" or something?  It's a strange sort of concept.
>>
>> I've kept the same wording that has been introduced in 2011 by Paul Turner
>> and used internally at Google since then, although it may be confusing if
>> people expect kvm-vCPU and rseq-vcpu to mean the same thing. Both really end
>> up providing the semantic of a virtually assigned cpu id (in opposition to
>> the logical cpu id on the system), but this is much more involved in the
>> case of KVM.
> 
> I had the same reaction as Andy.  The rseq concepts don't worry me so much as the
> existence of "vcpu" in mm_struct/task_struct, e.g. switch_mm_vcpu() when switching
> between KVM vCPU tasks is going to be super confusing.  Ditto for mm_vcpu_get()
> and mm_vcpu_put() in the few cases where KVM currently does mmget()/mmput().

I'm fine with changing the wording if it helps make things less confusing.

Should we go for "compact-cpu-id" ? "packed-cpu-id" ? Other ideas ?

Thanks,

Mathieu
  
Sean Christopherson Nov. 17, 2022, 7:10 p.m. UTC | #14
On Thu, Nov 17, 2022, Mathieu Desnoyers wrote:
> On 2022-11-14 15:49, Sean Christopherson wrote:
> > On Fri, Nov 11, 2022, Mathieu Desnoyers wrote:
> > > On 2022-11-10 23:41, Andy Lutomirski wrote:
> > > > On Thu, Nov 3, 2022 at 1:05 PM Mathieu Desnoyers
> > > > <mathieu.desnoyers@efficios.com> wrote:
> > > > Also, in my mind "virtual cpu" is vCPU, which this isn't.  Maybe
> > > > "compacted cpu" or something?  It's a strange sort of concept.
> > > 
> > > I've kept the same wording that has been introduced in 2011 by Paul Turner
> > > and used internally at Google since then, although it may be confusing if
> > > people expect kvm-vCPU and rseq-vcpu to mean the same thing. Both really end
> > > up providing the semantic of a virtually assigned cpu id (in opposition to
> > > the logical cpu id on the system), but this is much more involved in the
> > > case of KVM.
> > 
> > I had the same reaction as Andy.  The rseq concepts don't worry me so much as the
> > existence of "vcpu" in mm_struct/task_struct, e.g. switch_mm_vcpu() when switching
> > between KVM vCPU tasks is going to be super confusing.  Ditto for mm_vcpu_get()
> > and mm_vcpu_put() in the few cases where KVM currently does mmget()/mmput().
> 
> I'm fine with changing the wording if it helps make things less confusing.
> 
> Should we go for "compact-cpu-id" ? "packed-cpu-id" ? Other ideas ?

What about something like "process-local-cpu-id" to capture that the ID has meaning
only within the associated address space / process?
  
Mathieu Desnoyers Nov. 17, 2022, 7:42 p.m. UTC | #15
On 2022-11-17 14:10, Sean Christopherson wrote:
> On Thu, Nov 17, 2022, Mathieu Desnoyers wrote:
>> On 2022-11-14 15:49, Sean Christopherson wrote:
>>> On Fri, Nov 11, 2022, Mathieu Desnoyers wrote:
>>>> On 2022-11-10 23:41, Andy Lutomirski wrote:
>>>>> On Thu, Nov 3, 2022 at 1:05 PM Mathieu Desnoyers
>>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>> Also, in my mind "virtual cpu" is vCPU, which this isn't.  Maybe
>>>>> "compacted cpu" or something?  It's a strange sort of concept.
>>>>
>>>> I've kept the same wording that has been introduced in 2011 by Paul Turner
>>>> and used internally at Google since then, although it may be confusing if
>>>> people expect kvm-vCPU and rseq-vcpu to mean the same thing. Both really end
>>>> up providing the semantic of a virtually assigned cpu id (in opposition to
>>>> the logical cpu id on the system), but this is much more involved in the
>>>> case of KVM.
>>>
>>> I had the same reaction as Andy.  The rseq concepts don't worry me so much as the
>>> existence of "vcpu" in mm_struct/task_struct, e.g. switch_mm_vcpu() when switching
>>> between KVM vCPU tasks is going to be super confusing.  Ditto for mm_vcpu_get()
>>> and mm_vcpu_put() in the few cases where KVM currently does mmget()/mmput().
>>
>> I'm fine with changing the wording if it helps make things less confusing.
>>
>> Should we go for "compact-cpu-id" ? "packed-cpu-id" ? Other ideas ?
> 
> What about something like "process-local-cpu-id" to capture that the ID has meaning
> only within the associated address space / process?

Considering that the shorthand for "memory space" is "VM" in e.g. 
"CLONE_VM" clone(2) flags, perhaps "vm-cpu-id", "vm-local-cpu-id" or 
"per-vm-cpu-id" ?

Thanks,

Mathieu
  
Sean Christopherson Nov. 17, 2022, 9:15 p.m. UTC | #16
On Thu, Nov 17, 2022, Mathieu Desnoyers wrote:
> On 2022-11-17 14:10, Sean Christopherson wrote:
> > On Thu, Nov 17, 2022, Mathieu Desnoyers wrote:
> > > On 2022-11-14 15:49, Sean Christopherson wrote:
> > > > On Fri, Nov 11, 2022, Mathieu Desnoyers wrote:
> > > > > On 2022-11-10 23:41, Andy Lutomirski wrote:
> > > > > > On Thu, Nov 3, 2022 at 1:05 PM Mathieu Desnoyers
> > > > > > <mathieu.desnoyers@efficios.com> wrote:
> > > > > > Also, in my mind "virtual cpu" is vCPU, which this isn't.  Maybe
> > > > > > "compacted cpu" or something?  It's a strange sort of concept.
> > > > > 
> > > > > I've kept the same wording that has been introduced in 2011 by Paul Turner
> > > > > and used internally at Google since then, although it may be confusing if
> > > > > people expect kvm-vCPU and rseq-vcpu to mean the same thing. Both really end
> > > > > up providing the semantic of a virtually assigned cpu id (in opposition to
> > > > > the logical cpu id on the system), but this is much more involved in the
> > > > > case of KVM.
> > > > 
> > > > I had the same reaction as Andy.  The rseq concepts don't worry me so much as the
> > > > existence of "vcpu" in mm_struct/task_struct, e.g. switch_mm_vcpu() when switching
> > > > between KVM vCPU tasks is going to be super confusing.  Ditto for mm_vcpu_get()
> > > > and mm_vcpu_put() in the few cases where KVM currently does mmget()/mmput().
> > > 
> > > I'm fine with changing the wording if it helps make things less confusing.
> > > 
> > > Should we go for "compact-cpu-id" ? "packed-cpu-id" ? Other ideas ?
> > 
> > What about something like "process-local-cpu-id" to capture that the ID has meaning
> > only within the associated address space / process?
> 
> Considering that the shorthand for "memory space" is "VM" in e.g. "CLONE_VM"

No objection from me for "vm", I've already had to untrain myself and remember
that "vm" doesn't always mean "virtual machine" :-)

> clone(2) flags, perhaps "vm-cpu-id", "vm-local-cpu-id" or "per-vm-cpu-id" ?

I have a slight preference for vm-local-cpu-id, but any of 'em work for me.
  
Mathieu Desnoyers Nov. 21, 2022, 7 p.m. UTC | #17
On 2022-11-17 16:15, Sean Christopherson wrote:
> On Thu, Nov 17, 2022, Mathieu Desnoyers wrote:
>> On 2022-11-17 14:10, Sean Christopherson wrote:
>>> On Thu, Nov 17, 2022, Mathieu Desnoyers wrote:
>>>> On 2022-11-14 15:49, Sean Christopherson wrote:
>>>>> On Fri, Nov 11, 2022, Mathieu Desnoyers wrote:
>>>>>> On 2022-11-10 23:41, Andy Lutomirski wrote:
>>>>>>> On Thu, Nov 3, 2022 at 1:05 PM Mathieu Desnoyers
>>>>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>>>> Also, in my mind "virtual cpu" is vCPU, which this isn't.  Maybe
>>>>>>> "compacted cpu" or something?  It's a strange sort of concept.
>>>>>>
>>>>>> I've kept the same wording that has been introduced in 2011 by Paul Turner
>>>>>> and used internally at Google since then, although it may be confusing if
>>>>>> people expect kvm-vCPU and rseq-vcpu to mean the same thing. Both really end
>>>>>> up providing the semantic of a virtually assigned cpu id (in opposition to
>>>>>> the logical cpu id on the system), but this is much more involved in the
>>>>>> case of KVM.
>>>>>
>>>>> I had the same reaction as Andy.  The rseq concepts don't worry me so much as the
>>>>> existence of "vcpu" in mm_struct/task_struct, e.g. switch_mm_vcpu() when switching
>>>>> between KVM vCPU tasks is going to be super confusing.  Ditto for mm_vcpu_get()
>>>>> and mm_vcpu_put() in the few cases where KVM currently does mmget()/mmput().
>>>>
>>>> I'm fine with changing the wording if it helps make things less confusing.
>>>>
>>>> Should we go for "compact-cpu-id" ? "packed-cpu-id" ? Other ideas ?
>>>
>>> What about something like "process-local-cpu-id" to capture that the ID has meaning
>>> only within the associated address space / process?
>>
>> Considering that the shorthand for "memory space" is "VM" in e.g. "CLONE_VM"
> 
> No objection from me for "vm", I've already had to untrain myself and remember
> that "vm" doesn't always mean "virtual machine" :-)
> 
>> clone(2) flags, perhaps "vm-cpu-id", "vm-local-cpu-id" or "per-vm-cpu-id" ?
> 
> I have a slight preference for vm-local-cpu-id, but any of 'em work for me.

Taking a step back wrt naming (because if I do a name change across the 
series, I want it to be the last time I do it):

- VM (kvm) vs vm_ (rseq) is confusing.
- vCPU (kvm) vs vcpu (rseq) is confusing.

I propose "Address Space Concurrency ID". This indicates that those IDs 
are really just tags assigned uniquely within an address space for each 
thread running concurrently (and only while they are running).

Then the question that arises is whether the abbreviation presented to 
user-space should be "mm_cid" (as would be expected from an internal 
implementation perspective) or "as_cid" (which would match the name 
exposed to user-space) ?

Thanks,

Mathieu
  
Mathieu Desnoyers Nov. 21, 2022, 7:52 p.m. UTC | #18
On 2022-11-21 14:00, Mathieu Desnoyers wrote:
> On 2022-11-17 16:15, Sean Christopherson wrote:
>> On Thu, Nov 17, 2022, Mathieu Desnoyers wrote:
>>> On 2022-11-17 14:10, Sean Christopherson wrote:
>>>> On Thu, Nov 17, 2022, Mathieu Desnoyers wrote:
>>>>> On 2022-11-14 15:49, Sean Christopherson wrote:
>>>>>> On Fri, Nov 11, 2022, Mathieu Desnoyers wrote:
>>>>>>> On 2022-11-10 23:41, Andy Lutomirski wrote:
>>>>>>>> On Thu, Nov 3, 2022 at 1:05 PM Mathieu Desnoyers
>>>>>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>>>>> Also, in my mind "virtual cpu" is vCPU, which this isn't.  Maybe
>>>>>>>> "compacted cpu" or something?  It's a strange sort of concept.
>>>>>>>
>>>>>>> I've kept the same wording that has been introduced in 2011 by 
>>>>>>> Paul Turner
>>>>>>> and used internally at Google since then, although it may be 
>>>>>>> confusing if
>>>>>>> people expect kvm-vCPU and rseq-vcpu to mean the same thing. Both 
>>>>>>> really end
>>>>>>> up providing the semantic of a virtually assigned cpu id (in 
>>>>>>> opposition to
>>>>>>> the logical cpu id on the system), but this is much more involved 
>>>>>>> in the
>>>>>>> case of KVM.
>>>>>>
>>>>>> I had the same reaction as Andy.  The rseq concepts don't worry me 
>>>>>> so much as the
>>>>>> existence of "vcpu" in mm_struct/task_struct, e.g. 
>>>>>> switch_mm_vcpu() when switching
>>>>>> between KVM vCPU tasks is going to be super confusing.  Ditto for 
>>>>>> mm_vcpu_get()
>>>>>> and mm_vcpu_put() in the few cases where KVM currently does 
>>>>>> mmget()/mmput().
>>>>>
>>>>> I'm fine with changing the wording if it helps make things less 
>>>>> confusing.
>>>>>
>>>>> Should we go for "compact-cpu-id" ? "packed-cpu-id" ? Other ideas ?
>>>>
>>>> What about something like "process-local-cpu-id" to capture that the 
>>>> ID has meaning
>>>> only within the associated address space / process?
>>>
>>> Considering that the shorthand for "memory space" is "VM" in e.g. 
>>> "CLONE_VM"
>>
>> No objection from me for "vm", I've already had to untrain myself and 
>> remember
>> that "vm" doesn't always mean "virtual machine" :-)
>>
>>> clone(2) flags, perhaps "vm-cpu-id", "vm-local-cpu-id" or 
>>> "per-vm-cpu-id" ?
>>
>> I have a slight preference for vm-local-cpu-id, but any of 'em work 
>> for me.
> 
> Taking a step back wrt naming (because if I do a name change across the 
> series, I want it to be the last time I do it):
> 
> - VM (kvm) vs vm_ (rseq) is confusing.
> - vCPU (kvm) vs vcpu (rseq) is confusing.
> 
> I propose "Address Space Concurrency ID". This indicates that those IDs 
> are really just tags assigned uniquely within an address space for each 
> thread running concurrently (and only while they are running).
> 
> Then the question that arises is whether the abbreviation presented to 
> user-space should be "mm_cid" (as would be expected from an internal 
> implementation perspective) or "as_cid" (which would match the name 
> exposed to user-space) ?

Or it could be "Memory Map Concurrency ID" (mm_cid) to have matching 
abbreviation and naming. The notion of a "memory map" seems to be seen 
in a few places in man pages, and there are event tools to explore 
process memory maps (pmap(1)).

Thanks,

Mathieu



> 
> Thanks,
> 
> Mathieu
>
  

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 349a5da91efe..93eb88f4053b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1013,6 +1013,9 @@  static int exec_mmap(struct mm_struct *mm)
 	tsk->active_mm = mm;
 	tsk->mm = mm;
 	lru_gen_add_mm(mm);
+	mm_init_vcpu_lock(mm);
+	mm_init_vcpumask(mm);
+	mm_init_node_vcpumask(mm);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
@@ -1808,6 +1811,7 @@  static int bprm_execve(struct linux_binprm *bprm,
 
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
+	sched_vcpu_before_execve(current);
 
 	file = do_open_execat(fd, filename, flags);
 	retval = PTR_ERR(file);
@@ -1838,6 +1842,7 @@  static int bprm_execve(struct linux_binprm *bprm,
 	if (retval < 0)
 		goto out;
 
+	sched_vcpu_after_execve(current);
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
@@ -1857,6 +1862,7 @@  static int bprm_execve(struct linux_binprm *bprm,
 		force_fatal_sig(SIGSEGV);
 
 out_unmark:
+	sched_vcpu_after_execve(current);
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..eb6075d55339 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3475,4 +3475,29 @@  madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  */
 #define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
 
+#ifdef CONFIG_SCHED_MM_VCPU
+void sched_vcpu_before_execve(struct task_struct *t);
+void sched_vcpu_after_execve(struct task_struct *t);
+void sched_vcpu_fork(struct task_struct *t);
+void sched_vcpu_exit_signals(struct task_struct *t);
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+	return t->mm_vcpu;
+}
+#else
+static inline void sched_vcpu_before_execve(struct task_struct *t) { }
+static inline void sched_vcpu_after_execve(struct task_struct *t) { }
+static inline void sched_vcpu_fork(struct task_struct *t) { }
+static inline void sched_vcpu_exit_signals(struct task_struct *t) { }
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+	/*
+	 * Use the processor id as a fall-back when the mm vcpu feature is
+	 * disabled. This provides functional per-cpu data structure accesses
+	 * in user-space, althrough it won't provide the memory usage benefits.
+	 */
+	return raw_smp_processor_id();
+}
+#endif
+
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..9b4c7ec68057 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -18,6 +18,7 @@ 
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/nodemask.h>
 
 #include <asm/mmu.h>
 
@@ -556,7 +557,19 @@  struct mm_struct {
 		 * &struct mm_struct is freed.
 		 */
 		atomic_t mm_count;
-
+#ifdef CONFIG_SCHED_MM_VCPU
+		/**
+		 * @vcpu_lock: Protect vcpu_id bitmap updates vs lookups.
+		 *
+		 * Prevent situations where updates to the vcpu_id bitmap
+		 * happen concurrently with lookups. Those can lead to
+		 * situations where a lookup cannot find a free bit simply
+		 * because it was unlucky enough to load, non-atomically,
+		 * bitmap words as they were being concurrently updated by the
+		 * updaters.
+		 */
+		spinlock_t vcpu_lock;
+#endif
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
@@ -824,6 +837,101 @@  static inline void vma_iter_init(struct vma_iterator *vmi,
 	vmi->mas.node = MAS_START;
 }
 
+#ifdef CONFIG_SCHED_MM_VCPU
+/* Future-safe accessor for struct mm_struct's vcpu_mask. */
+static inline cpumask_t *mm_vcpumask(struct mm_struct *mm)
+{
+	unsigned long vcpu_bitmap = (unsigned long)mm;
+
+	vcpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+	/* Skip cpu_bitmap */
+	vcpu_bitmap += cpumask_size();
+	return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline void mm_init_vcpumask(struct mm_struct *mm)
+{
+	cpumask_clear(mm_vcpumask(mm));
+}
+
+static inline unsigned int mm_vcpumask_size(void)
+{
+	return cpumask_size();
+}
+
+#else
+static inline cpumask_t *mm_vcpumask(struct mm_struct *mm)
+{
+	return NULL;
+}
+
+static inline void mm_init_vcpumask(struct mm_struct *mm) { }
+
+static inline unsigned int mm_vcpumask_size(void)
+{
+	return 0;
+}
+#endif
+
+#if defined(CONFIG_SCHED_MM_VCPU) && defined(CONFIG_NUMA)
+/*
+ * Layout of node vcpumasks:
+ * - node_alloc vcpumask:        cpumask tracking which vcpu_id were
+ *                               allocated (across nodes) in this
+ *                               memory space.
+ * - node vcpumask[nr_node_ids]: per-node cpumask tracking which vcpu_id
+ *                               were allocated in this memory space.
+ */
+static inline cpumask_t *mm_node_alloc_vcpumask(struct mm_struct *mm)
+{
+	unsigned long vcpu_bitmap = (unsigned long)mm_vcpumask(mm);
+
+	/* Skip mm_vcpumask */
+	vcpu_bitmap += cpumask_size();
+	return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline cpumask_t *mm_node_vcpumask(struct mm_struct *mm, unsigned int node)
+{
+	unsigned long vcpu_bitmap = (unsigned long)mm_node_alloc_vcpumask(mm);
+
+	/* Skip node alloc vcpumask */
+	vcpu_bitmap += cpumask_size();
+	vcpu_bitmap += node * cpumask_size();
+	return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline void mm_init_node_vcpumask(struct mm_struct *mm)
+{
+	unsigned int node;
+
+	if (num_possible_nodes() == 1)
+		return;
+	cpumask_clear(mm_node_alloc_vcpumask(mm));
+	for (node = 0; node < nr_node_ids; node++)
+		cpumask_clear(mm_node_vcpumask(mm, node));
+}
+
+static inline void mm_init_vcpu_lock(struct mm_struct *mm)
+{
+	spin_lock_init(&mm->vcpu_lock);
+}
+
+static inline unsigned int mm_node_vcpumask_size(void)
+{
+	if (num_possible_nodes() == 1)
+		return 0;
+	return (nr_node_ids + 1) * cpumask_size();
+}
+#else
+static inline void mm_init_node_vcpumask(struct mm_struct *mm) { }
+static inline void mm_init_vcpu_lock(struct mm_struct *mm) { }
+static inline unsigned int mm_node_vcpumask_size(void)
+{
+	return 0;
+}
+#endif
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2a9e14e3e668..8e68335d122e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1314,6 +1314,11 @@  struct task_struct {
 	unsigned long rseq_event_mask;
 #endif
 
+#ifdef CONFIG_SCHED_MM_VCPU
+	int				mm_vcpu;	/* Current vcpu in mm */
+	int				mm_vcpu_active;	/* Whether vcpu bitmap is active */
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	union {
diff --git a/init/Kconfig b/init/Kconfig
index abf65098f1b6..cab2d2bcf62d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1039,6 +1039,10 @@  config RT_GROUP_SCHED
 
 endif #CGROUP_SCHED
 
+config SCHED_MM_VCPU
+	def_bool y
+	depends on SMP && RSEQ
+
 config UCLAMP_TASK_GROUP
 	bool "Utilization clamping per group of tasks"
 	depends on CGROUP_SCHED
diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..6a2323266942 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1047,6 +1047,10 @@  static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->reported_split_lock = 0;
 #endif
 
+#ifdef CONFIG_SCHED_MM_VCPU
+	tsk->mm_vcpu = -1;
+	tsk->mm_vcpu_active = 0;
+#endif
 	return tsk;
 
 free_stack:
@@ -1150,6 +1154,9 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
+	mm_init_vcpu_lock(mm);
+	mm_init_vcpumask(mm);
+	mm_init_node_vcpumask(mm);
 	return mm;
 
 fail_nocontext:
@@ -1579,6 +1586,7 @@  static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 
 	tsk->mm = mm;
 	tsk->active_mm = mm;
+	sched_vcpu_fork(tsk);
 	return 0;
 }
 
@@ -3041,7 +3049,8 @@  void __init proc_caches_init(void)
 	 * dynamically sized based on the maximum CPU number this system
 	 * can have, taking hotplug into account (nr_cpu_ids).
 	 */
-	mm_size = sizeof(struct mm_struct) + cpumask_size();
+	mm_size = sizeof(struct mm_struct) + cpumask_size() + mm_vcpumask_size() +
+		  mm_node_vcpumask_size();
 
 	mm_cachep = kmem_cache_create_usercopy("mm_struct",
 			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 07ac08caf019..2f356169047e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5019,6 +5019,7 @@  prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
 	rseq_preempt(prev);
+	switch_mm_vcpu(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	kmap_local_sched_out();
 	prepare_task(next);
@@ -11273,3 +11274,54 @@  void call_trace_sched_update_nr_running(struct rq *rq, int count)
 {
         trace_sched_update_nr_running_tp(rq, count);
 }
+
+#ifdef CONFIG_SCHED_MM_VCPU
+void sched_vcpu_exit_signals(struct task_struct *t)
+{
+	struct mm_struct *mm = t->mm;
+	unsigned long flags;
+
+	if (!mm)
+		return;
+	local_irq_save(flags);
+	mm_vcpu_put(mm, t->mm_vcpu);
+	t->mm_vcpu = -1;
+	t->mm_vcpu_active = 0;
+	local_irq_restore(flags);
+}
+
+void sched_vcpu_before_execve(struct task_struct *t)
+{
+	struct mm_struct *mm = t->mm;
+	unsigned long flags;
+
+	if (!mm)
+		return;
+	local_irq_save(flags);
+	mm_vcpu_put(mm, t->mm_vcpu);
+	t->mm_vcpu = -1;
+	t->mm_vcpu_active = 0;
+	local_irq_restore(flags);
+}
+
+void sched_vcpu_after_execve(struct task_struct *t)
+{
+	struct mm_struct *mm = t->mm;
+	unsigned long flags;
+
+	WARN_ON_ONCE((t->flags & PF_KTHREAD) || !t->mm);
+
+	local_irq_save(flags);
+	t->mm_vcpu = mm_vcpu_get(mm);
+	t->mm_vcpu_active = 1;
+	local_irq_restore(flags);
+	rseq_set_notify_resume(t);
+}
+
+void sched_vcpu_fork(struct task_struct *t)
+{
+	WARN_ON_ONCE((t->flags & PF_KTHREAD) || !t->mm);
+	t->mm_vcpu = -1;
+	t->mm_vcpu_active = 1;
+}
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 771f8ddb7053..28a51cad8174 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3261,4 +3261,172 @@  static inline void update_current_exec_runtime(struct task_struct *curr,
 	cgroup_account_cputime(curr, delta_exec);
 }
 
+#ifdef CONFIG_SCHED_MM_VCPU
+static inline int __mm_vcpu_get_single_node(struct mm_struct *mm)
+{
+	struct cpumask *cpumask;
+	int vcpu;
+
+	cpumask = mm_vcpumask(mm);
+	vcpu = cpumask_first_zero(cpumask);
+	if (vcpu >= nr_cpu_ids)
+		return -1;
+	__cpumask_set_cpu(vcpu, cpumask);
+	return vcpu;
+}
+
+#ifdef CONFIG_NUMA
+static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcpu_id)
+{
+	if (num_possible_nodes() == 1)
+		return true;
+	return cpumask_test_cpu(vcpu_id, mm_node_vcpumask(mm, numa_node_id()));
+}
+
+static inline int __mm_vcpu_get(struct mm_struct *mm)
+{
+	struct cpumask *cpumask = mm_vcpumask(mm),
+		       *node_cpumask = mm_node_vcpumask(mm, numa_node_id()),
+		       *node_alloc_cpumask = mm_node_alloc_vcpumask(mm);
+	unsigned int node;
+	int vcpu;
+
+	if (num_possible_nodes() == 1)
+		return __mm_vcpu_get_single_node(mm);
+
+	/*
+	 * Try to reserve lowest available vcpu number within those already
+	 * reserved for this NUMA node.
+	 */
+	vcpu = cpumask_first_andnot(node_cpumask, cpumask);
+	if (vcpu >= nr_cpu_ids)
+		goto alloc_numa;
+	__cpumask_set_cpu(vcpu, cpumask);
+	goto end;
+
+alloc_numa:
+	/*
+	 * Try to reserve lowest available vcpu number within those not already
+	 * allocated for numa nodes.
+	 */
+	vcpu = cpumask_first_notandnot(node_alloc_cpumask, cpumask);
+	if (vcpu >= nr_cpu_ids)
+		goto numa_update;
+	__cpumask_set_cpu(vcpu, cpumask);
+	__cpumask_set_cpu(vcpu, node_cpumask);
+	__cpumask_set_cpu(vcpu, node_alloc_cpumask);
+	goto end;
+
+numa_update:
+	/*
+	 * NUMA node id configuration changed for at least one CPU in the system.
+	 * We need to steal a currently unused vcpu_id from an overprovisioned
+	 * node for our current node. Userspace must handle the fact that the
+	 * node id associated with this vcpu_id may change due to node ID
+	 * reconfiguration.
+	 *
+	 * Count how many possible cpus are attached to each (other) node id,
+	 * and compare this with the per-mm node vcpumask cpu count. Find one
+	 * which has too many cpus in its mask to steal from.
+	 */
+	for (node = 0; node < nr_node_ids; node++) {
+		struct cpumask *iter_cpumask;
+
+		if (node == numa_node_id())
+			continue;
+		iter_cpumask = mm_node_vcpumask(mm, node);
+		if (nr_cpus_node(node) < cpumask_weight(iter_cpumask)) {
+			/* Try to steal from this node. */
+			vcpu = cpumask_first_andnot(iter_cpumask, cpumask);
+			if (vcpu >= nr_cpu_ids)
+				goto steal_fail;
+			__cpumask_set_cpu(vcpu, cpumask);
+			__cpumask_clear_cpu(vcpu, iter_cpumask);
+			__cpumask_set_cpu(vcpu, node_cpumask);
+			goto end;
+		}
+	}
+
+steal_fail:
+	/*
+	 * Our attempt at gracefully stealing a vcpu_id from another
+	 * overprovisioned NUMA node failed. Fallback to grabbing the first
+	 * available vcpu_id.
+	 */
+	vcpu = cpumask_first_zero(cpumask);
+	if (vcpu >= nr_cpu_ids)
+		return -1;
+	__cpumask_set_cpu(vcpu, cpumask);
+	/* Steal vcpu from its numa node mask. */
+	for (node = 0; node < nr_node_ids; node++) {
+		struct cpumask *iter_cpumask;
+
+		if (node == numa_node_id())
+			continue;
+		iter_cpumask = mm_node_vcpumask(mm, node);
+		if (cpumask_test_cpu(vcpu, iter_cpumask)) {
+			__cpumask_clear_cpu(vcpu, iter_cpumask);
+			break;
+		}
+	}
+	__cpumask_set_cpu(vcpu, node_cpumask);
+end:
+	return vcpu;
+}
+
+#else
+static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcpu_id)
+{
+	return true;
+}
+static inline int __mm_vcpu_get(struct mm_struct *mm)
+{
+	return __mm_vcpu_get_single_node(mm);
+}
+#endif
+
+static inline void mm_vcpu_put(struct mm_struct *mm, int vcpu)
+{
+	lockdep_assert_irqs_disabled();
+	if (vcpu < 0)
+		return;
+	spin_lock(&mm->vcpu_lock);
+	__cpumask_clear_cpu(vcpu, mm_vcpumask(mm));
+	spin_unlock(&mm->vcpu_lock);
+}
+
+static inline int mm_vcpu_get(struct mm_struct *mm)
+{
+	int ret;
+
+	lockdep_assert_irqs_disabled();
+	spin_lock(&mm->vcpu_lock);
+	ret = __mm_vcpu_get(mm);
+	spin_unlock(&mm->vcpu_lock);
+	return ret;
+}
+
+static inline void switch_mm_vcpu(struct task_struct *prev, struct task_struct *next)
+{
+	if (prev->mm_vcpu_active) {
+		if (next->mm_vcpu_active && next->mm == prev->mm) {
+			/*
+			 * Context switch between threads in same mm, hand over
+			 * the mm_vcpu from prev to next.
+			 */
+			next->mm_vcpu = prev->mm_vcpu;
+			prev->mm_vcpu = -1;
+			return;
+		}
+		mm_vcpu_put(prev->mm, prev->mm_vcpu);
+		prev->mm_vcpu = -1;
+	}
+	if (next->mm_vcpu_active)
+		next->mm_vcpu = mm_vcpu_get(next->mm);
+}
+
+#else
+static inline void switch_mm_vcpu(struct task_struct *prev, struct task_struct *next) { }
+#endif
+
 #endif /* _KERNEL_SCHED_SCHED_H */
diff --git a/kernel/signal.c b/kernel/signal.c
index d140672185a4..44fb9fdca13e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2950,6 +2950,7 @@  void exit_signals(struct task_struct *tsk)
 	cgroup_threadgroup_change_begin(tsk);
 
 	if (thread_group_empty(tsk) || (tsk->signal->flags & SIGNAL_GROUP_EXIT)) {
+		sched_vcpu_exit_signals(tsk);
 		tsk->flags |= PF_EXITING;
 		cgroup_threadgroup_change_end(tsk);
 		return;
@@ -2960,6 +2961,7 @@  void exit_signals(struct task_struct *tsk)
 	 * From now this task is not visible for group-wide signals,
 	 * see wants_signal(), do_signal_stop().
 	 */
+	sched_vcpu_exit_signals(tsk);
 	tsk->flags |= PF_EXITING;
 
 	cgroup_threadgroup_change_end(tsk);