[RFC] sched: Fix performance regression introduced by mm_cid

Message ID 20230403181342.210896-1-mathieu.desnoyers@efficios.com
State New
Headers
Series [RFC] sched: Fix performance regression introduced by mm_cid |

Commit Message

Mathieu Desnoyers April 3, 2023, 6:13 p.m. UTC
  Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
sysbench regression reported by Aaron Lu.

Keep track of the currently allocated mm_cid for each mm/cpu rather than
freeing them immediately on context switch. This eliminates most atomic
operations when context switching back and forth between threads
belonging to different memory spaces in multi-threaded scenarios (many
processes, each with many threads). The per-mm/per-cpu mm_cid values are
serialized by their respective runqueue locks.

Thread migration is handled by introducing invocation to
sched_mm_cid_migrate_from() (with source runqueue lock held) and to
sched_mm_cid_migrate_to() (with destination runqueue lock held) in
move_queued_task().

sched_mm_cid_migrate_from() clears the mm_cid from the task's mm per-cpu
index corresponding to the source runqueue if it matches the last mm_cid
observed by the migrated task. This last mm_cid value is returned as a
hint to conditionally clear the mm's per-cpu mm_cid on the destination
cpu.

Then, in sched_mm_cid_migrate_to(), if the last mm_cid is smaller than
the mm's destination cpu current mm_cid, clear the mm's destination cpu
current mm_cid. If the migrated task's mm is in use on the destination
cpu, the reclaim of the mm_cid will be done lazily on the next
destination cpu context switch, else it is performed immediately.

The source cpu's mm_cid is _not_ simply moved to the destination cpu on
migration, because passing ownership of the mm_cid value to the
destination cpu while an actively running tasks also has its own
mm_cid value (in case of lazy reclaim on next context switch) would
over-allocate mm_cid values beyond the number of possible cpus.

Because we want to ensure the mm_cid converges towards the smaller
values as migrations happen, the prior optimization that was done when
context switching between threads belonging to the same mm is removed,
because it could delay the lazy release of the destination runqueue
mm_cid after it has been replaced by a migration. Removing this prior
optimization is not an issue performance-wise because the introduced
per-mm/per-cpu mm_cid tracking also covers this more specific case.

This patch is based on v6.3-rc5 with this patch applied:

("mm: Fix memory leak on mm_init error handling")

https://lore.kernel.org/lkml/20230330133822.66271-1-mathieu.desnoyers@efficios.com/

Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/mm_types.h |  30 ++++++++++
 include/linux/sched.h    |   1 +
 kernel/fork.c            |   8 ++-
 kernel/sched/core.c      | 122 +++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h     |  83 +++++++++++++++++++-------
 5 files changed, 210 insertions(+), 34 deletions(-)
  

Comments

Mathieu Desnoyers April 4, 2023, 12:13 a.m. UTC | #1
On 2023-04-03 14:13, Mathieu Desnoyers wrote:
[...]
> +/*
> + * Migration to dst cpu. Called with dst_rq lock held.
> + */
> +void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t, int src_cid)
> +{
> +	struct mm_struct *mm = t->mm;
> +	int dst_cid, *dst_pcpu_cid;
> +
> +	lockdep_assert_rq_held(dst_rq);
> +
> +	if (!mm || src_cid == -1)
> +		return;
> +
> +	dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
> +
> +	/*
> +	 * If destination cpu cid is greater than the source cpu cid, unset it
> +	 * so it can be reallocated.
> +	 */
> +	dst_cid = *dst_pcpu_cid;
> +	if (dst_cid == -1 || dst_cid < src_cid)

Small detail: I plan to change this from "dst_cid < src_cid" to
"dst_cid  <= src_cid" in my next version of the patch to handle the
unlikely case where a task would be migrated back and forth between
two runqueues without being scheduled. It would be possible that the
task's last_mm_cid is equal to the dst_cid here, in which case it
would be better to leave the mm's destination cpu cid set.

> +		return;
> +	*dst_pcpu_cid = -1;
> +	/*
> +	 * Put dst_cid if it is not currently in use, else it will be lazy put
> +	 * on the next context switch.
> +	 */
> +	if (dst_rq->curr->mm != mm)
> +		__mm_cid_put(mm, dst_cid);
> +}

[...]

Thanks,

Mathieu
  
Aaron Lu April 4, 2023, 1:59 a.m. UTC | #2
On Mon, Apr 03, 2023 at 08:13:08PM -0400, Mathieu Desnoyers wrote:
> On 2023-04-03 14:13, Mathieu Desnoyers wrote:
> [...]
> > +/*
> > + * Migration to dst cpu. Called with dst_rq lock held.
> > + */
> > +void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t, int src_cid)
> > +{
> > +	struct mm_struct *mm = t->mm;
> > +	int dst_cid, *dst_pcpu_cid;
> > +
> > +	lockdep_assert_rq_held(dst_rq);
> > +
> > +	if (!mm || src_cid == -1)
> > +		return;
> > +
> > +	dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
> > +
> > +	/*
> > +	 * If destination cpu cid is greater than the source cpu cid, unset it
> > +	 * so it can be reallocated.
> > +	 */
> > +	dst_cid = *dst_pcpu_cid;
> > +	if (dst_cid == -1 || dst_cid < src_cid)
> 
> Small detail: I plan to change this from "dst_cid < src_cid" to
> "dst_cid  <= src_cid" in my next version of the patch to handle the
> unlikely case where a task would be migrated back and forth between
> two runqueues without being scheduled. It would be possible that the
> task's last_mm_cid is equal to the dst_cid here, in which case it
> would be better to leave the mm's destination cpu cid set.
>

This patch is still good regarding lock contention.
(I applied the above small change while testing)

> > +		return;
> > +	*dst_pcpu_cid = -1;
> > +	/*
> > +	 * Put dst_cid if it is not currently in use, else it will be lazy put
> > +	 * on the next context switch.
> > +	 */
> > +	if (dst_rq->curr->mm != mm)
> > +		__mm_cid_put(mm, dst_cid);
> > +}

Thanks,
Aaron
  
Peter Zijlstra April 4, 2023, 8:57 a.m. UTC | #3
Sorry for being a little late to the party.

On Mon, Apr 03, 2023 at 02:13:42PM -0400, Mathieu Desnoyers wrote:

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0722859c3647..21466fdc4dc6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -609,6 +609,14 @@ struct mm_struct {
>  		 * were being concurrently updated by the updaters.
>  		 */
>  		raw_spinlock_t cid_lock;
> +		/**
> +		 * @pcpu_cid: Per-cpu current cid.
> +		 *
> +		 * Keep track of the currently allocated mm_cid for each cpu.
> +		 * The per-cpu mm_cid values are serialized by their respective
> +		 * runqueue locks.
> +		 */
> +		int __percpu *pcpu_cid;

This *might* be a problem... Consider running *many* single threaded
processes, this could exhaust the limited per-cpu storage (32bit) or use
excessive memory on large CPU systems.

Consider having to allocate per-cpu storage on a 4K CPU system while
only running single threaded processes -- but *LOTS* of them..
  
Peter Zijlstra April 4, 2023, 9:05 a.m. UTC | #4
On Mon, Apr 03, 2023 at 02:13:42PM -0400, Mathieu Desnoyers wrote:
>  void sched_mm_cid_exit_signals(struct task_struct *t)
>  {
>  	struct mm_struct *mm = t->mm;
> -	unsigned long flags;
> +	struct rq_flags rf;
> +	struct rq *rq;
>  
>  	if (!mm)
>  		return;
> -	local_irq_save(flags);
> +
> +	preempt_disable();
> +	rq = this_rq();
> +	rq_lock_irqsave(rq, &rf);
>  	mm_cid_put(mm, t->mm_cid);
> -	t->mm_cid = -1;
> +	t->last_mm_cid = t->mm_cid = -1;
>  	t->mm_cid_active = 0;
> -	local_irq_restore(flags);
> +	rq_unlock_irqrestore(rq, &rf);
> +	preempt_enable();
>  }

FWIW a *slightly* cheaper form is:

	preempt_disable();
	rq = this_rq();
	rq_lock_irqsave(rq, &rf);
	preempt_enable_noresched(); /* holding spinlock */
	...
	rq_unlock_irqrestore(rq, &rf);
  
Peter Zijlstra April 4, 2023, 9:10 a.m. UTC | #5
On Tue, Apr 04, 2023 at 10:57:49AM +0200, Peter Zijlstra wrote:
> 
> Sorry for being a little late to the party.
> 
> On Mon, Apr 03, 2023 at 02:13:42PM -0400, Mathieu Desnoyers wrote:
> 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 0722859c3647..21466fdc4dc6 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -609,6 +609,14 @@ struct mm_struct {
> >  		 * were being concurrently updated by the updaters.
> >  		 */
> >  		raw_spinlock_t cid_lock;
> > +		/**
> > +		 * @pcpu_cid: Per-cpu current cid.
> > +		 *
> > +		 * Keep track of the currently allocated mm_cid for each cpu.
> > +		 * The per-cpu mm_cid values are serialized by their respective
> > +		 * runqueue locks.
> > +		 */
> > +		int __percpu *pcpu_cid;
> 
> This *might* be a problem... Consider running *many* single threaded
> processes, this could exhaust the limited per-cpu storage (32bit) or use
> excessive memory on large CPU systems.
> 
> Consider having to allocate per-cpu storage on a 4K CPU system while
> only running single threaded processes -- but *LOTS* of them..

Ah, mm_struct::rss_stat[] beat us and set precedent, so not our problem I
suppose :-)
  
Peter Zijlstra April 4, 2023, 9:21 a.m. UTC | #6
On Mon, Apr 03, 2023 at 02:13:42PM -0400, Mathieu Desnoyers wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0d18c3969f90..775c7da6ec7f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2326,16 +2326,20 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
>  				   struct task_struct *p, int new_cpu)
>  {
> +	int cid;
> +
>  	lockdep_assert_rq_held(rq);
>  
>  	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
>  	set_task_cpu(p, new_cpu);
> +	cid = sched_mm_cid_migrate_from(rq, p);
>  	rq_unlock(rq, rf);
>  
>  	rq = cpu_rq(new_cpu);
>  
>  	rq_lock(rq, rf);
>  	WARN_ON_ONCE(task_cpu(p) != new_cpu);
> +	sched_mm_cid_migrate_to(rq, p, cid);
>  	activate_task(rq, p, 0);
>  	check_preempt_curr(rq, p, 0);
>  

I can't find more uses? Yet there are a ton more ways to actually
migrate tasks.
  
Mathieu Desnoyers April 4, 2023, 1:23 p.m. UTC | #7
On 2023-04-04 05:10, Peter Zijlstra wrote:
> On Tue, Apr 04, 2023 at 10:57:49AM +0200, Peter Zijlstra wrote:
>>
>> Sorry for being a little late to the party.
>>
>> On Mon, Apr 03, 2023 at 02:13:42PM -0400, Mathieu Desnoyers wrote:
>>
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 0722859c3647..21466fdc4dc6 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -609,6 +609,14 @@ struct mm_struct {
>>>   		 * were being concurrently updated by the updaters.
>>>   		 */
>>>   		raw_spinlock_t cid_lock;
>>> +		/**
>>> +		 * @pcpu_cid: Per-cpu current cid.
>>> +		 *
>>> +		 * Keep track of the currently allocated mm_cid for each cpu.
>>> +		 * The per-cpu mm_cid values are serialized by their respective
>>> +		 * runqueue locks.
>>> +		 */
>>> +		int __percpu *pcpu_cid;
>>
>> This *might* be a problem... Consider running *many* single threaded
>> processes, this could exhaust the limited per-cpu storage (32bit) or use
>> excessive memory on large CPU systems.
>>
>> Consider having to allocate per-cpu storage on a 4K CPU system while
>> only running single threaded processes -- but *LOTS* of them..
> 
> Ah, mm_struct::rss_stat[] beat us and set precedent, so not our problem I
> suppose :-)

Yes, exactly. When I've seen that mm_struct::rss_stat[] did not care 
about having per-mm/per-cpu data, I went the same way.

Thanks,

Mathieu
  
Mathieu Desnoyers April 4, 2023, 1:24 p.m. UTC | #8
On 2023-04-04 05:05, Peter Zijlstra wrote:
> On Mon, Apr 03, 2023 at 02:13:42PM -0400, Mathieu Desnoyers wrote:
>>   void sched_mm_cid_exit_signals(struct task_struct *t)
>>   {
>>   	struct mm_struct *mm = t->mm;
>> -	unsigned long flags;
>> +	struct rq_flags rf;
>> +	struct rq *rq;
>>   
>>   	if (!mm)
>>   		return;
>> -	local_irq_save(flags);
>> +
>> +	preempt_disable();
>> +	rq = this_rq();
>> +	rq_lock_irqsave(rq, &rf);
>>   	mm_cid_put(mm, t->mm_cid);
>> -	t->mm_cid = -1;
>> +	t->last_mm_cid = t->mm_cid = -1;
>>   	t->mm_cid_active = 0;
>> -	local_irq_restore(flags);
>> +	rq_unlock_irqrestore(rq, &rf);
>> +	preempt_enable();
>>   }
> 
> FWIW a *slightly* cheaper form is:
> 
> 	preempt_disable();
> 	rq = this_rq();
> 	rq_lock_irqsave(rq, &rf);
> 	preempt_enable_noresched(); /* holding spinlock */
> 	...
> 	rq_unlock_irqrestore(rq, &rf);

Good point, updated for next round.

Thanks,

Mathieu
  

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0722859c3647..21466fdc4dc6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -609,6 +609,14 @@  struct mm_struct {
 		 * were being concurrently updated by the updaters.
 		 */
 		raw_spinlock_t cid_lock;
+		/**
+		 * @pcpu_cid: Per-cpu current cid.
+		 *
+		 * Keep track of the currently allocated mm_cid for each cpu.
+		 * The per-cpu mm_cid values are serialized by their respective
+		 * runqueue locks.
+		 */
+		int __percpu *pcpu_cid;
 #endif
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
@@ -872,6 +880,7 @@  static inline void vma_iter_init(struct vma_iterator *vmi,
 }
 
 #ifdef CONFIG_SCHED_MM_CID
+
 /* Accessor for struct mm_struct's cidmask. */
 static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 {
@@ -885,16 +894,37 @@  static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 
 static inline void mm_init_cid(struct mm_struct *mm)
 {
+	int i;
+
 	raw_spin_lock_init(&mm->cid_lock);
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(mm->pcpu_cid, i) = -1;
 	cpumask_clear(mm_cidmask(mm));
 }
 
+static inline int mm_alloc_cid(struct mm_struct *mm)
+{
+	mm->pcpu_cid = alloc_percpu(int);
+	if (!mm->pcpu_cid)
+		return -ENOMEM;
+	mm_init_cid(mm);
+	return 0;
+}
+
+static inline void mm_destroy_cid(struct mm_struct *mm)
+{
+	free_percpu(mm->pcpu_cid);
+	mm->pcpu_cid = NULL;
+}
+
 static inline unsigned int mm_cid_size(void)
 {
 	return cpumask_size();
 }
 #else /* CONFIG_SCHED_MM_CID */
 static inline void mm_init_cid(struct mm_struct *mm) { }
+static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_cid(struct mm_struct *mm) { }
 static inline unsigned int mm_cid_size(void)
 {
 	return 0;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..48d48b2c73a5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1313,6 +1313,7 @@  struct task_struct {
 
 #ifdef CONFIG_SCHED_MM_CID
 	int				mm_cid;		/* Current cid in mm */
+	int				last_mm_cid;	/* Most recent cid in mm */
 	int				mm_cid_active;	/* Whether cid bitmap is active */
 #endif
 
diff --git a/kernel/fork.c b/kernel/fork.c
index c983c4fe3090..0a2b905c2ed5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -790,6 +790,7 @@  void __mmdrop(struct mm_struct *mm)
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
+	mm_destroy_cid(mm);
 
 	for (i = 0; i < NR_MM_COUNTERS; i++)
 		percpu_counter_destroy(&mm->rss_stat[i]);
@@ -1054,6 +1055,7 @@  static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 #ifdef CONFIG_SCHED_MM_CID
 	tsk->mm_cid = -1;
+	tsk->last_mm_cid = -1;
 	tsk->mm_cid_active = 0;
 #endif
 	return tsk;
@@ -1159,18 +1161,22 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
+	if (mm_alloc_cid(mm))
+		goto fail_cid;
+
 	for (i = 0; i < NR_MM_COUNTERS; i++)
 		if (percpu_counter_init(&mm->rss_stat[i], 0, GFP_KERNEL_ACCOUNT))
 			goto fail_pcpu;
 
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
-	mm_init_cid(mm);
 	return mm;
 
 fail_pcpu:
 	while (i > 0)
 		percpu_counter_destroy(&mm->rss_stat[--i]);
+	mm_destroy_cid(mm);
+fail_cid:
 	destroy_context(mm);
 fail_nocontext:
 	mm_free_pgd(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d18c3969f90..775c7da6ec7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2326,16 +2326,20 @@  static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 				   struct task_struct *p, int new_cpu)
 {
+	int cid;
+
 	lockdep_assert_rq_held(rq);
 
 	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, new_cpu);
+	cid = sched_mm_cid_migrate_from(rq, p);
 	rq_unlock(rq, rf);
 
 	rq = cpu_rq(new_cpu);
 
 	rq_lock(rq, rf);
 	WARN_ON_ONCE(task_cpu(p) != new_cpu);
+	sched_mm_cid_migrate_to(rq, p, cid);
 	activate_task(rq, p, 0);
 	check_preempt_curr(rq, p, 0);
 
@@ -11383,45 +11387,139 @@  void call_trace_sched_update_nr_running(struct rq *rq, int count)
 }
 
 #ifdef CONFIG_SCHED_MM_CID
+/*
+ * Migration from src cpu. Called with src_rq lock held.
+ */
+int sched_mm_cid_migrate_from(struct rq *src_rq, struct task_struct *t)
+{
+	struct mm_struct *mm = t->mm;
+	int src_cid, *src_pcpu_cid, last_mm_cid;
+
+	lockdep_assert_rq_held(src_rq);
+
+	if (!mm)
+		return -1;
+
+	last_mm_cid = t->last_mm_cid;
+	/*
+	 * If the migrated task has no last cid, or if the current
+	 * task on src rq uses the cid, it means the destination cpu
+	 * does not have to reallocate its cid to keep the cid allocation
+	 * compact.
+	 */
+	if (last_mm_cid == -1 || src_rq->curr->mm == mm)
+		return -1;
+
+	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
+	src_cid = *src_pcpu_cid;
+	/*
+	 * If the source cpu cid is set, and matches the last cid of the
+	 * migrated task, clear the source cpu cid to keep cid allocation
+	 * compact to cover the case where this task is the last task using
+	 * this mm on the source cpu. If there happens to be other tasks left
+	 * on the source cpu using this mm, the next task using this mm will
+	 * reallocate its cid on context switch.
+	 */
+	if (src_cid != -1 && last_mm_cid == src_cid) {
+		/*
+		 * We cannot keep ownership of concurrency ID without runqueue
+		 * lock held when it is not used by a current task, because it
+		 * would lead to allocation of more concurrency ids than there
+		 * are possible cpus in the system. The last_mm_cid is used as
+		 * a hint to conditionally unset the dst cpu cid, keeping
+		 * allocated concurrency ids compact.
+		 */
+		__mm_cid_put(mm, src_cid);
+		*src_pcpu_cid = -1;
+	}
+	return last_mm_cid;
+}
+
+/*
+ * Migration to dst cpu. Called with dst_rq lock held.
+ */
+void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t, int src_cid)
+{
+	struct mm_struct *mm = t->mm;
+	int dst_cid, *dst_pcpu_cid;
+
+	lockdep_assert_rq_held(dst_rq);
+
+	if (!mm || src_cid == -1)
+		return;
+
+	dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
+
+	/*
+	 * If destination cpu cid is greater than the source cpu cid, unset it
+	 * so it can be reallocated.
+	 */
+	dst_cid = *dst_pcpu_cid;
+	if (dst_cid == -1 || dst_cid < src_cid)
+		return;
+	*dst_pcpu_cid = -1;
+	/*
+	 * Put dst_cid if it is not currently in use, else it will be lazy put
+	 * on the next context switch.
+	 */
+	if (dst_rq->curr->mm != mm)
+		__mm_cid_put(mm, dst_cid);
+}
+
 void sched_mm_cid_exit_signals(struct task_struct *t)
 {
 	struct mm_struct *mm = t->mm;
-	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
 
 	if (!mm)
 		return;
-	local_irq_save(flags);
+
+	preempt_disable();
+	rq = this_rq();
+	rq_lock_irqsave(rq, &rf);
 	mm_cid_put(mm, t->mm_cid);
-	t->mm_cid = -1;
+	t->last_mm_cid = t->mm_cid = -1;
 	t->mm_cid_active = 0;
-	local_irq_restore(flags);
+	rq_unlock_irqrestore(rq, &rf);
+	preempt_enable();
 }
 
 void sched_mm_cid_before_execve(struct task_struct *t)
 {
 	struct mm_struct *mm = t->mm;
-	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
 
 	if (!mm)
 		return;
-	local_irq_save(flags);
+
+	preempt_disable();
+	rq = this_rq();
+	rq_lock_irqsave(rq, &rf);
 	mm_cid_put(mm, t->mm_cid);
-	t->mm_cid = -1;
+	t->last_mm_cid = t->mm_cid = -1;
 	t->mm_cid_active = 0;
-	local_irq_restore(flags);
+	rq_unlock_irqrestore(rq, &rf);
+	preempt_enable();
 }
 
 void sched_mm_cid_after_execve(struct task_struct *t)
 {
 	struct mm_struct *mm = t->mm;
-	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
 
 	if (!mm)
 		return;
-	local_irq_save(flags);
-	t->mm_cid = mm_cid_get(mm);
+
+	preempt_disable();
+	rq = this_rq();
+	rq_lock_irqsave(rq, &rf);
+	t->last_mm_cid = t->mm_cid = mm_cid_get(mm);
 	t->mm_cid_active = 1;
-	local_irq_restore(flags);
+	rq_unlock_irqrestore(rq, &rf);
+	preempt_enable();
 	rseq_set_notify_resume(t);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e8df6d31c1e..2bc62cc412e1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3249,7 +3249,47 @@  static inline void update_current_exec_runtime(struct task_struct *curr,
 }
 
 #ifdef CONFIG_SCHED_MM_CID
-static inline int __mm_cid_get(struct mm_struct *mm)
+extern int sched_mm_cid_migrate_from(struct rq *src_rq, struct task_struct *t);
+extern void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t, int cid);
+
+static inline void __mm_cid_put(struct mm_struct *mm, int cid)
+{
+	lockdep_assert_irqs_disabled();
+	if (cid < 0)
+		return;
+	raw_spin_lock(&mm->cid_lock);
+	__cpumask_clear_cpu(cid, mm_cidmask(mm));
+	raw_spin_unlock(&mm->cid_lock);
+}
+
+static inline void mm_cid_put(struct mm_struct *mm, int thread_cid)
+{
+	int *pcpu_cid, cid;
+
+	lockdep_assert_irqs_disabled();
+	if (thread_cid < 0)
+		return;
+	pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
+	cid = *pcpu_cid;
+	if (cid == thread_cid)
+		*pcpu_cid = -1;
+	__mm_cid_put(mm, thread_cid);
+}
+
+static inline void mm_cid_put_lazy(struct mm_struct *mm, int thread_cid)
+{
+	int *pcpu_cid, cid;
+
+	lockdep_assert_irqs_disabled();
+	if (thread_cid < 0)
+		return;
+	pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
+	cid = *pcpu_cid;
+	if (cid != thread_cid)
+		__mm_cid_put(mm, thread_cid);
+}
+
+static inline int __mm_cid_get_locked(struct mm_struct *mm)
 {
 	struct cpumask *cpumask;
 	int cid;
@@ -3262,48 +3302,49 @@  static inline int __mm_cid_get(struct mm_struct *mm)
 	return cid;
 }
 
-static inline void mm_cid_put(struct mm_struct *mm, int cid)
+static inline int __mm_cid_get(struct mm_struct *mm)
 {
+	int ret;
+
 	lockdep_assert_irqs_disabled();
-	if (cid < 0)
-		return;
 	raw_spin_lock(&mm->cid_lock);
-	__cpumask_clear_cpu(cid, mm_cidmask(mm));
+	ret = __mm_cid_get_locked(mm);
 	raw_spin_unlock(&mm->cid_lock);
+	return ret;
 }
 
 static inline int mm_cid_get(struct mm_struct *mm)
 {
-	int ret;
+	int *pcpu_cid, cid;
 
 	lockdep_assert_irqs_disabled();
-	raw_spin_lock(&mm->cid_lock);
-	ret = __mm_cid_get(mm);
-	raw_spin_unlock(&mm->cid_lock);
-	return ret;
+	pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
+	cid = *pcpu_cid;
+	if (cid == -1) {
+		raw_spin_lock(&mm->cid_lock);
+		cid = __mm_cid_get_locked(mm);
+		raw_spin_unlock(&mm->cid_lock);
+		*pcpu_cid = cid;
+		return cid;
+	}
+	return cid;
 }
 
 static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
 {
 	if (prev->mm_cid_active) {
-		if (next->mm_cid_active && next->mm == prev->mm) {
-			/*
-			 * Context switch between threads in same mm, hand over
-			 * the mm_cid from prev to next.
-			 */
-			next->mm_cid = prev->mm_cid;
-			prev->mm_cid = -1;
-			return;
-		}
-		mm_cid_put(prev->mm, prev->mm_cid);
+		mm_cid_put_lazy(prev->mm, prev->mm_cid);
 		prev->mm_cid = -1;
 	}
 	if (next->mm_cid_active)
-		next->mm_cid = mm_cid_get(next->mm);
+		next->last_mm_cid = next->mm_cid = mm_cid_get(next->mm);
 }
 
 #else
 static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next) { }
+static inline void sched_mm_cid_migrate(struct rq *rq, struct task_struct *t, int new_cpu) { }
+static inline int sched_mm_cid_migrate_from(struct rq *src_rq, struct task_struct *t) { return 0; }
+static inline void sched_mm_cid_migrate_to(struct rq *src_rq, struct task_struct *t, int cid) { }
 #endif
 
 #endif /* _KERNEL_SCHED_SCHED_H */