[RFC,v3] sched: Fix performance regression introduced by mm_cid

Message ID 20230405162635.225245-1-mathieu.desnoyers@efficios.com
State New
Headers
Series [RFC,v3] sched: Fix performance regression introduced by mm_cid |

Commit Message

Mathieu Desnoyers April 5, 2023, 4:26 p.m. UTC
  Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
sysbench regression reported by Aaron Lu.

Keep track of the currently allocated mm_cid for each mm/cpu rather than
freeing them immediately on context switch. This eliminates most atomic
operations when context switching back and forth between threads
belonging to different memory spaces in multi-threaded scenarios (many
processes, each with many threads). The per-mm/per-cpu mm_cid values are
serialized by their respective runqueue locks.

Thread migration is handled by introducing invocation to
sched_mm_cid_migrate_from() in set_task_cpu() and to
sched_mm_cid_migrate_to() (with destination runqueue lock held) in
activate_task() for migrating tasks. set_task_cpu() is invoked with and
without source rq lock held: the wakeup path does not hold the source rq
lock.

sched_mm_cid_migrate_from() clears the mm_cid from the task's mm per-cpu
index corresponding to the source runqueue if it matches the last mm_cid
observed by the migrated task. This last mm_cid value is returned as a
hint to conditionally clear the mm's per-cpu mm_cid on the destination
cpu.

Then, in sched_mm_cid_migrate_to(), if the last mm_cid is smaller than
the mm's destination cpu current mm_cid, clear the mm's destination cpu
current mm_cid. If the migrated task's mm is in use on the destination
cpu, the reclaim of the mm_cid will be done lazily on the next
destination cpu context switch, else it is performed immediately.

The source cpu's mm_cid is _not_ simply moved to the destination cpu on
migration, because passing ownership of the mm_cid value to the
destination cpu while an actively running tasks also has its own
mm_cid value (in case of lazy reclaim on next context switch) would
over-allocate mm_cid values beyond the number of possible cpus.

Because we want to ensure the mm_cid converges towards the smaller
values as migrations happen, the prior optimization that was done when
context switching between threads belonging to the same mm is removed,
because it could delay the lazy release of the destination runqueue
mm_cid after it has been replaced by a migration. Removing this prior
optimization is not an issue performance-wise because the introduced
per-mm/per-cpu mm_cid tracking also covers this more specific case.

This patch is based on v6.3-rc5 with this patch applied:

("mm: Fix memory leak on mm_init error handling")

https://lore.kernel.org/lkml/20230330133822.66271-1-mathieu.desnoyers@efficios.com/

Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/mm_types.h |  60 +++++++++++
 include/linux/sched.h    |   1 +
 kernel/fork.c            |   8 +-
 kernel/sched/core.c      | 218 +++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h     | 106 +++++++++++++++----
 5 files changed, 352 insertions(+), 41 deletions(-)
  

Comments

Mathieu Desnoyers April 5, 2023, 8:37 p.m. UTC | #1
On 2023-04-05 12:26, Mathieu Desnoyers wrote:
[...]
>   #ifdef CONFIG_SCHED_MM_CID
> +/*
> + * Migration from src cpu. Called from set_task_cpu(). There are no guarantees
> + * that the rq lock is held.
> + */
> +void sched_mm_cid_migrate_from(struct task_struct *t)
> +{
> +	int src_cid, *src_pcpu_cid, last_mm_cid;
> +	struct mm_struct *mm = t->mm;
> +	struct rq *src_rq;
> +	struct task_struct *src_task;
> +
> +	if (!mm)
> +		return;
> +
> +	last_mm_cid = t->last_mm_cid;
> +	/*
> +	 * If the migrated task has no last cid, or if the current
> +	 * task on src rq uses the cid, it means the destination cpu
> +	 * does not have to reallocate its cid to keep the cid allocation
> +	 * compact.
> +	 */
> +	if (last_mm_cid == -1)
> +		return;
> +
> +	src_rq = task_rq(t);
> +	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
> +	src_cid = READ_ONCE(*src_pcpu_cid);
> +
> +	if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
> +		return;
> +
> +	/*
> +	 * If we observe an active task using the mm on this rq, it means we
> +	 * are not the last task to be migrated from this cpu for this mm, so
> +	 * there is no need to clear the src_cid.
> +	 */
> +	rcu_read_lock();
> +	src_task = rcu_dereference(src_rq->curr);
> +	if (src_task->mm_cid_active && src_task->mm == mm) {
> +		rcu_read_unlock();
> +		t->last_mm_cid = -1;
> +		return;
> +	}
> +	rcu_read_unlock();
> +
> +	/*
> +	 * If the source cpu cid is set, and matches the last cid of the
> +	 * migrated task, clear the source cpu cid to keep cid allocation
> +	 * compact to cover the case where this task is the last task using
> +	 * this mm on the source cpu. If there happens to be other tasks left
> +	 * on the source cpu using this mm, the next task using this mm will
> +	 * reallocate its cid on context switch.
> +	 *
> +	 * We cannot keep ownership of concurrency ID without runqueue
> +	 * lock held when it is not used by a current task, because it
> +	 * would lead to allocation of more concurrency ids than there
> +	 * are possible cpus in the system. The last_mm_cid is used as
> +	 * a hint to conditionally unset the dst cpu cid, keeping
> +	 * allocated concurrency ids compact.
> +	 */
> +	if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
> +		return;
> +
> +	/*
> +	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
> +	 * rq->curr->mm matches the scheduler barrier in context_switch()
> +	 * between store to rq->curr and load of prev and next task's
> +	 * per-mm/cpu cid.

Thinking about it further, I suspect the transition:

          *   user -> kernel   lazy + mmgrab() active

in context_switch() lacks a memory barrier we need here (it's only an
atomic_inc with mmgrab()).

The scenario is a transition from user to kernel thread happening 
concurrently with migrate-from.

Because there is no memory barrier between set rq->curr to a value that
has a NULL mm and loading per-mm/cpu cid value in mm_cid_put_lazy() for 
the prev task, nothing guarantees that the following src_rq->curr rcu 
dereference here will observe the store.

Or am I missing something ?

Thanks,

Mathieu


> +	 *
> +	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
> +	 * rq->curr->mm_cid_active matches the barrier in
> +	 * sched_mm_cid_exit_signals(), sched_mm_cid_before_execve(), and
> +	 * sched_mm_cid_after_execve() between store to t->mm_cid_active and
> +	 * load of per-mm/cpu cid.
> +	 */
> +
> +	/*
> +	 * If we observe an active task using the mm on this rq after setting the lazy-put
> +	 * flag, this task will be responsible for transitioning from lazy-put
> +	 * flag set to MM_CID_UNSET.
> +	 */
> +	rcu_read_lock();
> +	src_task = rcu_dereference(src_rq->curr);
> +	if (src_task->mm_cid_active && src_task->mm == mm) {
> +		rcu_read_unlock();
> +		/*
> +		 * We observed an active task for this mm, clearing the destination
> +		 * cpu mm_cid is not relevant for compactness.
> +		 */
> +		t->last_mm_cid = -1;
> +		return;
> +	}
> +	rcu_read_unlock();
> +
> +	/*
> +	 * The src_cid is unused, so it can be unset.
> +	 */
> +	if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) != mm_cid_set_lazy_put(src_cid))
> +		return;
> +	__mm_cid_put(mm, src_cid);
> +}
  
Peter Zijlstra April 6, 2023, 9:51 a.m. UTC | #2
On Wed, Apr 05, 2023 at 04:37:42PM -0400, Mathieu Desnoyers wrote:
> On 2023-04-05 12:26, Mathieu Desnoyers wrote:
> [...]
> >   #ifdef CONFIG_SCHED_MM_CID
> > +/*
> > + * Migration from src cpu. Called from set_task_cpu(). There are no guarantees
> > + * that the rq lock is held.
> > + */
> > +void sched_mm_cid_migrate_from(struct task_struct *t)
> > +{
> > +	int src_cid, *src_pcpu_cid, last_mm_cid;
> > +	struct mm_struct *mm = t->mm;
> > +	struct rq *src_rq;
> > +	struct task_struct *src_task;
> > +
> > +	if (!mm)
> > +		return;
> > +
> > +	last_mm_cid = t->last_mm_cid;
> > +	/*
> > +	 * If the migrated task has no last cid, or if the current
> > +	 * task on src rq uses the cid, it means the destination cpu
> > +	 * does not have to reallocate its cid to keep the cid allocation
> > +	 * compact.
> > +	 */
> > +	if (last_mm_cid == -1)
> > +		return;
> > +
> > +	src_rq = task_rq(t);
> > +	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
> > +	src_cid = READ_ONCE(*src_pcpu_cid);
> > +
> > +	if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
> > +		return;
> > +
> > +	/*
> > +	 * If we observe an active task using the mm on this rq, it means we
> > +	 * are not the last task to be migrated from this cpu for this mm, so
> > +	 * there is no need to clear the src_cid.
> > +	 */
> > +	rcu_read_lock();
> > +	src_task = rcu_dereference(src_rq->curr);
> > +	if (src_task->mm_cid_active && src_task->mm == mm) {
> > +		rcu_read_unlock();
> > +		t->last_mm_cid = -1;
> > +		return;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	/*
> > +	 * If the source cpu cid is set, and matches the last cid of the
> > +	 * migrated task, clear the source cpu cid to keep cid allocation
> > +	 * compact to cover the case where this task is the last task using
> > +	 * this mm on the source cpu. If there happens to be other tasks left
> > +	 * on the source cpu using this mm, the next task using this mm will
> > +	 * reallocate its cid on context switch.
> > +	 *
> > +	 * We cannot keep ownership of concurrency ID without runqueue
> > +	 * lock held when it is not used by a current task, because it
> > +	 * would lead to allocation of more concurrency ids than there
> > +	 * are possible cpus in the system. The last_mm_cid is used as
> > +	 * a hint to conditionally unset the dst cpu cid, keeping
> > +	 * allocated concurrency ids compact.
> > +	 */
> > +	if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
> > +		return;
> > +
> > +	/*
> > +	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
> > +	 * rq->curr->mm matches the scheduler barrier in context_switch()
> > +	 * between store to rq->curr and load of prev and next task's
> > +	 * per-mm/cpu cid.
> 
> Thinking about it further, I suspect the transition:
> 
>          *   user -> kernel   lazy + mmgrab() active
> 
> in context_switch() lacks a memory barrier we need here (it's only an
> atomic_inc with mmgrab()).
> 
> The scenario is a transition from user to kernel thread happening
> concurrently with migrate-from.
> 
> Because there is no memory barrier between set rq->curr to a value that
> has a NULL mm and loading per-mm/cpu cid value in mm_cid_put_lazy() for the
> prev task, nothing guarantees that the following src_rq->curr rcu
> dereference here will observe the store.
> 
> Or am I missing something ?

OK, so last night I thought it was me needing sleep (which might still
be the case), but now I'm still not quite getting it. Let me draw a
picture, perhaps that'll help...

It reads to me like you're treating a barrier as 'makes visible', which
it never will.


CPU0 will run a user task A and switch to a kernel thread K;
CPU1 will concurrently run sched_mm_cid_migrate_from().


	CPU0				CPU1

	.. runs A ..

					if (src_task->mm_cid_active && src_task->mm == mm)
					  // false, proceed

					cmpxchg(); // set LAZY

	__schedule()
					(A)
	  rq->curr = K;
					(B)
	  context_switch()
	    prepare_task_switch()
	      switch_mm_cid()
	        cid_put_lazy(prev)
		cid_get(next)
            mmgrab(); // user->kernel active_mm
	    switch_to();



	.. runs K ..


And it is this second compare that can either observe A or K, right?
That is the locations A or B above. Later doesn't make sense because
then switch_mm_cid() will have taken care of things.

If A, still false, and we continue to mark UNSET and put.
If B, we see K, which will not have mm_cid_active set so still false and
same as above.

What am I missing?
  
Mathieu Desnoyers April 7, 2023, 11:50 p.m. UTC | #3
On 2023-04-06 05:51, Peter Zijlstra wrote:
> On Wed, Apr 05, 2023 at 04:37:42PM -0400, Mathieu Desnoyers wrote:
>> On 2023-04-05 12:26, Mathieu Desnoyers wrote:
>> [...]
>>>    #ifdef CONFIG_SCHED_MM_CID
>>> +/*
>>> + * Migration from src cpu. Called from set_task_cpu(). There are no guarantees
>>> + * that the rq lock is held.
>>> + */
>>> +void sched_mm_cid_migrate_from(struct task_struct *t)
>>> +{
>>> +	int src_cid, *src_pcpu_cid, last_mm_cid;
>>> +	struct mm_struct *mm = t->mm;
>>> +	struct rq *src_rq;
>>> +	struct task_struct *src_task;
>>> +
>>> +	if (!mm)
>>> +		return;
>>> +
>>> +	last_mm_cid = t->last_mm_cid;
>>> +	/*
>>> +	 * If the migrated task has no last cid, or if the current
>>> +	 * task on src rq uses the cid, it means the destination cpu
>>> +	 * does not have to reallocate its cid to keep the cid allocation
>>> +	 * compact.
>>> +	 */
>>> +	if (last_mm_cid == -1)
>>> +		return;
>>> +
>>> +	src_rq = task_rq(t);
>>> +	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
>>> +	src_cid = READ_ONCE(*src_pcpu_cid);
>>> +
>>> +	if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
>>> +		return;
>>> +
>>> +	/*
>>> +	 * If we observe an active task using the mm on this rq, it means we
>>> +	 * are not the last task to be migrated from this cpu for this mm, so
>>> +	 * there is no need to clear the src_cid.
>>> +	 */
>>> +	rcu_read_lock();
>>> +	src_task = rcu_dereference(src_rq->curr);
>>> +	if (src_task->mm_cid_active && src_task->mm == mm) {
>>> +		rcu_read_unlock();
>>> +		t->last_mm_cid = -1;
>>> +		return;
>>> +	}
>>> +	rcu_read_unlock();
>>> +
>>> +	/*
>>> +	 * If the source cpu cid is set, and matches the last cid of the
>>> +	 * migrated task, clear the source cpu cid to keep cid allocation
>>> +	 * compact to cover the case where this task is the last task using
>>> +	 * this mm on the source cpu. If there happens to be other tasks left
>>> +	 * on the source cpu using this mm, the next task using this mm will
>>> +	 * reallocate its cid on context switch.
>>> +	 *
>>> +	 * We cannot keep ownership of concurrency ID without runqueue
>>> +	 * lock held when it is not used by a current task, because it
>>> +	 * would lead to allocation of more concurrency ids than there
>>> +	 * are possible cpus in the system. The last_mm_cid is used as
>>> +	 * a hint to conditionally unset the dst cpu cid, keeping
>>> +	 * allocated concurrency ids compact.
>>> +	 */
>>> +	if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
>>> +		return;
>>> +
>>> +	/*
>>> +	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
>>> +	 * rq->curr->mm matches the scheduler barrier in context_switch()
>>> +	 * between store to rq->curr and load of prev and next task's
>>> +	 * per-mm/cpu cid.
>>
>> Thinking about it further, I suspect the transition:
>>
>>           *   user -> kernel   lazy + mmgrab() active
>>
>> in context_switch() lacks a memory barrier we need here (it's only an
>> atomic_inc with mmgrab()).
>>
>> The scenario is a transition from user to kernel thread happening
>> concurrently with migrate-from.
>>
>> Because there is no memory barrier between set rq->curr to a value that
>> has a NULL mm and loading per-mm/cpu cid value in mm_cid_put_lazy() for the
>> prev task, nothing guarantees that the following src_rq->curr rcu
>> dereference here will observe the store.
>>
>> Or am I missing something ?
> 
> OK, so last night I thought it was me needing sleep (which might still
> be the case), but now I'm still not quite getting it. Let me draw a
> picture, perhaps that'll help...

Let me try to start from your explanation to expose the case I have in mind.

> 
> It reads to me like you're treating a barrier as 'makes visible', which
> it never will.

No. What I am trying to figure out here is how we can introduce a Dekker
memory ordering (ref. row 10 from table at https://lwn.net/Articles/573436/)
to guarantee that the loads observe at least one of the stores:

Taken from kernel/futex/waitwake.c:

  *      X = Y = 0
  *
  *      w[X]=1          w[Y]=1
  *      MB              MB
  *      r[Y]=y          r[X]=x
  *
  * Which guarantees that x==0 && y==0 is impossible [...]

But let's first go back to the basic properties we want to guarantee before we
discuss how to actually guarantee them without holding the runqueue lock, and
ideally without additional atomic operations or barriers on the scheduler
fast-path.

There are two basic properties we want is to guarantee here:

(1) Migrate-from should _never_ mark a per-cpu cid UNSET when it is actively used
     by a task. That would lead to concurrent allocation of the cid and userspace
     corruption.

(2) Migrate-from should ideally not leak a cid value when there are no more tasks
     using it on the cpu. This ensures we keep cid allocations as compact as possible
     (closest to 0). Getting this wrong will not cause userspace corruption, just a
     less compact id allocation.

> 
> 
> CPU0 will run a user task A and switch to a kernel thread K;
> CPU1 will concurrently run sched_mm_cid_migrate_from().
> 
> 
> 	CPU0				CPU1
> 
> 	.. runs A ..
> 
> 					if (src_task->mm_cid_active && src_task->mm == mm)
> 					  // false, proceed
> 
> 					cmpxchg(); // set LAZY
> 
> 	__schedule()
> 					(A)
> 	  rq->curr = K;
> 					(B)
> 	  context_switch()
> 	    prepare_task_switch()

Please excuse my earlier emails, I did get the order between store to rq->curr,
the barriers in context_switch(), and prepare_task_switch() wrong. The documented
barriers in context_switch() are _after_ prepare_task_switch(), not before. So a
part of my earlier explanations do not make sense because of this.

We are in a situation where there are no barriers between store to rq->curr and the
call to prepare_task_switch() that follows just after.

> 	      switch_mm_cid()
> 	        cid_put_lazy(prev)
> 		cid_get(next)
>              mmgrab(); // user->kernel active_mm
> 	    switch_to();
> 
> 
> 
> 	.. runs K ..
> 
> 
> And it is this second compare that can either observe A or K, right?
> That is the locations A or B above. Later doesn't make sense because
> then switch_mm_cid() will have taken care of things.
> 
> If A, still false, and we continue to mark UNSET and put.
> If B, we see K, which will not have mm_cid_active set so still false and
> same as above.
> 
> What am I missing?

Let's introduce task (Y) which has mm == src_task->mm and task (N) which has
mm != src_task->mm for the rest of the discussion.

Let's consider the scheduling state transitions we want to consider here.
There are two scheduler state transitions on context switch we care about:

(TSA) Store to rq->curr with transition from (N) to (Y)

(TSB) Store to rq->curr with transition from (Y) to (N)

On the migrate-from side, there is one transition we care about:

(TMA) cmpxchg to *pcpu_cid to set the LAZY flag

There is also a transition to UNSET state which can be performed from all
sides (scheduler, migrate-from). It is performed with a cmpxchg everywhere
which guarantees that only a single thread will succeed:

(TMB) cmpxchg to *pcpu_cid to mark UNSET

Just to be clear (at the risk of repeating myself), what we do _not_ want
to happen is a transition to UNSET when a thread is actively using the cid
(property (1)). And ideally we do not want to leak a cid after migrating the
last task from a cpu (property (2)).

Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.

Scenario A) (TSA)+(TMA) (from next task perspective)

CPU0                                                 CPU1

Context switch CS-1                                  Migrate-from
   - store to rq->curr: (N)->(Y) (TSA)                - cmpxchg to *pcpu_id to LAZY (TMA)
      *** missing barrier ?? ***                        (implied barrier after cmpxchg)
   - prepare_task_switch()
     - switch_mm_cid()
       - mm_cid_get (next)
         - READ_ONCE(*pcpu_cid)                       - rcu_dereference(src_rq->curr)

This Dekker ensures that either task (Y) is observed by the rcu_dereference() or the LAZY
flag is observed by READ_ONCE(), or both are observed.

If task (Y) store is observed by rcu_dereference(), it means that there is still
an active task on the cpu. Migrate-from will therefore not transition to UNSET, which
fulfills property (1). That observed task will itself eventually need a migrate-from
to be migrated away from that cpu, which fulfills property (2).

If task (Y) is not observed, but the lazy flag is observed by READ_ONCE(), it will
move its state to UNSET, which clears the percpu cid perhaps uselessly (which is not
an issue for correctness). Because task (Y) is not observed, CPU1 can move ahead to
set the state to UNSET. Because moving state to UNSET is done with a cmpxchg expecting
that the old state has the LAZY flag set, only one thread will successfully UNSET.

If both states (LAZY flag and task (Y)) are observed, the thread on CPU0 will observe
the LAZY flag and transition to UNSET (perhaps uselessly), and CPU1 will observe task
(Y) and do nothing more, which is fine.

What we are effectively preventing with this Dekker is a scenario where neither LAZY
flag nor store (Y) are observed, which would fail property (1) because this would
UNSET a cid which is actively used.


Scenario B) (TSB)+(TMA) (from prev task perspective)

CPU0                                                 CPU1

Context switch CS-1                                  Migrate-from
   - store to rq->curr: (Y)->(N) (TSB)                - cmpxchg to *pcpu_id to LAZY (TMA)
     *** missing barrier ?? ***                         (implied barrier after cmpxchg)
   - prepare_task_switch()
     - switch_mm_cid()
       - cid_put_lazy() (prev)
         - READ_ONCE(*pcpu_cid)                       - rcu_dereference(src_rq->curr)

This Dekker ensures that either task (N) is observed by the rcu_dereference() or the LAZY
flag is observed by READ_ONCE(), or both are observed.

If rcu_dereference observes (N) but LAZY is not observed, migrate-from will take care to
advance the state to UNSET, thus fulfilling property (2). Because (Y) is not running anymore
property (1) is fulfilled.

If rcu_dereference does not observe (N), but LAZY is observed, migrate-from does not
advance to UNSET because it observes (Y), but LAZY flag will make the task on CPU0
take care of advancing the state to UNSET, thus fulfilling property (2).

If both (N) and LAZY are observed, both migrate-from and CPU0 will try to advance the
state to UNSET, but only one will succeed its cmpxchg.

What we are effectively preventing with this Dekker is a scenario where neither LAZY
flag nor store (N) are observed, which would fail property (2) because it would leak
a cid on a cpu that has no task left using the mm.

So based on my analysis, we are indeed missing a barrier between store to rq->curr and
load of the per-mm/cpu cid within context_switch().

Am I missing something ? How can we provide this barrier with minimal overhead ?

Thanks,

Mathieu
  
Mathieu Desnoyers April 8, 2023, 1:14 a.m. UTC | #4
On 2023-04-07 19:50, Mathieu Desnoyers wrote:
[...]
> 
> Let's introduce task (Y) which has mm == src_task->mm and task (N) which 
> has
> mm != src_task->mm for the rest of the discussion.
> 
> Let's consider the scheduling state transitions we want to consider here.
> There are two scheduler state transitions on context switch we care about:
> 
> (TSA) Store to rq->curr with transition from (N) to (Y)
> 
> (TSB) Store to rq->curr with transition from (Y) to (N)
> 
> On the migrate-from side, there is one transition we care about:
> 
> (TMA) cmpxchg to *pcpu_cid to set the LAZY flag
> 
> There is also a transition to UNSET state which can be performed from all
> sides (scheduler, migrate-from). It is performed with a cmpxchg everywhere
> which guarantees that only a single thread will succeed:
> 
> (TMB) cmpxchg to *pcpu_cid to mark UNSET
> 
> Just to be clear (at the risk of repeating myself), what we do _not_ want
> to happen is a transition to UNSET when a thread is actively using the cid
> (property (1)). And ideally we do not want to leak a cid after migrating 
> the
> last task from a cpu (property (2)).
> 
> Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.
> 
> Scenario A) (TSA)+(TMA) (from next task perspective)
> 
> CPU0                                                 CPU1
> 
> Context switch CS-1                                  Migrate-from
>    - store to rq->curr: (N)->(Y) (TSA)                - cmpxchg to 
> *pcpu_id to LAZY (TMA)
>       *** missing barrier ?? ***                        (implied barrier 
> after cmpxchg)
>    - prepare_task_switch()
>      - switch_mm_cid()
>        - mm_cid_get (next)
>          - READ_ONCE(*pcpu_cid)                       - 
> rcu_dereference(src_rq->curr)
> 
> This Dekker ensures that either task (Y) is observed by the 
> rcu_dereference() or the LAZY
> flag is observed by READ_ONCE(), or both are observed.
> 
> If task (Y) store is observed by rcu_dereference(), it means that there 
> is still
> an active task on the cpu. Migrate-from will therefore not transition to 
> UNSET, which
> fulfills property (1). That observed task will itself eventually need a 
> migrate-from
> to be migrated away from that cpu, which fulfills property (2).
> 
> If task (Y) is not observed, but the lazy flag is observed by 
> READ_ONCE(), it will
> move its state to UNSET, which clears the percpu cid perhaps uselessly 
> (which is not
> an issue for correctness). Because task (Y) is not observed, CPU1 can 
> move ahead to
> set the state to UNSET. Because moving state to UNSET is done with a 
> cmpxchg expecting
> that the old state has the LAZY flag set, only one thread will 
> successfully UNSET.
> 
> If both states (LAZY flag and task (Y)) are observed, the thread on CPU0 
> will observe
> the LAZY flag and transition to UNSET (perhaps uselessly), and CPU1 will 
> observe task
> (Y) and do nothing more, which is fine.
> 
> What we are effectively preventing with this Dekker is a scenario where 
> neither LAZY
> flag nor store (Y) are observed, which would fail property (1) because 
> this would
> UNSET a cid which is actively used.
> 
> 
> Scenario B) (TSB)+(TMA) (from prev task perspective)
> 
> CPU0                                                 CPU1
> 
> Context switch CS-1                                  Migrate-from
>    - store to rq->curr: (Y)->(N) (TSB)                - cmpxchg to 
> *pcpu_id to LAZY (TMA)
>      *** missing barrier ?? ***                         (implied barrier 
> after cmpxchg)
>    - prepare_task_switch()
>      - switch_mm_cid()
>        - cid_put_lazy() (prev)
>          - READ_ONCE(*pcpu_cid)                       - 
> rcu_dereference(src_rq->curr)
> 
> This Dekker ensures that either task (N) is observed by the 
> rcu_dereference() or the LAZY
> flag is observed by READ_ONCE(), or both are observed.
> 
> If rcu_dereference observes (N) but LAZY is not observed, migrate-from 
> will take care to
> advance the state to UNSET, thus fulfilling property (2). Because (Y) is 
> not running anymore
> property (1) is fulfilled.
> 
> If rcu_dereference does not observe (N), but LAZY is observed, 
> migrate-from does not
> advance to UNSET because it observes (Y), but LAZY flag will make the 
> task on CPU0
> take care of advancing the state to UNSET, thus fulfilling property (2).
> 
> If both (N) and LAZY are observed, both migrate-from and CPU0 will try 
> to advance the
> state to UNSET, but only one will succeed its cmpxchg.
> 
> What we are effectively preventing with this Dekker is a scenario where 
> neither LAZY
> flag nor store (N) are observed, which would fail property (2) because 
> it would leak
> a cid on a cpu that has no task left using the mm.
> 
> So based on my analysis, we are indeed missing a barrier between store 
> to rq->curr and
> load of the per-mm/cpu cid within context_switch().
> 
> Am I missing something ? How can we provide this barrier with minimal 
> overhead ?

Here is the barrier placement I have in mind to minimize the impact:

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2a243616f222..f20fc0600fcc 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -37,6 +37,11 @@ static inline void mmgrab(struct mm_struct *mm)
  	atomic_inc(&mm->mm_count);
  }
  
+static inline void smp_mb__after_mmgrab(void)
+{
+	smp_mb__after_atomic();
+}
+
  extern void __mmdrop(struct mm_struct *mm);
  
  static inline void mmdrop(struct mm_struct *mm)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e0fa4193499..8d410c0dcb39 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5117,7 +5117,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
  	sched_info_switch(rq, prev, next);
  	perf_event_task_sched_out(prev, next);
  	rseq_preempt(prev);
-	switch_mm_cid(prev, next);
  	fire_sched_out_preempt_notifiers(prev, next);
  	kmap_local_sched_out();
  	prepare_task(next);
@@ -5273,6 +5272,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
  	 *
  	 * kernel ->   user   switch + mmdrop() active
  	 *   user ->   user   switch
+	 *
+	 * switch_mm_cid() needs to be updated if the barriers provided
+	 * by context_switch() are modified.
  	 */
  	if (!next->mm) {                                // to kernel
  		enter_lazy_tlb(prev->active_mm, next);
@@ -5302,6 +5304,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
  		}
  	}
  
+	/* switch_mm_cid() requires the memory barriers above. */
+	switch_mm_cid(prev, next);
+
  	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
  
  	prepare_lock_switch(rq, next, rf);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc0e1cd0d6ac..f3e7dc2cd1cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3354,6 +3354,37 @@ static inline int mm_cid_get(struct mm_struct *mm)
  
  static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
  {
+	/*
+	 * Provide a memory barrier between rq->curr store and load of
+	 * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
+	 *
+	 * Should be adapted if context_switch() is modified.
+	 */
+	if (!next->mm) {                                // to kernel
+		/*
+		 * user -> kernel transition does not guarantee a barrier, but
+		 * we can use the fact that it performs an atomic operation in
+		 * mmgrab().
+		 */
+		if (prev->mm)                           // from user
+			smp_mb__after_mmgrab();
+		/*
+		 * kernel -> kernel transition does not change rq->curr->mm
+		 * state. It stays NULL.
+		 */
+	} else {                                        // to user
+		/*
+		 * kernel -> user transition does not provide a barrier
+		 * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
+		 * Provide it here.
+		 */
+		if (!prev->mm)                          // from kernel
+			smp_mb();
+		/*
+		 * user -> user transition guarantees a memory barrier through
+		 * switch_mm().
+		 */
+	}
  	if (prev->mm_cid_active) {
  		mm_cid_put_lazy(prev);
  		prev->mm_cid = -1;

Thanks,

Mathieu
  
Peter Zijlstra April 11, 2023, 8:46 a.m. UTC | #5
On Fri, Apr 07, 2023 at 07:50:42PM -0400, Mathieu Desnoyers wrote:

> Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.
> 
> Scenario A) (TSA)+(TMA) (from next task perspective)
> 
> CPU0                                                 CPU1
> 
> Context switch CS-1                                  Migrate-from
>   - store to rq->curr: (N)->(Y) (TSA)                - cmpxchg to *pcpu_id to LAZY (TMA)
>      *** missing barrier ?? ***                        (implied barrier after cmpxchg)
>   - prepare_task_switch()
>     - switch_mm_cid()
>       - mm_cid_get (next)
>         - READ_ONCE(*pcpu_cid)                       - rcu_dereference(src_rq->curr)
> 
> This Dekker ensures that either task (Y) is observed by the rcu_dereference() or the LAZY
> flag is observed by READ_ONCE(), or both are observed.
> 
> If task (Y) store is observed by rcu_dereference(), it means that there is still
> an active task on the cpu. Migrate-from will therefore not transition to UNSET, which
> fulfills property (1). That observed task will itself eventually need a migrate-from
> to be migrated away from that cpu, which fulfills property (2).
> 
> If task (Y) is not observed, but the lazy flag is observed by READ_ONCE(), it will
> move its state to UNSET, which clears the percpu cid perhaps uselessly (which is not
> an issue for correctness). Because task (Y) is not observed, CPU1 can move ahead to
> set the state to UNSET. Because moving state to UNSET is done with a cmpxchg expecting
> that the old state has the LAZY flag set, only one thread will successfully UNSET.
> 
> If both states (LAZY flag and task (Y)) are observed, the thread on CPU0 will observe
> the LAZY flag and transition to UNSET (perhaps uselessly), and CPU1 will observe task
> (Y) and do nothing more, which is fine.
> 
> What we are effectively preventing with this Dekker is a scenario where neither LAZY
> flag nor store (Y) are observed, which would fail property (1) because this would
> UNSET a cid which is actively used.

OK, this I'll buy. Let me go stare at this more.

> Scenario B) (TSB)+(TMA) (from prev task perspective)
> 
> CPU0                                                 CPU1
> 
> Context switch CS-1                                  Migrate-from
>   - store to rq->curr: (Y)->(N) (TSB)                - cmpxchg to *pcpu_id to LAZY (TMA)
>     *** missing barrier ?? ***                         (implied barrier after cmpxchg)
>   - prepare_task_switch()
>     - switch_mm_cid()
>       - cid_put_lazy() (prev)
>         - READ_ONCE(*pcpu_cid)                       - rcu_dereference(src_rq->curr)
> 

This I'm conflicted about, if we're running Y, then how the heck do we
get to setting LAZY in the first place?

For this scenario there must be at least an N->Y->N transition, such
that the first:

  if (src_task->mm_cid_active && src_task->mm == mm) {

can observe N and proceed to setting LAZY. But that then leads us to the
scenario above.

(And apparently I ended up doing an N->N transition, which really isn't
*that* interesting :-)
  
Peter Zijlstra April 11, 2023, 8:53 a.m. UTC | #6
On Wed, Apr 05, 2023 at 12:26:35PM -0400, Mathieu Desnoyers wrote:
> +void sched_mm_cid_migrate_from(struct task_struct *t)
> +{

> +	/*
> +	 * If the source cpu cid is set, and matches the last cid of the
> +	 * migrated task, clear the source cpu cid to keep cid allocation
> +	 * compact to cover the case where this task is the last task using
> +	 * this mm on the source cpu. If there happens to be other tasks left
> +	 * on the source cpu using this mm, the next task using this mm will
> +	 * reallocate its cid on context switch.
> +	 *
> +	 * We cannot keep ownership of concurrency ID without runqueue
> +	 * lock held when it is not used by a current task, because it
> +	 * would lead to allocation of more concurrency ids than there
> +	 * are possible cpus in the system. The last_mm_cid is used as
> +	 * a hint to conditionally unset the dst cpu cid, keeping
> +	 * allocated concurrency ids compact.
> +	 */
> +	if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
> +		return;
> +

FWIW, I'm thinking that if we write this using try_cmpxchg() it'll be a
little nicer:

	lazy_cid = mm_cid_set_lazy_put(src_cid);
	if (!try_cmpxchg(src_pcpu_cid, &src_cid, lazy_cid))
		return;

> +	/*
> +	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
> +	 * rq->curr->mm matches the scheduler barrier in context_switch()
> +	 * between store to rq->curr and load of prev and next task's
> +	 * per-mm/cpu cid.
> +	 *
> +	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
> +	 * rq->curr->mm_cid_active matches the barrier in
> +	 * sched_mm_cid_exit_signals(), sched_mm_cid_before_execve(), and
> +	 * sched_mm_cid_after_execve() between store to t->mm_cid_active and
> +	 * load of per-mm/cpu cid.
> +	 */
> +
> +	/*
> +	 * If we observe an active task using the mm on this rq after setting the lazy-put
> +	 * flag, this task will be responsible for transitioning from lazy-put
> +	 * flag set to MM_CID_UNSET.
> +	 */
> +	rcu_read_lock();
> +	src_task = rcu_dereference(src_rq->curr);
> +	if (src_task->mm_cid_active && src_task->mm == mm) {
> +		rcu_read_unlock();
> +		/*
> +		 * We observed an active task for this mm, clearing the destination
> +		 * cpu mm_cid is not relevant for compactness.
> +		 */
> +		t->last_mm_cid = -1;
> +		return;
> +	}
> +	rcu_read_unlock();
> +
> +	/*
> +	 * The src_cid is unused, so it can be unset.
> +	 */
> +	if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) != mm_cid_set_lazy_put(src_cid))
> +		return;

	if (!try_cmpxchg(src_pcpu_cid, &lazy_cid, MM_CID_UNSET))
		return;

> +	__mm_cid_put(mm, src_cid);
> +}
  
Peter Zijlstra April 11, 2023, 9:37 a.m. UTC | #7
On Fri, Apr 07, 2023 at 09:14:36PM -0400, Mathieu Desnoyers wrote:

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bc0e1cd0d6ac..f3e7dc2cd1cc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3354,6 +3354,37 @@ static inline int mm_cid_get(struct mm_struct *mm)
>  static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
>  {
> +	/*
> +	 * Provide a memory barrier between rq->curr store and load of
> +	 * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
> +	 *
> +	 * Should be adapted if context_switch() is modified.
> +	 */
> +	if (!next->mm) {                                // to kernel
> +		/*
> +		 * user -> kernel transition does not guarantee a barrier, but
> +		 * we can use the fact that it performs an atomic operation in
> +		 * mmgrab().
> +		 */
> +		if (prev->mm)                           // from user
> +			smp_mb__after_mmgrab();
> +		/*
> +		 * kernel -> kernel transition does not change rq->curr->mm
> +		 * state. It stays NULL.
> +		 */
> +	} else {                                        // to user
> +		/*
> +		 * kernel -> user transition does not provide a barrier
> +		 * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
> +		 * Provide it here.
> +		 */
> +		if (!prev->mm)                          // from kernel
> +			smp_mb();
> +		/*
> +		 * user -> user transition guarantees a memory barrier through
> +		 * switch_mm().
> +		 */

What about the user->user case where next->mm == prev->mm ? There
sys_membarrier() relies on finish_task_switch()'s mmdrop(), but we
can't.

> +	}
>  	if (prev->mm_cid_active) {
>  		mm_cid_put_lazy(prev);
>  		prev->mm_cid = -1;
>
  
Peter Zijlstra April 11, 2023, 10:25 a.m. UTC | #8
On Tue, Apr 11, 2023 at 11:37:05AM +0200, Peter Zijlstra wrote:
> On Fri, Apr 07, 2023 at 09:14:36PM -0400, Mathieu Desnoyers wrote:
> 
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index bc0e1cd0d6ac..f3e7dc2cd1cc 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -3354,6 +3354,37 @@ static inline int mm_cid_get(struct mm_struct *mm)
> >  static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
> >  {
> > +	/*
> > +	 * Provide a memory barrier between rq->curr store and load of
> > +	 * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
> > +	 *
> > +	 * Should be adapted if context_switch() is modified.
> > +	 */
> > +	if (!next->mm) {                                // to kernel
> > +		/*
> > +		 * user -> kernel transition does not guarantee a barrier, but
> > +		 * we can use the fact that it performs an atomic operation in
> > +		 * mmgrab().
> > +		 */
> > +		if (prev->mm)                           // from user
> > +			smp_mb__after_mmgrab();
> > +		/*
> > +		 * kernel -> kernel transition does not change rq->curr->mm
> > +		 * state. It stays NULL.
> > +		 */
> > +	} else {                                        // to user
> > +		/*
> > +		 * kernel -> user transition does not provide a barrier
> > +		 * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
> > +		 * Provide it here.
> > +		 */
> > +		if (!prev->mm)                          // from kernel
> > +			smp_mb();
> > +		/*
> > +		 * user -> user transition guarantees a memory barrier through
> > +		 * switch_mm().
> > +		 */
> 
> What about the user->user case where next->mm == prev->mm ? There
> sys_membarrier() relies on finish_task_switch()'s mmdrop(), but we
> can't.

Ah, I suppose that's either a N->N or Y->Y transition and we don't care.

Not the clearest comment though.
  
Peter Zijlstra April 11, 2023, 11:03 a.m. UTC | #9
On Fri, Apr 07, 2023 at 09:14:36PM -0400, Mathieu Desnoyers wrote:

> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 2a243616f222..f20fc0600fcc 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -37,6 +37,11 @@ static inline void mmgrab(struct mm_struct *mm)
>  	atomic_inc(&mm->mm_count);
>  }
> +static inline void smp_mb__after_mmgrab(void)
> +{
> +	smp_mb__after_atomic();
> +}
> +
>  extern void __mmdrop(struct mm_struct *mm);
>  static inline void mmdrop(struct mm_struct *mm)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9e0fa4193499..8d410c0dcb39 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5117,7 +5117,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>  	sched_info_switch(rq, prev, next);
>  	perf_event_task_sched_out(prev, next);
>  	rseq_preempt(prev);
> -	switch_mm_cid(prev, next);
>  	fire_sched_out_preempt_notifiers(prev, next);
>  	kmap_local_sched_out();
>  	prepare_task(next);
> @@ -5273,6 +5272,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  	 *
>  	 * kernel ->   user   switch + mmdrop() active
>  	 *   user ->   user   switch
> +	 *
> +	 * switch_mm_cid() needs to be updated if the barriers provided
> +	 * by context_switch() are modified.
>  	 */
>  	if (!next->mm) {                                // to kernel
>  		enter_lazy_tlb(prev->active_mm, next);
> @@ -5302,6 +5304,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  		}
>  	}
> +	/* switch_mm_cid() requires the memory barriers above. */
> +	switch_mm_cid(prev, next);
> +
>  	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
>  	prepare_lock_switch(rq, next, rf);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bc0e1cd0d6ac..f3e7dc2cd1cc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3354,6 +3354,37 @@ static inline int mm_cid_get(struct mm_struct *mm)
>  static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
>  {
> +	/*
> +	 * Provide a memory barrier between rq->curr store and load of
> +	 * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
> +	 *
> +	 * Should be adapted if context_switch() is modified.
> +	 */
> +	if (!next->mm) {                                // to kernel
> +		/*
> +		 * user -> kernel transition does not guarantee a barrier, but
> +		 * we can use the fact that it performs an atomic operation in
> +		 * mmgrab().
> +		 */
> +		if (prev->mm)                           // from user
> +			smp_mb__after_mmgrab();
> +		/*
> +		 * kernel -> kernel transition does not change rq->curr->mm
> +		 * state. It stays NULL.
> +		 */
> +	} else {                                        // to user
> +		/*
> +		 * kernel -> user transition does not provide a barrier
> +		 * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
> +		 * Provide it here.
> +		 */
> +		if (!prev->mm)                          // from kernel
> +			smp_mb();
> +		/*
> +		 * user -> user transition guarantees a memory barrier through
> +		 * switch_mm().
> +		 */
> +	}
>  	if (prev->mm_cid_active) {
>  		mm_cid_put_lazy(prev);
>  		prev->mm_cid = -1;
> 

This is going to be pain wrt.:

  https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com

which is already in -next. Also, I recon Nick isn't going to too happy
-- although I recond smp_mb() is better than an atomic op on Power. But
still.

Urgh...

For Nick; the TL;DR is we need an smp_mb() after setting rq->curr and
before calling switch_mm_cid() *IFF* rq->curr->mm changes. Normally this
is provided by switch_mm() itself per actually changing the address
space, except for the whole active_mm/lazy swizzle nonsense, which gives
a few holes still.

The very much longer explanation is upthread here:

  https://lkml.kernel.org/r/fdaa7242-4ddd-fbe2-bc0e-6c62054dbde8@efficios.com
  
Peter Zijlstra April 11, 2023, 11:53 a.m. UTC | #10
On Tue, Apr 11, 2023 at 01:03:45PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 07, 2023 at 09:14:36PM -0400, Mathieu Desnoyers wrote:


> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index bc0e1cd0d6ac..f3e7dc2cd1cc 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -3354,6 +3354,37 @@ static inline int mm_cid_get(struct mm_struct *mm)
> >  static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
> >  {
> > +	/*
> > +	 * Provide a memory barrier between rq->curr store and load of
> > +	 * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
> > +	 *
> > +	 * Should be adapted if context_switch() is modified.
> > +	 */
> > +	if (!next->mm) {                                // to kernel
> > +		/*
> > +		 * user -> kernel transition does not guarantee a barrier, but
> > +		 * we can use the fact that it performs an atomic operation in
> > +		 * mmgrab().
> > +		 */
> > +		if (prev->mm)                           // from user
> > +			smp_mb__after_mmgrab();
> > +		/*
> > +		 * kernel -> kernel transition does not change rq->curr->mm
> > +		 * state. It stays NULL.
> > +		 */
> > +	} else {                                        // to user
> > +		/*
> > +		 * kernel -> user transition does not provide a barrier
> > +		 * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
> > +		 * Provide it here.
> > +		 */
> > +		if (!prev->mm)                          // from kernel
> > +			smp_mb();
> > +		/*
> > +		 * user -> user transition guarantees a memory barrier through
> > +		 * switch_mm().
> > +		 */
> > +	}

The possibly nicer way to write all this is:

	if (!prev->mm != !next->mm)
		smp_mb();

And then clean up the mm{grab,drop)_lazy_tlb() helpers. But we can
always do that later if we indeed end up with all this ...
  
Mathieu Desnoyers April 11, 2023, 12:38 p.m. UTC | #11
On 2023-04-11 04:46, Peter Zijlstra wrote:
> On Fri, Apr 07, 2023 at 07:50:42PM -0400, Mathieu Desnoyers wrote:
> 
>> Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.
>>
>> Scenario A) (TSA)+(TMA) (from next task perspective)
>>
>> CPU0                                                 CPU1
>>
>> Context switch CS-1                                  Migrate-from
>>    - store to rq->curr: (N)->(Y) (TSA)                - cmpxchg to *pcpu_id to LAZY (TMA)
>>       *** missing barrier ?? ***                        (implied barrier after cmpxchg)
>>    - prepare_task_switch()
>>      - switch_mm_cid()
>>        - mm_cid_get (next)
>>          - READ_ONCE(*pcpu_cid)                       - rcu_dereference(src_rq->curr)
>>
>> This Dekker ensures that either task (Y) is observed by the rcu_dereference() or the LAZY
>> flag is observed by READ_ONCE(), or both are observed.
>>
>> If task (Y) store is observed by rcu_dereference(), it means that there is still
>> an active task on the cpu. Migrate-from will therefore not transition to UNSET, which
>> fulfills property (1). That observed task will itself eventually need a migrate-from
>> to be migrated away from that cpu, which fulfills property (2).
>>
>> If task (Y) is not observed, but the lazy flag is observed by READ_ONCE(), it will
>> move its state to UNSET, which clears the percpu cid perhaps uselessly (which is not
>> an issue for correctness). Because task (Y) is not observed, CPU1 can move ahead to
>> set the state to UNSET. Because moving state to UNSET is done with a cmpxchg expecting
>> that the old state has the LAZY flag set, only one thread will successfully UNSET.
>>
>> If both states (LAZY flag and task (Y)) are observed, the thread on CPU0 will observe
>> the LAZY flag and transition to UNSET (perhaps uselessly), and CPU1 will observe task
>> (Y) and do nothing more, which is fine.
>>
>> What we are effectively preventing with this Dekker is a scenario where neither LAZY
>> flag nor store (Y) are observed, which would fail property (1) because this would
>> UNSET a cid which is actively used.
> 
> OK, this I'll buy. Let me go stare at this more.
> 
>> Scenario B) (TSB)+(TMA) (from prev task perspective)
>>
>> CPU0                                                 CPU1
>>
>> Context switch CS-1                                  Migrate-from
>>    - store to rq->curr: (Y)->(N) (TSB)                - cmpxchg to *pcpu_id to LAZY (TMA)
>>      *** missing barrier ?? ***                         (implied barrier after cmpxchg)
>>    - prepare_task_switch()
>>      - switch_mm_cid()
>>        - cid_put_lazy() (prev)
>>          - READ_ONCE(*pcpu_cid)                       - rcu_dereference(src_rq->curr)
>>
> 
> This I'm conflicted about, if we're running Y, then how the heck do we
> get to setting LAZY in the first place?
> 
> For this scenario there must be at least an N->Y->N transition, such
> that the first:
> 
>    if (src_task->mm_cid_active && src_task->mm == mm) {
> 
> can observe N and proceed to setting LAZY. But that then leads us to the
> scenario above.

Remember that migrate-from does not hold any rq lock. Therefore, it's 
very much possible that the first check:

   if (src_task->mm_cid_active && src_task->mm == mm) {

observes (N), then gets delayed for a while, and then only sets the
LAZY flag when (Y) has been scheduled, leading us to Scenario B).

Thanks,

Mathieu
  
Mathieu Desnoyers April 11, 2023, 12:41 p.m. UTC | #12
On 2023-04-11 04:53, Peter Zijlstra wrote:
> On Wed, Apr 05, 2023 at 12:26:35PM -0400, Mathieu Desnoyers wrote:
>> +void sched_mm_cid_migrate_from(struct task_struct *t)
>> +{
> 
>> +	/*
>> +	 * If the source cpu cid is set, and matches the last cid of the
>> +	 * migrated task, clear the source cpu cid to keep cid allocation
>> +	 * compact to cover the case where this task is the last task using
>> +	 * this mm on the source cpu. If there happens to be other tasks left
>> +	 * on the source cpu using this mm, the next task using this mm will
>> +	 * reallocate its cid on context switch.
>> +	 *
>> +	 * We cannot keep ownership of concurrency ID without runqueue
>> +	 * lock held when it is not used by a current task, because it
>> +	 * would lead to allocation of more concurrency ids than there
>> +	 * are possible cpus in the system. The last_mm_cid is used as
>> +	 * a hint to conditionally unset the dst cpu cid, keeping
>> +	 * allocated concurrency ids compact.
>> +	 */
>> +	if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
>> +		return;
>> +
> 
> FWIW, I'm thinking that if we write this using try_cmpxchg() it'll be a
> little nicer:
> 
> 	lazy_cid = mm_cid_set_lazy_put(src_cid);
> 	if (!try_cmpxchg(src_pcpu_cid, &src_cid, lazy_cid))
> 		return;
> 

Yes, done.

>> +	/*
>> +	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
>> +	 * rq->curr->mm matches the scheduler barrier in context_switch()
>> +	 * between store to rq->curr and load of prev and next task's
>> +	 * per-mm/cpu cid.
>> +	 *
>> +	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
>> +	 * rq->curr->mm_cid_active matches the barrier in
>> +	 * sched_mm_cid_exit_signals(), sched_mm_cid_before_execve(), and
>> +	 * sched_mm_cid_after_execve() between store to t->mm_cid_active and
>> +	 * load of per-mm/cpu cid.
>> +	 */
>> +
>> +	/*
>> +	 * If we observe an active task using the mm on this rq after setting the lazy-put
>> +	 * flag, this task will be responsible for transitioning from lazy-put
>> +	 * flag set to MM_CID_UNSET.
>> +	 */
>> +	rcu_read_lock();
>> +	src_task = rcu_dereference(src_rq->curr);
>> +	if (src_task->mm_cid_active && src_task->mm == mm) {
>> +		rcu_read_unlock();
>> +		/*
>> +		 * We observed an active task for this mm, clearing the destination
>> +		 * cpu mm_cid is not relevant for compactness.
>> +		 */
>> +		t->last_mm_cid = -1;
>> +		return;
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	/*
>> +	 * The src_cid is unused, so it can be unset.
>> +	 */
>> +	if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) != mm_cid_set_lazy_put(src_cid))
>> +		return;
> 
> 	if (!try_cmpxchg(src_pcpu_cid, &lazy_cid, MM_CID_UNSET))
> 		return;

done,

Thanks!

Mathieu

> 
>> +	__mm_cid_put(mm, src_cid);
>> +}
  
Mathieu Desnoyers April 11, 2023, 12:57 p.m. UTC | #13
On 2023-04-11 05:37, Peter Zijlstra wrote:
> On Fri, Apr 07, 2023 at 09:14:36PM -0400, Mathieu Desnoyers wrote:
> 
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index bc0e1cd0d6ac..f3e7dc2cd1cc 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -3354,6 +3354,37 @@ static inline int mm_cid_get(struct mm_struct *mm)
>>   static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
>>   {
>> +	/*
>> +	 * Provide a memory barrier between rq->curr store and load of
>> +	 * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
>> +	 *
>> +	 * Should be adapted if context_switch() is modified.
>> +	 */
>> +	if (!next->mm) {                                // to kernel
>> +		/*
>> +		 * user -> kernel transition does not guarantee a barrier, but
>> +		 * we can use the fact that it performs an atomic operation in
>> +		 * mmgrab().
>> +		 */
>> +		if (prev->mm)                           // from user
>> +			smp_mb__after_mmgrab();
>> +		/*
>> +		 * kernel -> kernel transition does not change rq->curr->mm
>> +		 * state. It stays NULL.
>> +		 */
>> +	} else {                                        // to user
>> +		/*
>> +		 * kernel -> user transition does not provide a barrier
>> +		 * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
>> +		 * Provide it here.
>> +		 */
>> +		if (!prev->mm)                          // from kernel
>> +			smp_mb();
>> +		/*
>> +		 * user -> user transition guarantees a memory barrier through
>> +		 * switch_mm().
>> +		 */
> 
> What about the user->user case where next->mm == prev->mm ? There
> sys_membarrier() relies on finish_task_switch()'s mmdrop(), but we
> can't.

AFAIU the finish_task_switch()'s mmdrop() is for the case where:

                  * [...] or in
                  * case 'prev->active_mm == next->mm' through
                  * finish_task_switch()'s mmdrop().

which applies for the case where we schedule from a kernel thread (which
kept the prior user task's mm as active mm) to a user task with the same
mm.

But this is really a transition from kernel -> user, not user -> user ?

Why should either membarrier or mm_cid care about a transition from
prev->mm to next->mm where mm is unchanged ? It does not register
as a transition from the comparison perspective.

I'll update my comment in switch_mm_cid to:

      /*
       * user -> user transition guarantees a memory barrier through
       * switch_mm() when current->mm changes. If current->mm is
       * unchanged, no barrier is needed.
       */

Thanks,

Mathieu


> 
>> +	}
>>   	if (prev->mm_cid_active) {
>>   		mm_cid_put_lazy(prev);
>>   		prev->mm_cid = -1;
>>
  
Mathieu Desnoyers April 11, 2023, 1:10 p.m. UTC | #14
On 2023-04-11 06:25, Peter Zijlstra wrote:
> On Tue, Apr 11, 2023 at 11:37:05AM +0200, Peter Zijlstra wrote:
>> On Fri, Apr 07, 2023 at 09:14:36PM -0400, Mathieu Desnoyers wrote:
>>
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index bc0e1cd0d6ac..f3e7dc2cd1cc 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -3354,6 +3354,37 @@ static inline int mm_cid_get(struct mm_struct *mm)
>>>   static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
>>>   {
>>> +	/*
>>> +	 * Provide a memory barrier between rq->curr store and load of
>>> +	 * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
>>> +	 *
>>> +	 * Should be adapted if context_switch() is modified.
>>> +	 */
>>> +	if (!next->mm) {                                // to kernel
>>> +		/*
>>> +		 * user -> kernel transition does not guarantee a barrier, but
>>> +		 * we can use the fact that it performs an atomic operation in
>>> +		 * mmgrab().
>>> +		 */
>>> +		if (prev->mm)                           // from user
>>> +			smp_mb__after_mmgrab();
>>> +		/*
>>> +		 * kernel -> kernel transition does not change rq->curr->mm
>>> +		 * state. It stays NULL.
>>> +		 */
>>> +	} else {                                        // to user
>>> +		/*
>>> +		 * kernel -> user transition does not provide a barrier
>>> +		 * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
>>> +		 * Provide it here.
>>> +		 */
>>> +		if (!prev->mm)                          // from kernel
>>> +			smp_mb();
>>> +		/*
>>> +		 * user -> user transition guarantees a memory barrier through
>>> +		 * switch_mm().
>>> +		 */
>>
>> What about the user->user case where next->mm == prev->mm ? There
>> sys_membarrier() relies on finish_task_switch()'s mmdrop(), but we
>> can't.
> 
> Ah, I suppose that's either a N->N or Y->Y transition and we don't care.
> 
> Not the clearest comment though.

For sake of completeness, here is the updated comment:



                 /*
                  * user -> user transition guarantees a memory barrier through
                  * switch_mm() when current->mm changes. If current->mm is
                  * unchanged, no barrier is needed.
                  */

Thanks,

Mathieu
  

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0722859c3647..0304f098bd50 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -609,6 +609,14 @@  struct mm_struct {
 		 * were being concurrently updated by the updaters.
 		 */
 		raw_spinlock_t cid_lock;
+		/**
+		 * @pcpu_cid: Per-cpu current cid.
+		 *
+		 * Keep track of the currently allocated mm_cid for each cpu.
+		 * The per-cpu mm_cid values are serialized by their respective
+		 * runqueue locks.
+		 */
+		int __percpu *pcpu_cid;
 #endif
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
@@ -872,6 +880,37 @@  static inline void vma_iter_init(struct vma_iterator *vmi,
 }
 
 #ifdef CONFIG_SCHED_MM_CID
+
+enum mm_cid_state {
+	MM_CID_UNSET = -1U,		/* Unset state has lazy_put flag set. */
+	MM_CID_LAZY_PUT = (1U << 31),
+};
+
+static inline bool mm_cid_is_unset(int cid)
+{
+	return cid == MM_CID_UNSET;
+}
+
+static inline bool mm_cid_is_lazy_put(int cid)
+{
+	return !mm_cid_is_unset(cid) && (cid & MM_CID_LAZY_PUT);
+}
+
+static inline bool mm_cid_is_valid(int cid)
+{
+	return !(cid & MM_CID_LAZY_PUT);
+}
+
+static inline int mm_cid_set_lazy_put(int cid)
+{
+	return cid | MM_CID_LAZY_PUT;
+}
+
+static inline int mm_cid_clear_lazy_put(int cid)
+{
+	return cid & ~MM_CID_LAZY_PUT;
+}
+
 /* Accessor for struct mm_struct's cidmask. */
 static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 {
@@ -885,16 +924,37 @@  static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 
 static inline void mm_init_cid(struct mm_struct *mm)
 {
+	int i;
+
 	raw_spin_lock_init(&mm->cid_lock);
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(mm->pcpu_cid, i) = -1;
 	cpumask_clear(mm_cidmask(mm));
 }
 
+static inline int mm_alloc_cid(struct mm_struct *mm)
+{
+	mm->pcpu_cid = alloc_percpu(int);
+	if (!mm->pcpu_cid)
+		return -ENOMEM;
+	mm_init_cid(mm);
+	return 0;
+}
+
+static inline void mm_destroy_cid(struct mm_struct *mm)
+{
+	free_percpu(mm->pcpu_cid);
+	mm->pcpu_cid = NULL;
+}
+
 static inline unsigned int mm_cid_size(void)
 {
 	return cpumask_size();
 }
 #else /* CONFIG_SCHED_MM_CID */
 static inline void mm_init_cid(struct mm_struct *mm) { }
+static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_cid(struct mm_struct *mm) { }
 static inline unsigned int mm_cid_size(void)
 {
 	return 0;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..48d48b2c73a5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1313,6 +1313,7 @@  struct task_struct {
 
 #ifdef CONFIG_SCHED_MM_CID
 	int				mm_cid;		/* Current cid in mm */
+	int				last_mm_cid;	/* Most recent cid in mm */
 	int				mm_cid_active;	/* Whether cid bitmap is active */
 #endif
 
diff --git a/kernel/fork.c b/kernel/fork.c
index c983c4fe3090..0a2b905c2ed5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -790,6 +790,7 @@  void __mmdrop(struct mm_struct *mm)
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
+	mm_destroy_cid(mm);
 
 	for (i = 0; i < NR_MM_COUNTERS; i++)
 		percpu_counter_destroy(&mm->rss_stat[i]);
@@ -1054,6 +1055,7 @@  static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 #ifdef CONFIG_SCHED_MM_CID
 	tsk->mm_cid = -1;
+	tsk->last_mm_cid = -1;
 	tsk->mm_cid_active = 0;
 #endif
 	return tsk;
@@ -1159,18 +1161,22 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
+	if (mm_alloc_cid(mm))
+		goto fail_cid;
+
 	for (i = 0; i < NR_MM_COUNTERS; i++)
 		if (percpu_counter_init(&mm->rss_stat[i], 0, GFP_KERNEL_ACCOUNT))
 			goto fail_pcpu;
 
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
-	mm_init_cid(mm);
 	return mm;
 
 fail_pcpu:
 	while (i > 0)
 		percpu_counter_destroy(&mm->rss_stat[--i]);
+	mm_destroy_cid(mm);
+fail_cid:
 	destroy_context(mm);
 fail_nocontext:
 	mm_free_pgd(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d18c3969f90..707a0a5e036e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2084,8 +2084,10 @@  static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (task_on_rq_migrating(p))
+	if (task_on_rq_migrating(p)) {
 		flags |= ENQUEUE_MIGRATED;
+		sched_mm_cid_migrate_to(rq, p);
+	}
 
 	enqueue_task(rq, p, flags);
 
@@ -3195,6 +3197,7 @@  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
 		rseq_migrate(p);
+		sched_mm_cid_migrate_from(p);
 		perf_event_task_migrate(p);
 	}
 
@@ -11383,45 +11386,224 @@  void call_trace_sched_update_nr_running(struct rq *rq, int count)
 }
 
 #ifdef CONFIG_SCHED_MM_CID
+/*
+ * Migration from src cpu. Called from set_task_cpu(). There are no guarantees
+ * that the rq lock is held.
+ */
+void sched_mm_cid_migrate_from(struct task_struct *t)
+{
+	int src_cid, *src_pcpu_cid, last_mm_cid;
+	struct mm_struct *mm = t->mm;
+	struct rq *src_rq;
+	struct task_struct *src_task;
+
+	if (!mm)
+		return;
+
+	last_mm_cid = t->last_mm_cid;
+	/*
+	 * If the migrated task has no last cid, or if the current
+	 * task on src rq uses the cid, it means the destination cpu
+	 * does not have to reallocate its cid to keep the cid allocation
+	 * compact.
+	 */
+	if (last_mm_cid == -1)
+		return;
+
+	src_rq = task_rq(t);
+	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
+	src_cid = READ_ONCE(*src_pcpu_cid);
+
+	if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
+		return;
+
+	/*
+	 * If we observe an active task using the mm on this rq, it means we
+	 * are not the last task to be migrated from this cpu for this mm, so
+	 * there is no need to clear the src_cid.
+	 */
+	rcu_read_lock();
+	src_task = rcu_dereference(src_rq->curr);
+	if (src_task->mm_cid_active && src_task->mm == mm) {
+		rcu_read_unlock();
+		t->last_mm_cid = -1;
+		return;
+	}
+	rcu_read_unlock();
+
+	/*
+	 * If the source cpu cid is set, and matches the last cid of the
+	 * migrated task, clear the source cpu cid to keep cid allocation
+	 * compact to cover the case where this task is the last task using
+	 * this mm on the source cpu. If there happens to be other tasks left
+	 * on the source cpu using this mm, the next task using this mm will
+	 * reallocate its cid on context switch.
+	 *
+	 * We cannot keep ownership of concurrency ID without runqueue
+	 * lock held when it is not used by a current task, because it
+	 * would lead to allocation of more concurrency ids than there
+	 * are possible cpus in the system. The last_mm_cid is used as
+	 * a hint to conditionally unset the dst cpu cid, keeping
+	 * allocated concurrency ids compact.
+	 */
+	if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
+		return;
+
+	/*
+	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
+	 * rq->curr->mm matches the scheduler barrier in context_switch()
+	 * between store to rq->curr and load of prev and next task's
+	 * per-mm/cpu cid.
+	 *
+	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
+	 * rq->curr->mm_cid_active matches the barrier in
+	 * sched_mm_cid_exit_signals(), sched_mm_cid_before_execve(), and
+	 * sched_mm_cid_after_execve() between store to t->mm_cid_active and
+	 * load of per-mm/cpu cid.
+	 */
+
+	/*
+	 * If we observe an active task using the mm on this rq after setting the lazy-put
+	 * flag, this task will be responsible for transitioning from lazy-put
+	 * flag set to MM_CID_UNSET.
+	 */
+	rcu_read_lock();
+	src_task = rcu_dereference(src_rq->curr);
+	if (src_task->mm_cid_active && src_task->mm == mm) {
+		rcu_read_unlock();
+		/*
+		 * We observed an active task for this mm, clearing the destination
+		 * cpu mm_cid is not relevant for compactness.
+		 */
+		t->last_mm_cid = -1;
+		return;
+	}
+	rcu_read_unlock();
+
+	/*
+	 * The src_cid is unused, so it can be unset.
+	 */
+	if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) != mm_cid_set_lazy_put(src_cid))
+		return;
+	__mm_cid_put(mm, src_cid);
+}
+
+/*
+ * Migration to dst cpu. Called with dst_rq lock held.
+ */
+void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
+{
+	struct mm_struct *mm = t->mm;
+	int src_cid, dst_cid, *dst_pcpu_cid;
+	struct task_struct *dest_task;
+
+	lockdep_assert_rq_held(dst_rq);
+
+	src_cid = t->last_mm_cid;
+	if (!mm || src_cid == -1)
+		return;
+
+	dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
+
+	/*
+	 * If destination cpu cid is greater than the source cpu cid, unset it
+	 * so it can be reallocated.
+	 */
+	dst_cid = READ_ONCE(*dst_pcpu_cid);
+	if (!mm_cid_is_valid(dst_cid) || dst_cid <= src_cid)
+		return;
+	/*
+	 * Put dst_cid if it is not currently in use, else it will be lazy put
+	 * on the next context switch.
+	 */
+	dest_task = dst_rq->curr;
+	if (dest_task->mm_cid_active && dest_task->mm == mm) {
+		WARN_ON_ONCE(dest_task->mm_cid != dst_cid);
+		/*
+		 * Attempt to set lazy-put flag. Can fail due to concurrent
+		 * migrate-from, which sets lazy-put flag or MM_CID_UNSET.
+		 */
+		(void) cmpxchg(dst_pcpu_cid, dst_cid, mm_cid_set_lazy_put(dst_cid));
+		return;
+	}
+	/*
+	 * Attempt to set MM_CID_UNSET. Can fail due to concurrent
+	 * migrate-from, which sets lazy-put flag or MM_CID_UNSET.
+	 */
+	if (cmpxchg(dst_pcpu_cid, dst_cid, MM_CID_UNSET) != dst_cid)
+		return;
+	__mm_cid_put(mm, dst_cid);
+}
+
 void sched_mm_cid_exit_signals(struct task_struct *t)
 {
 	struct mm_struct *mm = t->mm;
-	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
 
 	if (!mm)
 		return;
-	local_irq_save(flags);
-	mm_cid_put(mm, t->mm_cid);
-	t->mm_cid = -1;
-	t->mm_cid_active = 0;
-	local_irq_restore(flags);
+
+	preempt_disable();
+	rq = this_rq();
+	rq_lock_irqsave(rq, &rf);
+	preempt_enable_no_resched();	/* holding spinlock */
+	WRITE_ONCE(t->mm_cid_active, 0);
+	/*
+	 * Store t->mm_cid_active before loading per-mm/cpu cid.
+	 * Matches barrier in sched_mm_cid_migrate_from().
+	 */
+	smp_mb();
+	mm_cid_put(t);
+	t->last_mm_cid = t->mm_cid = -1;
+	rq_unlock_irqrestore(rq, &rf);
 }
 
 void sched_mm_cid_before_execve(struct task_struct *t)
 {
 	struct mm_struct *mm = t->mm;
-	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
 
 	if (!mm)
 		return;
-	local_irq_save(flags);
-	mm_cid_put(mm, t->mm_cid);
-	t->mm_cid = -1;
-	t->mm_cid_active = 0;
-	local_irq_restore(flags);
+
+	preempt_disable();
+	rq = this_rq();
+	rq_lock_irqsave(rq, &rf);
+	preempt_enable_no_resched();	/* holding spinlock */
+	WRITE_ONCE(t->mm_cid_active, 0);
+	/*
+	 * Store t->mm_cid_active before loading per-mm/cpu cid.
+	 * Matches barrier in sched_mm_cid_migrate_from().
+	 */
+	smp_mb();
+	mm_cid_put(t);
+	t->last_mm_cid = t->mm_cid = -1;
+	rq_unlock_irqrestore(rq, &rf);
 }
 
 void sched_mm_cid_after_execve(struct task_struct *t)
 {
 	struct mm_struct *mm = t->mm;
-	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
 
 	if (!mm)
 		return;
-	local_irq_save(flags);
-	t->mm_cid = mm_cid_get(mm);
-	t->mm_cid_active = 1;
-	local_irq_restore(flags);
+
+	preempt_disable();
+	rq = this_rq();
+	rq_lock_irqsave(rq, &rf);
+	preempt_enable_no_resched();	/* holding spinlock */
+	WRITE_ONCE(t->mm_cid_active, 1);
+	/*
+	 * Store t->mm_cid_active before loading per-mm/cpu cid.
+	 * Matches barrier in sched_mm_cid_migrate_from().
+	 */
+	smp_mb();
+	t->last_mm_cid = t->mm_cid = mm_cid_get(mm);
+	rq_unlock_irqrestore(rq, &rf);
 	rseq_set_notify_resume(t);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e8df6d31c1e..bc0e1cd0d6ac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3249,7 +3249,66 @@  static inline void update_current_exec_runtime(struct task_struct *curr,
 }
 
 #ifdef CONFIG_SCHED_MM_CID
-static inline int __mm_cid_get(struct mm_struct *mm)
+extern void sched_mm_cid_migrate_from(struct task_struct *t);
+extern void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t);
+
+static inline void __mm_cid_put(struct mm_struct *mm, int cid)
+{
+	if (cid < 0)
+		return;
+	cpumask_clear_cpu(cid, mm_cidmask(mm));
+}
+
+/*
+ * The per-mm/cpu cid can have the MM_CID_LAZY_PUT flag set or transition to the
+ * MM_CID_UNSET state without holding the rq lock, but the rq lock needs to be
+ * held to transition to other states.
+ */
+static inline void mm_cid_put_lazy(struct task_struct *t)
+{
+	struct mm_struct *mm = t->mm;
+	int *pcpu_cid, cid;
+
+	lockdep_assert_rq_held(this_rq());
+	pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
+	cid = READ_ONCE(*pcpu_cid);
+	if (!mm_cid_is_lazy_put(cid))
+		return;
+	if (cmpxchg(pcpu_cid, cid, MM_CID_UNSET) != cid)
+		return;
+	__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
+}
+
+static inline void mm_cid_put(struct task_struct *t)
+{
+	struct mm_struct *mm = t->mm;
+	int *pcpu_cid, cid, res;
+
+	lockdep_assert_rq_held(this_rq());
+	WARN_ON_ONCE(t->mm_cid < 0);
+	pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
+	cid = READ_ONCE(*pcpu_cid);
+	for (;;) {
+		if (mm_cid_is_unset(cid))
+			return;
+		WARN_ON_ONCE(mm_cid_clear_lazy_put(cid) != t->mm_cid);
+		/*
+		 * Attempt transition from valid or lazy-put to unset.
+		 */
+		res = cmpxchg(pcpu_cid, cid, MM_CID_UNSET);
+		if (res == cid)
+			break;
+		cid = res;
+	}
+	__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
+}
+
+/*
+ * Lookup/set of mm_cidmask are serialized by cid_lock. Clear of mm_cidmask
+ * uses atomic operations, therefore requiring the set to be done with atomic
+ * operations as well.
+ */
+static inline int __mm_cid_get_locked(struct mm_struct *mm)
 {
 	struct cpumask *cpumask;
 	int cid;
@@ -3258,52 +3317,55 @@  static inline int __mm_cid_get(struct mm_struct *mm)
 	cid = cpumask_first_zero(cpumask);
 	if (cid >= nr_cpu_ids)
 		return -1;
-	__cpumask_set_cpu(cid, cpumask);
+	cpumask_set_cpu(cid, cpumask);
 	return cid;
 }
 
-static inline void mm_cid_put(struct mm_struct *mm, int cid)
+static inline int __mm_cid_get(struct mm_struct *mm)
 {
-	lockdep_assert_irqs_disabled();
-	if (cid < 0)
-		return;
+	int ret;
+
+	lockdep_assert_rq_held(this_rq());
 	raw_spin_lock(&mm->cid_lock);
-	__cpumask_clear_cpu(cid, mm_cidmask(mm));
+	ret = __mm_cid_get_locked(mm);
 	raw_spin_unlock(&mm->cid_lock);
+	return ret;
 }
 
 static inline int mm_cid_get(struct mm_struct *mm)
 {
-	int ret;
+	int *pcpu_cid, cid;
 
-	lockdep_assert_irqs_disabled();
+	lockdep_assert_rq_held(this_rq());
+	pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
+	cid = READ_ONCE(*pcpu_cid);
+	if (mm_cid_is_valid(cid))
+		return cid;
+	if (mm_cid_is_lazy_put(cid)) {
+		if (cmpxchg(pcpu_cid, cid, MM_CID_UNSET) == cid)
+			__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
+	}
 	raw_spin_lock(&mm->cid_lock);
-	ret = __mm_cid_get(mm);
+	cid = __mm_cid_get_locked(mm);
 	raw_spin_unlock(&mm->cid_lock);
-	return ret;
+	WRITE_ONCE(*pcpu_cid, cid);
+	return cid;
 }
 
 static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
 {
 	if (prev->mm_cid_active) {
-		if (next->mm_cid_active && next->mm == prev->mm) {
-			/*
-			 * Context switch between threads in same mm, hand over
-			 * the mm_cid from prev to next.
-			 */
-			next->mm_cid = prev->mm_cid;
-			prev->mm_cid = -1;
-			return;
-		}
-		mm_cid_put(prev->mm, prev->mm_cid);
+		mm_cid_put_lazy(prev);
 		prev->mm_cid = -1;
 	}
 	if (next->mm_cid_active)
-		next->mm_cid = mm_cid_get(next->mm);
+		next->last_mm_cid = next->mm_cid = mm_cid_get(next->mm);
 }
 
 #else
 static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next) { }
+static inline void sched_mm_cid_migrate_from(struct task_struct *t) { }
+static inline void sched_mm_cid_migrate_to(struct rq *src_rq, struct task_struct *t) { }
 #endif
 
 #endif /* _KERNEL_SCHED_SCHED_H */