[RFC,v4] sched: Fix performance regression introduced by mm_cid

Message ID 20230410150150.2179062-1-mathieu.desnoyers@efficios.com
State New
Headers
Series [RFC,v4] sched: Fix performance regression introduced by mm_cid |

Commit Message

Mathieu Desnoyers April 10, 2023, 3:01 p.m. UTC
  Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
sysbench regression reported by Aaron Lu.

Keep track of the currently allocated mm_cid for each mm/cpu rather than
freeing them immediately on context switch. This eliminates most atomic
operations when context switching back and forth between threads
belonging to different memory spaces in multi-threaded scenarios (many
processes, each with many threads). The per-mm/per-cpu mm_cid values are
serialized by their respective runqueue locks.

Thread migration is handled by introducing invocation to
sched_mm_cid_migrate_from() in set_task_cpu() and to
sched_mm_cid_migrate_to() (with destination runqueue lock held) in
activate_task() for migrating tasks. set_task_cpu() is invoked with and
without source rq lock held: the wakeup path does not hold the source rq
lock.

sched_mm_cid_migrate_from() clears the mm_cid from the task's mm per-cpu
index corresponding to the source runqueue if it matches the last mm_cid
observed by the migrated task. This last mm_cid value is returned as a
hint to conditionally clear the mm's per-cpu mm_cid on the destination
cpu.

Then, in sched_mm_cid_migrate_to(), if the last mm_cid is smaller than
the mm's destination cpu current mm_cid, clear the mm's destination cpu
current mm_cid. If the migrated task's mm is in use on the destination
cpu, the reclaim of the mm_cid will be done lazily on the next
destination cpu context switch, else it is performed immediately.

The source cpu's mm_cid is _not_ simply moved to the destination cpu on
migration, because passing ownership of the mm_cid value to the
destination cpu while an actively running tasks also has its own
mm_cid value (in case of lazy reclaim on next context switch) would
over-allocate mm_cid values beyond the number of possible cpus.

Because we want to ensure the mm_cid converges towards the smaller
values as migrations happen, the prior optimization that was done when
context switching between threads belonging to the same mm is removed,
because it could delay the lazy release of the destination runqueue
mm_cid after it has been replaced by a migration. Removing this prior
optimization is not an issue performance-wise because the introduced
per-mm/per-cpu mm_cid tracking also covers this more specific case.

This patch is based on v6.3-rc6 with this patch applied:

("mm: Fix memory leak on mm_init error handling")

https://lore.kernel.org/lkml/20230330133822.66271-1-mathieu.desnoyers@efficios.com/

Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Olivier Dion <odion@efficios.com>
Cc: michael.christie@oracle.com
---
Changes since v3:
- Rebase on v6.3-rc6,
- Move switch_mm_cid() from prepare_task_switch() to context_switch()
  after the switch_to() barrier,
- Adding missing barriers at the beginning of switch_mm_cid(),
- Document the lock-free migrate-from algorithm.
- Add memory barriers in sched_mm_cid_exit_signals(),
  sched_mm_cid_before_execve(), and sched_mm_cid_after_execve() to
  order t->mm_cid_active store, per-mm/cpu cid load wrt migrate-from
  per-mm-cid cid cmpxchg, t->mm_cid_active load.
---
 include/linux/mm_types.h |  60 +++++++
 include/linux/sched.h    |   1 +
 include/linux/sched/mm.h |   5 +
 kernel/fork.c            |   8 +-
 kernel/sched/core.c      | 361 ++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h     | 137 ++++++++++++---
 6 files changed, 530 insertions(+), 42 deletions(-)
  

Comments

Mike Christie April 10, 2023, 9:51 p.m. UTC | #1
On 4/10/23 10:01 AM, Mathieu Desnoyers wrote:
> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
> sysbench regression reported by Aaron Lu.
> 
> Keep track of the currently allocated mm_cid for each mm/cpu rather than
> freeing them immediately on context switch. This eliminates most atomic
> operations when context switching back and forth between threads
> belonging to different memory spaces in multi-threaded scenarios (many
> processes, each with many threads). The per-mm/per-cpu mm_cid values are
> serialized by their respective runqueue locks.
> 
> Thread migration is handled by introducing invocation to
> sched_mm_cid_migrate_from() in set_task_cpu() and to
> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
> activate_task() for migrating tasks. set_task_cpu() is invoked with and
> without source rq lock held: the wakeup path does not hold the source rq
> lock.
> 
> sched_mm_cid_migrate_from() clears the mm_cid from the task's mm per-cpu
> index corresponding to the source runqueue if it matches the last mm_cid
> observed by the migrated task. This last mm_cid value is returned as a
> hint to conditionally clear the mm's per-cpu mm_cid on the destination
> cpu.
> 
> Then, in sched_mm_cid_migrate_to(), if the last mm_cid is smaller than
> the mm's destination cpu current mm_cid, clear the mm's destination cpu
> current mm_cid. If the migrated task's mm is in use on the destination
> cpu, the reclaim of the mm_cid will be done lazily on the next
> destination cpu context switch, else it is performed immediately.
> 
> The source cpu's mm_cid is _not_ simply moved to the destination cpu on
> migration, because passing ownership of the mm_cid value to the
> destination cpu while an actively running tasks also has its own
> mm_cid value (in case of lazy reclaim on next context switch) would
> over-allocate mm_cid values beyond the number of possible cpus.
> 
> Because we want to ensure the mm_cid converges towards the smaller
> values as migrations happen, the prior optimization that was done when
> context switching between threads belonging to the same mm is removed,
> because it could delay the lazy release of the destination runqueue
> mm_cid after it has been replaced by a migration. Removing this prior
> optimization is not an issue performance-wise because the introduced
> per-mm/per-cpu mm_cid tracking also covers this more specific case.
> 
> This patch is based on v6.3-rc6 with this patch applied:
> 
> ("mm: Fix memory leak on mm_init error handling")
> 
> https://lore.kernel.org/lkml/20230330133822.66271-1-mathieu.desnoyers@efficios.com/
> 
> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Aaron Lu <aaron.lu@intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Olivier Dion <odion@efficios.com>
> Cc: michael.christie@oracle.com

Hey thanks for fixing this.

When testing linux-next with vhost devices, without this patch IOPs get stuck at around
1.3 million IOPs total when using 8 or more devices (you get a worker thread per device)
per VM. With this patch applied IOPs scale again, and we get up to 2.4M iops when using
up to 16 devices per VM.

Tested-by: Mike Christie <michael.christie@oracle.com>
  
Aaron Lu April 11, 2023, 4:52 a.m. UTC | #2
On Mon, Apr 10, 2023 at 11:01:50AM -0400, Mathieu Desnoyers wrote:
> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
> sysbench regression reported by Aaron Lu.
> 
> Keep track of the currently allocated mm_cid for each mm/cpu rather than
> freeing them immediately on context switch. This eliminates most atomic
> operations when context switching back and forth between threads
> belonging to different memory spaces in multi-threaded scenarios (many
> processes, each with many threads). The per-mm/per-cpu mm_cid values are
> serialized by their respective runqueue locks.
> 
> Thread migration is handled by introducing invocation to
> sched_mm_cid_migrate_from() in set_task_cpu() and to
> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
> activate_task() for migrating tasks. set_task_cpu() is invoked with and
> without source rq lock held: the wakeup path does not hold the source rq
> lock.
> 
> sched_mm_cid_migrate_from() clears the mm_cid from the task's mm per-cpu
> index corresponding to the source runqueue if it matches the last mm_cid
> observed by the migrated task. This last mm_cid value is returned as a
> hint to conditionally clear the mm's per-cpu mm_cid on the destination
> cpu.
> 
> Then, in sched_mm_cid_migrate_to(), if the last mm_cid is smaller than
> the mm's destination cpu current mm_cid, clear the mm's destination cpu
> current mm_cid. If the migrated task's mm is in use on the destination
> cpu, the reclaim of the mm_cid will be done lazily on the next
> destination cpu context switch, else it is performed immediately.
> 
> The source cpu's mm_cid is _not_ simply moved to the destination cpu on
> migration, because passing ownership of the mm_cid value to the
> destination cpu while an actively running tasks also has its own
> mm_cid value (in case of lazy reclaim on next context switch) would
> over-allocate mm_cid values beyond the number of possible cpus.
> 
> Because we want to ensure the mm_cid converges towards the smaller
> values as migrations happen, the prior optimization that was done when
> context switching between threads belonging to the same mm is removed,
> because it could delay the lazy release of the destination runqueue
> mm_cid after it has been replaced by a migration. Removing this prior
> optimization is not an issue performance-wise because the introduced
> per-mm/per-cpu mm_cid tracking also covers this more specific case.
> 
> This patch is based on v6.3-rc6 with this patch applied:
> 
> ("mm: Fix memory leak on mm_init error handling")
> 
> https://lore.kernel.org/lkml/20230330133822.66271-1-mathieu.desnoyers@efficios.com/

Running the previouslly mentioned postgres_sysbench workload with this
patch applied showed there is single digit lock contention from
cid_lock, ranging from 1.x% - 7.x% during 3 minutes run. This is worse
than v1 which I tested before where there is almost no lock contention:
https://lore.kernel.org/lkml/20230404015949.GA14939@ziqianlu-desk2/

Detail lock contention number for the 3 minutes run are:

$ grep "\[k\] native_queued" *profile
perf_0.profile:     5.44%     5.44%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
perf_1.profile:     7.49%     7.49%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
perf_2.profile:     6.65%     6.65%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
perf_3.profile:     3.38%     3.38%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
perf_4.profile:     3.01%     3.01%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
perf_5.profile:     1.74%     1.74%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

cycles are recorded roughly every 30 seconds for a 5s window on all CPUs.

And for the worst profile perf_1.profile, the call traces for the
contention are:

     7.49%     7.49%  [kernel.vmlinux]                           [k] native_queued_spin_lock_slowpath                                                           -      -
5.46% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify
1.22% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64;0x7fdcfd755d16
0.47% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule;exit_to_user_mode_prepare;syscall_exit_to_user_mode;do_syscall_64;entry_SYSCALL_64;0x7fdcfdf9044c
0.11% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;try_to_wake_up;default_wake_function;ep_autoremove_wake_function;__wake_up_common;__wake_up_common_lock;__wake_up;ep_poll_callback;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;0x7f2edc54e494

I then also tested v3 and v2, turns out lock contention is even worse on
those two versions. v3: 5.x% - 22.x%, v2: 6% - 22.x%. It feels to me
something changed in v2 that brought back lock contention and then some
optimization done in v4 made things better, but still not as good as v1.

Thanks,
Aaron
  
Aaron Lu April 11, 2023, 1:12 p.m. UTC | #3
On Tue, Apr 11, 2023 at 12:52:25PM +0800, Aaron Lu wrote:
> On Mon, Apr 10, 2023 at 11:01:50AM -0400, Mathieu Desnoyers wrote:
> > Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
> > sysbench regression reported by Aaron Lu.
> > 
> > Keep track of the currently allocated mm_cid for each mm/cpu rather than
> > freeing them immediately on context switch. This eliminates most atomic
> > operations when context switching back and forth between threads
> > belonging to different memory spaces in multi-threaded scenarios (many
> > processes, each with many threads). The per-mm/per-cpu mm_cid values are
> > serialized by their respective runqueue locks.
> > 
> > Thread migration is handled by introducing invocation to
> > sched_mm_cid_migrate_from() in set_task_cpu() and to
> > sched_mm_cid_migrate_to() (with destination runqueue lock held) in
> > activate_task() for migrating tasks. set_task_cpu() is invoked with and
> > without source rq lock held: the wakeup path does not hold the source rq
> > lock.
> > 
> > sched_mm_cid_migrate_from() clears the mm_cid from the task's mm per-cpu
> > index corresponding to the source runqueue if it matches the last mm_cid
> > observed by the migrated task. This last mm_cid value is returned as a
> > hint to conditionally clear the mm's per-cpu mm_cid on the destination
> > cpu.
> > 
> > Then, in sched_mm_cid_migrate_to(), if the last mm_cid is smaller than
> > the mm's destination cpu current mm_cid, clear the mm's destination cpu
> > current mm_cid. If the migrated task's mm is in use on the destination
> > cpu, the reclaim of the mm_cid will be done lazily on the next
> > destination cpu context switch, else it is performed immediately.
> > 
> > The source cpu's mm_cid is _not_ simply moved to the destination cpu on
> > migration, because passing ownership of the mm_cid value to the
> > destination cpu while an actively running tasks also has its own
> > mm_cid value (in case of lazy reclaim on next context switch) would
> > over-allocate mm_cid values beyond the number of possible cpus.
> > 
> > Because we want to ensure the mm_cid converges towards the smaller
> > values as migrations happen, the prior optimization that was done when
> > context switching between threads belonging to the same mm is removed,
> > because it could delay the lazy release of the destination runqueue
> > mm_cid after it has been replaced by a migration. Removing this prior
> > optimization is not an issue performance-wise because the introduced
> > per-mm/per-cpu mm_cid tracking also covers this more specific case.
> > 
> > This patch is based on v6.3-rc6 with this patch applied:
> > 
> > ("mm: Fix memory leak on mm_init error handling")
> > 
> > https://lore.kernel.org/lkml/20230330133822.66271-1-mathieu.desnoyers@efficios.com/
> 
> Running the previouslly mentioned postgres_sysbench workload with this
> patch applied showed there is single digit lock contention from
> cid_lock, ranging from 1.x% - 7.x% during 3 minutes run. This is worse
> than v1 which I tested before where there is almost no lock contention:
> https://lore.kernel.org/lkml/20230404015949.GA14939@ziqianlu-desk2/
> 
> Detail lock contention number for the 3 minutes run are:
> 
> $ grep "\[k\] native_queued" *profile
> perf_0.profile:     5.44%     5.44%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> perf_1.profile:     7.49%     7.49%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> perf_2.profile:     6.65%     6.65%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> perf_3.profile:     3.38%     3.38%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> perf_4.profile:     3.01%     3.01%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> perf_5.profile:     1.74%     1.74%  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> 
> cycles are recorded roughly every 30 seconds for a 5s window on all CPUs.
> 
> And for the worst profile perf_1.profile, the call traces for the
> contention are:
> 
>      7.49%     7.49%  [kernel.vmlinux]                           [k] native_queued_spin_lock_slowpath                                                           -      -
> 5.46% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify
> 1.22% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64;0x7fdcfd755d16
> 0.47% native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule;exit_to_user_mode_prepare;syscall_exit_to_user_mode;do_syscall_64;entry_SYSCALL_64;0x7fdcfdf9044c
> 0.11% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;try_to_wake_up;default_wake_function;ep_autoremove_wake_function;__wake_up_common;__wake_up_common_lock;__wake_up;ep_poll_callback;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;0x7f2edc54e494
> 
> I then also tested v3 and v2, turns out lock contention is even worse on
> those two versions. v3: 5.x% - 22.x%, v2: 6% - 22.x%. It feels to me
> something changed in v2 that brought back lock contention and then some
> optimization done in v4 made things better, but still not as good as v1.

Forget about this "v4 is better than v2 and v3" part, my later test
showed the contention can also rise to around 18% for v4.

About why v2-v4 is worse on lock contention than v1, I think that's due
to v1 invoked sched_mm_cid_migrate_from/to() in move_queued_task() path
which didn't affect tasks that migrated on wakeup time while v2-v4 invoked
sched_mm_cid_migrate_from() in set_task_cpu() which affects tasks
migrated on wakeup time. And for this workload, tasks migrated a lot on
wakeup time: during a 5s window, there are about 5 million migrations on
wakeup time while for move_queued_task(), it's only some hundred or
thousand during 5s window.
  
Aaron Lu April 12, 2023, 4:27 a.m. UTC | #4
Just noticed below warning in dmesg on v4, the warning is triggered by
WARN_ON_ONCE((int) mm_cid < 0); in rseq_update_cpu_node_id().

[ 1819.649803] ------------[ cut here ]------------
[ 1819.649813] WARNING: CPU: 188 PID: 29881 at kernel/rseq.c:95 __rseq_handle_notify_resume+0x49b/0x590
[ 1819.649823] Modules linked in: veth tls xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel binfmt_misc kvm nls_iso8859_1 rapl intel_cstate mei_me isst_if_mbox_pci isst_if_mmio idxd input_leds joydev isst_if_common idxd_bus mei ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops pstore_blk reed_solomon pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid ast i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect
[ 1819.649903]  sysimgblt crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel dax_hmem cxl_acpi crypto_simd nvme cryptd cxl_core nvme_core i2c_i801 xhci_pci i40e drm i2c_smbus xhci_pci_renesas i2c_ismt wmi pinctrl_emmitsburg
[ 1819.649924] CPU: 188 PID: 29881 Comm: hackbench Not tainted 6.3.0-rc6-00002-g1acfd6ae9afc #2
[ 1819.649927] Hardware name: Intel Corporation D50DNP1SBB/D50DNP1SBB, BIOS SE5C7411.86B.8901.D03.2210131232 10/13/2022
[ 1819.649929] RIP: 0010:__rseq_handle_notify_resume+0x49b/0x590
[ 1819.649934] Code: f0 ff ff ff ff ff 00 e9 00 fe ff ff 48 ba 00 f0 ff ff ff ff ff 00 e9 3a fe ff ff 48 ba 00 f0 ff ffff ff ff 00 e9 66 fe ff ff <0f> 0b e9 7d fc ff ff 0f 01 ca e9 a6 fc ff ff 48 8b 4c 24 30 48 8b
[ 1819.649936] RSP: 0018:ffa0000018b0fe60 EFLAGS: 00010286
[ 1819.649939] RAX: ff11007f73500000 RBX: 00007f81a7226fe0 RCX: 0000000000000000
[ 1819.649941] RDX: 00000000ffffffff RSI: 0000000000000001 RDI: ffffffff828f9477
[ 1819.649943] RBP: ffa0000018b0fee8 R08: ff110040c192cc28 R09: ff1100408e2a3980
[ 1819.649944] R10: 0000000000000001 R11: 0000000000000000 R12: ff110040caa64000
[ 1819.649946] R13: 000000000002fa40 R14: 0000000000000000 R15: 00000000000000bc
[ 1819.649947] FS:  00007f81a7226640(0000) GS:ff11007f73500000(0000) knlGS:0000000000000000
[ 1819.649950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1819.649951] CR2: 00007fd8b00d5000 CR3: 00000040cf060001 CR4: 0000000000f71ee0
[ 1819.649953] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1819.649955] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1819.649957] PKRU: 55555554
[ 1819.649958] Call Trace:
[ 1819.649960]  <TASK>
[ 1819.649964]  exit_to_user_mode_prepare+0x13b/0x1a0
[ 1819.649970]  syscall_exit_to_user_mode+0x2a/0x50
[ 1819.649976]  ? __x64_sys_read+0x1d/0x30
[ 1819.649980]  do_syscall_64+0x6d/0x90
[ 1819.649984]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 1819.649990] RIP: 0033:0x7f81a77149cc
[ 1819.649996] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 b9 c0 f7 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 ff c0 f7 ff 48
[ 1819.649998] RSP: 002b:00007f81a7225d70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 1819.650004] RAX: 0000000000000064 RBX: 00007f81a7225da0 RCX: 00007f81a77149cc
[ 1819.650005] RDX: 0000000000000064 RSI: 00007f81a7225da0 RDI: 0000000000000135
[ 1819.650007] RBP: 00007f81a7225e50 R08: 0000000000000000 R09: 0000000000000000
[ 1819.650008] R10: 0000000000000008 R11: 0000000000000246 R12: 00000000004a3c12
[ 1819.650009] R13: 00007f81a7225e10 R14: 0000000000000000 R15: 0000558df7af6d00
[ 1819.650012]  </TASK>
[ 1819.650013] ---[ end trace 0000000000000000 ]---
  
Peter Zijlstra April 12, 2023, 9:10 a.m. UTC | #5
On Tue, Apr 11, 2023 at 09:12:21PM +0800, Aaron Lu wrote:

> Forget about this "v4 is better than v2 and v3" part, my later test
> showed the contention can also rise to around 18% for v4.

So while I can reproduce the initial regression on a HSW-EX system
(4*18*2) and get lovely things like:

  34.47%--schedule_hrtimeout_range_clock
          schedule
          |
          --34.42%--__schedule
                    |
                    |--31.86%--_raw_spin_lock
                    |          |
                    |           --31.65%--native_queued_spin_lock_slowpath
	            |
                    --0.72%--dequeue_task_fair
                             |
                             --0.60%--dequeue_entity

On a --threads=144 run; it is completely gone when I use v4:

  6.92%--__schedule
         |
         |--2.16%--dequeue_task_fair
         |          |
         |           --1.69%--dequeue_entity
         |                     |
         |                     |--0.61%--update_load_avg
         |                     |
         |                      --0.54%--update_curr
         |
         |--1.30%--pick_next_task_fair
         |          |
         |           --0.54%--set_next_entity
         |
         |--0.77%--psi_task_switch
         |
         --0.69%--switch_mm_irqs_off


:-(
  
Aaron Lu April 12, 2023, 11:42 a.m. UTC | #6
On Wed, Apr 12, 2023 at 11:10:43AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 11, 2023 at 09:12:21PM +0800, Aaron Lu wrote:
> 
> > Forget about this "v4 is better than v2 and v3" part, my later test
> > showed the contention can also rise to around 18% for v4.
> 
> So while I can reproduce the initial regression on a HSW-EX system
> (4*18*2) and get lovely things like:
> 
>   34.47%--schedule_hrtimeout_range_clock
>           schedule
>           |
>           --34.42%--__schedule
>                     |
>                     |--31.86%--_raw_spin_lock
>                     |          |
>                     |           --31.65%--native_queued_spin_lock_slowpath
> 	            |
>                     --0.72%--dequeue_task_fair
>                              |
>                              --0.60%--dequeue_entity
> 
> On a --threads=144 run; it is completely gone when I use v4:
> 
>   6.92%--__schedule
>          |
>          |--2.16%--dequeue_task_fair
>          |          |
>          |           --1.69%--dequeue_entity
>          |                     |
>          |                     |--0.61%--update_load_avg
>          |                     |
>          |                      --0.54%--update_curr
>          |
>          |--1.30%--pick_next_task_fair
>          |          |
>          |           --0.54%--set_next_entity
>          |
>          |--0.77%--psi_task_switch
>          |
>          --0.69%--switch_mm_irqs_off
> 
> 
> :-(

Hmm... I also tested on a 2sockets/64cores/128cpus Icelake, the
contention number is about 20%-48% with vanilla v6.3-rc6 and after
applying v4, the contention is gone.

But it's still there on 2sockets/112cores/224cpus Sapphire Rapids(SPR)
with v4(and v2, v3)...:

    18.38%     1.24%  [kernel.vmlinux]                           [k] __schedule
            |
            |--17.14%--__schedule
            |          |
            |          |--10.63%--mm_cid_get
            |          |          |
            |          |           --10.22%--_raw_spin_lock
            |          |                     |
            |          |                      --10.07%--native_queued_spin_lock_slowpath
            |          |
            |          |--3.43%--dequeue_task
            |          |          |
            |          |           --3.39%--dequeue_task_fair
            |          |                     |
            |          |                     |--2.60%--dequeue_entity
            |          |                     |          |
            |          |                     |          |--1.22%--update_cfs_group
            |          |                     |          |
            |          |                     |           --1.05%--update_load_avg
            |          |                     |
            |          |                      --0.63%--update_cfs_group
            |          |
            |          |--0.68%--switch_mm_irqs_off
            |          |
            |          |--0.60%--finish_task_switch.isra.0
            |          |
            |           --0.50%--psi_task_switch
            |
             --0.53%--0x55a8385c088b

It's much better than the initial 70% contention on SPR of course.

BTW, I found hackbench can also show this problem on both Icelake and SPR.

With v4, on SPR:
~/src/rt-tests-2.4/hackbench --pipe --threads -l 500000
Profile was captured 20s after starting hackbench.

    40.89%     7.71%  [kernel.vmlinux]            [k] __schedule
            |
            |--33.19%--__schedule
            |          |
            |          |--22.25%--mm_cid_get
            |          |          |
            |          |           --18.78%--_raw_spin_lock
            |          |                     |
            |          |                      --18.46%--native_queued_spin_lock_slowpath
            |          |
            |          |--7.46%--finish_task_switch.isra.0
            |          |          |
            |          |           --0.52%--asm_sysvec_call_function_single
            |          |                     sysvec_call_function_single
            |          |
            |          |--0.95%--dequeue_task
            |          |          |
            |          |           --0.93%--dequeue_task_fair
            |          |                     |
            |          |                      --0.76%--dequeue_entity
            |          |
            |           --0.75%--debug_smp_processor_id
            |


With v4, on Icelake:
~/src/rt-tests-2.4/hackbench --pipe --threads -l 500000
Profile was captured 20s after starting hackbench.

    25.83%     4.11%  [kernel.kallsyms]  [k] __schedule
            |
            |--21.72%--__schedule
            |          |
            |          |--11.68%--mm_cid_get
            |          |          |
            |          |           --9.36%--_raw_spin_lock
            |          |                     |
            |          |                      --9.09%--native_queued_spin_lock_slowpath
            |          |
            |          |--3.80%--finish_task_switch.isra.0
            |          |          |
            |          |           --0.70%--asm_sysvec_call_function_single
            |          |                     |
            |          |                      --0.69%--sysvec_call_function_single
            |          |
            |          |--2.58%--dequeue_task
            |          |          |
            |          |           --2.53%--dequeue_task_fair

I *guess* you might be able to see some contention with hackbench on
that HSW-EX system with v4.
  
Peter Zijlstra April 12, 2023, 2:26 p.m. UTC | #7
On Wed, Apr 12, 2023 at 07:42:40PM +0800, Aaron Lu wrote:

> I *guess* you might be able to see some contention with hackbench on
> that HSW-EX system with v4.

Indeed! Notably it seems to be the wakeup from idle that trips it
hardest:

11.31%     0.03%  swapper          [kernel.vmlinux]            [k] schedule_idle
        |
        --11.28%--schedule_idle
                  |
                  --11.27%--__schedule
                            |
                            |--8.61%--mm_cid_get
                            |          |
                            |           --5.78%--_raw_spin_lock
                            |                     |
                            |                      --5.56%--native_queued_spin_lock_slowpath
                            |
                            |--0.81%--finish_task_switch.isra.0
                            |          |
                            |           --0.69%--asm_sysvec_call_function_single
                            |                     |
                            |                      --0.58%--sysvec_call_function_single
                            |
                            --0.57%--switch_mm_irqs_off
  
Aaron Lu April 12, 2023, 2:39 p.m. UTC | #8
On Wed, Apr 12, 2023 at 04:26:16PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 12, 2023 at 07:42:40PM +0800, Aaron Lu wrote:
> 
> > I *guess* you might be able to see some contention with hackbench on
> > that HSW-EX system with v4.
> 
> Indeed! Notably it seems to be the wakeup from idle that trips it
> hardest:

Could it because for idle cpus, the per-cpu/mm cid is no longer valid for
the sake of compacting and when task wakes there, it will have to
re-allocate a new cid through mm_get_cid() which needs to acquire
mm->cid_lock?

> 11.31%     0.03%  swapper          [kernel.vmlinux]            [k] schedule_idle
>         |
>         --11.28%--schedule_idle
>                   |
>                   --11.27%--__schedule
>                             |
>                             |--8.61%--mm_cid_get
>                             |          |
>                             |           --5.78%--_raw_spin_lock
>                             |                     |
>                             |                      --5.56%--native_queued_spin_lock_slowpath
>                             |
>                             |--0.81%--finish_task_switch.isra.0
>                             |          |
>                             |           --0.69%--asm_sysvec_call_function_single
>                             |                     |
>                             |                      --0.58%--sysvec_call_function_single
>                             |
>                             --0.57%--switch_mm_irqs_off
>                     
>                     
>
  
Mathieu Desnoyers April 12, 2023, 8:57 p.m. UTC | #9
On 2023-04-12 00:27, Aaron Lu wrote:
> Just noticed below warning in dmesg on v4, the warning is triggered by
> WARN_ON_ONCE((int) mm_cid < 0); in rseq_update_cpu_node_id().

I think I know what triggers this in v4.

See sched_mm_cid_migrate_from() (near the end):

+       /*
+        * The src_cid is unused, so it can be unset.
+        */
+       if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) !=
+           mm_cid_set_lazy_put(src_cid))
+               return;
+       __mm_cid_put(mm, src_cid);

There is a short window here between successful cmpxchg and clear mask bit
where the thread has ownership of the cid without rq lock held. This can
lead to rare over-allocation of cids beyond nr_possible_cpus if the src cpu
runqueue reallocates a cid concurrently, which causes __mm_cid_get() to
return -1.

Because we want to avoid taking the src rq lock in migrate-from, I plan to
fix this by doing the following:

- disable interrupts around this cmpxchg and __mm_cid_put(),
- modify __mm_cid_get so it retries if cpumask_first_zero() returns -1.

This should take care of this kind of extremely rare over-allocation
scenario through retry, which will be bounded by the duration of the
instruction sequence between cmpxchg and clearing the bit in the bitmask.

I have not reproduced the warning on my end, only figured this out from
code review.

Thoughts ?

Thanks,

Mathieu

> 
> [ 1819.649803] ------------[ cut here ]------------
> [ 1819.649813] WARNING: CPU: 188 PID: 29881 at kernel/rseq.c:95 __rseq_handle_notify_resume+0x49b/0x590
> [ 1819.649823] Modules linked in: veth tls xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel binfmt_misc kvm nls_iso8859_1 rapl intel_cstate mei_me isst_if_mbox_pci isst_if_mmio idxd input_leds joydev isst_if_common idxd_bus mei ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops pstore_blk reed_solomon pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid ast i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect
> [ 1819.649903]  sysimgblt crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel dax_hmem cxl_acpi crypto_simd nvme cryptd cxl_core nvme_core i2c_i801 xhci_pci i40e drm i2c_smbus xhci_pci_renesas i2c_ismt wmi pinctrl_emmitsburg
> [ 1819.649924] CPU: 188 PID: 29881 Comm: hackbench Not tainted 6.3.0-rc6-00002-g1acfd6ae9afc #2
> [ 1819.649927] Hardware name: Intel Corporation D50DNP1SBB/D50DNP1SBB, BIOS SE5C7411.86B.8901.D03.2210131232 10/13/2022
> [ 1819.649929] RIP: 0010:__rseq_handle_notify_resume+0x49b/0x590
> [ 1819.649934] Code: f0 ff ff ff ff ff 00 e9 00 fe ff ff 48 ba 00 f0 ff ff ff ff ff 00 e9 3a fe ff ff 48 ba 00 f0 ff ffff ff ff 00 e9 66 fe ff ff <0f> 0b e9 7d fc ff ff 0f 01 ca e9 a6 fc ff ff 48 8b 4c 24 30 48 8b
> [ 1819.649936] RSP: 0018:ffa0000018b0fe60 EFLAGS: 00010286
> [ 1819.649939] RAX: ff11007f73500000 RBX: 00007f81a7226fe0 RCX: 0000000000000000
> [ 1819.649941] RDX: 00000000ffffffff RSI: 0000000000000001 RDI: ffffffff828f9477
> [ 1819.649943] RBP: ffa0000018b0fee8 R08: ff110040c192cc28 R09: ff1100408e2a3980
> [ 1819.649944] R10: 0000000000000001 R11: 0000000000000000 R12: ff110040caa64000
> [ 1819.649946] R13: 000000000002fa40 R14: 0000000000000000 R15: 00000000000000bc
> [ 1819.649947] FS:  00007f81a7226640(0000) GS:ff11007f73500000(0000) knlGS:0000000000000000
> [ 1819.649950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1819.649951] CR2: 00007fd8b00d5000 CR3: 00000040cf060001 CR4: 0000000000f71ee0
> [ 1819.649953] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1819.649955] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 1819.649957] PKRU: 55555554
> [ 1819.649958] Call Trace:
> [ 1819.649960]  <TASK>
> [ 1819.649964]  exit_to_user_mode_prepare+0x13b/0x1a0
> [ 1819.649970]  syscall_exit_to_user_mode+0x2a/0x50
> [ 1819.649976]  ? __x64_sys_read+0x1d/0x30
> [ 1819.649980]  do_syscall_64+0x6d/0x90
> [ 1819.649984]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [ 1819.649990] RIP: 0033:0x7f81a77149cc
> [ 1819.649996] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 b9 c0 f7 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 ff c0 f7 ff 48
> [ 1819.649998] RSP: 002b:00007f81a7225d70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 1819.650004] RAX: 0000000000000064 RBX: 00007f81a7225da0 RCX: 00007f81a77149cc
> [ 1819.650005] RDX: 0000000000000064 RSI: 00007f81a7225da0 RDI: 0000000000000135
> [ 1819.650007] RBP: 00007f81a7225e50 R08: 0000000000000000 R09: 0000000000000000
> [ 1819.650008] R10: 0000000000000008 R11: 0000000000000246 R12: 00000000004a3c12
> [ 1819.650009] R13: 00007f81a7225e10 R14: 0000000000000000 R15: 0000558df7af6d00
> [ 1819.650012]  </TASK>
> [ 1819.650013] ---[ end trace 0000000000000000 ]---

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
  
Peter Zijlstra April 13, 2023, 11:10 a.m. UTC | #10
On Wed, Apr 12, 2023 at 10:39:34PM +0800, Aaron Lu wrote:
> On Wed, Apr 12, 2023 at 04:26:16PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 12, 2023 at 07:42:40PM +0800, Aaron Lu wrote:
> > 
> > > I *guess* you might be able to see some contention with hackbench on
> > > that HSW-EX system with v4.
> > 
> > Indeed! Notably it seems to be the wakeup from idle that trips it
> > hardest:
> 
> Could it because for idle cpus, the per-cpu/mm cid is no longer valid for
> the sake of compacting and when task wakes there, it will have to
> re-allocate a new cid through mm_get_cid() which needs to acquire
> mm->cid_lock?

Yup. And I'm thinking it is futile (and counter productive) to strive
for compactness in this (nr_threads >= nr_cpus) case.

The below on v4 solves the contention I see with hackbench (which runs
400 threads which is well above the 144 cpu count on that machine).

This obviously leaves a problem with the nr_threads = nr_cpus - 1 case,
but I'm thinking we can add some fuzz (nr_cpu_ids - ilog2(nr_cpus_ids+1)
perhaps). Also, I would be thinking that's not something that typically
happens.

Mathieu, WDYT? -- other than that the patch is an obvious hack :-)

---
 include/linux/mm_types.h | 8 ++++++++
 kernel/fork.c            | 4 +++-
 kernel/sched/core.c      | 9 +++++++++
 kernel/sched/sched.h     | 2 ++
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4160ff5c6ebd..598d1b657afa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -609,6 +609,7 @@ struct mm_struct {
 		 * were being concurrently updated by the updaters.
 		 */
 		raw_spinlock_t cid_lock;
+		int cid_saturated;
 		/**
 		 * @pcpu_cid: Per-cpu current cid.
 		 *
@@ -912,6 +913,12 @@ static inline int mm_cid_clear_lazy_put(int cid)
 	return cid & ~MM_CID_LAZY_PUT;
 }
 
+static inline void mm_cid_desaturate(struct mm_struct *mm)
+{
+	if (mm->cid_saturated && atomic_read(&mm->mm_users) < nr_cpu_ids)
+		mm->cid_saturated = 0;
+}
+
 /* Accessor for struct mm_struct's cidmask. */
 static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 {
@@ -928,6 +935,7 @@ static inline void mm_init_cid(struct mm_struct *mm)
 	int i;
 
 	raw_spin_lock_init(&mm->cid_lock);
+	mm->cid_saturated = 0;
 	for_each_possible_cpu(i)
 		*per_cpu_ptr(mm->pcpu_cid, i) = MM_CID_UNSET;
 	cpumask_clear(mm_cidmask(mm));
diff --git a/kernel/fork.c b/kernel/fork.c
index 3832bea713c4..a5233e450435 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1233,7 +1233,9 @@ void mmput(struct mm_struct *mm)
 	might_sleep();
 
 	if (atomic_dec_and_test(&mm->mm_users))
-		__mmput(mm);
+		return __mmput(mm);
+
+	mm_cid_desaturate(mm);
 }
 EXPORT_SYMBOL_GPL(mmput);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 425766cc1300..d5004d179531 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11550,6 +11550,15 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
 	if (last_mm_cid == -1)
 		return;
 
+	/*
+	 * When nr_threads > nr_cpus, there is no point in moving anything
+	 * around to keep it compact.
+	 */
+	if (mm->cid_saturated) {
+		t->last_mm_cid = -1;
+		return;
+	}
+
 	src_rq = task_rq(t);
 	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
 	src_cid = READ_ONCE(*src_pcpu_cid);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f3e7dc2cd1cc..6c4af2992e79 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3347,6 +3347,8 @@ static inline int mm_cid_get(struct mm_struct *mm)
 	}
 	raw_spin_lock(&mm->cid_lock);
 	cid = __mm_cid_get_locked(mm);
+	if (cid == nr_cpu_ids - 1)
+		mm->cid_saturated = 1;
 	raw_spin_unlock(&mm->cid_lock);
 	WRITE_ONCE(*pcpu_cid, cid);
 	return cid;
  
Aaron Lu April 13, 2023, 1:13 p.m. UTC | #11
On Wed, Apr 12, 2023 at 04:57:50PM -0400, Mathieu Desnoyers wrote:
> On 2023-04-12 00:27, Aaron Lu wrote:
> > Just noticed below warning in dmesg on v4, the warning is triggered by
> > WARN_ON_ONCE((int) mm_cid < 0); in rseq_update_cpu_node_id().
> 
> I think I know what triggers this in v4.
> 
> See sched_mm_cid_migrate_from() (near the end):
> 
> +       /*
> +        * The src_cid is unused, so it can be unset.
> +        */
> +       if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) !=
> +           mm_cid_set_lazy_put(src_cid))
> +               return;
> +       __mm_cid_put(mm, src_cid);
> 
> There is a short window here between successful cmpxchg and clear mask bit
> where the thread has ownership of the cid without rq lock held. This can
> lead to rare over-allocation of cids beyond nr_possible_cpus if the src cpu
> runqueue reallocates a cid concurrently, which causes __mm_cid_get() to
> return -1.
> 
> Because we want to avoid taking the src rq lock in migrate-from, I plan to
> fix this by doing the following:
> 
> - disable interrupts around this cmpxchg and __mm_cid_put(),

To reduce the short window?

> - modify __mm_cid_get so it retries if cpumask_first_zero() returns -1.
> 
> This should take care of this kind of extremely rare over-allocation
> scenario through retry, which will be bounded by the duration of the
> instruction sequence between cmpxchg and clearing the bit in the bitmask.
> 
> I have not reproduced the warning on my end, only figured this out from
> code review.
> 
> Thoughts ?

Sounds reasonable to me.

I tested below diff to verify your theory:

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f3e7dc2cd1cc..2c0764e53b83 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3315,8 +3315,10 @@ static inline int __mm_cid_get_locked(struct mm_struct *mm)

        cpumask = mm_cidmask(mm);
        cid = cpumask_first_zero(cpumask);
-       if (cid >= nr_cpu_ids)
+       if (cid >= nr_cpu_ids) {
+               WARN_ON_ONCE(1);
                return -1;
+       }
        cpumask_set_cpu(cid, cpumask);
        return cid;
 }

And it indeed fired.

I then applied the below diff according to your description:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 425766cc1300..881571519a90 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11536,6 +11536,7 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
 	struct mm_struct *mm = t->mm;
 	struct rq *src_rq;
 	struct task_struct *src_task;
+	unsigned long flags;
 
 	if (!mm)
 		return;
@@ -11623,10 +11624,12 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
 	/*
 	 * The src_cid is unused, so it can be unset.
 	 */
+	local_irq_save(flags);
 	if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) !=
 	    mm_cid_set_lazy_put(src_cid))
 		return;
 	__mm_cid_put(mm, src_cid);
+	local_irq_restore(flags);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f3e7dc2cd1cc..41ed7022bab0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3315,8 +3315,10 @@ static inline int __mm_cid_get_locked(struct mm_struct *mm)
 
 	cpumask = mm_cidmask(mm);
 	cid = cpumask_first_zero(cpumask);
-	if (cid >= nr_cpu_ids)
+	if (cid >= nr_cpu_ids) {
+		WARN_ON_ONCE(1);
 		return -1;
+	}
 	cpumask_set_cpu(cid, cpumask);
 	return cid;
 }
@@ -3345,8 +3347,10 @@ static inline int mm_cid_get(struct mm_struct *mm)
 		if (cmpxchg(pcpu_cid, cid, MM_CID_UNSET) == cid)
 			__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
 	}
+	cid = -1;
 	raw_spin_lock(&mm->cid_lock);
-	cid = __mm_cid_get_locked(mm);
+	while (cid == -1)
+		cid = __mm_cid_get_locked(mm);
 	raw_spin_unlock(&mm->cid_lock);
 	WRITE_ONCE(*pcpu_cid, cid);
 	return cid;

And did several iterations of hackbench, no more warnings from the
original place rseq_update_cpu_node_id(). It appears solved the issue,
will let you know if I ever hit it again.

One more thing is, with the above diff applied, I still get the warning
about cid == -1 in __mm_cid_get_locked() so I suppose disabling irq in
migrate_from can't entirely close the race.

Another thing, it doesn't appear there is an user of __mm_get_cid()?

Thanks,
Aaron
 
> > 
> > [ 1819.649803] ------------[ cut here ]------------
> > [ 1819.649813] WARNING: CPU: 188 PID: 29881 at kernel/rseq.c:95 __rseq_handle_notify_resume+0x49b/0x590
> > [ 1819.649823] Modules linked in: veth tls xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel binfmt_misc kvm nls_iso8859_1 rapl intel_cstate mei_me isst_if_mbox_pci isst_if_mmio idxd input_leds joydev isst_if_common idxd_bus mei ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops pstore_blk reed_solomon pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid ast i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect
> > [ 1819.649903]  sysimgblt crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel dax_hmem cxl_acpi crypto_simd nvme cryptd cxl_core nvme_core i2c_i801 xhci_pci i40e drm i2c_smbus xhci_pci_renesas i2c_ismt wmi pinctrl_emmitsburg
> > [ 1819.649924] CPU: 188 PID: 29881 Comm: hackbench Not tainted 6.3.0-rc6-00002-g1acfd6ae9afc #2
> > [ 1819.649927] Hardware name: Intel Corporation D50DNP1SBB/D50DNP1SBB, BIOS SE5C7411.86B.8901.D03.2210131232 10/13/2022
> > [ 1819.649929] RIP: 0010:__rseq_handle_notify_resume+0x49b/0x590
> > [ 1819.649934] Code: f0 ff ff ff ff ff 00 e9 00 fe ff ff 48 ba 00 f0 ff ff ff ff ff 00 e9 3a fe ff ff 48 ba 00 f0 ff ffff ff ff 00 e9 66 fe ff ff <0f> 0b e9 7d fc ff ff 0f 01 ca e9 a6 fc ff ff 48 8b 4c 24 30 48 8b
> > [ 1819.649936] RSP: 0018:ffa0000018b0fe60 EFLAGS: 00010286
> > [ 1819.649939] RAX: ff11007f73500000 RBX: 00007f81a7226fe0 RCX: 0000000000000000
> > [ 1819.649941] RDX: 00000000ffffffff RSI: 0000000000000001 RDI: ffffffff828f9477
> > [ 1819.649943] RBP: ffa0000018b0fee8 R08: ff110040c192cc28 R09: ff1100408e2a3980
> > [ 1819.649944] R10: 0000000000000001 R11: 0000000000000000 R12: ff110040caa64000
> > [ 1819.649946] R13: 000000000002fa40 R14: 0000000000000000 R15: 00000000000000bc
> > [ 1819.649947] FS:  00007f81a7226640(0000) GS:ff11007f73500000(0000) knlGS:0000000000000000
> > [ 1819.649950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1819.649951] CR2: 00007fd8b00d5000 CR3: 00000040cf060001 CR4: 0000000000f71ee0
> > [ 1819.649953] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 1819.649955] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> > [ 1819.649957] PKRU: 55555554
> > [ 1819.649958] Call Trace:
> > [ 1819.649960]  <TASK>
> > [ 1819.649964]  exit_to_user_mode_prepare+0x13b/0x1a0
> > [ 1819.649970]  syscall_exit_to_user_mode+0x2a/0x50
> > [ 1819.649976]  ? __x64_sys_read+0x1d/0x30
> > [ 1819.649980]  do_syscall_64+0x6d/0x90
> > [ 1819.649984]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > [ 1819.649990] RIP: 0033:0x7f81a77149cc
> > [ 1819.649996] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 b9 c0 f7 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 ff c0 f7 ff 48
> > [ 1819.649998] RSP: 002b:00007f81a7225d70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> > [ 1819.650004] RAX: 0000000000000064 RBX: 00007f81a7225da0 RCX: 00007f81a77149cc
> > [ 1819.650005] RDX: 0000000000000064 RSI: 00007f81a7225da0 RDI: 0000000000000135
> > [ 1819.650007] RBP: 00007f81a7225e50 R08: 0000000000000000 R09: 0000000000000000
> > [ 1819.650008] R10: 0000000000000008 R11: 0000000000000246 R12: 00000000004a3c12
> > [ 1819.650009] R13: 00007f81a7225e10 R14: 0000000000000000 R15: 0000558df7af6d00
> > [ 1819.650012]  </TASK>
> > [ 1819.650013] ---[ end trace 0000000000000000 ]---
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
  
Mathieu Desnoyers April 13, 2023, 1:38 p.m. UTC | #12
On 2023-04-13 09:13, Aaron Lu wrote:
> On Wed, Apr 12, 2023 at 04:57:50PM -0400, Mathieu Desnoyers wrote:
>> On 2023-04-12 00:27, Aaron Lu wrote:
>>> Just noticed below warning in dmesg on v4, the warning is triggered by
>>> WARN_ON_ONCE((int) mm_cid < 0); in rseq_update_cpu_node_id().
>>
>> I think I know what triggers this in v4.
>>
>> See sched_mm_cid_migrate_from() (near the end):
>>
>> +       /*
>> +        * The src_cid is unused, so it can be unset.
>> +        */
>> +       if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) !=
>> +           mm_cid_set_lazy_put(src_cid))
>> +               return;
>> +       __mm_cid_put(mm, src_cid);
>>
>> There is a short window here between successful cmpxchg and clear mask bit
>> where the thread has ownership of the cid without rq lock held. This can
>> lead to rare over-allocation of cids beyond nr_possible_cpus if the src cpu
>> runqueue reallocates a cid concurrently, which causes __mm_cid_get() to
>> return -1.
>>
>> Because we want to avoid taking the src rq lock in migrate-from, I plan to
>> fix this by doing the following:
>>
>> - disable interrupts around this cmpxchg and __mm_cid_put(),
> 
> To reduce the short window?

Yes, just reduce. Eliminating the window would require grabbing the
src rq lock which we want to avoid.

> 
>> - modify __mm_cid_get so it retries if cpumask_first_zero() returns -1.
>>
>> This should take care of this kind of extremely rare over-allocation
>> scenario through retry, which will be bounded by the duration of the
>> instruction sequence between cmpxchg and clearing the bit in the bitmask.
>>
>> I have not reproduced the warning on my end, only figured this out from
>> code review.
>>
>> Thoughts ?
> 
> Sounds reasonable to me.
> 
> I tested below diff to verify your theory:
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f3e7dc2cd1cc..2c0764e53b83 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3315,8 +3315,10 @@ static inline int __mm_cid_get_locked(struct mm_struct *mm)
> 
>          cpumask = mm_cidmask(mm);
>          cid = cpumask_first_zero(cpumask);
> -       if (cid >= nr_cpu_ids)
> +       if (cid >= nr_cpu_ids) {
> +               WARN_ON_ONCE(1);
>                  return -1;
> +       }
>          cpumask_set_cpu(cid, cpumask);
>          return cid;
>   }
> 
> And it indeed fired.
> 
> I then applied the below diff according to your description:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 425766cc1300..881571519a90 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11536,6 +11536,7 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
>   	struct mm_struct *mm = t->mm;
>   	struct rq *src_rq;
>   	struct task_struct *src_task;
> +	unsigned long flags;
>   
>   	if (!mm)
>   		return;
> @@ -11623,10 +11624,12 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
>   	/*
>   	 * The src_cid is unused, so it can be unset.
>   	 */
> +	local_irq_save(flags);
>   	if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) !=
>   	    mm_cid_set_lazy_put(src_cid))
>   		return;
>   	__mm_cid_put(mm, src_cid);
> +	local_irq_restore(flags);
>   }
>   
>   /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f3e7dc2cd1cc..41ed7022bab0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3315,8 +3315,10 @@ static inline int __mm_cid_get_locked(struct mm_struct *mm)
>   
>   	cpumask = mm_cidmask(mm);
>   	cid = cpumask_first_zero(cpumask);
> -	if (cid >= nr_cpu_ids)
> +	if (cid >= nr_cpu_ids) {
> +		WARN_ON_ONCE(1);
>   		return -1;
> +	}
>   	cpumask_set_cpu(cid, cpumask);
>   	return cid;
>   }
> @@ -3345,8 +3347,10 @@ static inline int mm_cid_get(struct mm_struct *mm)
>   		if (cmpxchg(pcpu_cid, cid, MM_CID_UNSET) == cid)
>   			__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
>   	}
> +	cid = -1;
>   	raw_spin_lock(&mm->cid_lock);
> -	cid = __mm_cid_get_locked(mm);
> +	while (cid == -1)
> +		cid = __mm_cid_get_locked(mm);
>   	raw_spin_unlock(&mm->cid_lock);
>   	WRITE_ONCE(*pcpu_cid, cid);
>   	return cid;
> 
> And did several iterations of hackbench, no more warnings from the
> original place rseq_update_cpu_node_id(). It appears solved the issue,
> will let you know if I ever hit it again.
> 
> One more thing is, with the above diff applied, I still get the warning
> about cid == -1 in __mm_cid_get_locked() so I suppose disabling irq in
> migrate_from can't entirely close the race.

Yes, this is expected. We need to do a retry loop on "get" when we find
a fully set bitmap and retry. This is why I want to disable irqs in
migrate-from: to keep this retry loop done with irqs off and rq lock 
held as small as possible. Disabling interrupts on the other side 
achieves this.

> 
> Another thing, it doesn't appear there is an user of __mm_get_cid()?

True. I'm preparing a v5 which significantly changes things: I introduce
a lock-free get, which hopefully should help reducing contention on the 
qspinlock.

Thanks,

Mathieu


> 
> Thanks,
> Aaron
>   
>>>
>>> [ 1819.649803] ------------[ cut here ]------------
>>> [ 1819.649813] WARNING: CPU: 188 PID: 29881 at kernel/rseq.c:95 __rseq_handle_notify_resume+0x49b/0x590
>>> [ 1819.649823] Modules linked in: veth tls xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel binfmt_misc kvm nls_iso8859_1 rapl intel_cstate mei_me isst_if_mbox_pci isst_if_mmio idxd input_leds joydev isst_if_common idxd_bus mei ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops pstore_blk reed_solomon pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid ast i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect
>>> [ 1819.649903]  sysimgblt crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel dax_hmem cxl_acpi crypto_simd nvme cryptd cxl_core nvme_core i2c_i801 xhci_pci i40e drm i2c_smbus xhci_pci_renesas i2c_ismt wmi pinctrl_emmitsburg
>>> [ 1819.649924] CPU: 188 PID: 29881 Comm: hackbench Not tainted 6.3.0-rc6-00002-g1acfd6ae9afc #2
>>> [ 1819.649927] Hardware name: Intel Corporation D50DNP1SBB/D50DNP1SBB, BIOS SE5C7411.86B.8901.D03.2210131232 10/13/2022
>>> [ 1819.649929] RIP: 0010:__rseq_handle_notify_resume+0x49b/0x590
>>> [ 1819.649934] Code: f0 ff ff ff ff ff 00 e9 00 fe ff ff 48 ba 00 f0 ff ff ff ff ff 00 e9 3a fe ff ff 48 ba 00 f0 ff ffff ff ff 00 e9 66 fe ff ff <0f> 0b e9 7d fc ff ff 0f 01 ca e9 a6 fc ff ff 48 8b 4c 24 30 48 8b
>>> [ 1819.649936] RSP: 0018:ffa0000018b0fe60 EFLAGS: 00010286
>>> [ 1819.649939] RAX: ff11007f73500000 RBX: 00007f81a7226fe0 RCX: 0000000000000000
>>> [ 1819.649941] RDX: 00000000ffffffff RSI: 0000000000000001 RDI: ffffffff828f9477
>>> [ 1819.649943] RBP: ffa0000018b0fee8 R08: ff110040c192cc28 R09: ff1100408e2a3980
>>> [ 1819.649944] R10: 0000000000000001 R11: 0000000000000000 R12: ff110040caa64000
>>> [ 1819.649946] R13: 000000000002fa40 R14: 0000000000000000 R15: 00000000000000bc
>>> [ 1819.649947] FS:  00007f81a7226640(0000) GS:ff11007f73500000(0000) knlGS:0000000000000000
>>> [ 1819.649950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 1819.649951] CR2: 00007fd8b00d5000 CR3: 00000040cf060001 CR4: 0000000000f71ee0
>>> [ 1819.649953] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [ 1819.649955] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>>> [ 1819.649957] PKRU: 55555554
>>> [ 1819.649958] Call Trace:
>>> [ 1819.649960]  <TASK>
>>> [ 1819.649964]  exit_to_user_mode_prepare+0x13b/0x1a0
>>> [ 1819.649970]  syscall_exit_to_user_mode+0x2a/0x50
>>> [ 1819.649976]  ? __x64_sys_read+0x1d/0x30
>>> [ 1819.649980]  do_syscall_64+0x6d/0x90
>>> [ 1819.649984]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>> [ 1819.649990] RIP: 0033:0x7f81a77149cc
>>> [ 1819.649996] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 b9 c0 f7 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 ff c0 f7 ff 48
>>> [ 1819.649998] RSP: 002b:00007f81a7225d70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>>> [ 1819.650004] RAX: 0000000000000064 RBX: 00007f81a7225da0 RCX: 00007f81a77149cc
>>> [ 1819.650005] RDX: 0000000000000064 RSI: 00007f81a7225da0 RDI: 0000000000000135
>>> [ 1819.650007] RBP: 00007f81a7225e50 R08: 0000000000000000 R09: 0000000000000000
>>> [ 1819.650008] R10: 0000000000000008 R11: 0000000000000246 R12: 00000000004a3c12
>>> [ 1819.650009] R13: 00007f81a7225e10 R14: 0000000000000000 R15: 0000558df7af6d00
>>> [ 1819.650012]  </TASK>
>>> [ 1819.650013] ---[ end trace 0000000000000000 ]---
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
  
Aaron Lu April 13, 2023, 1:49 p.m. UTC | #13
On Thu, Apr 13, 2023 at 01:10:47PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 12, 2023 at 10:39:34PM +0800, Aaron Lu wrote:
> > On Wed, Apr 12, 2023 at 04:26:16PM +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 12, 2023 at 07:42:40PM +0800, Aaron Lu wrote:
> > > 
> > > > I *guess* you might be able to see some contention with hackbench on
> > > > that HSW-EX system with v4.
> > > 
> > > Indeed! Notably it seems to be the wakeup from idle that trips it
> > > hardest:
> > 
> > Could it because for idle cpus, the per-cpu/mm cid is no longer valid for
> > the sake of compacting and when task wakes there, it will have to
> > re-allocate a new cid through mm_get_cid() which needs to acquire
> > mm->cid_lock?
> 
> Yup. And I'm thinking it is futile (and counter productive) to strive
> for compactness in this (nr_threads >= nr_cpus) case.
> 
> The below on v4 solves the contention I see with hackbench (which runs
> 400 threads which is well above the 144 cpu count on that machine).

I also tested on Icelake and SPR using hackbench and the contention is
also gone :-)

I then tested with sysbench_postgres, the contention is still there on
SPR:

    11.63%    11.62%  [kernel.vmlinux]        [k]     native_queued_spin_lock_slowpath
    7.91%     native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify
    3.17%     native_queued_spin_lock_slowpath;_raw_spin_lock;mm_cid_get;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64;epoll_wait

> 
> This obviously leaves a problem with the nr_threads = nr_cpus - 1 case,
> but I'm thinking we can add some fuzz (nr_cpu_ids - ilog2(nr_cpus_ids+1)
> perhaps). Also, I would be thinking that's not something that typically
> happens.

Let me see if adding some fuzz helps here.

For sysbench_postgres, the server uses process model while the client
uses thread model and the client(sysbench) started nr_cpus threads so
it's likely didn't trigger the (nr_cpus-1) bar here. I doubt the bar
may have to be as low as 1/2 nr_cpus considering that some of the
cpus are running the server processes.

> 
> Mathieu, WDYT? -- other than that the patch is an obvious hack :-)
> 
> ---
>  include/linux/mm_types.h | 8 ++++++++
>  kernel/fork.c            | 4 +++-
>  kernel/sched/core.c      | 9 +++++++++
>  kernel/sched/sched.h     | 2 ++
>  4 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4160ff5c6ebd..598d1b657afa 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -609,6 +609,7 @@ struct mm_struct {
>  		 * were being concurrently updated by the updaters.
>  		 */
>  		raw_spinlock_t cid_lock;
> +		int cid_saturated;
>  		/**
>  		 * @pcpu_cid: Per-cpu current cid.
>  		 *
> @@ -912,6 +913,12 @@ static inline int mm_cid_clear_lazy_put(int cid)
>  	return cid & ~MM_CID_LAZY_PUT;
>  }
>  
> +static inline void mm_cid_desaturate(struct mm_struct *mm)
> +{
> +	if (mm->cid_saturated && atomic_read(&mm->mm_users) < nr_cpu_ids)
> +		mm->cid_saturated = 0;
> +}
> +
>  /* Accessor for struct mm_struct's cidmask. */
>  static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
>  {
> @@ -928,6 +935,7 @@ static inline void mm_init_cid(struct mm_struct *mm)
>  	int i;
>  
>  	raw_spin_lock_init(&mm->cid_lock);
> +	mm->cid_saturated = 0;
>  	for_each_possible_cpu(i)
>  		*per_cpu_ptr(mm->pcpu_cid, i) = MM_CID_UNSET;
>  	cpumask_clear(mm_cidmask(mm));
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3832bea713c4..a5233e450435 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1233,7 +1233,9 @@ void mmput(struct mm_struct *mm)
>  	might_sleep();
>  
>  	if (atomic_dec_and_test(&mm->mm_users))
> -		__mmput(mm);
> +		return __mmput(mm);
> +
> +	mm_cid_desaturate(mm);
>  }
>  EXPORT_SYMBOL_GPL(mmput);
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 425766cc1300..d5004d179531 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11550,6 +11550,15 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
>  	if (last_mm_cid == -1)
>  		return;
>  
> +	/*
> +	 * When nr_threads > nr_cpus, there is no point in moving anything
> +	 * around to keep it compact.
> +	 */
> +	if (mm->cid_saturated) {
> +		t->last_mm_cid = -1;
> +		return;
> +	}
> +
>  	src_rq = task_rq(t);
>  	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
>  	src_cid = READ_ONCE(*src_pcpu_cid);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f3e7dc2cd1cc..6c4af2992e79 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3347,6 +3347,8 @@ static inline int mm_cid_get(struct mm_struct *mm)
>  	}
>  	raw_spin_lock(&mm->cid_lock);
>  	cid = __mm_cid_get_locked(mm);
> +	if (cid == nr_cpu_ids - 1)
> +		mm->cid_saturated = 1;
>  	raw_spin_unlock(&mm->cid_lock);
>  	WRITE_ONCE(*pcpu_cid, cid);
>  	return cid;
  
Mathieu Desnoyers April 13, 2023, 1:56 p.m. UTC | #14
On 2023-04-13 07:10, Peter Zijlstra wrote:
> On Wed, Apr 12, 2023 at 10:39:34PM +0800, Aaron Lu wrote:
>> On Wed, Apr 12, 2023 at 04:26:16PM +0200, Peter Zijlstra wrote:
>>> On Wed, Apr 12, 2023 at 07:42:40PM +0800, Aaron Lu wrote:
>>>
>>>> I *guess* you might be able to see some contention with hackbench on
>>>> that HSW-EX system with v4.
>>>
>>> Indeed! Notably it seems to be the wakeup from idle that trips it
>>> hardest:
>>
>> Could it because for idle cpus, the per-cpu/mm cid is no longer valid for
>> the sake of compacting and when task wakes there, it will have to
>> re-allocate a new cid through mm_get_cid() which needs to acquire
>> mm->cid_lock?
> 
> Yup. And I'm thinking it is futile (and counter productive) to strive
> for compactness in this (nr_threads >= nr_cpus) case.
> 
> The below on v4 solves the contention I see with hackbench (which runs
> 400 threads which is well above the 144 cpu count on that machine).
> 
> This obviously leaves a problem with the nr_threads = nr_cpus - 1 case,
> but I'm thinking we can add some fuzz (nr_cpu_ids - ilog2(nr_cpus_ids+1)
> perhaps). Also, I would be thinking that's not something that typically
> happens.
> 
> Mathieu, WDYT? -- other than that the patch is an obvious hack :-)

I hate it with passion :-)

It is quite specific to your workload/configuration.

If we take for instance a process with a large mm_users count which is 
eventually affined to a subset of the cpus with cpusets or 
sched_setaffinity, your patch will prevent compaction of the concurrency 
ids when it really should not.

I'm working on a lock-free cid-get, hopefully my next version will 
eliminate the performance regression.

Thanks,

Mathieu

> 
> ---
>   include/linux/mm_types.h | 8 ++++++++
>   kernel/fork.c            | 4 +++-
>   kernel/sched/core.c      | 9 +++++++++
>   kernel/sched/sched.h     | 2 ++
>   4 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4160ff5c6ebd..598d1b657afa 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -609,6 +609,7 @@ struct mm_struct {
>   		 * were being concurrently updated by the updaters.
>   		 */
>   		raw_spinlock_t cid_lock;
> +		int cid_saturated;
>   		/**
>   		 * @pcpu_cid: Per-cpu current cid.
>   		 *
> @@ -912,6 +913,12 @@ static inline int mm_cid_clear_lazy_put(int cid)
>   	return cid & ~MM_CID_LAZY_PUT;
>   }
>   
> +static inline void mm_cid_desaturate(struct mm_struct *mm)
> +{
> +	if (mm->cid_saturated && atomic_read(&mm->mm_users) < nr_cpu_ids)
> +		mm->cid_saturated = 0;
> +}
> +
>   /* Accessor for struct mm_struct's cidmask. */
>   static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
>   {
> @@ -928,6 +935,7 @@ static inline void mm_init_cid(struct mm_struct *mm)
>   	int i;
>   
>   	raw_spin_lock_init(&mm->cid_lock);
> +	mm->cid_saturated = 0;
>   	for_each_possible_cpu(i)
>   		*per_cpu_ptr(mm->pcpu_cid, i) = MM_CID_UNSET;
>   	cpumask_clear(mm_cidmask(mm));
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3832bea713c4..a5233e450435 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1233,7 +1233,9 @@ void mmput(struct mm_struct *mm)
>   	might_sleep();
>   
>   	if (atomic_dec_and_test(&mm->mm_users))
> -		__mmput(mm);
> +		return __mmput(mm);
> +
> +	mm_cid_desaturate(mm);
>   }
>   EXPORT_SYMBOL_GPL(mmput);
>   
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 425766cc1300..d5004d179531 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11550,6 +11550,15 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
>   	if (last_mm_cid == -1)
>   		return;
>   
> +	/*
> +	 * When nr_threads > nr_cpus, there is no point in moving anything
> +	 * around to keep it compact.
> +	 */
> +	if (mm->cid_saturated) {
> +		t->last_mm_cid = -1;
> +		return;
> +	}
> +
>   	src_rq = task_rq(t);
>   	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
>   	src_cid = READ_ONCE(*src_pcpu_cid);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f3e7dc2cd1cc..6c4af2992e79 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3347,6 +3347,8 @@ static inline int mm_cid_get(struct mm_struct *mm)
>   	}
>   	raw_spin_lock(&mm->cid_lock);
>   	cid = __mm_cid_get_locked(mm);
> +	if (cid == nr_cpu_ids - 1)
> +		mm->cid_saturated = 1;
>   	raw_spin_unlock(&mm->cid_lock);
>   	WRITE_ONCE(*pcpu_cid, cid);
>   	return cid;
  
Peter Zijlstra April 13, 2023, 3:20 p.m. UTC | #15
On Thu, Apr 13, 2023 at 09:56:38AM -0400, Mathieu Desnoyers wrote:

> > Mathieu, WDYT? -- other than that the patch is an obvious hack :-)
> 
> I hate it with passion :-)
> 
> It is quite specific to your workload/configuration.
> 
> If we take for instance a process with a large mm_users count which is
> eventually affined to a subset of the cpus with cpusets or
> sched_setaffinity, your patch will prevent compaction of the concurrency ids
> when it really should not.

I don't think it will, it will only kick in once the higest cid is
handed out (I should've used num_online_cpus() instead of nr_cpu_ids),
and with affinity at play that should never happen.

Now, the more fancy scheme with:

	min(t->nr_cpus_allowed, atomic_read(&t->mm->mm_users))

that does get to be more complex; and I've yet to find a working version
that doesn't also need a for_each_cpu() loop on for reclaim :/

Anyway, I think the hack as presented is safe, but a hack none-the-less.
  
Mathieu Desnoyers April 13, 2023, 3:37 p.m. UTC | #16
On 2023-04-13 11:20, Peter Zijlstra wrote:
> On Thu, Apr 13, 2023 at 09:56:38AM -0400, Mathieu Desnoyers wrote:
> 
>>> Mathieu, WDYT? -- other than that the patch is an obvious hack :-)
>>
>> I hate it with passion :-)
>>
>> It is quite specific to your workload/configuration.
>>
>> If we take for instance a process with a large mm_users count which is
>> eventually affined to a subset of the cpus with cpusets or
>> sched_setaffinity, your patch will prevent compaction of the concurrency ids
>> when it really should not.
> 
> I don't think it will, it will only kick in once the higest cid is
> handed out (I should've used num_online_cpus() instead of nr_cpu_ids),
> and with affinity at play that should never happen.

So in that case, this optimization will only work if affinity is not 
set. E.g. a hackbench with cpuset or sched_setaffinity excluding one
core from the set will still be slower.

> 
> Now, the more fancy scheme with:
> 
> 	min(t->nr_cpus_allowed, atomic_read(&t->mm->mm_users))
> 
> that does get to be more complex; and I've yet to find a working version
> that doesn't also need a for_each_cpu() loop on for reclaim :/

Indeed. And with a allowed cpus approach, we need to carefully consider 
what happens if we change a allowed cpu mask from one set to another 
set, e.g, from allowing cpus 0, 1 to allowing only cpus 2, 3. There will 
be task migration, and we need to reclaim the cids from 0, 1, but we can 
very well be in a case where the number of mm_users is above the number 
of allowed cpus.

> 
> Anyway, I think the hack as presented is safe, but a hack none-the-less.

I don't think it is _unsafe_, but it will only trigger in specific 
scenarios, which makes it harder to understand more subtle performance 
regressions for scenarios that are not covered by this hack.

Thanks,

Mathieu
  

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a57e6ae78e65..4160ff5c6ebd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -609,6 +609,14 @@  struct mm_struct {
 		 * were being concurrently updated by the updaters.
 		 */
 		raw_spinlock_t cid_lock;
+		/**
+		 * @pcpu_cid: Per-cpu current cid.
+		 *
+		 * Keep track of the currently allocated mm_cid for each cpu.
+		 * The per-cpu mm_cid values are serialized by their respective
+		 * runqueue locks.
+		 */
+		int __percpu *pcpu_cid;
 #endif
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
@@ -873,6 +881,37 @@  static inline void vma_iter_init(struct vma_iterator *vmi,
 }
 
 #ifdef CONFIG_SCHED_MM_CID
+
+enum mm_cid_state {
+	MM_CID_UNSET = -1U,		/* Unset state has lazy_put flag set. */
+	MM_CID_LAZY_PUT = (1U << 31),
+};
+
+static inline bool mm_cid_is_unset(int cid)
+{
+	return cid == MM_CID_UNSET;
+}
+
+static inline bool mm_cid_is_lazy_put(int cid)
+{
+	return !mm_cid_is_unset(cid) && (cid & MM_CID_LAZY_PUT);
+}
+
+static inline bool mm_cid_is_valid(int cid)
+{
+	return !(cid & MM_CID_LAZY_PUT);
+}
+
+static inline int mm_cid_set_lazy_put(int cid)
+{
+	return cid | MM_CID_LAZY_PUT;
+}
+
+static inline int mm_cid_clear_lazy_put(int cid)
+{
+	return cid & ~MM_CID_LAZY_PUT;
+}
+
 /* Accessor for struct mm_struct's cidmask. */
 static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 {
@@ -886,16 +925,37 @@  static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 
 static inline void mm_init_cid(struct mm_struct *mm)
 {
+	int i;
+
 	raw_spin_lock_init(&mm->cid_lock);
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(mm->pcpu_cid, i) = MM_CID_UNSET;
 	cpumask_clear(mm_cidmask(mm));
 }
 
+static inline int mm_alloc_cid(struct mm_struct *mm)
+{
+	mm->pcpu_cid = alloc_percpu(int);
+	if (!mm->pcpu_cid)
+		return -ENOMEM;
+	mm_init_cid(mm);
+	return 0;
+}
+
+static inline void mm_destroy_cid(struct mm_struct *mm)
+{
+	free_percpu(mm->pcpu_cid);
+	mm->pcpu_cid = NULL;
+}
+
 static inline unsigned int mm_cid_size(void)
 {
 	return cpumask_size();
 }
 #else /* CONFIG_SCHED_MM_CID */
 static inline void mm_init_cid(struct mm_struct *mm) { }
+static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_cid(struct mm_struct *mm) { }
 static inline unsigned int mm_cid_size(void)
 {
 	return 0;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..48d48b2c73a5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1313,6 +1313,7 @@  struct task_struct {
 
 #ifdef CONFIG_SCHED_MM_CID
 	int				mm_cid;		/* Current cid in mm */
+	int				last_mm_cid;	/* Most recent cid in mm */
 	int				mm_cid_active;	/* Whether cid bitmap is active */
 #endif
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2a243616f222..f20fc0600fcc 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -37,6 +37,11 @@  static inline void mmgrab(struct mm_struct *mm)
 	atomic_inc(&mm->mm_count);
 }
 
+static inline void smp_mb__after_mmgrab(void)
+{
+	smp_mb__after_atomic();
+}
+
 extern void __mmdrop(struct mm_struct *mm);
 
 static inline void mmdrop(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index ea332319dffe..3832bea713c4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -793,6 +793,7 @@  void __mmdrop(struct mm_struct *mm)
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
+	mm_destroy_cid(mm);
 
 	for (i = 0; i < NR_MM_COUNTERS; i++)
 		percpu_counter_destroy(&mm->rss_stat[i]);
@@ -1057,6 +1058,7 @@  static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 #ifdef CONFIG_SCHED_MM_CID
 	tsk->mm_cid = -1;
+	tsk->last_mm_cid = -1;
 	tsk->mm_cid_active = 0;
 #endif
 	return tsk;
@@ -1162,18 +1164,22 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
+	if (mm_alloc_cid(mm))
+		goto fail_cid;
+
 	for (i = 0; i < NR_MM_COUNTERS; i++)
 		if (percpu_counter_init(&mm->rss_stat[i], 0, GFP_KERNEL_ACCOUNT))
 			goto fail_pcpu;
 
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
-	mm_init_cid(mm);
 	return mm;
 
 fail_pcpu:
 	while (i > 0)
 		percpu_counter_destroy(&mm->rss_stat[--i]);
+	mm_destroy_cid(mm);
+fail_cid:
 	destroy_context(mm);
 fail_nocontext:
 	mm_free_pgd(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d18c3969f90..425766cc1300 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2084,8 +2084,10 @@  static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (task_on_rq_migrating(p))
+	if (task_on_rq_migrating(p)) {
 		flags |= ENQUEUE_MIGRATED;
+		sched_mm_cid_migrate_to(rq, p);
+	}
 
 	enqueue_task(rq, p, flags);
 
@@ -3195,6 +3197,7 @@  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
 		rseq_migrate(p);
+		sched_mm_cid_migrate_from(p);
 		perf_event_task_migrate(p);
 	}
 
@@ -5114,7 +5117,6 @@  prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
 	rseq_preempt(prev);
-	switch_mm_cid(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	kmap_local_sched_out();
 	prepare_task(next);
@@ -5270,6 +5272,9 @@  context_switch(struct rq *rq, struct task_struct *prev,
 	 *
 	 * kernel ->   user   switch + mmdrop() active
 	 *   user ->   user   switch
+	 *
+	 * switch_mm_cid() needs to be updated if the barriers provided
+	 * by context_switch() are modified.
 	 */
 	if (!next->mm) {                                // to kernel
 		enter_lazy_tlb(prev->active_mm, next);
@@ -5299,6 +5304,9 @@  context_switch(struct rq *rq, struct task_struct *prev,
 		}
 	}
 
+	/* switch_mm_cid() requires the memory barriers above. */
+	switch_mm_cid(prev, next);
+
 	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
 
 	prepare_lock_switch(rq, next, rf);
@@ -11383,45 +11391,360 @@  void call_trace_sched_update_nr_running(struct rq *rq, int count)
 }
 
 #ifdef CONFIG_SCHED_MM_CID
+
+/*
+ * mm_cid migrate-from implements a lock-free algorithm to clear per-mm/cpu cid
+ * concurrently with respect to the execution of the source runqueue context
+ * switch.
+ *
+ * There are two basic properties we want is to guarantee here:
+ *
+ * (1) Migrate-from should _never_ mark a per-cpu cid UNSET when it is actively
+ *     used by a task. That would lead to concurrent allocation of the cid and
+ *     userspace corruption.
+ *
+ * (2) Migrate-from should ideally not leak a cid value when there are no more
+ *     tasks using it on the cpu. This ensures we keep cid allocations as
+ *     compact as possible (closest to 0). Getting this wrong will not cause
+ *     userspace corruption, just a less compact id allocation.
+ *
+ * Provide those guarantees by introducing a Dekker memory ordering to
+ * guarantee that a pair of loads observe at least one of a pair of stores,
+ * which can be shown as:
+ *
+ *      X = Y = 0
+ *
+ *      w[X]=1          w[Y]=1
+ *      MB              MB
+ *      r[Y]=y          r[X]=x
+ *
+ * Which guarantees that x==0 && y==0 is impossible. But rather than using
+ * values 0 and 1, this algorithm cares about specific state transitions of the
+ * runqueue current task (as updated by the scheduler context switch), and the
+ * per-mm/cpu cid value.
+ *
+ * Let's introduce task (Y) which has mm == src_task->mm and task (N) which has
+ * mm != src_task->mm for the rest of the discussion. There are two scheduler
+ * state transitions on context switch we care about:
+ *
+ * (TSA) Store to rq->curr with transition from (N) to (Y)
+ *
+ * (TSB) Store to rq->curr with transition from (Y) to (N)
+ *
+ * On the migrate-from side, there is one transition we care about:
+ *
+ * (TMA) cmpxchg to *pcpu_cid to set the LAZY flag
+ *
+ * There is also a transition to UNSET state which can be performed from all
+ * sides (scheduler, migrate-from). It is always performed with a cmpxchg which
+ * guarantees that only a single thread will succeed:
+ *
+ * (TMB) cmpxchg to *pcpu_cid to mark UNSET
+ *
+ * Just to be clear, what we do _not_ want to happen is a transition to UNSET
+ * when a thread is actively using the cid (property (1)). And ideally we do
+ * not want to leak a cid after migrating the last task from a cpu (property
+ * (2)).
+ *
+ * Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.
+ *
+ * Scenario A) (TSA)+(TMA) (from next task perspective)
+ *
+ * CPU0                                      CPU1
+ *
+ * Context switch CS-1                       Migrate-from
+ *   - store to rq->curr: (N)->(Y) (TSA)     - cmpxchg to *pcpu_id to LAZY (TMA)
+ *                                             (implied barrier after cmpxchg)
+ *   - switch_mm_cid()
+ *     - memory barrier (see switch_mm_cid()
+ *       comment explaining how this barrier
+ *       is combined with other scheduler
+ *       barriers)
+ *     - mm_cid_get (next)
+ *       - READ_ONCE(*pcpu_cid)              - rcu_dereference(src_rq->curr)
+ *
+ * This Dekker ensures that either task (Y) is observed by the
+ * rcu_dereference() or the LAZY flag is observed by READ_ONCE(), or both are
+ * observed.
+ *
+ * If task (Y) store is observed by rcu_dereference(), it means that there is
+ * still an active task on the cpu. Migrate-from will therefore not transition
+ * to UNSET, which fulfills property (1). That observed task will itself
+ * eventually need a migrate-from to be migrated away from that cpu, which
+ * fulfills property (2).
+ *
+ * If task (Y) is not observed, but the lazy flag is observed by READ_ONCE(),
+ * it will move its state to UNSET, which clears the percpu cid perhaps
+ * uselessly (which is not an issue for correctness). Because task (Y) is not
+ * observed, CPU1 can move ahead to set the state to UNSET. Because moving
+ * state to UNSET is done with a cmpxchg expecting that the old state has the
+ * LAZY flag set, only one thread will successfully UNSET.
+ *
+ * If both states (LAZY flag and task (Y)) are observed, the thread on CPU0
+ * will observe the LAZY flag and transition to UNSET (perhaps uselessly), and
+ * CPU1 will observe task (Y) and do nothing more, which is fine.
+ *
+ * What we are effectively preventing with this Dekker is a scenario where
+ * neither LAZY flag nor store (Y) are observed, which would fail property (1)
+ * because this would UNSET a cid which is actively used.
+ *
+ *
+ * Scenario B) (TSB)+(TMA) (from prev task perspective)
+ *
+ * CPU0                                      CPU1
+ *
+ * Context switch CS-1                       Migrate-from
+ *   - store to rq->curr: (Y)->(N) (TSB)     - cmpxchg to *pcpu_id to LAZY (TMA)
+ *                                           (implied barrier after cmpxchg)
+ *   - switch_mm_cid()
+ *     - memory barrier (see switch_mm_cid()
+ *       comment explaining how this barrier
+ *       is combined with other scheduler
+ *       barriers)
+ *     - cid_put_lazy() (prev)
+ *       - READ_ONCE(*pcpu_cid)              - rcu_dereference(src_rq->curr)
+ *
+ * This Dekker ensures that either task (N) is observed by the
+ * rcu_dereference() or the LAZY flag is observed by READ_ONCE(), or both are
+ * observed.
+ *
+ * If rcu_dereference observes (N) but LAZY is not observed, migrate-from will
+ * take care to advance the state to UNSET, thus fulfilling property (2).
+ * Because (Y) is not running anymore property (1) is fulfilled.
+ *
+ * If rcu_dereference does not observe (N), but LAZY is observed, migrate-from
+ * does not advance to UNSET because it observes (Y), but LAZY flag will make
+ * the task on CPU0 take care of advancing the state to UNSET, thus fulfilling
+ * property (2).
+ *
+ * If both (N) and LAZY are observed, both migrate-from and CPU0 will try to
+ * advance the state to UNSET, but only one will succeed its cmpxchg.
+ *
+ * What we are effectively preventing with this Dekker is a scenario where
+ * neither LAZY flag nor store (N) are observed, which would fail property (2)
+ * because it would leak a cid on a cpu that has no task left using the
+ * mm.
+ */
+
+/*
+ * Migration from src cpu. Called from set_task_cpu(). There are no guarantees
+ * that the rq lock is held.
+ */
+void sched_mm_cid_migrate_from(struct task_struct *t)
+{
+	int src_cid, *src_pcpu_cid, last_mm_cid;
+	struct mm_struct *mm = t->mm;
+	struct rq *src_rq;
+	struct task_struct *src_task;
+
+	if (!mm)
+		return;
+
+	last_mm_cid = t->last_mm_cid;
+	/*
+	 * If the migrated task has no last cid, or if the current
+	 * task on src rq uses the cid, it means the destination cpu
+	 * does not have to reallocate its cid to keep the cid allocation
+	 * compact.
+	 */
+	if (last_mm_cid == -1)
+		return;
+
+	src_rq = task_rq(t);
+	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
+	src_cid = READ_ONCE(*src_pcpu_cid);
+
+	if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
+		return;
+
+	/*
+	 * If we observe an active task using the mm on this rq, it means we
+	 * are not the last task to be migrated from this cpu for this mm, so
+	 * there is no need to clear the src_cid.
+	 */
+	rcu_read_lock();
+	src_task = rcu_dereference(src_rq->curr);
+	if (READ_ONCE(src_task->mm_cid_active) && src_task->mm == mm) {
+		rcu_read_unlock();
+		t->last_mm_cid = -1;
+		return;
+	}
+	rcu_read_unlock();
+
+	/*
+	 * If the source cpu cid is set, and matches the last cid of the
+	 * migrated task, clear the source cpu cid to keep cid allocation
+	 * compact to cover the case where this task is the last task using
+	 * this mm on the source cpu. If there happens to be other tasks left
+	 * on the source cpu using this mm, the next task using this mm will
+	 * reallocate its cid on context switch.
+	 *
+	 * We cannot keep ownership of concurrency ID without runqueue
+	 * lock held when it is not used by a current task, because it
+	 * would lead to allocation of more concurrency ids than there
+	 * are possible cpus in the system. The last_mm_cid is used as
+	 * a hint to conditionally unset the dst cpu cid, keeping
+	 * allocated concurrency ids compact.
+	 */
+	if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
+		return;
+
+	/*
+	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
+	 * rq->curr->mm matches the scheduler barrier in context_switch()
+	 * between store to rq->curr and load of prev and next task's
+	 * per-mm/cpu cid.
+	 *
+	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
+	 * rq->curr->mm_cid_active matches the barrier in
+	 * sched_mm_cid_exit_signals(), sched_mm_cid_before_execve(), and
+	 * sched_mm_cid_after_execve() between store to t->mm_cid_active and
+	 * load of per-mm/cpu cid.
+	 */
+
+	/*
+	 * If we observe an active task using the mm on this rq after setting the lazy-put
+	 * flag, this task will be responsible for transitioning from lazy-put
+	 * flag set to MM_CID_UNSET.
+	 */
+	rcu_read_lock();
+	src_task = rcu_dereference(src_rq->curr);
+	if (READ_ONCE(src_task->mm_cid_active) && src_task->mm == mm) {
+		rcu_read_unlock();
+		/*
+		 * We observed an active task for this mm, clearing the destination
+		 * cpu mm_cid is not relevant for compactness.
+		 */
+		t->last_mm_cid = -1;
+		return;
+	}
+	rcu_read_unlock();
+
+	/*
+	 * The src_cid is unused, so it can be unset.
+	 */
+	if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) !=
+	    mm_cid_set_lazy_put(src_cid))
+		return;
+	__mm_cid_put(mm, src_cid);
+}
+
+/*
+ * Migration to dst cpu. Called with dst_rq lock held.
+ */
+void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
+{
+	struct mm_struct *mm = t->mm;
+	int src_cid, dst_cid, *dst_pcpu_cid;
+	struct task_struct *dest_task;
+
+	lockdep_assert_rq_held(dst_rq);
+
+	src_cid = t->last_mm_cid;
+	if (!mm || src_cid == -1)
+		return;
+
+	dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
+
+	/*
+	 * If destination cpu cid is greater than the source cpu cid, unset it
+	 * so it can be reallocated.
+	 */
+	dst_cid = READ_ONCE(*dst_pcpu_cid);
+	if (!mm_cid_is_valid(dst_cid) || dst_cid <= src_cid)
+		return;
+	/*
+	 * Put dst_cid if it is not currently in use, else it will be lazy put
+	 * on the next context switch.
+	 */
+	dest_task = rcu_dereference_protected(dst_rq->curr, lockdep_is_held(__rq_lockp(dst_rq)));
+	if (dest_task->mm_cid_active && dest_task->mm == mm) {
+		WARN_ON_ONCE(dest_task->mm_cid != dst_cid);
+		/*
+		 * Attempt to set lazy-put flag. Can fail due to concurrent
+		 * migrate-from, which sets lazy-put flag or MM_CID_UNSET.
+		 */
+		(void) cmpxchg(dst_pcpu_cid, dst_cid, mm_cid_set_lazy_put(dst_cid));
+		return;
+	}
+	/*
+	 * Attempt to set MM_CID_UNSET. Can fail due to concurrent
+	 * migrate-from, which sets lazy-put flag or MM_CID_UNSET.
+	 */
+	if (cmpxchg(dst_pcpu_cid, dst_cid, MM_CID_UNSET) != dst_cid)
+		return;
+	__mm_cid_put(mm, dst_cid);
+}
+
 void sched_mm_cid_exit_signals(struct task_struct *t)
 {
 	struct mm_struct *mm = t->mm;
-	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
 
 	if (!mm)
 		return;
-	local_irq_save(flags);
-	mm_cid_put(mm, t->mm_cid);
-	t->mm_cid = -1;
-	t->mm_cid_active = 0;
-	local_irq_restore(flags);
+
+	preempt_disable();
+	rq = this_rq();
+	rq_lock_irqsave(rq, &rf);
+	preempt_enable_no_resched();	/* holding spinlock */
+	WRITE_ONCE(t->mm_cid_active, 0);
+	/*
+	 * Store t->mm_cid_active before loading per-mm/cpu cid.
+	 * Matches barrier in sched_mm_cid_migrate_from().
+	 */
+	smp_mb();
+	mm_cid_put(t);
+	t->last_mm_cid = t->mm_cid = -1;
+	rq_unlock_irqrestore(rq, &rf);
 }
 
 void sched_mm_cid_before_execve(struct task_struct *t)
 {
 	struct mm_struct *mm = t->mm;
-	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
 
 	if (!mm)
 		return;
-	local_irq_save(flags);
-	mm_cid_put(mm, t->mm_cid);
-	t->mm_cid = -1;
-	t->mm_cid_active = 0;
-	local_irq_restore(flags);
+
+	preempt_disable();
+	rq = this_rq();
+	rq_lock_irqsave(rq, &rf);
+	preempt_enable_no_resched();	/* holding spinlock */
+	WRITE_ONCE(t->mm_cid_active, 0);
+	/*
+	 * Store t->mm_cid_active before loading per-mm/cpu cid.
+	 * Matches barrier in sched_mm_cid_migrate_from().
+	 */
+	smp_mb();
+	mm_cid_put(t);
+	t->last_mm_cid = t->mm_cid = -1;
+	rq_unlock_irqrestore(rq, &rf);
 }
 
 void sched_mm_cid_after_execve(struct task_struct *t)
 {
 	struct mm_struct *mm = t->mm;
-	unsigned long flags;
+	struct rq_flags rf;
+	struct rq *rq;
 
 	if (!mm)
 		return;
-	local_irq_save(flags);
-	t->mm_cid = mm_cid_get(mm);
-	t->mm_cid_active = 1;
-	local_irq_restore(flags);
+
+	preempt_disable();
+	rq = this_rq();
+	rq_lock_irqsave(rq, &rf);
+	preempt_enable_no_resched();	/* holding spinlock */
+	WRITE_ONCE(t->mm_cid_active, 1);
+	/*
+	 * Store t->mm_cid_active before loading per-mm/cpu cid.
+	 * Matches barrier in sched_mm_cid_migrate_from().
+	 */
+	smp_mb();
+	t->last_mm_cid = t->mm_cid = mm_cid_get(mm);
+	rq_unlock_irqrestore(rq, &rf);
 	rseq_set_notify_resume(t);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e8df6d31c1e..f3e7dc2cd1cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3249,7 +3249,66 @@  static inline void update_current_exec_runtime(struct task_struct *curr,
 }
 
 #ifdef CONFIG_SCHED_MM_CID
-static inline int __mm_cid_get(struct mm_struct *mm)
+extern void sched_mm_cid_migrate_from(struct task_struct *t);
+extern void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t);
+
+static inline void __mm_cid_put(struct mm_struct *mm, int cid)
+{
+	if (cid < 0)
+		return;
+	cpumask_clear_cpu(cid, mm_cidmask(mm));
+}
+
+/*
+ * The per-mm/cpu cid can have the MM_CID_LAZY_PUT flag set or transition to the
+ * MM_CID_UNSET state without holding the rq lock, but the rq lock needs to be
+ * held to transition to other states.
+ */
+static inline void mm_cid_put_lazy(struct task_struct *t)
+{
+	struct mm_struct *mm = t->mm;
+	int *pcpu_cid, cid;
+
+	lockdep_assert_rq_held(this_rq());
+	pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
+	cid = READ_ONCE(*pcpu_cid);
+	if (!mm_cid_is_lazy_put(cid))
+		return;
+	if (cmpxchg(pcpu_cid, cid, MM_CID_UNSET) != cid)
+		return;
+	__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
+}
+
+static inline void mm_cid_put(struct task_struct *t)
+{
+	struct mm_struct *mm = t->mm;
+	int *pcpu_cid, cid, res;
+
+	lockdep_assert_rq_held(this_rq());
+	WARN_ON_ONCE(t->mm_cid < 0);
+	pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
+	cid = READ_ONCE(*pcpu_cid);
+	for (;;) {
+		if (mm_cid_is_unset(cid))
+			return;
+		WARN_ON_ONCE(mm_cid_clear_lazy_put(cid) != t->mm_cid);
+		/*
+		 * Attempt transition from valid or lazy-put to unset.
+		 */
+		res = cmpxchg(pcpu_cid, cid, MM_CID_UNSET);
+		if (res == cid)
+			break;
+		cid = res;
+	}
+	__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
+}
+
+/*
+ * Lookup/set of mm_cidmask are serialized by cid_lock. Clear of mm_cidmask
+ * uses atomic operations, therefore requiring the set to be done with atomic
+ * operations as well.
+ */
+static inline int __mm_cid_get_locked(struct mm_struct *mm)
 {
 	struct cpumask *cpumask;
 	int cid;
@@ -3258,52 +3317,86 @@  static inline int __mm_cid_get(struct mm_struct *mm)
 	cid = cpumask_first_zero(cpumask);
 	if (cid >= nr_cpu_ids)
 		return -1;
-	__cpumask_set_cpu(cid, cpumask);
+	cpumask_set_cpu(cid, cpumask);
 	return cid;
 }
 
-static inline void mm_cid_put(struct mm_struct *mm, int cid)
+static inline int __mm_cid_get(struct mm_struct *mm)
 {
-	lockdep_assert_irqs_disabled();
-	if (cid < 0)
-		return;
+	int ret;
+
+	lockdep_assert_rq_held(this_rq());
 	raw_spin_lock(&mm->cid_lock);
-	__cpumask_clear_cpu(cid, mm_cidmask(mm));
+	ret = __mm_cid_get_locked(mm);
 	raw_spin_unlock(&mm->cid_lock);
+	return ret;
 }
 
 static inline int mm_cid_get(struct mm_struct *mm)
 {
-	int ret;
+	int *pcpu_cid, cid;
 
-	lockdep_assert_irqs_disabled();
+	lockdep_assert_rq_held(this_rq());
+	pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
+	cid = READ_ONCE(*pcpu_cid);
+	if (mm_cid_is_valid(cid))
+		return cid;
+	if (mm_cid_is_lazy_put(cid)) {
+		if (cmpxchg(pcpu_cid, cid, MM_CID_UNSET) == cid)
+			__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
+	}
 	raw_spin_lock(&mm->cid_lock);
-	ret = __mm_cid_get(mm);
+	cid = __mm_cid_get_locked(mm);
 	raw_spin_unlock(&mm->cid_lock);
-	return ret;
+	WRITE_ONCE(*pcpu_cid, cid);
+	return cid;
 }
 
 static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
 {
+	/*
+	 * Provide a memory barrier between rq->curr store and load of
+	 * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
+	 *
+	 * Should be adapted if context_switch() is modified.
+	 */
+	if (!next->mm) {                                // to kernel
+		/*
+		 * user -> kernel transition does not guarantee a barrier, but
+		 * we can use the fact that it performs an atomic operation in
+		 * mmgrab().
+		 */
+		if (prev->mm)                           // from user
+			smp_mb__after_mmgrab();
+		/*
+		 * kernel -> kernel transition does not change rq->curr->mm
+		 * state. It stays NULL.
+		 */
+	} else {                                        // to user
+		/*
+		 * kernel -> user transition does not provide a barrier
+		 * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
+		 * Provide it here.
+		 */
+		if (!prev->mm)                          // from kernel
+			smp_mb();
+		/*
+		 * user -> user transition guarantees a memory barrier through
+		 * switch_mm().
+		 */
+	}
 	if (prev->mm_cid_active) {
-		if (next->mm_cid_active && next->mm == prev->mm) {
-			/*
-			 * Context switch between threads in same mm, hand over
-			 * the mm_cid from prev to next.
-			 */
-			next->mm_cid = prev->mm_cid;
-			prev->mm_cid = -1;
-			return;
-		}
-		mm_cid_put(prev->mm, prev->mm_cid);
+		mm_cid_put_lazy(prev);
 		prev->mm_cid = -1;
 	}
 	if (next->mm_cid_active)
-		next->mm_cid = mm_cid_get(next->mm);
+		next->last_mm_cid = next->mm_cid = mm_cid_get(next->mm);
 }
 
 #else
 static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next) { }
+static inline void sched_mm_cid_migrate_from(struct task_struct *t) { }
+static inline void sched_mm_cid_migrate_to(struct rq *src_rq, struct task_struct *t) { }
 #endif
 
 #endif /* _KERNEL_SCHED_SCHED_H */