[V1,0/1] sched/numa: Fix mm numa_scan_seq based unconditional scan

Message ID cover.1697816692.git.raghavendra.kt@amd.com
State New
Headers

Commit Message

Raghavendra K T Oct. 20, 2023, 3:57 p.m. UTC
  NUMA balancing code that updates PTEs by allowing unconditional scan
based on the value of processes' mm numa_scan_seq is not perfect.

More description is in patch1.

Have used the below patch to identify the corner case.

Detailed Result: (Only part of the result is updated
in patch1 to save space in commit log)

Detailed Result:

SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.

Base kernel: upstream 6.6-rc6 (dd72f9c7e512) with Mels patch-series
from tip/sched/core [1] applied.

Summary: Some benchmarks imrove. There is increase in system
time due to additional scanning. But elapsed time shows gain.

However there is also some overhead seen for benchmarks like NUMA01.

kernbench
==========		base                  patched
Amean     user-128    13799.58 (   0.00%)    13789.86 *   0.07%*
Amean     syst-128     3280.80 (   0.00%)     3249.67 *   0.95%*
Amean     elsp-128      165.09 (   0.00%)      164.78 *   0.19%*

Duration User       41404.28    41375.08
Duration System      9862.22     9768.48
Duration Elapsed      519.87      518.72

Ops NUMA PTE updates                 1041416.00      831536.00
Ops NUMA hint faults                  263296.00      220966.00
Ops NUMA pages migrated               258021.00      212769.00
Ops AutoNUMA cost                       1328.67        1114.69

autonumabench

NUMA01_THREADLOCAL
==================
Amean     syst-NUMA01_THREADLOCAL       10.65 (   0.00%)       26.47 *-148.59%*
Amean     elsp-NUMA01_THREADLOCAL       81.79 (   0.00%)       67.74 *  17.18%*

Duration User       54832.73    47379.67
Duration System        75.00      185.75
Duration Elapsed      576.72      476.09

Ops NUMA PTE updates                  394429.00    11121044.00
Ops NUMA hint faults                    1001.00     8906404.00
Ops NUMA pages migrated                  288.00     2998694.00
Ops AutoNUMA cost                          7.77       44666.84

NUMA01
=====
Amean     syst-NUMA01       31.97 (   0.00%)       52.95 * -65.62%*
Amean     elsp-NUMA01      143.16 (   0.00%)      150.81 *  -5.34%*

Duration User       84839.49    91342.19
Duration System       224.26      371.12
Duration Elapsed     1005.64     1059.01

Ops NUMA PTE updates                33929508.00    50116313.00
Ops NUMA hint faults                34993820.00    52895783.00
Ops NUMA pages migrated              5456115.00     7441228.00
Ops AutoNUMA cost                     175310.27      264971.11

NUMA02
=========
Amean     syst-NUMA02        0.86 (   0.00%)        0.86 *  -0.50%*
Amean     elsp-NUMA02        3.99 (   0.00%)        3.82 *   4.40%*

Duration User        1186.06     1092.07
Duration System         6.44        6.47
Duration Elapsed       31.28       30.30

Ops NUMA PTE updates                     776.00         731.00
Ops NUMA hint faults                     527.00         490.00
Ops NUMA pages migrated                  183.00         153.00
Ops AutoNUMA cost                          2.64           2.46

Link: https://lore.kernel.org/linux-mm/ZSXF3AFZgIld1meX@gmail.com/T/

Raghavendra K T (1):
  sched/numa: Fix mm numa_scan_seq based unconditional scan

 include/linux/mm_types.h | 3 +++
 kernel/sched/fair.c      | 4 +++-
 2 files changed, 6 insertions(+), 1 deletion(-)

---8<---
  

Comments

Raghavendra K T Oct. 23, 2023, 5:25 a.m. UTC | #1
On 10/20/2023 9:27 PM, Raghavendra K T wrote:
> NUMA balancing code that updates PTEs by allowing unconditional scan
> based on the value of processes' mm numa_scan_seq is not perfect.
> 
> More description is in patch1.
> 
> Have used the below patch to identify the corner case.
> 
> Detailed Result: (Only part of the result is updated
> in patch1 to save space in commit log)
> 
> Detailed Result:
> 
> SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
> 
> Base kernel: upstream 6.6-rc6 (dd72f9c7e512) with Mels patch-series
> from tip/sched/core [1] applied.
> 
> Summary: Some benchmarks imrove. There is increase in system
> time due to additional scanning. But elapsed time shows gain.
> 
> However there is also some overhead seen for benchmarks like NUMA01.
> 
> kernbench
> ==========		base                  patched
> Amean     user-128    13799.58 (   0.00%)    13789.86 *   0.07%*
> Amean     syst-128     3280.80 (   0.00%)     3249.67 *   0.95%*
> Amean     elsp-128      165.09 (   0.00%)      164.78 *   0.19%*
> 
> Duration User       41404.28    41375.08
> Duration System      9862.22     9768.48
> Duration Elapsed      519.87      518.72
> 
> Ops NUMA PTE updates                 1041416.00      831536.00
> Ops NUMA hint faults                  263296.00      220966.00
> Ops NUMA pages migrated               258021.00      212769.00
> Ops AutoNUMA cost                       1328.67        1114.69
> 
> autonumabench
> 
> NUMA01_THREADLOCAL
> ==================
> Amean     syst-NUMA01_THREADLOCAL       10.65 (   0.00%)       26.47 *-148.59%*
> Amean     elsp-NUMA01_THREADLOCAL       81.79 (   0.00%)       67.74 *  17.18%*
> 
> Duration User       54832.73    47379.67
> Duration System        75.00      185.75
> Duration Elapsed      576.72      476.09
> 
> Ops NUMA PTE updates                  394429.00    11121044.00
> Ops NUMA hint faults                    1001.00     8906404.00
> Ops NUMA pages migrated                  288.00     2998694.00
> Ops AutoNUMA cost                          7.77       44666.84
> 
> NUMA01
> =====
> Amean     syst-NUMA01       31.97 (   0.00%)       52.95 * -65.62%*
> Amean     elsp-NUMA01      143.16 (   0.00%)      150.81 *  -5.34%*
> 
> Duration User       84839.49    91342.19
> Duration System       224.26      371.12
> Duration Elapsed     1005.64     1059.01
> 
> Ops NUMA PTE updates                33929508.00    50116313.00
> Ops NUMA hint faults                34993820.00    52895783.00
> Ops NUMA pages migrated              5456115.00     7441228.00
> Ops AutoNUMA cost                     175310.27      264971.11
> 
> NUMA02
> =========
> Amean     syst-NUMA02        0.86 (   0.00%)        0.86 *  -0.50%*
> Amean     elsp-NUMA02        3.99 (   0.00%)        3.82 *   4.40%*
> 
> Duration User        1186.06     1092.07
> Duration System         6.44        6.47
> Duration Elapsed       31.28       30.30
> 
> Ops NUMA PTE updates                     776.00         731.00
> Ops NUMA hint faults                     527.00         490.00
> Ops NUMA pages migrated                  183.00         153.00
> Ops AutoNUMA cost                          2.64           2.46
> 
> Link: https://lore.kernel.org/linux-mm/ZSXF3AFZgIld1meX@gmail.com/T/
> 

Forgot to add skip_vma_count trace results:

autonumabench: numa01_THREAD_LOCAL 3 iterations

base:
inaccessible:13133
pid_inactive:15807
scan_delay:471
seq_completed:50
shared_ro:6983
unsuitable:3917

patched:
inaccessible:4727
pid_inactive:5119
scan_delay:455
seq_completed:7
shared_ro:2551
unsuitable:5402



> Raghavendra K T (1):
>    sched/numa: Fix mm numa_scan_seq based unconditional scan
> 
>   include/linux/mm_types.h | 3 +++
>   kernel/sched/fair.c      | 4 +++-
>   2 files changed, 6 insertions(+), 1 deletion(-)
> 
> ---8<---
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 010ba1b7cb0e..a4870b01c8a1 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -10,6 +10,30 @@
>   #include <linux/tracepoint.h>
>   #include <linux/binfmts.h>
>   
> +TRACE_EVENT(sched_vma_start_seq,
> +
> +	TP_PROTO(struct task_struct *t, struct vm_area_struct *vma, int start_seq),
> +
> +	TP_ARGS(t, vma, start_seq),
> +
> +	TP_STRUCT__entry(
> +		__array(	char,	comm,	TASK_COMM_LEN	)
> +		__field(	pid_t,	pid			)
> +		__field(	void *,	vma			)
> +		__field(	int, start_seq		)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
> +		__entry->pid	= t->pid;
> +		__entry->vma	= vma;
> +		__entry->start_seq	= start_seq;
> +	),
> +
> +	TP_printk("comm=%s pid=%d vma = %px start_seq=%d", __entry->comm, __entry->pid, __entry->vma,
> +			 __entry->start_seq)
> +);
> +
>   /*
>    * Tracepoint for calling kthread_stop, performed to end a kthread:
>    */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c8af3a7ccba7..e0c16ea8470b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3335,6 +3335,7 @@ static void task_numa_work(struct callback_head *work)
>   				continue;
>   
>   			vma->numab_state->start_scan_seq = mm->numa_scan_seq;
> +			trace_sched_vma_start_seq(p, vma, mm->numa_scan_seq);
>   
>   			vma->numab_state->next_scan = now +
>   				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
> 
>
  
Raghavendra K T Oct. 27, 2023, 5:24 a.m. UTC | #2
On 10/20/2023 9:27 PM, Raghavendra K T wrote:
> NUMA balancing code that updates PTEs by allowing unconditional scan
> based on the value of processes' mm numa_scan_seq is not perfect.
> 
> More description is in patch1.
> 
> Have used the below patch to identify the corner case.
> 
> Detailed Result: (Only part of the result is updated
> in patch1 to save space in commit log)
> 

Gentle ping to check if there are any concerns / comments
on the patch :)

Thanks and Regards
- Raghu
  

Patch

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 010ba1b7cb0e..a4870b01c8a1 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -10,6 +10,30 @@ 
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
 
+TRACE_EVENT(sched_vma_start_seq,
+
+	TP_PROTO(struct task_struct *t, struct vm_area_struct *vma, int start_seq),
+
+	TP_ARGS(t, vma, start_seq),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	void *,	vma			)
+		__field(	int, start_seq		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->pid	= t->pid;
+		__entry->vma	= vma;
+		__entry->start_seq	= start_seq;
+	),
+
+	TP_printk("comm=%s pid=%d vma = %px start_seq=%d", __entry->comm, __entry->pid, __entry->vma,
+			 __entry->start_seq)
+);
+
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8af3a7ccba7..e0c16ea8470b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3335,6 +3335,7 @@  static void task_numa_work(struct callback_head *work)
 				continue;
 
 			vma->numab_state->start_scan_seq = mm->numa_scan_seq;
+			trace_sched_vma_start_seq(p, vma, mm->numa_scan_seq);
 
 			vma->numab_state->next_scan = now +
 				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);