[V1,1/1] sched/numa: Fix mm numa_scan_seq based unconditional scan

Message ID 2ea7cbce80ac7c62e90cbfb9653a7972f902439f.1697816692.git.raghavendra.kt@amd.com
State New
Headers
Series [V1,1/1] sched/numa: Fix mm numa_scan_seq based unconditional scan |

Commit Message

Raghavendra K T Oct. 20, 2023, 3:57 p.m. UTC
  Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic")

NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the
task had previously accessed VMA. However unconditional scan of VMAs are
allowed during initial phase of VMA creation until process's
mm numa_scan_seq reaches 2 even though current task had not accessed VMA.

Rationale:
 - Without initial scan subsequent PTE update may never happen.
 - Give fair opportunity to all the VMAs to be scanned and subsequently
understand the access pattern of all the VMAs.

But it has a corner case where, if a VMA is created after some time,
process's mm numa_scan_seq could be already greater than 2.

For e.g., values of mm numa_scan_seq when VMAs are created by running
mmtest autonuma benchmark briefly looks like:
start_seq=0 : 459
start_seq=2 : 138
start_seq=3 : 144
start_seq=4 : 8
start_seq=8 : 1
start_seq=9 : 1
This results in no unconditional PTE updates for those VMAs created after
some time.

Fix:
- Note down the initial value of mm numa_scan_seq in per VMA start_seq.
- Allow unconditional scan till start_seq + 2.

Result:
SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
base kernel: upstream 6.6-rc6 with Mels patches [1] applied.

kernbench
==========		base                  patched %gain
Amean    elsp-128      165.09 ( 0.00%)      164.78 *   0.19%*

Duration User       41404.28    41375.08
Duration System      9862.22     9768.48
Duration Elapsed      519.87      518.72

Ops NUMA PTE updates           1041416.00      831536.00
Ops NUMA hint faults            263296.00      220966.00
Ops NUMA pages migrated         258021.00      212769.00
Ops AutoNUMA cost                 1328.67        1114.69

autonumabench

NUMA01_THREADLOCAL
==================
Amean  elsp-NUMA01_THREADLOCAL   81.79 (0.00%)  67.74 *  17.18%*

Duration User       54832.73    47379.67
Duration System        75.00      185.75
Duration Elapsed      576.72      476.09

Ops NUMA PTE updates                  394429.00    11121044.00
Ops NUMA hint faults                    1001.00     8906404.00
Ops NUMA pages migrated                  288.00     2998694.00
Ops AutoNUMA cost                          7.77       44666.84

Link: https://lore.kernel.org/linux-mm/ZSXF3AFZgIld1meX@gmail.com/T/

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm_types.h | 3 +++
 kernel/sched/fair.c      | 4 +++-
 2 files changed, 6 insertions(+), 1 deletion(-)
  

Comments

Mel Gorman Nov. 1, 2023, 9:21 a.m. UTC | #1
On Fri, Oct 20, 2023 at 09:27:46PM +0530, Raghavendra K T wrote:
> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
> 
> NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the
> task had previously accessed VMA. However unconditional scan of VMAs are
> allowed during initial phase of VMA creation until process's
> mm numa_scan_seq reaches 2 even though current task had not accessed VMA.
> 
> Rationale:
>  - Without initial scan subsequent PTE update may never happen.
>  - Give fair opportunity to all the VMAs to be scanned and subsequently
> understand the access pattern of all the VMAs.
> 
> But it has a corner case where, if a VMA is created after some time,
> process's mm numa_scan_seq could be already greater than 2.
> 
> For e.g., values of mm numa_scan_seq when VMAs are created by running
> mmtest autonuma benchmark briefly looks like:
> start_seq=0 : 459
> start_seq=2 : 138
> start_seq=3 : 144
> start_seq=4 : 8
> start_seq=8 : 1
> start_seq=9 : 1
> This results in no unconditional PTE updates for those VMAs created after
> some time.
> 
> Fix:
> - Note down the initial value of mm numa_scan_seq in per VMA start_seq.
> - Allow unconditional scan till start_seq + 2.
> 
> Result:
> SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
> base kernel: upstream 6.6-rc6 with Mels patches [1] applied.
> 
> kernbench
> ==========		base                  patched %gain
> Amean    elsp-128      165.09 ( 0.00%)      164.78 *   0.19%*
> 
> Duration User       41404.28    41375.08
> Duration System      9862.22     9768.48
> Duration Elapsed      519.87      518.72
> 
> Ops NUMA PTE updates           1041416.00      831536.00
> Ops NUMA hint faults            263296.00      220966.00
> Ops NUMA pages migrated         258021.00      212769.00
> Ops AutoNUMA cost                 1328.67        1114.69
> 
> autonumabench
> 
> NUMA01_THREADLOCAL
> ==================
> Amean  elsp-NUMA01_THREADLOCAL   81.79 (0.00%)  67.74 *  17.18%*
> 
> Duration User       54832.73    47379.67
> Duration System        75.00      185.75
> Duration Elapsed      576.72      476.09
> 
> Ops NUMA PTE updates                  394429.00    11121044.00
> Ops NUMA hint faults                    1001.00     8906404.00
> Ops NUMA pages migrated                  288.00     2998694.00
> Ops AutoNUMA cost                          7.77       44666.84
> 
> Link: https://lore.kernel.org/linux-mm/ZSXF3AFZgIld1meX@gmail.com/T/
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>

Acked-by: Mel Gorman <mgorman@suse.de>
  
Peter Zijlstra Nov. 1, 2023, 10:31 a.m. UTC | #2
On Wed, Nov 01, 2023 at 09:21:01AM +0000, Mel Gorman wrote:
> On Fri, Oct 20, 2023 at 09:27:46PM +0530, Raghavendra K T wrote:
> > Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
> > 
> > NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the
> > task had previously accessed VMA. However unconditional scan of VMAs are
> > allowed during initial phase of VMA creation until process's
> > mm numa_scan_seq reaches 2 even though current task had not accessed VMA.
> > 
> > Rationale:
> >  - Without initial scan subsequent PTE update may never happen.
> >  - Give fair opportunity to all the VMAs to be scanned and subsequently
> > understand the access pattern of all the VMAs.
> > 
> > But it has a corner case where, if a VMA is created after some time,
> > process's mm numa_scan_seq could be already greater than 2.
> > 
> > For e.g., values of mm numa_scan_seq when VMAs are created by running
> > mmtest autonuma benchmark briefly looks like:
> > start_seq=0 : 459
> > start_seq=2 : 138
> > start_seq=3 : 144
> > start_seq=4 : 8
> > start_seq=8 : 1
> > start_seq=9 : 1
> > This results in no unconditional PTE updates for those VMAs created after
> > some time.
> > 
> > Fix:
> > - Note down the initial value of mm numa_scan_seq in per VMA start_seq.
> > - Allow unconditional scan till start_seq + 2.
> > 
> > Result:
> > SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
> > base kernel: upstream 6.6-rc6 with Mels patches [1] applied.
> > 
> > kernbench
> > ==========		base                  patched %gain
> > Amean    elsp-128      165.09 ( 0.00%)      164.78 *   0.19%*
> > 
> > Duration User       41404.28    41375.08
> > Duration System      9862.22     9768.48
> > Duration Elapsed      519.87      518.72
> > 
> > Ops NUMA PTE updates           1041416.00      831536.00
> > Ops NUMA hint faults            263296.00      220966.00
> > Ops NUMA pages migrated         258021.00      212769.00
> > Ops AutoNUMA cost                 1328.67        1114.69
> > 
> > autonumabench
> > 
> > NUMA01_THREADLOCAL
> > ==================
> > Amean  elsp-NUMA01_THREADLOCAL   81.79 (0.00%)  67.74 *  17.18%*
> > 
> > Duration User       54832.73    47379.67
> > Duration System        75.00      185.75
> > Duration Elapsed      576.72      476.09
> > 
> > Ops NUMA PTE updates                  394429.00    11121044.00
> > Ops NUMA hint faults                    1001.00     8906404.00
> > Ops NUMA pages migrated                  288.00     2998694.00
> > Ops AutoNUMA cost                          7.77       44666.84
> > 
> > Link: https://lore.kernel.org/linux-mm/ZSXF3AFZgIld1meX@gmail.com/T/
> > 
> > Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> 
> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks, will queue for the next merge window (6.8 I think that is) once
6.7-rc1 comes around.
  
Raghavendra K T Nov. 2, 2023, 5:17 a.m. UTC | #3
On 11/1/2023 4:01 PM, Peter Zijlstra wrote:
> On Wed, Nov 01, 2023 at 09:21:01AM +0000, Mel Gorman wrote:
>> On Fri, Oct 20, 2023 at 09:27:46PM +0530, Raghavendra K T wrote:
>>> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
>>>
>>> NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the
>>> task had previously accessed VMA. However unconditional scan of VMAs are
>>> allowed during initial phase of VMA creation until process's
>>> mm numa_scan_seq reaches 2 even though current task had not accessed VMA.
>>>
>>> Rationale:
>>>   - Without initial scan subsequent PTE update may never happen.
>>>   - Give fair opportunity to all the VMAs to be scanned and subsequently
>>> understand the access pattern of all the VMAs.
>>>
>>> But it has a corner case where, if a VMA is created after some time,
>>> process's mm numa_scan_seq could be already greater than 2.
>>>
>>> For e.g., values of mm numa_scan_seq when VMAs are created by running
>>> mmtest autonuma benchmark briefly looks like:
>>> start_seq=0 : 459
>>> start_seq=2 : 138
>>> start_seq=3 : 144
>>> start_seq=4 : 8
>>> start_seq=8 : 1
>>> start_seq=9 : 1
>>> This results in no unconditional PTE updates for those VMAs created after
>>> some time.
>>>
>>> Fix:
>>> - Note down the initial value of mm numa_scan_seq in per VMA start_seq.
>>> - Allow unconditional scan till start_seq + 2.
>>>
>>> Result:
>>> SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
>>> base kernel: upstream 6.6-rc6 with Mels patches [1] applied.
>>>
>>> kernbench
>>> ==========		base                  patched %gain
>>> Amean    elsp-128      165.09 ( 0.00%)      164.78 *   0.19%*
>>>
>>> Duration User       41404.28    41375.08
>>> Duration System      9862.22     9768.48
>>> Duration Elapsed      519.87      518.72
>>>
>>> Ops NUMA PTE updates           1041416.00      831536.00
>>> Ops NUMA hint faults            263296.00      220966.00
>>> Ops NUMA pages migrated         258021.00      212769.00
>>> Ops AutoNUMA cost                 1328.67        1114.69
>>>
>>> autonumabench
>>>
>>> NUMA01_THREADLOCAL
>>> ==================
>>> Amean  elsp-NUMA01_THREADLOCAL   81.79 (0.00%)  67.74 *  17.18%*
>>>
>>> Duration User       54832.73    47379.67
>>> Duration System        75.00      185.75
>>> Duration Elapsed      576.72      476.09
>>>
>>> Ops NUMA PTE updates                  394429.00    11121044.00
>>> Ops NUMA hint faults                    1001.00     8906404.00
>>> Ops NUMA pages migrated                  288.00     2998694.00
>>> Ops AutoNUMA cost                          7.77       44666.84
>>>
>>> Link: https://lore.kernel.org/linux-mm/ZSXF3AFZgIld1meX@gmail.com/T/
>>>
>>> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
>>
>> Acked-by: Mel Gorman <mgorman@suse.de>
> 
> Thanks, will queue for the next merge window (6.8 I think that is) once
> 6.7-rc1 comes around.

Thank you Mel, PeterZ.

Meanwhile, I will check if extending #history (PeterZ) on this changed
baseline,  as well as implications of extending #bits for PIDS (Ingo) 
suggested (especially larger machine) helps and come back if I find
anything interesting.

Thanks and Regards
- Raghu
  

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 589f31ef2e84..679f076e3a91 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -576,6 +576,9 @@  struct vma_numab_state {
 	 */
 	unsigned long pids_active[2];
 
+	/* MM scan sequence ID when scan first started after VMA creation */
+	int start_scan_seq;
+
 	/*
 	 * MM scan sequence ID when the VMA was last completely scanned.
 	 * A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5e03ced2b566..c8af3a7ccba7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3191,7 +3191,7 @@  static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
 	 * This is also done to avoid any side effect of task scanning
 	 * amplifying the unfairness of disjoint set of VMAs' access.
 	 */
-	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
+	if ((READ_ONCE(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2)
 		return true;
 
 	pids = vma->numab_state->pids_active[0] | vma->numab_state->pids_active[1];
@@ -3334,6 +3334,8 @@  static void task_numa_work(struct callback_head *work)
 			if (!vma->numab_state)
 				continue;
 
+			vma->numab_state->start_scan_seq = mm->numa_scan_seq;
+
 			vma->numab_state->next_scan = now +
 				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);