[RFC,V2,1/1] sched/numa: Fix disjoint set vma scan regression

Message ID b0a8f3490b491d4fd003c3e0493e940afaea5f2c.1684228065.git.raghavendra.kt@amd.com
State New
Headers
Series sched/numa: Fix disjoint set vma scan regression |

Commit Message

Raghavendra K T May 16, 2023, 9:19 a.m. UTC
  With the numa scan enhancements [1], only the threads which had previously
accessed vma are allowed to scan.

While this had improved significant system time overhead, there were corner
cases, which genuinely need some relaxation. For e.g.,

1) Concern raised by PeterZ, where if there are N partition sets of vmas
belonging to tasks, then unfairness in allowing these threads to scan could
potentially amplify the side effect of some of the vmas being left
unscanned.

2) Below reports of LKP numa01 benchmark regression.

Currently this was handled by allowing first two scanning unconditional
as indicated by mm->numa_scan_seq. This is imprecise since for some
benchmark vma scanning might itself start at numa_scan_seq > 2.

Solution:
Allow unconditional scanning of vmas of tasks depending on vma size. This
is achieved by maintaining a per vma scan counter, where

f(allowed_to_scan) = f(scan_counter <  vma_size / scan_size)

Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
regression.

Result:
numa01_THREAD_ALLOC result on 6.4.0-rc1 (that has w/ numascan enhancement)
                base-numascan           base                    base+fix
real            1m3.025s                1m24.163s               1m3.551s
user            213m44.232s             251m3.638s              219m55.662s
sys             6m26.598s               0m13.056s               2m35.767s

numa_hit                5478165         4395752         4907431
numa_local              5478103         4395366         4907044
numa_other                   62             386             387
numa_pte_updates        1989274           11606         1265014
numa_hint_faults        1756059             515         1135804
numa_hint_faults_local   971500             486          558076
numa_pages_migrated      784211              29          577728

Summary: Regression in base is recovered by allowing scanning as required.

[1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t

Reported-by: Aithal Srikanth <sraithal@amd.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 41 ++++++++++++++++++++++++++++++++--------
 2 files changed, 34 insertions(+), 8 deletions(-)
  

Comments

Bharata B Rao May 19, 2023, 7:56 a.m. UTC | #1
On 16-May-23 2:49 PM, Raghavendra K T wrote:
>  With the numa scan enhancements [1], only the threads which had previously
> accessed vma are allowed to scan.
> 
> While this had improved significant system time overhead, there were corner
> cases, which genuinely need some relaxation. For e.g.,
> 
> 1) Concern raised by PeterZ, where if there are N partition sets of vmas
> belonging to tasks, then unfairness in allowing these threads to scan could
> potentially amplify the side effect of some of the vmas being left
> unscanned.
> 
> 2) Below reports of LKP numa01 benchmark regression.
> 
> Currently this was handled by allowing first two scanning unconditional
> as indicated by mm->numa_scan_seq. This is imprecise since for some
> benchmark vma scanning might itself start at numa_scan_seq > 2.
> 
> Solution:
> Allow unconditional scanning of vmas of tasks depending on vma size. This
> is achieved by maintaining a per vma scan counter, where
> 
> f(allowed_to_scan) = f(scan_counter <  vma_size / scan_size)
> 
> Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
> regression.
> 
> Result:
> numa01_THREAD_ALLOC result on 6.4.0-rc1 (that has w/ numascan enhancement)
>                 base-numascan           base                    base+fix
> real            1m3.025s                1m24.163s               1m3.551s
> user            213m44.232s             251m3.638s              219m55.662s
> sys             6m26.598s               0m13.056s               2m35.767s
> 
> numa_hit                5478165         4395752         4907431
> numa_local              5478103         4395366         4907044
> numa_other                   62             386             387
> numa_pte_updates        1989274           11606         1265014
> numa_hint_faults        1756059             515         1135804
> numa_hint_faults_local   971500             486          558076
> numa_pages_migrated      784211              29          577728
> 
> Summary: Regression in base is recovered by allowing scanning as required.
> 
> [1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
> 
> Reported-by: Aithal Srikanth <sraithal@amd.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>  include/linux/mm_types.h |  1 +
>  kernel/sched/fair.c      | 41 ++++++++++++++++++++++++++++++++--------
>  2 files changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 306a3d1a0fa6..992e460a713e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -479,6 +479,7 @@ struct vma_numab_state {
>  	unsigned long next_scan;
>  	unsigned long next_pid_reset;
>  	unsigned long access_pids[2];
> +	unsigned int scan_counter;
>  };
>  
>  /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..2c3e17e7fc2f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2931,20 +2931,34 @@ static void reset_ptenuma_scan(struct task_struct *p)
>  static bool vma_is_accessed(struct vm_area_struct *vma)
>  {
>  	unsigned long pids;
> +	unsigned int vma_size;
> +	unsigned int scan_threshold;
> +	unsigned int scan_size;
> +
> +	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
> +
> +	if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
> +		return true;
> +
> +	scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
> +	/* vma size in MB */
> +	vma_size = (vma->vm_end - vma->vm_start) >> 20;
> +
> +	/* Total scans needed to cover VMA */
> +	scan_threshold = (vma_size / scan_size);
> +
>  	/*
> -	 * Allow unconditional access first two times, so that all the (pages)
> -	 * of VMAs get prot_none fault introduced irrespective of accesses.
> +	 * Allow the scanning of half of disjoint set's VMA to induce
> +	 * prot_none fault irrespective of accesses.
>  	 * This is also done to avoid any side effect of task scanning
>  	 * amplifying the unfairness of disjoint set of VMAs' access.
>  	 */
> -	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
> -		return true;
> -
> -	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
> -	return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
> +	scan_threshold = 1 + (scan_threshold >> 1);
> +	return (READ_ONCE(vma->numab_state->scan_counter) <= scan_threshold);
>  }
>  
> -#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
> +#define VMA_PID_RESET_PERIOD		(4 * sysctl_numa_balancing_scan_delay)
> +#define DISJOINT_VMA_SCAN_RENEW_THRESH	16
>  
>  /*
>   * The expensive part of numa migration is done from task_work context.
> @@ -3058,6 +3072,8 @@ static void task_numa_work(struct callback_head *work)
>  			/* Reset happens after 4 times scan delay of scan start */
>  			vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
>  				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
> +
> +			WRITE_ONCE(vma->numab_state->scan_counter, 0);
>  		}
>  
>  		/*
> @@ -3068,6 +3084,13 @@ static void task_numa_work(struct callback_head *work)
>  						vma->numab_state->next_scan))
>  			continue;
>  
> +		/*
> +		 * For long running tasks, renew the disjoint vma scanning
> +		 * periodically.
> +		 */
> +		if (mm->numa_scan_seq && !(mm->numa_scan_seq % DISJOINT_VMA_SCAN_RENEW_THRESH))

Don't you need a READ_ONCE() accessor for mm->numa_scan_seq?

Regards,
Bharata.
  
Raghavendra K T May 19, 2023, 12:05 p.m. UTC | #2
On 5/19/2023 1:26 PM, Bharata B Rao wrote:
> On 16-May-23 2:49 PM, Raghavendra K T wrote:
>>   With the numa scan enhancements [1], only the threads which had previously
[...]
>> -#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
>> +#define VMA_PID_RESET_PERIOD		(4 * sysctl_numa_balancing_scan_delay)
>> +#define DISJOINT_VMA_SCAN_RENEW_THRESH	16
>>   
>>   /*
>>    * The expensive part of numa migration is done from task_work context.
>> @@ -3058,6 +3072,8 @@ static void task_numa_work(struct callback_head *work)
>>   			/* Reset happens after 4 times scan delay of scan start */
>>   			vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
>>   				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
>> +
>> +			WRITE_ONCE(vma->numab_state->scan_counter, 0);
>>   		}
>>   
>>   		/*
>> @@ -3068,6 +3084,13 @@ static void task_numa_work(struct callback_head *work)
>>   						vma->numab_state->next_scan))
>>   			continue;
>>   
>> +		/*
>> +		 * For long running tasks, renew the disjoint vma scanning
>> +		 * periodically.
>> +		 */
>> +		if (mm->numa_scan_seq && !(mm->numa_scan_seq % DISJOINT_VMA_SCAN_RENEW_THRESH))
> 
> Don't you need a READ_ONCE() accessor for mm->numa_scan_seq?
> 

Hello Bharata,

Yes.. Thanks for pointing out.. V1 I did ensure that, But in V2 somehow
leftout :( .

On the other-hand I see vma->numab_state->scan_counter does not need
READ_ONCE/WRITE_ONCE since it is not modified out of this function
(i.e. it is all done after cmpxchg above)..

Also thinking more, DISJOINT_VMA_SCAN_RENEW_THRESH reset change itself
  may need some correction, and doesn't seem to be absolutely necessary
here. (will post that separately for improving long running benchmark as
per my experiment with more detail)

will wait for any confirmation of reported regression fix with this
  patch and/or any better idea/ack for a while and repost.
  
kernel test robot May 26, 2023, 1:45 a.m. UTC | #3
Hello,

kernel test robot noticed a -46.3% improvement of autonuma-benchmark.numa01.seconds on:


commit: d281d36ed007eabb243ad2d489c52c43961f8ac3 ("[RFC PATCH V2 1/1] sched/numa: Fix disjoint set vma scan regression")
url: https://github.com/intel-lab-lkp/linux/commits/Raghavendra-K-T/sched-numa-Fix-disjoint-set-vma-scan-regression/20230516-180954
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git a6fcdd8d95f7486150b3faadfea119fc3dfc3b74
patch link: https://lore.kernel.org/all/b0a8f3490b491d4fd003c3e0493e940afaea5f2c.1684228065.git.raghavendra.kt@amd.com/
patch subject: [RFC PATCH V2 1/1] sched/numa: Fix disjoint set vma scan regression


we noticed this patch addressed the performance regression we reported
https://lore.kernel.org/all/202305101547.20f4c32a-oliver.sang@intel.com/

we also noticed there is still some discussion in the thread
https://lore.kernel.org/all/202305101547.20f4c32a-oliver.sang@intel.com/

since we didn't see V3 patch, send out this report for your information
about its performance impact.


testcase: autonuma-benchmark
test machine: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz (Cascade Lake) with 128G memory
parameters:

	iterations: 4x
	test: numa02_SMT
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        sudo bin/lkp install job.yaml           # job file is attached in this email
        bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
        sudo bin/lkp run generated-yaml-file

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.

=========================================================================================
compiler/cpufreq_governor/iterations/kconfig/rootfs/tbox_group/test/testcase:
  gcc-11/performance/4x/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-csl-2sp9/numa02_SMT/autonuma-benchmark

commit: 
  a6fcdd8d95 ("sched/debug: Correct printing for rq->nr_uninterruptible")
  d281d36ed0 ("sched/numa: Fix disjoint set vma scan regression")

a6fcdd8d95f74861 d281d36ed007eabb243ad2d489c 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      1899 ±  2%     -37.4%       1189 ± 18%  uptime.boot
      1809            +3.1%       1866        vmstat.system.cs
 1.685e+10 ±  3%     -52.2%  8.052e+09 ± 11%  cpuidle..time
  17400470 ±  3%     -52.3%    8308590 ± 11%  cpuidle..usage
     26350 ±  7%      -9.2%      23932        meminfo.Active
     26238 ±  7%      -9.2%      23828        meminfo.Active(anon)
     38666 ±  8%     +21.4%      46957 ±  7%  meminfo.Mapped
      1996           -16.5%       1666        meminfo.Mlocked
     23350 ± 56%     +64.7%      38446 ±  8%  numa-meminfo.node0.Mapped
      5108 ±  6%     +78.7%       9132 ± 43%  numa-meminfo.node0.Shmem
     25075 ±  8%      -9.3%      22750        numa-meminfo.node1.Active
     25057 ±  8%      -9.5%      22681        numa-meminfo.node1.Active(anon)
   2038104 ±  7%     -34.9%    1327290 ± 12%  numa-numastat.node0.local_node
   2394647 ±  5%     -30.9%    1655892 ± 15%  numa-numastat.node0.numa_hit
   1988880 ±  7%     -27.5%    1442918 ± 12%  numa-numastat.node1.local_node
   2255986 ±  6%     -23.0%    1737172 ± 17%  numa-numastat.node1.numa_hit
     10.54 ±  3%      -1.7        8.83 ±  9%  mpstat.cpu.all.idle%
      0.00 ± 74%      +0.1        0.05 ± 13%  mpstat.cpu.all.iowait%
      2.48            -0.9        1.57        mpstat.cpu.all.irq%
      0.08 ±  2%      -0.0        0.06 ±  3%  mpstat.cpu.all.soft%
      1.48            +0.7        2.19 ±  4%  mpstat.cpu.all.sys%
    427.10           -46.3%     229.25 ±  3%  autonuma-benchmark.numa01.seconds
      1819           -43.5%       1027 ±  3%  autonuma-benchmark.time.elapsed_time
      1819           -43.5%       1027 ±  3%  autonuma-benchmark.time.elapsed_time.max
    791068 ±  2%     -43.6%     446212 ±  4%  autonuma-benchmark.time.involuntary_context_switches
   2089497           -16.8%    1737489 ±  2%  autonuma-benchmark.time.minor_page_faults
      7603            +3.2%       7848        autonuma-benchmark.time.percent_of_cpu_this_job_got
    136519           -42.2%      78864 ±  3%  autonuma-benchmark.time.user_time
     22402           +44.8%      32429 ±  4%  autonuma-benchmark.time.voluntary_context_switches
      5919 ± 55%     +64.7%       9747 ±  9%  numa-vmstat.node0.nr_mapped
      1277 ±  6%     +77.6%       2268 ± 43%  numa-vmstat.node0.nr_shmem
   2394430 ±  5%     -30.9%    1655441 ± 15%  numa-vmstat.node0.numa_hit
   2037887 ±  7%     -34.9%    1326839 ± 12%  numa-vmstat.node0.numa_local
      6261 ±  8%      -9.2%       5683        numa-vmstat.node1.nr_active_anon
      6261 ±  8%      -9.2%       5683        numa-vmstat.node1.nr_zone_active_anon
   2255543 ±  6%     -23.0%    1736429 ± 17%  numa-vmstat.node1.numa_hit
   1988436 ±  7%     -27.5%    1442174 ± 12%  numa-vmstat.node1.numa_local
     35815 ±  5%     -23.8%      27284 ± 17%  turbostat.C1
      0.03 ± 17%      +0.0        0.07 ± 12%  turbostat.C1E%
  17197885 ±  3%     -52.7%    8127065 ± 11%  turbostat.C6
     10.48 ±  3%      -1.7        8.80 ± 10%  turbostat.C6%
     10.23 ±  3%     -17.3%       8.46 ± 10%  turbostat.CPU%c1
      0.24 ±  7%     +61.3%       0.38 ± 11%  turbostat.CPU%c6
 1.615e+08           -42.4%   93035289 ±  3%  turbostat.IRQ
     48830 ± 13%     -35.2%      31632 ± 11%  turbostat.POLL
      0.19 ±  7%     +61.1%       0.30 ± 11%  turbostat.Pkg%pc2
    238.01            +5.2%     250.27        turbostat.PkgWatt
     22.38           +25.3%      28.03        turbostat.RAMWatt
      6557 ±  7%      -9.1%       5963        proc-vmstat.nr_active_anon
   1539398            -4.8%    1465253        proc-vmstat.nr_anon_pages
      2955            -5.8%       2785        proc-vmstat.nr_anon_transparent_hugepages
   1541555            -4.7%    1468824        proc-vmstat.nr_inactive_anon
      9843 ±  8%     +21.4%      11949 ±  7%  proc-vmstat.nr_mapped
    499.00           -16.5%     416.67        proc-vmstat.nr_mlock
      3896            -3.2%       3770        proc-vmstat.nr_page_table_pages
      6557 ±  7%      -9.1%       5963        proc-vmstat.nr_zone_active_anon
   1541555            -4.7%    1468824        proc-vmstat.nr_zone_inactive_anon
     30446 ± 15%    +397.7%     151532 ±  4%  proc-vmstat.numa_hint_faults
     21562 ± 12%    +312.9%      89028 ±  3%  proc-vmstat.numa_hint_faults_local
   4651965           -27.0%    3395711 ±  2%  proc-vmstat.numa_hit
      5122 ±  7%   +1393.9%      76529 ±  5%  proc-vmstat.numa_huge_pte_updates
   4028316           -31.2%    2772852 ±  2%  proc-vmstat.numa_local
   1049660          +672.3%    8106150 ±  6%  proc-vmstat.numa_pages_migrated
   2725369 ±  7%   +1343.9%   39352403 ±  5%  proc-vmstat.numa_pte_updates
     45132 ± 31%     +31.9%      59519        proc-vmstat.pgactivate
 1.816e+08 ±  2%      +5.6%  1.918e+08 ±  3%  proc-vmstat.pgalloc_normal
   5863913           -30.2%    4092045 ±  2%  proc-vmstat.pgfault
 1.815e+08 ±  2%      +5.7%  1.918e+08 ±  3%  proc-vmstat.pgfree
   1049660          +672.3%    8106150 ±  6%  proc-vmstat.pgmigrate_success
    264923           -35.1%     171993 ±  2%  proc-vmstat.pgreuse
      2037          +675.1%      15790 ±  6%  proc-vmstat.thp_migration_success
  13598464           -42.9%    7770880 ±  3%  proc-vmstat.unevictable_pgs_scanned
      3208 ± 14%     +44.1%       4624 ± 12%  sched_debug.cfs_rq:/.load.min
      2.73 ± 16%     +52.6%       4.16 ± 14%  sched_debug.cfs_rq:/.load_avg.min
  94294753 ±  2%     -45.7%   51173318 ±  3%  sched_debug.cfs_rq:/.min_vruntime.avg
  98586361 ±  2%     -46.3%   52983552 ±  3%  sched_debug.cfs_rq:/.min_vruntime.max
  85615972 ±  2%     -45.4%   46737672 ±  3%  sched_debug.cfs_rq:/.min_vruntime.min
   2806211 ±  7%     -53.1%    1314959 ±  6%  sched_debug.cfs_rq:/.min_vruntime.stddev
      2.63 ± 23%     +65.8%       4.36 ± 24%  sched_debug.cfs_rq:/.removed.load_avg.avg
      1.08 ± 23%     +55.4%       1.68 ± 19%  sched_debug.cfs_rq:/.removed.runnable_avg.avg
      1.07 ± 24%     +56.7%       1.68 ± 19%  sched_debug.cfs_rq:/.removed.util_avg.avg
   7252565 ± 13%     -46.6%    3874343 ± 15%  sched_debug.cfs_rq:/.spread0.avg
  11534195 ± 10%     -50.8%    5673455 ± 12%  sched_debug.cfs_rq:/.spread0.max
  -1406320           -61.2%    -546147        sched_debug.cfs_rq:/.spread0.min
   2795186 ±  7%     -53.2%    1309202 ±  6%  sched_debug.cfs_rq:/.spread0.stddev
      6.57 ± 40%   +6868.7%     457.74 ±  6%  sched_debug.cfs_rq:/.util_est_enqueued.avg
    275.73 ± 43%    +333.0%       1193 ±  5%  sched_debug.cfs_rq:/.util_est_enqueued.max
     37.70 ± 41%    +722.1%     309.90 ±  3%  sched_debug.cfs_rq:/.util_est_enqueued.stddev
    794516 ±  5%     -26.2%     586654 ± 17%  sched_debug.cpu.avg_idle.min
    224.87 ±  4%     -42.6%     129.01 ± 10%  sched_debug.cpu.clock.stddev
    885581 ±  2%     -42.6%     508722 ±  3%  sched_debug.cpu.clock_task.min
     19466 ±  2%     -30.4%      13558 ±  4%  sched_debug.cpu.curr->pid.avg
     26029 ±  2%     -33.9%      17216 ±  2%  sched_debug.cpu.curr->pid.max
     13168 ± 11%     -33.3%       8788 ± 12%  sched_debug.cpu.curr->pid.min
      2761 ± 14%     -37.3%       1730 ± 29%  sched_debug.cpu.curr->pid.stddev
    958735           -24.7%     721786 ± 11%  sched_debug.cpu.max_idle_balance_cost.max
     97977 ±  3%     -52.2%      46860 ± 25%  sched_debug.cpu.max_idle_balance_cost.stddev
      0.00 ±  4%     -41.9%       0.00 ± 10%  sched_debug.cpu.next_balance.stddev
     20557 ±  3%     -37.3%      12889 ±  3%  sched_debug.cpu.nr_switches.avg
     81932 ±  7%     -25.4%      61097 ± 15%  sched_debug.cpu.nr_switches.max
      6643 ±  6%     -39.2%       4037 ± 15%  sched_debug.cpu.nr_switches.min
     13929 ±  6%     -30.1%       9740 ±  4%  sched_debug.cpu.nr_switches.stddev
     20.30 ± 23%     +73.3%      35.19 ± 28%  sched_debug.cpu.nr_uninterruptible.max
    -12.43          +145.8%     -30.55        sched_debug.cpu.nr_uninterruptible.min
      5.28 ± 12%     +82.4%       9.63 ± 13%  sched_debug.cpu.nr_uninterruptible.stddev
    925729 ±  2%     -42.4%     533314 ±  3%  sched_debug.sched_clk
     36.08           +51.2%      54.54 ±  2%  perf-stat.i.MPKI
 1.037e+08            +9.7%  1.137e+08        perf-stat.i.branch-instructions
      1.36            +0.0        1.39        perf-stat.i.branch-miss-rate%
   1602349           +19.8%    1918946 ±  3%  perf-stat.i.branch-misses
  11889864           +56.5%   18603954 ±  2%  perf-stat.i.cache-misses
  17973544           +54.5%   27773059 ±  2%  perf-stat.i.cache-references
      1771            +3.1%       1826        perf-stat.i.context-switches
 2.147e+11            +3.2%  2.215e+11        perf-stat.i.cpu-cycles
    112.60           +16.6%     131.26        perf-stat.i.cpu-migrations
     18460           -33.1%      12347        perf-stat.i.cycles-between-cache-misses
      0.03 ±  4%      +0.0        0.04 ±  8%  perf-stat.i.dTLB-load-miss-rate%
     52266 ±  4%     +28.9%      67377 ±  7%  perf-stat.i.dTLB-load-misses
 1.442e+08            +7.7%  1.553e+08        perf-stat.i.dTLB-loads
      0.25            +0.0        0.27        perf-stat.i.dTLB-store-miss-rate%
    189901           +13.6%     215670        perf-stat.i.dTLB-store-misses
  80186719            +6.8%   85653052        perf-stat.i.dTLB-stores
    400061 ±  4%     +18.9%     475847 ±  4%  perf-stat.i.iTLB-load-misses
    361622 ±  2%     -18.0%     296709 ± 12%  perf-stat.i.iTLB-loads
 5.358e+08            +9.1%  5.845e+08        perf-stat.i.instructions
      1420 ±  2%      -4.7%       1353 ±  3%  perf-stat.i.instructions-per-iTLB-miss
      0.00 ±  5%     +24.9%       0.01 ±  8%  perf-stat.i.ipc
      0.06 ±  8%     -21.0%       0.04 ±  8%  perf-stat.i.major-faults
      2.44            +3.3%       2.52        perf-stat.i.metric.GHz
      1592            +3.9%       1655 ±  2%  perf-stat.i.metric.K/sec
      2.46           +17.0%       2.88        perf-stat.i.metric.M/sec
      3139           +21.5%       3815        perf-stat.i.minor-faults
     52.88            +1.2       54.08        perf-stat.i.node-load-miss-rate%
    238363           +53.4%     365608        perf-stat.i.node-load-misses
    219288 ±  4%     +35.5%     297188        perf-stat.i.node-loads
     50.27            -6.3       44.01 ±  4%  perf-stat.i.node-store-miss-rate%
   5122757           +33.6%    6845784 ±  3%  perf-stat.i.node-store-misses
   5214111           +77.3%    9242761 ±  5%  perf-stat.i.node-stores
      3139           +21.5%       3815        perf-stat.i.page-faults
     33.59           +41.4%      47.50 ±  2%  perf-stat.overall.MPKI
      1.54            +0.2        1.70 ±  2%  perf-stat.overall.branch-miss-rate%
    407.04            -6.3%     381.24        perf-stat.overall.cpi
     18139           -34.2%      11935 ±  2%  perf-stat.overall.cycles-between-cache-misses
      0.03 ±  4%      +0.0        0.04 ±  7%  perf-stat.overall.dTLB-load-miss-rate%
      0.24            +0.0        0.25        perf-stat.overall.dTLB-store-miss-rate%
     54.20 ±  2%      +8.4       62.57 ±  5%  perf-stat.overall.iTLB-load-miss-rate%
      1353 ±  3%      -8.7%       1236 ±  5%  perf-stat.overall.instructions-per-iTLB-miss
      0.00            +6.8%       0.00        perf-stat.overall.ipc
     51.26 ±  2%      +2.9       54.20        perf-stat.overall.node-load-miss-rate%
     49.80            -7.1       42.74 ±  4%  perf-stat.overall.node-store-miss-rate%
 1.031e+08           +10.1%  1.136e+08        perf-stat.ps.branch-instructions
   1590263           +21.1%    1925935 ±  3%  perf-stat.ps.branch-misses
  11959851           +55.9%   18644201 ±  2%  perf-stat.ps.cache-misses
  17905275           +54.8%   27718539 ±  2%  perf-stat.ps.cache-references
      1778            +2.7%       1826        perf-stat.ps.context-switches
 2.169e+11            +2.5%  2.224e+11        perf-stat.ps.cpu-cycles
    112.21           +16.9%     131.18        perf-stat.ps.cpu-migrations
     50162 ±  4%     +30.8%      65621 ±  8%  perf-stat.ps.dTLB-load-misses
 1.434e+08            +8.0%  1.549e+08        perf-stat.ps.dTLB-loads
    191061           +13.1%     216009        perf-stat.ps.dTLB-store-misses
  79737750            +7.0%   85349028        perf-stat.ps.dTLB-stores
    394455 ±  4%     +20.0%     473256 ±  4%  perf-stat.ps.iTLB-load-misses
  5.33e+08            +9.5%  5.835e+08        perf-stat.ps.instructions
      0.06 ±  9%     -21.7%       0.04 ±  8%  perf-stat.ps.major-faults
      3088           +22.3%       3775        perf-stat.ps.minor-faults
    236351           +54.5%     365167        perf-stat.ps.node-load-misses
    225011 ±  4%     +37.1%     308593        perf-stat.ps.node-loads
   5183662           +32.7%    6879838 ±  3%  perf-stat.ps.node-store-misses
   5224274           +76.7%    9231733 ±  5%  perf-stat.ps.node-stores
      3088           +22.3%       3775        perf-stat.ps.page-faults
 9.703e+11           -38.2%  5.999e+11 ±  2%  perf-stat.total.instructions
     14.84 ± 13%     -12.7        2.15 ± 14%  perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt
      1.37 ± 11%      -0.6        0.76 ± 31%  perf-profile.calltrace.cycles-pp.evsel__read_counter.read_counters.process_interval.dispatch_events.cmd_stat
      0.53 ± 72%      +0.5        1.03 ± 18%  perf-profile.calltrace.cycles-pp.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
      1.26 ± 18%      +0.8        2.11 ± 12%  perf-profile.calltrace.cycles-pp.serial8250_console_write.console_flush_all.console_unlock.vprintk_emit._printk
      1.26 ± 18%      +0.9        2.13 ± 12%  perf-profile.calltrace.cycles-pp.irq_work_run_list.irq_work_run.__sysvec_irq_work.sysvec_irq_work.asm_sysvec_irq_work
      1.26 ± 18%      +0.9        2.13 ± 12%  perf-profile.calltrace.cycles-pp.irq_work_single.irq_work_run_list.irq_work_run.__sysvec_irq_work.sysvec_irq_work
      1.26 ± 18%      +0.9        2.13 ± 12%  perf-profile.calltrace.cycles-pp._printk.irq_work_single.irq_work_run_list.irq_work_run.__sysvec_irq_work
      1.26 ± 18%      +0.9        2.13 ± 12%  perf-profile.calltrace.cycles-pp.vprintk_emit._printk.irq_work_single.irq_work_run_list.irq_work_run
      1.26 ± 18%      +0.9        2.13 ± 12%  perf-profile.calltrace.cycles-pp.console_unlock.vprintk_emit._printk.irq_work_single.irq_work_run_list
      1.26 ± 18%      +0.9        2.13 ± 12%  perf-profile.calltrace.cycles-pp.console_flush_all.console_unlock.vprintk_emit._printk.irq_work_single
      2.10 ±112%      +6.0        8.06 ± 77%  perf-profile.calltrace.cycles-pp.__libc_start_main
      2.10 ±112%      +6.0        8.06 ± 77%  perf-profile.calltrace.cycles-pp.main.__libc_start_main
      2.10 ±112%      +6.0        8.06 ± 77%  perf-profile.calltrace.cycles-pp.run_builtin.main.__libc_start_main
      1.37 ±105%      +6.0        7.34 ± 77%  perf-profile.calltrace.cycles-pp.record__pushfn.perf_mmap__push.record__mmap_read_evlist.__cmd_record.cmd_record
      4.80 ± 11%      +9.9       14.67 ± 51%  perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
      3.42 ± 17%     +10.6       13.98 ± 55%  perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
      3.82 ± 17%     +10.6       14.41 ± 54%  perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault
      4.01 ± 17%     +10.6       14.60 ± 53%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault
      3.78 ± 17%     +10.6       14.40 ± 54%  perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
     17.34 ± 10%     -13.4        3.91 ±  9%  perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
      1.54 ± 94%      -1.3        0.20 ± 52%  perf-profile.children.cycles-pp.zero_user_segments
      4.02 ± 15%      -1.3        2.75 ± 16%  perf-profile.children.cycles-pp.exit_to_user_mode_loop
      3.52 ± 22%      -1.1        2.45 ± 13%  perf-profile.children.cycles-pp.get_perf_callchain
      3.37 ±  6%      -1.0        2.40 ± 12%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
      1.16 ±  7%      -1.0        0.20 ± 39%  perf-profile.children.cycles-pp.rcu_gp_kthread
      1.56 ± 15%      -0.8        0.71 ± 23%  perf-profile.children.cycles-pp.__irq_exit_rcu
      2.62 ± 23%      -0.8        1.82 ±  8%  perf-profile.children.cycles-pp.perf_callchain_kernel
      0.92 ± 10%      -0.8        0.16 ± 42%  perf-profile.children.cycles-pp.rcu_gp_fqs_loop
      2.66 ± 17%      -0.7        1.94 ± 18%  perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
      0.93 ± 11%      -0.7        0.24 ± 31%  perf-profile.children.cycles-pp.schedule_timeout
      1.53 ± 28%      -0.7        0.85 ± 19%  perf-profile.children.cycles-pp.perf_trace_sched_switch
      2.26 ± 24%      -0.6        1.62 ±  8%  perf-profile.children.cycles-pp.unwind_next_frame
      1.38 ± 11%      -0.6        0.76 ± 31%  perf-profile.children.cycles-pp.evsel__read_counter
      1.89 ± 15%      -0.6        1.29 ± 16%  perf-profile.children.cycles-pp.__do_softirq
      0.60 ± 23%      -0.5        0.06 ± 74%  perf-profile.children.cycles-pp.rebalance_domains
      0.69 ± 20%      -0.4        0.24 ± 65%  perf-profile.children.cycles-pp.load_balance
      1.30 ±  8%      -0.4        0.88 ± 20%  perf-profile.children.cycles-pp.readn
      0.96 ± 16%      -0.4        0.56 ± 14%  perf-profile.children.cycles-pp.task_mm_cid_work
      0.44 ± 22%      -0.3        0.10 ±  9%  perf-profile.children.cycles-pp.__evlist__disable
      0.75 ±  8%      -0.3        0.47 ± 24%  perf-profile.children.cycles-pp.perf_read
      0.72 ± 14%      -0.2        0.47 ± 40%  perf-profile.children.cycles-pp.pick_next_task_fair
      0.63 ± 23%      -0.2        0.38 ± 33%  perf-profile.children.cycles-pp.put_prev_entity
      0.52 ± 18%      -0.2        0.28 ± 32%  perf-profile.children.cycles-pp.asm_sysvec_reschedule_ipi
      0.26 ± 16%      -0.2        0.07 ± 48%  perf-profile.children.cycles-pp.swake_up_one
      0.26 ± 22%      -0.2        0.08 ± 11%  perf-profile.children.cycles-pp.rcu_report_qs_rdp
      0.41 ± 26%      -0.2        0.24 ± 30%  perf-profile.children.cycles-pp.__fdget_pos
      0.19 ± 16%      -0.2        0.03 ±100%  perf-profile.children.cycles-pp.detach_tasks
      0.19 ± 52%      -0.2        0.03 ±100%  perf-profile.children.cycles-pp.ioctl
      0.30 ± 20%      -0.1        0.15 ± 49%  perf-profile.children.cycles-pp.evlist__id2evsel
      0.22 ± 16%      -0.1        0.13 ± 57%  perf-profile.children.cycles-pp.__folio_throttle_swaprate
      0.22 ± 19%      -0.1        0.13 ± 57%  perf-profile.children.cycles-pp.blk_cgroup_congested
      0.17 ± 27%      -0.1        0.08 ± 22%  perf-profile.children.cycles-pp.generic_exec_single
      0.18 ± 24%      -0.1        0.08 ± 22%  perf-profile.children.cycles-pp.smp_call_function_single
      0.17 ± 21%      -0.1        0.10 ± 32%  perf-profile.children.cycles-pp.__perf_read_group_add
      0.13 ± 35%      -0.1        0.08 ± 66%  perf-profile.children.cycles-pp.__kmalloc
      0.10 ± 31%      -0.0        0.05 ± 74%  perf-profile.children.cycles-pp.__perf_event_read
      0.02 ±144%      +0.1        0.10 ± 28%  perf-profile.children.cycles-pp.mntput_no_expire
      0.29 ± 19%      +0.1        0.40 ± 19%  perf-profile.children.cycles-pp.dput
      0.05 ±101%      +0.1        0.18 ± 40%  perf-profile.children.cycles-pp.free_unref_page_prepare
      0.04 ±152%      +0.2        0.20 ± 22%  perf-profile.children.cycles-pp.devkmsg_read
      0.86 ±  8%      +0.2        1.08 ± 19%  perf-profile.children.cycles-pp.step_into
      0.35 ± 27%      +0.3        0.61 ± 15%  perf-profile.children.cycles-pp.run_ksoftirqd
      1.46 ± 18%      +0.9        2.34 ± 22%  perf-profile.children.cycles-pp.wait_for_lsr
      1.57 ± 14%      +1.0        2.61 ± 18%  perf-profile.children.cycles-pp.serial8250_console_write
      1.58 ± 14%      +1.0        2.63 ± 19%  perf-profile.children.cycles-pp.console_unlock
      1.58 ± 14%      +1.0        2.63 ± 19%  perf-profile.children.cycles-pp.console_flush_all
      1.58 ± 14%      +1.1        2.63 ± 19%  perf-profile.children.cycles-pp.asm_sysvec_irq_work
      1.58 ± 14%      +1.1        2.63 ± 19%  perf-profile.children.cycles-pp.irq_work_run_list
      1.57 ± 14%      +1.1        2.63 ± 19%  perf-profile.children.cycles-pp.sysvec_irq_work
      1.57 ± 14%      +1.1        2.63 ± 19%  perf-profile.children.cycles-pp.__sysvec_irq_work
      1.57 ± 14%      +1.1        2.63 ± 19%  perf-profile.children.cycles-pp.irq_work_run
      1.57 ± 14%      +1.1        2.63 ± 19%  perf-profile.children.cycles-pp.irq_work_single
      1.57 ± 14%      +1.1        2.63 ± 19%  perf-profile.children.cycles-pp._printk
      1.57 ± 14%      +1.1        2.63 ± 19%  perf-profile.children.cycles-pp.vprintk_emit
      2.17 ± 21%      +1.3        3.44 ± 28%  perf-profile.children.cycles-pp.io_serial_in
      2.24 ±100%      +5.8        8.06 ± 77%  perf-profile.children.cycles-pp.__libc_start_main
      2.24 ±100%      +5.8        8.06 ± 77%  perf-profile.children.cycles-pp.main
      2.24 ±100%      +5.8        8.06 ± 77%  perf-profile.children.cycles-pp.run_builtin
      1.68 ± 86%      +6.4        8.06 ± 77%  perf-profile.children.cycles-pp.cmd_record
      0.51 ± 59%      +7.9        8.43 ± 94%  perf-profile.children.cycles-pp.copy_page
      0.33 ±109%      +7.9        8.27 ± 97%  perf-profile.children.cycles-pp.folio_copy
      0.36 ±101%      +7.9        8.30 ± 96%  perf-profile.children.cycles-pp.move_to_new_folio
      0.36 ±101%      +7.9        8.30 ± 96%  perf-profile.children.cycles-pp.migrate_folio_extra
      0.42 ± 93%      +8.8        9.21 ± 88%  perf-profile.children.cycles-pp.migrate_pages_batch
      0.44 ± 91%      +8.8        9.24 ± 88%  perf-profile.children.cycles-pp.migrate_misplaced_page
      0.42 ± 93%      +8.8        9.22 ± 88%  perf-profile.children.cycles-pp.migrate_pages
     10.36 ±  5%      +9.5       19.84 ± 36%  perf-profile.children.cycles-pp.asm_exc_page_fault
      9.62 ±  5%      +9.5       19.12 ± 37%  perf-profile.children.cycles-pp.exc_page_fault
      7.83 ±  2%      +9.6       17.45 ± 42%  perf-profile.children.cycles-pp.handle_mm_fault
      9.20 ±  6%      +9.6       18.84 ± 38%  perf-profile.children.cycles-pp.do_user_addr_fault
      6.79 ±  2%      +9.7       16.49 ± 44%  perf-profile.children.cycles-pp.__handle_mm_fault
      0.30 ±117%      +9.8       10.06 ± 79%  perf-profile.children.cycles-pp.do_huge_pmd_numa_page
     12.05 ± 12%     -11.7        0.31 ± 36%  perf-profile.self.cycles-pp.asm_sysvec_apic_timer_interrupt
      0.93 ± 14%      -0.4        0.56 ± 14%  perf-profile.self.cycles-pp.task_mm_cid_work
      0.78 ± 25%      -0.2        0.55 ± 10%  perf-profile.self.cycles-pp.unwind_next_frame
      0.29 ± 19%      -0.1        0.15 ± 50%  perf-profile.self.cycles-pp.evlist__id2evsel
      0.20 ± 24%      -0.1        0.11 ± 49%  perf-profile.self.cycles-pp.exc_page_fault
      0.20 ± 20%      -0.1        0.12 ± 53%  perf-profile.self.cycles-pp.blk_cgroup_congested
      0.17 ± 14%      -0.1        0.12 ± 20%  perf-profile.self.cycles-pp.perf_swevent_event
      0.02 ±144%      +0.1        0.10 ± 28%  perf-profile.self.cycles-pp.mntput_no_expire
      0.12 ± 29%      +0.1        0.23 ± 35%  perf-profile.self.cycles-pp.mod_objcg_state
      0.04 ±104%      +0.1        0.17 ± 36%  perf-profile.self.cycles-pp.free_unref_page_prepare
      1.40 ± 13%      +0.9        2.29 ± 19%  perf-profile.self.cycles-pp.io_serial_in
      0.50 ± 59%      +7.8        8.28 ± 94%  perf-profile.self.cycles-pp.copy_page




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
  

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..992e460a713e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -479,6 +479,7 @@  struct vma_numab_state {
 	unsigned long next_scan;
 	unsigned long next_pid_reset;
 	unsigned long access_pids[2];
+	unsigned int scan_counter;
 };
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..2c3e17e7fc2f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2931,20 +2931,34 @@  static void reset_ptenuma_scan(struct task_struct *p)
 static bool vma_is_accessed(struct vm_area_struct *vma)
 {
 	unsigned long pids;
+	unsigned int vma_size;
+	unsigned int scan_threshold;
+	unsigned int scan_size;
+
+	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
+
+	if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
+		return true;
+
+	scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
+	/* vma size in MB */
+	vma_size = (vma->vm_end - vma->vm_start) >> 20;
+
+	/* Total scans needed to cover VMA */
+	scan_threshold = (vma_size / scan_size);
+
 	/*
-	 * Allow unconditional access first two times, so that all the (pages)
-	 * of VMAs get prot_none fault introduced irrespective of accesses.
+	 * Allow the scanning of half of disjoint set's VMA to induce
+	 * prot_none fault irrespective of accesses.
 	 * This is also done to avoid any side effect of task scanning
 	 * amplifying the unfairness of disjoint set of VMAs' access.
 	 */
-	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
-		return true;
-
-	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
-	return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
+	scan_threshold = 1 + (scan_threshold >> 1);
+	return (READ_ONCE(vma->numab_state->scan_counter) <= scan_threshold);
 }
 
-#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
+#define VMA_PID_RESET_PERIOD		(4 * sysctl_numa_balancing_scan_delay)
+#define DISJOINT_VMA_SCAN_RENEW_THRESH	16
 
 /*
  * The expensive part of numa migration is done from task_work context.
@@ -3058,6 +3072,8 @@  static void task_numa_work(struct callback_head *work)
 			/* Reset happens after 4 times scan delay of scan start */
 			vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
 				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+			WRITE_ONCE(vma->numab_state->scan_counter, 0);
 		}
 
 		/*
@@ -3068,6 +3084,13 @@  static void task_numa_work(struct callback_head *work)
 						vma->numab_state->next_scan))
 			continue;
 
+		/*
+		 * For long running tasks, renew the disjoint vma scanning
+		 * periodically.
+		 */
+		if (mm->numa_scan_seq && !(mm->numa_scan_seq % DISJOINT_VMA_SCAN_RENEW_THRESH))
+			WRITE_ONCE(vma->numab_state->scan_counter, 0);
+
 		/* Do not scan the VMA if task has not accessed */
 		if (!vma_is_accessed(vma))
 			continue;
@@ -3083,6 +3106,8 @@  static void task_numa_work(struct callback_head *work)
 			vma->numab_state->access_pids[0] = READ_ONCE(vma->numab_state->access_pids[1]);
 			vma->numab_state->access_pids[1] = 0;
 		}
+		WRITE_ONCE(vma->numab_state->scan_counter,
+				READ_ONCE(vma->numab_state->scan_counter) + 1);
 
 		do {
 			start = max(start, vma->vm_start);