diff mbox series

[-V3,7/9] mm: tune PCP high automatically

Message ID	20231016053002.756205-8-ying.huang@intel.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; From: Huang Ying <ying.huang@intel.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven <arjan@linux.intel.com>, Huang Ying <ying.huang@intel.com>, Mel Gorman <mgorman@techsingularity.net>, Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>, David Hildenbrand <david@redhat.com>, Johannes Weiner <jweiner@redhat.com>, Dave Hansen <dave.hansen@linux.intel.com>, Pavel Tatashin <pasha.tatashin@soleen.com>, Matthew Wilcox <willy@infradead.org>, Christoph Lameter <cl@linux.com> Subject: [PATCH -V3 7/9] mm: tune PCP high automatically Date: Mon, 16 Oct 2023 13:30:00 +0800 Message-Id: <20231016053002.756205-8-ying.huang@intel.com> In-Reply-To: <20231016053002.756205-1-ying.huang@intel.com> References: <20231016053002.756205-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mm: PCP high auto-tuning \| [-V3,0/9] mm: PCP high auto-tuning [-V3,1/9] mm, pcp: avoid to drain PCP when process exit [-V3,2/9] cacheinfo: calculate size of per-CPU data cache slice [-V3,3/9] mm, pcp: reduce lock contention for draining high-order pages [-V3,4/9] mm: restrict the pcp batch scale factor to avoid too long latency [-V3,5/9] mm, page_alloc: scale the number of pages that are batch allocated [-V3,6/9] mm: add framework for PCP high auto-tuning [-V3,7/9] mm: tune PCP high automatically [-V3,8/9] mm, pcp: decrease PCP high if free pages < high watermark [-V3,9/9] mm, pcp: reduce detecting time of consecutive high order page freeing

Commit Message

Huang, Ying Oct. 16, 2023, 5:30 a.m. UTC

  The target to tune PCP high automatically is as follows,

- Minimize allocation/freeing from/to shared zone

- Minimize idle pages in PCP

- Minimize pages in PCP if the system free pages is too few

To reach these target, a tuning algorithm as follows is designed,

- When we refill PCP via allocating from the zone, increase PCP high.
  Because if we had larger PCP, we could avoid to allocate from the
  zone.

- In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
  decrease PCP high to try to free possible idle PCP pages.

- When page reclaiming is active for the zone, stop increasing PCP
  high in allocating path, decrease PCP high and free some pages in
  freeing path.

So, the PCP high can be tuned to the page allocating/freeing depth of
workloads eventually.

One issue of the algorithm is that if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become
the maximal value even if the allocating/freeing depth is small.  But
this isn't a severe issue, because there are no idle pages in this
case.

One alternative choice is to increase PCP high when we drain PCP via
trying to free pages to the zone, but don't increase PCP high during
PCP refilling.  This can avoid the issue above.  But if the number of
pages allocated is much less than that of pages freed on a CPU, there
will be many idle pages in PCP and it is hard to free these idle
pages.

1/8 (>> 3) of PCP high will be decreased periodically.  The value 1/8
is kind of arbitrary.  Just to make sure that the idle PCP pages will
be freed eventually.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
instances in parallel (each with `make -j 28`) in 8 cgroup.  This
simulates the kbuild server that is used by 0-Day kbuild service.
With the patch, the build time decreases 3.5%.  The cycles% of the
spinlock contention (mostly for zone lock) decreases from 11.0% to
0.5%.  The number of PCP draining for high order pages
freeing (free_high) decreases 65.6%.  The number of pages allocated
from zone (instead of from PCP) decreases 83.9%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/gfp.h |   1 +
 mm/page_alloc.c     | 119 ++++++++++++++++++++++++++++++++++----------
 mm/vmstat.c         |   8 +--
 3 files changed, 99 insertions(+), 29 deletions(-)

Comments

kernel test robot Oct. 31, 2023, 2:50 a.m. UTC | #1

Hello,

kernel test robot noticed a 8.4% improvement of will-it-scale.per_process_ops on:


commit: ba6149e96007edcdb01284c1531ebd49b4720f72 ("[PATCH -V3 7/9] mm: tune PCP high automatically")
url: https://github.com/intel-lab-lkp/linux/commits/Huang-Ying/mm-pcp-avoid-to-drain-PCP-when-process-exit/20231017-143633
base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git 36b2d7dd5a8ac95c8c1e69bdc93c4a6e2dc28a23
patch link: https://lore.kernel.org/all/20231016053002.756205-8-ying.huang@intel.com/
patch subject: [PATCH -V3 7/9] mm: tune PCP high automatically

testcase: will-it-scale
test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
parameters:

	nr_task: 16
	mode: process
	test: page_fault2
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231031/202310311001.edbc5817-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/process/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/page_fault2/will-it-scale

commit: 
  9f9d0b0869 ("mm: add framework for PCP high auto-tuning")
  ba6149e960 ("mm: tune PCP high automatically")

9f9d0b08696fb316 ba6149e96007edcdb01284c1531 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      0.29            +0.0        0.32        mpstat.cpu.all.usr%
   1434135 ±  2%     +15.8%    1660688 ±  4%  numa-meminfo.node0.AnonPages.max
     22.97            +2.0%      23.43        turbostat.RAMWatt
    213121 ±  5%     -19.5%     171478 ±  7%  meminfo.DirectMap4k
   8031428           +12.0%    8998346        meminfo.Memused
   9777522           +14.3%   11178004        meminfo.max_used_kB
   4913700            +8.4%    5326025        will-it-scale.16.processes
    307105            +8.4%     332876        will-it-scale.per_process_ops
   4913700            +8.4%    5326025        will-it-scale.workload
 1.488e+09            +8.5%  1.614e+09        proc-vmstat.numa_hit
 1.487e+09            +8.4%  1.612e+09        proc-vmstat.numa_local
 1.486e+09            +8.3%  1.609e+09        proc-vmstat.pgalloc_normal
 1.482e+09            +8.3%  1.604e+09        proc-vmstat.pgfault
 1.486e+09            +8.3%  1.609e+09        proc-vmstat.pgfree
   2535424 ±  2%      +6.2%    2693888 ±  2%  proc-vmstat.unevictable_pgs_scanned
      0.04 ±  9%     +62.2%       0.06 ± 20%  perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
     85.33 ±  7%     +36.1%     116.17 ±  8%  perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
    475.33 ±  3%     +24.8%     593.33 ±  4%  perf-sched.wait_and_delay.count.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.16 ± 17%    +449.1%       0.87 ± 39%  perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
      0.03 ± 10%     +94.1%       0.07 ± 26%  perf-sched.wait_time.avg.ms.__cond_resched.__alloc_pages.__folio_alloc.vma_alloc_folio.do_cow_fault
      0.04 ±  9%     +62.2%       0.06 ± 20%  perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
      0.16 ± 17%    +449.1%       0.87 ± 39%  perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
     14.01            +6.0%      14.85        perf-stat.i.MPKI
  5.79e+09            +3.6%  6.001e+09        perf-stat.i.branch-instructions
      0.20 ±  2%      +0.0        0.21 ±  2%  perf-stat.i.branch-miss-rate%
  12098037 ±  2%      +8.5%   13122446 ±  2%  perf-stat.i.branch-misses
     82.90            +2.1       85.03        perf-stat.i.cache-miss-rate%
 4.005e+08            +9.8%  4.399e+08        perf-stat.i.cache-misses
  4.83e+08            +7.1%  5.174e+08        perf-stat.i.cache-references
      2.29            -3.2%       2.22        perf-stat.i.cpi
    164.08            -9.0%     149.33        perf-stat.i.cycles-between-cache-misses
 7.091e+09            +4.2%  7.392e+09        perf-stat.i.dTLB-loads
      0.97            +0.0        1.01        perf-stat.i.dTLB-store-miss-rate%
  40301594            +8.8%   43829422        perf-stat.i.dTLB-store-misses
 4.121e+09            +4.4%  4.302e+09        perf-stat.i.dTLB-stores
     83.96            +2.6       86.59        perf-stat.i.iTLB-load-miss-rate%
  10268085 ±  3%     +23.0%   12628681 ±  3%  perf-stat.i.iTLB-load-misses
 2.861e+10            +3.7%  2.966e+10        perf-stat.i.instructions
      2796 ±  3%     -15.7%       2356 ±  3%  perf-stat.i.instructions-per-iTLB-miss
      0.44            +3.3%       0.45        perf-stat.i.ipc
    984.67            +9.6%       1078        perf-stat.i.metric.K/sec
     78.05            +4.2%      81.29        perf-stat.i.metric.M/sec
   4913856            +8.4%    5329060        perf-stat.i.minor-faults
 1.356e+08           +10.6%  1.499e+08        perf-stat.i.node-loads
  32443508            +7.6%   34908277        perf-stat.i.node-stores
   4913858            +8.4%    5329062        perf-stat.i.page-faults
     14.00            +6.0%      14.83        perf-stat.overall.MPKI
      0.21 ±  2%      +0.0        0.22 ±  2%  perf-stat.overall.branch-miss-rate%
     82.92            +2.1       85.02        perf-stat.overall.cache-miss-rate%
      2.29            -3.1%       2.21        perf-stat.overall.cpi
    163.33            -8.6%     149.29        perf-stat.overall.cycles-between-cache-misses
      0.97            +0.0        1.01        perf-stat.overall.dTLB-store-miss-rate%
     84.00            +2.6       86.61        perf-stat.overall.iTLB-load-miss-rate%
      2789 ±  3%     -15.7%       2350 ±  3%  perf-stat.overall.instructions-per-iTLB-miss
      0.44            +3.2%       0.45        perf-stat.overall.ipc
   1754985            -4.7%    1673375        perf-stat.overall.path-length
 5.771e+09            +3.6%  5.981e+09        perf-stat.ps.branch-instructions
  12074113 ±  2%      +8.4%   13094204 ±  2%  perf-stat.ps.branch-misses
 3.992e+08            +9.8%  4.384e+08        perf-stat.ps.cache-misses
 4.814e+08            +7.1%  5.157e+08        perf-stat.ps.cache-references
 7.068e+09            +4.2%  7.367e+09        perf-stat.ps.dTLB-loads
  40167519            +8.7%   43680173        perf-stat.ps.dTLB-store-misses
 4.107e+09            +4.4%  4.288e+09        perf-stat.ps.dTLB-stores
  10234325 ±  3%     +23.0%   12587000 ±  3%  perf-stat.ps.iTLB-load-misses
 2.852e+10            +3.6%  2.956e+10        perf-stat.ps.instructions
   4897507            +8.4%    5310921        perf-stat.ps.minor-faults
 1.351e+08           +10.5%  1.494e+08        perf-stat.ps.node-loads
  32335421            +7.6%   34789913        perf-stat.ps.node-stores
   4897509            +8.4%    5310923        perf-stat.ps.page-faults
 8.623e+12            +3.4%  8.912e+12        perf-stat.total.instructions
      9.86 ±  3%      -8.4        1.49 ±  5%  perf-profile.calltrace.cycles-pp.rmqueue_bulk.__rmqueue_pcplist.rmqueue.get_page_from_freelist.__alloc_pages
      8.11 ±  3%      -7.5        0.58 ±  8%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.rmqueue_bulk.__rmqueue_pcplist.rmqueue.get_page_from_freelist
      8.10 ±  3%      -7.5        0.58 ±  8%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.rmqueue_bulk.__rmqueue_pcplist.rmqueue
      7.52 ±  3%      -6.4        1.15 ±  5%  perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_batch_pages_flush.zap_pte_range
      7.90 ±  4%      -6.4        1.55 ±  4%  perf-profile.calltrace.cycles-pp.free_unref_page_list.release_pages.tlb_batch_pages_flush.zap_pte_range.zap_pmd_range
      5.78 ±  4%      -5.8        0.00        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_batch_pages_flush
      5.78 ±  4%      -5.8        0.00        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.free_pcppages_bulk.free_unref_page_list.release_pages
     10.90 ±  3%      -5.3        5.59 ±  2%  perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.__folio_alloc.vma_alloc_folio.do_cow_fault
     10.57 ±  3%      -5.3        5.26 ±  3%  perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.__folio_alloc.vma_alloc_folio
     10.21 ±  3%      -5.3        4.94 ±  3%  perf-profile.calltrace.cycles-pp.__rmqueue_pcplist.rmqueue.get_page_from_freelist.__alloc_pages.__folio_alloc
     11.18 ±  3%      -5.3        5.91 ±  2%  perf-profile.calltrace.cycles-pp.__folio_alloc.vma_alloc_folio.do_cow_fault.do_fault.__handle_mm_fault
     11.15 ±  3%      -5.3        5.88 ±  2%  perf-profile.calltrace.cycles-pp.__alloc_pages.__folio_alloc.vma_alloc_folio.do_cow_fault.do_fault
     11.56 ±  3%      -5.2        6.37 ±  2%  perf-profile.calltrace.cycles-pp.vma_alloc_folio.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault
      9.76 ±  3%      -4.3        5.50 ±  6%  perf-profile.calltrace.cycles-pp.release_pages.tlb_batch_pages_flush.zap_pte_range.zap_pmd_range.unmap_page_range
     10.18 ±  3%      -4.2        5.95 ±  5%  perf-profile.calltrace.cycles-pp.tlb_batch_pages_flush.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.__munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.do_vmi_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.do_vmi_align_munmap.do_vmi_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     15.39 ±  3%      -3.7       11.70        perf-profile.calltrace.cycles-pp.unmap_region.do_vmi_align_munmap.do_vmi_munmap.__vm_munmap.__x64_sys_munmap
     14.08 ±  3%      -3.6       10.49        perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.unmap_region
     14.10 ±  3%      -3.6       10.52        perf-profile.calltrace.cycles-pp.unmap_vmas.unmap_region.do_vmi_align_munmap.do_vmi_munmap.__vm_munmap
     14.10 ±  3%      -3.6       10.52        perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.unmap_region.do_vmi_align_munmap.do_vmi_munmap
     14.10 ±  3%      -3.6       10.52        perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.unmap_region.do_vmi_align_munmap
      1.60 ±  2%      -0.7        0.86 ±  6%  perf-profile.calltrace.cycles-pp.__list_del_entry_valid_or_report.rmqueue_bulk.__rmqueue_pcplist.rmqueue.get_page_from_freelist
      0.96 ±  3%      -0.4        0.56 ±  3%  perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_batch_pages_flush.tlb_finish_mmu
      1.00 ±  4%      -0.4        0.62 ±  4%  perf-profile.calltrace.cycles-pp.free_unref_page_list.release_pages.tlb_batch_pages_flush.tlb_finish_mmu.unmap_region
      1.26 ±  4%      -0.1        1.11 ±  2%  perf-profile.calltrace.cycles-pp.release_pages.tlb_batch_pages_flush.tlb_finish_mmu.unmap_region.do_vmi_align_munmap
      1.28 ±  3%      -0.1        1.16 ±  3%  perf-profile.calltrace.cycles-pp.tlb_batch_pages_flush.tlb_finish_mmu.unmap_region.do_vmi_align_munmap.do_vmi_munmap
      1.28 ±  4%      -0.1        1.17 ±  2%  perf-profile.calltrace.cycles-pp.tlb_finish_mmu.unmap_region.do_vmi_align_munmap.do_vmi_munmap.__vm_munmap
      0.60 ±  3%      -0.0        0.57        perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault
      0.55 ±  3%      +0.0        0.60        perf-profile.calltrace.cycles-pp.__perf_sw_event.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
      0.73 ±  3%      +0.1        0.79 ±  2%  perf-profile.calltrace.cycles-pp.lock_vma_under_rcu.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
      0.68 ±  3%      +0.1        0.78 ±  3%  perf-profile.calltrace.cycles-pp.page_remove_rmap.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
      0.57 ±  7%      +0.1        0.71 ±  8%  perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.folio_add_new_anon_rmap.set_pte_range.finish_fault.do_cow_fault
      1.41 ±  3%      +0.1        1.55        perf-profile.calltrace.cycles-pp.sync_regs.asm_exc_page_fault.testcase
      0.77 ±  4%      +0.2        0.93 ±  5%  perf-profile.calltrace.cycles-pp.folio_add_new_anon_rmap.set_pte_range.finish_fault.do_cow_fault.do_fault
      0.94 ±  3%      +0.2        1.12 ±  3%  perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru_vma.set_pte_range.finish_fault
      0.36 ± 70%      +0.2        0.57        perf-profile.calltrace.cycles-pp.__perf_sw_event.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
      1.26 ±  5%      +0.2        1.47 ±  3%  perf-profile.calltrace.cycles-pp.filemap_get_entry.shmem_get_folio_gfp.shmem_fault.__do_fault.do_cow_fault
      1.61 ±  5%      +0.3        1.87 ±  3%  perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fault.__do_fault.do_cow_fault.do_fault
      1.75 ±  5%      +0.3        2.05 ±  3%  perf-profile.calltrace.cycles-pp.shmem_fault.__do_fault.do_cow_fault.do_fault.__handle_mm_fault
      1.86 ±  4%      +0.3        2.17 ±  2%  perf-profile.calltrace.cycles-pp.__do_fault.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault
      0.17 ±141%      +0.4        0.58 ±  3%  perf-profile.calltrace.cycles-pp.xas_load.filemap_get_entry.shmem_get_folio_gfp.shmem_fault.__do_fault
      2.60 ±  3%      +0.5        3.14 ±  5%  perf-profile.calltrace.cycles-pp._compound_head.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
      4.51 ±  3%      +0.7        5.16        perf-profile.calltrace.cycles-pp._raw_spin_lock.__pte_offset_map_lock.finish_fault.do_cow_fault.do_fault
      4.65 ±  3%      +0.7        5.32        perf-profile.calltrace.cycles-pp.__pte_offset_map_lock.finish_fault.do_cow_fault.do_fault.__handle_mm_fault
      1.61 ±  3%      +1.9        3.52 ±  6%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru_vma
      0.85 ±  2%      +1.9        2.77 ± 13%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush.zap_pte_range
      0.84 ±  2%      +1.9        2.76 ± 13%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush
      0.85 ±  2%      +1.9        2.78 ± 12%  perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush.zap_pte_range.zap_pmd_range
      1.71 ±  3%      +1.9        3.64 ±  6%  perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru_vma.set_pte_range.finish_fault
      1.70 ±  2%      +1.9        3.63 ±  6%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru_vma.set_pte_range
      3.31 ±  2%      +2.2        5.52 ±  5%  perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_add_lru_vma.set_pte_range.finish_fault.do_cow_fault
      3.46 ±  2%      +2.2        5.71 ±  5%  perf-profile.calltrace.cycles-pp.folio_add_lru_vma.set_pte_range.finish_fault.do_cow_fault.do_fault
      4.47 ±  2%      +2.4        6.90 ±  4%  perf-profile.calltrace.cycles-pp.set_pte_range.finish_fault.do_cow_fault.do_fault.__handle_mm_fault
      9.22 ±  2%      +3.1       12.33 ±  2%  perf-profile.calltrace.cycles-pp.finish_fault.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault
     44.13 ±  3%      +3.2       47.34        perf-profile.calltrace.cycles-pp.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
     44.27 ±  3%      +3.2       47.49        perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
     45.63 ±  2%      +3.3       48.95        perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
      0.00            +3.4        3.37 ±  2%  perf-profile.calltrace.cycles-pp.__list_del_entry_valid_or_report.__rmqueue_pcplist.rmqueue.get_page_from_freelist.__alloc_pages
     46.88 ±  3%      +3.4       50.29        perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
     49.40 ±  2%      +3.6       53.03        perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
     49.59 ±  2%      +3.7       53.24        perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.testcase
     59.06 ±  2%      +4.5       63.60        perf-profile.calltrace.cycles-pp.asm_exc_page_fault.testcase
     56.32 ±  3%      +4.6       60.89        perf-profile.calltrace.cycles-pp.testcase
     20.16 ±  3%      +4.9       25.10        perf-profile.calltrace.cycles-pp.copy_page.do_cow_fault.do_fault.__handle_mm_fault.handle_mm_fault
     16.66 ±  3%      -8.8        7.83 ±  8%  perf-profile.children.cycles-pp._raw_spin_lock_irqsave
     16.48 ±  3%      -8.8        7.66 ±  8%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
      9.90 ±  3%      -8.4        1.50 ±  5%  perf-profile.children.cycles-pp.rmqueue_bulk
      8.92 ±  3%      -6.7        2.18 ±  2%  perf-profile.children.cycles-pp.free_unref_page_list
      8.47 ±  3%      -6.7        1.74 ±  4%  perf-profile.children.cycles-pp.free_pcppages_bulk
     10.96 ±  3%      -5.3        5.64 ±  2%  perf-profile.children.cycles-pp.get_page_from_freelist
     10.62 ±  3%      -5.3        5.30 ±  2%  perf-profile.children.cycles-pp.rmqueue
     10.26 ±  3%      -5.3        4.97 ±  3%  perf-profile.children.cycles-pp.__rmqueue_pcplist
     11.24 ±  3%      -5.3        5.96 ±  2%  perf-profile.children.cycles-pp.__alloc_pages
     11.18 ±  3%      -5.3        5.92 ±  2%  perf-profile.children.cycles-pp.__folio_alloc
     11.57 ±  3%      -5.2        6.37 ±  2%  perf-profile.children.cycles-pp.vma_alloc_folio
     11.19 ±  3%      -4.4        6.82 ±  5%  perf-profile.children.cycles-pp.release_pages
     11.46 ±  3%      -4.3        7.12 ±  5%  perf-profile.children.cycles-pp.tlb_batch_pages_flush
     15.52 ±  3%      -3.7       11.81        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     15.52 ±  3%      -3.7       11.81        perf-profile.children.cycles-pp.do_syscall_64
     15.41 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.__munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.do_vmi_munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.do_vmi_align_munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.__x64_sys_munmap
     15.40 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.__vm_munmap
     15.39 ±  3%      -3.7       11.70        perf-profile.children.cycles-pp.unmap_region
     14.10 ±  3%      -3.6       10.52        perf-profile.children.cycles-pp.unmap_vmas
     14.10 ±  3%      -3.6       10.52        perf-profile.children.cycles-pp.unmap_page_range
     14.10 ±  3%      -3.6       10.52        perf-profile.children.cycles-pp.zap_pmd_range
     14.10 ±  3%      -3.6       10.52        perf-profile.children.cycles-pp.zap_pte_range
      2.60 ±  3%      -2.0        0.56 ±  4%  perf-profile.children.cycles-pp.__free_one_page
      1.28 ±  3%      -0.1        1.17 ±  2%  perf-profile.children.cycles-pp.tlb_finish_mmu
      0.15 ± 19%      -0.1        0.08 ± 14%  perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
      0.61 ±  3%      -0.0        0.58 ±  2%  perf-profile.children.cycles-pp.__mem_cgroup_charge
      0.11 ±  6%      -0.0        0.08 ±  7%  perf-profile.children.cycles-pp.__mod_zone_page_state
      0.25 ±  4%      +0.0        0.26        perf-profile.children.cycles-pp.error_entry
      0.15 ±  3%      +0.0        0.17 ±  4%  perf-profile.children.cycles-pp.free_unref_page_commit
      0.12 ±  8%      +0.0        0.14 ±  3%  perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
      0.18 ±  3%      +0.0        0.20 ±  4%  perf-profile.children.cycles-pp.access_error
      0.07 ±  5%      +0.0        0.09 ±  7%  perf-profile.children.cycles-pp.task_tick_fair
      0.04 ± 45%      +0.0        0.06 ±  7%  perf-profile.children.cycles-pp.page_counter_try_charge
      0.30 ±  4%      +0.0        0.32        perf-profile.children.cycles-pp.down_read_trylock
      0.27 ±  3%      +0.0        0.30 ±  2%  perf-profile.children.cycles-pp.up_read
      0.15 ±  8%      +0.0        0.18 ±  3%  perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
      0.02 ±142%      +0.1        0.07 ± 29%  perf-profile.children.cycles-pp.ret_from_fork_asm
      0.44 ±  2%      +0.1        0.49 ±  3%  perf-profile.children.cycles-pp.mas_walk
      0.46 ±  4%      +0.1        0.52 ±  2%  perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
      0.67 ±  3%      +0.1        0.73 ±  4%  perf-profile.children.cycles-pp.lock_mm_and_find_vma
      0.42 ±  3%      +0.1        0.48 ±  2%  perf-profile.children.cycles-pp.free_swap_cache
      0.43 ±  4%      +0.1        0.49 ±  2%  perf-profile.children.cycles-pp.free_pages_and_swap_cache
      0.30 ±  5%      +0.1        0.37 ±  3%  perf-profile.children.cycles-pp.xas_descend
      0.86 ±  3%      +0.1        0.92        perf-profile.children.cycles-pp.___perf_sw_event
      0.73 ±  3%      +0.1        0.80        perf-profile.children.cycles-pp.lock_vma_under_rcu
      0.40 ±  2%      +0.1        0.47        perf-profile.children.cycles-pp.__mod_node_page_state
      0.01 ±223%      +0.1        0.09 ± 12%  perf-profile.children.cycles-pp.shmem_get_policy
      0.53 ±  2%      +0.1        0.62 ±  2%  perf-profile.children.cycles-pp.__mod_lruvec_state
      1.09 ±  3%      +0.1        1.18        perf-profile.children.cycles-pp.__perf_sw_event
      0.50 ±  5%      +0.1        0.60 ±  3%  perf-profile.children.cycles-pp.xas_load
      0.68 ±  3%      +0.1        0.78 ±  3%  perf-profile.children.cycles-pp.page_remove_rmap
      1.45 ±  3%      +0.1        1.60        perf-profile.children.cycles-pp.sync_regs
      0.77 ±  4%      +0.2        0.93 ±  5%  perf-profile.children.cycles-pp.folio_add_new_anon_rmap
      0.84 ±  5%      +0.2        1.02 ±  7%  perf-profile.children.cycles-pp.__mod_lruvec_page_state
      0.96 ±  4%      +0.2        1.15 ±  3%  perf-profile.children.cycles-pp.lru_add_fn
      1.27 ±  5%      +0.2        1.48 ±  3%  perf-profile.children.cycles-pp.filemap_get_entry
      1.62 ±  4%      +0.3        1.88 ±  3%  perf-profile.children.cycles-pp.shmem_get_folio_gfp
      1.75 ±  5%      +0.3        2.06 ±  3%  perf-profile.children.cycles-pp.shmem_fault
      1.87 ±  4%      +0.3        2.18 ±  2%  perf-profile.children.cycles-pp.__do_fault
      2.19 ±  2%      +0.3        2.51        perf-profile.children.cycles-pp.native_irq_return_iret
      2.64 ±  4%      +0.5        3.18 ±  6%  perf-profile.children.cycles-pp._compound_head
      4.62 ±  3%      +0.6        5.26        perf-profile.children.cycles-pp._raw_spin_lock
      4.67 ±  3%      +0.7        5.34        perf-profile.children.cycles-pp.__pte_offset_map_lock
      3.32 ±  2%      +2.2        5.54 ±  5%  perf-profile.children.cycles-pp.folio_batch_move_lru
      3.47 ±  2%      +2.2        5.72 ±  5%  perf-profile.children.cycles-pp.folio_add_lru_vma
      4.49 ±  2%      +2.4        6.92 ±  4%  perf-profile.children.cycles-pp.set_pte_range
      9.25 ±  2%      +3.1       12.36 ±  2%  perf-profile.children.cycles-pp.finish_fault
      2.25 ±  2%      +3.1        5.36 ±  2%  perf-profile.children.cycles-pp.__list_del_entry_valid_or_report
     44.16 ±  3%      +3.2       47.37        perf-profile.children.cycles-pp.do_cow_fault
     44.28 ±  3%      +3.2       47.50        perf-profile.children.cycles-pp.do_fault
     45.66 ±  2%      +3.3       48.98        perf-profile.children.cycles-pp.__handle_mm_fault
     46.91 ±  2%      +3.4       50.33        perf-profile.children.cycles-pp.handle_mm_fault
     49.44 ±  2%      +3.6       53.08        perf-profile.children.cycles-pp.do_user_addr_fault
     49.62 ±  2%      +3.6       53.27        perf-profile.children.cycles-pp.exc_page_fault
      2.70 ±  3%      +4.1        6.75 ±  8%  perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
     55.26 ±  2%      +4.2       59.44        perf-profile.children.cycles-pp.asm_exc_page_fault
     58.13 ±  3%      +4.6       62.72        perf-profile.children.cycles-pp.testcase
     20.19 ±  3%      +4.9       25.14        perf-profile.children.cycles-pp.copy_page
     16.48 ±  3%      -8.8        7.66 ±  8%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      2.53 ±  3%      -2.0        0.54 ±  3%  perf-profile.self.cycles-pp.__free_one_page
      0.12 ±  4%      -0.1        0.05 ± 46%  perf-profile.self.cycles-pp.rmqueue_bulk
      0.14 ± 19%      -0.1        0.08 ± 14%  perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
      0.10 ±  3%      -0.0        0.08 ± 10%  perf-profile.self.cycles-pp.__mod_zone_page_state
      0.13 ±  5%      +0.0        0.14 ±  2%  perf-profile.self.cycles-pp.free_unref_page_commit
      0.13 ±  3%      +0.0        0.14 ±  3%  perf-profile.self.cycles-pp.exc_page_fault
      0.15 ±  5%      +0.0        0.17 ±  4%  perf-profile.self.cycles-pp.__pte_offset_map
      0.04 ± 44%      +0.0        0.06 ±  6%  perf-profile.self.cycles-pp.page_counter_try_charge
      0.18 ±  3%      +0.0        0.20 ±  4%  perf-profile.self.cycles-pp.access_error
      0.30 ±  3%      +0.0        0.32 ±  2%  perf-profile.self.cycles-pp.down_read_trylock
      0.16 ±  6%      +0.0        0.18        perf-profile.self.cycles-pp.set_pte_range
      0.26 ±  2%      +0.0        0.29 ±  3%  perf-profile.self.cycles-pp.up_read
      0.15 ±  8%      +0.0        0.18 ±  4%  perf-profile.self.cycles-pp.folio_add_lru_vma
      0.15 ±  8%      +0.0        0.18 ±  3%  perf-profile.self.cycles-pp.mem_cgroup_update_lru_size
      0.22 ±  6%      +0.0        0.26 ±  5%  perf-profile.self.cycles-pp.__alloc_pages
      0.32 ±  6%      +0.0        0.36 ±  3%  perf-profile.self.cycles-pp.shmem_get_folio_gfp
      0.28 ±  5%      +0.0        0.32 ±  4%  perf-profile.self.cycles-pp.do_cow_fault
      0.14 ±  7%      +0.0        0.18 ±  6%  perf-profile.self.cycles-pp.shmem_fault
      0.34 ±  5%      +0.0        0.38 ±  4%  perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.__cond_resched
      0.44 ±  3%      +0.1        0.49 ±  4%  perf-profile.self.cycles-pp.page_remove_rmap
      0.41 ±  3%      +0.1        0.47 ±  3%  perf-profile.self.cycles-pp.free_swap_cache
      0.75 ±  3%      +0.1        0.81 ±  2%  perf-profile.self.cycles-pp.___perf_sw_event
      0.91 ±  2%      +0.1        0.98 ±  2%  perf-profile.self.cycles-pp.__handle_mm_fault
      0.29 ±  6%      +0.1        0.36 ±  3%  perf-profile.self.cycles-pp.xas_descend
      0.38 ±  2%      +0.1        0.45 ±  2%  perf-profile.self.cycles-pp.__mod_node_page_state
      0.01 ±223%      +0.1        0.09 ±  8%  perf-profile.self.cycles-pp.shmem_get_policy
      0.58 ±  3%      +0.1        0.66 ±  2%  perf-profile.self.cycles-pp.release_pages
      0.44 ±  4%      +0.1        0.54 ±  3%  perf-profile.self.cycles-pp.lru_add_fn
      1.44 ±  3%      +0.1        1.59        perf-profile.self.cycles-pp.sync_regs
      2.18 ±  2%      +0.3        2.50        perf-profile.self.cycles-pp.native_irq_return_iret
      4.36 ±  3%      +0.4        4.76        perf-profile.self.cycles-pp.testcase
      2.61 ±  4%      +0.5        3.14 ±  5%  perf-profile.self.cycles-pp._compound_head
      4.60 ±  3%      +0.6        5.23        perf-profile.self.cycles-pp._raw_spin_lock
      2.23 ±  2%      +3.1        5.34 ±  2%  perf-profile.self.cycles-pp.__list_del_entry_valid_or_report
     20.10 ±  3%      +4.9       25.02        perf-profile.self.cycles-pp.copy_page




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.

diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665edc11fb9f..5b917e5b9350 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -320,6 +320,7 @@  extern void page_frag_free(void *addr);
 #define free_page(addr) free_pages((addr), 0)
 
 void page_alloc_init_cpuhp(void);
+int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1fb2c6ebde9c..8382ad2cdfd4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2157,6 +2157,40 @@  static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	return i;
 }
 
+/*
+ * Called from the vmstat counter updater to decay the PCP high.
+ * Return whether there are addition works to do.
+ */
+int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+{
+	int high_min, to_drain, batch;
+	int todo = 0;
+
+	high_min = READ_ONCE(pcp->high_min);
+	batch = READ_ONCE(pcp->batch);
+	/*
+	 * Decrease pcp->high periodically to try to free possible
+	 * idle PCP pages.  And, avoid to free too many pages to
+	 * control latency.  This caps pcp->high decrement too.
+	 */
+	if (pcp->high > high_min) {
+		pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
+				 pcp->high - (pcp->high >> 3), high_min);
+		if (pcp->high > high_min)
+			todo++;
+	}
+
+	to_drain = pcp->count - pcp->high;
+	if (to_drain > 0) {
+		spin_lock(&pcp->lock);
+		free_pcppages_bulk(zone, to_drain, pcp, 0);
+		spin_unlock(&pcp->lock);
+		todo++;
+	}
+
+	return todo;
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Called from the vmstat counter updater to drain pagesets of this
@@ -2318,14 +2352,13 @@  static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
 	return true;
 }
 
-static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
+static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free_high)
 {
 	int min_nr_free, max_nr_free;
-	int batch = READ_ONCE(pcp->batch);
 
-	/* Free everything if batch freeing high-order pages. */
+	/* Free as much as possible if batch freeing high-order pages. */
 	if (unlikely(free_high))
-		return pcp->count;
+		return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX);
 
 	/* Check for PCP disabled or boot pageset */
 	if (unlikely(high < batch))
@@ -2340,7 +2373,7 @@  static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 	 * freeing of pages without any allocation.
 	 */
 	batch <<= pcp->free_factor;
-	if (batch < max_nr_free && pcp->free_factor < CONFIG_PCP_BATCH_SCALE_MAX)
+	if (batch <= max_nr_free && pcp->free_factor < CONFIG_PCP_BATCH_SCALE_MAX)
 		pcp->free_factor++;
 	batch = clamp(batch, min_nr_free, max_nr_free);
 
@@ -2348,28 +2381,48 @@  static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 }
 
 static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
-		       bool free_high)
+		       int batch, bool free_high)
 {
-	int high = READ_ONCE(pcp->high_min);
+	int high, high_min, high_max;
 
-	if (unlikely(!high || free_high))
+	high_min = READ_ONCE(pcp->high_min);
+	high_max = READ_ONCE(pcp->high_max);
+	high = pcp->high = clamp(pcp->high, high_min, high_max);
+
+	if (unlikely(!high))
 		return 0;
 
-	if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
-		return high;
+	if (unlikely(free_high)) {
+		pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
+				high_min);
+		return 0;
+	}
 
 	/*
 	 * If reclaim is active, limit the number of pages that can be
 	 * stored on pcp lists
 	 */
-	return min(READ_ONCE(pcp->batch) << 2, high);
+	if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) {
+		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		return min(batch << 2, pcp->high);
+	}
+
+	if (pcp->count >= high && high_min != high_max) {
+		int need_high = (batch << pcp->free_factor) + batch;
+
+		/* pcp->high should be large enough to hold batch freed pages */
+		if (pcp->high < need_high)
+			pcp->high = clamp(need_high, high_min, high_max);
+	}
+
+	return high;
 }
 
 static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 				   struct page *page, int migratetype,
 				   unsigned int order)
 {
-	int high;
+	int high, batch;
 	int pindex;
 	bool free_high = false;
 
@@ -2384,6 +2437,7 @@  static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
 
+	batch = READ_ONCE(pcp->batch);
 	/*
 	 * As high-order pages other than THP's stored on PCP can contribute
 	 * to fragmentation, limit the number stored when PCP is heavily
@@ -2394,14 +2448,15 @@  static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 		free_high = (pcp->free_factor &&
 			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
 			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
-			      pcp->count >= READ_ONCE(pcp->batch)));
+			      pcp->count >= READ_ONCE(batch)));
 		pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
 	}
-	high = nr_pcp_high(pcp, zone, free_high);
+	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
-		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex);
+		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
+				   pcp, pindex);
 	}
 }
 
@@ -2685,24 +2740,38 @@  struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 	return page;
 }
 
-static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
+static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 {
-	int high, batch, max_nr_alloc;
+	int high, base_batch, batch, max_nr_alloc;
+	int high_max, high_min;
 
-	high = READ_ONCE(pcp->high_min);
-	batch = READ_ONCE(pcp->batch);
+	base_batch = READ_ONCE(pcp->batch);
+	high_min = READ_ONCE(pcp->high_min);
+	high_max = READ_ONCE(pcp->high_max);
+	high = pcp->high = clamp(pcp->high, high_min, high_max);
 
 	/* Check for PCP disabled or boot pageset */
-	if (unlikely(high < batch))
+	if (unlikely(high < base_batch))
 		return 1;
 
+	if (order)
+		batch = base_batch;
+	else
+		batch = (base_batch << pcp->alloc_factor);
+
 	/*
-	 * Double the number of pages allocated each time there is subsequent
-	 * allocation of order-0 pages without any freeing.
+	 * If we had larger pcp->high, we could avoid to allocate from
+	 * zone.
 	 */
+	if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
+		high = pcp->high = min(high + batch, high_max);
+
 	if (!order) {
-		max_nr_alloc = max(high - pcp->count - batch, batch);
-		batch <<= pcp->alloc_factor;
+		max_nr_alloc = max(high - pcp->count - base_batch, base_batch);
+		/*
+		 * Double the number of pages allocated each time there is
+		 * subsequent allocation of order-0 pages without any freeing.
+		 */
 		if (batch <= max_nr_alloc &&
 		    pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX)
 			pcp->alloc_factor++;
@@ -2733,7 +2802,7 @@  struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 
 	do {
 		if (list_empty(list)) {
-			int batch = nr_pcp_alloc(pcp, order);
+			int batch = nr_pcp_alloc(pcp, zone, order);
 			int alloced;
 
 			alloced = rmqueue_bulk(zone, order,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 00e81e99c6ee..2f716ad14168 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -814,9 +814,7 @@  static int refresh_cpu_vm_stats(bool do_pagesets)
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-#ifdef CONFIG_NUMA
 		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
-#endif
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 			int v;
@@ -832,10 +830,12 @@  static int refresh_cpu_vm_stats(bool do_pagesets)
 #endif
 			}
 		}
-#ifdef CONFIG_NUMA
 
 		if (do_pagesets) {
 			cond_resched();
+
+			changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+#ifdef CONFIG_NUMA
 			/*
 			 * Deal with draining the remote pageset of this
 			 * processor
@@ -862,8 +862,8 @@  static int refresh_cpu_vm_stats(bool do_pagesets)
 				drain_zone_pages(zone, this_cpu_ptr(pcp));
 				changes++;
 			}
-		}
 #endif
+		}
 	}
 
 	for_each_online_pgdat(pgdat) {