[00/10] mm: PCP high auto-tuning

Message ID 20230920061856.257597-1-ying.huang@intel.com
Headers
Series mm: PCP high auto-tuning |

Message

Huang, Ying Sept. 20, 2023, 6:18 a.m. UTC
  The page allocation performance requirements of different workloads
are often different.  So, we need to tune the PCP (Per-CPU Pageset)
high on each CPU automatically to optimize the page allocation
performance.

The list of patches in series is as follows,

 1 mm, pcp: avoid to drain PCP when process exit
 2 cacheinfo: calculate per-CPU data cache size
 3 mm, pcp: reduce lock contention for draining high-order pages
 4 mm: restrict the pcp batch scale factor to avoid too long latency
 5 mm, page_alloc: scale the number of pages that are batch allocated
 6 mm: add framework for PCP high auto-tuning
 7 mm: tune PCP high automatically
 8 mm, pcp: decrease PCP high if free pages < high watermark
 9 mm, pcp: avoid to reduce PCP high unnecessarily
10 mm, pcp: reduce detecting time of consecutive high order page freeing

Patch 1/2/3 optimize the PCP draining for consecutive high-order pages
freeing.

Patch 4/5 optimize batch freeing and allocating.

Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method.

Patch 10 optimize the PCP draining for consecutive high order page
freeing based on PCP high auto-tuning.

The test results for patches with performance impact are as follows,

kbuild
======

On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
one socket with `make -j 112`.

	build time	zone lock%	free_high	alloc_zone
	----------	----------	---------	----------
base	     100.0	      43.6          100.0            100.0
patch1	      96.6	      40.3	     49.2	      95.2
patch3	      96.4	      40.5	     11.3	      95.1
patch5	      96.1	      37.9	     13.3	      96.8
patch7	      86.4	       9.8	      6.2	      22.0
patch9	      85.9	       9.4	      4.8	      16.3
patch10	      87.7	      12.6	     29.0	      32.3

The PCP draining optimization (patch 1/3) improves performance a
little.  The PCP batch allocation optimization (patch 5) reduces zone
lock contention a little.  The PCP high auto-tuning (patch 7/9)
improves performance much.  Where the tuning target: the number of
pages allocated from zone reduces greatly.  So, the zone lock
contention cycles% reduces greatly.  The further PCP draining
optimization (patch 10) based on PCP tuning reduce the performance a
little.  But it will benefit network workloads as below.

With PCP tuning patches (patch 7/9/10), the maximum used memory during
test increases up to 50.6% because more pages are cached in PCP.  But
finally, the number of the used memory decreases to the same level as
that of the base patch.  That is, the pages cached in PCP will be
released to zone after not being used actively.

netperf SCTP_STREAM_MANY
========================

On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair
processes.

	     score	zone lock%	free_high	alloc_zone  cache miss rate%
	     -----	----------	---------	----------  ----------------
base	     100.0	       2.0          100.0            100.0	         1.3
patch1	      99.7	       2.0	     99.7	      99.7		 1.3
patch3	     105.5	       1.2	     13.2	     105.4		 1.2
patch5	     106.9	       1.2	     13.4	     106.9		 1.3
patch7	     103.5	       1.8	      6.8	      90.8		 7.6
patch9	     103.7	       1.8	      6.6	      89.8		 7.7
patch10	     106.9	       1.2	     13.5	     106.9		 1.2

The PCP draining optimization (patch 1+3) improves performance.  The
PCP high auto-tuning (patch 7/9) reduces performance a little because
PCP draining cannot be triggered in time sometimes.  So, the cache
miss rate% increases.  The further PCP draining optimization (patch
10) based on PCP tuning restore the performance.

lmbench3 UNIX (AF_UNIX)
=======================

On a 2-socket Intel server with 128 logical CPU, we tested UNIX
(AF_UNIX socket) test case of lmbench3 test suite with 16-pair
processes.

	     score	zone lock%	free_high	alloc_zone  cache miss rate%
	     -----	----------	---------	----------  ----------------
base	     100.0	      50.0          100.0            100.0	         0.3
patch1	     117.1	      45.8           72.6	     108.9	         0.2
patch3	     201.6	      21.2            7.4	     111.5	         0.2
patch5	     201.9	      20.9            7.5	     112.7	         0.3
patch7	     194.2	      19.3            7.3	     111.5	         2.9
patch9	     193.1	      19.2            7.2	     110.4	         2.9
patch10	     196.8	      21.0            7.4	     111.2	         2.1

The PCP draining optimization (patch 1/3) improves performance much.
The PCP tuning (patch 7/9) reduces performance a little because PCP
draining cannot be triggered in time sometimes.  The further PCP
draining optimization (patch 10) based on PCP tuning restores the
performance partly.

The patchset adds several fields in struct per_cpu_pages.  The struct
layout before/after the patchset is as follows,

base
====

struct per_cpu_pages {
	spinlock_t                 lock;                 /*     0     4 */
	int                        count;                /*     4     4 */
	int                        high;                 /*     8     4 */
	int                        batch;                /*    12     4 */
	short int                  free_factor;          /*    16     2 */
	short int                  expire;               /*    18     2 */

	/* XXX 4 bytes hole, try to pack */

	struct list_head           lists[13];            /*    24   208 */

	/* size: 256, cachelines: 4, members: 7 */
	/* sum members: 228, holes: 1, sum holes: 4 */
	/* padding: 24 */
} __attribute__((__aligned__(64)));

patched
=======

struct per_cpu_pages {
	spinlock_t                 lock;                 /*     0     4 */
	int                        count;                /*     4     4 */
	int                        count_min;            /*     8     4 */
	int                        high;                 /*    12     4 */
	int                        high_min;             /*    16     4 */
	int                        high_max;             /*    20     4 */
	int                        batch;                /*    24     4 */
	u8                         flags;                /*    28     1 */
	u8                         alloc_factor;         /*    29     1 */
	u8                         expire;               /*    30     1 */

	/* XXX 1 byte hole, try to pack */

	short int                  free_count;           /*    32     2 */

	/* XXX 6 bytes hole, try to pack */

	struct list_head           lists[13];            /*    40   208 */

	/* size: 256, cachelines: 4, members: 12 */
	/* sum members: 241, holes: 2, sum holes: 7 */
	/* padding: 8 */
} __attribute__((__aligned__(64)));

The size of the struct doesn't changed with the patchset.

Best Regards,
Huang, Ying
  

Comments

Andrew Morton Sept. 20, 2023, 4:41 p.m. UTC | #1
On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying <ying.huang@intel.com> wrote:

> The page allocation performance requirements of different workloads
> are often different.  So, we need to tune the PCP (Per-CPU Pageset)
> high on each CPU automatically to optimize the page allocation
> performance.

Some of the performance changes here are downright scary.

I've never been very sure that percpu pages was very beneficial (and
hey, I invented the thing back in the Mesozoic era).  But these numbers
make me think it's very important and we should have been paying more
attention.

> The list of patches in series is as follows,
> 
>  1 mm, pcp: avoid to drain PCP when process exit
>  2 cacheinfo: calculate per-CPU data cache size
>  3 mm, pcp: reduce lock contention for draining high-order pages
>  4 mm: restrict the pcp batch scale factor to avoid too long latency
>  5 mm, page_alloc: scale the number of pages that are batch allocated
>  6 mm: add framework for PCP high auto-tuning
>  7 mm: tune PCP high automatically
>  8 mm, pcp: decrease PCP high if free pages < high watermark
>  9 mm, pcp: avoid to reduce PCP high unnecessarily
> 10 mm, pcp: reduce detecting time of consecutive high order page freeing
> 
> Patch 1/2/3 optimize the PCP draining for consecutive high-order pages
> freeing.
> 
> Patch 4/5 optimize batch freeing and allocating.
> 
> Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method.
> 
> Patch 10 optimize the PCP draining for consecutive high order page
> freeing based on PCP high auto-tuning.
> 
> The test results for patches with performance impact are as follows,
> 
> kbuild
> ======
> 
> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
> one socket with `make -j 112`.
> 
> 	build time	zone lock%	free_high	alloc_zone
> 	----------	----------	---------	----------
> base	     100.0	      43.6          100.0            100.0
> patch1	      96.6	      40.3	     49.2	      95.2
> patch3	      96.4	      40.5	     11.3	      95.1
> patch5	      96.1	      37.9	     13.3	      96.8
> patch7	      86.4	       9.8	      6.2	      22.0
> patch9	      85.9	       9.4	      4.8	      16.3
> patch10	      87.7	      12.6	     29.0	      32.3

You're seriously saying that kbuild got 12% faster?

I see that [07/10] (autotuning) alone sped up kbuild by 10%?

Other thoughts:

- What if any facilities are provided to permit users/developers to
  monitor the operation of the autotuning algorithm?

- I'm not seeing any Documentation/ updates.  Surely there are things
  we can tell users?

- This:

  : It's possible that PCP high auto-tuning doesn't work well for some
  : workloads.  So, when PCP high is tuned by hand via the sysctl knob,
  : the auto-tuning will be disabled.  The PCP high set by hand will be
  : used instead.

  Is it a bit hacky to disable autotuning when the user alters
  pcp-high?  Would it be cleaner to have a separate on/off knob for
  autotuning?

  And how is the user to determine that "PCP high auto-tuning doesn't work
  well" for their workload?
  
Huang, Ying Sept. 21, 2023, 1:32 p.m. UTC | #2
Hi, Andrew,

Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying <ying.huang@intel.com> wrote:
>
>> The page allocation performance requirements of different workloads
>> are often different.  So, we need to tune the PCP (Per-CPU Pageset)
>> high on each CPU automatically to optimize the page allocation
>> performance.
>
> Some of the performance changes here are downright scary.
>
> I've never been very sure that percpu pages was very beneficial (and
> hey, I invented the thing back in the Mesozoic era).  But these numbers
> make me think it's very important and we should have been paying more
> attention.
>
>> The list of patches in series is as follows,
>> 
>>  1 mm, pcp: avoid to drain PCP when process exit
>>  2 cacheinfo: calculate per-CPU data cache size
>>  3 mm, pcp: reduce lock contention for draining high-order pages
>>  4 mm: restrict the pcp batch scale factor to avoid too long latency
>>  5 mm, page_alloc: scale the number of pages that are batch allocated
>>  6 mm: add framework for PCP high auto-tuning
>>  7 mm: tune PCP high automatically
>>  8 mm, pcp: decrease PCP high if free pages < high watermark
>>  9 mm, pcp: avoid to reduce PCP high unnecessarily
>> 10 mm, pcp: reduce detecting time of consecutive high order page freeing
>> 
>> Patch 1/2/3 optimize the PCP draining for consecutive high-order pages
>> freeing.
>> 
>> Patch 4/5 optimize batch freeing and allocating.
>> 
>> Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method.
>> 
>> Patch 10 optimize the PCP draining for consecutive high order page
>> freeing based on PCP high auto-tuning.
>> 
>> The test results for patches with performance impact are as follows,
>> 
>> kbuild
>> ======
>> 
>> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
>> one socket with `make -j 112`.
>> 
>> 	build time	zone lock%	free_high	alloc_zone
>> 	----------	----------	---------	----------
>> base	     100.0	      43.6          100.0            100.0
>> patch1	      96.6	      40.3	     49.2	      95.2
>> patch3	      96.4	      40.5	     11.3	      95.1
>> patch5	      96.1	      37.9	     13.3	      96.8
>> patch7	      86.4	       9.8	      6.2	      22.0
>> patch9	      85.9	       9.4	      4.8	      16.3
>> patch10	      87.7	      12.6	     29.0	      32.3
>
> You're seriously saying that kbuild got 12% faster?
>
> I see that [07/10] (autotuning) alone sped up kbuild by 10%?

Thank you very much for questioning!

I double-checked the my test results and configuration and found that I
used an uncommon configuration.  So the description of the test should
have been,

On a 2-socket Intel server with 224 logical CPU, we tested kbuild with
`numactl -m 1 -- make -j 112`.

This will make processes running on socket 0 to use the normal zone of
socket 1.  The remote accessing to zone->lock cause heavy lock
contention.

I apologize for any confusing caused by the above test results.

If we test kbuild with `make -j 224` on the machine, the test results
becomes,

	build time	     lock%	free_high	alloc_zone
	----------	----------	---------	----------
base	     100.0	      16.8          100.0            100.0
patch5	      99.2	      13.9	      9.5	      97.0
patch7	      98.5	       5.4	      4.8	      19.2

Although lock contention cycles%, draining PCP for high order freeing,
and allocating from zone reduces greatly, the build time almost doesn't
change.

We also tested kbuild in the following way, created 8 cgroup, and run
`make -j 28` in each cgroup.  That is, the total parallel is same, but
LRU lock contention can be eliminated via cgroup.  And, the
single-process link stage take less proportion to the parallel compiling
stage.  This isn't common for personal usage.  But it can be used by
something like 0Day kbuild service.  The test result is as follows,

	build time	     lock%	free_high	alloc_zone
	----------	----------	---------	----------
base	     100.0	      14.2          100.0            100.0
patch5	      98.5	       8.5	      8.1	      97.1
patch7	      95.0	       0.7	      3.0	      19.0

The lock contention cycles% reduces to nearly 0, because LRU lock
contention is eliminated too.  The build time reduction becomes visible
too.  We will continue to do a full test with this configuration.

> Other thoughts:
>
> - What if any facilities are provided to permit users/developers to
>   monitor the operation of the autotuning algorithm?

/proc/zoneinfo can be used to observe PCP high and count for each CPU.

> - I'm not seeing any Documentation/ updates.  Surely there are things
>   we can tell users?

I will think about that.

> - This:
>
>   : It's possible that PCP high auto-tuning doesn't work well for some
>   : workloads.  So, when PCP high is tuned by hand via the sysctl knob,
>   : the auto-tuning will be disabled.  The PCP high set by hand will be
>   : used instead.
>
>   Is it a bit hacky to disable autotuning when the user alters
>   pcp-high?  Would it be cleaner to have a separate on/off knob for
>   autotuning?

This was suggested by Mel Gormon,

https://lore.kernel.org/linux-mm/20230714140710.5xbesq6xguhcbyvi@techsingularity.net/

"
I'm not opposed to having an adaptive pcp->high in concept. I think it would
be best to disable adaptive tuning if percpu_pagelist_high_fraction is set
though. I expect that users of that tunable are rare and that if it *is*
used that there is a very good reason for it.
"

Do you think that this is reasonable?

>   And how is the user to determine that "PCP high auto-tuning doesn't work
>   well" for their workload?

One way is to check the perf profiling results.  If there is heavy zone
lock contention, the PCP high auto-tuning doesn't work well enough to
eliminate the zone lock contention.  Users may try to tune PCP high by
hand.

--
Best Regards,
Huang, Ying
  
Andrew Morton Sept. 21, 2023, 3:46 p.m. UTC | #3
On Thu, 21 Sep 2023 21:32:35 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:

> >   : It's possible that PCP high auto-tuning doesn't work well for some
> >   : workloads.  So, when PCP high is tuned by hand via the sysctl knob,
> >   : the auto-tuning will be disabled.  The PCP high set by hand will be
> >   : used instead.
> >
> >   Is it a bit hacky to disable autotuning when the user alters
> >   pcp-high?  Would it be cleaner to have a separate on/off knob for
> >   autotuning?
> 
> This was suggested by Mel Gormon,
> 
> https://lore.kernel.org/linux-mm/20230714140710.5xbesq6xguhcbyvi@techsingularity.net/
> 
> "
> I'm not opposed to having an adaptive pcp->high in concept. I think it would
> be best to disable adaptive tuning if percpu_pagelist_high_fraction is set
> though. I expect that users of that tunable are rare and that if it *is*
> used that there is a very good reason for it.
> "
> 
> Do you think that this is reasonable?

I suppose so, if it's documented!

Documentation/admin-guide/sysctl/vm.rst describes
percpu_pagelist_high_fraction.
  
Huang, Ying Sept. 22, 2023, 12:33 a.m. UTC | #4
Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 21 Sep 2023 21:32:35 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>
>> >   : It's possible that PCP high auto-tuning doesn't work well for some
>> >   : workloads.  So, when PCP high is tuned by hand via the sysctl knob,
>> >   : the auto-tuning will be disabled.  The PCP high set by hand will be
>> >   : used instead.
>> >
>> >   Is it a bit hacky to disable autotuning when the user alters
>> >   pcp-high?  Would it be cleaner to have a separate on/off knob for
>> >   autotuning?
>> 
>> This was suggested by Mel Gormon,
>> 
>> https://lore.kernel.org/linux-mm/20230714140710.5xbesq6xguhcbyvi@techsingularity.net/
>> 
>> "
>> I'm not opposed to having an adaptive pcp->high in concept. I think it would
>> be best to disable adaptive tuning if percpu_pagelist_high_fraction is set
>> though. I expect that users of that tunable are rare and that if it *is*
>> used that there is a very good reason for it.
>> "
>> 
>> Do you think that this is reasonable?
>
> I suppose so, if it's documented!
>
> Documentation/admin-guide/sysctl/vm.rst describes
> percpu_pagelist_high_fraction.

Sure.  Will add document about auto-tuning behavior in the above
document.

--
Best Regards,
Huang, Ying
  
Mel Gorman Oct. 11, 2023, 1:05 p.m. UTC | #5
On Wed, Sep 20, 2023 at 09:41:18AM -0700, Andrew Morton wrote:
> On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying <ying.huang@intel.com> wrote:
> 
> > The page allocation performance requirements of different workloads
> > are often different.  So, we need to tune the PCP (Per-CPU Pageset)
> > high on each CPU automatically to optimize the page allocation
> > performance.
> 
> Some of the performance changes here are downright scary.
> 
> I've never been very sure that percpu pages was very beneficial (and
> hey, I invented the thing back in the Mesozoic era).  But these numbers
> make me think it's very important and we should have been paying more
> attention.
> 

FWIW, it is because not only does it avoid lock contention issues, it
avoids excessive splitting/merging of buddies as well as the slower
paths of the allocator. It is not very satisfactory and frankly, the
whole page allocator needs a revisit to account for very large zones but
it is far from a trivial project. PCP just masks the worst of the issues
and replacing it is far harder than tweaking it.

> > The list of patches in series is as follows,
> > 
> >  1 mm, pcp: avoid to drain PCP when process exit
> >  2 cacheinfo: calculate per-CPU data cache size
> >  3 mm, pcp: reduce lock contention for draining high-order pages
> >  4 mm: restrict the pcp batch scale factor to avoid too long latency
> >  5 mm, page_alloc: scale the number of pages that are batch allocated
> >  6 mm: add framework for PCP high auto-tuning
> >  7 mm: tune PCP high automatically
> >  8 mm, pcp: decrease PCP high if free pages < high watermark
> >  9 mm, pcp: avoid to reduce PCP high unnecessarily
> > 10 mm, pcp: reduce detecting time of consecutive high order page freeing
> > 
> > Patch 1/2/3 optimize the PCP draining for consecutive high-order pages
> > freeing.
> > 
> > Patch 4/5 optimize batch freeing and allocating.
> > 
> > Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method.
> > 
> > Patch 10 optimize the PCP draining for consecutive high order page
> > freeing based on PCP high auto-tuning.
> > 
> > The test results for patches with performance impact are as follows,
> > 
> > kbuild
> > ======
> > 
> > On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
> > one socket with `make -j 112`.
> > 
> > 	build time	zone lock%	free_high	alloc_zone
> > 	----------	----------	---------	----------
> > base	     100.0	      43.6          100.0            100.0
> > patch1	      96.6	      40.3	     49.2	      95.2
> > patch3	      96.4	      40.5	     11.3	      95.1
> > patch5	      96.1	      37.9	     13.3	      96.8
> > patch7	      86.4	       9.8	      6.2	      22.0
> > patch9	      85.9	       9.4	      4.8	      16.3
> > patch10	      87.7	      12.6	     29.0	      32.3
> 
> You're seriously saying that kbuild got 12% faster?
> 
> I see that [07/10] (autotuning) alone sped up kbuild by 10%?
> 
> Other thoughts:
> 
> - What if any facilities are provided to permit users/developers to
>   monitor the operation of the autotuning algorithm?
> 

Not that I've seen yet but I'm still in part of the series. It could be
monitored with tracepoints but it can also be inferred from lock
contention issue. I think it would only be meaningful to developers to
monitor this closely, at least that's what I think now. Honestly, I'm
more worried about potential changes in behaviour depending on the exact
CPU and cache implementation than I am about being able to actively
monitor it.

> - I'm not seeing any Documentation/ updates.  Surely there are things
>   we can tell users?
> 
> - This:
> 
>   : It's possible that PCP high auto-tuning doesn't work well for some
>   : workloads.  So, when PCP high is tuned by hand via the sysctl knob,
>   : the auto-tuning will be disabled.  The PCP high set by hand will be
>   : used instead.
> 
>   Is it a bit hacky to disable autotuning when the user alters
>   pcp-high?  Would it be cleaner to have a separate on/off knob for
>   autotuning?
> 

It might be but tuning the allocator is very specific and once we
introduce that tunable, we're probably stuck with it. I would prefer to
see it introduced if and only if we have to.

>   And how is the user to determine that "PCP high auto-tuning doesn't work
>   well" for their workload?

Not easily. It may manifest as variable lock contention issues when the
workload is at a steady state but that would increase the pressure to
split the allocator away from being zone-based entirely instead of tweaking
PCP further.