diff mbox series

[RFC,2/2] mm: alloc/free depth based PCP high auto-tuning

Message ID	20230710065325.290366-3-ying.huang@intel.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Huang Ying <ying.huang@intel.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven <arjan@linux.intel.com>, Huang Ying <ying.huang@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mgorman@techsingularity.net>, Vlastimil Babka <vbabka@suse.cz>, David Hildenbrand <david@redhat.com>, Johannes Weiner <jweiner@redhat.com>, Dave Hansen <dave.hansen@linux.intel.com>, Michal Hocko <mhocko@suse.com>, Pavel Tatashin <pasha.tatashin@soleen.com>, Matthew Wilcox <willy@infradead.org> Subject: [RFC 2/2] mm: alloc/free depth based PCP high auto-tuning Date: Mon, 10 Jul 2023 14:53:25 +0800 Message-Id: <20230710065325.290366-3-ying.huang@intel.com> In-Reply-To: <20230710065325.290366-1-ying.huang@intel.com> References: <20230710065325.290366-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mm: PCP high auto-tuning \| [RFC,0/2] mm: PCP high auto-tuning [RFC,1/2] mm: add framework for PCP high auto-tuning [RFC,2/2] mm: alloc/free depth based PCP high auto-tuning

Commit Message

Huang, Ying July 10, 2023, 6:53 a.m. UTC

  To auto-tune PCP high for each CPU automatically, an
allocation/freeing depth based PCP high auto-tuning algorithm is
implemented in this patch.

The basic idea behind the algorithm is to detect the repetitive
allocation and freeing pattern with short enough period (about 1
second).  The period needs to be short to respond to allocation and
freeing pattern changes quickly and control the memory wasted by
unnecessary caching.

To detect the repetitive allocation and freeing pattern, the
alloc/free depth is calculated for each tuning period (1 second) on
each CPU.  To calculate the alloc/free depth, we track the alloc
count.  Which increases for page allocation from PCP and decreases for
page freeing to PCP.  The alloc depth is the maximum alloc count
difference between the later large value and former small value.
While, the free depth is the maximum alloc count difference between
the former large value and the later small value.

Then, the average alloc/free depth in multiple tuning periods is
calculated, with the old alloc/free depth decay in the average
gradually.

Finally, the PCP high is set to be the smaller value of average alloc
depth and average free depth, after clamped between the default and
the max PCP high.  In this way, pure allocation or freeing will not
enlarge the PCP high because PCP doesn't help.

We have tested the algorithm with several workloads on Intel's
2-socket server machines.

Will-it-scale/page_fault1
=========================

On one socket of the system with 56 cores, 56 workload processes are
run to stress the kernel memory allocator.  Each workload process is
put in a different memcg to eliminate the LRU lock contention.

                                       base     optimized
                                       ----     ---------
Thoughput (GB/s)                       34.3          75.0
native_queued_spin_lock_slowpath%      60.9           0.2

This is a simple workload that each process allocates 128MB pages then
frees them repetitively.  So, it's quite easy to detect its allocation
and freeing pattern and adjust PCP high accordingly.  The optimized
kernel almost eliminates the lock contention cycles% (from 60.9% to
0.2%).  And its benchmark score increases 118.7%.

Kbuild
======

"make -j 224" is used to build the kernel in parallel on the 2-socket
server system with 224 logical CPUs.

                                       base     optimized
                                       ----     ---------
Build time (s)                       162.67        162.67
native_queued_spin_lock_slowpath%     17.00         12.28
rmqueue%                              11.53          8.33
Free_unref_page_list%                  3.85          0.54
folio_lruvec_lock_irqsave%             1.21          1.96

The optimized kernel reduces cycles of the page allocation/freeing
functions from ~15.38% to ~8.87% via enlarging the PCP high when
necessary.  The system overhead lock contention cycles% decreases
too.  But the benchmark score hasn't visible change.  There should be
other bottlenecks.

We also captured /proc/meminfo during test.  After a short
while (about 10s) from the beginning of the test, the
Memused (MemTotal - MemFree) of the optimized kernel is higher than
that of the base kernel for the increased PCP high.  But in the second
half of the test, the Memused of the optimized kernel decreases to the
same level of that of the base kernel.  That is, PCP high is decreased
effectively for the decreased page allocation requirements.

Netperf/SCTP_STREAM_MANY

On another 2-socket server with 128 logical CPUs.  64 processes of
netperf are run with the netserver runs on the same machine (that is,
loopback network is used).

                                      base      optimized
                                      ----      ---------
Throughput (MB/s)	              7136           8489       +19.0%
vmstat.cpu.id%		                73.05          63.73     -9.3
vmstat.procs.r		                34.1           45.6     +33.5%
meminfo.Memused		   5479861        8492875       +55.0%
perf-stat.ps.cpu-cycles	  1.04e+11       1.38e+11       +32.3%
perf-stat.ps.instructions	  0.96e+11       1.14e+11       +17.8%
perf-profile.free_unref_page%            2.46           1.65     -0.8
latency.99%.__alloc_pages	         4.28           2.21    -48.4%
latency.99%.__free_unref_page            4.11           0.87    -78.8%

From the test results, the throughput of benchmark increases 19.0%.
That comes from the increased CPU cycles and instructions per
second (perf-stat.ps.cpu-cycles and perf-stat.ps.instructions), that
is, reduced CPU idle.  And, perf-profile shows that page allocator
cycles% isn't high.  So, the reduced CPU idle may comes from the page
allocation/freeing latency reduction, which influence the network
behavior.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
---
 include/linux/mmzone.h |  8 +++++++
 mm/page_alloc.c        | 50 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 56 insertions(+), 2 deletions(-)

Comments

Michal Hocko July 11, 2023, 11:19 a.m. UTC | #1

On Mon 10-07-23 14:53:25, Huang Ying wrote:
> To auto-tune PCP high for each CPU automatically, an
> allocation/freeing depth based PCP high auto-tuning algorithm is
> implemented in this patch.
> 
> The basic idea behind the algorithm is to detect the repetitive
> allocation and freeing pattern with short enough period (about 1
> second).  The period needs to be short to respond to allocation and
> freeing pattern changes quickly and control the memory wasted by
> unnecessary caching.

1s is an ethernity from the allocation POV. Is a time based sampling
really a good choice? I would have expected a natural allocation/freeing
feedback mechanism. I.e. double the batch size when the batch is
consumed and it requires to be refilled and shrink it under memory
pressure (GFP_NOWAIT allocation fails) or when the surplus grows too
high over batch (e.g. twice as much).  Have you considered something as
simple as that?
Quite honestly I am not sure time based approach is a good choice
because memory consumptions tends to be quite bulky (e.g. application
starts or workload transitions based on requests).

> To detect the repetitive allocation and freeing pattern, the
> alloc/free depth is calculated for each tuning period (1 second) on
> each CPU.  To calculate the alloc/free depth, we track the alloc
> count.  Which increases for page allocation from PCP and decreases for
> page freeing to PCP.  The alloc depth is the maximum alloc count
> difference between the later large value and former small value.
> While, the free depth is the maximum alloc count difference between
> the former large value and the later small value.
> 
> Then, the average alloc/free depth in multiple tuning periods is
> calculated, with the old alloc/free depth decay in the average
> gradually.
> 
> Finally, the PCP high is set to be the smaller value of average alloc
> depth and average free depth, after clamped between the default and
> the max PCP high.  In this way, pure allocation or freeing will not
> enlarge the PCP high because PCP doesn't help.
> 
> We have tested the algorithm with several workloads on Intel's
> 2-socket server machines.

How does this scheme deal with memory pressure?

Mel Gorman July 12, 2023, 9:05 a.m. UTC | #2

On Tue, Jul 11, 2023 at 01:19:46PM +0200, Michal Hocko wrote:
> On Mon 10-07-23 14:53:25, Huang Ying wrote:
> > To auto-tune PCP high for each CPU automatically, an
> > allocation/freeing depth based PCP high auto-tuning algorithm is
> > implemented in this patch.
> > 
> > The basic idea behind the algorithm is to detect the repetitive
> > allocation and freeing pattern with short enough period (about 1
> > second).  The period needs to be short to respond to allocation and
> > freeing pattern changes quickly and control the memory wasted by
> > unnecessary caching.
> 
> 1s is an ethernity from the allocation POV. Is a time based sampling
> really a good choice? I would have expected a natural allocation/freeing
> feedback mechanism. I.e. double the batch size when the batch is
> consumed and it requires to be refilled and shrink it under memory
> pressure (GFP_NOWAIT allocation fails) or when the surplus grows too
> high over batch (e.g. twice as much).  Have you considered something as
> simple as that?
> Quite honestly I am not sure time based approach is a good choice
> because memory consumptions tends to be quite bulky (e.g. application
> starts or workload transitions based on requests).
>  

I tend to agree. Tuning based on the recent allocation pattern without frees
would make more sense and also be symmetric with how free_factor works. I
suspect that time-based may be heavily orientated around the will-it-scale
benchmark. While I only glanced at this, a few things jumped out

1. Time-based heuristics are not ideal. congestion_wait() and
   friends was an obvious case where time-based heuristics fell apart even
   before the event it waited on was removed. For congestion, it happened to
   work for slow storage for a while but that was about it.  For allocation
   stream detection, it has a similar problem. If a process is allocating
   heavily, then fine, if it's in bursts of less than a second more than one
   second apart then it will not adapt. While I do not think it is explicitly
   mentioned anywhere, my understanding was that heuristics like this within
   mm/ should be driven by explicit events as much as possible and not time.

2. If time was to be used, it would be cheaper to have the simpliest possible
   state tracking in the fast paths and decay any resizing of the PCP
   within the vmstat updates (reuse pcp->expire except it applies to local
   pcps). Even this is less than ideal as the PCP may be too large for short
   periods of time but it may also act as a backstop for worst-case behaviour

3. free_factor is an existing mechanism for detecting recent patterns
   and adapting the PCP sizes. The allocation side should be symmetric
   and the events that should drive it are "refills" on the alloc side and
   "drains" on the free side. Initially it might be easier to have a single
   parameter that scales batch and high up to a limit

4. The amount of state tracked seems excessive and increases the size of
   the per-cpu structure by more than 1 cache line. That in itself may not
   be a problem but the state is tracked on every page alloc/free that goes
   through the fast path and it's relatively complex to track.  That is
   a constant penalty in fast paths that may not may not be relevant to the
   workload and only sustained bursty allocation streams may offset the
   cost.

5. Memory pressure and reclaim activity does not appear to be accounted
   for and it's not clear if pcp->high is bounded or if it's possible for
   a single PCP to hide a large number of pages from other CPUs sharing the
   same node. The max size of the PCP should probably be explicitly clamped.

Huang, Ying July 13, 2023, 8:11 a.m. UTC | #3

Michal Hocko <mhocko@suse.com> writes:

> On Mon 10-07-23 14:53:25, Huang Ying wrote:
>> To auto-tune PCP high for each CPU automatically, an
>> allocation/freeing depth based PCP high auto-tuning algorithm is
>> implemented in this patch.
>> 
>> The basic idea behind the algorithm is to detect the repetitive
>> allocation and freeing pattern with short enough period (about 1
>> second).  The period needs to be short to respond to allocation and
>> freeing pattern changes quickly and control the memory wasted by
>> unnecessary caching.
>
> 1s is an ethernity from the allocation POV. Is a time based sampling
> really a good choice? I would have expected a natural allocation/freeing
> feedback mechanism. I.e. double the batch size when the batch is
> consumed and it requires to be refilled and shrink it under memory
> pressure (GFP_NOWAIT allocation fails) or when the surplus grows too
> high over batch (e.g. twice as much).  Have you considered something as
> simple as that?
> Quite honestly I am not sure time based approach is a good choice
> because memory consumptions tends to be quite bulky (e.g. application
> starts or workload transitions based on requests).

If my understanding were correct, we are targeting different allocation
and freeing patterns.

Your target pattern (A):

- allocate several GB or more (application starts) in short time
- allocate and free small number of pages in long time
- free several GB or more (application exits) in short time

My target pattern (B):

- allocate several hundreds MB, free small number, in short time
- free several hundreds MB, allocate small number, in short time
- repeat the above pattern for long time

For pattern A, my patchset will almost not help it.  It may be helpful
to increase PCP batch (I will do some test to verify that).  But, it may
be not a good idea to increase PCP high too much?  Because we need to
put more than several GB in PCP for long time.  At the same time, these
memory may be needed by other CPUs.  It's hard to determine how long
should we keep these pages in PCP.

My patchset will help pattern B via identify the allocation/freeing
depth and increasing PCP high to hold several hundreds MB in PCP.  Then
most allocation and freeing needs not to call buddy.  And when the
pattern completes, the PCP high will be restored and pages will be put
back in buddy.

I can observe the pattern B in kernel building and
netperf/SCTP_STREAM_MANY workload.  So, the zone lock contention
or page allocation/freeing latency reduction can be observed.  For
example, the alloc_count for kbuild is as follows,

        +-------------------------------------------------------------------+   
  90000 |-+    +      +     +      +      +      +      +     +      +     E|   
        |                        mm_pcp_alloc_count.0.2.alloc_count ***E****|   
  85000 |-+                                                **E           * *|   
        |              EE           E             EE    E**  *          E  *|   
  80000 |-+            EE          **            * *  **     *         *   *|   
        |         E    EEEE       E *    E      E  * *       *     E   *   *|   
  75000 |-+      E*    EE *       E *   ** E    *  EE        *    E*  E    *|   
        |       E * E *   *      *  *  * * *   E   E         *   * * E*    *|   
  70000 |-+    EE E EE    *E     E  * E  *E*   E             *   * * E     E|   
        |      E  E EE    *EE   E   *E   E*EEEE              E EE  *EE      |   
  65000 |EEE**EE  EEE     EE*  EE   EE   E EEE               EEEE  EE       |   
        |                 E * E             EE               EEE   E        |   
  60000 |-+                 *E*                                             |   
        |                   EE                                              |   
  55000 |-+    +      +     EE     +      +      +      +     +      +      |   
        +-------------------------------------------------------------------+   
        74     75     76    77     78     79     80     81    82     83     84  

The X axis is time in second, the Y axis is alloc count (start from 0,
++ for allocation, -- for freeing).  From the figure you can find that
from 75s on, the allocation depth and freeing depth is about 30000,
while repetitive period is about 1s.

IMHO, both pattern A and pattern B exists in reality.

>> To detect the repetitive allocation and freeing pattern, the
>> alloc/free depth is calculated for each tuning period (1 second) on
>> each CPU.  To calculate the alloc/free depth, we track the alloc
>> count.  Which increases for page allocation from PCP and decreases for
>> page freeing to PCP.  The alloc depth is the maximum alloc count
>> difference between the later large value and former small value.
>> While, the free depth is the maximum alloc count difference between
>> the former large value and the later small value.
>> 
>> Then, the average alloc/free depth in multiple tuning periods is
>> calculated, with the old alloc/free depth decay in the average
>> gradually.
>> 
>> Finally, the PCP high is set to be the smaller value of average alloc
>> depth and average free depth, after clamped between the default and
>> the max PCP high.  In this way, pure allocation or freeing will not
>> enlarge the PCP high because PCP doesn't help.
>> 
>> We have tested the algorithm with several workloads on Intel's
>> 2-socket server machines.
>
> How does this scheme deal with memory pressure?

If my understanding were correct, ZONE_RECLAIM_ACTIVE will be set when
kswapd is waken up (balance_pgdat()->set_reclaim_active()).  Then
(pcp->batch * 4) will be used as PCP high (free_unref_page_commit() ->
nr_pcp_high()).  So, the pages in PCP will be reduced.

Best Regards,
Huang, Ying

Huang, Ying July 13, 2023, 8:56 a.m. UTC | #4

Mel Gorman <mgorman@techsingularity.net> writes:

> On Tue, Jul 11, 2023 at 01:19:46PM +0200, Michal Hocko wrote:
>> On Mon 10-07-23 14:53:25, Huang Ying wrote:
>> > To auto-tune PCP high for each CPU automatically, an
>> > allocation/freeing depth based PCP high auto-tuning algorithm is
>> > implemented in this patch.
>> > 
>> > The basic idea behind the algorithm is to detect the repetitive
>> > allocation and freeing pattern with short enough period (about 1
>> > second).  The period needs to be short to respond to allocation and
>> > freeing pattern changes quickly and control the memory wasted by
>> > unnecessary caching.
>> 
>> 1s is an ethernity from the allocation POV. Is a time based sampling
>> really a good choice? I would have expected a natural allocation/freeing
>> feedback mechanism. I.e. double the batch size when the batch is
>> consumed and it requires to be refilled and shrink it under memory
>> pressure (GFP_NOWAIT allocation fails) or when the surplus grows too
>> high over batch (e.g. twice as much).  Have you considered something as
>> simple as that?
>> Quite honestly I am not sure time based approach is a good choice
>> because memory consumptions tends to be quite bulky (e.g. application
>> starts or workload transitions based on requests).
>>  
>
> I tend to agree. Tuning based on the recent allocation pattern without
> frees would make more sense and also be symmetric with how free_factor
> works.

This sounds good to me.  I will give it a try to tune PCP batch.  Have
some questions about how to tune PCP high based on that.  Details are in
the following.

> I suspect that time-based may be heavily orientated around the
> will-it-scale benchmark.

I totally agree that will-it-scale isn't a real workload.  So, I tried
to find some more practical ones.  I found that a repetitive
allocation/freeing several hundreds MB pages pattern exists in kernel
building and netperf/SCTP_STREAM_MANY workloads.  More details can be
found in my reply to Michal as follows,

https://lore.kernel.org/linux-mm/877cr4dydo.fsf@yhuang6-desk2.ccr.corp.intel.com/

> While I only glanced at this, a few things jumped out
>
> 1. Time-based heuristics are not ideal. congestion_wait() and
>    friends was an obvious case where time-based heuristics fell apart even
>    before the event it waited on was removed. For congestion, it happened to
>    work for slow storage for a while but that was about it.  For allocation
>    stream detection, it has a similar problem. If a process is allocating
>    heavily, then fine, if it's in bursts of less than a second more than one
>    second apart then it will not adapt. While I do not think it is explicitly
>    mentioned anywhere, my understanding was that heuristics like this within
>    mm/ should be driven by explicit events as much as possible and not time.

The proposed heuristics can be changed to be not time-based.  When it is
found that the allocation/freeing depth is larger than previous value,
the PCP high can be increased immediately.  We use time-based
implementation to try to reduce the overhead.  And, we mainly targeted
long-time allocation pattern before.

> 2. If time was to be used, it would be cheaper to have the simpliest possible
>    state tracking in the fast paths and decay any resizing of the PCP
>    within the vmstat updates (reuse pcp->expire except it applies to local
>    pcps). Even this is less than ideal as the PCP may be too large for short
>    periods of time but it may also act as a backstop for worst-case behaviour

This sounds reasonable.  Thanks!  We will try this if we choose to use
time-based implementation.

> 3. free_factor is an existing mechanism for detecting recent patterns
>    and adapting the PCP sizes. The allocation side should be symmetric
>    and the events that should drive it are "refills" on the alloc side and
>    "drains" on the free side. Initially it might be easier to have a single
>    parameter that scales batch and high up to a limit

For example, when a workload is started, several GB pages will be
allocated.  We will observe many "refills", and almost no "drains".  So,
we will scales batch and high up to a limit.  When the workload exits,
large number of pages of the workload will be put in PCP because the PCP
high is increased.  When should we decrease the PCP batch and high?

> 4. The amount of state tracked seems excessive and increases the size of
>    the per-cpu structure by more than 1 cache line. That in itself may not
>    be a problem but the state is tracked on every page alloc/free that goes
>    through the fast path and it's relatively complex to track.  That is
>    a constant penalty in fast paths that may not may not be relevant to the
>    workload and only sustained bursty allocation streams may offset the
>    cost.

Yes.  Thanks for pointing this out.  We will optimize this if the other
aspects of the basic idea could be accepted.

> 5. Memory pressure and reclaim activity does not appear to be accounted
>    for and it's not clear if pcp->high is bounded or if it's possible for
>    a single PCP to hide a large number of pages from other CPUs sharing the
>    same node. The max size of the PCP should probably be explicitly clamped.

As in my reply to Michal's email, previously I thought
ZONE_RECLAIM_ACTIVE will be set for kswap reclaiming, and PCP high will
be decreased to (batch * 4) effectively.  Or I miss something?

There's a upper limit for PCP high, which comes from
MIN_PERCPU_PAGELIST_HIGH_FRACTION.  That is, we calculate the PCP high
as if percpu_pagelist_high_fraction is set to
MIN_PERCPU_PAGELIST_HIGH_FRACTION (8).  Then, we use that value as the
upper limit for PCP high.

Best Regards,
Huang, Ying

Michal Hocko July 14, 2023, 11:41 a.m. UTC | #5

On Wed 12-07-23 10:05:26, Mel Gorman wrote:
> On Tue, Jul 11, 2023 at 01:19:46PM +0200, Michal Hocko wrote:
> > On Mon 10-07-23 14:53:25, Huang Ying wrote:
> > > To auto-tune PCP high for each CPU automatically, an
> > > allocation/freeing depth based PCP high auto-tuning algorithm is
> > > implemented in this patch.
> > > 
> > > The basic idea behind the algorithm is to detect the repetitive
> > > allocation and freeing pattern with short enough period (about 1
> > > second).  The period needs to be short to respond to allocation and
> > > freeing pattern changes quickly and control the memory wasted by
> > > unnecessary caching.
> > 
> > 1s is an ethernity from the allocation POV. Is a time based sampling
> > really a good choice? I would have expected a natural allocation/freeing
> > feedback mechanism. I.e. double the batch size when the batch is
> > consumed and it requires to be refilled and shrink it under memory
> > pressure (GFP_NOWAIT allocation fails) or when the surplus grows too
> > high over batch (e.g. twice as much).  Have you considered something as
> > simple as that?
> > Quite honestly I am not sure time based approach is a good choice
> > because memory consumptions tends to be quite bulky (e.g. application
> > starts or workload transitions based on requests).
> >  
> 
> I tend to agree. Tuning based on the recent allocation pattern without frees
> would make more sense and also be symmetric with how free_factor works. I
> suspect that time-based may be heavily orientated around the will-it-scale
> benchmark. While I only glanced at this, a few things jumped out
> 
> 1. Time-based heuristics are not ideal. congestion_wait() and
>    friends was an obvious case where time-based heuristics fell apart even
>    before the event it waited on was removed. For congestion, it happened to
>    work for slow storage for a while but that was about it.  For allocation
>    stream detection, it has a similar problem. If a process is allocating
>    heavily, then fine, if it's in bursts of less than a second more than one
>    second apart then it will not adapt. While I do not think it is explicitly
>    mentioned anywhere, my understanding was that heuristics like this within
>    mm/ should be driven by explicit events as much as possible and not time.

Agreed. I would also like to point out that it is also important to
realize those events that we should care about. Remember the primary
motivation of the tuning is to reduce the lock contention. That being
said, it is less of a problem to have stream or bursty demand for
memory if that doesn't really cause the said contention, right? So any
auto-tuning should consider that as well and do not inflate the batch
in an absense of the contention. That of course means that a solely
deallocation based monitoring.

Mel Gorman July 14, 2023, 2:07 p.m. UTC | #6

On Thu, Jul 13, 2023 at 04:56:54PM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > On Tue, Jul 11, 2023 at 01:19:46PM +0200, Michal Hocko wrote:
> >> On Mon 10-07-23 14:53:25, Huang Ying wrote:
> >> > To auto-tune PCP high for each CPU automatically, an
> >> > allocation/freeing depth based PCP high auto-tuning algorithm is
> >> > implemented in this patch.
> >> > 
> >> > The basic idea behind the algorithm is to detect the repetitive
> >> > allocation and freeing pattern with short enough period (about 1
> >> > second).  The period needs to be short to respond to allocation and
> >> > freeing pattern changes quickly and control the memory wasted by
> >> > unnecessary caching.
> >> 
> >> 1s is an ethernity from the allocation POV. Is a time based sampling
> >> really a good choice? I would have expected a natural allocation/freeing
> >> feedback mechanism. I.e. double the batch size when the batch is
> >> consumed and it requires to be refilled and shrink it under memory
> >> pressure (GFP_NOWAIT allocation fails) or when the surplus grows too
> >> high over batch (e.g. twice as much).  Have you considered something as
> >> simple as that?
> >> Quite honestly I am not sure time based approach is a good choice
> >> because memory consumptions tends to be quite bulky (e.g. application
> >> starts or workload transitions based on requests).
> >>  
> >
> > I tend to agree. Tuning based on the recent allocation pattern without
> > frees would make more sense and also be symmetric with how free_factor
> > works.
> 
> This sounds good to me.  I will give it a try to tune PCP batch.  Have
> some questions about how to tune PCP high based on that.  Details are in
> the following.
> 
> > I suspect that time-based may be heavily orientated around the
> > will-it-scale benchmark.
> 
> I totally agree that will-it-scale isn't a real workload.  So, I tried
> to find some more practical ones.  I found that a repetitive
> allocation/freeing several hundreds MB pages pattern exists in kernel
> building and netperf/SCTP_STREAM_MANY workloads.  More details can be
> found in my reply to Michal as follows,
> 
> https://lore.kernel.org/linux-mm/877cr4dydo.fsf@yhuang6-desk2.ccr.corp.intel.com/
> 
> > While I only glanced at this, a few things jumped out
> >
> > 1. Time-based heuristics are not ideal. congestion_wait() and
> >    friends was an obvious case where time-based heuristics fell apart even
> >    before the event it waited on was removed. For congestion, it happened to
> >    work for slow storage for a while but that was about it.  For allocation
> >    stream detection, it has a similar problem. If a process is allocating
> >    heavily, then fine, if it's in bursts of less than a second more than one
> >    second apart then it will not adapt. While I do not think it is explicitly
> >    mentioned anywhere, my understanding was that heuristics like this within
> >    mm/ should be driven by explicit events as much as possible and not time.
> 
> The proposed heuristics can be changed to be not time-based.  When it is
> found that the allocation/freeing depth is larger than previous value,
> the PCP high can be increased immediately.  We use time-based
> implementation to try to reduce the overhead.  And, we mainly targeted
> long-time allocation pattern before.
> 

Time simply has too many corner cases. When it's reset for example, all
state is lost so patterns that are longer than the time window are
unpredictable. It tends to work slightly better than time decays state
rather than resets but it gets very hand-wavey.

> > 2. If time was to be used, it would be cheaper to have the simpliest possible
> >    state tracking in the fast paths and decay any resizing of the PCP
> >    within the vmstat updates (reuse pcp->expire except it applies to local
> >    pcps). Even this is less than ideal as the PCP may be too large for short
> >    periods of time but it may also act as a backstop for worst-case behaviour
> 
> This sounds reasonable.  Thanks!  We will try this if we choose to use
> time-based implementation.
> 
> > 3. free_factor is an existing mechanism for detecting recent patterns
> >    and adapting the PCP sizes. The allocation side should be symmetric
> >    and the events that should drive it are "refills" on the alloc side and
> >    "drains" on the free side. Initially it might be easier to have a single
> >    parameter that scales batch and high up to a limit
> 
> For example, when a workload is started, several GB pages will be
> allocated.  We will observe many "refills", and almost no "drains".  So,
> we will scales batch and high up to a limit.  When the workload exits,
> large number of pages of the workload will be put in PCP because the PCP
> high is increased.  When should we decrease the PCP batch and high?
> 

Honestly, I'm not 100% certain as I haven't spent time with paper to
sketch out the different combinations with "all allocs" at one end, "all
frees" at the other and "ratio of alloc:free" in the middle. Intuitively
I would expect the time to shrink is when there is a mix.

All allocs -- maximise batch and high
All frees  -- maximise batch and high
Mix        -- adjust high to appoximate the minimum value of high such
	      that a drain/refill does not occur or rarely occurs

Batch should have a much lower maximum than high because it's a deferred cost
that gets assigned to an arbitrary task. The worst case is where a process
that is a light user of the allocator incurs the full cost of a refill/drain.

Again, intuitively this may be PID Control problem for the "Mix" case
to estimate the size of high required to minimise drains/allocs as each
drain/alloc is potentially a lock contention. The catchall for corner
cases would be to decay high from vmstat context based on pcp->expires. The
decay would prevent the "high" being pinned at an artifically high value
without any zone lock contention for prolonged periods of time and also
mitigate worst-case due to state being per-cpu. The downside is that "high"
would also oscillate for a continuous steady allocation pattern as the PID
control might pick an ideal value suitable for a long period of time with
the "decay" disrupting that ideal value.

> > 4. The amount of state tracked seems excessive and increases the size of
> >    the per-cpu structure by more than 1 cache line. That in itself may not
> >    be a problem but the state is tracked on every page alloc/free that goes
> >    through the fast path and it's relatively complex to track.  That is
> >    a constant penalty in fast paths that may not may not be relevant to the
> >    workload and only sustained bursty allocation streams may offset the
> >    cost.
> 
> Yes.  Thanks for pointing this out.  We will optimize this if the other
> aspects of the basic idea could be accepted.
> 

I'm not opposed to having an adaptive pcp->high in concept. I think it would
be best to disable adaptive tuning if percpu_pagelist_high_fraction is set
though. I expect that users of that tunable are rare and that if it *is*
used that there is a very good reason for it.

> > 5. Memory pressure and reclaim activity does not appear to be accounted
> >    for and it's not clear if pcp->high is bounded or if it's possible for
> >    a single PCP to hide a large number of pages from other CPUs sharing the
> >    same node. The max size of the PCP should probably be explicitly clamped.
> 
> As in my reply to Michal's email, previously I thought
> ZONE_RECLAIM_ACTIVE will be set for kswap reclaiming, and PCP high will
> be decreased to (batch * 4) effectively.  Or I miss something?
> 

I don't think you did, but it deserves a big comment in the tuning at minimum
and potentially even disabling adaptive tuning entirely if reclaim is active.

Huang, Ying July 17, 2023, 9:16 a.m. UTC | #7

Mel Gorman <mgorman@techsingularity.net> writes:

> On Thu, Jul 13, 2023 at 04:56:54PM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>> 
>> > On Tue, Jul 11, 2023 at 01:19:46PM +0200, Michal Hocko wrote:
>> >> On Mon 10-07-23 14:53:25, Huang Ying wrote:
>> >> > To auto-tune PCP high for each CPU automatically, an
>> >> > allocation/freeing depth based PCP high auto-tuning algorithm is
>> >> > implemented in this patch.
>> >> > 
>> >> > The basic idea behind the algorithm is to detect the repetitive
>> >> > allocation and freeing pattern with short enough period (about 1
>> >> > second).  The period needs to be short to respond to allocation and
>> >> > freeing pattern changes quickly and control the memory wasted by
>> >> > unnecessary caching.
>> >> 
>> >> 1s is an ethernity from the allocation POV. Is a time based sampling
>> >> really a good choice? I would have expected a natural allocation/freeing
>> >> feedback mechanism. I.e. double the batch size when the batch is
>> >> consumed and it requires to be refilled and shrink it under memory
>> >> pressure (GFP_NOWAIT allocation fails) or when the surplus grows too
>> >> high over batch (e.g. twice as much).  Have you considered something as
>> >> simple as that?
>> >> Quite honestly I am not sure time based approach is a good choice
>> >> because memory consumptions tends to be quite bulky (e.g. application
>> >> starts or workload transitions based on requests).
>> >>  
>> >
>> > I tend to agree. Tuning based on the recent allocation pattern without
>> > frees would make more sense and also be symmetric with how free_factor
>> > works.
>> 
>> This sounds good to me.  I will give it a try to tune PCP batch.  Have
>> some questions about how to tune PCP high based on that.  Details are in
>> the following.
>> 
>> > I suspect that time-based may be heavily orientated around the
>> > will-it-scale benchmark.
>> 
>> I totally agree that will-it-scale isn't a real workload.  So, I tried
>> to find some more practical ones.  I found that a repetitive
>> allocation/freeing several hundreds MB pages pattern exists in kernel
>> building and netperf/SCTP_STREAM_MANY workloads.  More details can be
>> found in my reply to Michal as follows,
>> 
>> https://lore.kernel.org/linux-mm/877cr4dydo.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> 
>> > While I only glanced at this, a few things jumped out
>> >
>> > 1. Time-based heuristics are not ideal. congestion_wait() and
>> >    friends was an obvious case where time-based heuristics fell apart even
>> >    before the event it waited on was removed. For congestion, it happened to
>> >    work for slow storage for a while but that was about it.  For allocation
>> >    stream detection, it has a similar problem. If a process is allocating
>> >    heavily, then fine, if it's in bursts of less than a second more than one
>> >    second apart then it will not adapt. While I do not think it is explicitly
>> >    mentioned anywhere, my understanding was that heuristics like this within
>> >    mm/ should be driven by explicit events as much as possible and not time.
>> 
>> The proposed heuristics can be changed to be not time-based.  When it is
>> found that the allocation/freeing depth is larger than previous value,
>> the PCP high can be increased immediately.  We use time-based
>> implementation to try to reduce the overhead.  And, we mainly targeted
>> long-time allocation pattern before.
>> 
>
> Time simply has too many corner cases. When it's reset for example, all
> state is lost so patterns that are longer than the time window are
> unpredictable. It tends to work slightly better than time decays state
> rather than resets but it gets very hand-wavey.
>
>> > 2. If time was to be used, it would be cheaper to have the simpliest possible
>> >    state tracking in the fast paths and decay any resizing of the PCP
>> >    within the vmstat updates (reuse pcp->expire except it applies to local
>> >    pcps). Even this is less than ideal as the PCP may be too large for short
>> >    periods of time but it may also act as a backstop for worst-case behaviour
>> 
>> This sounds reasonable.  Thanks!  We will try this if we choose to use
>> time-based implementation.
>> 
>> > 3. free_factor is an existing mechanism for detecting recent patterns
>> >    and adapting the PCP sizes. The allocation side should be symmetric
>> >    and the events that should drive it are "refills" on the alloc side and
>> >    "drains" on the free side. Initially it might be easier to have a single
>> >    parameter that scales batch and high up to a limit
>> 
>> For example, when a workload is started, several GB pages will be
>> allocated.  We will observe many "refills", and almost no "drains".  So,
>> we will scales batch and high up to a limit.  When the workload exits,
>> large number of pages of the workload will be put in PCP because the PCP
>> high is increased.  When should we decrease the PCP batch and high?
>> 
>
> Honestly, I'm not 100% certain as I haven't spent time with paper to
> sketch out the different combinations with "all allocs" at one end, "all
> frees" at the other and "ratio of alloc:free" in the middle. Intuitively
> I would expect the time to shrink is when there is a mix.
>
> All allocs -- maximise batch and high
> All frees  -- maximise batch and high
> Mix        -- adjust high to appoximate the minimum value of high such
> 	      that a drain/refill does not occur or rarely occurs
>
> Batch should have a much lower maximum than high because it's a deferred cost
> that gets assigned to an arbitrary task. The worst case is where a process
> that is a light user of the allocator incurs the full cost of a refill/drain.
>
> Again, intuitively this may be PID Control problem for the "Mix" case
> to estimate the size of high required to minimise drains/allocs as each
> drain/alloc is potentially a lock contention. The catchall for corner
> cases would be to decay high from vmstat context based on pcp->expires. The
> decay would prevent the "high" being pinned at an artifically high value
> without any zone lock contention for prolonged periods of time and also
> mitigate worst-case due to state being per-cpu. The downside is that "high"
> would also oscillate for a continuous steady allocation pattern as the PID
> control might pick an ideal value suitable for a long period of time with
> the "decay" disrupting that ideal value.

Maybe we can track the minimal value of pcp->count.  If it's small
enough recently, we can avoid to decay pcp->high.  Because the pages in
PCP are used for allocations instead of idle.

Another question is as follows.

For example, on CPU A, a large number of pages are freed, and we
maximize batch and high.  So, a large number of pages are put in PCP.
Then, the possible situations may be,

a) a large number of pages are allocated on CPU A after some time
b) a large number of pages are allocated on another CPU B

For a), we want the pages are kept in PCP of CPU A as long as possible.
For b), we want the pages are kept in PCP of CPU A as short as possible.
I think that we need to balance between them.  What is the reasonable
time to keep pages in PCP without many allocations?

>> > 4. The amount of state tracked seems excessive and increases the size of
>> >    the per-cpu structure by more than 1 cache line. That in itself may not
>> >    be a problem but the state is tracked on every page alloc/free that goes
>> >    through the fast path and it's relatively complex to track.  That is
>> >    a constant penalty in fast paths that may not may not be relevant to the
>> >    workload and only sustained bursty allocation streams may offset the
>> >    cost.
>> 
>> Yes.  Thanks for pointing this out.  We will optimize this if the other
>> aspects of the basic idea could be accepted.
>> 
>
> I'm not opposed to having an adaptive pcp->high in concept. I think it would
> be best to disable adaptive tuning if percpu_pagelist_high_fraction is set
> though. I expect that users of that tunable are rare and that if it *is*
> used that there is a very good reason for it.

OK.  Will do that in the future version.

>> > 5. Memory pressure and reclaim activity does not appear to be accounted
>> >    for and it's not clear if pcp->high is bounded or if it's possible for
>> >    a single PCP to hide a large number of pages from other CPUs sharing the
>> >    same node. The max size of the PCP should probably be explicitly clamped.
>> 
>> As in my reply to Michal's email, previously I thought
>> ZONE_RECLAIM_ACTIVE will be set for kswap reclaiming, and PCP high will
>> be decreased to (batch * 4) effectively.  Or I miss something?
>> 
>
> I don't think you did, but it deserves a big comment in the tuning at minimum
> and potentially even disabling adaptive tuning entirely if reclaim is active.

Sure.  Will do that.

Best Regards,
Huang, Ying

Mel Gorman July 17, 2023, 1:50 p.m. UTC | #8

On Mon, Jul 17, 2023 at 05:16:11PM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > Batch should have a much lower maximum than high because it's a deferred cost
> > that gets assigned to an arbitrary task. The worst case is where a process
> > that is a light user of the allocator incurs the full cost of a refill/drain.
> >
> > Again, intuitively this may be PID Control problem for the "Mix" case
> > to estimate the size of high required to minimise drains/allocs as each
> > drain/alloc is potentially a lock contention. The catchall for corner
> > cases would be to decay high from vmstat context based on pcp->expires. The
> > decay would prevent the "high" being pinned at an artifically high value
> > without any zone lock contention for prolonged periods of time and also
> > mitigate worst-case due to state being per-cpu. The downside is that "high"
> > would also oscillate for a continuous steady allocation pattern as the PID
> > control might pick an ideal value suitable for a long period of time with
> > the "decay" disrupting that ideal value.
> 
> Maybe we can track the minimal value of pcp->count.  If it's small
> enough recently, we can avoid to decay pcp->high.  Because the pages in
> PCP are used for allocations instead of idle.

Implement as a separate patch. I suspect this type of heuristic will be
very benchmark specific and the complexity may not be worth it in the
general case.

> 
> Another question is as follows.
> 
> For example, on CPU A, a large number of pages are freed, and we
> maximize batch and high.  So, a large number of pages are put in PCP.
> Then, the possible situations may be,
> 
> a) a large number of pages are allocated on CPU A after some time
> b) a large number of pages are allocated on another CPU B
> 
> For a), we want the pages are kept in PCP of CPU A as long as possible.
> For b), we want the pages are kept in PCP of CPU A as short as possible.
> I think that we need to balance between them.  What is the reasonable
> time to keep pages in PCP without many allocations?
> 

This would be a case where you're relying on vmstat to drain the PCP after
a period of time as it is a corner case. You cannot reasonably detect the
pattern on two separate per-cpu lists without either inspecting remote CPU
state or maintaining global state. Either would incur cache miss penalties
that probably cost more than the heuristic saves.

Huang, Ying July 18, 2023, 12:55 a.m. UTC | #9

Mel Gorman <mgorman@techsingularity.net> writes:

> On Mon, Jul 17, 2023 at 05:16:11PM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>> 
>> > Batch should have a much lower maximum than high because it's a deferred cost
>> > that gets assigned to an arbitrary task. The worst case is where a process
>> > that is a light user of the allocator incurs the full cost of a refill/drain.
>> >
>> > Again, intuitively this may be PID Control problem for the "Mix" case
>> > to estimate the size of high required to minimise drains/allocs as each
>> > drain/alloc is potentially a lock contention. The catchall for corner
>> > cases would be to decay high from vmstat context based on pcp->expires. The
>> > decay would prevent the "high" being pinned at an artifically high value
>> > without any zone lock contention for prolonged periods of time and also
>> > mitigate worst-case due to state being per-cpu. The downside is that "high"
>> > would also oscillate for a continuous steady allocation pattern as the PID
>> > control might pick an ideal value suitable for a long period of time with
>> > the "decay" disrupting that ideal value.
>> 
>> Maybe we can track the minimal value of pcp->count.  If it's small
>> enough recently, we can avoid to decay pcp->high.  Because the pages in
>> PCP are used for allocations instead of idle.
>
> Implement as a separate patch. I suspect this type of heuristic will be
> very benchmark specific and the complexity may not be worth it in the
> general case.

OK.

>> Another question is as follows.
>> 
>> For example, on CPU A, a large number of pages are freed, and we
>> maximize batch and high.  So, a large number of pages are put in PCP.
>> Then, the possible situations may be,
>> 
>> a) a large number of pages are allocated on CPU A after some time
>> b) a large number of pages are allocated on another CPU B
>> 
>> For a), we want the pages are kept in PCP of CPU A as long as possible.
>> For b), we want the pages are kept in PCP of CPU A as short as possible.
>> I think that we need to balance between them.  What is the reasonable
>> time to keep pages in PCP without many allocations?
>> 
>
> This would be a case where you're relying on vmstat to drain the PCP after
> a period of time as it is a corner case.

Yes.  The remaining question is how long should "a period of time" be?
If it's long, the pages in PCP can be used for allocation after some
time.  If it's short the pages can be put in buddy, so can be used by
other workloads if needed.

Anyway, I will do some experiment for that.

> You cannot reasonably detect the pattern on two separate per-cpu lists
> without either inspecting remote CPU state or maintaining global
> state. Either would incur cache miss penalties that probably cost more
> than the heuristic saves.

Yes.  Totally agree.

Best Regards,
Huang, Ying

Mel Gorman July 18, 2023, 12:34 p.m. UTC | #10

On Tue, Jul 18, 2023 at 08:55:16AM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > On Mon, Jul 17, 2023 at 05:16:11PM +0800, Huang, Ying wrote:
> >> Mel Gorman <mgorman@techsingularity.net> writes:
> >> 
> >> > Batch should have a much lower maximum than high because it's a deferred cost
> >> > that gets assigned to an arbitrary task. The worst case is where a process
> >> > that is a light user of the allocator incurs the full cost of a refill/drain.
> >> >
> >> > Again, intuitively this may be PID Control problem for the "Mix" case
> >> > to estimate the size of high required to minimise drains/allocs as each
> >> > drain/alloc is potentially a lock contention. The catchall for corner
> >> > cases would be to decay high from vmstat context based on pcp->expires. The
> >> > decay would prevent the "high" being pinned at an artifically high value
> >> > without any zone lock contention for prolonged periods of time and also
> >> > mitigate worst-case due to state being per-cpu. The downside is that "high"
> >> > would also oscillate for a continuous steady allocation pattern as the PID
> >> > control might pick an ideal value suitable for a long period of time with
> >> > the "decay" disrupting that ideal value.
> >> 
> >> Maybe we can track the minimal value of pcp->count.  If it's small
> >> enough recently, we can avoid to decay pcp->high.  Because the pages in
> >> PCP are used for allocations instead of idle.
> >
> > Implement as a separate patch. I suspect this type of heuristic will be
> > very benchmark specific and the complexity may not be worth it in the
> > general case.
> 
> OK.
> 
> >> Another question is as follows.
> >> 
> >> For example, on CPU A, a large number of pages are freed, and we
> >> maximize batch and high.  So, a large number of pages are put in PCP.
> >> Then, the possible situations may be,
> >> 
> >> a) a large number of pages are allocated on CPU A after some time
> >> b) a large number of pages are allocated on another CPU B
> >> 
> >> For a), we want the pages are kept in PCP of CPU A as long as possible.
> >> For b), we want the pages are kept in PCP of CPU A as short as possible.
> >> I think that we need to balance between them.  What is the reasonable
> >> time to keep pages in PCP without many allocations?
> >> 
> >
> > This would be a case where you're relying on vmstat to drain the PCP after
> > a period of time as it is a corner case.
> 
> Yes.  The remaining question is how long should "a period of time" be?

Match the time used for draining "remote" pages from the PCP lists. The
choice is arbitrary and no matter what value is chosen, it'll be possible
to build an adverse workload.

> If it's long, the pages in PCP can be used for allocation after some
> time.  If it's short the pages can be put in buddy, so can be used by
> other workloads if needed.
> 

Assume that the main reason to expire pages and put them back on the buddy
list is to avoid premature allocation failures due to pages pinned on the
PCP. Once pages are going back onto the buddy list and the expiry is hit,
it might as well be assumed that the pages are cache-cold. Some bad corner
cases should be mitigated by disabling the adapative sizing when reclaim is
active. The big remaaining corner case to watch out for is where the sum
of the boosted pcp->high exceeds the low watermark.  If that should ever
happen then potentially a premature OOM happens because the watermarks
are fine so no reclaim is active but no pages are available. It may even
be the case that the sum of pcp->high should not exceed *min* as that
corner case means that processes may prematurely enter direct reclaim
(not as bad as OOM but still bad).

> Anyway, I will do some experiment for that.
> 
> > You cannot reasonably detect the pattern on two separate per-cpu lists
> > without either inspecting remote CPU state or maintaining global
> > state. Either would incur cache miss penalties that probably cost more
> > than the heuristic saves.
> 
> Yes.  Totally agree.
> 
> Best Regards,
> Huang, Ying

Huang, Ying July 19, 2023, 5:59 a.m. UTC | #11

Mel Gorman <mgorman@techsingularity.net> writes:

> On Tue, Jul 18, 2023 at 08:55:16AM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>> 
>> > On Mon, Jul 17, 2023 at 05:16:11PM +0800, Huang, Ying wrote:
>> >> Mel Gorman <mgorman@techsingularity.net> writes:
>> >> 
>> >> > Batch should have a much lower maximum than high because it's a deferred cost
>> >> > that gets assigned to an arbitrary task. The worst case is where a process
>> >> > that is a light user of the allocator incurs the full cost of a refill/drain.
>> >> >
>> >> > Again, intuitively this may be PID Control problem for the "Mix" case
>> >> > to estimate the size of high required to minimise drains/allocs as each
>> >> > drain/alloc is potentially a lock contention. The catchall for corner
>> >> > cases would be to decay high from vmstat context based on pcp->expires. The
>> >> > decay would prevent the "high" being pinned at an artifically high value
>> >> > without any zone lock contention for prolonged periods of time and also
>> >> > mitigate worst-case due to state being per-cpu. The downside is that "high"
>> >> > would also oscillate for a continuous steady allocation pattern as the PID
>> >> > control might pick an ideal value suitable for a long period of time with
>> >> > the "decay" disrupting that ideal value.
>> >> 
>> >> Maybe we can track the minimal value of pcp->count.  If it's small
>> >> enough recently, we can avoid to decay pcp->high.  Because the pages in
>> >> PCP are used for allocations instead of idle.
>> >
>> > Implement as a separate patch. I suspect this type of heuristic will be
>> > very benchmark specific and the complexity may not be worth it in the
>> > general case.
>> 
>> OK.
>> 
>> >> Another question is as follows.
>> >> 
>> >> For example, on CPU A, a large number of pages are freed, and we
>> >> maximize batch and high.  So, a large number of pages are put in PCP.
>> >> Then, the possible situations may be,
>> >> 
>> >> a) a large number of pages are allocated on CPU A after some time
>> >> b) a large number of pages are allocated on another CPU B
>> >> 
>> >> For a), we want the pages are kept in PCP of CPU A as long as possible.
>> >> For b), we want the pages are kept in PCP of CPU A as short as possible.
>> >> I think that we need to balance between them.  What is the reasonable
>> >> time to keep pages in PCP without many allocations?
>> >> 
>> >
>> > This would be a case where you're relying on vmstat to drain the PCP after
>> > a period of time as it is a corner case.
>> 
>> Yes.  The remaining question is how long should "a period of time" be?
>
> Match the time used for draining "remote" pages from the PCP lists. The
> choice is arbitrary and no matter what value is chosen, it'll be possible
> to build an adverse workload.

OK.

>> If it's long, the pages in PCP can be used for allocation after some
>> time.  If it's short the pages can be put in buddy, so can be used by
>> other workloads if needed.
>> 
>
> Assume that the main reason to expire pages and put them back on the buddy
> list is to avoid premature allocation failures due to pages pinned on the
> PCP. Once pages are going back onto the buddy list and the expiry is hit,
> it might as well be assumed that the pages are cache-cold. Some bad corner
> cases should be mitigated by disabling the adapative sizing when reclaim is
> active.

Yes.  This can be mitigated, but the page allocation performance may be
hurt.

> The big remaaining corner case to watch out for is where the sum
> of the boosted pcp->high exceeds the low watermark.  If that should ever
> happen then potentially a premature OOM happens because the watermarks
> are fine so no reclaim is active but no pages are available. It may even
> be the case that the sum of pcp->high should not exceed *min* as that
> corner case means that processes may prematurely enter direct reclaim
> (not as bad as OOM but still bad).

Sorry, I don't understand this.  When pages are moved from buddy to PCP,
zone NR_FREE_PAGES will be decreased in rmqueue_bulk().  That is, pages
in PCP will be counted as used instead of free.  And, in
zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is
used to check watermark.  So, if my understanding were correct, if the
number of pages in PCP is larger than low/min watermark, we can still
trigger reclaim.  Whether is my understanding correct?

>> Anyway, I will do some experiment for that.
>> 
>> > You cannot reasonably detect the pattern on two separate per-cpu lists
>> > without either inspecting remote CPU state or maintaining global
>> > state. Either would incur cache miss penalties that probably cost more
>> > than the heuristic saves.
>> 
>> Yes.  Totally agree.

Best Regards,
Huang, Ying

Mel Gorman July 19, 2023, 9:05 a.m. UTC | #12

On Wed, Jul 19, 2023 at 01:59:00PM +0800, Huang, Ying wrote:
> > The big remaaining corner case to watch out for is where the sum
> > of the boosted pcp->high exceeds the low watermark.  If that should ever
> > happen then potentially a premature OOM happens because the watermarks
> > are fine so no reclaim is active but no pages are available. It may even
> > be the case that the sum of pcp->high should not exceed *min* as that
> > corner case means that processes may prematurely enter direct reclaim
> > (not as bad as OOM but still bad).
> 
> Sorry, I don't understand this.  When pages are moved from buddy to PCP,
> zone NR_FREE_PAGES will be decreased in rmqueue_bulk().  That is, pages
> in PCP will be counted as used instead of free.  And, in
> zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is
> used to check watermark.  So, if my understanding were correct, if the
> number of pages in PCP is larger than low/min watermark, we can still
> trigger reclaim.  Whether is my understanding correct?
> 

You're right, I didn't check the timing of the accounting and all that
occurred to me was "the timing of when watermarks trigger kswapd or
direct reclaim may change as a result of PCP adaptive resizing". Even
though I got the timing wrong, the shape of the problem just changes.
I suspect that excessively large PCP high relative to the watermarks may
mean that reclaim happens prematurely if too many pages are pinned by PCP
pages as the zone free pages approaches the watermark. While disabling
the adaptive resizing during reclaim will limit the worst of the problem,
it may still be the case that kswapd is woken early simply because there
are enough CPUs pinning pages in PCP lists. Similarly, depending on the
size of pcp->high and the gap between the watermarks, it's possible for
direct reclaim to happen prematurely. I could still be wrong because I'm
not thinking the problem through fully, examining the code or thinking
about the implementation. It's simply worth keeping in mind the impact
elevated PCP high values has on the timing of watermarks failing. If it's
complex enough, it may be necessary to have a separate patch dealing with
the impact of elevated pcp->high on watermarks.

Huang, Ying July 21, 2023, 7:28 a.m. UTC | #13

Mel Gorman <mgorman@techsingularity.net> writes:

> On Wed, Jul 19, 2023 at 01:59:00PM +0800, Huang, Ying wrote:
>> > The big remaaining corner case to watch out for is where the sum
>> > of the boosted pcp->high exceeds the low watermark.  If that should ever
>> > happen then potentially a premature OOM happens because the watermarks
>> > are fine so no reclaim is active but no pages are available. It may even
>> > be the case that the sum of pcp->high should not exceed *min* as that
>> > corner case means that processes may prematurely enter direct reclaim
>> > (not as bad as OOM but still bad).
>> 
>> Sorry, I don't understand this.  When pages are moved from buddy to PCP,
>> zone NR_FREE_PAGES will be decreased in rmqueue_bulk().  That is, pages
>> in PCP will be counted as used instead of free.  And, in
>> zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is
>> used to check watermark.  So, if my understanding were correct, if the
>> number of pages in PCP is larger than low/min watermark, we can still
>> trigger reclaim.  Whether is my understanding correct?
>> 
>
> You're right, I didn't check the timing of the accounting and all that
> occurred to me was "the timing of when watermarks trigger kswapd or
> direct reclaim may change as a result of PCP adaptive resizing". Even
> though I got the timing wrong, the shape of the problem just changes.
> I suspect that excessively large PCP high relative to the watermarks may
> mean that reclaim happens prematurely if too many pages are pinned by PCP
> pages as the zone free pages approaches the watermark.

Yes.  I think so too.  In addition to reclaim, falling back to remote
NUMA node may happen prematurely too.

> While disabling the adaptive resizing during reclaim will limit the
> worst of the problem, it may still be the case that kswapd is woken
> early simply because there are enough CPUs pinning pages in PCP
> lists. Similarly, depending on the size of pcp->high and the gap
> between the watermarks, it's possible for direct reclaim to happen
> prematurely. I could still be wrong because I'm not thinking the
> problem through fully, examining the code or thinking about the
> implementation. It's simply worth keeping in mind the impact elevated
> PCP high values has on the timing of watermarks failing. If it's
> complex enough, it may be necessary to have a separate patch dealing
> with the impact of elevated pcp->high on watermarks.

Sure.  I will keep this in mind.  We may need to check zone watermark
when tuning pcp->high and free some pages from PCP before falling back
to other node or reclaiming.

--
Best Regards,
Huang, Ying

Mel Gorman July 21, 2023, 9:21 a.m. UTC | #14

On Fri, Jul 21, 2023 at 03:28:43PM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > On Wed, Jul 19, 2023 at 01:59:00PM +0800, Huang, Ying wrote:
> >> > The big remaaining corner case to watch out for is where the sum
> >> > of the boosted pcp->high exceeds the low watermark.  If that should ever
> >> > happen then potentially a premature OOM happens because the watermarks
> >> > are fine so no reclaim is active but no pages are available. It may even
> >> > be the case that the sum of pcp->high should not exceed *min* as that
> >> > corner case means that processes may prematurely enter direct reclaim
> >> > (not as bad as OOM but still bad).
> >> 
> >> Sorry, I don't understand this.  When pages are moved from buddy to PCP,
> >> zone NR_FREE_PAGES will be decreased in rmqueue_bulk().  That is, pages
> >> in PCP will be counted as used instead of free.  And, in
> >> zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is
> >> used to check watermark.  So, if my understanding were correct, if the
> >> number of pages in PCP is larger than low/min watermark, we can still
> >> trigger reclaim.  Whether is my understanding correct?
> >> 
> >
> > You're right, I didn't check the timing of the accounting and all that
> > occurred to me was "the timing of when watermarks trigger kswapd or
> > direct reclaim may change as a result of PCP adaptive resizing". Even
> > though I got the timing wrong, the shape of the problem just changes.
> > I suspect that excessively large PCP high relative to the watermarks may
> > mean that reclaim happens prematurely if too many pages are pinned by PCP
> > pages as the zone free pages approaches the watermark.
> 
> Yes.  I think so too.  In addition to reclaim, falling back to remote
> NUMA node may happen prematurely too.
> 

Yes, with the added bonus that this is relatively easy to detect from
the NUMA miss stats. I say "relative" because in a lot of cases, it'll be
difficult to distinguish from the noise. Hence, it's better to be explicit in
the change log that the potential problem is known and has been considered.
That way, if bisect points the finger at adaptive resizing, there will be
some notes on how to investigate the bug.

> > While disabling the adaptive resizing during reclaim will limit the
> > worst of the problem, it may still be the case that kswapd is woken
> > early simply because there are enough CPUs pinning pages in PCP
> > lists. Similarly, depending on the size of pcp->high and the gap
> > between the watermarks, it's possible for direct reclaim to happen
> > prematurely. I could still be wrong because I'm not thinking the
> > problem through fully, examining the code or thinking about the
> > implementation. It's simply worth keeping in mind the impact elevated
> > PCP high values has on the timing of watermarks failing. If it's
> > complex enough, it may be necessary to have a separate patch dealing
> > with the impact of elevated pcp->high on watermarks.
> 
> Sure.  I will keep this in mind.  We may need to check zone watermark
> when tuning pcp->high and free some pages from PCP before falling back
> to other node or reclaiming.
> 

That would certainly be one option, a cap on adaptive resizing as memory
gets lower. It's not perfect but ideally the worst-case behaviour would be
that PCP adaptive sizing returns to existing behaviour when memory usage
is persistently high and near watermarks within a zone.

Huang, Ying July 24, 2023, 1:09 a.m. UTC | #15

Mel Gorman <mgorman@techsingularity.net> writes:

> On Fri, Jul 21, 2023 at 03:28:43PM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>> 
>> > On Wed, Jul 19, 2023 at 01:59:00PM +0800, Huang, Ying wrote:
>> >> > The big remaaining corner case to watch out for is where the sum
>> >> > of the boosted pcp->high exceeds the low watermark.  If that should ever
>> >> > happen then potentially a premature OOM happens because the watermarks
>> >> > are fine so no reclaim is active but no pages are available. It may even
>> >> > be the case that the sum of pcp->high should not exceed *min* as that
>> >> > corner case means that processes may prematurely enter direct reclaim
>> >> > (not as bad as OOM but still bad).
>> >> 
>> >> Sorry, I don't understand this.  When pages are moved from buddy to PCP,
>> >> zone NR_FREE_PAGES will be decreased in rmqueue_bulk().  That is, pages
>> >> in PCP will be counted as used instead of free.  And, in
>> >> zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is
>> >> used to check watermark.  So, if my understanding were correct, if the
>> >> number of pages in PCP is larger than low/min watermark, we can still
>> >> trigger reclaim.  Whether is my understanding correct?
>> >> 
>> >
>> > You're right, I didn't check the timing of the accounting and all that
>> > occurred to me was "the timing of when watermarks trigger kswapd or
>> > direct reclaim may change as a result of PCP adaptive resizing". Even
>> > though I got the timing wrong, the shape of the problem just changes.
>> > I suspect that excessively large PCP high relative to the watermarks may
>> > mean that reclaim happens prematurely if too many pages are pinned by PCP
>> > pages as the zone free pages approaches the watermark.
>> 
>> Yes.  I think so too.  In addition to reclaim, falling back to remote
>> NUMA node may happen prematurely too.
>> 
>
> Yes, with the added bonus that this is relatively easy to detect from
> the NUMA miss stats. I say "relative" because in a lot of cases, it'll be
> difficult to distinguish from the noise. Hence, it's better to be explicit in
> the change log that the potential problem is known and has been considered.
> That way, if bisect points the finger at adaptive resizing, there will be
> some notes on how to investigate the bug.

Sure.  Will do that.

>> > While disabling the adaptive resizing during reclaim will limit the
>> > worst of the problem, it may still be the case that kswapd is woken
>> > early simply because there are enough CPUs pinning pages in PCP
>> > lists. Similarly, depending on the size of pcp->high and the gap
>> > between the watermarks, it's possible for direct reclaim to happen
>> > prematurely. I could still be wrong because I'm not thinking the
>> > problem through fully, examining the code or thinking about the
>> > implementation. It's simply worth keeping in mind the impact elevated
>> > PCP high values has on the timing of watermarks failing. If it's
>> > complex enough, it may be necessary to have a separate patch dealing
>> > with the impact of elevated pcp->high on watermarks.
>> 
>> Sure.  I will keep this in mind.  We may need to check zone watermark
>> when tuning pcp->high and free some pages from PCP before falling back
>> to other node or reclaiming.
>> 
>
> That would certainly be one option, a cap on adaptive resizing as memory
> gets lower. It's not perfect but ideally the worst-case behaviour would be
> that PCP adaptive sizing returns to existing behaviour when memory usage
> is persistently high and near watermarks within a zone.

--
Best Regards,
Huang, Ying

diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7e2c1864a9ea..cd9b497cd596 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -670,6 +670,14 @@  struct per_cpu_pages {
 #ifdef CONFIG_NUMA
 	short expire;		/* When 0, remote pagesets are drained */
 #endif
+	int alloc_count;	/* alloc/free count from tune period start */
+	int alloc_high;		/* max alloc count from tune period start */
+	int alloc_low;		/* min alloc count from tune period start */
+	int alloc_depth;	/* alloc depth from tune period start */
+	int free_depth;		/* free depth from tune period start */
+	int avg_alloc_depth;	/* average alloc depth */
+	int avg_free_depth;	/* average free depth */
+	unsigned long tune_start; /* tune period start timestamp */
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd83c19f25c6..4d627d96e41a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2616,9 +2616,38 @@  static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 	return min(READ_ONCE(pcp->batch) << 2, high);
 }
 
+#define	PCP_AUTO_TUNE_PERIOD	HZ
+
 static void tune_pcp_high(struct per_cpu_pages *pcp, int high_def)
 {
-	pcp->high = high_def;
+	unsigned long now = jiffies;
+	int high_max, high;
+
+	if (likely(now - pcp->tune_start <= PCP_AUTO_TUNE_PERIOD))
+		return;
+
+	/* No alloc/free in last 2 tune period, reset */
+	if (now - pcp->tune_start > 2 * PCP_AUTO_TUNE_PERIOD) {
+		pcp->tune_start = now;
+		pcp->alloc_high = pcp->alloc_low = pcp->alloc_count = 0;
+		pcp->alloc_depth = pcp->free_depth = 0;
+		pcp->avg_alloc_depth = pcp->avg_free_depth = 0;
+		pcp->high = high_def;
+		return;
+	}
+
+	/* End of tune period, try to tune PCP high automatically */
+	pcp->tune_start = now;
+	/* The old alloc/free depth decay with time */
+	pcp->avg_alloc_depth = (pcp->avg_alloc_depth + pcp->alloc_depth) / 2;
+	pcp->avg_free_depth = (pcp->avg_free_depth + pcp->free_depth) / 2;
+	/* Reset for next tune period */
+	pcp->alloc_high = pcp->alloc_low = pcp->alloc_count = 0;
+	pcp->alloc_depth = pcp->free_depth = 0;
+	/* Pure alloc/free will not increase PCP high */
+	high = min(pcp->avg_alloc_depth, pcp->avg_free_depth);
+	high_max = READ_ONCE(pcp->high_max);
+	pcp->high = clamp(high, high_def, high_max);
 }
 
 static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
@@ -2630,7 +2659,19 @@  static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	bool free_high;
 
 	high_def = READ_ONCE(pcp->high_def);
-	tune_pcp_high(pcp, high_def);
+	/* PCP is disabled or boot pageset */
+	if (unlikely(!high_def)) {
+		pcp->high = high_def;
+		pcp->tune_start = 0;
+	} else {
+		/* free count as negative allocation */
+		pcp->alloc_count -= (1 << order);
+		pcp->alloc_low = min(pcp->alloc_low, pcp->alloc_count);
+		/* max free depth from the start of current tune period */
+		pcp->free_depth = max(pcp->free_depth,
+				      pcp->alloc_high - pcp->alloc_count);
+		tune_pcp_high(pcp, high_def);
+	}
 
 	__count_vm_events(PGFREE, 1 << order);
 	pindex = order_to_pindex(migratetype, order);
@@ -2998,6 +3039,11 @@  static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 		return NULL;
 	}
 
+	pcp->alloc_count += (1 << order);
+	pcp->alloc_high = max(pcp->alloc_high, pcp->alloc_count);
+	/* max alloc depth from the start of current tune period */
+	pcp->alloc_depth = max(pcp->alloc_depth, pcp->alloc_count - pcp->alloc_low);
+
 	/*
 	 * On allocation, reduce the number of pages that are batch freed.
 	 * See nr_pcp_free() where free_factor is increased for subsequent