[RFC,00/26] mm: reliable huge page allocator

Message ID	20230418191313.268131-1-hannes@cmpxchg.org
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Johannes Weiner <hannes@cmpxchg.org> To: linux-mm@kvack.org Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu>, Mel Gorman <mgorman@techsingularity.net>, Vlastimil Babka <vbabka@suse.cz>, David Rientjes <rientjes@google.com>, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH 00/26] mm: reliable huge page allocator Date: Tue, 18 Apr 2023 15:12:47 -0400 Message-Id: <20230418191313.268131-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mm: reliable huge page allocator \| [RFC,00/26] mm: reliable huge page allocator [RFC,01/26] block: bdev: blockdev page cache is movable [RFC,02/26] mm: compaction: avoid GFP_NOFS deadlocks [RFC,03/26] mm: make pageblock_order 2M per default [RFC,04/26] mm: page_isolation: write proper kerneldoc [RFC,05/26] mm: page_alloc: per-migratetype pcplist for THPs [RFC,06/26] mm: page_alloc: consolidate free page accounting [RFC,07/26] mm: page_alloc: move capture_control to the page allocator [RFC,08/26] mm: page_alloc: claim blocks during compaction capturing [RFC,09/26] mm: page_alloc: move expand() above compaction_capture() [RFC,10/26] mm: page_alloc: allow compaction capturing from larger blocks [RFC,11/26] mm: page_alloc: introduce MIGRATE_FREE [RFC,12/26] mm: page_alloc: per-migratetype free counts [RFC,13/26] mm: compaction: remove compaction result helpers [RFC,14/26] mm: compaction: simplify should_compact_retry() [RFC,15/26] mm: compaction: simplify free block check in suitable_migration_target() [RFC,16/26] mm: compaction: improve compaction_suitable() accuracy [RFC,17/26] mm: compaction: refactor __compaction_suitable() [RFC,18/26] mm: compaction: remove unnecessary is_via_compact_memory() checks [RFC,19/26] mm: compaction: drop redundant watermark check in compaction_zonelist_suitable() [RFC,20/26] mm: vmscan: use compaction_suitable() check in kswapd [RFC,21/26] mm: compaction: align compaction goals with reclaim goals [RFC,22/26] mm: page_alloc: manage free memory in whole pageblocks [RFC,23/26] mm: page_alloc: kill highatomic [RFC,24/26] mm: page_alloc: kill watermark boosting [RFC,25/26] mm: page_alloc: disallow fallbacks when 2M defrag is enabled [RFC,26/26] mm: page_alloc: add sanity checks for migratetypes

Message ID

20230418191313.268131-1-hannes@cmpxchg.org

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Johannes Weiner <hannes@cmpxchg.org>
To: linux-mm@kvack.org
Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu>,
        Mel Gorman <mgorman@techsingularity.net>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Rientjes <rientjes@google.com>,
        linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: [RFC PATCH 00/26] mm: reliable huge page allocator
Date: Tue, 18 Apr 2023 15:12:47 -0400
Message-Id: <20230418191313.268131-1-hannes@cmpxchg.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

mm: reliable huge page allocator |

Message

Johannes Weiner April 18, 2023, 7:12 p.m. UTC

  As memory capacity continues to grow, 4k TLB coverage has not been
able to keep up. On Meta's 64G webservers, close to 20% of execution
cycles are observed to be handling TLB misses when using 4k pages
only. Huge pages are shifting from being a nice-to-have optimization
for HPC workloads to becoming a necessity for common applications.

However, while trying to deploy THP more universally, we observe a
fragmentation problem in the page allocator that often prevents larger
requests from being met quickly, or met at all, at runtime. Since we
have to provision hardware capacity for worst case performance,
unreliable huge page coverage isn't of much help.

Drilling into the allocator, we find that existing defrag efforts,
such as mobility grouping and watermark boosting, help, but are
insufficient by themselves. We still observe a high number of blocks
being routinely shared by allocations of different migratetypes. This
in turn results in inefficient or ineffective reclaim/compaction runs.

In a broad sample of Meta servers, we find that unmovable allocations
make up less than 7% of total memory on average, yet occupy 34% of the
2M blocks in the system. We also found that this effect isn't
correlated with high uptimes, and that servers can get heavily
fragmented within the first hour of running a workload.

The following experiment shows that only 20min of build load under
moderate memory pressure already results in a significant number of
typemixed blocks (block analysis run after system is back to idle):

vanilla:
unmovable 50
movable 701
reclaimable 149
unmovable blocks with slab/lru pages: 13 ({'slab': 17, 'lru': 19} pages)
movable blocks with non-LRU pages: 77 ({'slab': 4257, 'kmem': 77, 'other': 2} pages)
reclaimable blocks with non-slab pages: 16 ({'lru': 37, 'kmem': 311, 'other': 26} pages)

patched:
unmovable 65
movable 457
reclaimable 159
free 219
unmovable blocks with slab/lru pages: 22 ({'slab': 0, 'lru': 38} pages)
movable blocks with non-LRU pages: 0 ({'slab': 0, 'kmem': 0, 'other': 0} pages)
reclaimable blocks with non-slab pages: 3 ({'lru': 36, 'kmem': 0, 'other': 23} pages)

[ The remaining "mixed blocks" in the patched kernel are false
  positives: LRU pages without migrate callbacks (empty_aops), and
  i915 shmem that is pinned until reclaimed through shrinkers. ]

Root causes

One of the behaviors that sabotage the page allocator's mobility
grouping is the fact that requests of one migratetype are allowed to
fall back into blocks of another type before reclaim and compaction
occur. This is a design decision to prioritize memory utilization over
block fragmentation - especially considering the history of lumpy
reclaim and its tendency to overreclaim. However, with compaction
available, these two goals are no longer in conflict: the scratch
space of free pages for compaction to work is only twice the size of
the allocation request; in most cases, only small amounts of
proactive, coordinated reclaim and compaction is required to prevent a
fallback which may fragment a pageblock indefinitely.

Another problem lies in how the page allocator drives reclaim and
compaction when it does invoke it. While the page allocator targets
migratetype grouping at the pageblock level, it calls reclaim and
compaction with the order of the allocation request. As most requests
are smaller than a pageblock, this results in partial block freeing
and subsequent fallbacks and type mixing.

Note that in combination, these two design decisions have a
self-reinforcing effect on fragmentation: 1. Partially used unmovable
blocks are filled up with fallback movable pages. 2. A subsequent
unmovable allocation, instead of grouping up, will then need to enter
reclaim, which most likely results in a partially freed movable block
that it falls back into. Over time, unmovable allocations are sparsely
scattered throughout the address space and poison many pageblocks.

Note that block fragmentation is driven by lower-order requests. It is
not reliably mitigated by the mere presence of higher-order requests.

Proposal

This series proposes to make THP allocations reliable by enforcing
pageblock hygiene, and aligning the allocator, reclaim and compaction
on the pageblock as the base unit for managing free memory. All orders
up to and including the pageblock are made first-class requests that
(outside of OOM situations) are expected to succeed without
exceptional investment by the allocating thread.

A neutral pageblock type is introduced, MIGRATE_FREE. The first
allocation to be placed into such a block claims it exclusively for
the allocation's migratetype. Fallbacks from a different type are no
longer allowed, and the block is "kept open" for more allocations of
the same type to ensure tight grouping. A pageblock becomes neutral
again only once all its pages have been freed.

Reclaim and compaction are changed from partial block reclaim to
producing whole neutral page blocks. The watermark logic is adjusted
to apply to neutral blocks, ensuring that background and direct
reclaim always maintain a readily-available reserve of them.

The defragmentation effort changes from reactive to proactive. In
turn, this makes defragmentation actually more efficient: compaction
only has to scan movable blocks and can skip other blocks entirely;
since movable blocks aren't poisoned by unmovable pages, the chances
of successful compaction in each block are greatly improved as well.

Defragmentation becomes an ongoing responsibility of all allocations,
rather than being the burden of only higher-order asks. This prevents
sub-block allocations - which cause block fragmentation in the first
place - from starving the increasingly important larger requests.

There is a slight increase in worst-case memory overhead by requiring
the watermarks to be met against neutral blocks even when there might
be free pages in typed blocks. However, the high watermarks are less
than 1% of the zone, so the increase is relatively small.

These changes only apply to CONFIG_COMPACTION kernels. Without
compaction, fallbacks and partial block reclaim remain the best
trade-off between memory utilization and fragmentation.

Initial Test Results

The following is purely an allocation reliability test. Achieving full
THP benefits in practice is tied to other pending changes, such as the
THP shrinker to avoid memory pressure from excessive internal
fragmentation, and tweaks to the kernel's THP allocation strategy.

The test is a kernel build under moderate-to-high memory pressure,
with a concurrent process trying to repeatedly fault THPs (madvise):

                                              HUGEALLOC-VANILLA       HUGEALLOC-PATCHED
Real time                                   265.04 (    +0.00%)     268.12 (    +1.16%)
User time                                  1131.05 (    +0.00%)    1131.13 (    +0.01%)
System time                                 474.66 (    +0.00%)     478.97 (    +0.91%)
THP fault alloc                           17913.24 (    +0.00%)   19647.50 (    +9.68%)
THP fault fallback                         1947.12 (    +0.00%)     223.40 (   -88.48%)
THP fault fail rate %                         9.80 (    +0.00%)       1.12 (   -80.34%)
Direct compact stall                        282.44 (    +0.00%)     543.90 (   +92.25%)
Direct compact fail                         262.44 (    +0.00%)     239.90 (    -8.56%)
Direct compact success                       20.00 (    +0.00%)     304.00 ( +1352.38%)
Direct compact success rate %                 7.15 (    +0.00%)      57.10 (  +612.90%)
Compact daemon scanned migrate            21643.80 (    +0.00%)  387479.80 ( +1690.18%)
Compact daemon scanned free              188462.36 (    +0.00%) 2842824.10 ( +1408.42%)
Compact direct scanned migrate          1601294.84 (    +0.00%)  275670.70 (   -82.78%)
Compact direct scanned free             4476155.60 (    +0.00%) 2438835.00 (   -45.51%)
Compact migrate scanned daemon %              1.32 (    +0.00%)      59.18 ( +2499.00%)
Compact free scanned daemon %                 3.95 (    +0.00%)      54.31 ( +1018.20%)
Alloc stall                                2425.00 (    +0.00%)     992.00 (   -59.07%)
Pages kswapd scanned                     586756.68 (    +0.00%)  975390.20 (   +66.23%)
Pages kswapd reclaimed                   385468.20 (    +0.00%)  437767.50 (   +13.57%)
Pages direct scanned                     335199.56 (    +0.00%)  501824.20 (   +49.71%)
Pages direct reclaimed                   127953.72 (    +0.00%)  151880.70 (   +18.70%)
Pages scanned kswapd %                       64.43 (    +0.00%)      66.39 (    +2.99%)
Swap out                                  14083.88 (    +0.00%)   45034.60 (  +219.74%)
Swap in                                    3395.08 (    +0.00%)    7767.50 (  +128.75%)
File refaults                             93546.68 (    +0.00%)  129648.30 (   +38.59%)

The THP fault success rate is drastically improved. A bigger share of
the work is done by the background threads, as they now proactively
maintain MIGRATE_FREE block reserves. The increase in memory pressure
is shown by the uptick in swap activity.

Status

Initial test results look promising, but production testing has been
lagging behind the effort to generalize this code for upstream, and
putting all the pieces together to make THP work. I'll follow up as I
gather more data.

Sending this out now as an RFC to get input on the overall direction.

The patches are based on v6.2.

 Documentation/admin-guide/sysctl/vm.rst |  21 -
 block/bdev.c                            |   2 +-
 include/linux/compaction.h              | 100 +---
 include/linux/gfp.h                     |   2 -
 include/linux/mm.h                      |   1 -
 include/linux/mmzone.h                  |  30 +-
 include/linux/page-isolation.h          |  28 +-
 include/linux/pageblock-flags.h         |   4 +-
 include/linux/vmstat.h                  |   8 -
 include/trace/events/mmflags.h          |   4 +-
 kernel/sysctl.c                         |   8 -
 mm/compaction.c                         | 242 +++-----
 mm/internal.h                           |  14 +-
 mm/memory_hotplug.c                     |   4 +-
 mm/page_alloc.c                         | 930 +++++++++++++-----------------
 mm/page_isolation.c                     |  42 +-
 mm/vmscan.c                             | 251 ++------
 mm/vmstat.c                             |   6 +-
 18 files changed, 629 insertions(+), 1068 deletions(-)

Comments

Kirill A. Shutemov April 18, 2023, 11:54 p.m. UTC | #1

On Tue, Apr 18, 2023 at 03:12:47PM -0400, Johannes Weiner wrote:
> As memory capacity continues to grow, 4k TLB coverage has not been
> able to keep up. On Meta's 64G webservers, close to 20% of execution
> cycles are observed to be handling TLB misses when using 4k pages
> only. Huge pages are shifting from being a nice-to-have optimization
> for HPC workloads to becoming a necessity for common applications.
> 
> However, while trying to deploy THP more universally, we observe a
> fragmentation problem in the page allocator that often prevents larger
> requests from being met quickly, or met at all, at runtime. Since we
> have to provision hardware capacity for worst case performance,
> unreliable huge page coverage isn't of much help.
> 
> Drilling into the allocator, we find that existing defrag efforts,
> such as mobility grouping and watermark boosting, help, but are
> insufficient by themselves. We still observe a high number of blocks
> being routinely shared by allocations of different migratetypes. This
> in turn results in inefficient or ineffective reclaim/compaction runs.
> 
> In a broad sample of Meta servers, we find that unmovable allocations
> make up less than 7% of total memory on average, yet occupy 34% of the
> 2M blocks in the system. We also found that this effect isn't
> correlated with high uptimes, and that servers can get heavily
> fragmented within the first hour of running a workload.
> 
> The following experiment shows that only 20min of build load under
> moderate memory pressure already results in a significant number of
> typemixed blocks (block analysis run after system is back to idle):
> 
> vanilla:
> unmovable 50
> movable 701
> reclaimable 149
> unmovable blocks with slab/lru pages: 13 ({'slab': 17, 'lru': 19} pages)
> movable blocks with non-LRU pages: 77 ({'slab': 4257, 'kmem': 77, 'other': 2} pages)
> reclaimable blocks with non-slab pages: 16 ({'lru': 37, 'kmem': 311, 'other': 26} pages)
> 
> patched:
> unmovable 65
> movable 457
> reclaimable 159
> free 219
> unmovable blocks with slab/lru pages: 22 ({'slab': 0, 'lru': 38} pages)
> movable blocks with non-LRU pages: 0 ({'slab': 0, 'kmem': 0, 'other': 0} pages)
> reclaimable blocks with non-slab pages: 3 ({'lru': 36, 'kmem': 0, 'other': 23} pages)
> 
> [ The remaining "mixed blocks" in the patched kernel are false
>   positives: LRU pages without migrate callbacks (empty_aops), and
>   i915 shmem that is pinned until reclaimed through shrinkers. ]
> 
> Root causes
> 
> One of the behaviors that sabotage the page allocator's mobility
> grouping is the fact that requests of one migratetype are allowed to
> fall back into blocks of another type before reclaim and compaction
> occur. This is a design decision to prioritize memory utilization over
> block fragmentation - especially considering the history of lumpy
> reclaim and its tendency to overreclaim. However, with compaction
> available, these two goals are no longer in conflict: the scratch
> space of free pages for compaction to work is only twice the size of
> the allocation request; in most cases, only small amounts of
> proactive, coordinated reclaim and compaction is required to prevent a
> fallback which may fragment a pageblock indefinitely.
> 
> Another problem lies in how the page allocator drives reclaim and
> compaction when it does invoke it. While the page allocator targets
> migratetype grouping at the pageblock level, it calls reclaim and
> compaction with the order of the allocation request. As most requests
> are smaller than a pageblock, this results in partial block freeing
> and subsequent fallbacks and type mixing.
> 
> Note that in combination, these two design decisions have a
> self-reinforcing effect on fragmentation: 1. Partially used unmovable
> blocks are filled up with fallback movable pages. 2. A subsequent
> unmovable allocation, instead of grouping up, will then need to enter
> reclaim, which most likely results in a partially freed movable block
> that it falls back into. Over time, unmovable allocations are sparsely
> scattered throughout the address space and poison many pageblocks.
> 
> Note that block fragmentation is driven by lower-order requests. It is
> not reliably mitigated by the mere presence of higher-order requests.
> 
> Proposal
> 
> This series proposes to make THP allocations reliable by enforcing
> pageblock hygiene, and aligning the allocator, reclaim and compaction
> on the pageblock as the base unit for managing free memory. All orders
> up to and including the pageblock are made first-class requests that
> (outside of OOM situations) are expected to succeed without
> exceptional investment by the allocating thread.
> 
> A neutral pageblock type is introduced, MIGRATE_FREE. The first
> allocation to be placed into such a block claims it exclusively for
> the allocation's migratetype. Fallbacks from a different type are no
> longer allowed, and the block is "kept open" for more allocations of
> the same type to ensure tight grouping. A pageblock becomes neutral
> again only once all its pages have been freed.

Sounds like this will cause earlier OOM, no?

I guess with 2M pageblock on 64G server it shouldn't matter much. But how
about smaller machines?

> Reclaim and compaction are changed from partial block reclaim to
> producing whole neutral page blocks.

How does it affect allocation latencies? I see direct compact stall grew
substantially. Hm?

> The watermark logic is adjusted
> to apply to neutral blocks, ensuring that background and direct
> reclaim always maintain a readily-available reserve of them.
> 
> The defragmentation effort changes from reactive to proactive. In
> turn, this makes defragmentation actually more efficient: compaction
> only has to scan movable blocks and can skip other blocks entirely;
> since movable blocks aren't poisoned by unmovable pages, the chances
> of successful compaction in each block are greatly improved as well.
> 
> Defragmentation becomes an ongoing responsibility of all allocations,
> rather than being the burden of only higher-order asks. This prevents
> sub-block allocations - which cause block fragmentation in the first
> place - from starving the increasingly important larger requests.
> 
> There is a slight increase in worst-case memory overhead by requiring
> the watermarks to be met against neutral blocks even when there might
> be free pages in typed blocks. However, the high watermarks are less
> than 1% of the zone, so the increase is relatively small.
> 
> These changes only apply to CONFIG_COMPACTION kernels. Without
> compaction, fallbacks and partial block reclaim remain the best
> trade-off between memory utilization and fragmentation.
> 
> Initial Test Results
> 
> The following is purely an allocation reliability test. Achieving full
> THP benefits in practice is tied to other pending changes, such as the
> THP shrinker to avoid memory pressure from excessive internal
> fragmentation, and tweaks to the kernel's THP allocation strategy.
> 
> The test is a kernel build under moderate-to-high memory pressure,
> with a concurrent process trying to repeatedly fault THPs (madvise):
> 
>                                               HUGEALLOC-VANILLA       HUGEALLOC-PATCHED
> Real time                                   265.04 (    +0.00%)     268.12 (    +1.16%)
> User time                                  1131.05 (    +0.00%)    1131.13 (    +0.01%)
> System time                                 474.66 (    +0.00%)     478.97 (    +0.91%)
> THP fault alloc                           17913.24 (    +0.00%)   19647.50 (    +9.68%)
> THP fault fallback                         1947.12 (    +0.00%)     223.40 (   -88.48%)
> THP fault fail rate %                         9.80 (    +0.00%)       1.12 (   -80.34%)
> Direct compact stall                        282.44 (    +0.00%)     543.90 (   +92.25%)
> Direct compact fail                         262.44 (    +0.00%)     239.90 (    -8.56%)
> Direct compact success                       20.00 (    +0.00%)     304.00 ( +1352.38%)
> Direct compact success rate %                 7.15 (    +0.00%)      57.10 (  +612.90%)
> Compact daemon scanned migrate            21643.80 (    +0.00%)  387479.80 ( +1690.18%)
> Compact daemon scanned free              188462.36 (    +0.00%) 2842824.10 ( +1408.42%)
> Compact direct scanned migrate          1601294.84 (    +0.00%)  275670.70 (   -82.78%)
> Compact direct scanned free             4476155.60 (    +0.00%) 2438835.00 (   -45.51%)
> Compact migrate scanned daemon %              1.32 (    +0.00%)      59.18 ( +2499.00%)
> Compact free scanned daemon %                 3.95 (    +0.00%)      54.31 ( +1018.20%)
> Alloc stall                                2425.00 (    +0.00%)     992.00 (   -59.07%)
> Pages kswapd scanned                     586756.68 (    +0.00%)  975390.20 (   +66.23%)
> Pages kswapd reclaimed                   385468.20 (    +0.00%)  437767.50 (   +13.57%)
> Pages direct scanned                     335199.56 (    +0.00%)  501824.20 (   +49.71%)
> Pages direct reclaimed                   127953.72 (    +0.00%)  151880.70 (   +18.70%)
> Pages scanned kswapd %                       64.43 (    +0.00%)      66.39 (    +2.99%)
> Swap out                                  14083.88 (    +0.00%)   45034.60 (  +219.74%)
> Swap in                                    3395.08 (    +0.00%)    7767.50 (  +128.75%)
> File refaults                             93546.68 (    +0.00%)  129648.30 (   +38.59%)
> 
> The THP fault success rate is drastically improved. A bigger share of
> the work is done by the background threads, as they now proactively
> maintain MIGRATE_FREE block reserves. The increase in memory pressure
> is shown by the uptick in swap activity.
> 
> Status
> 
> Initial test results look promising, but production testing has been
> lagging behind the effort to generalize this code for upstream, and
> putting all the pieces together to make THP work. I'll follow up as I
> gather more data.
> 
> Sending this out now as an RFC to get input on the overall direction.
> 
> The patches are based on v6.2.
> 
>  Documentation/admin-guide/sysctl/vm.rst |  21 -
>  block/bdev.c                            |   2 +-
>  include/linux/compaction.h              | 100 +---
>  include/linux/gfp.h                     |   2 -
>  include/linux/mm.h                      |   1 -
>  include/linux/mmzone.h                  |  30 +-
>  include/linux/page-isolation.h          |  28 +-
>  include/linux/pageblock-flags.h         |   4 +-
>  include/linux/vmstat.h                  |   8 -
>  include/trace/events/mmflags.h          |   4 +-
>  kernel/sysctl.c                         |   8 -
>  mm/compaction.c                         | 242 +++-----
>  mm/internal.h                           |  14 +-
>  mm/memory_hotplug.c                     |   4 +-
>  mm/page_alloc.c                         | 930 +++++++++++++-----------------
>  mm/page_isolation.c                     |  42 +-
>  mm/vmscan.c                             | 251 ++------
>  mm/vmstat.c                             |   6 +-
>  18 files changed, 629 insertions(+), 1068 deletions(-)
> 
>

Johannes Weiner April 19, 2023, 2:08 a.m. UTC | #2

Hi Kirill, thanks for taking a look so quickly.

On Wed, Apr 19, 2023 at 02:54:02AM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2023 at 03:12:47PM -0400, Johannes Weiner wrote:
> > This series proposes to make THP allocations reliable by enforcing
> > pageblock hygiene, and aligning the allocator, reclaim and compaction
> > on the pageblock as the base unit for managing free memory. All orders
> > up to and including the pageblock are made first-class requests that
> > (outside of OOM situations) are expected to succeed without
> > exceptional investment by the allocating thread.
> > 
> > A neutral pageblock type is introduced, MIGRATE_FREE. The first
> > allocation to be placed into such a block claims it exclusively for
> > the allocation's migratetype. Fallbacks from a different type are no
> > longer allowed, and the block is "kept open" for more allocations of
> > the same type to ensure tight grouping. A pageblock becomes neutral
> > again only once all its pages have been freed.
> 
> Sounds like this will cause earlier OOM, no?
> 
> I guess with 2M pageblock on 64G server it shouldn't matter much. But how
> about smaller machines?

Yes, it's a tradeoff.

It's not really possible to reduce external fragmentation and increase
contiguity, without also increasing the risk of internal fragmentation
to some extent. The tradeoff is slighly less but overall faster memory.

A 2M block size *seems* reasonable for most current setups. It's
actually still somewhat on the lower side, if you consider that we had
4k blocks when memory was a few megabytes. (4k pages for 4M RAM is the
same ratio as 2M pages for 2G RAM. My phone has 8G and my desktop 32G.
64G is unusually small for a datacenter server.)

I wouldn't be opposed to sticking this behind a separate config option
if there are setups that WOULD want to keep the current best-effort
compaction without the block hygiene. But obviously, from a
maintenance POV life would be much easier if we didn't have to.

FWIF, I have been doing tests in an environment constrained to 2G and
haven't had any issues with premature OOMs. But I'm happy to test
other situations and workloads that might be of interest to people.

> > Reclaim and compaction are changed from partial block reclaim to
> > producing whole neutral page blocks.
> 
> How does it affect allocation latencies? I see direct compact stall grew
> substantially. Hm?

Good question.

There are 260 more compact stalls but also 1,734 more successful THP
allocations. And 1,433 fewer allocation stalls. There seems to be much
less direct work performed per successful allocation.

But of course, that's not the whole story. Let me trace the actual
latencies.

Thanks for your thoughts!
Johannes

Matthew Wilcox April 19, 2023, 4:11 a.m. UTC | #3

On Tue, Apr 18, 2023 at 03:12:47PM -0400, Johannes Weiner wrote:
> This series proposes to make THP allocations reliable by enforcing
> pageblock hygiene, and aligning the allocator, reclaim and compaction
> on the pageblock as the base unit for managing free memory. All orders
> up to and including the pageblock are made first-class requests that
> (outside of OOM situations) are expected to succeed without
> exceptional investment by the allocating thread.
> 
> A neutral pageblock type is introduced, MIGRATE_FREE. The first
> allocation to be placed into such a block claims it exclusively for
> the allocation's migratetype. Fallbacks from a different type are no
> longer allowed, and the block is "kept open" for more allocations of
> the same type to ensure tight grouping. A pageblock becomes neutral
> again only once all its pages have been freed.

YES!  This is exactly what I've been thinking is the right solution
for some time.  Thank you for doing it.

Vlastimil Babka April 19, 2023, 10:56 a.m. UTC | #4

On 4/19/23 04:08, Johannes Weiner wrote:
> Hi Kirill, thanks for taking a look so quickly.
> 
> On Wed, Apr 19, 2023 at 02:54:02AM +0300, Kirill A. Shutemov wrote:
>> On Tue, Apr 18, 2023 at 03:12:47PM -0400, Johannes Weiner wrote:
>> > This series proposes to make THP allocations reliable by enforcing
>> > pageblock hygiene, and aligning the allocator, reclaim and compaction
>> > on the pageblock as the base unit for managing free memory. All orders
>> > up to and including the pageblock are made first-class requests that
>> > (outside of OOM situations) are expected to succeed without
>> > exceptional investment by the allocating thread.
>> > 
>> > A neutral pageblock type is introduced, MIGRATE_FREE. The first
>> > allocation to be placed into such a block claims it exclusively for
>> > the allocation's migratetype. Fallbacks from a different type are no
>> > longer allowed, and the block is "kept open" for more allocations of
>> > the same type to ensure tight grouping. A pageblock becomes neutral
>> > again only once all its pages have been freed.
>> 
>> Sounds like this will cause earlier OOM, no?
>> 
>> I guess with 2M pageblock on 64G server it shouldn't matter much. But how
>> about smaller machines?
> 
> Yes, it's a tradeoff.
> 
> It's not really possible to reduce external fragmentation and increase
> contiguity, without also increasing the risk of internal fragmentation
> to some extent. The tradeoff is slighly less but overall faster memory.
> 
> A 2M block size *seems* reasonable for most current setups. It's
> actually still somewhat on the lower side, if you consider that we had
> 4k blocks when memory was a few megabytes. (4k pages for 4M RAM is the
> same ratio as 2M pages for 2G RAM. My phone has 8G and my desktop 32G.
> 64G is unusually small for a datacenter server.)
> 
> I wouldn't be opposed to sticking this behind a separate config option
> if there are setups that WOULD want to keep the current best-effort
> compaction without the block hygiene. But obviously, from a
> maintenance POV life would be much easier if we didn't have to.

As much as tunables are frowned upon in general, this could make sense to me
 even as a runtime tunable (maybe with defaults based on how large the
system is), because a datacenter server and a phone is after all not the
same thing. But of course it would be preferrable to find out it works
reasonably well even for the smaller systems. For example we already do
completely disable mobility grouping if there's too little RAM for it to
make sense, which is somewhat similar (but not completely identical) decision.

> FWIF, I have been doing tests in an environment constrained to 2G and
> haven't had any issues with premature OOMs. But I'm happy to test
> other situations and workloads that might be of interest to people.
> 
>> > Reclaim and compaction are changed from partial block reclaim to
>> > producing whole neutral page blocks.
>> 
>> How does it affect allocation latencies? I see direct compact stall grew
>> substantially. Hm?
> 
> Good question.
> 
> There are 260 more compact stalls but also 1,734 more successful THP
> allocations. And 1,433 fewer allocation stalls. There seems to be much
> less direct work performed per successful allocation.

Yeah if there's a workload that uses THP madvise to indicate it prefers the
compaction stalls to base page fallbacks, and compaction is more sucessful,
it won't defer further attempts so as a result there will be more stalls.
What we should watch out for are rather latencies of allocations that don't
prefer the stalls, but might now be forced to clean up new MIGRATE_FREE
pageblocks for their order-0 allocation that would previously just fallback,
etc.

> But of course, that's not the whole story. Let me trace the actual
> latencies.
> 
> Thanks for your thoughts!
> Johannes

Mel Gorman April 21, 2023, 4:11 p.m. UTC | #5

On Wed, Apr 19, 2023 at 05:11:45AM +0100, Matthew Wilcox wrote:
> On Tue, Apr 18, 2023 at 03:12:47PM -0400, Johannes Weiner wrote:
> > This series proposes to make THP allocations reliable by enforcing
> > pageblock hygiene, and aligning the allocator, reclaim and compaction
> > on the pageblock as the base unit for managing free memory. All orders
> > up to and including the pageblock are made first-class requests that
> > (outside of OOM situations) are expected to succeed without
> > exceptional investment by the allocating thread.
> > 
> > A neutral pageblock type is introduced, MIGRATE_FREE. The first
> > allocation to be placed into such a block claims it exclusively for
> > the allocation's migratetype. Fallbacks from a different type are no
> > longer allowed, and the block is "kept open" for more allocations of
> > the same type to ensure tight grouping. A pageblock becomes neutral
> > again only once all its pages have been freed.
> 
> YES!  This is exactly what I've been thinking is the right solution
> for some time.  Thank you for doing it.
> 

It was considered once upon a time and comes up every so often as variants
of a "sticky" pageblock pageblock bit that prevents mixing. The risks was
ending up in a context where memory within a suitable pageblock cannot
be freed and all of the available MOVABLE pageblocks have at least one
pinned page that cannot migrate from the allocating context. It can also
potentially hit a case where the majority of memory is UNMOVABLE pageblocks,
each of which has a single pagetable page that cannot be freed without an
OOM kill. Variants of issues like this would manifestas an OOM kill with
plenty of memory free bug or excessive CPu usage on reclaim or compaction.

It doesn't kill the idea of the series at all but it puts a lot of emphasis
in splitting the series by low-risk and high-risk. Maybe to the extent where
the absolute protection against mixing can be broken in OOM situations,
kernel command line or sysctl.

Matthew Wilcox April 21, 2023, 5:14 p.m. UTC | #6

On Fri, Apr 21, 2023 at 05:11:56PM +0100, Mel Gorman wrote:
> It was considered once upon a time and comes up every so often as variants
> of a "sticky" pageblock pageblock bit that prevents mixing. The risks was
> ending up in a context where memory within a suitable pageblock cannot
> be freed and all of the available MOVABLE pageblocks have at least one
> pinned page that cannot migrate from the allocating context. It can also
> potentially hit a case where the majority of memory is UNMOVABLE pageblocks,
> each of which has a single pagetable page that cannot be freed without an
> OOM kill. Variants of issues like this would manifestas an OOM kill with
> plenty of memory free bug or excessive CPu usage on reclaim or compaction.
> 
> It doesn't kill the idea of the series at all but it puts a lot of emphasis
> in splitting the series by low-risk and high-risk. Maybe to the extent where
> the absolute protection against mixing can be broken in OOM situations,
> kernel command line or sysctl.

Has a variant been previously considered where MOVABLE allocations are
allowed to come from UNMOVABLE blocks?  After all, MOVABLE allocations
are generally, well, movable.  So an UNMOVABLE allocation could try to
migrate pages from a MIXED pageblock in order to turn the MIXED pageblock
back into an UNMOVABLE pageblock.

This might work better in practice because GFP_NOFS allocations tend
to also be MOVABLE, so allowing them to take up some of the UNMOVABLE
space temporarily feels like a get-out-of-OOM card.

(I've resisted talking about plans to make page table pages movable
because I don't think that's your point; that's just an example of a
currently-unmovable allocation, right?)

I mention this in part because on my laptop, ZONE_DMA is almost unused:

Node 0, zone      DMA      0      0      0      0      0      0      0      0      1      2      2
Node 0, zone    DMA32   1685   1345   1152    554    424    212    104     40      2      0      0
Node 0, zone   Normal   6959   3530   1893   1862    629    483    107     10      0      0      0

That's 2 order-10 (=8MB), 2 order-9 (=4MB) and 1 order8 (=1MB) for a
total of 13MB of memory.  That's insignificant to a 16GB laptop, but on
smaller machines, it might be worth allowing MOVABLE allocations to come
from ZONE_DMA on the grounds that they can be easily freed if anybody
ever allocated from ZONE_DMA.

David Hildenbrand May 2, 2023, 3:21 p.m. UTC | #7

On 21.04.23 19:14, Matthew Wilcox wrote:
> On Fri, Apr 21, 2023 at 05:11:56PM +0100, Mel Gorman wrote:
>> It was considered once upon a time and comes up every so often as variants
>> of a "sticky" pageblock pageblock bit that prevents mixing. The risks was
>> ending up in a context where memory within a suitable pageblock cannot
>> be freed and all of the available MOVABLE pageblocks have at least one
>> pinned page that cannot migrate from the allocating context. It can also
>> potentially hit a case where the majority of memory is UNMOVABLE pageblocks,
>> each of which has a single pagetable page that cannot be freed without an
>> OOM kill. Variants of issues like this would manifestas an OOM kill with
>> plenty of memory free bug or excessive CPu usage on reclaim or compaction.
>>
>> It doesn't kill the idea of the series at all but it puts a lot of emphasis
>> in splitting the series by low-risk and high-risk. Maybe to the extent where
>> the absolute protection against mixing can be broken in OOM situations,
>> kernel command line or sysctl.
> 
> Has a variant been previously considered where MOVABLE allocations are
> allowed to come from UNMOVABLE blocks?  After all, MOVABLE allocations
> are generally, well, movable.  So an UNMOVABLE allocation could try to
> migrate pages from a MIXED pageblock in order to turn the MIXED pageblock
> back into an UNMOVABLE pageblock.

I might be completely off, but my understanding was that movable 
allocations can be happily placed into unmovable blocks if required already?

IIRC, it's primarily the zone fallback rules that prevent e.g., ZONE_DMA 
to get filled immediately with movable data in your example. I might eb 
wrong, though.

I guess what you mean is serving movable allocations much earlier from 
these other zones.

Having memory hotunplug in mind ( as always ;) ), I'd expect that such 
fragmentation must be allowed to happen to guarantee that memory (esp. 
ZONE_MOVABLE) can be properly evacuated even if there are not sufficient 
MOVABLE pageblocks around to hold all that (movable) data.