[RFC,0/4] Enable >0 order folio memory compaction

Message ID 20230912162815.440749-1-zi.yan@sent.com
Headers
Series Enable >0 order folio memory compaction |

Message

Zi Yan Sept. 12, 2023, 4:28 p.m. UTC
  From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset enables >0 order folio memory compaction, which is one of
the prerequisitions for large folio support[1]. It is on top of
mm-everything-2023-09-11-22-56.

Overview
===

To support >0 order folio compaction, the patchset changes how free pages used
for migration are kept during compaction. Free pages used to be split into
order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
page order stored in page->private is zeroed, and page reference is set to 1).
Now all free pages are kept in a MAX_ORDER+1 array of page lists based
on their order without post allocation process. When migrate_pages() asks for
a new page, one of the free pages, based on the requested page order, is
then processed and given out.


Optimizations
===

1. Free page split is added to increase migration success rate in case
a source page does not have a matched free page in the free page lists.
Free page merge is possible but not implemented, since existing
PFN-based buddy page merge algorithm requires the identification of
buddy pages, but free pages kept for memory compaction cannot have
PageBuddy set to avoid confusing other PFN scanners.

2. Sort source pages in ascending order before migration is added to
reduce free page split. Otherwise, high order free pages might be
prematurely split, causing undesired high order folio migration failures.


TODOs
===

1. Refactor free page post allocation and free page preparation code so
that compaction_alloc() and compaction_free() can call functions instead
of hard coding.

2. One possible optimization is to allow migrate_pages() to continue
even if get_new_folio() returns a NULL. In general, that means there is
not enough memory. But in >0 order folio compaction case, that means
there is no suitable free page at source page order. It might be better
to skip that page and finish the rest of migration to achieve a better
compaction result.

3. Another possible optimization is to enable free page merge. It is
possible that a to-be-migrated page causes free page split then fails to
migrate eventually. We would lose a high order free page without free
page merge function. But a way of identifying free pages for memory
compaction is needed to reuse existing PFN-based buddy page merge.

4. The implemented >0 order folio compaction algorithm is quite naive
and does not consider all possible situations. A better algorithm can
improve compaction success rate.


Feel free to give comments and ask questions.

Thanks.


[1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/

Zi Yan (4):
  mm/compaction: add support for >0 order folio memory compaction.
  mm/compaction: optimize >0 order folio compaction with free page
    split.
  mm/compaction: optimize >0 order folio compaction by sorting source
    pages.
  mm/compaction: enable compacting >0 order folios.

 mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
 mm/internal.h   |   7 +-
 2 files changed, 176 insertions(+), 36 deletions(-)
  

Comments

Luis Chamberlain Sept. 21, 2023, 12:55 a.m. UTC | #1
On Tue, Sep 12, 2023 at 12:28:11PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Feel free to give comments and ask questions.

How about testing? I'm looking with an eye towards creating a
pathalogical situation which can be automated for fragmentation and see
how things go.

Mel Gorman's original artificial fragmentation taken from his first
patches ot help with fragmentation avoidance from 2018 suggested he
tried [0]:

------ From 2018
a) Create an XFS filesystem

b) Start 4 fio threads that write a number of 64K files inefficiently.
Inefficiently means that files are created on first access and not
created in advance (fio parameterr create_on_open=1) and fallocate is
not used (fallocate=none). With multiple IO issuers this creates a mix
of slab and page cache allocations over time. The total size of the
files is 150% physical memory so that the slabs and page cache pages get
mixed

c) Warm up a number of fio read-only threads accessing the same files
created in step 2. This part runs for the same length of time it took to
create the files. It'll fault back in old data and further interleave
slab and page cache allocations. As it's now low on memory due to step
2, fragmentation occurs as pageblocks get stolen. While step 3 is still
running, start a process that tries to allocate 75% of memory as huge
pages with a number of threads. The number of threads is based on a
(NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with
fio, any other threads or forcing cross-NUMA scheduling. Note that the
test has not been used on a machine with less than 8 cores. The
benchmark records whether huge pages were allocated and what the fault
latency was in microseconds

d) Measure the number of events potentially causing external fragmentation,
the fault latency and the huge page allocation success rate.
------- end of extract

These days we can probably do a bit more damage. There has been concerns
that LBS support (block size > ps) could hinder fragmentation, one of
the reasons is that any file created despite it's size will require at
least the block size, and if using 64k block size that means 64k
allocation for each new file on that 64k block size filesystem, so
clearly you may run out of lower order allocations pretty quickly. You
can also create different larg eblock filesystems too, one for 64k
another for 32k. Although LBS is new and we're still ironing out the
kinks if you wanna give it a go we've rebased the patches onto Linus'
tree [1], and if you wanted to ramp up fast you could use kdevops [2] which
let's you pick that branch and also a series of NVMe drives (by enabling
CONFIG_LIBVIRT_EXTRA_STORAGE_DRIVE_NVME) for large IO experimentation (by
enabling CONFIG_VAGRANT_ENABLE_LARGEIO). Creating different filesystem
with large block size (64k, 32k, 16k) on a 4k sector size drive
(mkfs.xfs -f -b size=64k -s size=4k) should let you easily do tons of
crazy pathalogical things.

Are there other known recipes test help test this stuff?
How do we measure success in your patches for fragmentation exactly?

[0] https://lwn.net/Articles/770235/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus-nobdev
[2] https://github.com/linux-kdevops/kdevops

  Luis
  
Luis Chamberlain Sept. 21, 2023, 1:16 a.m. UTC | #2
On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
> Are there other known recipes test help test this stuff?

You know, it got me wondering... since how memory fragmented a system
might be by just running fstests, because, well, we already have
that automated in kdevops and it also has LBS support for all the
different large block sizes on 4k sector size. So if we just had a
way to "measure" or "quantify" memory fragmentation with a score,
we could just tally up how we did after 4 hours of testing for each
block size with a set of memory on the guest / target node / cloud
system.

  Luis
  
John Hubbard Sept. 21, 2023, 2:05 a.m. UTC | #3
On 9/20/23 18:16, Luis Chamberlain wrote:
> On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
>> Are there other known recipes test help test this stuff?
> 
> You know, it got me wondering... since how memory fragmented a system
> might be by just running fstests, because, well, we already have
> that automated in kdevops and it also has LBS support for all the
> different large block sizes on 4k sector size. So if we just had a
> way to "measure" or "quantify" memory fragmentation with a score,
> we could just tally up how we did after 4 hours of testing for each
> block size with a set of memory on the guest / target node / cloud
> system.
> 
>    Luis

I thought about it, and here is one possible way to quantify
fragmentation with just a single number. Take this with some
skepticism because it is a first draft sort of thing:

a) Let BLOCKS be the number of 4KB pages (or more generally, then number
of smallest sized objects allowed) in the area.

b) Let FRAGS be the number of free *or* allocated chunks (no need to
consider the size of each, as that is automatically taken into
consideration).

Then:
       fragmentation percentage = (FRAGS / BLOCKS) * 100%

This has some nice properties. For one thing, it's easy to calculate.
For another, it can discern between these cases:

Assume a 12-page area:

Case 1) 6 pages allocated allocated unevenly:

1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated

fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7%

Case 2) 6 pages allocated evenly: every other page is allocated:

fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100%



thanks,
  
Luis Chamberlain Sept. 21, 2023, 3:14 a.m. UTC | #4
On Wed, Sep 20, 2023 at 07:05:25PM -0700, John Hubbard wrote:
> On 9/20/23 18:16, Luis Chamberlain wrote:
> > On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
> > > Are there other known recipes test help test this stuff?
> > 
> > You know, it got me wondering... since how memory fragmented a system
> > might be by just running fstests, because, well, we already have
> > that automated in kdevops and it also has LBS support for all the
> > different large block sizes on 4k sector size. So if we just had a
> > way to "measure" or "quantify" memory fragmentation with a score,
> > we could just tally up how we did after 4 hours of testing for each
> > block size with a set of memory on the guest / target node / cloud
> > system.
> > 
> >    Luis
> 
> I thought about it, and here is one possible way to quantify
> fragmentation with just a single number. Take this with some
> skepticism because it is a first draft sort of thing:
> 
> a) Let BLOCKS be the number of 4KB pages (or more generally, then number
> of smallest sized objects allowed) in the area.
> 
> b) Let FRAGS be the number of free *or* allocated chunks (no need to
> consider the size of each, as that is automatically taken into
> consideration).
> 
> Then:
>       fragmentation percentage = (FRAGS / BLOCKS) * 100%
> 
> This has some nice properties. For one thing, it's easy to calculate.
> For another, it can discern between these cases:
> 
> Assume a 12-page area:
> 
> Case 1) 6 pages allocated allocated unevenly:
> 
> 1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated
> 
> fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7%
> 
> Case 2) 6 pages allocated evenly: every other page is allocated:
> 
> fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100%

Thanks! Will try this!

BTW stress-ng might also be a nice way to do other pathalogical things here.

  Luis
  
Zi Yan Sept. 21, 2023, 3:56 p.m. UTC | #5
On 20 Sep 2023, at 23:14, Luis Chamberlain wrote:

> On Wed, Sep 20, 2023 at 07:05:25PM -0700, John Hubbard wrote:
>> On 9/20/23 18:16, Luis Chamberlain wrote:
>>> On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
>>>> Are there other known recipes test help test this stuff?
>>>
>>> You know, it got me wondering... since how memory fragmented a system
>>> might be by just running fstests, because, well, we already have
>>> that automated in kdevops and it also has LBS support for all the
>>> different large block sizes on 4k sector size. So if we just had a
>>> way to "measure" or "quantify" memory fragmentation with a score,
>>> we could just tally up how we did after 4 hours of testing for each
>>> block size with a set of memory on the guest / target node / cloud
>>> system.
>>>
>>>    Luis
>>
>> I thought about it, and here is one possible way to quantify
>> fragmentation with just a single number. Take this with some
>> skepticism because it is a first draft sort of thing:
>>
>> a) Let BLOCKS be the number of 4KB pages (or more generally, then number
>> of smallest sized objects allowed) in the area.
>>
>> b) Let FRAGS be the number of free *or* allocated chunks (no need to
>> consider the size of each, as that is automatically taken into
>> consideration).
>>
>> Then:
>>       fragmentation percentage = (FRAGS / BLOCKS) * 100%
>>
>> This has some nice properties. For one thing, it's easy to calculate.
>> For another, it can discern between these cases:
>>
>> Assume a 12-page area:
>>
>> Case 1) 6 pages allocated allocated unevenly:
>>
>> 1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated
>>
>> fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7%
>>
>> Case 2) 6 pages allocated evenly: every other page is allocated:
>>
>> fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100%
>
> Thanks! Will try this!
>
> BTW stress-ng might also be a nice way to do other pathalogical things here.
>
>   Luis

Thanks. These are all good performance tests and a good fragmentation metric.
I would like to get it working properly first. As I mentioned in another email,
there will be tons of exploration to do to improve >0 folio memory compaction
with the consideration of:

1. the distribution of free pages,
2. the goal of compaction, e.g., to allocate a single order folio or reduce
the overall fragmentation level,
3. the runtime cost of compaction, and more.
My patchset aims to provide a reasonably working compaction functionality.


In terms of correctness testing, what I have done locally is to:

1. have a XFS partition,
2. create files with various sizes from 4KB to 2MB,
3. mmap each of these files to use one folio at the file size,
4. get the physical addresses of these folios,
5. trigger global memory compaction via sysctl,
6. read the physical addresses of these folios again.

--
Best Regards,
Yan, Zi
  
Ryan Roberts Oct. 2, 2023, 12:32 p.m. UTC | #6
Hi Zi,

On 12/09/2023 17:28, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset enables >0 order folio memory compaction, which is one of
> the prerequisitions for large folio support[1]. It is on top of
> mm-everything-2023-09-11-22-56.

I've taken a quick look at these and realize I'm not well equipped to provide
much in the way of meaningful review comments; All I can say is thanks for
putting this together, and yes, I think it will become even more important for
my work on anonymous large folios.


> 
> Overview
> ===
> 
> To support >0 order folio compaction, the patchset changes how free pages used
> for migration are kept during compaction. Free pages used to be split into
> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
> page order stored in page->private is zeroed, and page reference is set to 1).
> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
> on their order without post allocation process. When migrate_pages() asks for
> a new page, one of the free pages, based on the requested page order, is
> then processed and given out.
> 
> 
> Optimizations
> ===
> 
> 1. Free page split is added to increase migration success rate in case
> a source page does not have a matched free page in the free page lists.
> Free page merge is possible but not implemented, since existing
> PFN-based buddy page merge algorithm requires the identification of
> buddy pages, but free pages kept for memory compaction cannot have
> PageBuddy set to avoid confusing other PFN scanners.
> 
> 2. Sort source pages in ascending order before migration is added to
> reduce free page split. Otherwise, high order free pages might be
> prematurely split, causing undesired high order folio migration failures.

Not knowing much about how compaction actually works, naively I would imagine
that if you are just trying to free up a known amount of contiguous physical
space, then working through the pages in PFN order is more likely to yield the
result quicker? Unless all of the pages in the set must be successfully migrated
in order to free up the required amount of space...

Thanks,
Ryan
  
Huang, Ying Oct. 9, 2023, 7:12 a.m. UTC | #7
Hi, Zi,

Thanks for your patch!

Zi Yan <zi.yan@sent.com> writes:

> From: Zi Yan <ziy@nvidia.com>
>
> Hi all,
>
> This patchset enables >0 order folio memory compaction, which is one of
> the prerequisitions for large folio support[1]. It is on top of
> mm-everything-2023-09-11-22-56.
>
> Overview
> ===
>
> To support >0 order folio compaction, the patchset changes how free pages used
> for migration are kept during compaction.

migrate_pages() can split the large folio for allocation failure.  So
the minimal implementation could be

- allow to migrate large folios in compaction
- return -ENOMEM for order > 0 in compaction_alloc()

The performance may be not desirable.  But that may be a baseline for
further optimization.

And, if we can measure the performance for each step of optimization,
that will be even better.

> Free pages used to be split into
> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
> page order stored in page->private is zeroed, and page reference is set to 1).
> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
> on their order without post allocation process. When migrate_pages() asks for
> a new page, one of the free pages, based on the requested page order, is
> then processed and given out.
>
>
> Optimizations
> ===
>
> 1. Free page split is added to increase migration success rate in case
> a source page does not have a matched free page in the free page lists.
> Free page merge is possible but not implemented, since existing
> PFN-based buddy page merge algorithm requires the identification of
> buddy pages, but free pages kept for memory compaction cannot have
> PageBuddy set to avoid confusing other PFN scanners.
>
> 2. Sort source pages in ascending order before migration is added to

Trivial.

s/ascending/descending/

> reduce free page split. Otherwise, high order free pages might be
> prematurely split, causing undesired high order folio migration failures.
>
>
> TODOs
> ===
>
> 1. Refactor free page post allocation and free page preparation code so
> that compaction_alloc() and compaction_free() can call functions instead
> of hard coding.
>
> 2. One possible optimization is to allow migrate_pages() to continue
> even if get_new_folio() returns a NULL. In general, that means there is
> not enough memory. But in >0 order folio compaction case, that means
> there is no suitable free page at source page order. It might be better
> to skip that page and finish the rest of migration to achieve a better
> compaction result.

We can split the source folio if get_new_folio() returns NULL.  So, do
we really need this?

In general, we may reconsider all further optimizations given splitting
is available already.

> 3. Another possible optimization is to enable free page merge. It is
> possible that a to-be-migrated page causes free page split then fails to
> migrate eventually. We would lose a high order free page without free
> page merge function. But a way of identifying free pages for memory
> compaction is needed to reuse existing PFN-based buddy page merge.
>
> 4. The implemented >0 order folio compaction algorithm is quite naive
> and does not consider all possible situations. A better algorithm can
> improve compaction success rate.
>
>
> Feel free to give comments and ask questions.
>
> Thanks.
>
>
> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
>
> Zi Yan (4):
>   mm/compaction: add support for >0 order folio memory compaction.
>   mm/compaction: optimize >0 order folio compaction with free page
>     split.
>   mm/compaction: optimize >0 order folio compaction by sorting source
>     pages.
>   mm/compaction: enable compacting >0 order folios.
>
>  mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>  mm/internal.h   |   7 +-
>  2 files changed, 176 insertions(+), 36 deletions(-)

--
Best Regards,
Huang, Ying
  
Zi Yan Oct. 9, 2023, 1:24 p.m. UTC | #8
On 2 Oct 2023, at 8:32, Ryan Roberts wrote:

> Hi Zi,
>
> On 12/09/2023 17:28, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> This patchset enables >0 order folio memory compaction, which is one of
>> the prerequisitions for large folio support[1]. It is on top of
>> mm-everything-2023-09-11-22-56.
>
> I've taken a quick look at these and realize I'm not well equipped to provide
> much in the way of meaningful review comments; All I can say is thanks for
> putting this together, and yes, I think it will become even more important for
> my work on anonymous large folios.
>
>
>>
>> Overview
>> ===
>>
>> To support >0 order folio compaction, the patchset changes how free pages used
>> for migration are kept during compaction. Free pages used to be split into
>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>> page order stored in page->private is zeroed, and page reference is set to 1).
>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>> on their order without post allocation process. When migrate_pages() asks for
>> a new page, one of the free pages, based on the requested page order, is
>> then processed and given out.
>>
>>
>> Optimizations
>> ===
>>
>> 1. Free page split is added to increase migration success rate in case
>> a source page does not have a matched free page in the free page lists.
>> Free page merge is possible but not implemented, since existing
>> PFN-based buddy page merge algorithm requires the identification of
>> buddy pages, but free pages kept for memory compaction cannot have
>> PageBuddy set to avoid confusing other PFN scanners.
>>
>> 2. Sort source pages in ascending order before migration is added to
>> reduce free page split. Otherwise, high order free pages might be
>> prematurely split, causing undesired high order folio migration failures.
>
> Not knowing much about how compaction actually works, naively I would imagine
> that if you are just trying to free up a known amount of contiguous physical
> space, then working through the pages in PFN order is more likely to yield the
> result quicker? Unless all of the pages in the set must be successfully migrated
> in order to free up the required amount of space...

During compaction, pages are not freed, since that is the job of page reclaim.
The goal of compaction is to get a high order free page without freeing existing
pages to avoid potential high cost IO operations. If compaction does not work,
page reclaim would free pages to get us there (and potentially another follow-up
compaction). So either pages are migrated or stay where they are during compaction.

BTW compaction works by scanning in use pages from lower PFN to higher PFN,
and free pages from higher PFN to lower PFN until two scanners meet in the middle.

--
Best Regards,
Yan, Zi
  
Zi Yan Oct. 9, 2023, 1:43 p.m. UTC | #9
On 9 Oct 2023, at 3:12, Huang, Ying wrote:

> Hi, Zi,
>
> Thanks for your patch!
>
> Zi Yan <zi.yan@sent.com> writes:
>
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> This patchset enables >0 order folio memory compaction, which is one of
>> the prerequisitions for large folio support[1]. It is on top of
>> mm-everything-2023-09-11-22-56.
>>
>> Overview
>> ===
>>
>> To support >0 order folio compaction, the patchset changes how free pages used
>> for migration are kept during compaction.
>
> migrate_pages() can split the large folio for allocation failure.  So
> the minimal implementation could be
>
> - allow to migrate large folios in compaction
> - return -ENOMEM for order > 0 in compaction_alloc()
>
> The performance may be not desirable.  But that may be a baseline for
> further optimization.

I would imagine it might cause a regression since compaction might gradually
split high order folios in the system. But I can move Patch 4 first to make this
the baseline and see how system performance changes.

>
> And, if we can measure the performance for each step of optimization,
> that will be even better.

Do you have any benchmark in mind for the performance tests? vm-scalability?

>
>> Free pages used to be split into
>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>> page order stored in page->private is zeroed, and page reference is set to 1).
>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>> on their order without post allocation process. When migrate_pages() asks for
>> a new page, one of the free pages, based on the requested page order, is
>> then processed and given out.
>>
>>
>> Optimizations
>> ===
>>
>> 1. Free page split is added to increase migration success rate in case
>> a source page does not have a matched free page in the free page lists.
>> Free page merge is possible but not implemented, since existing
>> PFN-based buddy page merge algorithm requires the identification of
>> buddy pages, but free pages kept for memory compaction cannot have
>> PageBuddy set to avoid confusing other PFN scanners.
>>
>> 2. Sort source pages in ascending order before migration is added to
>
> Trivial.
>
> s/ascending/descending/
>
>> reduce free page split. Otherwise, high order free pages might be
>> prematurely split, causing undesired high order folio migration failures.
>>
>>
>> TODOs
>> ===
>>
>> 1. Refactor free page post allocation and free page preparation code so
>> that compaction_alloc() and compaction_free() can call functions instead
>> of hard coding.
>>
>> 2. One possible optimization is to allow migrate_pages() to continue
>> even if get_new_folio() returns a NULL. In general, that means there is
>> not enough memory. But in >0 order folio compaction case, that means
>> there is no suitable free page at source page order. It might be better
>> to skip that page and finish the rest of migration to achieve a better
>> compaction result.
>
> We can split the source folio if get_new_folio() returns NULL.  So, do
> we really need this?

It depends. The situation it can benefit is that when the system is going
to allocate a high order free page and trigger a compaction, it is possible to
get the high order free page by migrating a bunch of base pages instead of
splitting a existing high order folio.

>
> In general, we may reconsider all further optimizations given splitting
> is available already.

In my mind, split should be avoided as much as possible. But it really depends
on the actual situation, e.g., how much effort and cost the compaction wants
to pay to get memory defragmented. If the system really wants to get a high
order free page at any cost, split can be used without any issue. But applications
might lose performance because existing large folios are split just to a
new one.

Like I said in the email, there are tons of optimizations and policies for us
to explore. We can start with the bare minimum support (if no performance
regression is observed, we can even start with split all high folios like you
suggested) and add optimizations one by one.

>
>> 3. Another possible optimization is to enable free page merge. It is
>> possible that a to-be-migrated page causes free page split then fails to
>> migrate eventually. We would lose a high order free page without free
>> page merge function. But a way of identifying free pages for memory
>> compaction is needed to reuse existing PFN-based buddy page merge.
>>
>> 4. The implemented >0 order folio compaction algorithm is quite naive
>> and does not consider all possible situations. A better algorithm can
>> improve compaction success rate.
>>
>>
>> Feel free to give comments and ask questions.
>>
>> Thanks.
>>
>>
>> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
>>
>> Zi Yan (4):
>>   mm/compaction: add support for >0 order folio memory compaction.
>>   mm/compaction: optimize >0 order folio compaction with free page
>>     split.
>>   mm/compaction: optimize >0 order folio compaction by sorting source
>>     pages.
>>   mm/compaction: enable compacting >0 order folios.
>>
>>  mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>>  mm/internal.h   |   7 +-
>>  2 files changed, 176 insertions(+), 36 deletions(-)
>
> --
> Best Regards,
> Huang, Ying


--
Best Regards,
Yan, Zi
  
Ryan Roberts Oct. 9, 2023, 2:10 p.m. UTC | #10
On 09/10/2023 14:24, Zi Yan wrote:
> On 2 Oct 2023, at 8:32, Ryan Roberts wrote:
> 
>> Hi Zi,
>>
>> On 12/09/2023 17:28, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> Hi all,
>>>
>>> This patchset enables >0 order folio memory compaction, which is one of
>>> the prerequisitions for large folio support[1]. It is on top of
>>> mm-everything-2023-09-11-22-56.
>>
>> I've taken a quick look at these and realize I'm not well equipped to provide
>> much in the way of meaningful review comments; All I can say is thanks for
>> putting this together, and yes, I think it will become even more important for
>> my work on anonymous large folios.
>>
>>
>>>
>>> Overview
>>> ===
>>>
>>> To support >0 order folio compaction, the patchset changes how free pages used
>>> for migration are kept during compaction. Free pages used to be split into
>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>> on their order without post allocation process. When migrate_pages() asks for
>>> a new page, one of the free pages, based on the requested page order, is
>>> then processed and given out.
>>>
>>>
>>> Optimizations
>>> ===
>>>
>>> 1. Free page split is added to increase migration success rate in case
>>> a source page does not have a matched free page in the free page lists.
>>> Free page merge is possible but not implemented, since existing
>>> PFN-based buddy page merge algorithm requires the identification of
>>> buddy pages, but free pages kept for memory compaction cannot have
>>> PageBuddy set to avoid confusing other PFN scanners.
>>>
>>> 2. Sort source pages in ascending order before migration is added to
>>> reduce free page split. Otherwise, high order free pages might be
>>> prematurely split, causing undesired high order folio migration failures.
>>
>> Not knowing much about how compaction actually works, naively I would imagine
>> that if you are just trying to free up a known amount of contiguous physical
>> space, then working through the pages in PFN order is more likely to yield the
>> result quicker? Unless all of the pages in the set must be successfully migrated
>> in order to free up the required amount of space...
> 
> During compaction, pages are not freed, since that is the job of page reclaim.

Sorry yes - my fault for using sloppy language. When I said "free up a known
amount of contiguous physical space", I really meant "move pages in order to
recover an amount of contiguous physical space". But I still think the rest of
what I said applies; wouldn't you be more likely to reach your goal quicker if
you sort by PFN?

> The goal of compaction is to get a high order free page without freeing existing
> pages to avoid potential high cost IO operations. If compaction does not work,
> page reclaim would free pages to get us there (and potentially another follow-up
> compaction). So either pages are migrated or stay where they are during compaction.
> 
> BTW compaction works by scanning in use pages from lower PFN to higher PFN,
> and free pages from higher PFN to lower PFN until two scanners meet in the middle.
> 
> --
> Best Regards,
> Yan, Zi
  
Zi Yan Oct. 9, 2023, 3:52 p.m. UTC | #11
(resent as plain text)
On 9 Oct 2023, at 10:10, Ryan Roberts wrote:

> On 09/10/2023 14:24, Zi Yan wrote:
>> On 2 Oct 2023, at 8:32, Ryan Roberts wrote:
>>
>>> Hi Zi,
>>>
>>> On 12/09/2023 17:28, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> Hi all,
>>>>
>>>> This patchset enables >0 order folio memory compaction, which is one of
>>>> the prerequisitions for large folio support[1]. It is on top of
>>>> mm-everything-2023-09-11-22-56.
>>>
>>> I've taken a quick look at these and realize I'm not well equipped to provide
>>> much in the way of meaningful review comments; All I can say is thanks for
>>> putting this together, and yes, I think it will become even more important for
>>> my work on anonymous large folios.
>>>
>>>
>>>>
>>>> Overview
>>>> ===
>>>>
>>>> To support >0 order folio compaction, the patchset changes how free pages used
>>>> for migration are kept during compaction. Free pages used to be split into
>>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>>> on their order without post allocation process. When migrate_pages() asks for
>>>> a new page, one of the free pages, based on the requested page order, is
>>>> then processed and given out.
>>>>
>>>>
>>>> Optimizations
>>>> ===
>>>>
>>>> 1. Free page split is added to increase migration success rate in case
>>>> a source page does not have a matched free page in the free page lists.
>>>> Free page merge is possible but not implemented, since existing
>>>> PFN-based buddy page merge algorithm requires the identification of
>>>> buddy pages, but free pages kept for memory compaction cannot have
>>>> PageBuddy set to avoid confusing other PFN scanners.
>>>>
>>>> 2. Sort source pages in ascending order before migration is added to
>>>> reduce free page split. Otherwise, high order free pages might be
>>>> prematurely split, causing undesired high order folio migration failures.
>>>
>>> Not knowing much about how compaction actually works, naively I would imagine
>>> that if you are just trying to free up a known amount of contiguous physical
>>> space, then working through the pages in PFN order is more likely to yield the
>>> result quicker? Unless all of the pages in the set must be successfully migrated
>>> in order to free up the required amount of space...
>>
>> During compaction, pages are not freed, since that is the job of page reclaim.
>
> Sorry yes - my fault for using sloppy language. When I said "free up a known
> amount of contiguous physical space", I really meant "move pages in order to
> recover an amount of contiguous physical space". But I still think the rest of
> what I said applies; wouldn't you be more likely to reach your goal quicker if
> you sort by PFN?

Not always. If the in-use folios on the left are order-2, order-2, order-4
(all contiguous in one pageblock) and free pages on the right are order-4 (pageblock N),
order-2, order-2 (pageblock N-1) and it is not a single order-8, since there are
in-use folios in the middle), going in PFN order will not get you an order-8 free
page, since first order-4 free page will be split into two order-2 for the first
two order-2 in-use folios. But if you migrate in the the descending order of
in-use page orders, you can get an order-8 free page at the end.

The patchset minimizes free page splits to avoid the situation described above,
since once a high order free page is split, the opportunity of migrating a high order
in-use folio into it is gone and hardly recoverable.


>> The goal of compaction is to get a high order free page without freeing existing
>> pages to avoid potential high cost IO operations. If compaction does not work,
>> page reclaim would free pages to get us there (and potentially another follow-up
>> compaction). So either pages are migrated or stay where they are during compaction.
>>
>> BTW compaction works by scanning in use pages from lower PFN to higher PFN,
>> and free pages from higher PFN to lower PFN until two scanners meet in the middle.
>>
>> --
>> Best Regards,
>> Yan, Zi


Best Regards,
Yan, Zi
  
Huang, Ying Oct. 10, 2023, 6:08 a.m. UTC | #12
Something wrong with my mail box.  Sorry, if you received duplicated
mail.

Zi Yan <ziy@nvidia.com> writes:

> On 9 Oct 2023, at 3:12, Huang, Ying wrote:
>
>> Hi, Zi,
>>
>> Thanks for your patch!
>>
>> Zi Yan <zi.yan@sent.com> writes:
>>
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> Hi all,
>>>
>>> This patchset enables >0 order folio memory compaction, which is one of
>>> the prerequisitions for large folio support[1]. It is on top of
>>> mm-everything-2023-09-11-22-56.
>>>
>>> Overview
>>> ===
>>>
>>> To support >0 order folio compaction, the patchset changes how free pages used
>>> for migration are kept during compaction.
>>
>> migrate_pages() can split the large folio for allocation failure.  So
>> the minimal implementation could be
>>
>> - allow to migrate large folios in compaction
>> - return -ENOMEM for order > 0 in compaction_alloc()
>>
>> The performance may be not desirable.  But that may be a baseline for
>> further optimization.
>
> I would imagine it might cause a regression since compaction might gradually
> split high order folios in the system.

I may not call it a pure regression, since large folio can be migrated
during compaction with that, but it's possible that this hurts
performance.

Anyway, this can be a not-so-good minimal baseline.

> But I can move Patch 4 first to make this the baseline and see how
> system performance changes.

Thanks!

>>
>> And, if we can measure the performance for each step of optimization,
>> that will be even better.
>
> Do you have any benchmark in mind for the performance tests? vm-scalability?

I remember Mel Gorman has done some tests for defragmentation before.
But that's for order-0 pages.

>>> Free pages used to be split into
>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>> on their order without post allocation process. When migrate_pages() asks for
>>> a new page, one of the free pages, based on the requested page order, is
>>> then processed and given out.
>>>
>>>
>>> Optimizations
>>> ===
>>>
>>> 1. Free page split is added to increase migration success rate in case
>>> a source page does not have a matched free page in the free page lists.
>>> Free page merge is possible but not implemented, since existing
>>> PFN-based buddy page merge algorithm requires the identification of
>>> buddy pages, but free pages kept for memory compaction cannot have
>>> PageBuddy set to avoid confusing other PFN scanners.
>>>
>>> 2. Sort source pages in ascending order before migration is added to
>>
>> Trivial.
>>
>> s/ascending/descending/
>>
>>> reduce free page split. Otherwise, high order free pages might be
>>> prematurely split, causing undesired high order folio migration failures.
>>>
>>>
>>> TODOs
>>> ===
>>>
>>> 1. Refactor free page post allocation and free page preparation code so
>>> that compaction_alloc() and compaction_free() can call functions instead
>>> of hard coding.
>>>
>>> 2. One possible optimization is to allow migrate_pages() to continue
>>> even if get_new_folio() returns a NULL. In general, that means there is
>>> not enough memory. But in >0 order folio compaction case, that means
>>> there is no suitable free page at source page order. It might be better
>>> to skip that page and finish the rest of migration to achieve a better
>>> compaction result.
>>
>> We can split the source folio if get_new_folio() returns NULL.  So, do
>> we really need this?
>
> It depends. The situation it can benefit is that when the system is going
> to allocate a high order free page and trigger a compaction, it is possible to
> get the high order free page by migrating a bunch of base pages instead of
> splitting a existing high order folio.
>
>>
>> In general, we may reconsider all further optimizations given splitting
>> is available already.
>
> In my mind, split should be avoided as much as possible.

If so, should we use "nosplit" logic in migrate_pages_batch() in some
situation?

> But it really depends
> on the actual situation, e.g., how much effort and cost the compaction wants
> to pay to get memory defragmented. If the system really wants to get a high
> order free page at any cost, split can be used without any issue. But applications
> might lose performance because existing large folios are split just to a
> new one.

Is it possible that splitting is desirable in some situation?  For
example, allocate some large DMA buffers at the cost of large anonymous
folios?

> Like I said in the email, there are tons of optimizations and policies for us
> to explore. We can start with the bare minimum support (if no performance
> regression is observed, we can even start with split all high folios like you
> suggested) and add optimizations one by one.

Sound good to me!  Thanks!

>>
>>> 3. Another possible optimization is to enable free page merge. It is
>>> possible that a to-be-migrated page causes free page split then fails to
>>> migrate eventually. We would lose a high order free page without free
>>> page merge function. But a way of identifying free pages for memory
>>> compaction is needed to reuse existing PFN-based buddy page merge.
>>>
>>> 4. The implemented >0 order folio compaction algorithm is quite naive
>>> and does not consider all possible situations. A better algorithm can
>>> improve compaction success rate.
>>>
>>>
>>> Feel free to give comments and ask questions.
>>>
>>> Thanks.
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
>>>
>>> Zi Yan (4):
>>>   mm/compaction: add support for >0 order folio memory compaction.
>>>   mm/compaction: optimize >0 order folio compaction with free page
>>>     split.
>>>   mm/compaction: optimize >0 order folio compaction by sorting source
>>>     pages.
>>>   mm/compaction: enable compacting >0 order folios.
>>>
>>>  mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>>>  mm/internal.h   |   7 +-
>>>  2 files changed, 176 insertions(+), 36 deletions(-)

--
Best Regards,
Huang, Ying
  
Ryan Roberts Oct. 10, 2023, 10 a.m. UTC | #13
On 09/10/2023 16:52, Zi Yan wrote:
> (resent as plain text)
> On 9 Oct 2023, at 10:10, Ryan Roberts wrote:
> 
>> On 09/10/2023 14:24, Zi Yan wrote:
>>> On 2 Oct 2023, at 8:32, Ryan Roberts wrote:
>>>
>>>> Hi Zi,
>>>>
>>>> On 12/09/2023 17:28, Zi Yan wrote:
>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This patchset enables >0 order folio memory compaction, which is one of
>>>>> the prerequisitions for large folio support[1]. It is on top of
>>>>> mm-everything-2023-09-11-22-56.
>>>>
>>>> I've taken a quick look at these and realize I'm not well equipped to provide
>>>> much in the way of meaningful review comments; All I can say is thanks for
>>>> putting this together, and yes, I think it will become even more important for
>>>> my work on anonymous large folios.
>>>>
>>>>
>>>>>
>>>>> Overview
>>>>> ===
>>>>>
>>>>> To support >0 order folio compaction, the patchset changes how free pages used
>>>>> for migration are kept during compaction. Free pages used to be split into
>>>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>>>> on their order without post allocation process. When migrate_pages() asks for
>>>>> a new page, one of the free pages, based on the requested page order, is
>>>>> then processed and given out.
>>>>>
>>>>>
>>>>> Optimizations
>>>>> ===
>>>>>
>>>>> 1. Free page split is added to increase migration success rate in case
>>>>> a source page does not have a matched free page in the free page lists.
>>>>> Free page merge is possible but not implemented, since existing
>>>>> PFN-based buddy page merge algorithm requires the identification of
>>>>> buddy pages, but free pages kept for memory compaction cannot have
>>>>> PageBuddy set to avoid confusing other PFN scanners.
>>>>>
>>>>> 2. Sort source pages in ascending order before migration is added to
>>>>> reduce free page split. Otherwise, high order free pages might be
>>>>> prematurely split, causing undesired high order folio migration failures.
>>>>
>>>> Not knowing much about how compaction actually works, naively I would imagine
>>>> that if you are just trying to free up a known amount of contiguous physical
>>>> space, then working through the pages in PFN order is more likely to yield the
>>>> result quicker? Unless all of the pages in the set must be successfully migrated
>>>> in order to free up the required amount of space...
>>>
>>> During compaction, pages are not freed, since that is the job of page reclaim.
>>
>> Sorry yes - my fault for using sloppy language. When I said "free up a known
>> amount of contiguous physical space", I really meant "move pages in order to
>> recover an amount of contiguous physical space". But I still think the rest of
>> what I said applies; wouldn't you be more likely to reach your goal quicker if
>> you sort by PFN?
> 
> Not always. If the in-use folios on the left are order-2, order-2, order-4
> (all contiguous in one pageblock) and free pages on the right are order-4 (pageblock N),
> order-2, order-2 (pageblock N-1) and it is not a single order-8, since there are
> in-use folios in the middle), going in PFN order will not get you an order-8 free
> page, since first order-4 free page will be split into two order-2 for the first
> two order-2 in-use folios. But if you migrate in the the descending order of
> in-use page orders, you can get an order-8 free page at the end.
> 
> The patchset minimizes free page splits to avoid the situation described above,
> since once a high order free page is split, the opportunity of migrating a high order
> in-use folio into it is gone and hardly recoverable.

OK I get it now - thanks!

> 
> 
>>> The goal of compaction is to get a high order free page without freeing existing
>>> pages to avoid potential high cost IO operations. If compaction does not work,
>>> page reclaim would free pages to get us there (and potentially another follow-up
>>> compaction). So either pages are migrated or stay where they are during compaction.
>>>
>>> BTW compaction works by scanning in use pages from lower PFN to higher PFN,
>>> and free pages from higher PFN to lower PFN until two scanners meet in the middle.
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
> 
> 
> Best Regards,
> Yan, Zi
  
Zi Yan Oct. 10, 2023, 4:48 p.m. UTC | #14
On 10 Oct 2023, at 2:08, Huang, Ying wrote:

> Something wrong with my mail box.  Sorry, if you received duplicated
> mail.
>
> Zi Yan <ziy@nvidia.com> writes:
>
>> On 9 Oct 2023, at 3:12, Huang, Ying wrote:
>>
>>> Hi, Zi,
>>>
>>> Thanks for your patch!
>>>
>>> Zi Yan <zi.yan@sent.com> writes:
>>>
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> Hi all,
>>>>
>>>> This patchset enables >0 order folio memory compaction, which is one of
>>>> the prerequisitions for large folio support[1]. It is on top of
>>>> mm-everything-2023-09-11-22-56.
>>>>
>>>> Overview
>>>> ===
>>>>
>>>> To support >0 order folio compaction, the patchset changes how free pages used
>>>> for migration are kept during compaction.
>>>
>>> migrate_pages() can split the large folio for allocation failure.  So
>>> the minimal implementation could be
>>>
>>> - allow to migrate large folios in compaction
>>> - return -ENOMEM for order > 0 in compaction_alloc()
>>>
>>> The performance may be not desirable.  But that may be a baseline for
>>> further optimization.
>>
>> I would imagine it might cause a regression since compaction might gradually
>> split high order folios in the system.
>
> I may not call it a pure regression, since large folio can be migrated
> during compaction with that, but it's possible that this hurts
> performance.
>
> Anyway, this can be a not-so-good minimal baseline.
>
>> But I can move Patch 4 first to make this the baseline and see how
>> system performance changes.
>
> Thanks!
>
>>>
>>> And, if we can measure the performance for each step of optimization,
>>> that will be even better.
>>
>> Do you have any benchmark in mind for the performance tests? vm-scalability?
>
> I remember Mel Gorman has done some tests for defragmentation before.
> But that's for order-0 pages.

OK, I will try to find that.

>
>>>> Free pages used to be split into
>>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
>>>> page order stored in page->private is zeroed, and page reference is set to 1).
>>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based
>>>> on their order without post allocation process. When migrate_pages() asks for
>>>> a new page, one of the free pages, based on the requested page order, is
>>>> then processed and given out.
>>>>
>>>>
>>>> Optimizations
>>>> ===
>>>>
>>>> 1. Free page split is added to increase migration success rate in case
>>>> a source page does not have a matched free page in the free page lists.
>>>> Free page merge is possible but not implemented, since existing
>>>> PFN-based buddy page merge algorithm requires the identification of
>>>> buddy pages, but free pages kept for memory compaction cannot have
>>>> PageBuddy set to avoid confusing other PFN scanners.
>>>>
>>>> 2. Sort source pages in ascending order before migration is added to
>>>
>>> Trivial.
>>>
>>> s/ascending/descending/
>>>
>>>> reduce free page split. Otherwise, high order free pages might be
>>>> prematurely split, causing undesired high order folio migration failures.
>>>>
>>>>
>>>> TODOs
>>>> ===
>>>>
>>>> 1. Refactor free page post allocation and free page preparation code so
>>>> that compaction_alloc() and compaction_free() can call functions instead
>>>> of hard coding.
>>>>
>>>> 2. One possible optimization is to allow migrate_pages() to continue
>>>> even if get_new_folio() returns a NULL. In general, that means there is
>>>> not enough memory. But in >0 order folio compaction case, that means
>>>> there is no suitable free page at source page order. It might be better
>>>> to skip that page and finish the rest of migration to achieve a better
>>>> compaction result.
>>>
>>> We can split the source folio if get_new_folio() returns NULL.  So, do
>>> we really need this?
>>
>> It depends. The situation it can benefit is that when the system is going
>> to allocate a high order free page and trigger a compaction, it is possible to
>> get the high order free page by migrating a bunch of base pages instead of
>> splitting a existing high order folio.
>>
>>>
>>> In general, we may reconsider all further optimizations given splitting
>>> is available already.
>>
>> In my mind, split should be avoided as much as possible.
>
> If so, should we use "nosplit" logic in migrate_pages_batch() in some
> situation?

A possible future optimization.

>
>> But it really depends
>> on the actual situation, e.g., how much effort and cost the compaction wants
>> to pay to get memory defragmented. If the system really wants to get a high
>> order free page at any cost, split can be used without any issue. But applications
>> might lose performance because existing large folios are split just to a
>> new one.
>
> Is it possible that splitting is desirable in some situation?  For
> example, allocate some large DMA buffers at the cost of large anonymous
> folios?

Sure. There are definitely cases split is better than non-split. But let's leave
it when large anonymous folio is deployed.

>
>> Like I said in the email, there are tons of optimizations and policies for us
>> to explore. We can start with the bare minimum support (if no performance
>> regression is observed, we can even start with split all high folios like you
>> suggested) and add optimizations one by one.
>
> Sound good to me!  Thanks!
>
>>>
>>>> 3. Another possible optimization is to enable free page merge. It is
>>>> possible that a to-be-migrated page causes free page split then fails to
>>>> migrate eventually. We would lose a high order free page without free
>>>> page merge function. But a way of identifying free pages for memory
>>>> compaction is needed to reuse existing PFN-based buddy page merge.
>>>>
>>>> 4. The implemented >0 order folio compaction algorithm is quite naive
>>>> and does not consider all possible situations. A better algorithm can
>>>> improve compaction success rate.
>>>>
>>>>
>>>> Feel free to give comments and ask questions.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
>>>>
>>>> Zi Yan (4):
>>>>   mm/compaction: add support for >0 order folio memory compaction.
>>>>   mm/compaction: optimize >0 order folio compaction with free page
>>>>     split.
>>>>   mm/compaction: optimize >0 order folio compaction by sorting source
>>>>     pages.
>>>>   mm/compaction: enable compacting >0 order folios.
>>>>
>>>>  mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
>>>>  mm/internal.h   |   7 +-
>>>>  2 files changed, 176 insertions(+), 36 deletions(-)
>
> --
> Best Regards,
> Huang, Ying


--
Best Regards,
Yan, Zi