[v2,0/5] variable-order, large folios for anonymous memory

Message ID 20230703135330.1865927-1-ryan.roberts@arm.com
Headers
Series variable-order, large folios for anonymous memory |

Message

Ryan Roberts July 3, 2023, 1:53 p.m. UTC
  Hi All,

This is v2 of a series to implement variable order, large folios for anonymous
memory. The objective of this is to improve performance by allocating larger
chunks of memory during anonymous page faults. See [1] for background.

I've significantly reworked and simplified the patch set based on comments from
Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
VARIABLE_THP, on Yu's advice.

The last patch is for arm64 to explicitly override the default
arch_wants_pte_order() and is intended as an example. If this series is accepted
I suggest taking the first 4 patches through the mm tree and the arm64 change
could be handled through the arm64 tree separately. Neither has any build
dependency on the other.

The one area where I haven't followed Yu's advice is in the determination of the
size of folio to use. It was suggested that I have a single preferred large
order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
being existing overlapping populated PTEs, etc) then fallback immediately to
order-0. It turned out that this approach caused a performance regression in the
Speedometer benchmark. With my v1 patch, there were significant quantities of
memory which could not be placed in the 64K bucket and were instead being
allocated for the 32K and 16K buckets. With the proposed simplification, that
memory ended up using the 4K bucket, so page faults increased by 2.75x compared
to the v1 patch (although due to the 64K bucket, this number is still a bit
lower than the baseline). So instead, I continue to calculate a folio order that
is somewhere between the preferred order and 0. (See below for more details).

The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
[2], which is a hard dependency. I have a branch at [3].


Changes since v1 [1]
--------------------

  - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
  - replaced with arch-independent alloc_anon_folio()
      - follows THP allocation approach
  - no longer retry with intermediate orders if allocation fails
      - fallback directly to order-0
  - remove folio_add_new_anon_rmap_range() patch
      - instead add its new functionality to folio_add_new_anon_rmap()
  - remove batch-zap pte mappings optimization patch
      - remove enabler folio_remove_rmap_range() patch too
      - These offer real perf improvement so will submit separately
  - simplify Kconfig
      - single FLEXIBLE_THP option, which is independent of arch
      - depends on TRANSPARENT_HUGEPAGE
      - when enabled default to max anon folio size of 64K unless arch
        explicitly overrides
  - simplify changes to do_anonymous_page():
      - no more retry loop


Performance
-----------

Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel
compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in
Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled,
Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5
reboots and averaged.

'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2
patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the
order selection simplification that Yu Zhao suggested - I'm trying to justify
here why I did not follow the advice.


Kernel compilation with 8 jobs:

| kernel                         |   real-time |   kern-time |   user-time |
|:-------------------------------|------------:|------------:|------------:|
| baseline-4k                    |        0.0% |        0.0% |        0.0% |
| anonfolio-lkml-v1              |       -5.3% |      -42.9% |       -0.6% |
| anonfolio-lkml-v2-simple-order |       -4.4% |      -36.5% |       -0.4% |
| anonfolio-lkml-v2              |       -4.8% |      -38.6% |       -0.6% |

We can see that the simple-order approach is responsible for a regression of
0.4%.


Kernel compilation with 80 jobs:

| kernel                         |   real-time |   kern-time |   user-time |
|:-------------------------------|------------:|------------:|------------:|
| baseline-4k                    |        0.0% |        0.0% |        0.0% |
| anonfolio-lkml-v1              |       -4.6% |      -45.7% |        1.4% |
| anonfolio-lkml-v2-simple-order |       -4.7% |      -40.2% |       -0.1% |
| anonfolio-lkml-v2              |       -5.0% |      -42.6% |       -0.3% |

simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to
fixing the v1 regression on user-time.


Speedometer 2.0:

| kernel                         |   runs_per_min |
|:-------------------------------|---------------:|
| baseline-4k                    |           0.0% |
| anonfolio-lkml-v1              |           0.7% |
| anonfolio-lkml-v2-simple-order |          -0.9% |
| anonfolio-lkml-v2              |           0.5% |

simple-order regresses performance by 0.9% vs the baseline, for a total negative
swing of 1.6% vs v1. This is fixed by keeping the more complex order selection
mechanism from v1.


The remaining (kernel time) performance gap between v1 and v2 for the above
benchmarks is due to the removal of the "batch zap" patch in v2. Adding that
back in gives us the performance back. I intend to submit that as a separate
series once this series is accepted.


[1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2

Thanks,
Ryan


Ryan Roberts (5):
  mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  mm: Allow deferred splitting of arbitrary large anon folios
  mm: Default implementation of arch_wants_pte_order()
  mm: FLEXIBLE_THP for improved performance
  arm64: mm: Override arch_wants_pte_order()

 arch/arm64/Kconfig               |  12 +++
 arch/arm64/include/asm/pgtable.h |   4 +
 arch/arm64/mm/mmu.c              |   8 ++
 include/linux/pgtable.h          |  13 +++
 mm/Kconfig                       |  10 ++
 mm/memory.c                      | 168 ++++++++++++++++++++++++++++---
 mm/rmap.c                        |  28 ++++--
 7 files changed, 222 insertions(+), 21 deletions(-)

--
2.25.1
  

Comments

Yu Zhao July 4, 2023, 2:18 a.m. UTC | #1
On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> This is v2 of a series to implement variable order, large folios for anonymous
> memory. The objective of this is to improve performance by allocating larger
> chunks of memory during anonymous page faults. See [1] for background.

Thanks for the quick response!

> I've significantly reworked and simplified the patch set based on comments from
> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> VARIABLE_THP, on Yu's advice.
>
> The last patch is for arm64 to explicitly override the default
> arch_wants_pte_order() and is intended as an example. If this series is accepted
> I suggest taking the first 4 patches through the mm tree and the arm64 change
> could be handled through the arm64 tree separately. Neither has any build
> dependency on the other.
>
> The one area where I haven't followed Yu's advice is in the determination of the
> size of folio to use. It was suggested that I have a single preferred large
> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> being existing overlapping populated PTEs, etc) then fallback immediately to
> order-0. It turned out that this approach caused a performance regression in the
> Speedometer benchmark.

I suppose it's regression against the v1, not the unpatched kernel.

> With my v1 patch, there were significant quantities of
> memory which could not be placed in the 64K bucket and were instead being
> allocated for the 32K and 16K buckets. With the proposed simplification, that
> memory ended up using the 4K bucket, so page faults increased by 2.75x compared
> to the v1 patch (although due to the 64K bucket, this number is still a bit
> lower than the baseline). So instead, I continue to calculate a folio order that
> is somewhere between the preferred order and 0. (See below for more details).

I suppose the benchmark wasn't running under memory pressure, which is
uncommon for client devices. It could be easier the other way around:
using 32/16KB shows regression whereas order-0 shows better
performance under memory pressure.

I'm not sure we should use v1 as the baseline. Unpatched kernel sounds
more reasonable at this point. If 32/16KB is proven to be better in
most scenarios including under memory pressure, we can reintroduce
that policy. I highly doubt this is the case: we tried 16KB base page
size on client devices, and overall, the regressions outweighs the
benefits.

> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I have a branch at [3].

It's not clear to me why [2] is a hard dependency.

It seems to me we are getting close and I was hoping we could get into
mm-unstable soon without depending on other series...
  
Yin Fengwei July 4, 2023, 6:22 a.m. UTC | #2
On 7/4/2023 10:18 AM, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi All,
>>
>> This is v2 of a series to implement variable order, large folios for anonymous
>> memory. The objective of this is to improve performance by allocating larger
>> chunks of memory during anonymous page faults. See [1] for background.
> 
> Thanks for the quick response!
> 
>> I've significantly reworked and simplified the patch set based on comments from
>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
>> VARIABLE_THP, on Yu's advice.
>>
>> The last patch is for arm64 to explicitly override the default
>> arch_wants_pte_order() and is intended as an example. If this series is accepted
>> I suggest taking the first 4 patches through the mm tree and the arm64 change
>> could be handled through the arm64 tree separately. Neither has any build
>> dependency on the other.
>>
>> The one area where I haven't followed Yu's advice is in the determination of the
>> size of folio to use. It was suggested that I have a single preferred large
>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
>> being existing overlapping populated PTEs, etc) then fallback immediately to
>> order-0. It turned out that this approach caused a performance regression in the
>> Speedometer benchmark.
> 
> I suppose it's regression against the v1, not the unpatched kernel.
From the performance data Ryan shared, it's against unpatched kernel:

Speedometer 2.0:

| kernel                         |   runs_per_min |
|:-------------------------------|---------------:|
| baseline-4k                    |           0.0% |
| anonfolio-lkml-v1              |           0.7% |
| anonfolio-lkml-v2-simple-order |          -0.9% |
| anonfolio-lkml-v2              |           0.5% |


What if we use 32K or 16K instead of 64K as default anonymous folio size? I suspect
this app may have 32K or 16K anon folio as sweet spot.


Regards
Yin, Fengwei

> 
>> With my v1 patch, there were significant quantities of
>> memory which could not be placed in the 64K bucket and were instead being
>> allocated for the 32K and 16K buckets. With the proposed simplification, that
>> memory ended up using the 4K bucket, so page faults increased by 2.75x compared
>> to the v1 patch (although due to the 64K bucket, this number is still a bit
>> lower than the baseline). So instead, I continue to calculate a folio order that
>> is somewhere between the preferred order and 0. (See below for more details).
> 
> I suppose the benchmark wasn't running under memory pressure, which is
> uncommon for client devices. It could be easier the other way around:
> using 32/16KB shows regression whereas order-0 shows better
> performance under memory pressure.
> 
> I'm not sure we should use v1 as the baseline. Unpatched kernel sounds
> more reasonable at this point. If 32/16KB is proven to be better in
> most scenarios including under memory pressure, we can reintroduce
> that policy. I highly doubt this is the case: we tried 16KB base page
> size on client devices, and overall, the regressions outweighs the
> benefits.
> 
>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
>> [2], which is a hard dependency. I have a branch at [3].
> 
> It's not clear to me why [2] is a hard dependency.
> 
> It seems to me we are getting close and I was hoping we could get into
> mm-unstable soon without depending on other series...
>
  
Yu Zhao July 4, 2023, 7:11 a.m. UTC | #3
On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
> On 7/4/2023 10:18 AM, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi All,
> >>
> >> This is v2 of a series to implement variable order, large folios for anonymous
> >> memory. The objective of this is to improve performance by allocating larger
> >> chunks of memory during anonymous page faults. See [1] for background.
> >
> > Thanks for the quick response!
> >
> >> I've significantly reworked and simplified the patch set based on comments from
> >> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> >> VARIABLE_THP, on Yu's advice.
> >>
> >> The last patch is for arm64 to explicitly override the default
> >> arch_wants_pte_order() and is intended as an example. If this series is accepted
> >> I suggest taking the first 4 patches through the mm tree and the arm64 change
> >> could be handled through the arm64 tree separately. Neither has any build
> >> dependency on the other.
> >>
> >> The one area where I haven't followed Yu's advice is in the determination of the
> >> size of folio to use. It was suggested that I have a single preferred large
> >> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> >> being existing overlapping populated PTEs, etc) then fallback immediately to
> >> order-0. It turned out that this approach caused a performance regression in the
> >> Speedometer benchmark.
> >
> > I suppose it's regression against the v1, not the unpatched kernel.
> From the performance data Ryan shared, it's against unpatched kernel:
>
> Speedometer 2.0:
>
> | kernel                         |   runs_per_min |
> |:-------------------------------|---------------:|
> | baseline-4k                    |           0.0% |
> | anonfolio-lkml-v1              |           0.7% |
> | anonfolio-lkml-v2-simple-order |          -0.9% |
> | anonfolio-lkml-v2              |           0.5% |

I see. Thanks.

A couple of questions:
1. Do we have a stddev?
2. Do we have a theory why it regressed?
Assuming no bugs, I don't see how a real regression could happen --
falling back to order-0 isn't different from the original behavior.
Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
  
Ryan Roberts July 4, 2023, 3:36 p.m. UTC | #4
On 04/07/2023 08:11, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>
>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> This is v2 of a series to implement variable order, large folios for anonymous
>>>> memory. The objective of this is to improve performance by allocating larger
>>>> chunks of memory during anonymous page faults. See [1] for background.
>>>
>>> Thanks for the quick response!
>>>
>>>> I've significantly reworked and simplified the patch set based on comments from
>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
>>>> VARIABLE_THP, on Yu's advice.
>>>>
>>>> The last patch is for arm64 to explicitly override the default
>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
>>>> could be handled through the arm64 tree separately. Neither has any build
>>>> dependency on the other.
>>>>
>>>> The one area where I haven't followed Yu's advice is in the determination of the
>>>> size of folio to use. It was suggested that I have a single preferred large
>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
>>>> order-0. It turned out that this approach caused a performance regression in the
>>>> Speedometer benchmark.
>>>
>>> I suppose it's regression against the v1, not the unpatched kernel.
>> From the performance data Ryan shared, it's against unpatched kernel:
>>
>> Speedometer 2.0:
>>
>> | kernel                         |   runs_per_min |
>> |:-------------------------------|---------------:|
>> | baseline-4k                    |           0.0% |
>> | anonfolio-lkml-v1              |           0.7% |
>> | anonfolio-lkml-v2-simple-order |          -0.9% |
>> | anonfolio-lkml-v2              |           0.5% |
> 
> I see. Thanks.
> 
> A couple of questions:
> 1. Do we have a stddev?

| kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
|:------------------------- |-----------:|----------:|-----------:|----------:|
| baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
| anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
| anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
| anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |

This is with 3 runs per reboot across 5 reboots, with first run after reboot
trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
points per kernel in total.

I've rerun the test multiple times and see similar results each time.

I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
see the same performance as baseline-4k.


> 2. Do we have a theory why it regressed?

I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
to order-0. I guess this is happening so often for this workload that the cost
of doing the checks and fallback is outweighing the benefit of the memory that
does end up with order-4 folios.

I've sampled the memory in each bucket (once per second) while running and its
roughly:

64K: 25%
32K: 15%
16K: 15%
4K: 45%

32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
the 64K contents is more static - that's just a guess though.

> Assuming no bugs, I don't see how a real regression could happen --
> falling back to order-0 isn't different from the original behavior.
> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?

I can, but it will have to be a bit later in the week. I'll do some more test
runs overnight so we have a larger number of runs - hopefully that might tell us
that this is noise to a certain extent.

I'd still like to hear a clear technical argument for why the bin-packing
approach is not the correct one!

Thanks,
Ryan
  
Yin Fengwei July 4, 2023, 11:52 p.m. UTC | #5
On 7/4/23 23:36, Ryan Roberts wrote:
> On 04/07/2023 08:11, Yu Zhao wrote:
>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> This is v2 of a series to implement variable order, large folios for anonymous
>>>>> memory. The objective of this is to improve performance by allocating larger
>>>>> chunks of memory during anonymous page faults. See [1] for background.
>>>>
>>>> Thanks for the quick response!
>>>>
>>>>> I've significantly reworked and simplified the patch set based on comments from
>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
>>>>> VARIABLE_THP, on Yu's advice.
>>>>>
>>>>> The last patch is for arm64 to explicitly override the default
>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
>>>>> could be handled through the arm64 tree separately. Neither has any build
>>>>> dependency on the other.
>>>>>
>>>>> The one area where I haven't followed Yu's advice is in the determination of the
>>>>> size of folio to use. It was suggested that I have a single preferred large
>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
>>>>> order-0. It turned out that this approach caused a performance regression in the
>>>>> Speedometer benchmark.
>>>>
>>>> I suppose it's regression against the v1, not the unpatched kernel.
>>> From the performance data Ryan shared, it's against unpatched kernel:
>>>
>>> Speedometer 2.0:
>>>
>>> | kernel                         |   runs_per_min |
>>> |:-------------------------------|---------------:|
>>> | baseline-4k                    |           0.0% |
>>> | anonfolio-lkml-v1              |           0.7% |
>>> | anonfolio-lkml-v2-simple-order |          -0.9% |
>>> | anonfolio-lkml-v2              |           0.5% |
>>
>> I see. Thanks.
>>
>> A couple of questions:
>> 1. Do we have a stddev?
> 
> | kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
> |:------------------------- |-----------:|----------:|-----------:|----------:|
> | baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
> | anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
> | anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
> | anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |
> 
> This is with 3 runs per reboot across 5 reboots, with first run after reboot
> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
> points per kernel in total.
> 
> I've rerun the test multiple times and see similar results each time.
> 
> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
> see the same performance as baseline-4k.
> 
> 
>> 2. Do we have a theory why it regressed?
> 
> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
> to order-0. I guess this is happening so often for this workload that the cost
> of doing the checks and fallback is outweighing the benefit of the memory that
> does end up with order-4 folios.
> 
> I've sampled the memory in each bucket (once per second) while running and its
> roughly:
> 
> 64K: 25%
> 32K: 15%
> 16K: 15%
> 4K: 45%
> 
> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
> the 64K contents is more static - that's just a guess though.
So this is like out of vma range thing.

> 
>> Assuming no bugs, I don't see how a real regression could happen --
>> falling back to order-0 isn't different from the original behavior.
>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
> 
> I can, but it will have to be a bit later in the week. I'll do some more test
> runs overnight so we have a larger number of runs - hopefully that might tell us
> that this is noise to a certain extent.
> 
> I'd still like to hear a clear technical argument for why the bin-packing
> approach is not the correct one!
My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
postpone this part of change and make basic anon large folio support in. Then
discuss which approach we should take. Maybe people will agree retry is the
choice, maybe other approach will be taken...

For example, for this out of VMA range case, per VMA order should be considered.
We don't need make decision that the retry should be taken now.


Regards
Yin, Fengwei

> 
> Thanks,
> Ryan
> 
> 
>
  
Yu Zhao July 5, 2023, 12:21 a.m. UTC | #6
On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 7/4/23 23:36, Ryan Roberts wrote:
> > On 04/07/2023 08:11, Yu Zhao wrote:
> >> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>
> >>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
> >>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> This is v2 of a series to implement variable order, large folios for anonymous
> >>>>> memory. The objective of this is to improve performance by allocating larger
> >>>>> chunks of memory during anonymous page faults. See [1] for background.
> >>>>
> >>>> Thanks for the quick response!
> >>>>
> >>>>> I've significantly reworked and simplified the patch set based on comments from
> >>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> >>>>> VARIABLE_THP, on Yu's advice.
> >>>>>
> >>>>> The last patch is for arm64 to explicitly override the default
> >>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
> >>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
> >>>>> could be handled through the arm64 tree separately. Neither has any build
> >>>>> dependency on the other.
> >>>>>
> >>>>> The one area where I haven't followed Yu's advice is in the determination of the
> >>>>> size of folio to use. It was suggested that I have a single preferred large
> >>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> >>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
> >>>>> order-0. It turned out that this approach caused a performance regression in the
> >>>>> Speedometer benchmark.
> >>>>
> >>>> I suppose it's regression against the v1, not the unpatched kernel.
> >>> From the performance data Ryan shared, it's against unpatched kernel:
> >>>
> >>> Speedometer 2.0:
> >>>
> >>> | kernel                         |   runs_per_min |
> >>> |:-------------------------------|---------------:|
> >>> | baseline-4k                    |           0.0% |
> >>> | anonfolio-lkml-v1              |           0.7% |
> >>> | anonfolio-lkml-v2-simple-order |          -0.9% |
> >>> | anonfolio-lkml-v2              |           0.5% |
> >>
> >> I see. Thanks.
> >>
> >> A couple of questions:
> >> 1. Do we have a stddev?
> >
> > | kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
> > |:------------------------- |-----------:|----------:|-----------:|----------:|
> > | baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
> > | anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
> > | anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
> > | anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |
> >
> > This is with 3 runs per reboot across 5 reboots, with first run after reboot
> > trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
> > points per kernel in total.
> >
> > I've rerun the test multiple times and see similar results each time.
> >
> > I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
> > see the same performance as baseline-4k.
> >
> >
> >> 2. Do we have a theory why it regressed?
> >
> > I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
> > mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
> > to order-0. I guess this is happening so often for this workload that the cost
> > of doing the checks and fallback is outweighing the benefit of the memory that
> > does end up with order-4 folios.
> >
> > I've sampled the memory in each bucket (once per second) while running and its
> > roughly:
> >
> > 64K: 25%
> > 32K: 15%
> > 16K: 15%
> > 4K: 45%
> >
> > 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
> > But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
> > the 64K contents is more static - that's just a guess though.
> So this is like out of vma range thing.
>
> >
> >> Assuming no bugs, I don't see how a real regression could happen --
> >> falling back to order-0 isn't different from the original behavior.
> >> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
> >
> > I can, but it will have to be a bit later in the week. I'll do some more test
> > runs overnight so we have a larger number of runs - hopefully that might tell us
> > that this is noise to a certain extent.
> >
> > I'd still like to hear a clear technical argument for why the bin-packing
> > approach is not the correct one!
> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
> postpone this part of change and make basic anon large folio support in. Then
> discuss which approach we should take. Maybe people will agree retry is the
> choice, maybe other approach will be taken...
>
> For example, for this out of VMA range case, per VMA order should be considered.
> We don't need make decision that the retry should be taken now.

I've articulated the reasons in another email. Just summarize the most
important point here:
using more fallback orders makes a system reach equilibrium faster, at
which point it can't allocate the order of arch_wants_pte_order()
anymore. IOW, this best-fit policy can reduce the number of folios of
the h/w prefered order for a system running long enough.
  
Ryan Roberts July 5, 2023, 10:16 a.m. UTC | #7
On 05/07/2023 01:21, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 7/4/23 23:36, Ryan Roberts wrote:
>>> On 04/07/2023 08:11, Yu Zhao wrote:
>>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>
>>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
>>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> This is v2 of a series to implement variable order, large folios for anonymous
>>>>>>> memory. The objective of this is to improve performance by allocating larger
>>>>>>> chunks of memory during anonymous page faults. See [1] for background.
>>>>>>
>>>>>> Thanks for the quick response!
>>>>>>
>>>>>>> I've significantly reworked and simplified the patch set based on comments from
>>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
>>>>>>> VARIABLE_THP, on Yu's advice.
>>>>>>>
>>>>>>> The last patch is for arm64 to explicitly override the default
>>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
>>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
>>>>>>> could be handled through the arm64 tree separately. Neither has any build
>>>>>>> dependency on the other.
>>>>>>>
>>>>>>> The one area where I haven't followed Yu's advice is in the determination of the
>>>>>>> size of folio to use. It was suggested that I have a single preferred large
>>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
>>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
>>>>>>> order-0. It turned out that this approach caused a performance regression in the
>>>>>>> Speedometer benchmark.
>>>>>>
>>>>>> I suppose it's regression against the v1, not the unpatched kernel.
>>>>> From the performance data Ryan shared, it's against unpatched kernel:
>>>>>
>>>>> Speedometer 2.0:
>>>>>
>>>>> | kernel                         |   runs_per_min |
>>>>> |:-------------------------------|---------------:|
>>>>> | baseline-4k                    |           0.0% |
>>>>> | anonfolio-lkml-v1              |           0.7% |
>>>>> | anonfolio-lkml-v2-simple-order |          -0.9% |
>>>>> | anonfolio-lkml-v2              |           0.5% |
>>>>
>>>> I see. Thanks.
>>>>
>>>> A couple of questions:
>>>> 1. Do we have a stddev?
>>>
>>> | kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
>>> |:------------------------- |-----------:|----------:|-----------:|----------:|
>>> | baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
>>> | anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
>>> | anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
>>> | anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |
>>>
>>> This is with 3 runs per reboot across 5 reboots, with first run after reboot
>>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
>>> points per kernel in total.
>>>
>>> I've rerun the test multiple times and see similar results each time.
>>>
>>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
>>> see the same performance as baseline-4k.
>>>
>>>
>>>> 2. Do we have a theory why it regressed?
>>>
>>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
>>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
>>> to order-0. I guess this is happening so often for this workload that the cost
>>> of doing the checks and fallback is outweighing the benefit of the memory that
>>> does end up with order-4 folios.
>>>
>>> I've sampled the memory in each bucket (once per second) while running and its
>>> roughly:
>>>
>>> 64K: 25%
>>> 32K: 15%
>>> 16K: 15%
>>> 4K: 45%
>>>
>>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
>>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
>>> the 64K contents is more static - that's just a guess though.
>> So this is like out of vma range thing.
>>
>>>
>>>> Assuming no bugs, I don't see how a real regression could happen --
>>>> falling back to order-0 isn't different from the original behavior.
>>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
>>>
>>> I can, but it will have to be a bit later in the week. I'll do some more test
>>> runs overnight so we have a larger number of runs - hopefully that might tell us
>>> that this is noise to a certain extent.
>>>
>>> I'd still like to hear a clear technical argument for why the bin-packing
>>> approach is not the correct one!
>> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
>> postpone this part of change and make basic anon large folio support in. Then
>> discuss which approach we should take. Maybe people will agree retry is the
>> choice, maybe other approach will be taken...
>>
>> For example, for this out of VMA range case, per VMA order should be considered.
>> We don't need make decision that the retry should be taken now.
> 
> I've articulated the reasons in another email. Just summarize the most
> important point here:
> using more fallback orders makes a system reach equilibrium faster, at
> which point it can't allocate the order of arch_wants_pte_order()
> anymore. IOW, this best-fit policy can reduce the number of folios of
> the h/w prefered order for a system running long enough.

Thanks for taking the time to write all the arguments down. I understand what
you are saying. If we are considering the whole system, then we also need to
think about the page cache though, and that will allocate multiple orders, so
you are still going to suffer fragmentation from that user.

That said, I like the proposal patch posted where we have up to 3 orders that we
try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That
feels like a good compromise that allows me to fulfil my objectives. I'm going
to pull this together into a v3 patch set and aim to post towards the end of the
week.

Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I
need your explicit permission).

On the regression front, I've done a much bigger test run and see the regression
is still present (although the mean has shifted a little bit). I've also built a
kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns
order-3. The aim was to test your hypothesis that 64K allocation is slow. This
kernel is performing even better, so I think that confirms your hypothesis:

| kernel                         |   runs_per_min |   runs |   sessions |
|:-------------------------------|---------------:|-------:|-----------:|
| baseline-4k                    |           0.0% |     75 |         15 |
| anonfolio-lkml-v1              |           1.0% |     75 |         15 |
| anonfolio-lkml-v2-simple-order |          -0.4% |     75 |         15 |
| anonfolio-lkml-v2              |           0.9% |     75 |         15 |
| anonfolio-lkml-v2-32k          |           1.4% |     10 |          5 |

Thanks,
Ryan
  
Yu Zhao July 5, 2023, 7 p.m. UTC | #8
On Wed, Jul 5, 2023 at 4:16 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/07/2023 01:21, Yu Zhao wrote:
> > On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >>
> >>
> >> On 7/4/23 23:36, Ryan Roberts wrote:
> >>> On 04/07/2023 08:11, Yu Zhao wrote:
> >>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>>>
> >>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote:
> >>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>
> >>>>>>> Hi All,
> >>>>>>>
> >>>>>>> This is v2 of a series to implement variable order, large folios for anonymous
> >>>>>>> memory. The objective of this is to improve performance by allocating larger
> >>>>>>> chunks of memory during anonymous page faults. See [1] for background.
> >>>>>>
> >>>>>> Thanks for the quick response!
> >>>>>>
> >>>>>>> I've significantly reworked and simplified the patch set based on comments from
> >>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> >>>>>>> VARIABLE_THP, on Yu's advice.
> >>>>>>>
> >>>>>>> The last patch is for arm64 to explicitly override the default
> >>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted
> >>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change
> >>>>>>> could be handled through the arm64 tree separately. Neither has any build
> >>>>>>> dependency on the other.
> >>>>>>>
> >>>>>>> The one area where I haven't followed Yu's advice is in the determination of the
> >>>>>>> size of folio to use. It was suggested that I have a single preferred large
> >>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> >>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to
> >>>>>>> order-0. It turned out that this approach caused a performance regression in the
> >>>>>>> Speedometer benchmark.
> >>>>>>
> >>>>>> I suppose it's regression against the v1, not the unpatched kernel.
> >>>>> From the performance data Ryan shared, it's against unpatched kernel:
> >>>>>
> >>>>> Speedometer 2.0:
> >>>>>
> >>>>> | kernel                         |   runs_per_min |
> >>>>> |:-------------------------------|---------------:|
> >>>>> | baseline-4k                    |           0.0% |
> >>>>> | anonfolio-lkml-v1              |           0.7% |
> >>>>> | anonfolio-lkml-v2-simple-order |          -0.9% |
> >>>>> | anonfolio-lkml-v2              |           0.5% |
> >>>>
> >>>> I see. Thanks.
> >>>>
> >>>> A couple of questions:
> >>>> 1. Do we have a stddev?
> >>>
> >>> | kernel                    |   mean_abs |   std_abs |   mean_rel |   std_rel |
> >>> |:------------------------- |-----------:|----------:|-----------:|----------:|
> >>> | baseline-4k               |      117.4 |       0.8 |       0.0% |      0.7% |
> >>> | anonfolio-v1              |      118.2 |         1 |       0.7% |      0.9% |
> >>> | anonfolio-v2-simple-order |      116.4 |       1.1 |      -0.9% |      0.9% |
> >>> | anonfolio-v2              |        118 |       1.2 |       0.5% |      1.0% |
> >>>
> >>> This is with 3 runs per reboot across 5 reboots, with first run after reboot
> >>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data
> >>> points per kernel in total.
> >>>
> >>> I've rerun the test multiple times and see similar results each time.
> >>>
> >>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I
> >>> see the same performance as baseline-4k.
> >>>
> >>>
> >>>> 2. Do we have a theory why it regressed?
> >>>
> >>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that
> >>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback
> >>> to order-0. I guess this is happening so often for this workload that the cost
> >>> of doing the checks and fallback is outweighing the benefit of the memory that
> >>> does end up with order-4 folios.
> >>>
> >>> I've sampled the memory in each bucket (once per second) while running and its
> >>> roughly:
> >>>
> >>> 64K: 25%
> >>> 32K: 15%
> >>> 16K: 15%
> >>> 4K: 45%
> >>>
> >>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order.
> >>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and
> >>> the 64K contents is more static - that's just a guess though.
> >> So this is like out of vma range thing.
> >>
> >>>
> >>>> Assuming no bugs, I don't see how a real regression could happen --
> >>>> falling back to order-0 isn't different from the original behavior.
> >>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
> >>>
> >>> I can, but it will have to be a bit later in the week. I'll do some more test
> >>> runs overnight so we have a larger number of runs - hopefully that might tell us
> >>> that this is noise to a certain extent.
> >>>
> >>> I'd still like to hear a clear technical argument for why the bin-packing
> >>> approach is not the correct one!
> >> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we
> >> postpone this part of change and make basic anon large folio support in. Then
> >> discuss which approach we should take. Maybe people will agree retry is the
> >> choice, maybe other approach will be taken...
> >>
> >> For example, for this out of VMA range case, per VMA order should be considered.
> >> We don't need make decision that the retry should be taken now.
> >
> > I've articulated the reasons in another email. Just summarize the most
> > important point here:
> > using more fallback orders makes a system reach equilibrium faster, at
> > which point it can't allocate the order of arch_wants_pte_order()
> > anymore. IOW, this best-fit policy can reduce the number of folios of
> > the h/w prefered order for a system running long enough.
>
> Thanks for taking the time to write all the arguments down. I understand what
> you are saying. If we are considering the whole system, then we also need to
> think about the page cache though, and that will allocate multiple orders, so
> you are still going to suffer fragmentation from that user.

1. page cache doesn't use the best-fit policy -- it has the advantage
of having RA hit/miss numbers -- IOW, it doesn't try all orders
without an estimated ROI.
2. page cache causes far less fragmentation in my experience: clean
page cache gets reclaimed first under memory; unmapped page cache is
less costly to migrate. Neither is true for anon, and what makes it
worse is that heavy anon users usually enable zram/zswap: allocating
memory (to store compressed data) under memory pressure makes
reclaim/compaction even harder.

> That said, I like the proposal patch posted where we have up to 3 orders that we
> try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That
> feels like a good compromise that allows me to fulfil my objectives. I'm going
> to pull this together into a v3 patch set and aim to post towards the end of the
> week.
>
> Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I
> need your explicit permission).

Thanks for asking. No need to worry about it -- it's been a great team
work with you, Fengwei, Yang et al.

I'm attaching a single patch containing all pieces I spelled
out/implied/forgot to mention. It doesn't depend on other series -- I
just stress-tested it on top of the latest mm-unstable. Please feel
free to reuse any bits you see fit. Again no need to worry about
Suggested-by.

> On the regression front, I've done a much bigger test run and see the regression
> is still present (although the mean has shifted a little bit). I've also built a
> kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns
> order-3. The aim was to test your hypothesis that 64K allocation is slow. This
> kernel is performing even better, so I think that confirms your hypothesis:

Great, thanks for confirming.

> | kernel                         |   runs_per_min |   runs |   sessions |
> |:-------------------------------|---------------:|-------:|-----------:|
> | baseline-4k                    |           0.0% |     75 |         15 |
> | anonfolio-lkml-v1              |           1.0% |     75 |         15 |
> | anonfolio-lkml-v2-simple-order |          -0.4% |     75 |         15 |
> | anonfolio-lkml-v2              |           0.9% |     75 |         15 |
> | anonfolio-lkml-v2-32k          |           1.4% |     10 |          5 |

Since we are all committed to the effort long term, the last number is
good enough for the initial step to conclude. Hopefully v3 can address
all pending comments and get into mm-unstable.
  
David Hildenbrand July 5, 2023, 7:38 p.m. UTC | #9
On 03.07.23 15:53, Ryan Roberts wrote:
> Hi All,
> 
> This is v2 of a series to implement variable order, large folios for anonymous
> memory. The objective of this is to improve performance by allocating larger
> chunks of memory during anonymous page faults. See [1] for background.
> 
> I've significantly reworked and simplified the patch set based on comments from
> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to
> VARIABLE_THP, on Yu's advice.
> 
> The last patch is for arm64 to explicitly override the default
> arch_wants_pte_order() and is intended as an example. If this series is accepted
> I suggest taking the first 4 patches through the mm tree and the arm64 change
> could be handled through the arm64 tree separately. Neither has any build
> dependency on the other.
> 
> The one area where I haven't followed Yu's advice is in the determination of the
> size of folio to use. It was suggested that I have a single preferred large
> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there
> being existing overlapping populated PTEs, etc) then fallback immediately to
> order-0. It turned out that this approach caused a performance regression in the
> Speedometer benchmark. With my v1 patch, there were significant quantities of
> memory which could not be placed in the 64K bucket and were instead being
> allocated for the 32K and 16K buckets. With the proposed simplification, that
> memory ended up using the 4K bucket, so page faults increased by 2.75x compared
> to the v1 patch (although due to the 64K bucket, this number is still a bit
> lower than the baseline). So instead, I continue to calculate a folio order that
> is somewhere between the preferred order and 0. (See below for more details).
> 
> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series
> [2], which is a hard dependency. I have a branch at [3].
> 
> 
> Changes since v1 [1]
> --------------------
> 
>    - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
>    - replaced with arch-independent alloc_anon_folio()
>        - follows THP allocation approach
>    - no longer retry with intermediate orders if allocation fails
>        - fallback directly to order-0
>    - remove folio_add_new_anon_rmap_range() patch
>        - instead add its new functionality to folio_add_new_anon_rmap()
>    - remove batch-zap pte mappings optimization patch
>        - remove enabler folio_remove_rmap_range() patch too
>        - These offer real perf improvement so will submit separately
>    - simplify Kconfig
>        - single FLEXIBLE_THP option, which is independent of arch
>        - depends on TRANSPARENT_HUGEPAGE
>        - when enabled default to max anon folio size of 64K unless arch
>          explicitly overrides
>    - simplify changes to do_anonymous_page():
>        - no more retry loop
> 
> 
> Performance
> -----------
> 
> Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel
> compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in
> Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled,
> Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5
> reboots and averaged.
> 
> 'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2
> patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the
> order selection simplification that Yu Zhao suggested - I'm trying to justify
> here why I did not follow the advice.
> 
> 
> Kernel compilation with 8 jobs:
> 
> | kernel                         |   real-time |   kern-time |   user-time |
> |:-------------------------------|------------:|------------:|------------:|
> | baseline-4k                    |        0.0% |        0.0% |        0.0% |
> | anonfolio-lkml-v1              |       -5.3% |      -42.9% |       -0.6% |
> | anonfolio-lkml-v2-simple-order |       -4.4% |      -36.5% |       -0.4% |
> | anonfolio-lkml-v2              |       -4.8% |      -38.6% |       -0.6% |
> 
> We can see that the simple-order approach is responsible for a regression of
> 0.4%.
> 
> 
> Kernel compilation with 80 jobs:
> 
> | kernel                         |   real-time |   kern-time |   user-time |
> |:-------------------------------|------------:|------------:|------------:|
> | baseline-4k                    |        0.0% |        0.0% |        0.0% |
> | anonfolio-lkml-v1              |       -4.6% |      -45.7% |        1.4% |
> | anonfolio-lkml-v2-simple-order |       -4.7% |      -40.2% |       -0.1% |
> | anonfolio-lkml-v2              |       -5.0% |      -42.6% |       -0.3% |
> 
> simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to
> fixing the v1 regression on user-time.
> 
> 
> Speedometer 2.0:
> 
> | kernel                         |   runs_per_min |
> |:-------------------------------|---------------:|
> | baseline-4k                    |           0.0% |
> | anonfolio-lkml-v1              |           0.7% |
> | anonfolio-lkml-v2-simple-order |          -0.9% |
> | anonfolio-lkml-v2              |           0.5% |
> 
> simple-order regresses performance by 0.9% vs the baseline, for a total negative
> swing of 1.6% vs v1. This is fixed by keeping the more complex order selection
> mechanism from v1.
> 
> 
> The remaining (kernel time) performance gap between v1 and v2 for the above
> benchmarks is due to the removal of the "batch zap" patch in v2. Adding that
> back in gives us the performance back. I intend to submit that as a separate
> series once this series is accepted.
> 
> 
> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2
> 
> Thanks,
> Ryan

Hi Ryan,

is page migration already working as expected (what about page 
compaction?), and do we handle migration -ENOMEM when allocating a 
target page: do we split an fallback to 4k page migration?
  
Ryan Roberts July 6, 2023, 8:02 a.m. UTC | #10
On 05/07/2023 20:38, David Hildenbrand wrote:
> On 03.07.23 15:53, Ryan Roberts wrote:
>> Hi All,
>>
>> This is v2 of a series to implement variable order, large folios for anonymous
>> memory. The objective of this is to improve performance by allocating larger
>> chunks of memory during anonymous page faults. See [1] for background.
>>

[...]

>> Thanks,
>> Ryan
> 
> Hi Ryan,
> 
> is page migration already working as expected (what about page compaction?), and
> do we handle migration -ENOMEM when allocating a target page: do we split an
> fallback to 4k page migration?
> 

Hi David, All,

This series aims to be the bare minimum to demonstrate allocation of large anon
folios. As such, there is a laundry list of things that need to be done for this
feature to play nicely with other features. My preferred route is to merge this
with it's Kconfig defaulted to disabled, and its Kconfig description clearly
shouting that it's EXPERIMENTAL with an explanation of why (similar to
READ_ONLY_THP_FOR_FS).

That said, I've put together a table of the items that I'm aware of that need
attention. It would be great if people can review and add any missing items.
Then we can hopefully parallelize the implementation work. David, I don't think
the items you raised are covered - would you mind providing a bit more detail so
I can add them to the list? (or just add them to the list yourself, if you prefer).

---

- item:
    mlock

  description: >-
    Large, pte-mapped folios are ignored when mlock is requested. Code comment
    for mlock_vma_folio() says "...filter out pte mappings of THPs, which
    cannot be consistently counted: a pte mapping of the THP head cannot be
    distinguished by the page alone."

  location:
    - mlock_pte_range()
    - mlock_vma_folio()

  assignee:
    Yin, Fengwei


- item:
    numa balancing

  description: >-
    Large, pte-mapped folios are ignored by numa-balancing code. Commit
    comment (e81c480): "We're going to have THP mapped with PTEs. It will
    confuse numabalancing. Let's skip them for now."

  location:
    - do_numa_page()

  assignee:
    <none>


- item:
    madvise

  description: >-
    MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes
    exclusive only if mapcount==1, else skips remainder of operation. For
    large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages
    and still be exclusive. Even better; don't split the folio if it fits
    entirely within the range? Discussion at

https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/
    talks about changing folio mapcounting - may help determine if exclusive
    without pgtable scan?

  location:
    - madvise_cold_or_pageout_pte_range()
    - madvise_free_pte_range()

  assignee:
    <none>


- item:
    shrink_folio_list

  description: >-
    Raised by Yu Zhao; I can't see the problem in the code - need
    clarification

  location:
    - shrink_folio_list()

  assignee:
    <none>


- item:
    compaction

  description: >-
    Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
    page-cache pages today. Is my understand correct?

  location:
    - <where?>

  assignee:
    <none>
---

Thanks,
Ryan
  
David Hildenbrand July 7, 2023, 11:40 a.m. UTC | #11
On 06.07.23 10:02, Ryan Roberts wrote:
> On 05/07/2023 20:38, David Hildenbrand wrote:
>> On 03.07.23 15:53, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> This is v2 of a series to implement variable order, large folios for anonymous
>>> memory. The objective of this is to improve performance by allocating larger
>>> chunks of memory during anonymous page faults. See [1] for background.
>>>
> 
> [...]
> 
>>> Thanks,
>>> Ryan
>>
>> Hi Ryan,
>>
>> is page migration already working as expected (what about page compaction?), and
>> do we handle migration -ENOMEM when allocating a target page: do we split an
>> fallback to 4k page migration?
>>
> 
> Hi David, All,

Hi Ryan,

thanks a lot for the list.

But can you comment on the page migration part (IOW did you try it already)?

For example, memory hotunplug, CMA, MCE handling, compaction all rely on 
page migration of something that was allocated using GFP_MOVABLE to 
actually work.

Compaction seems to skip any higher-order folios, but the question is if 
the udnerlying migration itself works.

If it already works: great! If not, this really has to be tackled early, 
because otherwise we'll be breaking the GFP_MOVABLE semantics.

> 
> This series aims to be the bare minimum to demonstrate allocation of large anon
> folios. As such, there is a laundry list of things that need to be done for this
> feature to play nicely with other features. My preferred route is to merge this
> with it's Kconfig defaulted to disabled, and its Kconfig description clearly
> shouting that it's EXPERIMENTAL with an explanation of why (similar to
> READ_ONLY_THP_FOR_FS).
As long as we are not sure about the user space control and as long as 
basic functionality is not working (example, page migration), I would 
tend to not merge this early just for the sake of it.

But yes, something like mlock can eventually be tackled later: as long 
as there is a runtime interface to disable it ;)

> 
> That said, I've put together a table of the items that I'm aware of that need
> attention. It would be great if people can review and add any missing items.
> Then we can hopefully parallelize the implementation work. David, I don't think
> the items you raised are covered - would you mind providing a bit more detail so
> I can add them to the list? (or just add them to the list yourself, if you prefer).
> 
> ---
> 
> - item:
>      mlock
> 
>    description: >-
>      Large, pte-mapped folios are ignored when mlock is requested. Code comment
>      for mlock_vma_folio() says "...filter out pte mappings of THPs, which
>      cannot be consistently counted: a pte mapping of the THP head cannot be
>      distinguished by the page alone."
> 
>    location:
>      - mlock_pte_range()
>      - mlock_vma_folio()
> 
>    assignee:
>      Yin, Fengwei
> 
> 
> - item:
>      numa balancing
> 
>    description: >-
>      Large, pte-mapped folios are ignored by numa-balancing code. Commit
>      comment (e81c480): "We're going to have THP mapped with PTEs. It will
>      confuse numabalancing. Let's skip them for now."
> 
>    location:
>      - do_numa_page()
> 
>    assignee:
>      <none>
> 
> 
> - item:
>      madvise
> 
>    description: >-
>      MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes
>      exclusive only if mapcount==1, else skips remainder of operation. For
>      large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages
>      and still be exclusive. Even better; don't split the folio if it fits
>      entirely within the range? Discussion at
> 
> https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/
>      talks about changing folio mapcounting - may help determine if exclusive
>      without pgtable scan?
> 
>    location:
>      - madvise_cold_or_pageout_pte_range()
>      - madvise_free_pte_range()
> 
>    assignee:
>      <none>
> 
> 
> - item:
>      shrink_folio_list
> 
>    description: >-
>      Raised by Yu Zhao; I can't see the problem in the code - need
>      clarification
> 
>    location:
>      - shrink_folio_list()
> 
>    assignee:
>      <none>
> 
> 
> - item:
>      compaction
> 
>    description: >-
>      Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for
>      page-cache pages today. Is my understand correct?
> 
>    location:
>      - <where?>
> 
>    assignee:
>      <none>

I'm still thinking about the whole mapcount thingy (and I burned way too 
much time on that yesterday), which is a big item for such a list and 
affects some of these items.

A pagetable scan is pretty much irrelevant for order-2 pages. But once 
we're talking about higher orders we really don't want to do that.

I'm preparing a writeup with users and challenges.


Is swapping working as expected? zswap?
  
Matthew Wilcox July 7, 2023, 1:12 p.m. UTC | #12
On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
> On 06.07.23 10:02, Ryan Roberts wrote:
> But can you comment on the page migration part (IOW did you try it already)?
> 
> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
> page migration of something that was allocated using GFP_MOVABLE to actually
> work.
> 
> Compaction seems to skip any higher-order folios, but the question is if the
> udnerlying migration itself works.
> 
> If it already works: great! If not, this really has to be tackled early,
> because otherwise we'll be breaking the GFP_MOVABLE semantics.

I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
is not.

If you look at a function like folio_migrate_mapping(), it all seems
appropriately folio-ised.  There might be something in there that is
slightly wrong, but that would just be a bug to fix, not a huge
architectural problem.

The problem comes in the callers of migrate_pages().  They pass a
new_folio_t callback.  alloc_migration_target() is the usual one passed
and as far as I can tell is fine.  I've seen no problems reported with it.

compaction_alloc() is a disaster, and I don't know how to fix it.
The compaction code has its own allocator which is populated with order-0
folios.  How it populates that freelist is awful ... see split_map_pages()

> Is swapping working as expected? zswap?

Suboptimally.  Swap will split folios in order to swap them.  Somebody
needs to fix that, but it should work.
  
David Hildenbrand July 7, 2023, 1:24 p.m. UTC | #13
On 07.07.23 15:12, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>> On 06.07.23 10:02, Ryan Roberts wrote:
>> But can you comment on the page migration part (IOW did you try it already)?
>>
>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>> page migration of something that was allocated using GFP_MOVABLE to actually
>> work.
>>
>> Compaction seems to skip any higher-order folios, but the question is if the
>> udnerlying migration itself works.
>>
>> If it already works: great! If not, this really has to be tackled early,
>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
> 
> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
> is not.

Thanks! Very nice if at least ordinary migration works.

> 
> If you look at a function like folio_migrate_mapping(), it all seems
> appropriately folio-ised.  There might be something in there that is
> slightly wrong, but that would just be a bug to fix, not a huge
> architectural problem.
> 
> The problem comes in the callers of migrate_pages().  They pass a
> new_folio_t callback.  alloc_migration_target() is the usual one passed
> and as far as I can tell is fine.  I've seen no problems reported with it.
> 
> compaction_alloc() is a disaster, and I don't know how to fix it.
> The compaction code has its own allocator which is populated with order-0
> folios.  How it populates that freelist is awful ... see split_map_pages()

Yeah, all that code was written under the assumption that we're moving 
order-0 pages (which is what the anon+pagecache pages part).

 From what I recall, we're allocating order-0 pages from the high memory 
addresses, so we can migrate from low memory addresses, effectively 
freeing up low memory addresses and filling high memory addresses.

Adjusting that will be ... interesting. Instead of allocating order-0 
pages from high addresses, we might want to allocate "as large as 
possible" ("grab what we can") from high addresses and then have our own 
kind of buddy for allocating from that pool a compaction destination 
page, depending on our source page. Nasty.

What should always work is the split->migrate. But that's definitely not 
what we want in many cases.

> 
>> Is swapping working as expected? zswap?
> 
> Suboptimally.  Swap will split folios in order to swap them.  Somebody
> needs to fix that, but it should work.

Good!

It would be great to have some kind of a feature matrix that tells us 
what works perfectly, sub-optimally, barely, not at all (and what has 
not been tested). Maybe (likely!) we'll also find things that are 
sub-optimal for ordinary THP (like swapping, not even sure about).

I suspect that KSM should work mostly fine with flexible-thp. When 
deduplciating, we'll simply split the compound page and proceed as 
expected. But might be worth testing as well.
  
Ryan Roberts July 10, 2023, 10:07 a.m. UTC | #14
On 07/07/2023 14:24, David Hildenbrand wrote:
> On 07.07.23 15:12, Matthew Wilcox wrote:
>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>> But can you comment on the page migration part (IOW did you try it already)?
>>>
>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>> work.
>>>
>>> Compaction seems to skip any higher-order folios, but the question is if the
>>> udnerlying migration itself works.
>>>
>>> If it already works: great! If not, this really has to be tackled early,
>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>
>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>> is not.
> 
> Thanks! Very nice if at least ordinary migration works.

That's good to hear - I hadn't personally investigated.

> 
>>
>> If you look at a function like folio_migrate_mapping(), it all seems
>> appropriately folio-ised.  There might be something in there that is
>> slightly wrong, but that would just be a bug to fix, not a huge
>> architectural problem.
>>
>> The problem comes in the callers of migrate_pages().  They pass a
>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>
>> compaction_alloc() is a disaster, and I don't know how to fix it.
>> The compaction code has its own allocator which is populated with order-0
>> folios.  How it populates that freelist is awful ... see split_map_pages()

I think this compaction issue also affects large folios in the page cache? So
really it is a pre-existing bug in the code base that needs to be fixed
independently of large anon folios? Should I assume you are tackling this, Matthew?

> 
> Yeah, all that code was written under the assumption that we're moving order-0
> pages (which is what the anon+pagecache pages part).
> 
> From what I recall, we're allocating order-0 pages from the high memory
> addresses, so we can migrate from low memory addresses, effectively freeing up
> low memory addresses and filling high memory addresses.
> 
> Adjusting that will be ... interesting. Instead of allocating order-0 pages from
> high addresses, we might want to allocate "as large as possible" ("grab what we
> can") from high addresses and then have our own kind of buddy for allocating
> from that pool a compaction destination page, depending on our source page. Nasty.
> 
> What should always work is the split->migrate. But that's definitely not what we
> want in many cases.
> 
>>
>>> Is swapping working as expected? zswap?
>>
>> Suboptimally.  Swap will split folios in order to swap them.  Somebody
>> needs to fix that, but it should work.
> 
> Good!
> 
> It would be great to have some kind of a feature matrix that tells us what works
> perfectly, sub-optimally, barely, not at all (and what has not been tested).
> Maybe (likely!) we'll also find things that are sub-optimal for ordinary THP
> (like swapping, not even sure about).

I'm building a list of known issues, but so far it has been based on code I've
found during review and things raised by people in these threads. Are there test
suites that explicitly test these features? If so I'll happily run them against
large anon folios, but at the moment I'm ignorant I'm afraid. I have been trying
to get mm selftests up and running, but I currently have a bunch of failures on
arm64, even without any of my patches - somthing I'm working through.

> 
> I suspect that KSM should work mostly fine with flexible-thp. When
> deduplciating, we'll simply split the compound page and proceed as expected. But
> might be worth testing as well.
>
  
Zi Yan July 10, 2023, 4:53 p.m. UTC | #15
On 7 Jul 2023, at 9:24, David Hildenbrand wrote:

> On 07.07.23 15:12, Matthew Wilcox wrote:
>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>> But can you comment on the page migration part (IOW did you try it already)?
>>>
>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>> work.
>>>
>>> Compaction seems to skip any higher-order folios, but the question is if the
>>> udnerlying migration itself works.
>>>
>>> If it already works: great! If not, this really has to be tackled early,
>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>
>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>> is not.
>
> Thanks! Very nice if at least ordinary migration works.
>
>>
>> If you look at a function like folio_migrate_mapping(), it all seems
>> appropriately folio-ised.  There might be something in there that is
>> slightly wrong, but that would just be a bug to fix, not a huge
>> architectural problem.
>>
>> The problem comes in the callers of migrate_pages().  They pass a
>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>
>> compaction_alloc() is a disaster, and I don't know how to fix it.
>> The compaction code has its own allocator which is populated with order-0
>> folios.  How it populates that freelist is awful ... see split_map_pages()
>
> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>
> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>
> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.

We probably do not need a pool, since before migration, we have isolated folios to
be migrated and can come up with a stats on how many folios there are at each order.
Then, we can isolate free pages based on the stats and do not split free pages
all the way down to order-0. We can sort the source folios based on their orders
and isolate free pages from largest order to smallest order. That could avoid
a free page pool.

--
Best Regards,
Yan, Zi
  
Matthew Wilcox July 10, 2023, 4:57 p.m. UTC | #16
On Mon, Jul 10, 2023 at 11:07:47AM +0100, Ryan Roberts wrote:
> I think this compaction issue also affects large folios in the page cache? So
> really it is a pre-existing bug in the code base that needs to be fixed
> independently of large anon folios? Should I assume you are tackling this, Matthew?

It does need to be fixed independently of large anon folios.  Said fix
should probably be backported to 6.1 once it's suitably stable.  However,
I'm not working on it.  I have a lot of projects and this one's a
missed-opportunity, not a show-stopper.  Sounds like Zi Yan might be
interested in tackling it though!
  
Luis Chamberlain July 11, 2023, 9:11 p.m. UTC | #17
On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
> 
> > Is swapping working as expected? zswap?
> 
> Suboptimally.  Swap will split folios in order to swap them.

Wouldn't that mean if high order folios are used a lot but swap is also
used, until this is fixed you wouldn't get the expected reclaim gains
for high order folios and we'd need compaction more then?

> Somebody needs to fix that, but it should work.

As we look at shmem stuff it was on the path so something we have
considered doing. Ie, it's on our team's list of items to help with
but currently on a backburner.

  Luis
  
Matthew Wilcox July 11, 2023, 9:59 p.m. UTC | #18
On Tue, Jul 11, 2023 at 02:11:19PM -0700, Luis Chamberlain wrote:
> On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
> > 
> > > Is swapping working as expected? zswap?
> > 
> > Suboptimally.  Swap will split folios in order to swap them.
> 
> Wouldn't that mean if high order folios are used a lot but swap is also
> used, until this is fixed you wouldn't get the expected reclaim gains
> for high order folios and we'd need compaction more then?

They're split in shrink_folio_list(), so they stay intact until
that point?

> > Somebody needs to fix that, but it should work.
> 
> As we look at shmem stuff it was on the path so something we have
> considered doing. Ie, it's on our team's list of items to help with
> but currently on a backburner.

Something I was thinking about is that you'll need to prohibit swap
devices or swap files being created on large block devices.  Until
we rewrite the entire swap subsystem ...
  
Ryan Roberts July 19, 2023, 3:49 p.m. UTC | #19
On 10/07/2023 17:53, Zi Yan wrote:
> On 7 Jul 2023, at 9:24, David Hildenbrand wrote:
> 
>> On 07.07.23 15:12, Matthew Wilcox wrote:
>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>>> But can you comment on the page migration part (IOW did you try it already)?
>>>>
>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>>> work.
>>>>
>>>> Compaction seems to skip any higher-order folios, but the question is if the
>>>> udnerlying migration itself works.
>>>>
>>>> If it already works: great! If not, this really has to be tackled early,
>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>>
>>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>>> is not.
>>
>> Thanks! Very nice if at least ordinary migration works.
>>
>>>
>>> If you look at a function like folio_migrate_mapping(), it all seems
>>> appropriately folio-ised.  There might be something in there that is
>>> slightly wrong, but that would just be a bug to fix, not a huge
>>> architectural problem.
>>>
>>> The problem comes in the callers of migrate_pages().  They pass a
>>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>>
>>> compaction_alloc() is a disaster, and I don't know how to fix it.
>>> The compaction code has its own allocator which is populated with order-0
>>> folios.  How it populates that freelist is awful ... see split_map_pages()
>>
>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>>
>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>>
>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.
> 
> We probably do not need a pool, since before migration, we have isolated folios to
> be migrated and can come up with a stats on how many folios there are at each order.
> Then, we can isolate free pages based on the stats and do not split free pages
> all the way down to order-0. We can sort the source folios based on their orders
> and isolate free pages from largest order to smallest order. That could avoid
> a free page pool.

Hi Zi, I just wanted to check; is this something you are working on or planning
to work on? I'm trying to maintain a list of all the items that need to get
sorted for large anon folios. It would be great to put your name against it! ;-)

> 
> --
> Best Regards,
> Yan, Zi
  
Zi Yan July 19, 2023, 4:05 p.m. UTC | #20
On 19 Jul 2023, at 11:49, Ryan Roberts wrote:

> On 10/07/2023 17:53, Zi Yan wrote:
>> On 7 Jul 2023, at 9:24, David Hildenbrand wrote:
>>
>>> On 07.07.23 15:12, Matthew Wilcox wrote:
>>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>>>> But can you comment on the page migration part (IOW did you try it already)?
>>>>>
>>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>>>> work.
>>>>>
>>>>> Compaction seems to skip any higher-order folios, but the question is if the
>>>>> udnerlying migration itself works.
>>>>>
>>>>> If it already works: great! If not, this really has to be tackled early,
>>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>>>
>>>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>>>> is not.
>>>
>>> Thanks! Very nice if at least ordinary migration works.
>>>
>>>>
>>>> If you look at a function like folio_migrate_mapping(), it all seems
>>>> appropriately folio-ised.  There might be something in there that is
>>>> slightly wrong, but that would just be a bug to fix, not a huge
>>>> architectural problem.
>>>>
>>>> The problem comes in the callers of migrate_pages().  They pass a
>>>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>>>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>>>
>>>> compaction_alloc() is a disaster, and I don't know how to fix it.
>>>> The compaction code has its own allocator which is populated with order-0
>>>> folios.  How it populates that freelist is awful ... see split_map_pages()
>>>
>>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>>>
>>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>>>
>>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.
>>
>> We probably do not need a pool, since before migration, we have isolated folios to
>> be migrated and can come up with a stats on how many folios there are at each order.
>> Then, we can isolate free pages based on the stats and do not split free pages
>> all the way down to order-0. We can sort the source folios based on their orders
>> and isolate free pages from largest order to smallest order. That could avoid
>> a free page pool.
>
> Hi Zi, I just wanted to check; is this something you are working on or planning
> to work on? I'm trying to maintain a list of all the items that need to get
> sorted for large anon folios. It would be great to put your name against it! ;-)

Sure. I can work on this one.

--
Best Regards,
Yan, Zi
  
Ryan Roberts July 19, 2023, 6:37 p.m. UTC | #21
On 19/07/2023 17:05, Zi Yan wrote:
> On 19 Jul 2023, at 11:49, Ryan Roberts wrote:
> 
>> On 10/07/2023 17:53, Zi Yan wrote:
>>> On 7 Jul 2023, at 9:24, David Hildenbrand wrote:
>>>
>>>> On 07.07.23 15:12, Matthew Wilcox wrote:
>>>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote:
>>>>>> On 06.07.23 10:02, Ryan Roberts wrote:
>>>>>> But can you comment on the page migration part (IOW did you try it already)?
>>>>>>
>>>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on
>>>>>> page migration of something that was allocated using GFP_MOVABLE to actually
>>>>>> work.
>>>>>>
>>>>>> Compaction seems to skip any higher-order folios, but the question is if the
>>>>>> udnerlying migration itself works.
>>>>>>
>>>>>> If it already works: great! If not, this really has to be tackled early,
>>>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics.
>>>>>
>>>>> I have looked at this a bit.  _Migration_ should be fine.  _Compaction_
>>>>> is not.
>>>>
>>>> Thanks! Very nice if at least ordinary migration works.
>>>>
>>>>>
>>>>> If you look at a function like folio_migrate_mapping(), it all seems
>>>>> appropriately folio-ised.  There might be something in there that is
>>>>> slightly wrong, but that would just be a bug to fix, not a huge
>>>>> architectural problem.
>>>>>
>>>>> The problem comes in the callers of migrate_pages().  They pass a
>>>>> new_folio_t callback.  alloc_migration_target() is the usual one passed
>>>>> and as far as I can tell is fine.  I've seen no problems reported with it.
>>>>>
>>>>> compaction_alloc() is a disaster, and I don't know how to fix it.
>>>>> The compaction code has its own allocator which is populated with order-0
>>>>> folios.  How it populates that freelist is awful ... see split_map_pages()
>>>>
>>>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part).
>>>>
>>>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses.
>>>>
>>>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty.
>>>
>>> We probably do not need a pool, since before migration, we have isolated folios to
>>> be migrated and can come up with a stats on how many folios there are at each order.
>>> Then, we can isolate free pages based on the stats and do not split free pages
>>> all the way down to order-0. We can sort the source folios based on their orders
>>> and isolate free pages from largest order to smallest order. That could avoid
>>> a free page pool.
>>
>> Hi Zi, I just wanted to check; is this something you are working on or planning
>> to work on? I'm trying to maintain a list of all the items that need to get
>> sorted for large anon folios. It would be great to put your name against it! ;-)
> 
> Sure. I can work on this one.

Awesome - thanks!

> 
> --
> Best Regards,
> Yan, Zi