[v3,0/4] variable-order, large folios for anonymous memory

Message ID	20230714160407.4142030-1-ryan.roberts@arm.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Yin Fengwei <fengwei.yin@intel.com>, David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Anshuman Khandual <anshuman.khandual@arm.com>, Yang Shi <shy828301@gmail.com>, "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>, Luis Chamberlain <mcgrof@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v3 0/4] variable-order, large folios for anonymous memory Date: Fri, 14 Jul 2023 17:04:03 +0100 Message-Id: <20230714160407.4142030-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	variable-order, large folios for anonymous memory \| [v3,0/4] variable-order, large folios for anonymous memory [v3,1/4] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() [v3,2/4] mm: Default implementation of arch_wants_pte_order() [v3,3/4] mm: FLEXIBLE_THP for improved performance [v3,4/4] arm64: mm: Override arch_wants_pte_order()

Message ID

20230714160407.4142030-1-ryan.roberts@arm.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
        Matthew Wilcox <willy@infradead.org>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        Yin Fengwei <fengwei.yin@intel.com>,
        David Hildenbrand <david@redhat.com>,
        Yu Zhao <yuzhao@google.com>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Will Deacon <will@kernel.org>,
        Anshuman Khandual <anshuman.khandual@arm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>,
        Luis Chamberlain <mcgrof@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
        linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org,
        linux-mm@kvack.org
Subject: [PATCH v3 0/4] variable-order, large folios for anonymous memory
Date: Fri, 14 Jul 2023 17:04:03 +0100
Message-Id: <20230714160407.4142030-1-ryan.roberts@arm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

variable-order, large folios for anonymous memory |

Message

Ryan Roberts July 14, 2023, 4:04 p.m. UTC

  Hi All,

This is v3 of a series to implement variable order, large folios for anonymous
memory. (currently called "FLEXIBLE_THP") The objective of this is to improve
performance by allocating larger chunks of memory during anonymous page faults.
See [1] and [2] for background.

There has been quite a bit more rework and simplification, mainly based on
feedback from Yu Zhao. Additionally, I've added a command line parameter,
flexthp_unhinted_max, the idea for which came from discussion with David
Hildenbrand (thanks for all your feedback!).

The last patch is for arm64 to explicitly override the default
arch_wants_pte_order() and is intended as an example. If this series is accepted
I suggest taking the first 3 patches through the mm tree and the arm64 change
could be handled through the arm64 tree separately. Neither has any build
dependency on the other.

The patches are based on top of v6.5-rc1. I have a branch at [3].


Changes since v2 [2]
--------------------

  - Dropped commit "Allow deferred splitting of arbitrary large anon folios"
      - Huang, Ying suggested the "batch zap" work (which I dropped from this
        series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
        moved the deferred split patch to a separate series along with the batch
        zap changes. I plan to submit this series early next week.
  - Changed folio order fallback policy
      - We no longer iterate from preferred to 0 looking for acceptable policy
      - Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
  - Removed vma parameter from arch_wants_pte_order()
  - Added command line parameter `flexthp_unhinted_max`
      - clamps preferred order when vma hasn't explicitly opted-in to THP
  - Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
    for process or system).
  - Simplified implementation and integration with do_anonymous_page()
  - Removed dependency on set_ptes()


Performance
-----------

Performance is still similar to v2; see cover letter at [2].


Opens
-----

  - Feature name: FLEXIBLE_THP or LARGE_ANON_FOLIO?
      - Given the closer policy ties to THP, I prefer FLEXIBLE_THP
  - Prerequisits for merging
      - Sounds like there is a concensus that we should wait until exisitng
        features are improved to place nicely with large folios.


[1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230703135330.1865927-1-ryan.roberts@arm.com/
[3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v3


Thanks,
Ryan


Ryan Roberts (4):
  mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
  mm: Default implementation of arch_wants_pte_order()
  mm: FLEXIBLE_THP for improved performance
  arm64: mm: Override arch_wants_pte_order()

 .../admin-guide/kernel-parameters.txt         |  10 +
 arch/arm64/include/asm/pgtable.h              |   6 +
 include/linux/pgtable.h                       |  13 ++
 mm/Kconfig                                    |  10 +
 mm/memory.c                                   | 187 ++++++++++++++++--
 mm/rmap.c                                     |  28 ++-
 6 files changed, 230 insertions(+), 24 deletions(-)

--
2.25.1

Comments

Ryan Roberts July 24, 2023, 11:59 a.m. UTC | #1

On 14/07/2023 17:04, Ryan Roberts wrote:
> Hi All,
> 
> This is v3 of a series to implement variable order, large folios for anonymous
> memory. (currently called "FLEXIBLE_THP") The objective of this is to improve
> performance by allocating larger chunks of memory during anonymous page faults.
> See [1] and [2] for background.

A question for anyone that can help; I'm preparing v4 and as part of that am
running the mm selftests, now that I've fixed them  up to run reliably for
arm64. This is showing 2 regressions vs the v6.5-rc3 baseline:

1) khugepaged test fails here:
# Run test: collapse_max_ptes_none (khugepaged:anon)
# Maybe collapse with max_ptes_none exceeded.... Fail
# Unexpected huge page

2) split_huge_page_test fails with:
# Still AnonHugePages not split

I *think* (but haven't yet verified) that (1) is due to khugepaged ignoring
non-order-0 folios when looking for candidates to collapse. Now that we have
large anon folios, the memory allocated by the test is in large folios and
therefore does not get collapsed. We understand this issue, and I believe
DavidH's new scheme for determining exclusive vs shared should give us the tools
to solve this.

But (2) is weird. If I run this test on its own immediately after booting, it
passes. If I then run the khugepaged test, then re-run this test, it fails.

The test is allocating 4 hugepages, then requesting they are split using the
debugfs interface. Then the test looks at /proc/self/smaps to check that
AnonHugePages is back to 0.

In both the passing and failing cases, the kernel thinks that it has
successfully split the pages; the debug logs in split_huge_pages_pid() confirm
this. In the failing case, I wonder if somehow khugepaged could be immediately
re-collapsing the pages before user sapce can observe the split? Perhaps the
failed khugepaged test has left khugepaged in an "awake" state and it
immediately pounces?

Thanks,
Ryan

Zi Yan July 24, 2023, 2:58 p.m. UTC | #2

On 24 Jul 2023, at 7:59, Ryan Roberts wrote:

> On 14/07/2023 17:04, Ryan Roberts wrote:
>> Hi All,
>>
>> This is v3 of a series to implement variable order, large folios for anonymous
>> memory. (currently called "FLEXIBLE_THP") The objective of this is to improve
>> performance by allocating larger chunks of memory during anonymous page faults.
>> See [1] and [2] for background.
>
> A question for anyone that can help; I'm preparing v4 and as part of that am
> running the mm selftests, now that I've fixed them  up to run reliably for
> arm64. This is showing 2 regressions vs the v6.5-rc3 baseline:
>
> 1) khugepaged test fails here:
> # Run test: collapse_max_ptes_none (khugepaged:anon)
> # Maybe collapse with max_ptes_none exceeded.... Fail
> # Unexpected huge page
>
> 2) split_huge_page_test fails with:
> # Still AnonHugePages not split
>
> I *think* (but haven't yet verified) that (1) is due to khugepaged ignoring
> non-order-0 folios when looking for candidates to collapse. Now that we have
> large anon folios, the memory allocated by the test is in large folios and
> therefore does not get collapsed. We understand this issue, and I believe
> DavidH's new scheme for determining exclusive vs shared should give us the tools
> to solve this.
>
> But (2) is weird. If I run this test on its own immediately after booting, it
> passes. If I then run the khugepaged test, then re-run this test, it fails.
>
> The test is allocating 4 hugepages, then requesting they are split using the
> debugfs interface. Then the test looks at /proc/self/smaps to check that
> AnonHugePages is back to 0.
>
> In both the passing and failing cases, the kernel thinks that it has
> successfully split the pages; the debug logs in split_huge_pages_pid() confirm
> this. In the failing case, I wonder if somehow khugepaged could be immediately
> re-collapsing the pages before user sapce can observe the split? Perhaps the
> failed khugepaged test has left khugepaged in an "awake" state and it
> immediately pounces?

This is more likely to be a stats issue. Have you checked smap to see if
AnonHugePages is 0 KB by placing a getchar() before the exit(EXIT_FAILURE)?
Since split_huge_page_test checks that stats to make sure the split indeed
happened.

--
Best Regards,
Yan, Zi

Ryan Roberts July 24, 2023, 3:41 p.m. UTC | #3

On 24/07/2023 15:58, Zi Yan wrote:
> On 24 Jul 2023, at 7:59, Ryan Roberts wrote:
> 
>> On 14/07/2023 17:04, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> This is v3 of a series to implement variable order, large folios for anonymous
>>> memory. (currently called "FLEXIBLE_THP") The objective of this is to improve
>>> performance by allocating larger chunks of memory during anonymous page faults.
>>> See [1] and [2] for background.
>>
>> A question for anyone that can help; I'm preparing v4 and as part of that am
>> running the mm selftests, now that I've fixed them  up to run reliably for
>> arm64. This is showing 2 regressions vs the v6.5-rc3 baseline:
>>
>> 1) khugepaged test fails here:
>> # Run test: collapse_max_ptes_none (khugepaged:anon)
>> # Maybe collapse with max_ptes_none exceeded.... Fail
>> # Unexpected huge page
>>
>> 2) split_huge_page_test fails with:
>> # Still AnonHugePages not split
>>
>> I *think* (but haven't yet verified) that (1) is due to khugepaged ignoring
>> non-order-0 folios when looking for candidates to collapse. Now that we have
>> large anon folios, the memory allocated by the test is in large folios and
>> therefore does not get collapsed. We understand this issue, and I believe
>> DavidH's new scheme for determining exclusive vs shared should give us the tools
>> to solve this.
>>
>> But (2) is weird. If I run this test on its own immediately after booting, it
>> passes. If I then run the khugepaged test, then re-run this test, it fails.
>>
>> The test is allocating 4 hugepages, then requesting they are split using the
>> debugfs interface. Then the test looks at /proc/self/smaps to check that
>> AnonHugePages is back to 0.
>>
>> In both the passing and failing cases, the kernel thinks that it has
>> successfully split the pages; the debug logs in split_huge_pages_pid() confirm
>> this. In the failing case, I wonder if somehow khugepaged could be immediately
>> re-collapsing the pages before user sapce can observe the split? Perhaps the
>> failed khugepaged test has left khugepaged in an "awake" state and it
>> immediately pounces?
> 
> This is more likely to be a stats issue. Have you checked smap to see if
> AnonHugePages is 0 KB by placing a getchar() before the exit(EXIT_FAILURE)?

Yes - its still 8192K. But looking at the code that value is determined from the
fact that there is a PMD block mapping present. And the split definitely
succeeded so something must have re-collapsed it.

Looking into the khugepaged test suite, it saves the thp and khugepaged settings
out of sysfs, modifies them for the tests, then restores them when finished. But
it doesn't restore if exiting early (due to failure). It changes the settings
for alloc_sleep_millisecs and scan_sleep_millisecs from a large number of
seconds to 10 ms, for example. So I'm pretty sure this is the culprit.


> Since split_huge_page_test checks that stats to make sure the split indeed
> happened.
> 
> --
> Best Regards,
> Yan, Zi

Ryan Roberts July 26, 2023, 8:42 a.m. UTC | #4

On 26/07/2023 08:36, Itaru Kitayama wrote:
> Ryan,
> Do you have a kselfrest code for this new feature?
> I’d like to test it out on FVP when I have the chance.

A very timely question! I have modified the mm/cow tests to additionally test
large anon folios. That patch is part of v4, which I am about to start writing
the cover letter for. So look out for that in around an hour.