[v3,3/4] mm: FLEXIBLE_THP for improved performance

Message ID 20230714161733.4144503-3-ryan.roberts@arm.com
State New
Headers
Series variable-order, large folios for anonymous memory |

Commit Message

Ryan Roberts July 14, 2023, 4:17 p.m. UTC
  Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
allocated in large folios of a determined order. All pages of the large
folio are pte-mapped during the same page fault, significantly reducing
the number of page faults. The number of per-page operations (e.g. ref
counting, rmap management lru list management) are also significantly
reduced since those ops now become per-folio.

The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
defaults to disabled for now; The long term aim is for this to defaut to
enabled, but there are some risks around internal fragmentation that
need to be better understood first.

When enabled, the folio order is determined as such: For a vma, process
or system that has explicitly disabled THP, we continue to allocate
order-0. THP is most likely disabled to avoid any possible internal
fragmentation so we honour that request.

Otherwise, the return value of arch_wants_pte_order() is used. For vmas
that have not explicitly opted-in to use transparent hugepages (e.g.
where thp=madvise and the vma does not have MADV_HUGEPAGE), then
arch_wants_pte_order() is limited by the new cmdline parameter,
`flexthp_unhinted_max`. This allows for a performance boost without
requiring any explicit opt-in from the workload while allowing the
sysadmin to tune between performance and internal fragmentation.

arch_wants_pte_order() can be overridden by the architecture if desired.
Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
set of ptes map physically contigious, naturally aligned memory, so this
mechanism allows the architecture to optimize as required.

If the preferred order can't be used (e.g. because the folio would
breach the bounds of the vma, or because ptes in the region are already
mapped) then we fall back to a suitable lower order; first
PAGE_ALLOC_COSTLY_ORDER, then order-0.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 .../admin-guide/kernel-parameters.txt         |  10 +
 mm/Kconfig                                    |  10 +
 mm/memory.c                                   | 187 ++++++++++++++++--
 3 files changed, 190 insertions(+), 17 deletions(-)
  

Comments

Yu Zhao July 14, 2023, 5:17 p.m. UTC | #1
On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> allocated in large folios of a determined order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> defaults to disabled for now; The long term aim is for this to defaut to
> enabled, but there are some risks around internal fragmentation that
> need to be better understood first.
>
> When enabled, the folio order is determined as such: For a vma, process
> or system that has explicitly disabled THP, we continue to allocate
> order-0. THP is most likely disabled to avoid any possible internal
> fragmentation so we honour that request.
>
> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> that have not explicitly opted-in to use transparent hugepages (e.g.
> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> arch_wants_pte_order() is limited by the new cmdline parameter,
> `flexthp_unhinted_max`. This allows for a performance boost without
> requiring any explicit opt-in from the workload while allowing the
> sysadmin to tune between performance and internal fragmentation.
>
> arch_wants_pte_order() can be overridden by the architecture if desired.
> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> set of ptes map physically contigious, naturally aligned memory, so this
> mechanism allows the architecture to optimize as required.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order; first
> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  10 +
>  mm/Kconfig                                    |  10 +
>  mm/memory.c                                   | 187 ++++++++++++++++--
>  3 files changed, 190 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a1457995fd41..405d624e2191 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1497,6 +1497,16 @@
>                         See Documentation/admin-guide/sysctl/net.rst for
>                         fb_tunnels_only_for_init_ns
>
> +       flexthp_unhinted_max=
> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
> +                       folio size that will be allocated for an anonymous vma
> +                       that has neither explicitly opted in nor out of using
> +                       transparent hugepages. The size must be a power-of-2 in
> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size improves
> +                       performance by reducing page faults, while a smaller
> +                       size reduces internal fragmentation. Default: max(64K,
> +                       PAGE_SIZE). Format: size[KMG].
> +

Let's split this parameter into a separate patch.

And I'm going to ask many questions about it (I can live with a sysctl
parameter but this boot parameter is unacceptable to me).

> diff --git a/mm/memory.c b/mm/memory.c
> index 01f39e8144ef..e8bc729efb9d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4050,6 +4050,148 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         return ret;
>  }
>
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> +       int i;
> +
> +       if (nr_pages == 1)
> +               return vmf_pte_changed(vmf);
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> +                       return true;
> +       }
> +
> +       return false;
> +}
> +
> +#ifdef CONFIG_FLEXIBLE_THP
> +static int flexthp_unhinted_max_order =
> +               ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT;
> +
> +static int __init parse_flexthp_unhinted_max(char *s)
> +{
> +       unsigned long long size = memparse(s, NULL);
> +
> +       if (!is_power_of_2(size) || size < PAGE_SIZE || size > PMD_SIZE) {
> +               pr_warn("flexthp: flexthp_unhinted_max=%s must be power-of-2 between PAGE_SIZE (%lu) and PMD_SIZE (%lu), ignoring\n",
> +                       s, PAGE_SIZE, PMD_SIZE);
> +               return 1;
> +       }
> +
> +       flexthp_unhinted_max_order = ilog2(size) - PAGE_SHIFT;
> +
> +       /* THP machinery requires at least 3 struct pages for meta data. */
> +       if (flexthp_unhinted_max_order == 1)
> +               flexthp_unhinted_max_order--;
> +
> +       return 1;
> +}
> +
> +__setup("flexthp_unhinted_max=", parse_flexthp_unhinted_max);
> +
> +static int anon_folio_order(struct vm_area_struct *vma)
> +{
> +       int order;
> +
> +       /*
> +        * If THP is explicitly disabled for either the vma, the process or the
> +        * system, then this is very likely intended to limit internal
> +        * fragmentation; in this case, don't attempt to allocate a large
> +        * anonymous folio.
> +        *
> +        * Else, if the vma is eligible for thp, allocate a large folio of the
> +        * size preferred by the arch. Or if the arch requested a very small
> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> +        * which still meets the arch's requirements but means we still take
> +        * advantage of SW optimizations (e.g. fewer page faults).
> +        *
> +        * Finally if thp is enabled but the vma isn't eligible, take the
> +        * arch-preferred size and limit it to the flexthp_unhinted_max cmdline
> +        * parameter. This allows a sysadmin to tune performance vs internal
> +        * fragmentation.
> +        */
> +
> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> +           !hugepage_flags_enabled())
> +               order = 0;
> +       else {
> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> +
> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> +                       order = min(order, flexthp_unhinted_max_order);
> +       }
> +
> +       return order;
> +}
> +
> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> +{
> +       int i;
> +       gfp_t gfp;
> +       pte_t *pte;
> +       unsigned long addr;
> +       struct vm_area_struct *vma = vmf->vma;
> +       int prefer = anon_folio_order(vma);
> +       int orders[] = {
> +               prefer,
> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> +               0,
> +       };
> +
> +       *folio = NULL;
> +
> +       if (vmf_orig_pte_uffd_wp(vmf))
> +               goto fallback;
> +
> +       for (i = 0; orders[i]; i++) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> +               if (addr >= vma->vm_start &&
> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> +                       break;
> +       }
> +
> +       if (!orders[i])
> +               goto fallback;
> +
> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> +       if (!pte)
> +               return -EAGAIN;

It would be a bug if this happens. So probably -EINVAL?

> +
> +       for (; orders[i]; i++) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> +               vmf->pte = pte + pte_index(addr);
> +               if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> +                       break;
> +       }
> +
> +       vmf->pte = NULL;
> +       pte_unmap(pte);
> +
> +       gfp = vma_thp_gfp_mask(vma);
> +
> +       for (; orders[i]; i++) {
> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> +               *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> +               if (*folio) {
> +                       clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
> +                       return 0;
> +               }
> +       }
> +
> +fallback:
> +       *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +       return *folio ? 0 : -ENOMEM;
> +}
> +#else
> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)

Drop "inline" (it doesn't do anything in .c).

The rest looks good to me.
  
Ryan Roberts July 14, 2023, 5:59 p.m. UTC | #2
On 14/07/2023 18:17, Yu Zhao wrote:
> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> allocated in large folios of a determined order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>> defaults to disabled for now; The long term aim is for this to defaut to
>> enabled, but there are some risks around internal fragmentation that
>> need to be better understood first.
>>
>> When enabled, the folio order is determined as such: For a vma, process
>> or system that has explicitly disabled THP, we continue to allocate
>> order-0. THP is most likely disabled to avoid any possible internal
>> fragmentation so we honour that request.
>>
>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>> that have not explicitly opted-in to use transparent hugepages (e.g.
>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>> arch_wants_pte_order() is limited by the new cmdline parameter,
>> `flexthp_unhinted_max`. This allows for a performance boost without
>> requiring any explicit opt-in from the workload while allowing the
>> sysadmin to tune between performance and internal fragmentation.
>>
>> arch_wants_pte_order() can be overridden by the architecture if desired.
>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>> set of ptes map physically contigious, naturally aligned memory, so this
>> mechanism allows the architecture to optimize as required.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order; first
>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  .../admin-guide/kernel-parameters.txt         |  10 +
>>  mm/Kconfig                                    |  10 +
>>  mm/memory.c                                   | 187 ++++++++++++++++--
>>  3 files changed, 190 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index a1457995fd41..405d624e2191 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -1497,6 +1497,16 @@
>>                         See Documentation/admin-guide/sysctl/net.rst for
>>                         fb_tunnels_only_for_init_ns
>>
>> +       flexthp_unhinted_max=
>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
>> +                       folio size that will be allocated for an anonymous vma
>> +                       that has neither explicitly opted in nor out of using
>> +                       transparent hugepages. The size must be a power-of-2 in
>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size improves
>> +                       performance by reducing page faults, while a smaller
>> +                       size reduces internal fragmentation. Default: max(64K,
>> +                       PAGE_SIZE). Format: size[KMG].
>> +
> 
> Let's split this parameter into a separate patch.

Ha - I had it as a separate patch originally, but thought you'd ask for it to be
a single patch so squashed it ;-). I can separate it again, no problem.

> 
> And I'm going to ask many questions about it (I can live with a sysctl
> parameter but this boot parameter is unacceptable to me).

Please do. Hopefully the thread with DavidH against v2 gives the rationale. Care
to elaborate on why its unacceptable?

> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 01f39e8144ef..e8bc729efb9d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4050,6 +4050,148 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>         return ret;
>>  }
>>
>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>> +{
>> +       int i;
>> +
>> +       if (nr_pages == 1)
>> +               return vmf_pte_changed(vmf);
>> +
>> +       for (i = 0; i < nr_pages; i++) {
>> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>> +                       return true;
>> +       }
>> +
>> +       return false;
>> +}
>> +
>> +#ifdef CONFIG_FLEXIBLE_THP
>> +static int flexthp_unhinted_max_order =
>> +               ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT;
>> +
>> +static int __init parse_flexthp_unhinted_max(char *s)
>> +{
>> +       unsigned long long size = memparse(s, NULL);
>> +
>> +       if (!is_power_of_2(size) || size < PAGE_SIZE || size > PMD_SIZE) {
>> +               pr_warn("flexthp: flexthp_unhinted_max=%s must be power-of-2 between PAGE_SIZE (%lu) and PMD_SIZE (%lu), ignoring\n",
>> +                       s, PAGE_SIZE, PMD_SIZE);
>> +               return 1;
>> +       }
>> +
>> +       flexthp_unhinted_max_order = ilog2(size) - PAGE_SHIFT;
>> +
>> +       /* THP machinery requires at least 3 struct pages for meta data. */
>> +       if (flexthp_unhinted_max_order == 1)
>> +               flexthp_unhinted_max_order--;
>> +
>> +       return 1;
>> +}
>> +
>> +__setup("flexthp_unhinted_max=", parse_flexthp_unhinted_max);
>> +
>> +static int anon_folio_order(struct vm_area_struct *vma)
>> +{
>> +       int order;
>> +
>> +       /*
>> +        * If THP is explicitly disabled for either the vma, the process or the
>> +        * system, then this is very likely intended to limit internal
>> +        * fragmentation; in this case, don't attempt to allocate a large
>> +        * anonymous folio.
>> +        *
>> +        * Else, if the vma is eligible for thp, allocate a large folio of the
>> +        * size preferred by the arch. Or if the arch requested a very small
>> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
>> +        * which still meets the arch's requirements but means we still take
>> +        * advantage of SW optimizations (e.g. fewer page faults).
>> +        *
>> +        * Finally if thp is enabled but the vma isn't eligible, take the
>> +        * arch-preferred size and limit it to the flexthp_unhinted_max cmdline
>> +        * parameter. This allows a sysadmin to tune performance vs internal
>> +        * fragmentation.
>> +        */
>> +
>> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
>> +           !hugepage_flags_enabled())
>> +               order = 0;
>> +       else {
>> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
>> +
>> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>> +                       order = min(order, flexthp_unhinted_max_order);
>> +       }
>> +
>> +       return order;
>> +}
>> +
>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>> +{
>> +       int i;
>> +       gfp_t gfp;
>> +       pte_t *pte;
>> +       unsigned long addr;
>> +       struct vm_area_struct *vma = vmf->vma;
>> +       int prefer = anon_folio_order(vma);
>> +       int orders[] = {
>> +               prefer,
>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>> +               0,
>> +       };
>> +
>> +       *folio = NULL;
>> +
>> +       if (vmf_orig_pte_uffd_wp(vmf))
>> +               goto fallback;
>> +
>> +       for (i = 0; orders[i]; i++) {
>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>> +               if (addr >= vma->vm_start &&
>> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
>> +                       break;
>> +       }
>> +
>> +       if (!orders[i])
>> +               goto fallback;
>> +
>> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>> +       if (!pte)
>> +               return -EAGAIN;
> 
> It would be a bug if this happens. So probably -EINVAL?

Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
possible for pte_offset_map() to fail (if I understood correctly) and we have to
handle this. The intent is that we will return from the fault without making any
change, then we will refault and try again.

> 
>> +
>> +       for (; orders[i]; i++) {
>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>> +               vmf->pte = pte + pte_index(addr);
>> +               if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
>> +                       break;
>> +       }
>> +
>> +       vmf->pte = NULL;
>> +       pte_unmap(pte);
>> +
>> +       gfp = vma_thp_gfp_mask(vma);
>> +
>> +       for (; orders[i]; i++) {
>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>> +               *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
>> +               if (*folio) {
>> +                       clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
>> +                       return 0;
>> +               }
>> +       }
>> +
>> +fallback:
>> +       *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +       return *folio ? 0 : -ENOMEM;
>> +}
>> +#else
>> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> 
> Drop "inline" (it doesn't do anything in .c).

There are 38 instances of inline in memory.c alone, so looks like a well used
convention, even if the compiler may choose to ignore. Perhaps you can educate
me; what's the benefit of dropping it?

> 
> The rest looks good to me.

Great - just incase it wasn't obvious, I decided not to overwrite vmf->address
with the aligned version, as you suggested, for 2 reasons; 1) address is const
in the struct, so would have had to change that. 2) there is a uffd path that
can be taken after the vmf->address fixup would have occured and the path
consumes that member, so it would have had to be un-fixed-up making it more
messy than the way I opted for.

Thanks for the quick review as always!
  
Yu Zhao July 14, 2023, 10:11 p.m. UTC | #3
On Fri, Jul 14, 2023 at 11:59 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 14/07/2023 18:17, Yu Zhao wrote:
> > On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> >> allocated in large folios of a determined order. All pages of the large
> >> folio are pte-mapped during the same page fault, significantly reducing
> >> the number of page faults. The number of per-page operations (e.g. ref
> >> counting, rmap management lru list management) are also significantly
> >> reduced since those ops now become per-folio.
> >>
> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> >> defaults to disabled for now; The long term aim is for this to defaut to
> >> enabled, but there are some risks around internal fragmentation that
> >> need to be better understood first.
> >>
> >> When enabled, the folio order is determined as such: For a vma, process
> >> or system that has explicitly disabled THP, we continue to allocate
> >> order-0. THP is most likely disabled to avoid any possible internal
> >> fragmentation so we honour that request.
> >>
> >> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >> that have not explicitly opted-in to use transparent hugepages (e.g.
> >> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >> arch_wants_pte_order() is limited by the new cmdline parameter,
> >> `flexthp_unhinted_max`. This allows for a performance boost without
> >> requiring any explicit opt-in from the workload while allowing the
> >> sysadmin to tune between performance and internal fragmentation.
> >>
> >> arch_wants_pte_order() can be overridden by the architecture if desired.
> >> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> >> set of ptes map physically contigious, naturally aligned memory, so this
> >> mechanism allows the architecture to optimize as required.
> >>
> >> If the preferred order can't be used (e.g. because the folio would
> >> breach the bounds of the vma, or because ptes in the region are already
> >> mapped) then we fall back to a suitable lower order; first
> >> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  .../admin-guide/kernel-parameters.txt         |  10 +
> >>  mm/Kconfig                                    |  10 +
> >>  mm/memory.c                                   | 187 ++++++++++++++++--
> >>  3 files changed, 190 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> >> index a1457995fd41..405d624e2191 100644
> >> --- a/Documentation/admin-guide/kernel-parameters.txt
> >> +++ b/Documentation/admin-guide/kernel-parameters.txt
> >> @@ -1497,6 +1497,16 @@
> >>                         See Documentation/admin-guide/sysctl/net.rst for
> >>                         fb_tunnels_only_for_init_ns
> >>
> >> +       flexthp_unhinted_max=
> >> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
> >> +                       folio size that will be allocated for an anonymous vma
> >> +                       that has neither explicitly opted in nor out of using
> >> +                       transparent hugepages. The size must be a power-of-2 in
> >> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size improves
> >> +                       performance by reducing page faults, while a smaller
> >> +                       size reduces internal fragmentation. Default: max(64K,
> >> +                       PAGE_SIZE). Format: size[KMG].
> >> +
> >
> > Let's split this parameter into a separate patch.
>
> Ha - I had it as a separate patch originally, but thought you'd ask for it to be
> a single patch so squashed it ;-). I can separate it again, no problem.
>
> >
> > And I'm going to ask many questions about it (I can live with a sysctl
> > parameter but this boot parameter is unacceptable to me).
>
> Please do. Hopefully the thread with DavidH against v2 gives the rationale. Care
> to elaborate on why its unacceptable?

For starters, it's a bad practice not to support the first one that
works first, and then support the following ones if/when new use cases
arise.
1. per vma/process policies, e.g., madvise
2. per memcg policy, e..g, cgroup/memory.*
3. systemwide runtime knobs, e.g., sysctl
4. boot parameter, e.g., this one
5. kconfig option

Assuming the optimal state has one systemwide value, we would have to
find it by trial and error. And each trial would require a reboot,
which could be avoided if it was a sysctl parameter.

And why can we assume the optimal state has only one value?

(CONFIG_FLEXIBLE_THP is better than sysctl only because we plan to get
rid of it once we are ready -- using sysctl would cause an ABI
breakage, which might be acceptable but why do so if it can be
avoided.)

> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 01f39e8144ef..e8bc729efb9d 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4050,6 +4050,148 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>         return ret;
> >>  }
> >>
> >> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> >> +{
> >> +       int i;
> >> +
> >> +       if (nr_pages == 1)
> >> +               return vmf_pte_changed(vmf);
> >> +
> >> +       for (i = 0; i < nr_pages; i++) {
> >> +               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> >> +                       return true;
> >> +       }
> >> +
> >> +       return false;
> >> +}
> >> +
> >> +#ifdef CONFIG_FLEXIBLE_THP
> >> +static int flexthp_unhinted_max_order =
> >> +               ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT;
> >> +
> >> +static int __init parse_flexthp_unhinted_max(char *s)
> >> +{
> >> +       unsigned long long size = memparse(s, NULL);
> >> +
> >> +       if (!is_power_of_2(size) || size < PAGE_SIZE || size > PMD_SIZE) {
> >> +               pr_warn("flexthp: flexthp_unhinted_max=%s must be power-of-2 between PAGE_SIZE (%lu) and PMD_SIZE (%lu), ignoring\n",
> >> +                       s, PAGE_SIZE, PMD_SIZE);
> >> +               return 1;
> >> +       }
> >> +
> >> +       flexthp_unhinted_max_order = ilog2(size) - PAGE_SHIFT;
> >> +
> >> +       /* THP machinery requires at least 3 struct pages for meta data. */
> >> +       if (flexthp_unhinted_max_order == 1)
> >> +               flexthp_unhinted_max_order--;
> >> +
> >> +       return 1;
> >> +}
> >> +
> >> +__setup("flexthp_unhinted_max=", parse_flexthp_unhinted_max);
> >> +
> >> +static int anon_folio_order(struct vm_area_struct *vma)
> >> +{
> >> +       int order;
> >> +
> >> +       /*
> >> +        * If THP is explicitly disabled for either the vma, the process or the
> >> +        * system, then this is very likely intended to limit internal
> >> +        * fragmentation; in this case, don't attempt to allocate a large
> >> +        * anonymous folio.
> >> +        *
> >> +        * Else, if the vma is eligible for thp, allocate a large folio of the
> >> +        * size preferred by the arch. Or if the arch requested a very small
> >> +        * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
> >> +        * which still meets the arch's requirements but means we still take
> >> +        * advantage of SW optimizations (e.g. fewer page faults).
> >> +        *
> >> +        * Finally if thp is enabled but the vma isn't eligible, take the
> >> +        * arch-preferred size and limit it to the flexthp_unhinted_max cmdline
> >> +        * parameter. This allows a sysadmin to tune performance vs internal
> >> +        * fragmentation.
> >> +        */
> >> +
> >> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> >> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> >> +           !hugepage_flags_enabled())
> >> +               order = 0;
> >> +       else {
> >> +               order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
> >> +
> >> +               if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> >> +                       order = min(order, flexthp_unhinted_max_order);
> >> +       }
> >> +
> >> +       return order;
> >> +}
> >> +
> >> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >> +{
> >> +       int i;
> >> +       gfp_t gfp;
> >> +       pte_t *pte;
> >> +       unsigned long addr;
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +       int prefer = anon_folio_order(vma);
> >> +       int orders[] = {
> >> +               prefer,
> >> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> >> +               0,
> >> +       };
> >> +
> >> +       *folio = NULL;
> >> +
> >> +       if (vmf_orig_pte_uffd_wp(vmf))
> >> +               goto fallback;
> >> +
> >> +       for (i = 0; orders[i]; i++) {
> >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >> +               if (addr >= vma->vm_start &&
> >> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> >> +                       break;
> >> +       }
> >> +
> >> +       if (!orders[i])
> >> +               goto fallback;
> >> +
> >> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >> +       if (!pte)
> >> +               return -EAGAIN;
> >
> > It would be a bug if this happens. So probably -EINVAL?
>
> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
> possible for pte_offset_map() to fail (if I understood correctly) and we have to
> handle this. The intent is that we will return from the fault without making any
> change, then we will refault and try again.

Thanks for checking that -- it's very relevant. One detail is that
that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
happen while we are holding mmap_lock for read here, and therefore,
the race that could cause pte_offset_map() on shmem/file PTEs to fail
doesn't apply here.

+Hugh Dickins for further consultation if you need it.

> >> +
> >> +       for (; orders[i]; i++) {
> >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >> +               vmf->pte = pte + pte_index(addr);
> >> +               if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> >> +                       break;
> >> +       }
> >> +
> >> +       vmf->pte = NULL;
> >> +       pte_unmap(pte);
> >> +
> >> +       gfp = vma_thp_gfp_mask(vma);
> >> +
> >> +       for (; orders[i]; i++) {
> >> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >> +               *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> >> +               if (*folio) {
> >> +                       clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
> >> +                       return 0;
> >> +               }
> >> +       }
> >> +
> >> +fallback:
> >> +       *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> >> +       return *folio ? 0 : -ENOMEM;
> >> +}
> >> +#else
> >> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >
> > Drop "inline" (it doesn't do anything in .c).
>
> There are 38 instances of inline in memory.c alone, so looks like a well used
> convention, even if the compiler may choose to ignore. Perhaps you can educate
> me; what's the benefit of dropping it?

I'll let Willy and Andrew educate both of us :)

+Matthew Wilcox +Andrew Morton please. Thank you.

> > The rest looks good to me.
>
> Great - just incase it wasn't obvious, I decided not to overwrite vmf->address
> with the aligned version, as you suggested

Yes, I've noticed. Not overwriting has its own merits for sure.

> for 2 reasons; 1) address is const
> in the struct, so would have had to change that. 2) there is a uffd path that
> can be taken after the vmf->address fixup would have occured and the path
> consumes that member, so it would have had to be un-fixed-up making it more
> messy than the way I opted for.
>
> Thanks for the quick review as always!
  
David Hildenbrand July 17, 2023, 1:06 p.m. UTC | #4
On 14.07.23 19:17, Yu Zhao wrote:
> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> allocated in large folios of a determined order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>> defaults to disabled for now; The long term aim is for this to defaut to
>> enabled, but there are some risks around internal fragmentation that
>> need to be better understood first.
>>
>> When enabled, the folio order is determined as such: For a vma, process
>> or system that has explicitly disabled THP, we continue to allocate
>> order-0. THP is most likely disabled to avoid any possible internal
>> fragmentation so we honour that request.
>>
>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>> that have not explicitly opted-in to use transparent hugepages (e.g.
>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>> arch_wants_pte_order() is limited by the new cmdline parameter,
>> `flexthp_unhinted_max`. This allows for a performance boost without
>> requiring any explicit opt-in from the workload while allowing the
>> sysadmin to tune between performance and internal fragmentation.
>>
>> arch_wants_pte_order() can be overridden by the architecture if desired.
>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>> set of ptes map physically contigious, naturally aligned memory, so this
>> mechanism allows the architecture to optimize as required.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order; first
>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   .../admin-guide/kernel-parameters.txt         |  10 +
>>   mm/Kconfig                                    |  10 +
>>   mm/memory.c                                   | 187 ++++++++++++++++--
>>   3 files changed, 190 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index a1457995fd41..405d624e2191 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -1497,6 +1497,16 @@
>>                          See Documentation/admin-guide/sysctl/net.rst for
>>                          fb_tunnels_only_for_init_ns
>>
>> +       flexthp_unhinted_max=
>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
>> +                       folio size that will be allocated for an anonymous vma
>> +                       that has neither explicitly opted in nor out of using
>> +                       transparent hugepages. The size must be a power-of-2 in
>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size improves
>> +                       performance by reducing page faults, while a smaller
>> +                       size reduces internal fragmentation. Default: max(64K,
>> +                       PAGE_SIZE). Format: size[KMG].
>> +
> 
> Let's split this parameter into a separate patch.
> 

Just a general comment after stumbling over patch #2, let's not start 
splitting patches into things that don't make any sense on their own; 
that just makes review a lot harder.

For this case here, I'd suggest first adding the general infrastructure 
and then adding tunables we want to have on top.

I agree that toggling that at runtime (for example via sysfs as raised 
by me previously) would be nicer.
  
Ryan Roberts July 17, 2023, 1:20 p.m. UTC | #5
On 17/07/2023 14:06, David Hildenbrand wrote:
> On 14.07.23 19:17, Yu Zhao wrote:
>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>> allocated in large folios of a determined order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>>
>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>> defaults to disabled for now; The long term aim is for this to defaut to
>>> enabled, but there are some risks around internal fragmentation that
>>> need to be better understood first.
>>>
>>> When enabled, the folio order is determined as such: For a vma, process
>>> or system that has explicitly disabled THP, we continue to allocate
>>> order-0. THP is most likely disabled to avoid any possible internal
>>> fragmentation so we honour that request.
>>>
>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>> arch_wants_pte_order() is limited by the new cmdline parameter,
>>> `flexthp_unhinted_max`. This allows for a performance boost without
>>> requiring any explicit opt-in from the workload while allowing the
>>> sysadmin to tune between performance and internal fragmentation.
>>>
>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>> set of ptes map physically contigious, naturally aligned memory, so this
>>> mechanism allows the architecture to optimize as required.
>>>
>>> If the preferred order can't be used (e.g. because the folio would
>>> breach the bounds of the vma, or because ptes in the region are already
>>> mapped) then we fall back to a suitable lower order; first
>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>   .../admin-guide/kernel-parameters.txt         |  10 +
>>>   mm/Kconfig                                    |  10 +
>>>   mm/memory.c                                   | 187 ++++++++++++++++--
>>>   3 files changed, 190 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>>> b/Documentation/admin-guide/kernel-parameters.txt
>>> index a1457995fd41..405d624e2191 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -1497,6 +1497,16 @@
>>>                          See Documentation/admin-guide/sysctl/net.rst for
>>>                          fb_tunnels_only_for_init_ns
>>>
>>> +       flexthp_unhinted_max=
>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
>>> +                       folio size that will be allocated for an anonymous vma
>>> +                       that has neither explicitly opted in nor out of using
>>> +                       transparent hugepages. The size must be a power-of-2 in
>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size improves
>>> +                       performance by reducing page faults, while a smaller
>>> +                       size reduces internal fragmentation. Default: max(64K,
>>> +                       PAGE_SIZE). Format: size[KMG].
>>> +
>>
>> Let's split this parameter into a separate patch.
>>
> 
> Just a general comment after stumbling over patch #2, let's not start splitting
> patches into things that don't make any sense on their own; that just makes
> review a lot harder.

ACK

> 
> For this case here, I'd suggest first adding the general infrastructure and then
> adding tunables we want to have on top.

OK, so 1 patch for the main infrastructure, then a patch to disable for
MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max
via a sysctl?

> 
> I agree that toggling that at runtime (for example via sysfs as raised by me
> previously) would be nicer.

OK, I clearly misunderstood, I thought you were requesting a boot parameter.
What's the ABI compat guarrantee for sysctls? I assumed that for a boot
parameter it would be easier to remove in future if we wanted, but for sysctl,
its there forever?

Also, how do you feel about the naming and behavior of the parameter?

>
  
Ryan Roberts July 17, 2023, 1:36 p.m. UTC | #6
>>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>>> +{
>>>> +       int i;
>>>> +       gfp_t gfp;
>>>> +       pte_t *pte;
>>>> +       unsigned long addr;
>>>> +       struct vm_area_struct *vma = vmf->vma;
>>>> +       int prefer = anon_folio_order(vma);
>>>> +       int orders[] = {
>>>> +               prefer,
>>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>>>> +               0,
>>>> +       };
>>>> +
>>>> +       *folio = NULL;
>>>> +
>>>> +       if (vmf_orig_pte_uffd_wp(vmf))
>>>> +               goto fallback;
>>>> +
>>>> +       for (i = 0; orders[i]; i++) {
>>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>>> +               if (addr >= vma->vm_start &&
>>>> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
>>>> +                       break;
>>>> +       }
>>>> +
>>>> +       if (!orders[i])
>>>> +               goto fallback;
>>>> +
>>>> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>> +       if (!pte)
>>>> +               return -EAGAIN;
>>>
>>> It would be a bug if this happens. So probably -EINVAL?
>>
>> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
>> possible for pte_offset_map() to fail (if I understood correctly) and we have to
>> handle this. The intent is that we will return from the fault without making any
>> change, then we will refault and try again.
> 
> Thanks for checking that -- it's very relevant. One detail is that
> that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
> happen while we are holding mmap_lock for read here, and therefore,
> the race that could cause pte_offset_map() on shmem/file PTEs to fail
> doesn't apply here.

But Hugh's patches have changed do_anonymous_page() to handle failure from
pte_offset_map_lock(). So I was just following that pattern. If this really
can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s
prototype to just return a `struct folio *` (and if it's null that means ENOMEM).

Hugh, perhaps you can comment?

As an aside, it was my understanding from LWN, that we are now using a per-VMA
lock so presumably we don't hold mmap_lock for read here? Or perhaps that only
applies to file-backed memory?

> 
> +Hugh Dickins for further consultation if you need it.
> 
>>>> +
>>>> +       for (; orders[i]; i++) {
>>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>>> +               vmf->pte = pte + pte_index(addr);
>>>> +               if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
>>>> +                       break;
>>>> +       }
>>>> +
>>>> +       vmf->pte = NULL;
>>>> +       pte_unmap(pte);
>>>> +
>>>> +       gfp = vma_thp_gfp_mask(vma);
>>>> +
>>>> +       for (; orders[i]; i++) {
>>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>>> +               *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
>>>> +               if (*folio) {
>>>> +                       clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
>>>> +                       return 0;
>>>> +               }
>>>> +       }
>>>> +
>>>> +fallback:
>>>> +       *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>>> +       return *folio ? 0 : -ENOMEM;
>>>> +}
>>>> +#else
>>>> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>>
>>> Drop "inline" (it doesn't do anything in .c).
>>
>> There are 38 instances of inline in memory.c alone, so looks like a well used
>> convention, even if the compiler may choose to ignore. Perhaps you can educate
>> me; what's the benefit of dropping it?
> 
> I'll let Willy and Andrew educate both of us :)
> 
> +Matthew Wilcox +Andrew Morton please. Thank you.
> 
>>> The rest looks good to me.
>>
>> Great - just incase it wasn't obvious, I decided not to overwrite vmf->address
>> with the aligned version, as you suggested
> 
> Yes, I've noticed. Not overwriting has its own merits for sure.
> 
>> for 2 reasons; 1) address is const
>> in the struct, so would have had to change that. 2) there is a uffd path that
>> can be taken after the vmf->address fixup would have occured and the path
>> consumes that member, so it would have had to be un-fixed-up making it more
>> messy than the way I opted for.
>>
>> Thanks for the quick review as always!
  
David Hildenbrand July 17, 2023, 1:56 p.m. UTC | #7
On 17.07.23 15:20, Ryan Roberts wrote:
> On 17/07/2023 14:06, David Hildenbrand wrote:
>> On 14.07.23 19:17, Yu Zhao wrote:
>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>>> allocated in large folios of a determined order. All pages of the large
>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>> counting, rmap management lru list management) are also significantly
>>>> reduced since those ops now become per-folio.
>>>>
>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>>> defaults to disabled for now; The long term aim is for this to defaut to
>>>> enabled, but there are some risks around internal fragmentation that
>>>> need to be better understood first.
>>>>
>>>> When enabled, the folio order is determined as such: For a vma, process
>>>> or system that has explicitly disabled THP, we continue to allocate
>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>> fragmentation so we honour that request.
>>>>
>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>> arch_wants_pte_order() is limited by the new cmdline parameter,
>>>> `flexthp_unhinted_max`. This allows for a performance boost without
>>>> requiring any explicit opt-in from the workload while allowing the
>>>> sysadmin to tune between performance and internal fragmentation.
>>>>
>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>> mechanism allows the architecture to optimize as required.
>>>>
>>>> If the preferred order can't be used (e.g. because the folio would
>>>> breach the bounds of the vma, or because ptes in the region are already
>>>> mapped) then we fall back to a suitable lower order; first
>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    .../admin-guide/kernel-parameters.txt         |  10 +
>>>>    mm/Kconfig                                    |  10 +
>>>>    mm/memory.c                                   | 187 ++++++++++++++++--
>>>>    3 files changed, 190 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>>>> b/Documentation/admin-guide/kernel-parameters.txt
>>>> index a1457995fd41..405d624e2191 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -1497,6 +1497,16 @@
>>>>                           See Documentation/admin-guide/sysctl/net.rst for
>>>>                           fb_tunnels_only_for_init_ns
>>>>
>>>> +       flexthp_unhinted_max=
>>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
>>>> +                       folio size that will be allocated for an anonymous vma
>>>> +                       that has neither explicitly opted in nor out of using
>>>> +                       transparent hugepages. The size must be a power-of-2 in
>>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size improves
>>>> +                       performance by reducing page faults, while a smaller
>>>> +                       size reduces internal fragmentation. Default: max(64K,
>>>> +                       PAGE_SIZE). Format: size[KMG].
>>>> +
>>>
>>> Let's split this parameter into a separate patch.
>>>
>>
>> Just a general comment after stumbling over patch #2, let's not start splitting
>> patches into things that don't make any sense on their own; that just makes
>> review a lot harder.
> 
> ACK
> 
>>
>> For this case here, I'd suggest first adding the general infrastructure and then
>> adding tunables we want to have on top.
> 
> OK, so 1 patch for the main infrastructure, then a patch to disable for
> MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max
> via a sysctl?

MADV_NOHUGEPAGE handling for me falls under the category "required for 
correctness to not break existing workloads" and has to be there initially.

Anything that is rather a performance tunable (e.g., a sysctl to 
optimize) can be added on top and discussed separately.

At least IMHO :)

> 
>>
>> I agree that toggling that at runtime (for example via sysfs as raised by me
>> previously) would be nicer.
> 
> OK, I clearly misunderstood, I thought you were requesting a boot parameter.

Oh, sorry about that. I wanted to actually express 
"/sys/kernel/mm/transparent_hugepage/" sysctls where we can toggle that 
later at runtime as well.

> What's the ABI compat guarrantee for sysctls? I assumed that for a boot
> parameter it would be easier to remove in future if we wanted, but for sysctl,
> its there forever?

sysctl are hard/impossible to remove, yes. So we better make sure what 
we add has clear semantics.

If we ever want some real auto-tunable mode (and can actually implement 
it without harming performance; and I am skeptical), we might want to 
allow for setting such a parameter to "auto", for example.

> 
> Also, how do you feel about the naming and behavior of the parameter?

Very good question. "flexthp_unhinted_max" naming is a bit suboptimal.

For example, I'm not so sure if we should expose the feature to user 
space as "flexthp" at all. I think we should find a clearer feature name 
to begin with.

... maybe we can initially get away with dropping that parameter and 
default to something reasonably small (i.e., 64k as you have above)?

/sys/kernel/mm/transparent_hugepage/enabled=never and simply not get any 
thp.
  
Ryan Roberts July 17, 2023, 2:47 p.m. UTC | #8
On 17/07/2023 14:56, David Hildenbrand wrote:
> On 17.07.23 15:20, Ryan Roberts wrote:
>> On 17/07/2023 14:06, David Hildenbrand wrote:
>>> On 14.07.23 19:17, Yu Zhao wrote:
>>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>>>> allocated in large folios of a determined order. All pages of the large
>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>> counting, rmap management lru list management) are also significantly
>>>>> reduced since those ops now become per-folio.
>>>>>
>>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>>>> defaults to disabled for now; The long term aim is for this to defaut to
>>>>> enabled, but there are some risks around internal fragmentation that
>>>>> need to be better understood first.
>>>>>
>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>> fragmentation so we honour that request.
>>>>>
>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>> arch_wants_pte_order() is limited by the new cmdline parameter,
>>>>> `flexthp_unhinted_max`. This allows for a performance boost without
>>>>> requiring any explicit opt-in from the workload while allowing the
>>>>> sysadmin to tune between performance and internal fragmentation.
>>>>>
>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>> mechanism allows the architecture to optimize as required.
>>>>>
>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>> mapped) then we fall back to a suitable lower order; first
>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>    .../admin-guide/kernel-parameters.txt         |  10 +
>>>>>    mm/Kconfig                                    |  10 +
>>>>>    mm/memory.c                                   | 187 ++++++++++++++++--
>>>>>    3 files changed, 190 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>>>>> b/Documentation/admin-guide/kernel-parameters.txt
>>>>> index a1457995fd41..405d624e2191 100644
>>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>>> @@ -1497,6 +1497,16 @@
>>>>>                           See Documentation/admin-guide/sysctl/net.rst for
>>>>>                           fb_tunnels_only_for_init_ns
>>>>>
>>>>> +       flexthp_unhinted_max=
>>>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The
>>>>> maximum
>>>>> +                       folio size that will be allocated for an anonymous vma
>>>>> +                       that has neither explicitly opted in nor out of using
>>>>> +                       transparent hugepages. The size must be a
>>>>> power-of-2 in
>>>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size
>>>>> improves
>>>>> +                       performance by reducing page faults, while a smaller
>>>>> +                       size reduces internal fragmentation. Default: max(64K,
>>>>> +                       PAGE_SIZE). Format: size[KMG].
>>>>> +
>>>>
>>>> Let's split this parameter into a separate patch.
>>>>
>>>
>>> Just a general comment after stumbling over patch #2, let's not start splitting
>>> patches into things that don't make any sense on their own; that just makes
>>> review a lot harder.
>>
>> ACK
>>
>>>
>>> For this case here, I'd suggest first adding the general infrastructure and then
>>> adding tunables we want to have on top.
>>
>> OK, so 1 patch for the main infrastructure, then a patch to disable for
>> MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max
>> via a sysctl?
> 
> MADV_NOHUGEPAGE handling for me falls under the category "required for
> correctness to not break existing workloads" and has to be there initially.
> 
> Anything that is rather a performance tunable (e.g., a sysctl to optimize) can
> be added on top and discussed separately.>
> At least IMHO :)
> 
>>
>>>
>>> I agree that toggling that at runtime (for example via sysfs as raised by me
>>> previously) would be nicer.
>>
>> OK, I clearly misunderstood, I thought you were requesting a boot parameter.
> 
> Oh, sorry about that. I wanted to actually express
> "/sys/kernel/mm/transparent_hugepage/" sysctls where we can toggle that later at
> runtime as well.
> 
>> What's the ABI compat guarrantee for sysctls? I assumed that for a boot
>> parameter it would be easier to remove in future if we wanted, but for sysctl,
>> its there forever?
> 
> sysctl are hard/impossible to remove, yes. So we better make sure what we add
> has clear semantics.
> 
> If we ever want some real auto-tunable mode (and can actually implement it
> without harming performance; and I am skeptical), we might want to allow for
> setting such a parameter to "auto", for example.
> 
>>
>> Also, how do you feel about the naming and behavior of the parameter?
> 
> Very good question. "flexthp_unhinted_max" naming is a bit suboptimal.
> 
> For example, I'm not so sure if we should expose the feature to user space as
> "flexthp" at all. I think we should find a clearer feature name to begin with.
> 
> ... maybe we can initially get away with dropping that parameter and default to
> something reasonably small (i.e., 64k as you have above)?

That would certainly get my vote. But it was you who was arguing for a tunable
previously ;-). I propose we use the following as the "unhinted ceiling" for
now, then we can add a tunable if/when we find a use case that doesn't work
optimally with this value:

	static int flexthp_unhinted_max_order =
		ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT;

(Using PAGE_SIZE when its gt 64K to cover the ppc case that looks like it can
support 256K pages. Open coding the max because max() can't be used outside a
function).

> 
> /sys/kernel/mm/transparent_hugepage/enabled=never and simply not get any thp.

Yes, that should work with the patch as it is today.

>
  
David Hildenbrand July 17, 2023, 2:55 p.m. UTC | #9
On 17.07.23 16:47, Ryan Roberts wrote:
> On 17/07/2023 14:56, David Hildenbrand wrote:
>> On 17.07.23 15:20, Ryan Roberts wrote:
>>> On 17/07/2023 14:06, David Hildenbrand wrote:
>>>> On 14.07.23 19:17, Yu Zhao wrote:
>>>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>>>>> allocated in large folios of a determined order. All pages of the large
>>>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>>>> counting, rmap management lru list management) are also significantly
>>>>>> reduced since those ops now become per-folio.
>>>>>>
>>>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>>>>> defaults to disabled for now; The long term aim is for this to defaut to
>>>>>> enabled, but there are some risks around internal fragmentation that
>>>>>> need to be better understood first.
>>>>>>
>>>>>> When enabled, the folio order is determined as such: For a vma, process
>>>>>> or system that has explicitly disabled THP, we continue to allocate
>>>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>>>> fragmentation so we honour that request.
>>>>>>
>>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>>>> arch_wants_pte_order() is limited by the new cmdline parameter,
>>>>>> `flexthp_unhinted_max`. This allows for a performance boost without
>>>>>> requiring any explicit opt-in from the workload while allowing the
>>>>>> sysadmin to tune between performance and internal fragmentation.
>>>>>>
>>>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>>>> mechanism allows the architecture to optimize as required.
>>>>>>
>>>>>> If the preferred order can't be used (e.g. because the folio would
>>>>>> breach the bounds of the vma, or because ptes in the region are already
>>>>>> mapped) then we fall back to a suitable lower order; first
>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>     .../admin-guide/kernel-parameters.txt         |  10 +
>>>>>>     mm/Kconfig                                    |  10 +
>>>>>>     mm/memory.c                                   | 187 ++++++++++++++++--
>>>>>>     3 files changed, 190 insertions(+), 17 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>>>>>> b/Documentation/admin-guide/kernel-parameters.txt
>>>>>> index a1457995fd41..405d624e2191 100644
>>>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>>>> @@ -1497,6 +1497,16 @@
>>>>>>                            See Documentation/admin-guide/sysctl/net.rst for
>>>>>>                            fb_tunnels_only_for_init_ns
>>>>>>
>>>>>> +       flexthp_unhinted_max=
>>>>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The
>>>>>> maximum
>>>>>> +                       folio size that will be allocated for an anonymous vma
>>>>>> +                       that has neither explicitly opted in nor out of using
>>>>>> +                       transparent hugepages. The size must be a
>>>>>> power-of-2 in
>>>>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size
>>>>>> improves
>>>>>> +                       performance by reducing page faults, while a smaller
>>>>>> +                       size reduces internal fragmentation. Default: max(64K,
>>>>>> +                       PAGE_SIZE). Format: size[KMG].
>>>>>> +
>>>>>
>>>>> Let's split this parameter into a separate patch.
>>>>>
>>>>
>>>> Just a general comment after stumbling over patch #2, let's not start splitting
>>>> patches into things that don't make any sense on their own; that just makes
>>>> review a lot harder.
>>>
>>> ACK
>>>
>>>>
>>>> For this case here, I'd suggest first adding the general infrastructure and then
>>>> adding tunables we want to have on top.
>>>
>>> OK, so 1 patch for the main infrastructure, then a patch to disable for
>>> MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max
>>> via a sysctl?
>>
>> MADV_NOHUGEPAGE handling for me falls under the category "required for
>> correctness to not break existing workloads" and has to be there initially.
>>
>> Anything that is rather a performance tunable (e.g., a sysctl to optimize) can
>> be added on top and discussed separately.>
>> At least IMHO :)
>>
>>>
>>>>
>>>> I agree that toggling that at runtime (for example via sysfs as raised by me
>>>> previously) would be nicer.
>>>
>>> OK, I clearly misunderstood, I thought you were requesting a boot parameter.
>>
>> Oh, sorry about that. I wanted to actually express
>> "/sys/kernel/mm/transparent_hugepage/" sysctls where we can toggle that later at
>> runtime as well.
>>
>>> What's the ABI compat guarrantee for sysctls? I assumed that for a boot
>>> parameter it would be easier to remove in future if we wanted, but for sysctl,
>>> its there forever?
>>
>> sysctl are hard/impossible to remove, yes. So we better make sure what we add
>> has clear semantics.
>>
>> If we ever want some real auto-tunable mode (and can actually implement it
>> without harming performance; and I am skeptical), we might want to allow for
>> setting such a parameter to "auto", for example.
>>
>>>
>>> Also, how do you feel about the naming and behavior of the parameter?
>>
>> Very good question. "flexthp_unhinted_max" naming is a bit suboptimal.
>>
>> For example, I'm not so sure if we should expose the feature to user space as
>> "flexthp" at all. I think we should find a clearer feature name to begin with.
>>
>> ... maybe we can initially get away with dropping that parameter and default to
>> something reasonably small (i.e., 64k as you have above)?
> 
> That would certainly get my vote. But it was you who was arguing for a tunable
> previously ;-). I propose we use the following as the "unhinted ceiling" for

Yes, I still think having tunables makes sense.


But it's certainly something to add separately, especially if it makes 
your work here easier.

As long as it can be disabled, good enough for me for the initial version.
  
Yu Zhao July 17, 2023, 5:07 p.m. UTC | #10
On Mon, Jul 17, 2023 at 7:06 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.07.23 19:17, Yu Zhao wrote:
> > On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> >> allocated in large folios of a determined order. All pages of the large
> >> folio are pte-mapped during the same page fault, significantly reducing
> >> the number of page faults. The number of per-page operations (e.g. ref
> >> counting, rmap management lru list management) are also significantly
> >> reduced since those ops now become per-folio.
> >>
> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> >> defaults to disabled for now; The long term aim is for this to defaut to
> >> enabled, but there are some risks around internal fragmentation that
> >> need to be better understood first.
> >>
> >> When enabled, the folio order is determined as such: For a vma, process
> >> or system that has explicitly disabled THP, we continue to allocate
> >> order-0. THP is most likely disabled to avoid any possible internal
> >> fragmentation so we honour that request.
> >>
> >> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> >> that have not explicitly opted-in to use transparent hugepages (e.g.
> >> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> >> arch_wants_pte_order() is limited by the new cmdline parameter,
> >> `flexthp_unhinted_max`. This allows for a performance boost without
> >> requiring any explicit opt-in from the workload while allowing the
> >> sysadmin to tune between performance and internal fragmentation.
> >>
> >> arch_wants_pte_order() can be overridden by the architecture if desired.
> >> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> >> set of ptes map physically contigious, naturally aligned memory, so this
> >> mechanism allows the architecture to optimize as required.
> >>
> >> If the preferred order can't be used (e.g. because the folio would
> >> breach the bounds of the vma, or because ptes in the region are already
> >> mapped) then we fall back to a suitable lower order; first
> >> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>   .../admin-guide/kernel-parameters.txt         |  10 +
> >>   mm/Kconfig                                    |  10 +
> >>   mm/memory.c                                   | 187 ++++++++++++++++--
> >>   3 files changed, 190 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> >> index a1457995fd41..405d624e2191 100644
> >> --- a/Documentation/admin-guide/kernel-parameters.txt
> >> +++ b/Documentation/admin-guide/kernel-parameters.txt
> >> @@ -1497,6 +1497,16 @@
> >>                          See Documentation/admin-guide/sysctl/net.rst for
> >>                          fb_tunnels_only_for_init_ns
> >>
> >> +       flexthp_unhinted_max=
> >> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
> >> +                       folio size that will be allocated for an anonymous vma
> >> +                       that has neither explicitly opted in nor out of using
> >> +                       transparent hugepages. The size must be a power-of-2 in
> >> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size improves
> >> +                       performance by reducing page faults, while a smaller
> >> +                       size reduces internal fragmentation. Default: max(64K,
> >> +                       PAGE_SIZE). Format: size[KMG].
> >> +
> >
> > Let's split this parameter into a separate patch.
> >
>
> Just a general comment after stumbling over patch #2, let's not start
> splitting patches into things that don't make any sense on their own;
> that just makes review a lot harder.

Sorry to hear this -- but there are also non-subjective reasons we
split patches this way.

Initially we had minimum to no common ground, so we had to divide and
conquer by smallest steps.

if you look at previous discussions: there was a disagreement on patch
2 in v2 -- that's the patch you asked to be squashed into the main
patch 3. Fortunately we've resolved that. If that disagreement had
persisted, we would leave patch 2 out rather than let it bog down
patch 3, which would work indifferently for all arches except arm and
could be merged separately.

> For this case here, I'd suggest first adding the general infrastructure
> and then adding tunables we want to have on top.
>
> I agree that toggling that at runtime (for example via sysfs as raised
> by me previously) would be nicer.
  
David Hildenbrand July 17, 2023, 5:16 p.m. UTC | #11
On 17.07.23 19:07, Yu Zhao wrote:
> On Mon, Jul 17, 2023 at 7:06 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 14.07.23 19:17, Yu Zhao wrote:
>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>>> allocated in large folios of a determined order. All pages of the large
>>>> folio are pte-mapped during the same page fault, significantly reducing
>>>> the number of page faults. The number of per-page operations (e.g. ref
>>>> counting, rmap management lru list management) are also significantly
>>>> reduced since those ops now become per-folio.
>>>>
>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>>> defaults to disabled for now; The long term aim is for this to defaut to
>>>> enabled, but there are some risks around internal fragmentation that
>>>> need to be better understood first.
>>>>
>>>> When enabled, the folio order is determined as such: For a vma, process
>>>> or system that has explicitly disabled THP, we continue to allocate
>>>> order-0. THP is most likely disabled to avoid any possible internal
>>>> fragmentation so we honour that request.
>>>>
>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
>>>> that have not explicitly opted-in to use transparent hugepages (e.g.
>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
>>>> arch_wants_pte_order() is limited by the new cmdline parameter,
>>>> `flexthp_unhinted_max`. This allows for a performance boost without
>>>> requiring any explicit opt-in from the workload while allowing the
>>>> sysadmin to tune between performance and internal fragmentation.
>>>>
>>>> arch_wants_pte_order() can be overridden by the architecture if desired.
>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
>>>> set of ptes map physically contigious, naturally aligned memory, so this
>>>> mechanism allows the architecture to optimize as required.
>>>>
>>>> If the preferred order can't be used (e.g. because the folio would
>>>> breach the bounds of the vma, or because ptes in the region are already
>>>> mapped) then we fall back to a suitable lower order; first
>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    .../admin-guide/kernel-parameters.txt         |  10 +
>>>>    mm/Kconfig                                    |  10 +
>>>>    mm/memory.c                                   | 187 ++++++++++++++++--
>>>>    3 files changed, 190 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>>> index a1457995fd41..405d624e2191 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -1497,6 +1497,16 @@
>>>>                           See Documentation/admin-guide/sysctl/net.rst for
>>>>                           fb_tunnels_only_for_init_ns
>>>>
>>>> +       flexthp_unhinted_max=
>>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
>>>> +                       folio size that will be allocated for an anonymous vma
>>>> +                       that has neither explicitly opted in nor out of using
>>>> +                       transparent hugepages. The size must be a power-of-2 in
>>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size improves
>>>> +                       performance by reducing page faults, while a smaller
>>>> +                       size reduces internal fragmentation. Default: max(64K,
>>>> +                       PAGE_SIZE). Format: size[KMG].
>>>> +
>>>
>>> Let's split this parameter into a separate patch.
>>>
>>
>> Just a general comment after stumbling over patch #2, let's not start
>> splitting patches into things that don't make any sense on their own;
>> that just makes review a lot harder.
> 
> Sorry to hear this -- but there are also non-subjective reasons we
> split patches this way.
> 
> Initially we had minimum to no common ground, so we had to divide and
> conquer by smallest steps.
> 
> if you look at previous discussions: there was a disagreement on patch
> 2 in v2 -- that's the patch you asked to be squashed into the main
> patch 3. Fortunately we've resolved that. If that disagreement had
> persisted, we would leave patch 2 out rather than let it bog down
> patch 3, which would work indifferently for all arches except arm and
> could be merged separately.

All makes sense to me, and squashing it now is most probably the logical 
step and was different before.

As I said, just a general comment when we talk about splitting stuff out.
  
Yu Zhao July 17, 2023, 7:31 p.m. UTC | #12
On Mon, Jul 17, 2023 at 7:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> >>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >>>> +{
> >>>> +       int i;
> >>>> +       gfp_t gfp;
> >>>> +       pte_t *pte;
> >>>> +       unsigned long addr;
> >>>> +       struct vm_area_struct *vma = vmf->vma;
> >>>> +       int prefer = anon_folio_order(vma);
> >>>> +       int orders[] = {
> >>>> +               prefer,
> >>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> >>>> +               0,
> >>>> +       };
> >>>> +
> >>>> +       *folio = NULL;
> >>>> +
> >>>> +       if (vmf_orig_pte_uffd_wp(vmf))
> >>>> +               goto fallback;
> >>>> +
> >>>> +       for (i = 0; orders[i]; i++) {
> >>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >>>> +               if (addr >= vma->vm_start &&
> >>>> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> >>>> +                       break;
> >>>> +       }
> >>>> +
> >>>> +       if (!orders[i])
> >>>> +               goto fallback;
> >>>> +
> >>>> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >>>> +       if (!pte)
> >>>> +               return -EAGAIN;
> >>>
> >>> It would be a bug if this happens. So probably -EINVAL?
> >>
> >> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
> >> possible for pte_offset_map() to fail (if I understood correctly) and we have to
> >> handle this. The intent is that we will return from the fault without making any
> >> change, then we will refault and try again.
> >
> > Thanks for checking that -- it's very relevant. One detail is that
> > that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
> > happen while we are holding mmap_lock for read here, and therefore,
> > the race that could cause pte_offset_map() on shmem/file PTEs to fail
> > doesn't apply here.
>
> But Hugh's patches have changed do_anonymous_page() to handle failure from
> pte_offset_map_lock(). So I was just following that pattern. If this really
> can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s
> prototype to just return a `struct folio *` (and if it's null that means ENOMEM).
>
> Hugh, perhaps you can comment?
>
> As an aside, it was my understanding from LWN, that we are now using a per-VMA
> lock so presumably we don't hold mmap_lock for read here? Or perhaps that only
> applies to file-backed memory?

For anon under mmap_lock for read:
1. pte_offset_map[_lock]() fails when a parallel PF changes PMD from
none to leaf.
2. changing PMD from non-leaf to leaf is a bug. See the comments in
the "else" branch in handle_pte_fault().

So for do_anonymous_page(), there is only one case
pte_offset_map[_lock]() can fail. For the code above, this case was
ruled out by vmf_orig_pte_uffd_wp().

Checking the return value from pte_offset_map[_lock]() is a good
practice. What I'm saying is that -EAGAIN would mislead people to
think, in our case, !pte is legitimate, and hence the suggestion of
replacing it with -EINVAL.

No BUG_ON() please. As I've previously mentioned, it's against
Documentation/process/coding-style.rst.

> > +Hugh Dickins for further consultation if you need it.
> >
> >>>> +
> >>>> +       for (; orders[i]; i++) {
> >>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >>>> +               vmf->pte = pte + pte_index(addr);
> >>>> +               if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> >>>> +                       break;
> >>>> +       }
> >>>> +
> >>>> +       vmf->pte = NULL;
> >>>> +       pte_unmap(pte);
> >>>> +
> >>>> +       gfp = vma_thp_gfp_mask(vma);
> >>>> +
> >>>> +       for (; orders[i]; i++) {
> >>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >>>> +               *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> >>>> +               if (*folio) {
> >>>> +                       clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
> >>>> +                       return 0;
> >>>> +               }
> >>>> +       }
> >>>> +
> >>>> +fallback:
> >>>> +       *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> >>>> +       return *folio ? 0 : -ENOMEM;
> >>>> +}
> >>>> +#else
> >>>> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >>>
> >>> Drop "inline" (it doesn't do anything in .c).
> >>
> >> There are 38 instances of inline in memory.c alone, so looks like a well used
> >> convention, even if the compiler may choose to ignore. Perhaps you can educate
> >> me; what's the benefit of dropping it?
> >
> > I'll let Willy and Andrew educate both of us :)
> >
> > +Matthew Wilcox +Andrew Morton please. Thank you.
> >
> >>> The rest looks good to me.
> >>
> >> Great - just incase it wasn't obvious, I decided not to overwrite vmf->address
> >> with the aligned version, as you suggested
> >
> > Yes, I've noticed. Not overwriting has its own merits for sure.
> >
> >> for 2 reasons; 1) address is const
> >> in the struct, so would have had to change that. 2) there is a uffd path that
> >> can be taken after the vmf->address fixup would have occured and the path
> >> consumes that member, so it would have had to be un-fixed-up making it more
> >> messy than the way I opted for.
> >>
> >> Thanks for the quick review as always!
>
  
Yu Zhao July 17, 2023, 8:35 p.m. UTC | #13
On Mon, Jul 17, 2023 at 1:31 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jul 17, 2023 at 7:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > >>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> > >>>> +{
> > >>>> +       int i;
> > >>>> +       gfp_t gfp;
> > >>>> +       pte_t *pte;
> > >>>> +       unsigned long addr;
> > >>>> +       struct vm_area_struct *vma = vmf->vma;
> > >>>> +       int prefer = anon_folio_order(vma);
> > >>>> +       int orders[] = {
> > >>>> +               prefer,
> > >>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> > >>>> +               0,
> > >>>> +       };
> > >>>> +
> > >>>> +       *folio = NULL;
> > >>>> +
> > >>>> +       if (vmf_orig_pte_uffd_wp(vmf))
> > >>>> +               goto fallback;
> > >>>> +
> > >>>> +       for (i = 0; orders[i]; i++) {
> > >>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > >>>> +               if (addr >= vma->vm_start &&
> > >>>> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> > >>>> +                       break;
> > >>>> +       }
> > >>>> +
> > >>>> +       if (!orders[i])
> > >>>> +               goto fallback;
> > >>>> +
> > >>>> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> > >>>> +       if (!pte)
> > >>>> +               return -EAGAIN;
> > >>>
> > >>> It would be a bug if this happens. So probably -EINVAL?
> > >>
> > >> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
> > >> possible for pte_offset_map() to fail (if I understood correctly) and we have to
> > >> handle this. The intent is that we will return from the fault without making any
> > >> change, then we will refault and try again.
> > >
> > > Thanks for checking that -- it's very relevant. One detail is that
> > > that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
> > > happen while we are holding mmap_lock for read here, and therefore,
> > > the race that could cause pte_offset_map() on shmem/file PTEs to fail
> > > doesn't apply here.
> >
> > But Hugh's patches have changed do_anonymous_page() to handle failure from
> > pte_offset_map_lock(). So I was just following that pattern. If this really
> > can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s
> > prototype to just return a `struct folio *` (and if it's null that means ENOMEM).
> >
> > Hugh, perhaps you can comment?
> >
> > As an aside, it was my understanding from LWN, that we are now using a per-VMA
> > lock so presumably we don't hold mmap_lock for read here? Or perhaps that only
> > applies to file-backed memory?
>
> For anon under mmap_lock for read:
> 1. pte_offset_map[_lock]() fails when a parallel PF changes PMD from
> none to leaf.
> 2. changing PMD from non-leaf to leaf is a bug. See the comments in
> the "else" branch in handle_pte_fault().
>
> So for do_anonymous_page(), there is only one case
> pte_offset_map[_lock]() can fail.

===
> For the code above, this case was
> ruled out by vmf_orig_pte_uffd_wp().

Actually I was wrong about this part.
===

> Checking the return value from pte_offset_map[_lock]() is a good
> practice. What I'm saying is that -EAGAIN would mislead people to
> think, in our case, !pte is legitimate, and hence the suggestion of
> replacing it with -EINVAL.

Yes, -EAGAIN is suitable.

> No BUG_ON() please. As I've previously mentioned, it's against
> Documentation/process/coding-style.rst.
>
> > > +Hugh Dickins for further consultation if you need it.
> > >
> > >>>> +
> > >>>> +       for (; orders[i]; i++) {
> > >>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > >>>> +               vmf->pte = pte + pte_index(addr);
> > >>>> +               if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> > >>>> +                       break;
> > >>>> +       }
> > >>>> +
> > >>>> +       vmf->pte = NULL;
> > >>>> +       pte_unmap(pte);
> > >>>> +
> > >>>> +       gfp = vma_thp_gfp_mask(vma);
> > >>>> +
> > >>>> +       for (; orders[i]; i++) {
> > >>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > >>>> +               *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> > >>>> +               if (*folio) {
> > >>>> +                       clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
> > >>>> +                       return 0;
> > >>>> +               }
> > >>>> +       }
> > >>>> +
> > >>>> +fallback:
> > >>>> +       *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> > >>>> +       return *folio ? 0 : -ENOMEM;
> > >>>> +}
> > >>>> +#else
> > >>>> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> > >>>
> > >>> Drop "inline" (it doesn't do anything in .c).
> > >>
> > >> There are 38 instances of inline in memory.c alone, so looks like a well used
> > >> convention, even if the compiler may choose to ignore. Perhaps you can educate
> > >> me; what's the benefit of dropping it?
> > >
> > > I'll let Willy and Andrew educate both of us :)
> > >
> > > +Matthew Wilcox +Andrew Morton please. Thank you.
> > >
> > >>> The rest looks good to me.
> > >>
> > >> Great - just incase it wasn't obvious, I decided not to overwrite vmf->address
> > >> with the aligned version, as you suggested
> > >
> > > Yes, I've noticed. Not overwriting has its own merits for sure.
> > >
> > >> for 2 reasons; 1) address is const
> > >> in the struct, so would have had to change that. 2) there is a uffd path that
> > >> can be taken after the vmf->address fixup would have occured and the path
> > >> consumes that member, so it would have had to be un-fixed-up making it more
> > >> messy than the way I opted for.
> > >>
> > >> Thanks for the quick review as always!
  
Hugh Dickins July 17, 2023, 11:37 p.m. UTC | #14
On Mon, 17 Jul 2023, Ryan Roberts wrote:

> >>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> >>>> +{
> >>>> +       int i;
> >>>> +       gfp_t gfp;
> >>>> +       pte_t *pte;
> >>>> +       unsigned long addr;
> >>>> +       struct vm_area_struct *vma = vmf->vma;
> >>>> +       int prefer = anon_folio_order(vma);
> >>>> +       int orders[] = {
> >>>> +               prefer,
> >>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> >>>> +               0,
> >>>> +       };
> >>>> +
> >>>> +       *folio = NULL;
> >>>> +
> >>>> +       if (vmf_orig_pte_uffd_wp(vmf))
> >>>> +               goto fallback;
> >>>> +
> >>>> +       for (i = 0; orders[i]; i++) {
> >>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> >>>> +               if (addr >= vma->vm_start &&
> >>>> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> >>>> +                       break;
> >>>> +       }
> >>>> +
> >>>> +       if (!orders[i])
> >>>> +               goto fallback;
> >>>> +
> >>>> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >>>> +       if (!pte)
> >>>> +               return -EAGAIN;
> >>>
> >>> It would be a bug if this happens. So probably -EINVAL?
> >>
> >> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
> >> possible for pte_offset_map() to fail (if I understood correctly) and we have to
> >> handle this. The intent is that we will return from the fault without making any
> >> change, then we will refault and try again.
> > 
> > Thanks for checking that -- it's very relevant. One detail is that
> > that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
> > happen while we are holding mmap_lock for read here, and therefore,
> > the race that could cause pte_offset_map() on shmem/file PTEs to fail
> > doesn't apply here.
> 
> But Hugh's patches have changed do_anonymous_page() to handle failure from
> pte_offset_map_lock(). So I was just following that pattern. If this really
> can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s
> prototype to just return a `struct folio *` (and if it's null that means ENOMEM).
> 
> Hugh, perhaps you can comment?

I agree with your use of -EAGAIN there: I find it better to allow for the
possibility, than to go to great effort persuading that it's impossible;
especially because what's possible tomorrow may differ from today.

And notice that, before my changes, there used to be a pmd_trans_unstable()
check above, implying that it is possible for it to fail (for more reasons
than corruption causing pmd_bad()) - one scenario would be that the
pte_alloc() above succeeded *because* someone else had managed to insert
a huge pmd there already (maybe we have MMF_DISABLE_THP but they did not).

But I see from later mail that Yu Zhao now agrees with your -EAGAIN too,
so we are all on the same folio.

Hugh

p.s. while giving opinions, I'm one of those against using "THP" for
large but not pmd-mappable folios; and was glad to see Matthew arguing
the same way when considering THP_SWPOUT in another thread today.
  
Ryan Roberts July 18, 2023, 10:36 a.m. UTC | #15
On 18/07/2023 00:37, Hugh Dickins wrote:
> On Mon, 17 Jul 2023, Ryan Roberts wrote:
> 
>>>>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
>>>>>> +{
>>>>>> +       int i;
>>>>>> +       gfp_t gfp;
>>>>>> +       pte_t *pte;
>>>>>> +       unsigned long addr;
>>>>>> +       struct vm_area_struct *vma = vmf->vma;
>>>>>> +       int prefer = anon_folio_order(vma);
>>>>>> +       int orders[] = {
>>>>>> +               prefer,
>>>>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
>>>>>> +               0,
>>>>>> +       };
>>>>>> +
>>>>>> +       *folio = NULL;
>>>>>> +
>>>>>> +       if (vmf_orig_pte_uffd_wp(vmf))
>>>>>> +               goto fallback;
>>>>>> +
>>>>>> +       for (i = 0; orders[i]; i++) {
>>>>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
>>>>>> +               if (addr >= vma->vm_start &&
>>>>>> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
>>>>>> +                       break;
>>>>>> +       }
>>>>>> +
>>>>>> +       if (!orders[i])
>>>>>> +               goto fallback;
>>>>>> +
>>>>>> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>>>> +       if (!pte)
>>>>>> +               return -EAGAIN;
>>>>>
>>>>> It would be a bug if this happens. So probably -EINVAL?
>>>>
>>>> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
>>>> possible for pte_offset_map() to fail (if I understood correctly) and we have to
>>>> handle this. The intent is that we will return from the fault without making any
>>>> change, then we will refault and try again.
>>>
>>> Thanks for checking that -- it's very relevant. One detail is that
>>> that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
>>> happen while we are holding mmap_lock for read here, and therefore,
>>> the race that could cause pte_offset_map() on shmem/file PTEs to fail
>>> doesn't apply here.
>>
>> But Hugh's patches have changed do_anonymous_page() to handle failure from
>> pte_offset_map_lock(). So I was just following that pattern. If this really
>> can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s
>> prototype to just return a `struct folio *` (and if it's null that means ENOMEM).
>>
>> Hugh, perhaps you can comment?
> 
> I agree with your use of -EAGAIN there: I find it better to allow for the
> possibility, than to go to great effort persuading that it's impossible;
> especially because what's possible tomorrow may differ from today.
> 
> And notice that, before my changes, there used to be a pmd_trans_unstable()
> check above, implying that it is possible for it to fail (for more reasons
> than corruption causing pmd_bad()) - one scenario would be that the
> pte_alloc() above succeeded *because* someone else had managed to insert
> a huge pmd there already (maybe we have MMF_DISABLE_THP but they did not).
> 
> But I see from later mail that Yu Zhao now agrees with your -EAGAIN too,
> so we are all on the same folio.

Thanks for the explanation. I think we are all now agreed that the error case
needs handling and -EAGAIN is the correct code.

> 
> Hugh
> 
> p.s. while giving opinions, I'm one of those against using "THP" for
> large but not pmd-mappable folios; and was glad to see Matthew arguing
> the same way when considering THP_SWPOUT in another thread today.

Honestly, I don't have an opinion either way on this (probably because I don't
have the full context and history of THP like many of you do). So given there is
a fair bit of opposition to FLEXIBLE_THP, I'll change it back to
LARGE_ANON_FOLIO (and move it out of the THP sub-menu) in the next revision.
  
Ryan Roberts July 21, 2023, 10:57 a.m. UTC | #16
On 14/07/2023 17:17, Ryan Roberts wrote:
> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> allocated in large folios of a determined order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
> 
> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> defaults to disabled for now; The long term aim is for this to defaut to
> enabled, but there are some risks around internal fragmentation that
> need to be better understood first.
> 
> When enabled, the folio order is determined as such: For a vma, process
> or system that has explicitly disabled THP, we continue to allocate
> order-0. THP is most likely disabled to avoid any possible internal
> fragmentation so we honour that request.
> 
> Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> that have not explicitly opted-in to use transparent hugepages (e.g.
> where thp=madvise and the vma does not have MADV_HUGEPAGE), then
> arch_wants_pte_order() is limited by the new cmdline parameter,
> `flexthp_unhinted_max`. This allows for a performance boost without
> requiring any explicit opt-in from the workload while allowing the
> sysadmin to tune between performance and internal fragmentation.
> 
> arch_wants_pte_order() can be overridden by the architecture if desired.
> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
> set of ptes map physically contigious, naturally aligned memory, so this
> mechanism allows the architecture to optimize as required.
> 
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order; first
> PAGE_ALLOC_COSTLY_ORDER, then order-0.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

...

> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4057,11 +4199,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   */
>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  {
> +	int i = 0;
> +	int nr_pages = 1;
>  	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct folio *folio;
>  	vm_fault_t ret = 0;
>  	pte_t entry;
> +	unsigned long addr;
>  
>  	/* File mapping without ->vm_ops ? */
>  	if (vma->vm_flags & VM_SHARED)
> @@ -4101,10 +4246,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	/* Allocate our own private page. */
>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto oom;
> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +	ret = alloc_anon_folio(vmf, &folio);
> +	if (unlikely(ret == -EAGAIN))
> +		return 0;
>  	if (!folio)
>  		goto oom;
>  
> +	nr_pages = folio_nr_pages(folio);
> +	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +
>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>  		goto oom_free_page;
>  	folio_throttle_swaprate(folio, GFP_KERNEL);
> @@ -4116,17 +4266,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	 */
>  	__folio_mark_uptodate(folio);
>  
> -	entry = mk_pte(&folio->page, vma->vm_page_prot);
> -	entry = pte_sw_mkyoung(entry);
> -	if (vma->vm_flags & VM_WRITE)
> -		entry = pte_mkwrite(pte_mkdirty(entry));
> -
> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -			&vmf->ptl);
> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>  	if (!vmf->pte)
>  		goto release;
> -	if (vmf_pte_changed(vmf)) {
> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
> +	if (vmf_pte_range_changed(vmf, nr_pages)) {
> +		for (i = 0; i < nr_pages; i++)
> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>  		goto release;
>  	}
>  
> @@ -4141,16 +4286,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>  	}
>  
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
> +	folio_ref_add(folio, nr_pages - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +	folio_add_new_anon_rmap(folio, vma, addr);
>  	folio_add_lru_vma(folio, vma);
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
> +		entry = pte_sw_mkyoung(entry);
> +		if (vma->vm_flags & VM_WRITE)
> +			entry = pte_mkwrite(pte_mkdirty(entry));
>  setpte:
> -	if (uffd_wp)
> -		entry = pte_mkuffd_wp(entry);
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +		if (uffd_wp)
> +			entry = pte_mkuffd_wp(entry);
> +		set_pte_at(vma->vm_mm, addr + PAGE_SIZE * i, vmf->pte + i, entry);

I've just spotted a bug here for the case where we arrive via goto setpte; in
this case, addr is not initialized. This crept in during the refactoring and I
have no idea how this could possibly have not fallen over in a heap when
executed. Sorry about that. I'm fixing in v4.

>  
> -	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache(vma, vmf->address, vmf->pte);
> +		/* No need to invalidate - it was non-present before */
> +		update_mmu_cache(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> +	}
>  unlock:
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
  

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a1457995fd41..405d624e2191 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1497,6 +1497,16 @@ 
 			See Documentation/admin-guide/sysctl/net.rst for
 			fb_tunnels_only_for_init_ns
 
+	flexthp_unhinted_max=
+			[KNL] Requires CONFIG_FLEXIBLE_THP enabled. The maximum
+			folio size that will be allocated for an anonymous vma
+			that has neither explicitly opted in nor out of using
+			transparent hugepages. The size must be a power-of-2 in
+			the range [PAGE_SIZE, PMD_SIZE). A larger size improves
+			performance by reducing page faults, while a smaller
+			size reduces internal fragmentation. Default: max(64K,
+			PAGE_SIZE). Format: size[KMG].
+
 	floppy=		[HW]
 			See Documentation/admin-guide/blockdev/floppy.rst.
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..26c5e51ef11d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -848,6 +848,16 @@  config READ_ONLY_THP_FOR_FS
 	  support of file THPs will be developed in the next few release
 	  cycles.
 
+config FLEXIBLE_THP
+	bool "Flexible order THP"
+	depends on TRANSPARENT_HUGEPAGE
+	default n
+	help
+	  Use large (bigger than order-0) folios to back anonymous memory where
+	  possible, even for pte-mapped memory. This reduces the number of page
+	  faults, as well as other per-page overheads to improve performance for
+	  many workloads.
+
 endif # TRANSPARENT_HUGEPAGE
 
 #
diff --git a/mm/memory.c b/mm/memory.c
index 01f39e8144ef..e8bc729efb9d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4050,6 +4050,148 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }
 
+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+	int i;
+
+	if (nr_pages == 1)
+		return vmf_pte_changed(vmf);
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+			return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_FLEXIBLE_THP
+static int flexthp_unhinted_max_order =
+		ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT;
+
+static int __init parse_flexthp_unhinted_max(char *s)
+{
+	unsigned long long size = memparse(s, NULL);
+
+	if (!is_power_of_2(size) || size < PAGE_SIZE || size > PMD_SIZE) {
+		pr_warn("flexthp: flexthp_unhinted_max=%s must be power-of-2 between PAGE_SIZE (%lu) and PMD_SIZE (%lu), ignoring\n",
+			s, PAGE_SIZE, PMD_SIZE);
+		return 1;
+	}
+
+	flexthp_unhinted_max_order = ilog2(size) - PAGE_SHIFT;
+
+	/* THP machinery requires at least 3 struct pages for meta data. */
+	if (flexthp_unhinted_max_order == 1)
+		flexthp_unhinted_max_order--;
+
+	return 1;
+}
+
+__setup("flexthp_unhinted_max=", parse_flexthp_unhinted_max);
+
+static int anon_folio_order(struct vm_area_struct *vma)
+{
+	int order;
+
+	/*
+	 * If THP is explicitly disabled for either the vma, the process or the
+	 * system, then this is very likely intended to limit internal
+	 * fragmentation; in this case, don't attempt to allocate a large
+	 * anonymous folio.
+	 *
+	 * Else, if the vma is eligible for thp, allocate a large folio of the
+	 * size preferred by the arch. Or if the arch requested a very small
+	 * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER,
+	 * which still meets the arch's requirements but means we still take
+	 * advantage of SW optimizations (e.g. fewer page faults).
+	 *
+	 * Finally if thp is enabled but the vma isn't eligible, take the
+	 * arch-preferred size and limit it to the flexthp_unhinted_max cmdline
+	 * parameter. This allows a sysadmin to tune performance vs internal
+	 * fragmentation.
+	 */
+
+	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
+	    !hugepage_flags_enabled())
+		order = 0;
+	else {
+		order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER);
+
+		if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true))
+			order = min(order, flexthp_unhinted_max_order);
+	}
+
+	return order;
+}
+
+static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
+{
+	int i;
+	gfp_t gfp;
+	pte_t *pte;
+	unsigned long addr;
+	struct vm_area_struct *vma = vmf->vma;
+	int prefer = anon_folio_order(vma);
+	int orders[] = {
+		prefer,
+		prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
+		0,
+	};
+
+	*folio = NULL;
+
+	if (vmf_orig_pte_uffd_wp(vmf))
+		goto fallback;
+
+	for (i = 0; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		if (addr >= vma->vm_start &&
+		    addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
+			break;
+	}
+
+	if (!orders[i])
+		goto fallback;
+
+	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+	if (!pte)
+		return -EAGAIN;
+
+	for (; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		vmf->pte = pte + pte_index(addr);
+		if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
+			break;
+	}
+
+	vmf->pte = NULL;
+	pte_unmap(pte);
+
+	gfp = vma_thp_gfp_mask(vma);
+
+	for (; orders[i]; i++) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
+		*folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
+		if (*folio) {
+			clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
+			return 0;
+		}
+	}
+
+fallback:
+	*folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	return *folio ? 0 : -ENOMEM;
+}
+#else
+static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
+{
+	*folio = vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
+	return *folio ? 0 : -ENOMEM;
+}
+#endif
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4057,11 +4199,14 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
  */
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
+	int i = 0;
+	int nr_pages = 1;
 	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	unsigned long addr;
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -4101,10 +4246,15 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	ret = alloc_anon_folio(vmf, &folio);
+	if (unlikely(ret == -EAGAIN))
+		return 0;
 	if (!folio)
 		goto oom;
 
+	nr_pages = folio_nr_pages(folio);
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4116,17 +4266,12 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 */
 	__folio_mark_uptodate(folio);
 
-	entry = mk_pte(&folio->page, vma->vm_page_prot);
-	entry = pte_sw_mkyoung(entry);
-	if (vma->vm_flags & VM_WRITE)
-		entry = pte_mkwrite(pte_mkdirty(entry));
-
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	if (!vmf->pte)
 		goto release;
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (vmf_pte_range_changed(vmf, nr_pages)) {
+		for (i = 0; i < nr_pages; i++)
+			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
 		goto release;
 	}
 
@@ -4141,16 +4286,24 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	folio_add_new_anon_rmap(folio, vma, addr);
 	folio_add_lru_vma(folio, vma);
+
+	for (i = 0; i < nr_pages; i++) {
+		entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
+		entry = pte_sw_mkyoung(entry);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pte_mkwrite(pte_mkdirty(entry));
 setpte:
-	if (uffd_wp)
-		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, addr + PAGE_SIZE * i, vmf->pte + i, entry);
 
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, addr + PAGE_SIZE * i, vmf->pte + i);
+	}
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);