[v2,3/5] mm: Default implementation of arch_wants_pte_order()

Message ID 20230703135330.1865927-4-ryan.roberts@arm.com
State New
Headers
Series variable-order, large folios for anonymous memory |

Commit Message

Ryan Roberts July 3, 2023, 1:53 p.m. UTC
  arch_wants_pte_order() can be overridden by the arch to return the
preferred folio order for pte-mapped memory. This is useful as some
architectures (e.g. arm64) can coalesce TLB entries when the physical
memory is suitably contiguous.

The first user for this hint will be FLEXIBLE_THP, which aims to
allocate large folios for anonymous memory to reduce page faults and
other per-page operation costs.

Here we add the default implementation of the function, used when the
architecture does not define it, which returns the order corresponding
to 64K.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)
  

Comments

Yu Zhao July 3, 2023, 7:50 p.m. UTC | #1
On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> arch_wants_pte_order() can be overridden by the arch to return the
> preferred folio order for pte-mapped memory. This is useful as some
> architectures (e.g. arm64) can coalesce TLB entries when the physical
> memory is suitably contiguous.
>
> The first user for this hint will be FLEXIBLE_THP, which aims to
> allocate large folios for anonymous memory to reduce page faults and
> other per-page operation costs.
>
> Here we add the default implementation of the function, used when the
> architecture does not define it, which returns the order corresponding
> to 64K.

I don't really mind a non-zero default value. But people would ask why
non-zero and why 64KB. Probably you could argue this is the large size
all known archs support if they have TLB coalescing. For x86, AMD CPUs
would want to override this. I'll leave it to Fengwei to decide
whether Intel wants a different default value.

Also I don't like the vma parameter because it makes
arch_wants_pte_order() a mix of hw preference and vma policy. From my
POV, the function should be only about the former; the latter should
be decided by arch-independent MM code. However, I can live with it if
ARM MM people think this is really what you want. ATM, I'm skeptical
they do.

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

After another CPU vendor, e.g., Fengwei, and an ARM MM person, e.g.,
Will give the green light:
Reviewed-by: Yu Zhao <yuzhao@google.com>

> ---
>  include/linux/pgtable.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index a661a17173fa..f7e38598f20b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -13,6 +13,7 @@
>  #include <linux/errno.h>
>  #include <asm-generic/pgtable_uffd.h>
>  #include <linux/page_table_check.h>
> +#include <linux/sizes.h>
>
>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>         defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
>
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios

The warning is helpful.

> + * to be at least order-2.
> + */
> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> +{
> +       return ilog2(SZ_64K >> PAGE_SHIFT);
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>                                        unsigned long address,
  
Yin Fengwei July 4, 2023, 2:22 a.m. UTC | #2
On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> arch_wants_pte_order() can be overridden by the arch to return the
> preferred folio order for pte-mapped memory. This is useful as some
> architectures (e.g. arm64) can coalesce TLB entries when the physical
> memory is suitably contiguous.
> 
> The first user for this hint will be FLEXIBLE_THP, which aims to
> allocate large folios for anonymous memory to reduce page faults and
> other per-page operation costs.
> 
> Here we add the default implementation of the function, used when the
> architecture does not define it, which returns the order corresponding
> to 64K.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/pgtable.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index a661a17173fa..f7e38598f20b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -13,6 +13,7 @@
>  #include <linux/errno.h>
>  #include <asm-generic/pgtable_uffd.h>
>  #include <linux/page_table_check.h>
> +#include <linux/sizes.h>
>  
>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>  	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>  }
>  #endif
>  
> +#ifndef arch_wants_pte_order
> +/*
> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> + * to be at least order-2.
> + */
> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> +{
> +	return ilog2(SZ_64K >> PAGE_SHIFT);
Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?

Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.


Regards
Yin, Fengwei

> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				       unsigned long address,
  
Yu Zhao July 4, 2023, 3:02 a.m. UTC | #3
On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> > arch_wants_pte_order() can be overridden by the arch to return the
> > preferred folio order for pte-mapped memory. This is useful as some
> > architectures (e.g. arm64) can coalesce TLB entries when the physical
> > memory is suitably contiguous.
> >
> > The first user for this hint will be FLEXIBLE_THP, which aims to
> > allocate large folios for anonymous memory to reduce page faults and
> > other per-page operation costs.
> >
> > Here we add the default implementation of the function, used when the
> > architecture does not define it, which returns the order corresponding
> > to 64K.
> >
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >  include/linux/pgtable.h | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index a661a17173fa..f7e38598f20b 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -13,6 +13,7 @@
> >  #include <linux/errno.h>
> >  #include <asm-generic/pgtable_uffd.h>
> >  #include <linux/page_table_check.h>
> > +#include <linux/sizes.h>
> >
> >  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >  }
> >  #endif
> >
> > +#ifndef arch_wants_pte_order
> > +/*
> > + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> > + * to be at least order-2.
> > + */
> > +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> > +{
> > +     return ilog2(SZ_64K >> PAGE_SHIFT);
> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>
> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.

The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
s/w policy not a h/w preference. Besides, I don't think we can include
mmzone.h in pgtable.h.
  
Yu Zhao July 4, 2023, 3:59 a.m. UTC | #4
On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >
> >
> >
> > On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> > > arch_wants_pte_order() can be overridden by the arch to return the
> > > preferred folio order for pte-mapped memory. This is useful as some
> > > architectures (e.g. arm64) can coalesce TLB entries when the physical
> > > memory is suitably contiguous.
> > >
> > > The first user for this hint will be FLEXIBLE_THP, which aims to
> > > allocate large folios for anonymous memory to reduce page faults and
> > > other per-page operation costs.
> > >
> > > Here we add the default implementation of the function, used when the
> > > architecture does not define it, which returns the order corresponding
> > > to 64K.
> > >
> > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > > ---
> > >  include/linux/pgtable.h | 13 +++++++++++++
> > >  1 file changed, 13 insertions(+)
> > >
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index a661a17173fa..f7e38598f20b 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -13,6 +13,7 @@
> > >  #include <linux/errno.h>
> > >  #include <asm-generic/pgtable_uffd.h>
> > >  #include <linux/page_table_check.h>
> > > +#include <linux/sizes.h>
> > >
> > >  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> > >       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> > > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> > >  }
> > >  #endif
> > >
> > > +#ifndef arch_wants_pte_order
> > > +/*
> > > + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> > > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> > > + * to be at least order-2.
> > > + */
> > > +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> > > +{
> > > +     return ilog2(SZ_64K >> PAGE_SHIFT);
> > Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >
> > Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> > If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>
> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> s/w policy not a h/w preference. Besides, I don't think we can include
> mmzone.h in pgtable.h.

I think we can make a compromise:
1. change the default implementation of arch_has_hw_pte_young() to return 0, and
2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
don't override arch_has_hw_pte_young(), or if its return value is too
large to fit.
This should also take care of the regression, right?
  
Yin Fengwei July 4, 2023, 5:22 a.m. UTC | #5
On 7/4/2023 11:59 AM, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>>
>>>
>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>> memory is suitably contiguous.
>>>>
>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>> allocate large folios for anonymous memory to reduce page faults and
>>>> other per-page operation costs.
>>>>
>>>> Here we add the default implementation of the function, used when the
>>>> architecture does not define it, which returns the order corresponding
>>>> to 64K.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index a661a17173fa..f7e38598f20b 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -13,6 +13,7 @@
>>>>  #include <linux/errno.h>
>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>  #include <linux/page_table_check.h>
>>>> +#include <linux/sizes.h>
>>>>
>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>  }
>>>>  #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>> + * to be at least order-2.
>>>> + */
>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>> +{
>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>
>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>
>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>> s/w policy not a h/w preference. Besides, I don't think we can include
>> mmzone.h in pgtable.h.
> 
> I think we can make a compromise:
> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> don't override arch_has_hw_pte_young(), or if its return value is too
> large to fit.
Do you mean arch_wants_pte_order()? Yes. This looks good to me. Thanks.


Regards
Yin, Fengwei

> This should also take care of the regression, right?
  
Yu Zhao July 4, 2023, 5:42 a.m. UTC | #6
On Mon, Jul 3, 2023 at 11:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 7/4/2023 11:59 AM, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>
> >>>
> >>>
> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>> memory is suitably contiguous.
> >>>>
> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>> allocate large folios for anonymous memory to reduce page faults and
> >>>> other per-page operation costs.
> >>>>
> >>>> Here we add the default implementation of the function, used when the
> >>>> architecture does not define it, which returns the order corresponding
> >>>> to 64K.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> ---
> >>>>  include/linux/pgtable.h | 13 +++++++++++++
> >>>>  1 file changed, 13 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>> index a661a17173fa..f7e38598f20b 100644
> >>>> --- a/include/linux/pgtable.h
> >>>> +++ b/include/linux/pgtable.h
> >>>> @@ -13,6 +13,7 @@
> >>>>  #include <linux/errno.h>
> >>>>  #include <asm-generic/pgtable_uffd.h>
> >>>>  #include <linux/page_table_check.h>
> >>>> +#include <linux/sizes.h>
> >>>>
> >>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >>>>  }
> >>>>  #endif
> >>>>
> >>>> +#ifndef arch_wants_pte_order
> >>>> +/*
> >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>>> + * to be at least order-2.
> >>>> + */
> >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> >>>> +{
> >>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
> >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >>>
> >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
> >>
> >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> >> s/w policy not a h/w preference. Besides, I don't think we can include
> >> mmzone.h in pgtable.h.
> >
> > I think we can make a compromise:
> > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> > don't override arch_has_hw_pte_young(), or if its return value is too
> > large to fit.
> Do you mean arch_wants_pte_order()? Yes. This looks good to me. Thanks.

Sorry, copied the wrong function from above and pasted without looking...
  
Ryan Roberts July 4, 2023, 12:36 p.m. UTC | #7
On 04/07/2023 04:59, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>>
>>>
>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>> memory is suitably contiguous.
>>>>
>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>> allocate large folios for anonymous memory to reduce page faults and
>>>> other per-page operation costs.
>>>>
>>>> Here we add the default implementation of the function, used when the
>>>> architecture does not define it, which returns the order corresponding
>>>> to 64K.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index a661a17173fa..f7e38598f20b 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -13,6 +13,7 @@
>>>>  #include <linux/errno.h>
>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>  #include <linux/page_table_check.h>
>>>> +#include <linux/sizes.h>
>>>>
>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>  }
>>>>  #endif
>>>>
>>>> +#ifndef arch_wants_pte_order
>>>> +/*
>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>> + * to be at least order-2.
>>>> + */
>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>> +{
>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>
>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>
>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>> s/w policy not a h/w preference. Besides, I don't think we can include
>> mmzone.h in pgtable.h.
> 
> I think we can make a compromise:
> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> don't override arch_has_hw_pte_young(), or if its return value is too
> large to fit.
> This should also take care of the regression, right?

I think you are suggesting that we use 0 as a sentinel which we then translate
to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
memory.c (actually it is currently a macro defined as arch_wants_pte_order()).

So it would become (I'll talk about the vma concern separately in the thread
where you raised it):

static inline int max_anon_folio_order(struct vm_area_struct *vma)
{
	int order = arch_wants_pte_order(vma);

	return order ? order : PAGE_ALLOC_COSTLY_ORDER;
}

Correct?

I don't see how it fixes the regression (assume you're talking about
Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
order-4.
  
Ryan Roberts July 4, 2023, 1:20 p.m. UTC | #8
On 03/07/2023 20:50, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> arch_wants_pte_order() can be overridden by the arch to return the
>> preferred folio order for pte-mapped memory. This is useful as some
>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>> memory is suitably contiguous.
>>
>> The first user for this hint will be FLEXIBLE_THP, which aims to
>> allocate large folios for anonymous memory to reduce page faults and
>> other per-page operation costs.
>>
>> Here we add the default implementation of the function, used when the
>> architecture does not define it, which returns the order corresponding
>> to 64K.
> 
> I don't really mind a non-zero default value. But people would ask why
> non-zero and why 64KB. Probably you could argue this is the large size
> all known archs support if they have TLB coalescing. For x86, AMD CPUs
> would want to override this. I'll leave it to Fengwei to decide
> whether Intel wants a different default value.>
> Also I don't like the vma parameter because it makes
> arch_wants_pte_order() a mix of hw preference and vma policy. From my
> POV, the function should be only about the former; the latter should
> be decided by arch-independent MM code. However, I can live with it if
> ARM MM people think this is really what you want. ATM, I'm skeptical
> they do.

Here's the big picture for what I'm tryng to achieve:

 - In the common case, I'd like all programs to get a performance bump by
automatically and transparently using large anon folios - so no explicit
requirement on the process to opt-in.

 - On arm64, in the above case, I'd like the preferred folio size to be 64K;
from the (admittedly limitted) testing I've done that's about where the
performance knee is and it doesn't appear to increase the memory wastage very
much. It also has the benefits that for 4K base pages this is the contpte size
(order-4) so I can take full benefit of contpte mappings transparently to the
process. And for 16K this is the HPA size (order-2).

 - On arm64 when the process has marked the VMA for THP (or when
transparent_hugepage=always) but the VMA does not meet the requirements for a
PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
and for 64K this is 2M (order-5). The 64K base page case is very important since
the PMD size for that base page is 512MB which is almost impossible to allocate
in practice.

So one approach would be to define arch_wants_pte_order() as always returning
the contpte size (remove the vma parameter). Then max_anon_folio_order() in
memory.c could so this:


#define MAX_ANON_FOLIO_ORDER_NOTHP	ilog2(SZ_64K >> PAGE_SHIFT);

static inline int max_anon_folio_order(struct vm_area_struct *vma)
{
	int order = arch_wants_pte_order();

	// Fix up default case which returns 0 because PAGE_ALLOC_COSTLY_ORDER
	// can't be used directly in pgtable.h
	order = order ? order : PAGE_ALLOC_COSTLY_ORDER;

	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
		return order;
	else
		return min(order, MAX_ANON_FOLIO_ORDER_NOTHP);
}


This moves the SW policy into memory.c and gives you PAGE_ALLOC_COSTLY_ORDER (or
whatever default we decide on) as the default for arches with no override, and
also meets all my goals above.

> 
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> After another CPU vendor, e.g., Fengwei, and an ARM MM person, e.g.,
> Will give the green light:
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> 
>> ---
>>  include/linux/pgtable.h | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index a661a17173fa..f7e38598f20b 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -13,6 +13,7 @@
>>  #include <linux/errno.h>
>>  #include <asm-generic/pgtable_uffd.h>
>>  #include <linux/page_table_check.h>
>> +#include <linux/sizes.h>
>>
>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>         defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>  }
>>  #endif
>>
>> +#ifndef arch_wants_pte_order
>> +/*
>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> 
> The warning is helpful.
> 
>> + * to be at least order-2.
>> + */
>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>> +{
>> +       return ilog2(SZ_64K >> PAGE_SHIFT);
>> +}
>> +#endif
>> +
>>  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>                                        unsigned long address,
  
Ryan Roberts July 4, 2023, 1:23 p.m. UTC | #9
On 04/07/2023 13:36, Ryan Roberts wrote:
> On 04/07/2023 04:59, Yu Zhao wrote:
>> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>>
>>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>>> memory is suitably contiguous.
>>>>>
>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>>> allocate large folios for anonymous memory to reduce page faults and
>>>>> other per-page operation costs.
>>>>>
>>>>> Here we add the default implementation of the function, used when the
>>>>> architecture does not define it, which returns the order corresponding
>>>>> to 64K.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>>  1 file changed, 13 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index a661a17173fa..f7e38598f20b 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -13,6 +13,7 @@
>>>>>  #include <linux/errno.h>
>>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>>  #include <linux/page_table_check.h>
>>>>> +#include <linux/sizes.h>
>>>>>
>>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>  }
>>>>>  #endif
>>>>>
>>>>> +#ifndef arch_wants_pte_order
>>>>> +/*
>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>> + * to be at least order-2.
>>>>> + */
>>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>>> +{
>>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>>
>>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>>
>>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>>> s/w policy not a h/w preference. Besides, I don't think we can include
>>> mmzone.h in pgtable.h.
>>
>> I think we can make a compromise:
>> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
>> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
>> don't override arch_has_hw_pte_young(), or if its return value is too
>> large to fit.
>> This should also take care of the regression, right?
> 
> I think you are suggesting that we use 0 as a sentinel which we then translate
> to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
> memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
> 
> So it would become (I'll talk about the vma concern separately in the thread
> where you raised it):
> 
> static inline int max_anon_folio_order(struct vm_area_struct *vma)
> {
> 	int order = arch_wants_pte_order(vma);
> 
> 	return order ? order : PAGE_ALLOC_COSTLY_ORDER;
> }
> 
> Correct?

Actually, I'm not sure its a good idea to default to a fixed order. If running
on an arch with big base pages (e.g. powerpc with 64K pages?), that will soon
add up to a big chunk of memory, which could be wasteful?

PAGE_ALLOC_COSTLY_ORDER = 3 so with 64K base page, that 512K. Is that a concern?
Wouldn't it be better to define this as an absolute size? Or even the min of
PAGE_ALLOC_COSTLY_ORDER and an absolute size?


> 
> I don't see how it fixes the regression (assume you're talking about
> Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
> order-4.
>
  
Yu Zhao July 5, 2023, 1:23 a.m. UTC | #10
On Tue, Jul 4, 2023 at 6:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/07/2023 04:59, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>
> >>>
> >>>
> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>> memory is suitably contiguous.
> >>>>
> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>> allocate large folios for anonymous memory to reduce page faults and
> >>>> other per-page operation costs.
> >>>>
> >>>> Here we add the default implementation of the function, used when the
> >>>> architecture does not define it, which returns the order corresponding
> >>>> to 64K.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>> ---
> >>>>  include/linux/pgtable.h | 13 +++++++++++++
> >>>>  1 file changed, 13 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>> index a661a17173fa..f7e38598f20b 100644
> >>>> --- a/include/linux/pgtable.h
> >>>> +++ b/include/linux/pgtable.h
> >>>> @@ -13,6 +13,7 @@
> >>>>  #include <linux/errno.h>
> >>>>  #include <asm-generic/pgtable_uffd.h>
> >>>>  #include <linux/page_table_check.h>
> >>>> +#include <linux/sizes.h>
> >>>>
> >>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >>>>  }
> >>>>  #endif
> >>>>
> >>>> +#ifndef arch_wants_pte_order
> >>>> +/*
> >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>>> + * to be at least order-2.
> >>>> + */
> >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> >>>> +{
> >>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
> >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >>>
> >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
> >>
> >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> >> s/w policy not a h/w preference. Besides, I don't think we can include
> >> mmzone.h in pgtable.h.
> >
> > I think we can make a compromise:
> > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> > don't override arch_has_hw_pte_young(), or if its return value is too
> > large to fit.
> > This should also take care of the regression, right?
>
> I think you are suggesting that we use 0 as a sentinel which we then translate
> to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
> memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
>
> So it would become (I'll talk about the vma concern separately in the thread
> where you raised it):
>
> static inline int max_anon_folio_order(struct vm_area_struct *vma)
> {
>         int order = arch_wants_pte_order(vma);
>
>         return order ? order : PAGE_ALLOC_COSTLY_ORDER;
> }
>
> Correct?
>
> I don't see how it fixes the regression (assume you're talking about
> Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
> order-4.

Here is what I was actually suggesting -- I think the problem was
because contpte is a bit too large for that benchmark and for the page
allocator too, unfortunately. The following allows one retry (32KB)
before fallback to order 0 when using contpte (64KB). There is no
retry for HPA (16KB) and other archs.

+       int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER;
+       int orders[] = {
+               preferred,
+               preferred > PAGE_ALLOC_COSTLY_ORDER ?
PAGE_ALLOC_COSTLY_ORDER : 0,
+               0,
+       };

I'm attaching a patch which fills in the two helpers I left empty here [1].

Would the above work for Intel, Fengwei?

(AMD wouldn't need to override arch_wants_pte_order() since PTE
coalescing on Zen is also PAGE_ALLOC_COSTLY_ORDER.)

[1] https://lore.kernel.org/linux-mm/CAOUHufaK82K8Sa35T7z3=gkm4GB0cWD3aqeZF6mYx82v7cOTeA@mail.gmail.com/2-anon_folios.patch
  
Yu Zhao July 5, 2023, 1:40 a.m. UTC | #11
On Tue, Jul 4, 2023 at 7:23 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/07/2023 13:36, Ryan Roberts wrote:
> > On 04/07/2023 04:59, Yu Zhao wrote:
> >> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> >>>
> >>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >>>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>>> memory is suitably contiguous.
> >>>>>
> >>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>>> allocate large folios for anonymous memory to reduce page faults and
> >>>>> other per-page operation costs.
> >>>>>
> >>>>> Here we add the default implementation of the function, used when the
> >>>>> architecture does not define it, which returns the order corresponding
> >>>>> to 64K.
> >>>>>
> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>>> ---
> >>>>>  include/linux/pgtable.h | 13 +++++++++++++
> >>>>>  1 file changed, 13 insertions(+)
> >>>>>
> >>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>>> index a661a17173fa..f7e38598f20b 100644
> >>>>> --- a/include/linux/pgtable.h
> >>>>> +++ b/include/linux/pgtable.h
> >>>>> @@ -13,6 +13,7 @@
> >>>>>  #include <linux/errno.h>
> >>>>>  #include <asm-generic/pgtable_uffd.h>
> >>>>>  #include <linux/page_table_check.h>
> >>>>> +#include <linux/sizes.h>
> >>>>>
> >>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
> >>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
> >>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
> >>>>>  }
> >>>>>  #endif
> >>>>>
> >>>>> +#ifndef arch_wants_pte_order
> >>>>> +/*
> >>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
> >>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
> >>>>> + * to be at least order-2.
> >>>>> + */
> >>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
> >>>>> +{
> >>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
> >>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
> >>>>
> >>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
> >>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
> >>>
> >>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
> >>> s/w policy not a h/w preference. Besides, I don't think we can include
> >>> mmzone.h in pgtable.h.
> >>
> >> I think we can make a compromise:
> >> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
> >> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
> >> don't override arch_has_hw_pte_young(), or if its return value is too
> >> large to fit.
> >> This should also take care of the regression, right?
> >
> > I think you are suggesting that we use 0 as a sentinel which we then translate
> > to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
> > memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
> >
> > So it would become (I'll talk about the vma concern separately in the thread
> > where you raised it):
> >
> > static inline int max_anon_folio_order(struct vm_area_struct *vma)
> > {
> >       int order = arch_wants_pte_order(vma);
> >
> >       return order ? order : PAGE_ALLOC_COSTLY_ORDER;
> > }
> >
> > Correct?
>
> Actually, I'm not sure its a good idea to default to a fixed order. If running
> on an arch with big base pages (e.g. powerpc with 64K pages?), that will soon
> add up to a big chunk of memory, which could be wasteful?
>
> PAGE_ALLOC_COSTLY_ORDER = 3 so with 64K base page, that 512K. Is that a concern?
> Wouldn't it be better to define this as an absolute size? Or even the min of
> PAGE_ALLOC_COSTLY_ORDER and an absolute size?

For my POV, not at all. POWER can use smaller page sizes if they
wanted to -- I don't think they do: at least the distros I use on my
POWER9 all have THP=always by default (2MB).
  
Yu Zhao July 5, 2023, 2:07 a.m. UTC | #12
On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 03/07/2023 20:50, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> arch_wants_pte_order() can be overridden by the arch to return the
> >> preferred folio order for pte-mapped memory. This is useful as some
> >> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >> memory is suitably contiguous.
> >>
> >> The first user for this hint will be FLEXIBLE_THP, which aims to
> >> allocate large folios for anonymous memory to reduce page faults and
> >> other per-page operation costs.
> >>
> >> Here we add the default implementation of the function, used when the
> >> architecture does not define it, which returns the order corresponding
> >> to 64K.
> >
> > I don't really mind a non-zero default value. But people would ask why
> > non-zero and why 64KB. Probably you could argue this is the large size
> > all known archs support if they have TLB coalescing. For x86, AMD CPUs
> > would want to override this. I'll leave it to Fengwei to decide
> > whether Intel wants a different default value.>
> > Also I don't like the vma parameter because it makes
> > arch_wants_pte_order() a mix of hw preference and vma policy. From my
> > POV, the function should be only about the former; the latter should
> > be decided by arch-independent MM code. However, I can live with it if
> > ARM MM people think this is really what you want. ATM, I'm skeptical
> > they do.
>
> Here's the big picture for what I'm tryng to achieve:
>
>  - In the common case, I'd like all programs to get a performance bump by
> automatically and transparently using large anon folios - so no explicit
> requirement on the process to opt-in.

We all agree on this :)

>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
> from the (admittedly limitted) testing I've done that's about where the
> performance knee is and it doesn't appear to increase the memory wastage very
> much. It also has the benefits that for 4K base pages this is the contpte size
> (order-4) so I can take full benefit of contpte mappings transparently to the
> process. And for 16K this is the HPA size (order-2).

My highest priority is to get 16KB proven first because it would
benefit both client and server devices. So it may be different from
yours but I don't see any conflict.

>  - On arm64 when the process has marked the VMA for THP (or when
> transparent_hugepage=always) but the VMA does not meet the requirements for a
> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
> and for 64K this is 2M (order-5). The 64K base page case is very important since
> the PMD size for that base page is 512MB which is almost impossible to allocate
> in practice.

Which case (server or client) are you focusing on here? For our client
devices, I can confidently say that 64KB has to be after 16KB, if it
happens at all. For servers in general, I don't know of any major
memory-intensive workloads that are not THP-aware, i.e., I don't think
"VMA does not meet the requirements" is a concern.
  
Yin Fengwei July 5, 2023, 2:18 a.m. UTC | #13
On 7/5/23 09:23, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 6:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 04/07/2023 04:59, Yu Zhao wrote:
>>> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>>>>
>>>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>>>> memory is suitably contiguous.
>>>>>>
>>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>>>> allocate large folios for anonymous memory to reduce page faults and
>>>>>> other per-page operation costs.
>>>>>>
>>>>>> Here we add the default implementation of the function, used when the
>>>>>> architecture does not define it, which returns the order corresponding
>>>>>> to 64K.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>  include/linux/pgtable.h | 13 +++++++++++++
>>>>>>  1 file changed, 13 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index a661a17173fa..f7e38598f20b 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -13,6 +13,7 @@
>>>>>>  #include <linux/errno.h>
>>>>>>  #include <asm-generic/pgtable_uffd.h>
>>>>>>  #include <linux/page_table_check.h>
>>>>>> +#include <linux/sizes.h>
>>>>>>
>>>>>>  #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
>>>>>>       defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
>>>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void)
>>>>>>  }
>>>>>>  #endif
>>>>>>
>>>>>> +#ifndef arch_wants_pte_order
>>>>>> +/*
>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0,
>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
>>>>>> + * to be at least order-2.
>>>>>> + */
>>>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> +     return ilog2(SZ_64K >> PAGE_SHIFT);
>>>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER?
>>>>>
>>>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9.
>>>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp.
>>>>
>>>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a
>>>> s/w policy not a h/w preference. Besides, I don't think we can include
>>>> mmzone.h in pgtable.h.
>>>
>>> I think we can make a compromise:
>>> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and
>>> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that
>>> don't override arch_has_hw_pte_young(), or if its return value is too
>>> large to fit.
>>> This should also take care of the regression, right?
>>
>> I think you are suggesting that we use 0 as a sentinel which we then translate
>> to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in
>> memory.c (actually it is currently a macro defined as arch_wants_pte_order()).
>>
>> So it would become (I'll talk about the vma concern separately in the thread
>> where you raised it):
>>
>> static inline int max_anon_folio_order(struct vm_area_struct *vma)
>> {
>>         int order = arch_wants_pte_order(vma);
>>
>>         return order ? order : PAGE_ALLOC_COSTLY_ORDER;
>> }
>>
>> Correct?
>>
>> I don't see how it fixes the regression (assume you're talking about
>> Speedometer) though? On arm64 arch_wants_pte_order() will still be returning
>> order-4.
> 
> Here is what I was actually suggesting -- I think the problem was
> because contpte is a bit too large for that benchmark and for the page
> allocator too, unfortunately. The following allows one retry (32KB)
> before fallback to order 0 when using contpte (64KB). There is no
> retry for HPA (16KB) and other archs.
> 
> +       int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER;
> +       int orders[] = {
> +               preferred,
> +               preferred > PAGE_ALLOC_COSTLY_ORDER ?
> PAGE_ALLOC_COSTLY_ORDER : 0,
> +               0,
> +       };
> 
> I'm attaching a patch which fills in the two helpers I left empty here [1].
> 
> Would the above work for Intel, Fengwei?
PAGE_ALLOC_COSTLY_ORDER is Intel preferred because it fits the most common
Intel system. So yes. This works for Intel.


Regards
Yin, Fengwei

> 
> (AMD wouldn't need to override arch_wants_pte_order() since PTE
> coalescing on Zen is also PAGE_ALLOC_COSTLY_ORDER.)
> 
> [1] https://lore.kernel.org/linux-mm/CAOUHufaK82K8Sa35T7z3=gkm4GB0cWD3aqeZF6mYx82v7cOTeA@mail.gmail.com/2-anon_folios.patch
  
Ryan Roberts July 5, 2023, 9:11 a.m. UTC | #14
On 05/07/2023 03:07, Yu Zhao wrote:
> On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 03/07/2023 20:50, Yu Zhao wrote:
>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>> memory is suitably contiguous.
>>>>
>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>> allocate large folios for anonymous memory to reduce page faults and
>>>> other per-page operation costs.
>>>>
>>>> Here we add the default implementation of the function, used when the
>>>> architecture does not define it, which returns the order corresponding
>>>> to 64K.
>>>
>>> I don't really mind a non-zero default value. But people would ask why
>>> non-zero and why 64KB. Probably you could argue this is the large size
>>> all known archs support if they have TLB coalescing. For x86, AMD CPUs
>>> would want to override this. I'll leave it to Fengwei to decide
>>> whether Intel wants a different default value.>
>>> Also I don't like the vma parameter because it makes
>>> arch_wants_pte_order() a mix of hw preference and vma policy. From my
>>> POV, the function should be only about the former; the latter should
>>> be decided by arch-independent MM code. However, I can live with it if
>>> ARM MM people think this is really what you want. ATM, I'm skeptical
>>> they do.
>>
>> Here's the big picture for what I'm tryng to achieve:
>>
>>  - In the common case, I'd like all programs to get a performance bump by
>> automatically and transparently using large anon folios - so no explicit
>> requirement on the process to opt-in.
> 
> We all agree on this :)
> 
>>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
>> from the (admittedly limitted) testing I've done that's about where the
>> performance knee is and it doesn't appear to increase the memory wastage very
>> much. It also has the benefits that for 4K base pages this is the contpte size
>> (order-4) so I can take full benefit of contpte mappings transparently to the
>> process. And for 16K this is the HPA size (order-2).
> 
> My highest priority is to get 16KB proven first because it would
> benefit both client and server devices. So it may be different from
> yours but I don't see any conflict.

Do you mean 16K folios on a 4K base page system, or large folios on a 16K base
page system? I thought your focus was on speeding up 4K base page client systems
but this statement has got me wondering?

> 
>>  - On arm64 when the process has marked the VMA for THP (or when
>> transparent_hugepage=always) but the VMA does not meet the requirements for a
>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
>> and for 64K this is 2M (order-5). The 64K base page case is very important since
>> the PMD size for that base page is 512MB which is almost impossible to allocate
>> in practice.
> 
> Which case (server or client) are you focusing on here? For our client
> devices, I can confidently say that 64KB has to be after 16KB, if it
> happens at all. For servers in general, I don't know of any major
> memory-intensive workloads that are not THP-aware, i.e., I don't think
> "VMA does not meet the requirements" is a concern.

For the 64K base page case, the focus is server. The problem reported by our
partner is that the 512M huge page size is too big to reliably allocate and so
the fauls always fall back to 64K base pages in practice. I would also speculate
(happy to be proved wrong) that there are many THP-aware workloads that assume
the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M
huge page when running on 64K base page system.

But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base
page system is a very real requirement. Our intent is that this will be the
mechanism we use to enable it.
  
Yu Zhao July 5, 2023, 5:24 p.m. UTC | #15
On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/07/2023 03:07, Yu Zhao wrote:
> > On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 03/07/2023 20:50, Yu Zhao wrote:
> >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> arch_wants_pte_order() can be overridden by the arch to return the
> >>>> preferred folio order for pte-mapped memory. This is useful as some
> >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
> >>>> memory is suitably contiguous.
> >>>>
> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to
> >>>> allocate large folios for anonymous memory to reduce page faults and
> >>>> other per-page operation costs.
> >>>>
> >>>> Here we add the default implementation of the function, used when the
> >>>> architecture does not define it, which returns the order corresponding
> >>>> to 64K.
> >>>
> >>> I don't really mind a non-zero default value. But people would ask why
> >>> non-zero and why 64KB. Probably you could argue this is the large size
> >>> all known archs support if they have TLB coalescing. For x86, AMD CPUs
> >>> would want to override this. I'll leave it to Fengwei to decide
> >>> whether Intel wants a different default value.>
> >>> Also I don't like the vma parameter because it makes
> >>> arch_wants_pte_order() a mix of hw preference and vma policy. From my
> >>> POV, the function should be only about the former; the latter should
> >>> be decided by arch-independent MM code. However, I can live with it if
> >>> ARM MM people think this is really what you want. ATM, I'm skeptical
> >>> they do.
> >>
> >> Here's the big picture for what I'm tryng to achieve:
> >>
> >>  - In the common case, I'd like all programs to get a performance bump by
> >> automatically and transparently using large anon folios - so no explicit
> >> requirement on the process to opt-in.
> >
> > We all agree on this :)
> >
> >>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
> >> from the (admittedly limitted) testing I've done that's about where the
> >> performance knee is and it doesn't appear to increase the memory wastage very
> >> much. It also has the benefits that for 4K base pages this is the contpte size
> >> (order-4) so I can take full benefit of contpte mappings transparently to the
> >> process. And for 16K this is the HPA size (order-2).
> >
> > My highest priority is to get 16KB proven first because it would
> > benefit both client and server devices. So it may be different from
> > yours but I don't see any conflict.
>
> Do you mean 16K folios on a 4K base page system

Yes.

> or large folios on a 16K base
> page system? I thought your focus was on speeding up 4K base page client systems
> but this statement has got me wondering?

Sorry, I should have said 4x4KB.

> >>  - On arm64 when the process has marked the VMA for THP (or when
> >> transparent_hugepage=always) but the VMA does not meet the requirements for a
> >> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
> >> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
> >> and for 64K this is 2M (order-5). The 64K base page case is very important since
> >> the PMD size for that base page is 512MB which is almost impossible to allocate
> >> in practice.
> >
> > Which case (server or client) are you focusing on here? For our client
> > devices, I can confidently say that 64KB has to be after 16KB, if it
> > happens at all. For servers in general, I don't know of any major
> > memory-intensive workloads that are not THP-aware, i.e., I don't think
> > "VMA does not meet the requirements" is a concern.
>
> For the 64K base page case, the focus is server. The problem reported by our
> partner is that the 512M huge page size is too big to reliably allocate and so
> the fauls always fall back to 64K base pages in practice. I would also speculate
> (happy to be proved wrong) that there are many THP-aware workloads that assume
> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M
> huge page when running on 64K base page system.

Interesting. When you have something ready to share, I might be able
to try it on our ARM servers as well.

> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base
> page system is a very real requirement. Our intent is that this will be the
> mechanism we use to enable it.

Yes, contpte makes more sense for what you described. It'd fit in a
lot better in the hugetlb case, but I guess your partner uses anon.
  
Ryan Roberts July 5, 2023, 6:01 p.m. UTC | #16
On 05/07/2023 18:24, Yu Zhao wrote:
> On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 05/07/2023 03:07, Yu Zhao wrote:
>>> On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 03/07/2023 20:50, Yu Zhao wrote:
>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> arch_wants_pte_order() can be overridden by the arch to return the
>>>>>> preferred folio order for pte-mapped memory. This is useful as some
>>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical
>>>>>> memory is suitably contiguous.
>>>>>>
>>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to
>>>>>> allocate large folios for anonymous memory to reduce page faults and
>>>>>> other per-page operation costs.
>>>>>>
>>>>>> Here we add the default implementation of the function, used when the
>>>>>> architecture does not define it, which returns the order corresponding
>>>>>> to 64K.
>>>>>
>>>>> I don't really mind a non-zero default value. But people would ask why
>>>>> non-zero and why 64KB. Probably you could argue this is the large size
>>>>> all known archs support if they have TLB coalescing. For x86, AMD CPUs
>>>>> would want to override this. I'll leave it to Fengwei to decide
>>>>> whether Intel wants a different default value.>
>>>>> Also I don't like the vma parameter because it makes
>>>>> arch_wants_pte_order() a mix of hw preference and vma policy. From my
>>>>> POV, the function should be only about the former; the latter should
>>>>> be decided by arch-independent MM code. However, I can live with it if
>>>>> ARM MM people think this is really what you want. ATM, I'm skeptical
>>>>> they do.
>>>>
>>>> Here's the big picture for what I'm tryng to achieve:
>>>>
>>>>  - In the common case, I'd like all programs to get a performance bump by
>>>> automatically and transparently using large anon folios - so no explicit
>>>> requirement on the process to opt-in.
>>>
>>> We all agree on this :)
>>>
>>>>  - On arm64, in the above case, I'd like the preferred folio size to be 64K;
>>>> from the (admittedly limitted) testing I've done that's about where the
>>>> performance knee is and it doesn't appear to increase the memory wastage very
>>>> much. It also has the benefits that for 4K base pages this is the contpte size
>>>> (order-4) so I can take full benefit of contpte mappings transparently to the
>>>> process. And for 16K this is the HPA size (order-2).
>>>
>>> My highest priority is to get 16KB proven first because it would
>>> benefit both client and server devices. So it may be different from
>>> yours but I don't see any conflict.
>>
>> Do you mean 16K folios on a 4K base page system
> 
> Yes.
> 
>> or large folios on a 16K base
>> page system? I thought your focus was on speeding up 4K base page client systems
>> but this statement has got me wondering?
> 
> Sorry, I should have said 4x4KB.

OK. Be aware that a number of Arm CPUs that support HPA don't have it enabled by
default (or at least don't have it enabled in the mode that you would want it to
see best performance with large anon folios). You would need EL3 access to
reconfigure it.

> 
>>>>  - On arm64 when the process has marked the VMA for THP (or when
>>>> transparent_hugepage=always) but the VMA does not meet the requirements for a
>>>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
>>>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
>>>> and for 64K this is 2M (order-5). The 64K base page case is very important since
>>>> the PMD size for that base page is 512MB which is almost impossible to allocate
>>>> in practice.
>>>
>>> Which case (server or client) are you focusing on here? For our client
>>> devices, I can confidently say that 64KB has to be after 16KB, if it
>>> happens at all. For servers in general, I don't know of any major
>>> memory-intensive workloads that are not THP-aware, i.e., I don't think
>>> "VMA does not meet the requirements" is a concern.
>>
>> For the 64K base page case, the focus is server. The problem reported by our
>> partner is that the 512M huge page size is too big to reliably allocate and so
>> the fauls always fall back to 64K base pages in practice. I would also speculate
>> (happy to be proved wrong) that there are many THP-aware workloads that assume
>> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M
>> huge page when running on 64K base page system.
> 
> Interesting. When you have something ready to share, I might be able
> to try it on our ARM servers as well.

That would be really helpful. I'm currently updating my branch that collates
everything to reflect the review comments in this patch set and the contpte
patch set. I'll share it in a couple of weeks.

> 
>> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base
>> page system is a very real requirement. Our intent is that this will be the
>> mechanism we use to enable it.
> 
> Yes, contpte makes more sense for what you described. It'd fit in a
> lot better in the hugetlb case, but I guess your partner uses anon.

arm64 already supports contpte for hugetlb, but they need it to work with anon
memory using THP.
  
Matthew Wilcox July 6, 2023, 7:33 p.m. UTC | #17
On Tue, Jul 04, 2023 at 08:07:19PM -0600, Yu Zhao wrote:
> >  - On arm64 when the process has marked the VMA for THP (or when
> > transparent_hugepage=always) but the VMA does not meet the requirements for a
> > PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
> > contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
> > and for 64K this is 2M (order-5). The 64K base page case is very important since
> > the PMD size for that base page is 512MB which is almost impossible to allocate
> > in practice.
> 
> Which case (server or client) are you focusing on here? For our client
> devices, I can confidently say that 64KB has to be after 16KB, if it
> happens at all. For servers in general, I don't know of any major
> memory-intensive workloads that are not THP-aware, i.e., I don't think
> "VMA does not meet the requirements" is a concern.

It sounds like you've done some measurements, and I'd like to understand
those a bit better.  There are a number of factors involved:

 - A larger page size shrinks the length of the LRU list, so systems
   which see heavy LRU lock contention benefit more
 - A larger page size has more internal fragmentation, so we run out of
   memory and have to do reclaim more often (and maybe workload which
   used to fit in DRAM now do not)
(probably others; i'm not at 100% right now)

I think concerns about "allocating lots of order-2 folios makes it harder
to allocate order-4 folios" are _probably_ not warranted (without data
to prove otherwise).  All anonymous memory is movable, so our compaction
code should be able to create larger order folios.
  
Ryan Roberts July 7, 2023, 10 a.m. UTC | #18
On 06/07/2023 20:33, Matthew Wilcox wrote:
> On Tue, Jul 04, 2023 at 08:07:19PM -0600, Yu Zhao wrote:
>>>  - On arm64 when the process has marked the VMA for THP (or when
>>> transparent_hugepage=always) but the VMA does not meet the requirements for a
>>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using
>>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7)
>>> and for 64K this is 2M (order-5). The 64K base page case is very important since
>>> the PMD size for that base page is 512MB which is almost impossible to allocate
>>> in practice.
>>
>> Which case (server or client) are you focusing on here? For our client
>> devices, I can confidently say that 64KB has to be after 16KB, if it
>> happens at all. For servers in general, I don't know of any major
>> memory-intensive workloads that are not THP-aware, i.e., I don't think
>> "VMA does not meet the requirements" is a concern.
> 
> It sounds like you've done some measurements, and I'd like to understand
> those a bit better.  There are a number of factors involved:

I'm not sure if that's a question to me or Yu? I haven't personally done any
measurements for the 64K base page case. But Arm has a partner that is pushing
for this. I'm hoping to see some test results from them posted publicly in the
coming weeks. See [1] for more explanation on the rationale.

[1]
https://lore.kernel.org/linux-mm/4d4c45a2-0037-71de-b182-f516fee07e67@arm.com/T/#m8a7c4b71f94224ec3fe6d0a407f48d74c789ba4f

> 
>  - A larger page size shrinks the length of the LRU list, so systems
>    which see heavy LRU lock contention benefit more
>  - A larger page size has more internal fragmentation, so we run out of
>    memory and have to do reclaim more often (and maybe workload which
>    used to fit in DRAM now do not)
> (probably others; i'm not at 100% right now)
> 
> I think concerns about "allocating lots of order-2 folios makes it harder
> to allocate order-4 folios" are _probably_ not warranted (without data
> to prove otherwise).  All anonymous memory is movable, so our compaction
> code should be able to create larger order folios.
>
  

Patch

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a661a17173fa..f7e38598f20b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -13,6 +13,7 @@ 
 #include <linux/errno.h>
 #include <asm-generic/pgtable_uffd.h>
 #include <linux/page_table_check.h>
+#include <linux/sizes.h>
 
 #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
 	defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
@@ -336,6 +337,18 @@  static inline bool arch_has_hw_pte_young(void)
 }
 #endif
 
+#ifndef arch_wants_pte_order
+/*
+ * Returns preferred folio order for pte-mapped memory. Must be in range [0,
+ * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios
+ * to be at least order-2.
+ */
+static inline int arch_wants_pte_order(struct vm_area_struct *vma)
+{
+	return ilog2(SZ_64K >> PAGE_SHIFT);
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,