[v6,5/9] mm: thp: Extend THP to allocate anonymous large folios

Message ID 20230929114421.3761121-6-ryan.roberts@arm.com
State New
Headers
Series variable-order, large folios for anonymous memory |

Commit Message

Ryan Roberts Sept. 29, 2023, 11:44 a.m. UTC
  Introduce the logic to allow THP to be configured (through the new
anon_orders interface we just added) to allocate large folios to back
anonymous memory, which are smaller than PMD-size (for example order-2,
order-3, order-4, etc).

These THPs continue to be PTE-mapped, but in many cases can still
provide similar benefits to traditional PMD-sized THP: Page faults are
significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
the configured order), but latency spikes are much less prominent
because the size of each page isn't as huge as the PMD-sized variant and
there is less memory to clear in each page fault. The number of per-page
operations (e.g. ref counting, rmap management, lru list management) are
also significantly reduced since those ops now become per-folio.

Some architectures also employ TLB compression mechanisms to squeeze
more entries in when a set of PTEs are virtually and physically
contiguous and approporiately aligned. In this case, TLB misses will
occur less often.

The new behaviour is disabled by default because the anon_orders
defaults to only enabling PMD-order, but can be enabled at runtime by
writing to anon_orders (see documentation in previous commit). The long
term aim is to default anon_orders to include suitable lower orders, but
there are some risks around internal fragmentation that need to be
better understood first.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst |   9 +-
 include/linux/huge_mm.h                    |   6 +-
 mm/memory.c                                | 108 +++++++++++++++++++--
 3 files changed, 111 insertions(+), 12 deletions(-)
  

Comments

Daniel Gomez Oct. 5, 2023, 12:05 p.m. UTC | #1
On Fri, Sep 29, 2023 at 12:44:16PM +0100, Ryan Roberts wrote:

Hi Ryan,
> Introduce the logic to allow THP to be configured (through the new
> anon_orders interface we just added) to allocate large folios to back
> anonymous memory, which are smaller than PMD-size (for example order-2,
> order-3, order-4, etc).
>
> These THPs continue to be PTE-mapped, but in many cases can still
> provide similar benefits to traditional PMD-sized THP: Page faults are
> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
> the configured order), but latency spikes are much less prominent
> because the size of each page isn't as huge as the PMD-sized variant and
> there is less memory to clear in each page fault. The number of per-page
> operations (e.g. ref counting, rmap management, lru list management) are
> also significantly reduced since those ops now become per-folio.
>
> Some architectures also employ TLB compression mechanisms to squeeze
> more entries in when a set of PTEs are virtually and physically
> contiguous and approporiately aligned. In this case, TLB misses will
> occur less often.
>
> The new behaviour is disabled by default because the anon_orders
> defaults to only enabling PMD-order, but can be enabled at runtime by
> writing to anon_orders (see documentation in previous commit). The long
> term aim is to default anon_orders to include suitable lower orders, but
> there are some risks around internal fragmentation that need to be
> better understood first.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst |   9 +-
>  include/linux/huge_mm.h                    |   6 +-
>  mm/memory.c                                | 108 +++++++++++++++++++--
>  3 files changed, 111 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 9f954e73a4ca..732c3b2f4ba8 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
>  ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
>  fields for each mapping. Note that in both cases, AnonHugePages refers
>  only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
> -using PTEs.
> +using PTEs. This includes all THPs whose order is smaller than
> +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
> +for other reasons.
>
>  The number of file transparent huge pages mapped to userspace is available
>  by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
> @@ -367,6 +369,11 @@ frequently will incur overhead.
>  There are a number of counters in ``/proc/vmstat`` that may be used to
>  monitor how successfully the system is providing huge pages for use.
>
> +.. note::
> +   Currently the below counters only record events relating to
> +   PMD-order THPs. Events relating to smaller order THPs are not
> +   included.
> +
>  thp_fault_alloc
>  	is incremented every time a huge page is successfully
>  	allocated to handle a page fault.
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2e7c338229a6..c4860476a1f5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>
>  /*
> - * Mask of all large folio orders supported for anonymous THP.
> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> + * (which is a limitation of the THP implementation).
>   */
> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>
>  /*
>   * Mask of all large folio orders supported for file THP.
> diff --git a/mm/memory.c b/mm/memory.c
> index b5b82fc8e164..92ed9c782dc9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	return ret;
>  }
>
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> +	int i;
> +
> +	if (nr_pages == 1)
> +		return vmf_pte_changed(vmf);
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +{
> +	gfp_t gfp;
> +	pte_t *pte;
> +	unsigned long addr;
> +	struct folio *folio;
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned int orders;
> +	int order;
> +
> +	/*
> +	 * If uffd is active for the vma we need per-page fault fidelity to
> +	 * maintain the uffd semantics.
> +	 */
> +	if (userfaultfd_armed(vma))
> +		goto fallback;
> +
> +	/*
> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
> +	 * for this vma. Then filter out the orders that can't be allocated over
> +	 * the faulting address and still be fully contained in the vma.
> +	 */
> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
> +				    BIT(PMD_ORDER) - 1);
> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
> +
> +	if (!orders)
> +		goto fallback;
> +
> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> +	if (!pte)
> +		return ERR_PTR(-EAGAIN);
> +
> +	order = first_order(orders);
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		vmf->pte = pte + pte_index(addr);
> +		if (!vmf_pte_range_changed(vmf, 1 << order))
> +			break;
> +		order = next_order(&orders, order);
> +	}
> +
> +	vmf->pte = NULL;
> +	pte_unmap(pte);
> +
> +	gfp = vma_thp_gfp_mask(vma);
> +
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);

I was checking your series and noticed about the hugepage flag. I think
you've changed it from v1 -> v2 from being false to true when orders >=2
but I'm not sure about the reasoning. Is this because of your statement
in the cover letter [1]?

[1] cover letter snippet:

"to implement variable order, large folios for anonymous memory.
(previously called ..., but now exposed as an extension to THP;
"small-order THP")"

Thanks,
Daniel

> +		if (folio) {
> +			clear_huge_page(&folio->page, addr, 1 << order);
> +			return folio;
> +		}
> +		order = next_order(&orders, order);
> +	}
> +
> +fallback:
> +	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +}
> +#else
> +#define alloc_anon_folio(vmf) \
> +		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
> +#endif
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   */
>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  {
> +	int i;
> +	int nr_pages = 1;
> +	unsigned long addr = vmf->address;
>  	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct folio *folio;
> @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	/* Allocate our own private page. */
>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto oom;
> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +	folio = alloc_anon_folio(vmf);
> +	if (IS_ERR(folio))
> +		return 0;
>  	if (!folio)
>  		goto oom;
>
> +	nr_pages = folio_nr_pages(folio);
> +	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +
>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>  		goto oom_free_page;
>  	folio_throttle_swaprate(folio, GFP_KERNEL);
> @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry), vma);
>
> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -			&vmf->ptl);
> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>  	if (!vmf->pte)
>  		goto release;
> -	if (vmf_pte_changed(vmf)) {
> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
> +	if (vmf_pte_range_changed(vmf, nr_pages)) {
> +		for (i = 0; i < nr_pages; i++)
> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>  		goto release;
>  	}
>
> @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>  	}
>
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
> +	folio_ref_add(folio, nr_pages - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +	folio_add_new_anon_rmap(folio, vma, addr);
>  	folio_add_lru_vma(folio, vma);
>  setpte:
>  	if (uffd_wp)
>  		entry = pte_mkuffd_wp(entry);
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> +	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>  unlock:
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
> --
> 2.25.1
>
  
Ryan Roberts Oct. 5, 2023, 12:49 p.m. UTC | #2
On 05/10/2023 13:05, Daniel Gomez wrote:
> On Fri, Sep 29, 2023 at 12:44:16PM +0100, Ryan Roberts wrote:
> 
> Hi Ryan,
>> Introduce the logic to allow THP to be configured (through the new
>> anon_orders interface we just added) to allocate large folios to back
>> anonymous memory, which are smaller than PMD-size (for example order-2,
>> order-3, order-4, etc).
>>
>> These THPs continue to be PTE-mapped, but in many cases can still
>> provide similar benefits to traditional PMD-sized THP: Page faults are
>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
>> the configured order), but latency spikes are much less prominent
>> because the size of each page isn't as huge as the PMD-sized variant and
>> there is less memory to clear in each page fault. The number of per-page
>> operations (e.g. ref counting, rmap management, lru list management) are
>> also significantly reduced since those ops now become per-folio.
>>
>> Some architectures also employ TLB compression mechanisms to squeeze
>> more entries in when a set of PTEs are virtually and physically
>> contiguous and approporiately aligned. In this case, TLB misses will
>> occur less often.
>>
>> The new behaviour is disabled by default because the anon_orders
>> defaults to only enabling PMD-order, but can be enabled at runtime by
>> writing to anon_orders (see documentation in previous commit). The long
>> term aim is to default anon_orders to include suitable lower orders, but
>> there are some risks around internal fragmentation that need to be
>> better understood first.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  Documentation/admin-guide/mm/transhuge.rst |   9 +-
>>  include/linux/huge_mm.h                    |   6 +-
>>  mm/memory.c                                | 108 +++++++++++++++++++--
>>  3 files changed, 111 insertions(+), 12 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index 9f954e73a4ca..732c3b2f4ba8 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
>>  ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
>>  fields for each mapping. Note that in both cases, AnonHugePages refers
>>  only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
>> -using PTEs.
>> +using PTEs. This includes all THPs whose order is smaller than
>> +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
>> +for other reasons.
>>
>>  The number of file transparent huge pages mapped to userspace is available
>>  by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
>> @@ -367,6 +369,11 @@ frequently will incur overhead.
>>  There are a number of counters in ``/proc/vmstat`` that may be used to
>>  monitor how successfully the system is providing huge pages for use.
>>
>> +.. note::
>> +   Currently the below counters only record events relating to
>> +   PMD-order THPs. Events relating to smaller order THPs are not
>> +   included.
>> +
>>  thp_fault_alloc
>>  	is incremented every time a huge page is successfully
>>  	allocated to handle a page fault.
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2e7c338229a6..c4860476a1f5 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>
>>  /*
>> - * Mask of all large folio orders supported for anonymous THP.
>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>> + * (which is a limitation of the THP implementation).
>>   */
>> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
>> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>
>>  /*
>>   * Mask of all large folio orders supported for file THP.
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5b82fc8e164..92ed9c782dc9 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  	return ret;
>>  }
>>
>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>> +{
>> +	int i;
>> +
>> +	if (nr_pages == 1)
>> +		return vmf_pte_changed(vmf);
>> +
>> +	for (i = 0; i < nr_pages; i++) {
>> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>> +{
>> +	gfp_t gfp;
>> +	pte_t *pte;
>> +	unsigned long addr;
>> +	struct folio *folio;
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	unsigned int orders;
>> +	int order;
>> +
>> +	/*
>> +	 * If uffd is active for the vma we need per-page fault fidelity to
>> +	 * maintain the uffd semantics.
>> +	 */
>> +	if (userfaultfd_armed(vma))
>> +		goto fallback;
>> +
>> +	/*
>> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
>> +	 * for this vma. Then filter out the orders that can't be allocated over
>> +	 * the faulting address and still be fully contained in the vma.
>> +	 */
>> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
>> +				    BIT(PMD_ORDER) - 1);
>> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
>> +
>> +	if (!orders)
>> +		goto fallback;
>> +
>> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>> +	if (!pte)
>> +		return ERR_PTR(-EAGAIN);
>> +
>> +	order = first_order(orders);
>> +	while (orders) {
>> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +		vmf->pte = pte + pte_index(addr);
>> +		if (!vmf_pte_range_changed(vmf, 1 << order))
>> +			break;
>> +		order = next_order(&orders, order);
>> +	}
>> +
>> +	vmf->pte = NULL;
>> +	pte_unmap(pte);
>> +
>> +	gfp = vma_thp_gfp_mask(vma);
>> +
>> +	while (orders) {
>> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);
> 
> I was checking your series and noticed about the hugepage flag. I think
> you've changed it from v1 -> v2 from being false to true when orders >=2
> but I'm not sure about the reasoning. Is this because of your statement
> in the cover letter [1]?

That hugepage flags is spec'ed as follows:

 * @hugepage: For hugepages try only the preferred node if possible.

The intent of passing true for orders higher than 0, is that we would prefer to
allocate a smaller order folio that is on the preferred node than a higher order
folio that is not on the preferred node. The assumption is that the on-going
cost of accessing the memory on the non-preferred node will outweigh the benefit
of allocating it as a high order folio.

Thanks,
Ryan


> 
> [1] cover letter snippet:
> 
> "to implement variable order, large folios for anonymous memory.
> (previously called ..., but now exposed as an extension to THP;
> "small-order THP")"
> 
> Thanks,
> Daniel
> 
>> +		if (folio) {
>> +			clear_huge_page(&folio->page, addr, 1 << order);
>> +			return folio;
>> +		}
>> +		order = next_order(&orders, order);
>> +	}
>> +
>> +fallback:
>> +	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +}
>> +#else
>> +#define alloc_anon_folio(vmf) \
>> +		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
>> +#endif
>> +
>>  /*
>>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>   * but allow concurrent faults), and pte mapped but not yet locked.
>> @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>   */
>>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  {
>> +	int i;
>> +	int nr_pages = 1;
>> +	unsigned long addr = vmf->address;
>>  	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>  	struct vm_area_struct *vma = vmf->vma;
>>  	struct folio *folio;
>> @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	/* Allocate our own private page. */
>>  	if (unlikely(anon_vma_prepare(vma)))
>>  		goto oom;
>> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +	folio = alloc_anon_folio(vmf);
>> +	if (IS_ERR(folio))
>> +		return 0;
>>  	if (!folio)
>>  		goto oom;
>>
>> +	nr_pages = folio_nr_pages(folio);
>> +	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>> +
>>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>  		goto oom_free_page;
>>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>> @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	if (vma->vm_flags & VM_WRITE)
>>  		entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>
>> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -			&vmf->ptl);
>> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>  	if (!vmf->pte)
>>  		goto release;
>> -	if (vmf_pte_changed(vmf)) {
>> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
>> +	if (vmf_pte_range_changed(vmf, nr_pages)) {
>> +		for (i = 0; i < nr_pages; i++)
>> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>  		goto release;
>>  	}
>>
>> @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>>  	}
>>
>> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +	folio_ref_add(folio, nr_pages - 1);
>> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>> +	folio_add_new_anon_rmap(folio, vma, addr);
>>  	folio_add_lru_vma(folio, vma);
>>  setpte:
>>  	if (uffd_wp)
>>  		entry = pte_mkuffd_wp(entry);
>> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>>
>>  	/* No need to invalidate - it was non-present before */
>> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>> +	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>>  unlock:
>>  	if (vmf->pte)
>>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>> --
>> 2.25.1
>>
  
Daniel Gomez Oct. 5, 2023, 2:59 p.m. UTC | #3
On Thu, Oct 05, 2023 at 01:49:30PM +0100, Ryan Roberts wrote:
> On 05/10/2023 13:05, Daniel Gomez wrote:
> > On Fri, Sep 29, 2023 at 12:44:16PM +0100, Ryan Roberts wrote:
> >
> > Hi Ryan,
> >> Introduce the logic to allow THP to be configured (through the new
> >> anon_orders interface we just added) to allocate large folios to back
> >> anonymous memory, which are smaller than PMD-size (for example order-2,
> >> order-3, order-4, etc).
> >>
> >> These THPs continue to be PTE-mapped, but in many cases can still
> >> provide similar benefits to traditional PMD-sized THP: Page faults are
> >> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
> >> the configured order), but latency spikes are much less prominent
> >> because the size of each page isn't as huge as the PMD-sized variant and
> >> there is less memory to clear in each page fault. The number of per-page
> >> operations (e.g. ref counting, rmap management, lru list management) are
> >> also significantly reduced since those ops now become per-folio.
> >>
> >> Some architectures also employ TLB compression mechanisms to squeeze
> >> more entries in when a set of PTEs are virtually and physically
> >> contiguous and approporiately aligned. In this case, TLB misses will
> >> occur less often.
> >>
> >> The new behaviour is disabled by default because the anon_orders
> >> defaults to only enabling PMD-order, but can be enabled at runtime by
> >> writing to anon_orders (see documentation in previous commit). The long
> >> term aim is to default anon_orders to include suitable lower orders, but
> >> there are some risks around internal fragmentation that need to be
> >> better understood first.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  Documentation/admin-guide/mm/transhuge.rst |   9 +-
> >>  include/linux/huge_mm.h                    |   6 +-
> >>  mm/memory.c                                | 108 +++++++++++++++++++--
> >>  3 files changed, 111 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> >> index 9f954e73a4ca..732c3b2f4ba8 100644
> >> --- a/Documentation/admin-guide/mm/transhuge.rst
> >> +++ b/Documentation/admin-guide/mm/transhuge.rst
> >> @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read
> >>  ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
> >>  fields for each mapping. Note that in both cases, AnonHugePages refers
> >>  only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
> >> -using PTEs.
> >> +using PTEs. This includes all THPs whose order is smaller than
> >> +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
> >> +for other reasons.
> >>
> >>  The number of file transparent huge pages mapped to userspace is available
> >>  by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
> >> @@ -367,6 +369,11 @@ frequently will incur overhead.
> >>  There are a number of counters in ``/proc/vmstat`` that may be used to
> >>  monitor how successfully the system is providing huge pages for use.
> >>
> >> +.. note::
> >> +   Currently the below counters only record events relating to
> >> +   PMD-order THPs. Events relating to smaller order THPs are not
> >> +   included.
> >> +
> >>  thp_fault_alloc
> >>  	is incremented every time a huge page is successfully
> >>  	allocated to handle a page fault.
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 2e7c338229a6..c4860476a1f5 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
> >>  #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> >>
> >>  /*
> >> - * Mask of all large folio orders supported for anonymous THP.
> >> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> >> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> >> + * (which is a limitation of the THP implementation).
> >>   */
> >> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
> >> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
> >>
> >>  /*
> >>   * Mask of all large folio orders supported for file THP.
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index b5b82fc8e164..92ed9c782dc9 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>  	return ret;
> >>  }
> >>
> >> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> >> +{
> >> +	int i;
> >> +
> >> +	if (nr_pages == 1)
> >> +		return vmf_pte_changed(vmf);
> >> +
> >> +	for (i = 0; i < nr_pages; i++) {
> >> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> >> +			return true;
> >> +	}
> >> +
> >> +	return false;
> >> +}
> >> +
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >> +{
> >> +	gfp_t gfp;
> >> +	pte_t *pte;
> >> +	unsigned long addr;
> >> +	struct folio *folio;
> >> +	struct vm_area_struct *vma = vmf->vma;
> >> +	unsigned int orders;
> >> +	int order;
> >> +
> >> +	/*
> >> +	 * If uffd is active for the vma we need per-page fault fidelity to
> >> +	 * maintain the uffd semantics.
> >> +	 */
> >> +	if (userfaultfd_armed(vma))
> >> +		goto fallback;
> >> +
> >> +	/*
> >> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >> +	 * for this vma. Then filter out the orders that can't be allocated over
> >> +	 * the faulting address and still be fully contained in the vma.
> >> +	 */
> >> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
> >> +				    BIT(PMD_ORDER) - 1);
> >> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
> >> +
> >> +	if (!orders)
> >> +		goto fallback;
> >> +
> >> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >> +	if (!pte)
> >> +		return ERR_PTR(-EAGAIN);
> >> +
> >> +	order = first_order(orders);
> >> +	while (orders) {
> >> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >> +		vmf->pte = pte + pte_index(addr);
> >> +		if (!vmf_pte_range_changed(vmf, 1 << order))
> >> +			break;
> >> +		order = next_order(&orders, order);
> >> +	}
> >> +
> >> +	vmf->pte = NULL;
> >> +	pte_unmap(pte);
> >> +
> >> +	gfp = vma_thp_gfp_mask(vma);
> >> +
> >> +	while (orders) {
> >> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >
> > I was checking your series and noticed about the hugepage flag. I think
> > you've changed it from v1 -> v2 from being false to true when orders >=2
> > but I'm not sure about the reasoning. Is this because of your statement
> > in the cover letter [1]?
>
> That hugepage flags is spec'ed as follows:
>
>  * @hugepage: For hugepages try only the preferred node if possible.
>
> The intent of passing true for orders higher than 0, is that we would prefer to
> allocate a smaller order folio that is on the preferred node than a higher order
> folio that is not on the preferred node. The assumption is that the on-going
> cost of accessing the memory on the non-preferred node will outweigh the benefit
> of allocating it as a high order folio.
>
> Thanks,
> Ryan

I think I'm confused about the @hugepage name. I guess activating that
for any order >= 2 doesn't imply you are in fact allocating a huge page
isn't? I can see order is passed from vma_alloc_folio -> __folio_alloc
-> __alloc_pages but I assumed (before reading your patch) you always
want this disabled except for HPAGE_PMD_ORDER allocation cases. But
order is not a limitation for the preferred node here, regardless this
is a huge page or not.

I see the motivation, thanks for sharing.
>
>
> >
> > [1] cover letter snippet:
> >
> > "to implement variable order, large folios for anonymous memory.
> > (previously called ..., but now exposed as an extension to THP;
> > "small-order THP")"
> >
> > Thanks,
> > Daniel
> >
> >> +		if (folio) {
> >> +			clear_huge_page(&folio->page, addr, 1 << order);
> >> +			return folio;
> >> +		}
> >> +		order = next_order(&orders, order);
> >> +	}
> >> +
> >> +fallback:
> >> +	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
> >> +}
> >> +#else
> >> +#define alloc_anon_folio(vmf) \
> >> +		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
> >> +#endif
> >> +
> >>  /*
> >>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >>   * but allow concurrent faults), and pte mapped but not yet locked.
> >> @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>   */
> >>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  {
> >> +	int i;
> >> +	int nr_pages = 1;
> >> +	unsigned long addr = vmf->address;
> >>  	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
> >>  	struct vm_area_struct *vma = vmf->vma;
> >>  	struct folio *folio;
> >> @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  	/* Allocate our own private page. */
> >>  	if (unlikely(anon_vma_prepare(vma)))
> >>  		goto oom;
> >> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> >> +	folio = alloc_anon_folio(vmf);
> >> +	if (IS_ERR(folio))
> >> +		return 0;
> >>  	if (!folio)
> >>  		goto oom;
> >>
> >> +	nr_pages = folio_nr_pages(folio);
> >> +	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> >> +
> >>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> >>  		goto oom_free_page;
> >>  	folio_throttle_swaprate(folio, GFP_KERNEL);
> >> @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  	if (vma->vm_flags & VM_WRITE)
> >>  		entry = pte_mkwrite(pte_mkdirty(entry), vma);
> >>
> >> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> -			&vmf->ptl);
> >> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> >>  	if (!vmf->pte)
> >>  		goto release;
> >> -	if (vmf_pte_changed(vmf)) {
> >> -		update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> +	if (vmf_pte_range_changed(vmf, nr_pages)) {
> >> +		for (i = 0; i < nr_pages; i++)
> >> +			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> >>  		goto release;
> >>  	}
> >>
> >> @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>  		return handle_userfault(vmf, VM_UFFD_MISSING);
> >>  	}
> >>
> >> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
> >> +	folio_ref_add(folio, nr_pages - 1);
> >> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >> +	folio_add_new_anon_rmap(folio, vma, addr);
> >>  	folio_add_lru_vma(folio, vma);
> >>  setpte:
> >>  	if (uffd_wp)
> >>  		entry = pte_mkuffd_wp(entry);
> >> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
> >>
> >>  	/* No need to invalidate - it was non-present before */
> >> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> >> +	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
> >>  unlock:
> >>  	if (vmf->pte)
> >>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> --
> >> 2.25.1
> >>
>
  
John Hubbard Oct. 27, 2023, 11:04 p.m. UTC | #4
On 9/29/23 04:44, Ryan Roberts wrote:

Hi Ryan,

A few clarifying questions below.

...
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2e7c338229a6..c4860476a1f5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>   
>   /*
> - * Mask of all large folio orders supported for anonymous THP.
> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> + * (which is a limitation of the THP implementation).
>    */
> -#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
> +#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>   
>   /*
>    * Mask of all large folio orders supported for file THP.
> diff --git a/mm/memory.c b/mm/memory.c
> index b5b82fc8e164..92ed9c782dc9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	return ret;
>   }
>   
> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
> +{
> +	int i;
> +
> +	if (nr_pages == 1)
> +		return vmf_pte_changed(vmf);
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
> +			return true;

This seems like something different than the function name implies.
It's really confusing: for a single page case, return true if the
pte in the page tables has changed, yes that is very clear.

But then for multiple page cases, which is really the main
focus here--for that, claim that the range has changed if any
pte is present (!pte_none). Can you please help me understand
what this means?

> +	}
> +
> +	return false;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +{
> +	gfp_t gfp;
> +	pte_t *pte;
> +	unsigned long addr;
> +	struct folio *folio;
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned int orders;
> +	int order;
> +
> +	/*
> +	 * If uffd is active for the vma we need per-page fault fidelity to
> +	 * maintain the uffd semantics.
> +	 */
> +	if (userfaultfd_armed(vma))
> +		goto fallback;
> +
> +	/*
> +	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
> +	 * for this vma. Then filter out the orders that can't be allocated over
> +	 * the faulting address and still be fully contained in the vma.
> +	 */
> +	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
> +				    BIT(PMD_ORDER) - 1);
> +	orders = transhuge_vma_suitable(vma, vmf->address, orders);
> +
> +	if (!orders)
> +		goto fallback;
> +
> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> +	if (!pte)
> +		return ERR_PTR(-EAGAIN);

pte_offset_map() can only fail due to:

     a) Wrong pmd type. These include:
         pmd_none
         pmd_bad
         pmd migration entry
         pmd_trans_huge
         pmd_devmap

     b) __pte_map() failure

For (a), why is it that -EAGAIN is used here? I see that that
will lead to a re-fault, I got that far, but am missing something
still.

For (b), same question, actually. I'm not completely sure why
why a retry is going to fix a __pte_map() failure?


> +
> +	order = first_order(orders);
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		vmf->pte = pte + pte_index(addr);
> +		if (!vmf_pte_range_changed(vmf, 1 << order))
> +			break;
> +		order = next_order(&orders, order);
> +	}
> +
> +	vmf->pte = NULL;
> +	pte_unmap(pte);
> +
> +	gfp = vma_thp_gfp_mask(vma);
> +
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +		if (folio) {
> +			clear_huge_page(&folio->page, addr, 1 << order);
> +			return folio;
> +		}
> +		order = next_order(&orders, order);
> +	}

And finally: is it accurate to say that there are *no* special
page flags being set, for PTE-mapped THPs? I don't see any here,
but want to confirm.


thanks,
  
Ryan Roberts Oct. 30, 2023, 11:43 a.m. UTC | #5
On 28/10/2023 00:04, John Hubbard wrote:
> On 9/29/23 04:44, Ryan Roberts wrote:
> 
> Hi Ryan,
> 
> A few clarifying questions below.

Excellent - keep them coming!

> 
> ...
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2e7c338229a6..c4860476a1f5 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>     /*
>> - * Mask of all large folio orders supported for anonymous THP.
>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>> + * (which is a limitation of the THP implementation).
>>    */
>> -#define THP_ORDERS_ALL_ANON    BIT(PMD_ORDER)
>> +#define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>     /*
>>    * Mask of all large folio orders supported for file THP.
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5b82fc8e164..92ed9c782dc9 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4059,6 +4059,87 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>       return ret;
>>   }
>>   +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>> +{
>> +    int i;
>> +
>> +    if (nr_pages == 1)
>> +        return vmf_pte_changed(vmf);
>> +
>> +    for (i = 0; i < nr_pages; i++) {
>> +        if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>> +            return true;
> 
> This seems like something different than the function name implies.
> It's really confusing: for a single page case, return true if the
> pte in the page tables has changed, yes that is very clear.
> 
> But then for multiple page cases, which is really the main
> focus here--for that, claim that the range has changed if any
> pte is present (!pte_none). Can you please help me understand
> what this means?

Yes I understand your confusion. Although I'm confident that the code is
correct, its a bad name - I'll make the excuse that this has evolved through
rebasing to cope with additions to UFFD. Perhaps something like
vmf_is_large_folio_suitable() is a better name.

It used to be that we would only take the do_anonymous_page() path if the pte
was none; i.e. this is the first time we are faulting on an address covered by
an anon VMA and we need to allocate some memory. But more recently we also end
up here if the pte is a uffd_wp marker. So for a single pte, instead of checking
none, we can check if the pte has changed from our original check (where we
determined it was a uffd_wp marker or none). But for multiple ptes, we don't
have storage to store all the original ptes from the first check.

Fortunately, if uffd is in use for a vma, then we don't want to use a large
folio anyway (this would break uffd semantics because we would no longer get a
fault for every page). So we only care about the "same but not none" case for
nr_pages=1.

Would changing the name to vmf_is_large_folio_suitable() help here?


> 
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>> +{
>> +    gfp_t gfp;
>> +    pte_t *pte;
>> +    unsigned long addr;
>> +    struct folio *folio;
>> +    struct vm_area_struct *vma = vmf->vma;
>> +    unsigned int orders;
>> +    int order;
>> +
>> +    /*
>> +     * If uffd is active for the vma we need per-page fault fidelity to
>> +     * maintain the uffd semantics.
>> +     */
>> +    if (userfaultfd_armed(vma))
>> +        goto fallback;
>> +
>> +    /*
>> +     * Get a list of all the (large) orders below PMD_ORDER that are enabled
>> +     * for this vma. Then filter out the orders that can't be allocated over
>> +     * the faulting address and still be fully contained in the vma.
>> +     */
>> +    orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
>> +                    BIT(PMD_ORDER) - 1);
>> +    orders = transhuge_vma_suitable(vma, vmf->address, orders);
>> +
>> +    if (!orders)
>> +        goto fallback;
>> +
>> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>> +    if (!pte)
>> +        return ERR_PTR(-EAGAIN);
> 
> pte_offset_map() can only fail due to:
> 
>     a) Wrong pmd type. These include:
>         pmd_none
>         pmd_bad
>         pmd migration entry
>         pmd_trans_huge
>         pmd_devmap
> 
>     b) __pte_map() failure
> 
> For (a), why is it that -EAGAIN is used here? I see that that
> will lead to a re-fault, I got that far, but am missing something
> still.
> 
> For (b), same question, actually. I'm not completely sure why
> why a retry is going to fix a __pte_map() failure?

I'm not going to claim to understand all the details of this. But this is due to
a change that Hugh introduced and we concluded at [1] that its always correct to
return EAGAIN here to rerun the fault. In fact, with the current implementation
pte_offset_map() should never fail for anon IIUC, but the view was that EAGAIN
makes it safe for tomorrow, and because this would only fail due to a race,
retrying is correct.

[1] https://lore.kernel.org/linux-mm/8bdfd8d8-5662-4615-86dc-d60259bd16d@google.com/


> 
> 
>> +
>> +    order = first_order(orders);
>> +    while (orders) {
>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +        vmf->pte = pte + pte_index(addr);
>> +        if (!vmf_pte_range_changed(vmf, 1 << order))
>> +            break;
>> +        order = next_order(&orders, order);
>> +    }
>> +
>> +    vmf->pte = NULL;
>> +    pte_unmap(pte);
>> +
>> +    gfp = vma_thp_gfp_mask(vma);
>> +
>> +    while (orders) {
>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +        folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +        if (folio) {
>> +            clear_huge_page(&folio->page, addr, 1 << order);
>> +            return folio;
>> +        }
>> +        order = next_order(&orders, order);
>> +    }
> 
> And finally: is it accurate to say that there are *no* special
> page flags being set, for PTE-mapped THPs? I don't see any here,
> but want to confirm.

The page flags are coming from 'gfp = vma_thp_gfp_mask(vma)', which pulls in the
correct flags based on transparent_hugepage/defrag file.

> 
> 
> thanks,
  
John Hubbard Oct. 30, 2023, 11:25 p.m. UTC | #6
On 10/30/23 04:43, Ryan Roberts wrote:
> On 28/10/2023 00:04, John Hubbard wrote:
>> On 9/29/23 04:44, Ryan Roberts wrote:
...
>>>    +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>> +{
>>> +    int i;
>>> +
>>> +    if (nr_pages == 1)
>>> +        return vmf_pte_changed(vmf);
>>> +
>>> +    for (i = 0; i < nr_pages; i++) {
>>> +        if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>> +            return true;
>>
>> This seems like something different than the function name implies.
>> It's really confusing: for a single page case, return true if the
>> pte in the page tables has changed, yes that is very clear.
>>
>> But then for multiple page cases, which is really the main
>> focus here--for that, claim that the range has changed if any
>> pte is present (!pte_none). Can you please help me understand
>> what this means?
> 
> Yes I understand your confusion. Although I'm confident that the code is
> correct, its a bad name - I'll make the excuse that this has evolved through
> rebasing to cope with additions to UFFD. Perhaps something like
> vmf_is_large_folio_suitable() is a better name.
> 
> It used to be that we would only take the do_anonymous_page() path if the pte
> was none; i.e. this is the first time we are faulting on an address covered by
> an anon VMA and we need to allocate some memory. But more recently we also end
> up here if the pte is a uffd_wp marker. So for a single pte, instead of checking
> none, we can check if the pte has changed from our original check (where we
> determined it was a uffd_wp marker or none). But for multiple ptes, we don't
> have storage to store all the original ptes from the first check.
> 
> Fortunately, if uffd is in use for a vma, then we don't want to use a large
> folio anyway (this would break uffd semantics because we would no longer get a
> fault for every page). So we only care about the "same but not none" case for
> nr_pages=1.
> 
> Would changing the name to vmf_is_large_folio_suitable() help here?

Yes it would! And adding in a sentence or two from above about the uffd, as
a function-level comment might be just the right of demystification for
the code.

...
pte_offset_map() can only fail due to:
>>
>>      a) Wrong pmd type. These include:
>>          pmd_none
>>          pmd_bad
>>          pmd migration entry
>>          pmd_trans_huge
>>          pmd_devmap
>>
>>      b) __pte_map() failure
>>
>> For (a), why is it that -EAGAIN is used here? I see that that
>> will lead to a re-fault, I got that far, but am missing something
>> still.
>>
>> For (b), same question, actually. I'm not completely sure why
>> why a retry is going to fix a __pte_map() failure?
> 
> I'm not going to claim to understand all the details of this. But this is due to
> a change that Hugh introduced and we concluded at [1] that its always correct to
> return EAGAIN here to rerun the fault. In fact, with the current implementation
> pte_offset_map() should never fail for anon IIUC, but the view was that EAGAIN
> makes it safe for tomorrow, and because this would only fail due to a race,
> retrying is correct.
> 
> [1] https://lore.kernel.org/linux-mm/8bdfd8d8-5662-4615-86dc-d60259bd16d@google.com/
> 

OK, got it.

...
>> And finally: is it accurate to say that there are *no* special
>> page flags being set, for PTE-mapped THPs? I don't see any here,
>> but want to confirm.
> 
> The page flags are coming from 'gfp = vma_thp_gfp_mask(vma)', which pulls in the
> correct flags based on transparent_hugepage/defrag file.
> 

OK that all is pretty clear now, thanks for the answers!


thanks,
  
Ryan Roberts Nov. 1, 2023, 1:56 p.m. UTC | #7
On 30/10/2023 23:25, John Hubbard wrote:
> On 10/30/23 04:43, Ryan Roberts wrote:
>> On 28/10/2023 00:04, John Hubbard wrote:
>>> On 9/29/23 04:44, Ryan Roberts wrote:
> ...
>>>>    +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    if (nr_pages == 1)
>>>> +        return vmf_pte_changed(vmf);
>>>> +
>>>> +    for (i = 0; i < nr_pages; i++) {
>>>> +        if (!pte_none(ptep_get_lockless(vmf->pte + i)))
>>>> +            return true;
>>>
>>> This seems like something different than the function name implies.
>>> It's really confusing: for a single page case, return true if the
>>> pte in the page tables has changed, yes that is very clear.
>>>
>>> But then for multiple page cases, which is really the main
>>> focus here--for that, claim that the range has changed if any
>>> pte is present (!pte_none). Can you please help me understand
>>> what this means?
>>
>> Yes I understand your confusion. Although I'm confident that the code is
>> correct, its a bad name - I'll make the excuse that this has evolved through
>> rebasing to cope with additions to UFFD. Perhaps something like
>> vmf_is_large_folio_suitable() is a better name.
>>
>> It used to be that we would only take the do_anonymous_page() path if the pte
>> was none; i.e. this is the first time we are faulting on an address covered by
>> an anon VMA and we need to allocate some memory. But more recently we also end
>> up here if the pte is a uffd_wp marker. So for a single pte, instead of checking
>> none, we can check if the pte has changed from our original check (where we
>> determined it was a uffd_wp marker or none). But for multiple ptes, we don't
>> have storage to store all the original ptes from the first check.
>>
>> Fortunately, if uffd is in use for a vma, then we don't want to use a large
>> folio anyway (this would break uffd semantics because we would no longer get a
>> fault for every page). So we only care about the "same but not none" case for
>> nr_pages=1.
>>
>> Would changing the name to vmf_is_large_folio_suitable() help here?
> 
> Yes it would! And adding in a sentence or two from above about the uffd, as
> a function-level comment might be just the right of demystification for
> the code.

Actually I don't think the name I proposed it quite right either - this gets
called for small folios too.

I think its cleaner to change the name to vmf_pte_range_none() and strip out the
nr_pages==1 case. The checking-for-none part is required by alloc_anon_folio()
and needs to be safe without holding the PTL. vmf_pte_changed() is not safe in
without the lock. So I've just hoisted the nr_pages==1 case directly into
do_anonymous_page(). Shout if you think we can do better:


diff --git a/mm/memory.c b/mm/memory.c
index 569c828b1cdc..b48e4de1bf20 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4117,19 +4117,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
        return ret;
 }

-static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+static bool pte_range_none(pte_t *pte, int nr_pages)
 {
        int i;

-       if (nr_pages == 1)
-               return vmf_pte_changed(vmf);
-
        for (i = 0; i < nr_pages; i++) {
-               if (!pte_none(ptep_get_lockless(vmf->pte + i)))
-                       return true;
+               if (!pte_none(ptep_get_lockless(pte + i)))
+                       return false;
        }

-       return false;
+       return true;
 }

 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -4170,7 +4167,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
        while (orders) {
                addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
                vmf->pte = pte + pte_index(addr);
-               if (!vmf_pte_range_changed(vmf, 1 << order))
+               if (pte_range_none(vmf->pte, 1 << order))
                        break;
                order = next_order(&orders, order);
        }
@@ -4280,7 +4277,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
        if (!vmf->pte)
                goto release;
-       if (vmf_pte_range_changed(vmf, nr_pages)) {
+       if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
+           (nr_pages  > 1 && !pte_range_none(vmf->pte, nr_pages))) {
                for (i = 0; i < nr_pages; i++)
                        update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
                goto release;
  

Patch

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 9f954e73a4ca..732c3b2f4ba8 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -353,7 +353,9 @@  anonymous transparent huge pages, it is necessary to read
 ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap
 fields for each mapping. Note that in both cases, AnonHugePages refers
 only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped
-using PTEs.
+using PTEs. This includes all THPs whose order is smaller than
+PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped
+for other reasons.
 
 The number of file transparent huge pages mapped to userspace is available
 by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
@@ -367,6 +369,11 @@  frequently will incur overhead.
 There are a number of counters in ``/proc/vmstat`` that may be used to
 monitor how successfully the system is providing huge pages for use.
 
+.. note::
+   Currently the below counters only record events relating to
+   PMD-order THPs. Events relating to smaller order THPs are not
+   included.
+
 thp_fault_alloc
 	is incremented every time a huge page is successfully
 	allocated to handle a page fault.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2e7c338229a6..c4860476a1f5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -68,9 +68,11 @@  extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
 /*
- * Mask of all large folio orders supported for anonymous THP.
+ * Mask of all large folio orders supported for anonymous THP; all orders up to
+ * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
+ * (which is a limitation of the THP implementation).
  */
-#define THP_ORDERS_ALL_ANON	BIT(PMD_ORDER)
+#define THP_ORDERS_ALL_ANON	((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
 
 /*
  * Mask of all large folio orders supported for file THP.
diff --git a/mm/memory.c b/mm/memory.c
index b5b82fc8e164..92ed9c782dc9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4059,6 +4059,87 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	return ret;
 }
 
+static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+	int i;
+
+	if (nr_pages == 1)
+		return vmf_pte_changed(vmf);
+
+	for (i = 0; i < nr_pages; i++) {
+		if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+			return true;
+	}
+
+	return false;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+	gfp_t gfp;
+	pte_t *pte;
+	unsigned long addr;
+	struct folio *folio;
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned int orders;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (userfaultfd_armed(vma))
+		goto fallback;
+
+	/*
+	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
+	 * for this vma. Then filter out the orders that can't be allocated over
+	 * the faulting address and still be fully contained in the vma.
+	 */
+	orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true,
+				    BIT(PMD_ORDER) - 1);
+	orders = transhuge_vma_suitable(vma, vmf->address, orders);
+
+	if (!orders)
+		goto fallback;
+
+	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+	if (!pte)
+		return ERR_PTR(-EAGAIN);
+
+	order = first_order(orders);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		vmf->pte = pte + pte_index(addr);
+		if (!vmf_pte_range_changed(vmf, 1 << order))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	vmf->pte = NULL;
+	pte_unmap(pte);
+
+	gfp = vma_thp_gfp_mask(vma);
+
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vma, addr, true);
+		if (folio) {
+			clear_huge_page(&folio->page, addr, 1 << order);
+			return folio;
+		}
+		order = next_order(&orders, order);
+	}
+
+fallback:
+	return vma_alloc_zeroed_movable_folio(vma, vmf->address);
+}
+#else
+#define alloc_anon_folio(vmf) \
+		vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
+#endif
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -4066,6 +4147,9 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
  */
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
+	int i;
+	int nr_pages = 1;
+	unsigned long addr = vmf->address;
 	bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio;
@@ -4110,10 +4194,15 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = alloc_anon_folio(vmf);
+	if (IS_ERR(folio))
+		return 0;
 	if (!folio)
 		goto oom;
 
+	nr_pages = folio_nr_pages(folio);
+	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4130,12 +4219,12 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry), vma);
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	if (!vmf->pte)
 		goto release;
-	if (vmf_pte_changed(vmf)) {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (vmf_pte_range_changed(vmf, nr_pages)) {
+		for (i = 0; i < nr_pages; i++)
+			update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
 		goto release;
 	}
 
@@ -4150,16 +4239,17 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	folio_add_new_anon_rmap(folio, vma, addr);
 	folio_add_lru_vma(folio, vma);
 setpte:
 	if (uffd_wp)
 		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);