[v2,4/5] mm: FLEXIBLE_THP for improved performance

Message ID 20230703135330.1865927-5-ryan.roberts@arm.com
State New
Headers
Series variable-order, large folios for anonymous memory |

Commit Message

Ryan Roberts July 3, 2023, 1:53 p.m. UTC
  Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
allocated in large folios of a specified order. All pages of the large
folio are pte-mapped during the same page fault, significantly reducing
the number of page faults. The number of per-page operations (e.g. ref
counting, rmap management lru list management) are also significantly
reduced since those ops now become per-folio.

The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
defaults to disabled for now; there is a long list of todos to make
FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
madvise ops, etc). These items will be tackled in subsequent patches.

When enabled, the preferred folio order is as returned by
arch_wants_pte_order(), which may be overridden by the arch as it sees
fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
contiguous set of ptes map physically contigious, naturally aligned
memory, so this mechanism allows the architecture to optimize as
required.

If the preferred order can't be used (e.g. because the folio would
breach the bounds of the vma, or because ptes in the region are already
mapped) then we fall back to a suitable lower order.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/Kconfig  |  10 ++++
 mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 165 insertions(+), 13 deletions(-)
  

Comments

kernel test robot July 3, 2023, 3:51 p.m. UTC | #1
Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on v6.4]
[cannot apply to akpm-mm/mm-everything linus/master next-20230703]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link:    https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com
patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
config: um-allyesconfig (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307032325.u93xmWbG-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __raw_readb(PCI_IOBASE + addr);
                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
   #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
                                                     ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
                                                     ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsb(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsw(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsl(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesb(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesw(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesl(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
>> mm/memory.c:4271:2: error: implicit declaration of function 'set_ptes' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
           set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
           ^
   mm/memory.c:4271:2: note: did you mean 'set_pte'?
   arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here
   static inline void set_pte(pte_t *pteptr, pte_t pteval)
                      ^
>> mm/memory.c:4274:2: error: implicit declaration of function 'update_mmu_cache_range' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
           update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
           ^
   12 warnings and 2 errors generated.


vim +/set_ptes +4271 mm/memory.c

  4135	
  4136	/*
  4137	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  4138	 * but allow concurrent faults), and pte mapped but not yet locked.
  4139	 * We return with mmap_lock still held, but pte unmapped and unlocked.
  4140	 */
  4141	static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
  4142	{
  4143		bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
  4144		struct vm_area_struct *vma = vmf->vma;
  4145		struct folio *folio;
  4146		vm_fault_t ret = 0;
  4147		pte_t entry;
  4148		int order;
  4149		int pgcount;
  4150		unsigned long addr;
  4151	
  4152		/* File mapping without ->vm_ops ? */
  4153		if (vma->vm_flags & VM_SHARED)
  4154			return VM_FAULT_SIGBUS;
  4155	
  4156		/*
  4157		 * Use pte_alloc() instead of pte_alloc_map().  We can't run
  4158		 * pte_offset_map() on pmds where a huge pmd might be created
  4159		 * from a different thread.
  4160		 *
  4161		 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
  4162		 * parallel threads are excluded by other means.
  4163		 *
  4164		 * Here we only have mmap_read_lock(mm).
  4165		 */
  4166		if (pte_alloc(vma->vm_mm, vmf->pmd))
  4167			return VM_FAULT_OOM;
  4168	
  4169		/* See comment in handle_pte_fault() */
  4170		if (unlikely(pmd_trans_unstable(vmf->pmd)))
  4171			return 0;
  4172	
  4173		/* Use the zero-page for reads */
  4174		if (!(vmf->flags & FAULT_FLAG_WRITE) &&
  4175				!mm_forbids_zeropage(vma->vm_mm)) {
  4176			entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
  4177							vma->vm_page_prot));
  4178			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4179					vmf->address, &vmf->ptl);
  4180			if (vmf_pte_changed(vmf)) {
  4181				update_mmu_tlb(vma, vmf->address, vmf->pte);
  4182				goto unlock;
  4183			}
  4184			ret = check_stable_address_space(vma->vm_mm);
  4185			if (ret)
  4186				goto unlock;
  4187			/* Deliver the page fault to userland, check inside PT lock */
  4188			if (userfaultfd_missing(vma)) {
  4189				pte_unmap_unlock(vmf->pte, vmf->ptl);
  4190				return handle_userfault(vmf, VM_UFFD_MISSING);
  4191			}
  4192			if (uffd_wp)
  4193				entry = pte_mkuffd_wp(entry);
  4194			set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
  4195	
  4196			/* No need to invalidate - it was non-present before */
  4197			update_mmu_cache(vma, vmf->address, vmf->pte);
  4198			goto unlock;
  4199		}
  4200	
  4201		/*
  4202		 * If allocating a large folio, determine the biggest suitable order for
  4203		 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
  4204		 * overlap with any populated PTEs, etc). We are not under the ptl here
  4205		 * so we will need to re-check that we are not overlapping any populated
  4206		 * PTEs once we have the lock.
  4207		 */
  4208		order = uffd_wp ? 0 : max_anon_folio_order(vma);
  4209		if (order > 0) {
  4210			vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
  4211			order = calc_anon_folio_order_alloc(vmf, order);
  4212			pte_unmap(vmf->pte);
  4213		}
  4214	
  4215		/* Allocate our own private folio. */
  4216		if (unlikely(anon_vma_prepare(vma)))
  4217			goto oom;
  4218		folio = alloc_anon_folio(vma, vmf->address, order);
  4219		if (!folio && order > 0) {
  4220			order = 0;
  4221			folio = alloc_anon_folio(vma, vmf->address, order);
  4222		}
  4223		if (!folio)
  4224			goto oom;
  4225	
  4226		pgcount = 1 << order;
  4227		addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
  4228	
  4229		if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
  4230			goto oom_free_page;
  4231		folio_throttle_swaprate(folio, GFP_KERNEL);
  4232	
  4233		/*
  4234		 * The memory barrier inside __folio_mark_uptodate makes sure that
  4235		 * preceding stores to the folio contents become visible before
  4236		 * the set_ptes() write.
  4237		 */
  4238		__folio_mark_uptodate(folio);
  4239	
  4240		entry = mk_pte(&folio->page, vma->vm_page_prot);
  4241		entry = pte_sw_mkyoung(entry);
  4242		if (vma->vm_flags & VM_WRITE)
  4243			entry = pte_mkwrite(pte_mkdirty(entry));
  4244	
  4245		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
  4246		if (vmf_pte_changed(vmf)) {
  4247			update_mmu_tlb(vma, vmf->address, vmf->pte);
  4248			goto release;
  4249		} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
  4250			goto release;
  4251		}
  4252	
  4253		ret = check_stable_address_space(vma->vm_mm);
  4254		if (ret)
  4255			goto release;
  4256	
  4257		/* Deliver the page fault to userland, check inside PT lock */
  4258		if (userfaultfd_missing(vma)) {
  4259			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4260			folio_put(folio);
  4261			return handle_userfault(vmf, VM_UFFD_MISSING);
  4262		}
  4263	
  4264		folio_ref_add(folio, pgcount - 1);
  4265		add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
  4266		folio_add_new_anon_rmap(folio, vma, addr);
  4267		folio_add_lru_vma(folio, vma);
  4268	
  4269		if (uffd_wp)
  4270			entry = pte_mkuffd_wp(entry);
> 4271		set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
  4272	
  4273		/* No need to invalidate - it was non-present before */
> 4274		update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
  4275	unlock:
  4276		pte_unmap_unlock(vmf->pte, vmf->ptl);
  4277		return ret;
  4278	release:
  4279		folio_put(folio);
  4280		goto unlock;
  4281	oom_free_page:
  4282		folio_put(folio);
  4283	oom:
  4284		return VM_FAULT_OOM;
  4285	}
  4286
  
kernel test robot July 3, 2023, 4:01 p.m. UTC | #2
Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on v6.4]
[cannot apply to akpm-mm/mm-everything linus/master next-20230703]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link:    https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com
patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307032330.TguyNttt-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from mm/memory.c:42:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     692 |         readsb(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     700 |         readsw(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     708 |         readsl(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     717 |         writesb(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     726 |         writesw(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     735 |         writesl(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
>> mm/memory.c:4271:2: error: call to undeclared function 'set_ptes'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    4271 |         set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
         |         ^
   mm/memory.c:4271:2: note: did you mean 'set_pte'?
   arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here
     232 | static inline void set_pte(pte_t *pteptr, pte_t pteval)
         |                    ^
>> mm/memory.c:4274:2: error: call to undeclared function 'update_mmu_cache_range'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    4274 |         update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
         |         ^
   12 warnings and 2 errors generated.


vim +/set_ptes +4271 mm/memory.c

  4135	
  4136	/*
  4137	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  4138	 * but allow concurrent faults), and pte mapped but not yet locked.
  4139	 * We return with mmap_lock still held, but pte unmapped and unlocked.
  4140	 */
  4141	static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
  4142	{
  4143		bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
  4144		struct vm_area_struct *vma = vmf->vma;
  4145		struct folio *folio;
  4146		vm_fault_t ret = 0;
  4147		pte_t entry;
  4148		int order;
  4149		int pgcount;
  4150		unsigned long addr;
  4151	
  4152		/* File mapping without ->vm_ops ? */
  4153		if (vma->vm_flags & VM_SHARED)
  4154			return VM_FAULT_SIGBUS;
  4155	
  4156		/*
  4157		 * Use pte_alloc() instead of pte_alloc_map().  We can't run
  4158		 * pte_offset_map() on pmds where a huge pmd might be created
  4159		 * from a different thread.
  4160		 *
  4161		 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
  4162		 * parallel threads are excluded by other means.
  4163		 *
  4164		 * Here we only have mmap_read_lock(mm).
  4165		 */
  4166		if (pte_alloc(vma->vm_mm, vmf->pmd))
  4167			return VM_FAULT_OOM;
  4168	
  4169		/* See comment in handle_pte_fault() */
  4170		if (unlikely(pmd_trans_unstable(vmf->pmd)))
  4171			return 0;
  4172	
  4173		/* Use the zero-page for reads */
  4174		if (!(vmf->flags & FAULT_FLAG_WRITE) &&
  4175				!mm_forbids_zeropage(vma->vm_mm)) {
  4176			entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
  4177							vma->vm_page_prot));
  4178			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4179					vmf->address, &vmf->ptl);
  4180			if (vmf_pte_changed(vmf)) {
  4181				update_mmu_tlb(vma, vmf->address, vmf->pte);
  4182				goto unlock;
  4183			}
  4184			ret = check_stable_address_space(vma->vm_mm);
  4185			if (ret)
  4186				goto unlock;
  4187			/* Deliver the page fault to userland, check inside PT lock */
  4188			if (userfaultfd_missing(vma)) {
  4189				pte_unmap_unlock(vmf->pte, vmf->ptl);
  4190				return handle_userfault(vmf, VM_UFFD_MISSING);
  4191			}
  4192			if (uffd_wp)
  4193				entry = pte_mkuffd_wp(entry);
  4194			set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
  4195	
  4196			/* No need to invalidate - it was non-present before */
  4197			update_mmu_cache(vma, vmf->address, vmf->pte);
  4198			goto unlock;
  4199		}
  4200	
  4201		/*
  4202		 * If allocating a large folio, determine the biggest suitable order for
  4203		 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
  4204		 * overlap with any populated PTEs, etc). We are not under the ptl here
  4205		 * so we will need to re-check that we are not overlapping any populated
  4206		 * PTEs once we have the lock.
  4207		 */
  4208		order = uffd_wp ? 0 : max_anon_folio_order(vma);
  4209		if (order > 0) {
  4210			vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
  4211			order = calc_anon_folio_order_alloc(vmf, order);
  4212			pte_unmap(vmf->pte);
  4213		}
  4214	
  4215		/* Allocate our own private folio. */
  4216		if (unlikely(anon_vma_prepare(vma)))
  4217			goto oom;
  4218		folio = alloc_anon_folio(vma, vmf->address, order);
  4219		if (!folio && order > 0) {
  4220			order = 0;
  4221			folio = alloc_anon_folio(vma, vmf->address, order);
  4222		}
  4223		if (!folio)
  4224			goto oom;
  4225	
  4226		pgcount = 1 << order;
  4227		addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
  4228	
  4229		if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
  4230			goto oom_free_page;
  4231		folio_throttle_swaprate(folio, GFP_KERNEL);
  4232	
  4233		/*
  4234		 * The memory barrier inside __folio_mark_uptodate makes sure that
  4235		 * preceding stores to the folio contents become visible before
  4236		 * the set_ptes() write.
  4237		 */
  4238		__folio_mark_uptodate(folio);
  4239	
  4240		entry = mk_pte(&folio->page, vma->vm_page_prot);
  4241		entry = pte_sw_mkyoung(entry);
  4242		if (vma->vm_flags & VM_WRITE)
  4243			entry = pte_mkwrite(pte_mkdirty(entry));
  4244	
  4245		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
  4246		if (vmf_pte_changed(vmf)) {
  4247			update_mmu_tlb(vma, vmf->address, vmf->pte);
  4248			goto release;
  4249		} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
  4250			goto release;
  4251		}
  4252	
  4253		ret = check_stable_address_space(vma->vm_mm);
  4254		if (ret)
  4255			goto release;
  4256	
  4257		/* Deliver the page fault to userland, check inside PT lock */
  4258		if (userfaultfd_missing(vma)) {
  4259			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4260			folio_put(folio);
  4261			return handle_userfault(vmf, VM_UFFD_MISSING);
  4262		}
  4263	
  4264		folio_ref_add(folio, pgcount - 1);
  4265		add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
  4266		folio_add_new_anon_rmap(folio, vma, addr);
  4267		folio_add_lru_vma(folio, vma);
  4268	
  4269		if (uffd_wp)
  4270			entry = pte_mkuffd_wp(entry);
> 4271		set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
  4272	
  4273		/* No need to invalidate - it was non-present before */
> 4274		update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
  4275	unlock:
  4276		pte_unmap_unlock(vmf->pte, vmf->ptl);
  4277		return ret;
  4278	release:
  4279		folio_put(folio);
  4280		goto unlock;
  4281	oom_free_page:
  4282		folio_put(folio);
  4283	oom:
  4284		return VM_FAULT_OOM;
  4285	}
  4286
  
Yu Zhao July 4, 2023, 1:35 a.m. UTC | #3
On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> allocated in large folios of a specified order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> defaults to disabled for now; there is a long list of todos to make
> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
> madvise ops, etc). These items will be tackled in subsequent patches.
>
> When enabled, the preferred folio order is as returned by
> arch_wants_pte_order(), which may be overridden by the arch as it sees
> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a

coalesce

> contiguous set of ptes map physically contigious, naturally aligned

contiguous

> memory, so this mechanism allows the architecture to optimize as
> required.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  |  10 ++++
>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 165 insertions(+), 13 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..1c06b2c0a24e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>           support of file THPs will be developed in the next few release
>           cycles.
>
> +config FLEXIBLE_THP
> +       bool "Flexible order THP"
> +       depends on TRANSPARENT_HUGEPAGE
> +       default n

The default value is already N.

> +       help
> +         Use large (bigger than order-0) folios to back anonymous memory where
> +         possible, even if the order of the folio is smaller than the PMD
> +         order. This reduces the number of page faults, as well as other
> +         per-page overheads to improve performance for many workloads.
> +
>  endif # TRANSPARENT_HUGEPAGE
>
>  #
> diff --git a/mm/memory.c b/mm/memory.c
> index fb30f7523550..abe2ea94f3f5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>         return 0;
>  }
>
> +#ifdef CONFIG_FLEXIBLE_THP
> +/*
> + * Allocates, zeros and returns a folio of the requested order for use as
> + * anonymous memory.
> + */
> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
> +                                     unsigned long addr, int order)
> +{
> +       gfp_t gfp;
> +       struct folio *folio;
> +
> +       if (order == 0)
> +               return vma_alloc_zeroed_movable_folio(vma, addr);
> +
> +       gfp = vma_thp_gfp_mask(vma);
> +       folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +       if (folio)
> +               clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
> +
> +       return folio;
> +}
> +
> +/*
> + * Preferred folio order to allocate for anonymous memory.
> + */
> +#define max_anon_folio_order(vma)      arch_wants_pte_order(vma)
> +#else
> +#define alloc_anon_folio(vma, addr, order) \
> +                               vma_alloc_zeroed_movable_folio(vma, addr)
> +#define max_anon_folio_order(vma)      0
> +#endif
> +
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++) {
> +               if (!pte_none(ptep_get(pte++)))
> +                       return i;
> +       }
> +
> +       return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> +{
> +       /*
> +        * The aim here is to determine what size of folio we should allocate
> +        * for this fault. Factors include:
> +        * - Order must not be higher than `order` upon entry
> +        * - Folio must be naturally aligned within VA space
> +        * - Folio must be fully contained inside one pmd entry
> +        * - Folio must not breach boundaries of vma
> +        * - Folio must not overlap any non-none ptes
> +        *
> +        * Additionally, we do not allow order-1 since this breaks assumptions
> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> +        * store state up to the 3rd struct page subpage), and these pages must
> +        * be THP in order to correctly use pre-existing THP infrastructure such
> +        * as folio_split().
> +        *
> +        * Note that the caller may or may not choose to lock the pte. If
> +        * unlocked, the result is racy and the user must re-check any overlap
> +        * with non-none ptes under the lock.
> +        */
> +
> +       struct vm_area_struct *vma = vmf->vma;
> +       int nr;
> +       unsigned long addr;
> +       pte_t *pte;
> +       pte_t *first_set = NULL;
> +       int ret;
> +
> +       order = min(order, PMD_SHIFT - PAGE_SHIFT);
> +
> +       for (; order > 1; order--) {

I'm not sure how we can justify this policy. As an initial step, it'd
be a lot easier to sell if we only considered the order of
arch_wants_pte_order() and the order 0.

> +               nr = 1 << order;
> +               addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> +               pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +               /* Check vma bounds. */
> +               if (addr < vma->vm_start ||
> +                   addr + (nr << PAGE_SHIFT) > vma->vm_end)
> +                       continue;
> +
> +               /* Ptes covered by order already known to be none. */
> +               if (pte + nr <= first_set)
> +                       break;
> +
> +               /* Already found set pte in range covered by order. */
> +               if (pte <= first_set)
> +                       continue;
> +
> +               /* Need to check if all the ptes are none. */
> +               ret = check_ptes_none(pte, nr);
> +               if (ret == nr)
> +                       break;
> +
> +               first_set = pte + ret;
> +       }
> +
> +       if (order == 1)
> +               order = 0;
> +
> +       return order;
> +}

Everything above can be simplified into two helpers:
vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you
prefer). Details below.

>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>                 goto oom;
>
>         if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
> -               new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +               new_folio = alloc_anon_folio(vma, vmf->address, 0);

This seems unnecessary for now. Later on, we could fill in an aligned
area with multiple write-protected zero pages during a read fault and
then replace them with a large folio here.

>                 if (!new_folio)
>                         goto oom;
>         } else {
> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         struct folio *folio;
>         vm_fault_t ret = 0;
>         pte_t entry;
> +       int order;
> +       int pgcount;
> +       unsigned long addr;
>
>         /* File mapping without ->vm_ops ? */
>         if (vma->vm_flags & VM_SHARED)
> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>                 }
> -               goto setpte;
> +               if (uffd_wp)
> +                       entry = pte_mkuffd_wp(entry);
> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> +               /* No need to invalidate - it was non-present before */
> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> +               goto unlock;
> +       }

Nor really needed IMO. Details below.

===

> +       /*
> +        * If allocating a large folio, determine the biggest suitable order for
> +        * the VMA (e.g. it must not exceed the VMA's bounds, it must not
> +        * overlap with any populated PTEs, etc). We are not under the ptl here
> +        * so we will need to re-check that we are not overlapping any populated
> +        * PTEs once we have the lock.
> +        */
> +       order = uffd_wp ? 0 : max_anon_folio_order(vma);
> +       if (order > 0) {
> +               vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> +               order = calc_anon_folio_order_alloc(vmf, order);
> +               pte_unmap(vmf->pte);
>         }

===

The section above together with the section below should be wrapped in a helper.

> -       /* Allocate our own private page. */
> +       /* Allocate our own private folio. */
>         if (unlikely(anon_vma_prepare(vma)))
>                 goto oom;

===

> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +       folio = alloc_anon_folio(vma, vmf->address, order);
> +       if (!folio && order > 0) {
> +               order = 0;
> +               folio = alloc_anon_folio(vma, vmf->address, order);
> +       }

===

One helper returns a folio of order arch_wants_pte_order(), or order 0
if it fails to allocate that order, e.g.,

folio = alloc_anon_folio(vmf);

And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0
regardless of arch_wants_pte_order(). Upon success, it can update
vmf->address, since if we run into a race with another PF, we exit the
fault handler and retry anyway.

>         if (!folio)
>                 goto oom;
>
> +       pgcount = 1 << order;
> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);

As shown above, the helper already updates vmf->address. And mm/ never
used pgcount before -- the convention is nr_pages = folio_nr_pages().

>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>                 goto oom_free_page;
>         folio_throttle_swaprate(folio, GFP_KERNEL);
>
>         /*
>          * The memory barrier inside __folio_mark_uptodate makes sure that
> -        * preceding stores to the page contents become visible before
> -        * the set_pte_at() write.
> +        * preceding stores to the folio contents become visible before
> +        * the set_ptes() write.

We don't have set_ptes() yet.

>          */
>         __folio_mark_uptodate(folio);
>
> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         if (vma->vm_flags & VM_WRITE)
>                 entry = pte_mkwrite(pte_mkdirty(entry));
>
> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -                       &vmf->ptl);
> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>         if (vmf_pte_changed(vmf)) {
>                 update_mmu_tlb(vma, vmf->address, vmf->pte);
>                 goto release;
> +       } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
> +               goto release;
>         }

Need new helper:

  if (vmf_pte_range_changed(vmf, nr_pages)) {
    for (i = 0; i < nr_pages; i++)
      update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
    goto release;
  }

(It should be fine to call update_mmu_tlb() even if it's not really necessary.)

>         ret = check_stable_address_space(vma->vm_mm);
> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>         }
>
> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> +       folio_ref_add(folio, pgcount - 1);
> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> +       folio_add_new_anon_rmap(folio, vma, addr);
>         folio_add_lru_vma(folio, vma);
> -setpte:
> +
>         if (uffd_wp)
>                 entry = pte_mkuffd_wp(entry);
> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);

We would have to do it one by one for now.

>         /* No need to invalidate - it was non-present before */
> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);

Ditto.

How about this (by moving mk_pte()  and its friends here):
...
        folio_add_lru_vma(folio, vma);

        for (i = 0; i < nr_pages; i++) {
                entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
                entry = pte_sw_mkyoung(entry);
                if (vma->vm_flags & VM_WRITE)
                        entry = pte_mkwrite(pte_mkdirty(entry));
setpte:
                if (uffd_wp)
                        entry = pte_mkuffd_wp(entry);
                set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i,
vmf->pte + i, entry);

                /* No need to invalidate - it was non-present before */
                update_mmu_cache(vma, vmf->address + PAGE_SIZE * i,
vmf->pte + i);
        }

>  unlock:
>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>         return ret;

Attaching a small patch in case anything above is not clear. Please
take a look. Thanks.
  
Yin Fengwei July 4, 2023, 3:45 a.m. UTC | #4
On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
THP is for huge page which is 2M size. We are not huge page here. But
I don't have good name either.

> allocated in large folios of a specified order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
> 
> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> defaults to disabled for now; there is a long list of todos to make
> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
> madvise ops, etc). These items will be tackled in subsequent patches.
> 
> When enabled, the preferred folio order is as returned by
> arch_wants_pte_order(), which may be overridden by the arch as it sees
> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
> contiguous set of ptes map physically contigious, naturally aligned
> memory, so this mechanism allows the architecture to optimize as
> required.
> 
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/Kconfig  |  10 ++++
>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 165 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..1c06b2c0a24e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>  	  support of file THPs will be developed in the next few release
>  	  cycles.
>  
> +config FLEXIBLE_THP
> +	bool "Flexible order THP"
> +	depends on TRANSPARENT_HUGEPAGE
> +	default n
> +	help
> +	  Use large (bigger than order-0) folios to back anonymous memory where
> +	  possible, even if the order of the folio is smaller than the PMD
> +	  order. This reduces the number of page faults, as well as other
> +	  per-page overheads to improve performance for many workloads.
> +
>  endif # TRANSPARENT_HUGEPAGE
>  
>  #
> diff --git a/mm/memory.c b/mm/memory.c
> index fb30f7523550..abe2ea94f3f5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_FLEXIBLE_THP
> +/*
> + * Allocates, zeros and returns a folio of the requested order for use as
> + * anonymous memory.
> + */
> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
> +				      unsigned long addr, int order)
> +{
> +	gfp_t gfp;
> +	struct folio *folio;
> +
> +	if (order == 0)
> +		return vma_alloc_zeroed_movable_folio(vma, addr);
> +
> +	gfp = vma_thp_gfp_mask(vma);
> +	folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +	if (folio)
> +		clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
> +
> +	return folio;
> +}
> +
> +/*
> + * Preferred folio order to allocate for anonymous memory.
> + */
> +#define max_anon_folio_order(vma)	arch_wants_pte_order(vma)
> +#else
> +#define alloc_anon_folio(vma, addr, order) \
> +				vma_alloc_zeroed_movable_folio(vma, addr)
> +#define max_anon_folio_order(vma)	0
> +#endif
> +
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; i++) {
> +		if (!pte_none(ptep_get(pte++)))
> +			return i;
> +	}
> +
> +	return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> +{
> +	/*
> +	 * The aim here is to determine what size of folio we should allocate
> +	 * for this fault. Factors include:
> +	 * - Order must not be higher than `order` upon entry
> +	 * - Folio must be naturally aligned within VA space
> +	 * - Folio must be fully contained inside one pmd entry
> +	 * - Folio must not breach boundaries of vma
> +	 * - Folio must not overlap any non-none ptes
> +	 *
> +	 * Additionally, we do not allow order-1 since this breaks assumptions
> +	 * elsewhere in the mm; THP pages must be at least order-2 (since they
> +	 * store state up to the 3rd struct page subpage), and these pages must
> +	 * be THP in order to correctly use pre-existing THP infrastructure such
> +	 * as folio_split().
> +	 *
> +	 * Note that the caller may or may not choose to lock the pte. If
> +	 * unlocked, the result is racy and the user must re-check any overlap
> +	 * with non-none ptes under the lock.
> +	 */
> +
> +	struct vm_area_struct *vma = vmf->vma;
> +	int nr;
> +	unsigned long addr;
> +	pte_t *pte;
> +	pte_t *first_set = NULL;
> +	int ret;
> +
> +	order = min(order, PMD_SHIFT - PAGE_SHIFT);
> +
> +	for (; order > 1; order--) {
> +		nr = 1 << order;
> +		addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> +		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> +
> +		/* Check vma bounds. */
> +		if (addr < vma->vm_start ||
> +		    addr + (nr << PAGE_SHIFT) > vma->vm_end)
> +			continue;
> +
> +		/* Ptes covered by order already known to be none. */
> +		if (pte + nr <= first_set)
> +			break;
> +
> +		/* Already found set pte in range covered by order. */
> +		if (pte <= first_set)
> +			continue;
> +
> +		/* Need to check if all the ptes are none. */
> +		ret = check_ptes_none(pte, nr);
> +		if (ret == nr)
> +			break;
> +
> +		first_set = pte + ret;
> +	}
> +
> +	if (order == 1)
> +		order = 0;
> +
> +	return order;
> +}
The logic in above function should be kept is whether the order fit in vma range.

check_ptes_none() is not accurate here because no page table lock hold and concurrent
fault could happen. So may just drop the check here? Check_ptes_none() is done after
take the page table lock.

We pick the arch prefered order or order 0 now.

> +
>  /*
>   * Handle write page faults for pages that can be reused in the current vma
>   *
> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>  		goto oom;
>  
>  	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
> -		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +		new_folio = alloc_anon_folio(vma, vmf->address, 0);
>  		if (!new_folio)
>  			goto oom;
>  	} else {
> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	struct folio *folio;
>  	vm_fault_t ret = 0;
>  	pte_t entry;
> +	int order;
> +	int pgcount;
> +	unsigned long addr;
>  
>  	/* File mapping without ->vm_ops ? */
>  	if (vma->vm_flags & VM_SHARED)
> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>  			return handle_userfault(vmf, VM_UFFD_MISSING);
>  		}
> -		goto setpte;
> +		if (uffd_wp)
> +			entry = pte_mkuffd_wp(entry);
> +		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> +		/* No need to invalidate - it was non-present before */
> +		update_mmu_cache(vma, vmf->address, vmf->pte);
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * If allocating a large folio, determine the biggest suitable order for
> +	 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
> +	 * overlap with any populated PTEs, etc). We are not under the ptl here
> +	 * so we will need to re-check that we are not overlapping any populated
> +	 * PTEs once we have the lock.
> +	 */
> +	order = uffd_wp ? 0 : max_anon_folio_order(vma);
> +	if (order > 0) {
> +		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> +		order = calc_anon_folio_order_alloc(vmf, order);
> +		pte_unmap(vmf->pte);
>  	}
>  
> -	/* Allocate our own private page. */
> +	/* Allocate our own private folio. */
>  	if (unlikely(anon_vma_prepare(vma)))
>  		goto oom;
> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +	folio = alloc_anon_folio(vma, vmf->address, order);
> +	if (!folio && order > 0) {
> +		order = 0;
> +		folio = alloc_anon_folio(vma, vmf->address, order);
> +	}
>  	if (!folio)
>  		goto oom;
>  
> +	pgcount = 1 << order;
> +	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> +
>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>  		goto oom_free_page;
>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>  
>  	/*
>  	 * The memory barrier inside __folio_mark_uptodate makes sure that
> -	 * preceding stores to the page contents become visible before
> -	 * the set_pte_at() write.
> +	 * preceding stores to the folio contents become visible before
> +	 * the set_ptes() write.
>  	 */
>  	__folio_mark_uptodate(folio);
>  
> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry));
>  
> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> -			&vmf->ptl);
> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>  	if (vmf_pte_changed(vmf)) {
>  		update_mmu_tlb(vma, vmf->address, vmf->pte);
>  		goto release;
> +	} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
This could be the case that we allocated order 4 page and find a neighbor PTE is
filled by concurrent fault. Should we put current folio and fallback to order 0
and try again immedately (goto order 0 allocation instead of return from this
function which will go through some page fault path again)?


Regards
Yin, Fengwei

> +		goto release;
>  	}
>  
>  	ret = check_stable_address_space(vma->vm_mm);
> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>  	}
>  
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
> +	folio_ref_add(folio, pgcount - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> +	folio_add_new_anon_rmap(folio, vma, addr);
>  	folio_add_lru_vma(folio, vma);
> -setpte:
> +
>  	if (uffd_wp)
>  		entry = pte_mkuffd_wp(entry);
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>  
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache(vma, vmf->address, vmf->pte);
> +	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>  unlock:
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  	return ret;
  
Ryan Roberts July 4, 2023, 2:08 p.m. UTC | #5
On 04/07/2023 02:35, Yu Zhao wrote:
> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> allocated in large folios of a specified order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>> defaults to disabled for now; there is a long list of todos to make
>> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
>> madvise ops, etc). These items will be tackled in subsequent patches.
>>
>> When enabled, the preferred folio order is as returned by
>> arch_wants_pte_order(), which may be overridden by the arch as it sees
>> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
> 
> coalesce

ACK

> 
>> contiguous set of ptes map physically contigious, naturally aligned
> 
> contiguous

ACK

> 
>> memory, so this mechanism allows the architecture to optimize as
>> required.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  |  10 ++++
>>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 165 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..1c06b2c0a24e 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>>           support of file THPs will be developed in the next few release
>>           cycles.
>>
>> +config FLEXIBLE_THP
>> +       bool "Flexible order THP"
>> +       depends on TRANSPARENT_HUGEPAGE
>> +       default n
> 
> The default value is already N.

Is there a coding standard for this? Personally I prefer to make it explicit.

> 
>> +       help
>> +         Use large (bigger than order-0) folios to back anonymous memory where
>> +         possible, even if the order of the folio is smaller than the PMD
>> +         order. This reduces the number of page faults, as well as other
>> +         per-page overheads to improve performance for many workloads.
>> +
>>  endif # TRANSPARENT_HUGEPAGE
>>
>>  #
>> diff --git a/mm/memory.c b/mm/memory.c
>> index fb30f7523550..abe2ea94f3f5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>         return 0;
>>  }
>>
>> +#ifdef CONFIG_FLEXIBLE_THP
>> +/*
>> + * Allocates, zeros and returns a folio of the requested order for use as
>> + * anonymous memory.
>> + */
>> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
>> +                                     unsigned long addr, int order)
>> +{
>> +       gfp_t gfp;
>> +       struct folio *folio;
>> +
>> +       if (order == 0)
>> +               return vma_alloc_zeroed_movable_folio(vma, addr);
>> +
>> +       gfp = vma_thp_gfp_mask(vma);
>> +       folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +       if (folio)
>> +               clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
>> +
>> +       return folio;
>> +}
>> +
>> +/*
>> + * Preferred folio order to allocate for anonymous memory.
>> + */
>> +#define max_anon_folio_order(vma)      arch_wants_pte_order(vma)
>> +#else
>> +#define alloc_anon_folio(vma, addr, order) \
>> +                               vma_alloc_zeroed_movable_folio(vma, addr)
>> +#define max_anon_folio_order(vma)      0
>> +#endif
>> +
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < nr; i++) {
>> +               if (!pte_none(ptep_get(pte++)))
>> +                       return i;
>> +       }
>> +
>> +       return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>> +{
>> +       /*
>> +        * The aim here is to determine what size of folio we should allocate
>> +        * for this fault. Factors include:
>> +        * - Order must not be higher than `order` upon entry
>> +        * - Folio must be naturally aligned within VA space
>> +        * - Folio must be fully contained inside one pmd entry
>> +        * - Folio must not breach boundaries of vma
>> +        * - Folio must not overlap any non-none ptes
>> +        *
>> +        * Additionally, we do not allow order-1 since this breaks assumptions
>> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
>> +        * store state up to the 3rd struct page subpage), and these pages must
>> +        * be THP in order to correctly use pre-existing THP infrastructure such
>> +        * as folio_split().
>> +        *
>> +        * Note that the caller may or may not choose to lock the pte. If
>> +        * unlocked, the result is racy and the user must re-check any overlap
>> +        * with non-none ptes under the lock.
>> +        */
>> +
>> +       struct vm_area_struct *vma = vmf->vma;
>> +       int nr;
>> +       unsigned long addr;
>> +       pte_t *pte;
>> +       pte_t *first_set = NULL;
>> +       int ret;
>> +
>> +       order = min(order, PMD_SHIFT - PAGE_SHIFT);
>> +
>> +       for (; order > 1; order--) {
> 
> I'm not sure how we can justify this policy. As an initial step, it'd
> be a lot easier to sell if we only considered the order of
> arch_wants_pte_order() and the order 0.

My justification is in the cover letter; I see performance regression (vs the
unpatched kernel) when using the policy you suggest. This policy performs much
better in my tests. (I'll reply directly to your follow up questions in the
cover letter shortly).

What are your technical concerns about this approach? It is pretty light weight
(I only touch each PTE once, regardless of the number of loops). If we have
strong technical reasons for reverting to the less performant approach then fair
enough, but I'd like to hear the rational first.

> 
>> +               nr = 1 << order;
>> +               addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>> +               pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +               /* Check vma bounds. */
>> +               if (addr < vma->vm_start ||
>> +                   addr + (nr << PAGE_SHIFT) > vma->vm_end)
>> +                       continue;
>> +
>> +               /* Ptes covered by order already known to be none. */
>> +               if (pte + nr <= first_set)
>> +                       break;
>> +
>> +               /* Already found set pte in range covered by order. */
>> +               if (pte <= first_set)
>> +                       continue;
>> +
>> +               /* Need to check if all the ptes are none. */
>> +               ret = check_ptes_none(pte, nr);
>> +               if (ret == nr)
>> +                       break;
>> +
>> +               first_set = pte + ret;
>> +       }
>> +
>> +       if (order == 1)
>> +               order = 0;
>> +
>> +       return order;
>> +}
> 
> Everything above can be simplified into two helpers:
> vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you
> prefer). Details below.
> 
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>>                 goto oom;
>>
>>         if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
>> -               new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +               new_folio = alloc_anon_folio(vma, vmf->address, 0);
> 
> This seems unnecessary for now. Later on, we could fill in an aligned
> area with multiple write-protected zero pages during a read fault and
> then replace them with a large folio here.

I don't have a strong opinion. I thought that it would be neater to use the same
API everywhere, but happy to revert.

> 
>>                 if (!new_folio)
>>                         goto oom;
>>         } else {
>> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         struct folio *folio;
>>         vm_fault_t ret = 0;
>>         pte_t entry;
>> +       int order;
>> +       int pgcount;
>> +       unsigned long addr;
>>
>>         /* File mapping without ->vm_ops ? */
>>         if (vma->vm_flags & VM_SHARED)
>> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>                         return handle_userfault(vmf, VM_UFFD_MISSING);
>>                 }
>> -               goto setpte;
>> +               if (uffd_wp)
>> +                       entry = pte_mkuffd_wp(entry);
>> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +
>> +               /* No need to invalidate - it was non-present before */
>> +               update_mmu_cache(vma, vmf->address, vmf->pte);
>> +               goto unlock;
>> +       }
> 
> Nor really needed IMO. Details below.
> 
> ===
> 
>> +       /*
>> +        * If allocating a large folio, determine the biggest suitable order for
>> +        * the VMA (e.g. it must not exceed the VMA's bounds, it must not
>> +        * overlap with any populated PTEs, etc). We are not under the ptl here
>> +        * so we will need to re-check that we are not overlapping any populated
>> +        * PTEs once we have the lock.
>> +        */
>> +       order = uffd_wp ? 0 : max_anon_folio_order(vma);
>> +       if (order > 0) {
>> +               vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> +               order = calc_anon_folio_order_alloc(vmf, order);
>> +               pte_unmap(vmf->pte);
>>         }
> 
> ===
> 
> The section above together with the section below should be wrapped in a helper.
> 
>> -       /* Allocate our own private page. */
>> +       /* Allocate our own private folio. */
>>         if (unlikely(anon_vma_prepare(vma)))
>>                 goto oom;
> 
> ===
> 
>> -       folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +       folio = alloc_anon_folio(vma, vmf->address, order);
>> +       if (!folio && order > 0) {
>> +               order = 0;
>> +               folio = alloc_anon_folio(vma, vmf->address, order);
>> +       }
> 
> ===
> 
> One helper returns a folio of order arch_wants_pte_order(), or order 0
> if it fails to allocate that order, e.g.,
> 
> folio = alloc_anon_folio(vmf);
> 
> And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0
> regardless of arch_wants_pte_order(). Upon success, it can update
> vmf->address, since if we run into a race with another PF, we exit the
> fault handler and retry anyway.
> 
>>         if (!folio)
>>                 goto oom;
>>
>> +       pgcount = 1 << order;
>> +       addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> 
> As shown above, the helper already updates vmf->address. And mm/ never
> used pgcount before -- the convention is nr_pages = folio_nr_pages().

ACK

> 
>>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>                 goto oom_free_page;
>>         folio_throttle_swaprate(folio, GFP_KERNEL);
>>
>>         /*
>>          * The memory barrier inside __folio_mark_uptodate makes sure that
>> -        * preceding stores to the page contents become visible before
>> -        * the set_pte_at() write.
>> +        * preceding stores to the folio contents become visible before
>> +        * the set_ptes() write.
> 
> We don't have set_ptes() yet.

Indeed, that's why I listed the set_ptes() patch set as a hard dependency ;-)

> 
>>          */
>>         __folio_mark_uptodate(folio);
>>
>> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>         if (vma->vm_flags & VM_WRITE)
>>                 entry = pte_mkwrite(pte_mkdirty(entry));
>>
>> -       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -                       &vmf->ptl);
>> +       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>         if (vmf_pte_changed(vmf)) {
>>                 update_mmu_tlb(vma, vmf->address, vmf->pte);
>>                 goto release;
>> +       } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
>> +               goto release;
>>         }
> 
> Need new helper:
> 
>   if (vmf_pte_range_changed(vmf, nr_pages)) {
>     for (i = 0; i < nr_pages; i++)
>       update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i);
>     goto release;
>   }
> 
> (It should be fine to call update_mmu_tlb() even if it's not really necessary.)
> 
>>         ret = check_stable_address_space(vma->vm_mm);
>> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>                 return handle_userfault(vmf, VM_UFFD_MISSING);
>>         }
>>
>> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +       folio_ref_add(folio, pgcount - 1);
>> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>> +       folio_add_new_anon_rmap(folio, vma, addr);
>>         folio_add_lru_vma(folio, vma);
>> -setpte:
>> +
>>         if (uffd_wp)
>>                 entry = pte_mkuffd_wp(entry);
>> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
> 
> We would have to do it one by one for now.
> 
>>         /* No need to invalidate - it was non-present before */
>> -       update_mmu_cache(vma, vmf->address, vmf->pte);
>> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
> 
> Ditto.
> 
> How about this (by moving mk_pte()  and its friends here):
> ...
>         folio_add_lru_vma(folio, vma);
> 
>         for (i = 0; i < nr_pages; i++) {
>                 entry = mk_pte(folio_page(folio, i), vma->vm_page_prot);
>                 entry = pte_sw_mkyoung(entry);
>                 if (vma->vm_flags & VM_WRITE)
>                         entry = pte_mkwrite(pte_mkdirty(entry));
> setpte:
>                 if (uffd_wp)
>                         entry = pte_mkuffd_wp(entry);
>                 set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i,
> vmf->pte + i, entry);
> 
>                 /* No need to invalidate - it was non-present before */
>                 update_mmu_cache(vma, vmf->address + PAGE_SIZE * i,
> vmf->pte + i);
>         }
> 
>>  unlock:
>>         pte_unmap_unlock(vmf->pte, vmf->ptl);
>>         return ret;
> 
> Attaching a small patch in case anything above is not clear. Please
> take a look. Thanks.

OK, I'll take a look and rework for v3.
  
Ryan Roberts July 4, 2023, 2:20 p.m. UTC | #6
On 04/07/2023 04:45, Yin, Fengwei wrote:
> 
> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> THP is for huge page which is 2M size. We are not huge page here. But
> I don't have good name either.

Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K
base page, they are 512M. So huge pages already have a variable size. And they
sometimes get PTE-mapped. So can't we just think of this as an extension of the
THP feature?

> 
>> allocated in large folios of a specified order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
>>
>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>> defaults to disabled for now; there is a long list of todos to make
>> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
>> madvise ops, etc). These items will be tackled in subsequent patches.
>>
>> When enabled, the preferred folio order is as returned by
>> arch_wants_pte_order(), which may be overridden by the arch as it sees
>> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
>> contiguous set of ptes map physically contigious, naturally aligned
>> memory, so this mechanism allows the architecture to optimize as
>> required.
>>
>> If the preferred order can't be used (e.g. because the folio would
>> breach the bounds of the vma, or because ptes in the region are already
>> mapped) then we fall back to a suitable lower order.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/Kconfig  |  10 ++++
>>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 165 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 7672a22647b4..1c06b2c0a24e 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>>  	  support of file THPs will be developed in the next few release
>>  	  cycles.
>>  
>> +config FLEXIBLE_THP
>> +	bool "Flexible order THP"
>> +	depends on TRANSPARENT_HUGEPAGE
>> +	default n
>> +	help
>> +	  Use large (bigger than order-0) folios to back anonymous memory where
>> +	  possible, even if the order of the folio is smaller than the PMD
>> +	  order. This reduces the number of page faults, as well as other
>> +	  per-page overheads to improve performance for many workloads.
>> +
>>  endif # TRANSPARENT_HUGEPAGE
>>  
>>  #
>> diff --git a/mm/memory.c b/mm/memory.c
>> index fb30f7523550..abe2ea94f3f5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>  	return 0;
>>  }
>>  
>> +#ifdef CONFIG_FLEXIBLE_THP
>> +/*
>> + * Allocates, zeros and returns a folio of the requested order for use as
>> + * anonymous memory.
>> + */
>> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
>> +				      unsigned long addr, int order)
>> +{
>> +	gfp_t gfp;
>> +	struct folio *folio;
>> +
>> +	if (order == 0)
>> +		return vma_alloc_zeroed_movable_folio(vma, addr);
>> +
>> +	gfp = vma_thp_gfp_mask(vma);
>> +	folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +	if (folio)
>> +		clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
>> +
>> +	return folio;
>> +}
>> +
>> +/*
>> + * Preferred folio order to allocate for anonymous memory.
>> + */
>> +#define max_anon_folio_order(vma)	arch_wants_pte_order(vma)
>> +#else
>> +#define alloc_anon_folio(vma, addr, order) \
>> +				vma_alloc_zeroed_movable_folio(vma, addr)
>> +#define max_anon_folio_order(vma)	0
>> +#endif
>> +
>> +/*
>> + * Returns index of first pte that is not none, or nr if all are none.
>> + */
>> +static inline int check_ptes_none(pte_t *pte, int nr)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; i++) {
>> +		if (!pte_none(ptep_get(pte++)))
>> +			return i;
>> +	}
>> +
>> +	return nr;
>> +}
>> +
>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>> +{
>> +	/*
>> +	 * The aim here is to determine what size of folio we should allocate
>> +	 * for this fault. Factors include:
>> +	 * - Order must not be higher than `order` upon entry
>> +	 * - Folio must be naturally aligned within VA space
>> +	 * - Folio must be fully contained inside one pmd entry
>> +	 * - Folio must not breach boundaries of vma
>> +	 * - Folio must not overlap any non-none ptes
>> +	 *
>> +	 * Additionally, we do not allow order-1 since this breaks assumptions
>> +	 * elsewhere in the mm; THP pages must be at least order-2 (since they
>> +	 * store state up to the 3rd struct page subpage), and these pages must
>> +	 * be THP in order to correctly use pre-existing THP infrastructure such
>> +	 * as folio_split().
>> +	 *
>> +	 * Note that the caller may or may not choose to lock the pte. If
>> +	 * unlocked, the result is racy and the user must re-check any overlap
>> +	 * with non-none ptes under the lock.
>> +	 */
>> +
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	int nr;
>> +	unsigned long addr;
>> +	pte_t *pte;
>> +	pte_t *first_set = NULL;
>> +	int ret;
>> +
>> +	order = min(order, PMD_SHIFT - PAGE_SHIFT);
>> +
>> +	for (; order > 1; order--) {
>> +		nr = 1 << order;
>> +		addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>> +		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>> +
>> +		/* Check vma bounds. */
>> +		if (addr < vma->vm_start ||
>> +		    addr + (nr << PAGE_SHIFT) > vma->vm_end)
>> +			continue;
>> +
>> +		/* Ptes covered by order already known to be none. */
>> +		if (pte + nr <= first_set)
>> +			break;
>> +
>> +		/* Already found set pte in range covered by order. */
>> +		if (pte <= first_set)
>> +			continue;
>> +
>> +		/* Need to check if all the ptes are none. */
>> +		ret = check_ptes_none(pte, nr);
>> +		if (ret == nr)
>> +			break;
>> +
>> +		first_set = pte + ret;
>> +	}
>> +
>> +	if (order == 1)
>> +		order = 0;
>> +
>> +	return order;
>> +}
> The logic in above function should be kept is whether the order fit in vma range.
> 
> check_ptes_none() is not accurate here because no page table lock hold and concurrent
> fault could happen. So may just drop the check here? Check_ptes_none() is done after
> take the page table lock.

I agree it is just an estimate given the lock is not held; the comment at the
top says the same. But I don't think we can wait until after the lock is taken
to measure this. We can't hold the lock while allocating the folio and we need a
guess at what to allocate. If we don't guess here, we will allocate the biggest,
then take the lock, see that it doesn't fit, and exit. Then the system will
re-fault and we will follow the exact same path - ending up in live lock.

> 
> We pick the arch prefered order or order 0 now.
> 
>> +
>>  /*
>>   * Handle write page faults for pages that can be reused in the current vma
>>   *
>> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>>  		goto oom;
>>  
>>  	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
>> -		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +		new_folio = alloc_anon_folio(vma, vmf->address, 0);
>>  		if (!new_folio)
>>  			goto oom;
>>  	} else {
>> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	struct folio *folio;
>>  	vm_fault_t ret = 0;
>>  	pte_t entry;
>> +	int order;
>> +	int pgcount;
>> +	unsigned long addr;
>>  
>>  	/* File mapping without ->vm_ops ? */
>>  	if (vma->vm_flags & VM_SHARED)
>> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  			return handle_userfault(vmf, VM_UFFD_MISSING);
>>  		}
>> -		goto setpte;
>> +		if (uffd_wp)
>> +			entry = pte_mkuffd_wp(entry);
>> +		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +
>> +		/* No need to invalidate - it was non-present before */
>> +		update_mmu_cache(vma, vmf->address, vmf->pte);
>> +		goto unlock;
>> +	}
>> +
>> +	/*
>> +	 * If allocating a large folio, determine the biggest suitable order for
>> +	 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
>> +	 * overlap with any populated PTEs, etc). We are not under the ptl here
>> +	 * so we will need to re-check that we are not overlapping any populated
>> +	 * PTEs once we have the lock.
>> +	 */
>> +	order = uffd_wp ? 0 : max_anon_folio_order(vma);
>> +	if (order > 0) {
>> +		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> +		order = calc_anon_folio_order_alloc(vmf, order);
>> +		pte_unmap(vmf->pte);
>>  	}
>>  
>> -	/* Allocate our own private page. */
>> +	/* Allocate our own private folio. */
>>  	if (unlikely(anon_vma_prepare(vma)))
>>  		goto oom;
>> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +	folio = alloc_anon_folio(vma, vmf->address, order);
>> +	if (!folio && order > 0) {
>> +		order = 0;
>> +		folio = alloc_anon_folio(vma, vmf->address, order);
>> +	}
>>  	if (!folio)
>>  		goto oom;
>>  
>> +	pgcount = 1 << order;
>> +	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
>> +
>>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>  		goto oom_free_page;
>>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>>  
>>  	/*
>>  	 * The memory barrier inside __folio_mark_uptodate makes sure that
>> -	 * preceding stores to the page contents become visible before
>> -	 * the set_pte_at() write.
>> +	 * preceding stores to the folio contents become visible before
>> +	 * the set_ptes() write.
>>  	 */
>>  	__folio_mark_uptodate(folio);
>>  
>> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  	if (vma->vm_flags & VM_WRITE)
>>  		entry = pte_mkwrite(pte_mkdirty(entry));
>>  
>> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -			&vmf->ptl);
>> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>  	if (vmf_pte_changed(vmf)) {
>>  		update_mmu_tlb(vma, vmf->address, vmf->pte);
>>  		goto release;
>> +	} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
> This could be the case that we allocated order 4 page and find a neighbor PTE is
> filled by concurrent fault. Should we put current folio and fallback to order 0
> and try again immedately (goto order 0 allocation instead of return from this
> function which will go through some page fault path again)?

That's how it worked in v1, but I had review comments from Yang Shi asking me to
re-fault instead. This approach is certainly cleaner from a code point of view.
And I expect races of that nature will be rare.

> 
> 
> Regards
> Yin, Fengwei
> 
>> +		goto release;
>>  	}
>>  
>>  	ret = check_stable_address_space(vma->vm_mm);
>> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>>  	}
>>  
>> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +	folio_ref_add(folio, pgcount - 1);
>> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>> +	folio_add_new_anon_rmap(folio, vma, addr);
>>  	folio_add_lru_vma(folio, vma);
>> -setpte:
>> +
>>  	if (uffd_wp)
>>  		entry = pte_mkuffd_wp(entry);
>> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>>  
>>  	/* No need to invalidate - it was non-present before */
>> -	update_mmu_cache(vma, vmf->address, vmf->pte);
>> +	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>>  unlock:
>>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  	return ret;
  
Yin Fengwei July 4, 2023, 11:35 p.m. UTC | #7
On 7/4/23 22:20, Ryan Roberts wrote:
> On 04/07/2023 04:45, Yin, Fengwei wrote:
>>
>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> THP is for huge page which is 2M size. We are not huge page here. But
>> I don't have good name either.
> 
> Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K
> base page, they are 512M. So huge pages already have a variable size. And they
> sometimes get PTE-mapped. So can't we just think of this as an extension of the
> THP feature?
My understanding is the THP has several fixed size on different arch.
The 32K or 16K which could be picked here are not THP size.

> 
>>
>>> allocated in large folios of a specified order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>>
>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
>>> defaults to disabled for now; there is a long list of todos to make
>>> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
>>> madvise ops, etc). These items will be tackled in subsequent patches.
>>>
>>> When enabled, the preferred folio order is as returned by
>>> arch_wants_pte_order(), which may be overridden by the arch as it sees
>>> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
>>> contiguous set of ptes map physically contigious, naturally aligned
>>> memory, so this mechanism allows the architecture to optimize as
>>> required.
>>>
>>> If the preferred order can't be used (e.g. because the folio would
>>> breach the bounds of the vma, or because ptes in the region are already
>>> mapped) then we fall back to a suitable lower order.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  mm/Kconfig  |  10 ++++
>>>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
>>>  2 files changed, 165 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 7672a22647b4..1c06b2c0a24e 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
>>>  	  support of file THPs will be developed in the next few release
>>>  	  cycles.
>>>  
>>> +config FLEXIBLE_THP
>>> +	bool "Flexible order THP"
>>> +	depends on TRANSPARENT_HUGEPAGE
>>> +	default n
>>> +	help
>>> +	  Use large (bigger than order-0) folios to back anonymous memory where
>>> +	  possible, even if the order of the folio is smaller than the PMD
>>> +	  order. This reduces the number of page faults, as well as other
>>> +	  per-page overheads to improve performance for many workloads.
>>> +
>>>  endif # TRANSPARENT_HUGEPAGE
>>>  
>>>  #
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index fb30f7523550..abe2ea94f3f5 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
>>>  	return 0;
>>>  }
>>>  
>>> +#ifdef CONFIG_FLEXIBLE_THP
>>> +/*
>>> + * Allocates, zeros and returns a folio of the requested order for use as
>>> + * anonymous memory.
>>> + */
>>> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
>>> +				      unsigned long addr, int order)
>>> +{
>>> +	gfp_t gfp;
>>> +	struct folio *folio;
>>> +
>>> +	if (order == 0)
>>> +		return vma_alloc_zeroed_movable_folio(vma, addr);
>>> +
>>> +	gfp = vma_thp_gfp_mask(vma);
>>> +	folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>> +	if (folio)
>>> +		clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
>>> +
>>> +	return folio;
>>> +}
>>> +
>>> +/*
>>> + * Preferred folio order to allocate for anonymous memory.
>>> + */
>>> +#define max_anon_folio_order(vma)	arch_wants_pte_order(vma)
>>> +#else
>>> +#define alloc_anon_folio(vma, addr, order) \
>>> +				vma_alloc_zeroed_movable_folio(vma, addr)
>>> +#define max_anon_folio_order(vma)	0
>>> +#endif
>>> +
>>> +/*
>>> + * Returns index of first pte that is not none, or nr if all are none.
>>> + */
>>> +static inline int check_ptes_none(pte_t *pte, int nr)
>>> +{
>>> +	int i;
>>> +
>>> +	for (i = 0; i < nr; i++) {
>>> +		if (!pte_none(ptep_get(pte++)))
>>> +			return i;
>>> +	}
>>> +
>>> +	return nr;
>>> +}
>>> +
>>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
>>> +{
>>> +	/*
>>> +	 * The aim here is to determine what size of folio we should allocate
>>> +	 * for this fault. Factors include:
>>> +	 * - Order must not be higher than `order` upon entry
>>> +	 * - Folio must be naturally aligned within VA space
>>> +	 * - Folio must be fully contained inside one pmd entry
>>> +	 * - Folio must not breach boundaries of vma
>>> +	 * - Folio must not overlap any non-none ptes
>>> +	 *
>>> +	 * Additionally, we do not allow order-1 since this breaks assumptions
>>> +	 * elsewhere in the mm; THP pages must be at least order-2 (since they
>>> +	 * store state up to the 3rd struct page subpage), and these pages must
>>> +	 * be THP in order to correctly use pre-existing THP infrastructure such
>>> +	 * as folio_split().
>>> +	 *
>>> +	 * Note that the caller may or may not choose to lock the pte. If
>>> +	 * unlocked, the result is racy and the user must re-check any overlap
>>> +	 * with non-none ptes under the lock.
>>> +	 */
>>> +
>>> +	struct vm_area_struct *vma = vmf->vma;
>>> +	int nr;
>>> +	unsigned long addr;
>>> +	pte_t *pte;
>>> +	pte_t *first_set = NULL;
>>> +	int ret;
>>> +
>>> +	order = min(order, PMD_SHIFT - PAGE_SHIFT);
>>> +
>>> +	for (; order > 1; order--) {
>>> +		nr = 1 << order;
>>> +		addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
>>> +		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
>>> +
>>> +		/* Check vma bounds. */
>>> +		if (addr < vma->vm_start ||
>>> +		    addr + (nr << PAGE_SHIFT) > vma->vm_end)
>>> +			continue;
>>> +
>>> +		/* Ptes covered by order already known to be none. */
>>> +		if (pte + nr <= first_set)
>>> +			break;
>>> +
>>> +		/* Already found set pte in range covered by order. */
>>> +		if (pte <= first_set)
>>> +			continue;
>>> +
>>> +		/* Need to check if all the ptes are none. */
>>> +		ret = check_ptes_none(pte, nr);
>>> +		if (ret == nr)
>>> +			break;
>>> +
>>> +		first_set = pte + ret;
>>> +	}
>>> +
>>> +	if (order == 1)
>>> +		order = 0;
>>> +
>>> +	return order;
>>> +}
>> The logic in above function should be kept is whether the order fit in vma range.
>>
>> check_ptes_none() is not accurate here because no page table lock hold and concurrent
>> fault could happen. So may just drop the check here? Check_ptes_none() is done after
>> take the page table lock.
> 
> I agree it is just an estimate given the lock is not held; the comment at the
> top says the same. But I don't think we can wait until after the lock is taken
> to measure this. We can't hold the lock while allocating the folio and we need a
> guess at what to allocate. If we don't guess here, we will allocate the biggest,
> then take the lock, see that it doesn't fit, and exit. Then the system will
> re-fault and we will follow the exact same path - ending up in live lock.
It will not if we try order0 immediately. But see my comments to the refault.

> 
>>
>> We pick the arch prefered order or order 0 now.
>>
>>> +
>>>  /*
>>>   * Handle write page faults for pages that can be reused in the current vma
>>>   *
>>> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>>>  		goto oom;
>>>  
>>>  	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
>>> -		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>> +		new_folio = alloc_anon_folio(vma, vmf->address, 0);
>>>  		if (!new_folio)
>>>  			goto oom;
>>>  	} else {
>>> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>  	struct folio *folio;
>>>  	vm_fault_t ret = 0;
>>>  	pte_t entry;
>>> +	int order;
>>> +	int pgcount;
>>> +	unsigned long addr;
>>>  
>>>  	/* File mapping without ->vm_ops ? */
>>>  	if (vma->vm_flags & VM_SHARED)
>>> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>  			return handle_userfault(vmf, VM_UFFD_MISSING);
>>>  		}
>>> -		goto setpte;
>>> +		if (uffd_wp)
>>> +			entry = pte_mkuffd_wp(entry);
>>> +		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>> +
>>> +		/* No need to invalidate - it was non-present before */
>>> +		update_mmu_cache(vma, vmf->address, vmf->pte);
>>> +		goto unlock;
>>> +	}
>>> +
>>> +	/*
>>> +	 * If allocating a large folio, determine the biggest suitable order for
>>> +	 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
>>> +	 * overlap with any populated PTEs, etc). We are not under the ptl here
>>> +	 * so we will need to re-check that we are not overlapping any populated
>>> +	 * PTEs once we have the lock.
>>> +	 */
>>> +	order = uffd_wp ? 0 : max_anon_folio_order(vma);
>>> +	if (order > 0) {
>>> +		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>>> +		order = calc_anon_folio_order_alloc(vmf, order);
>>> +		pte_unmap(vmf->pte);
>>>  	}
>>>  
>>> -	/* Allocate our own private page. */
>>> +	/* Allocate our own private folio. */
>>>  	if (unlikely(anon_vma_prepare(vma)))
>>>  		goto oom;
>>> -	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>> +	folio = alloc_anon_folio(vma, vmf->address, order);
>>> +	if (!folio && order > 0) {
>>> +		order = 0;
>>> +		folio = alloc_anon_folio(vma, vmf->address, order);
>>> +	}
>>>  	if (!folio)
>>>  		goto oom;
>>>  
>>> +	pgcount = 1 << order;
>>> +	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
>>> +
>>>  	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>>  		goto oom_free_page;
>>>  	folio_throttle_swaprate(folio, GFP_KERNEL);
>>>  
>>>  	/*
>>>  	 * The memory barrier inside __folio_mark_uptodate makes sure that
>>> -	 * preceding stores to the page contents become visible before
>>> -	 * the set_pte_at() write.
>>> +	 * preceding stores to the folio contents become visible before
>>> +	 * the set_ptes() write.
>>>  	 */
>>>  	__folio_mark_uptodate(folio);
>>>  
>>> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>  	if (vma->vm_flags & VM_WRITE)
>>>  		entry = pte_mkwrite(pte_mkdirty(entry));
>>>  
>>> -	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>> -			&vmf->ptl);
>>> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>>  	if (vmf_pte_changed(vmf)) {
>>>  		update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>  		goto release;
>>> +	} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
>> This could be the case that we allocated order 4 page and find a neighbor PTE is
>> filled by concurrent fault. Should we put current folio and fallback to order 0
>> and try again immedately (goto order 0 allocation instead of return from this
>> function which will go through some page fault path again)?
> 
> That's how it worked in v1, but I had review comments from Yang Shi asking me to
> re-fault instead. This approach is certainly cleaner from a code point of view.
> And I expect races of that nature will be rare.
I must miss that discussion in v1. My bad. I should jump in that discussion.
So I will drop my comment here even I still think we should avoid refault.
I don't want the comment back and forth.


Regards
Yin, Fengwei

> 
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>> +		goto release;
>>>  	}
>>>  
>>>  	ret = check_stable_address_space(vma->vm_mm);
>>> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>  		return handle_userfault(vmf, VM_UFFD_MISSING);
>>>  	}
>>>  
>>> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> -	folio_add_new_anon_rmap(folio, vma, vmf->address);
>>> +	folio_ref_add(folio, pgcount - 1);
>>> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
>>> +	folio_add_new_anon_rmap(folio, vma, addr);
>>>  	folio_add_lru_vma(folio, vma);
>>> -setpte:
>>> +
>>>  	if (uffd_wp)
>>>  		entry = pte_mkuffd_wp(entry);
>>> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>>>  
>>>  	/* No need to invalidate - it was non-present before */
>>> -	update_mmu_cache(vma, vmf->address, vmf->pte);
>>> +	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
>>>  unlock:
>>>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>  	return ret;
>
  
Yu Zhao July 4, 2023, 11:47 p.m. UTC | #8
On Tue, Jul 4, 2023 at 8:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/07/2023 02:35, Yu Zhao wrote:
> > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> >> allocated in large folios of a specified order. All pages of the large
> >> folio are pte-mapped during the same page fault, significantly reducing
> >> the number of page faults. The number of per-page operations (e.g. ref
> >> counting, rmap management lru list management) are also significantly
> >> reduced since those ops now become per-folio.
> >>
> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> >> defaults to disabled for now; there is a long list of todos to make
> >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
> >> madvise ops, etc). These items will be tackled in subsequent patches.
> >>
> >> When enabled, the preferred folio order is as returned by
> >> arch_wants_pte_order(), which may be overridden by the arch as it sees
> >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
> >
> > coalesce
>
> ACK
>
> >
> >> contiguous set of ptes map physically contigious, naturally aligned
> >
> > contiguous
>
> ACK
>
> >
> >> memory, so this mechanism allows the architecture to optimize as
> >> required.
> >>
> >> If the preferred order can't be used (e.g. because the folio would
> >> breach the bounds of the vma, or because ptes in the region are already
> >> mapped) then we fall back to a suitable lower order.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/Kconfig  |  10 ++++
> >>  mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >>  2 files changed, 165 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index 7672a22647b4..1c06b2c0a24e 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
> >>           support of file THPs will be developed in the next few release
> >>           cycles.
> >>
> >> +config FLEXIBLE_THP
> >> +       bool "Flexible order THP"
> >> +       depends on TRANSPARENT_HUGEPAGE
> >> +       default n
> >
> > The default value is already N.
>
> Is there a coding standard for this? Personally I prefer to make it explicit.
>
> >
> >> +       help
> >> +         Use large (bigger than order-0) folios to back anonymous memory where
> >> +         possible, even if the order of the folio is smaller than the PMD
> >> +         order. This reduces the number of page faults, as well as other
> >> +         per-page overheads to improve performance for many workloads.
> >> +
> >>  endif # TRANSPARENT_HUGEPAGE
> >>
> >>  #
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index fb30f7523550..abe2ea94f3f5 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
> >>         return 0;
> >>  }
> >>
> >> +#ifdef CONFIG_FLEXIBLE_THP
> >> +/*
> >> + * Allocates, zeros and returns a folio of the requested order for use as
> >> + * anonymous memory.
> >> + */
> >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
> >> +                                     unsigned long addr, int order)
> >> +{
> >> +       gfp_t gfp;
> >> +       struct folio *folio;
> >> +
> >> +       if (order == 0)
> >> +               return vma_alloc_zeroed_movable_folio(vma, addr);
> >> +
> >> +       gfp = vma_thp_gfp_mask(vma);
> >> +       folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >> +       if (folio)
> >> +               clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
> >> +
> >> +       return folio;
> >> +}
> >> +
> >> +/*
> >> + * Preferred folio order to allocate for anonymous memory.
> >> + */
> >> +#define max_anon_folio_order(vma)      arch_wants_pte_order(vma)
> >> +#else
> >> +#define alloc_anon_folio(vma, addr, order) \
> >> +                               vma_alloc_zeroed_movable_folio(vma, addr)
> >> +#define max_anon_folio_order(vma)      0
> >> +#endif
> >> +
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i = 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> >> +{
> >> +       /*
> >> +        * The aim here is to determine what size of folio we should allocate
> >> +        * for this fault. Factors include:
> >> +        * - Order must not be higher than `order` upon entry
> >> +        * - Folio must be naturally aligned within VA space
> >> +        * - Folio must be fully contained inside one pmd entry
> >> +        * - Folio must not breach boundaries of vma
> >> +        * - Folio must not overlap any non-none ptes
> >> +        *
> >> +        * Additionally, we do not allow order-1 since this breaks assumptions
> >> +        * elsewhere in the mm; THP pages must be at least order-2 (since they
> >> +        * store state up to the 3rd struct page subpage), and these pages must
> >> +        * be THP in order to correctly use pre-existing THP infrastructure such
> >> +        * as folio_split().
> >> +        *
> >> +        * Note that the caller may or may not choose to lock the pte. If
> >> +        * unlocked, the result is racy and the user must re-check any overlap
> >> +        * with non-none ptes under the lock.
> >> +        */
> >> +
> >> +       struct vm_area_struct *vma = vmf->vma;
> >> +       int nr;
> >> +       unsigned long addr;
> >> +       pte_t *pte;
> >> +       pte_t *first_set = NULL;
> >> +       int ret;
> >> +
> >> +       order = min(order, PMD_SHIFT - PAGE_SHIFT);
> >> +
> >> +       for (; order > 1; order--) {
> >
> > I'm not sure how we can justify this policy. As an initial step, it'd
> > be a lot easier to sell if we only considered the order of
> > arch_wants_pte_order() and the order 0.
>
> My justification is in the cover letter; I see performance regression (vs the
> unpatched kernel) when using the policy you suggest. This policy performs much
> better in my tests. (I'll reply directly to your follow up questions in the
> cover letter shortly).
>
> What are your technical concerns about this approach? It is pretty light weight
> (I only touch each PTE once, regardless of the number of loops). If we have
> strong technical reasons for reverting to the less performant approach then fair
> enough, but I'd like to hear the rational first.

Yes, mainly from three different angles:
1. The engineering principle: we'd want to separate the mechanical
part and the policy part when attacking something large. This way it'd
be easier to root cause any regressions if they happen. In our case,
assuming the regression is real, it might actually prove my point
here: I really don't think the two checks (if a vma range fits and if
it does, which is unlikely according to your description, if all 64
PTEs are none) caused the regression. My theory is that 64KB itself
caused the regression, but smaller sizes made an improvement. If this
is really the case, I'd say the fallback policy masked the real
problem, which is that 64KB is too large to begin with.
2. The benchmark methodology: I appreciate your effort in doing it,
but we also need to consider that the setup is an uncommon scenario.
The common scenarios are devices that have been running for weeks
without reboots, generally having higher external fragmentation. In
addition, for client devices, they are often under memory pressure,
which makes fragmentation worse. So we should take the result with a
grain of salt, and for that matter, results from after refresh
reboots.
3. The technical concern: an ideal policy would consider all three
major factors: the h/w features, userspace behaviors and the page
allocator behavior. So far we only have the first one handy. The
second one is too challenging, so let's forget about it for now. The
third one is why I really don't like this best-fit policy. By falling
back to smaller orders, we can waste a limited number of physically
contiguous pages on wrong vmas (small vmas only), leading to failures
to serve large vmas which otherwise would have a higher overall ROI.
This can only be addressed within the page allocator: we need to
enlighten it to return the highest order available, i.e., not breaking
up any higher orders.

I'm not really saying we should never try this fallback policy. I'm
just thinking we can leave it for later, probably after we've
addressed all the concerns with basic functionality.
  
Matthew Wilcox July 4, 2023, 11:57 p.m. UTC | #9
On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote:
> On 04/07/2023 04:45, Yin, Fengwei wrote:
> > 
> > On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> > THP is for huge page which is 2M size. We are not huge page here. But
> > I don't have good name either.
> 
> Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K
> base page, they are 512M. So huge pages already have a variable size. And they
> sometimes get PTE-mapped. So can't we just think of this as an extension of the
> THP feature?

The confusing thing is that we have counters for the number of THP
allocated (and number of THP mapped), and for those we always use
PMD-size folios.

If we must have a config option, then this is ANON_LARGE_FOLIOS.

But why do we need a config option?  We don't have one for the
page cache, and we're better off for it.  Yes, it depends on
CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental
heritage, and it'd be great to do away with that dependency eventually.

Hardware support isn't needed.  Large folios benefit us from a software
point of view.  if we need a chicken bit, we can edit the source code
to not create anon folios larger than order 0.
  
Ryan Roberts July 5, 2023, 9:54 a.m. UTC | #10
On 05/07/2023 00:57, Matthew Wilcox wrote:
> On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote:
>> On 04/07/2023 04:45, Yin, Fengwei wrote:
>>>
>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote:
>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>> THP is for huge page which is 2M size. We are not huge page here. But
>>> I don't have good name either.
>>
>> Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K
>> base page, they are 512M. So huge pages already have a variable size. And they
>> sometimes get PTE-mapped. So can't we just think of this as an extension of the
>> THP feature?
> 
> The confusing thing is that we have counters for the number of THP
> allocated (and number of THP mapped), and for those we always use
> PMD-size folios.

OK fair point. I really don't have a strong opinion on the name - I changed it
from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm
happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the
concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep
this name for now and focus on converging the actual implementation to something
that is agreeable. Once we are there, we can argue about the name.

> 
> If we must have a config option, then this is ANON_LARGE_FOLIOS.
> 
> But why do we need a config option?  We don't have one for the
> page cache, and we're better off for it.  Yes, it depends on
> CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental
> heritage, and it'd be great to do away with that dependency eventually.
> 
> Hardware support isn't needed.  Large folios benefit us from a software
> point of view.  if we need a chicken bit, we can edit the source code
> to not create anon folios larger than order 0.

From my PoV it's about managing risk; there are currently parts of the mm that
will interact poorly with large pte-mapped folios (madvise, compaction, ...). We
want to incrementally fix that stuff, but until it's all fixed, we can't deploy
this as always-on. Further down the line when things are more complete and there
is more test coverage, we could remove the Kconfig or default it to enabled.
  
Matthew Wilcox July 5, 2023, 12:08 p.m. UTC | #11
On Wed, Jul 05, 2023 at 10:54:30AM +0100, Ryan Roberts wrote:
> On 05/07/2023 00:57, Matthew Wilcox wrote:
> > The confusing thing is that we have counters for the number of THP
> > allocated (and number of THP mapped), and for those we always use
> > PMD-size folios.
> 
> OK fair point. I really don't have a strong opinion on the name - I changed it
> from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm
> happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the
> concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep
> this name for now and focus on converging the actual implementation to something
> that is agreeable. Once we are there, we can argue about the name.

I didn't see Yu arguing for changing the name of the config options,
just having far fewer of them.

> > If we must have a config option, then this is ANON_LARGE_FOLIOS.
> > 
> > But why do we need a config option?  We don't have one for the
> > page cache, and we're better off for it.  Yes, it depends on
> > CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental
> > heritage, and it'd be great to do away with that dependency eventually.
> > 
> > Hardware support isn't needed.  Large folios benefit us from a software
> > point of view.  if we need a chicken bit, we can edit the source code
> > to not create anon folios larger than order 0.
> 
> >From my PoV it's about managing risk; there are currently parts of the mm that
> will interact poorly with large pte-mapped folios (madvise, compaction, ...). We
> want to incrementally fix that stuff, but until it's all fixed, we can't deploy
> this as always-on. Further down the line when things are more complete and there
> is more test coverage, we could remove the Kconfig or default it to enabled.

We have to fix those places with the bad interactions, not merge a
Kconfig option that lets you turn it on to experiment.  That's how you
get a bad reputation and advice to disable a config option.  We had that
for years with CONFIG_TRANSPARENT_HUGEPAGE; people tried it out early on,
found the performance problems, and all these years later we still have
articles being published that say to turn it off.

By all means, we can have a golden patchset that we all agree is the
one to use for finding problems, and we can merge the pre-enabling work
"We don't have large anonymous folios yet, but when we do, this will
need to iterate over each page in the folio".
  
Huang, Ying July 7, 2023, 8:01 a.m. UTC | #12
Ryan Roberts <ryan.roberts@arm.com> writes:

> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
> allocated in large folios of a specified order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.

I likes the idea to share as much code as possible between large
(anonymous) folio and THP.  Finally, THP becomes just a special kind of
large folio.

Although we can use smaller page order for FLEXIBLE_THP, it's hard to
avoid internal fragmentation completely.  So, I think that finally we
will need to provide a mechanism for the users to opt out, e.g.,
something like "always madvise never" via
/sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
a good idea to reuse the existing interface of THP.

Best Regards,
Huang, Ying
  
Ryan Roberts July 7, 2023, 9:52 a.m. UTC | #13
On 07/07/2023 09:01, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>> allocated in large folios of a specified order. All pages of the large
>> folio are pte-mapped during the same page fault, significantly reducing
>> the number of page faults. The number of per-page operations (e.g. ref
>> counting, rmap management lru list management) are also significantly
>> reduced since those ops now become per-folio.
> 
> I likes the idea to share as much code as possible between large
> (anonymous) folio and THP.  Finally, THP becomes just a special kind of
> large folio.
> 
> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
> avoid internal fragmentation completely.  So, I think that finally we
> will need to provide a mechanism for the users to opt out, e.g.,
> something like "always madvise never" via
> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
> a good idea to reuse the existing interface of THP.

I wouldn't want to tie this to the existing interface, simply because that
implies that we would want to follow the "always" and "madvise" advice too; That
means that on a thp=madvise system (which is certainly the case for android and
other client systems) we would have to disable large anon folios for VMAs that
haven't explicitly opted in. That breaks the intention that this should be an
invisible performance boost. I think it's important to set the policy for use of
THP separately to use of large anon folios.

I could be persuaded on the merrits of a new runtime enable/disable interface if
there is concensus.

> 
> Best Regards,
> Huang, Ying
  
David Hildenbrand July 7, 2023, 11:29 a.m. UTC | #14
On 07.07.23 11:52, Ryan Roberts wrote:
> On 07/07/2023 09:01, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
>>> allocated in large folios of a specified order. All pages of the large
>>> folio are pte-mapped during the same page fault, significantly reducing
>>> the number of page faults. The number of per-page operations (e.g. ref
>>> counting, rmap management lru list management) are also significantly
>>> reduced since those ops now become per-folio.
>>
>> I likes the idea to share as much code as possible between large
>> (anonymous) folio and THP.  Finally, THP becomes just a special kind of
>> large folio.
>>
>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>> avoid internal fragmentation completely.  So, I think that finally we
>> will need to provide a mechanism for the users to opt out, e.g.,
>> something like "always madvise never" via
>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>> a good idea to reuse the existing interface of THP.
> 
> I wouldn't want to tie this to the existing interface, simply because that
> implies that we would want to follow the "always" and "madvise" advice too; That
> means that on a thp=madvise system (which is certainly the case for android and
> other client systems) we would have to disable large anon folios for VMAs that
> haven't explicitly opted in. That breaks the intention that this should be an
> invisible performance boost. I think it's important to set the policy for use of

It will never ever be a completely invisible performance boost, just 
like ordinary THP.

Using the exact same existing toggle is the right thing to do. If 
someone specify "never" or "madvise", then do exactly that.

It might make sense to have more modes or additional toggles, but 
"madvise=never" means no memory waste.


I remember I raised it already in the past, but you *absolutely* have to 
respect the MADV_NOHUGEPAGE flag. There is user space out there (for 
example, userfaultfd) that doesn't want the kernel to populate any 
additional page tables. So if you have to respect that already, then 
also respect MADV_HUGEPAGE, simple.

> THP separately to use of large anon folios.
> 
> I could be persuaded on the merrits of a new runtime enable/disable interface if
> there is concensus.

There would have to be very good reason for a completely separate 
control. Bypassing MADV_NOHUGEPAGE or "madvise=never" simply because we 
add a "flexible" before the THP sounds broken.
  
Matthew Wilcox July 7, 2023, 1:57 p.m. UTC | #15
On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
> On 07.07.23 11:52, Ryan Roberts wrote:
> > On 07/07/2023 09:01, Huang, Ying wrote:
> > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to
> > > avoid internal fragmentation completely.  So, I think that finally we
> > > will need to provide a mechanism for the users to opt out, e.g.,
> > > something like "always madvise never" via
> > > /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
> > > a good idea to reuse the existing interface of THP.
> > 
> > I wouldn't want to tie this to the existing interface, simply because that
> > implies that we would want to follow the "always" and "madvise" advice too; That
> > means that on a thp=madvise system (which is certainly the case for android and
> > other client systems) we would have to disable large anon folios for VMAs that
> > haven't explicitly opted in. That breaks the intention that this should be an
> > invisible performance boost. I think it's important to set the policy for use of
> 
> It will never ever be a completely invisible performance boost, just like
> ordinary THP.
> 
> Using the exact same existing toggle is the right thing to do. If someone
> specify "never" or "madvise", then do exactly that.
> 
> It might make sense to have more modes or additional toggles, but
> "madvise=never" means no memory waste.

I hate the existing mechanisms.  They are an abdication of our
responsibility, and an attempt to blame the user (be it the sysadmin
or the programmer) of our code for using it wrongly.  We should not
replicate this mistake.

Our code should be auto-tuning.  I posted a long, detailed outline here:
https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/

> I remember I raised it already in the past, but you *absolutely* have to
> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
> example, userfaultfd) that doesn't want the kernel to populate any
> additional page tables. So if you have to respect that already, then also
> respect MADV_HUGEPAGE, simple.

Possibly having uffd enabled on a VMA should disable using large folios,
I can get behind that.  But the notion that userspace knows what it's
doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
know what it's doing.
  
David Hildenbrand July 7, 2023, 2:07 p.m. UTC | #16
On 07.07.23 15:57, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>> On 07.07.23 11:52, Ryan Roberts wrote:
>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>> something like "always madvise never" via
>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>> a good idea to reuse the existing interface of THP.
>>>
>>> I wouldn't want to tie this to the existing interface, simply because that
>>> implies that we would want to follow the "always" and "madvise" advice too; That
>>> means that on a thp=madvise system (which is certainly the case for android and
>>> other client systems) we would have to disable large anon folios for VMAs that
>>> haven't explicitly opted in. That breaks the intention that this should be an
>>> invisible performance boost. I think it's important to set the policy for use of
>>
>> It will never ever be a completely invisible performance boost, just like
>> ordinary THP.
>>
>> Using the exact same existing toggle is the right thing to do. If someone
>> specify "never" or "madvise", then do exactly that.
>>
>> It might make sense to have more modes or additional toggles, but
>> "madvise=never" means no memory waste.
> 
> I hate the existing mechanisms.  They are an abdication of our
> responsibility, and an attempt to blame the user (be it the sysadmin
> or the programmer) of our code for using it wrongly.  We should not
> replicate this mistake.

I don't agree regarding the programmer responsibility. In some cases the 
programmer really doesn't want to get more memory populated than 
requested -- and knows exactly why setting MADV_NOHUGEPAGE is the right 
thing to do.

Regarding the madvise=never/madvise/always (sys admin decision), memory 
waste (and nailing down bugs or working around them in customer setups) 
have been very good reasons to let the admin have a word.

> 
> Our code should be auto-tuning.  I posted a long, detailed outline here:
> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
> 

Well, "auto-tuning" also should be perfect for everybody, but once 
reality strikes you know it isn't.

If people don't feel like using THP, let them have a word. The "madvise" 
config option is probably more controversial. But the "always vs. never" 
absolutely makes sense to me.

>> I remember I raised it already in the past, but you *absolutely* have to
>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>> example, userfaultfd) that doesn't want the kernel to populate any
>> additional page tables. So if you have to respect that already, then also
>> respect MADV_HUGEPAGE, simple.
> 
> Possibly having uffd enabled on a VMA should disable using large folios,

There are cases where we enable uffd *after* already touching memory 
(postcopy live migration in QEMU being the famous example). That doesn't 
fly.

> I can get behind that.  But the notion that userspace knows what it's
> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
> know what it's doing.

If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing 
... in some cases. And these include cases I care about messing with 
sparse VM memory :)

I have strong opinions against populating more than required when user 
space set MADV_NOHUGEPAGE.
  
Ryan Roberts July 7, 2023, 3:13 p.m. UTC | #17
On 07/07/2023 15:07, David Hildenbrand wrote:
> On 07.07.23 15:57, Matthew Wilcox wrote:
>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>> something like "always madvise never" via
>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>> a good idea to reuse the existing interface of THP.
>>>>
>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>> That
>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>> invisible performance boost. I think it's important to set the policy for
>>>> use of
>>>
>>> It will never ever be a completely invisible performance boost, just like
>>> ordinary THP.
>>>
>>> Using the exact same existing toggle is the right thing to do. If someone
>>> specify "never" or "madvise", then do exactly that.
>>>
>>> It might make sense to have more modes or additional toggles, but
>>> "madvise=never" means no memory waste.
>>
>> I hate the existing mechanisms.  They are an abdication of our
>> responsibility, and an attempt to blame the user (be it the sysadmin
>> or the programmer) of our code for using it wrongly.  We should not
>> replicate this mistake.
> 
> I don't agree regarding the programmer responsibility. In some cases the
> programmer really doesn't want to get more memory populated than requested --
> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
> 
> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
> (and nailing down bugs or working around them in customer setups) have been very
> good reasons to let the admin have a word.
> 
>>
>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>
> 
> Well, "auto-tuning" also should be perfect for everybody, but once reality
> strikes you know it isn't.
> 
> If people don't feel like using THP, let them have a word. The "madvise" config
> option is probably more controversial. But the "always vs. never" absolutely
> makes sense to me.
> 
>>> I remember I raised it already in the past, but you *absolutely* have to
>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>> example, userfaultfd) that doesn't want the kernel to populate any
>>> additional page tables. So if you have to respect that already, then also
>>> respect MADV_HUGEPAGE, simple.
>>
>> Possibly having uffd enabled on a VMA should disable using large folios,
> 
> There are cases where we enable uffd *after* already touching memory (postcopy
> live migration in QEMU being the famous example). That doesn't fly.
> 
>> I can get behind that.  But the notion that userspace knows what it's
>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>> know what it's doing.
> 
> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
> some cases. And these include cases I care about messing with sparse VM memory :)
> 
> I have strong opinions against populating more than required when user space set
> MADV_NOHUGEPAGE.

I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
reasonable to fallback to allocating an order-0 page in a VMA that has it set.
The app has gone out of its way to explicitly set it, after all.

I think the correct behaviour for the global thp controls (cmdline and sysfs)
are less obvious though. I could get on board with disabling large anon folios
globally when thp="never". But for other situations, I would prefer to keep
large anon folios enabled (treat "madvise" as "always"), with the argument that
their order is much smaller than traditional THP and therefore the internal
fragmentation is significantly reduced. I really don't want to end up with user
space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
anon folios.

I still feel that it would be better for the thp and large anon folio controls
to be independent though - what's the argument for tying them together?

>
  
David Hildenbrand July 7, 2023, 4:06 p.m. UTC | #18
On 07.07.23 17:13, Ryan Roberts wrote:
> On 07/07/2023 15:07, David Hildenbrand wrote:
>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>> something like "always madvise never" via
>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>> a good idea to reuse the existing interface of THP.
>>>>>
>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>> That
>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>> invisible performance boost. I think it's important to set the policy for
>>>>> use of
>>>>
>>>> It will never ever be a completely invisible performance boost, just like
>>>> ordinary THP.
>>>>
>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>> specify "never" or "madvise", then do exactly that.
>>>>
>>>> It might make sense to have more modes or additional toggles, but
>>>> "madvise=never" means no memory waste.
>>>
>>> I hate the existing mechanisms.  They are an abdication of our
>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>> or the programmer) of our code for using it wrongly.  We should not
>>> replicate this mistake.
>>
>> I don't agree regarding the programmer responsibility. In some cases the
>> programmer really doesn't want to get more memory populated than requested --
>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>
>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>> (and nailing down bugs or working around them in customer setups) have been very
>> good reasons to let the admin have a word.
>>
>>>
>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>
>>
>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>> strikes you know it isn't.
>>
>> If people don't feel like using THP, let them have a word. The "madvise" config
>> option is probably more controversial. But the "always vs. never" absolutely
>> makes sense to me.
>>
>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>> additional page tables. So if you have to respect that already, then also
>>>> respect MADV_HUGEPAGE, simple.
>>>
>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>
>> There are cases where we enable uffd *after* already touching memory (postcopy
>> live migration in QEMU being the famous example). That doesn't fly.
>>
>>> I can get behind that.  But the notion that userspace knows what it's
>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>> know what it's doing.
>>
>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>> some cases. And these include cases I care about messing with sparse VM memory :)
>>
>> I have strong opinions against populating more than required when user space set
>> MADV_NOHUGEPAGE.
> 
> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
> The app has gone out of its way to explicitly set it, after all.
> 
> I think the correct behaviour for the global thp controls (cmdline and sysfs)
> are less obvious though. I could get on board with disabling large anon folios
> globally when thp="never". But for other situations, I would prefer to keep
> large anon folios enabled (treat "madvise" as "always"), with the argument that
> their order is much smaller than traditional THP and therefore the internal
> fragmentation is significantly reduced. I really don't want to end up with user
> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
> anon folios.

I was briefly playing with a nasty idea of an additional "madvise-pmd" 
option (that could be the new default), that would use PMD THP only in 
madvise'd regions, and ordinary everywhere else. But let's disregard 
that for now. I think there is a bigger issue (below).

> 
> I still feel that it would be better for the thp and large anon folio controls
> to be independent though - what's the argument for tying them together?

Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 
MiB PMD THP on aarch64 (4k kernel), how are they any different? Just the 
way they are mapped ...

It's easy to say "64k vs. 2 MiB" is a difference and we want separate 
controls, but how is "2MiB vs. 2 MiB" different?

Having that said, I think we have to make up our mind how much control 
we want to give user space. Again, the "2MiB vs. 2 MiB" case nicely 
shows that it's not trivial: memory waste is a real issue on some 
systems where we limit THP to madvise().


Just throwing it out for discussing:

What about keeping the "all / madvise / never" semantics (and 
MADV_NOHUGEPAGE ...) but having an additional config knob that specifies 
in which cases we *still* allow flexible THP even though the system was 
configured for "madvise".

I can't come up with a good name for that, but something like 
"max_auto_size=64k" could be something reasonable to set. We could have 
an arch+hw specific default.

(we all hate config options, I know, but I think there are good reasons 
to have such bare-minimum ones)
  
Ryan Roberts July 7, 2023, 4:22 p.m. UTC | #19
On 07/07/2023 17:06, David Hildenbrand wrote:
> On 07.07.23 17:13, Ryan Roberts wrote:
>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>> something like "always madvise never" via
>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>
>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>> That
>>>>>> means that on a thp=madvise system (which is certainly the case for
>>>>>> android and
>>>>>> other client systems) we would have to disable large anon folios for VMAs
>>>>>> that
>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>> use of
>>>>>
>>>>> It will never ever be a completely invisible performance boost, just like
>>>>> ordinary THP.
>>>>>
>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>> specify "never" or "madvise", then do exactly that.
>>>>>
>>>>> It might make sense to have more modes or additional toggles, but
>>>>> "madvise=never" means no memory waste.
>>>>
>>>> I hate the existing mechanisms.  They are an abdication of our
>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>> or the programmer) of our code for using it wrongly.  We should not
>>>> replicate this mistake.
>>>
>>> I don't agree regarding the programmer responsibility. In some cases the
>>> programmer really doesn't want to get more memory populated than requested --
>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>
>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>> (and nailing down bugs or working around them in customer setups) have been very
>>> good reasons to let the admin have a word.
>>>
>>>>
>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>
>>>
>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>> strikes you know it isn't.
>>>
>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>> option is probably more controversial. But the "always vs. never" absolutely
>>> makes sense to me.
>>>
>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>> additional page tables. So if you have to respect that already, then also
>>>>> respect MADV_HUGEPAGE, simple.
>>>>
>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>
>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>> live migration in QEMU being the famous example). That doesn't fly.
>>>
>>>> I can get behind that.  But the notion that userspace knows what it's
>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>> know what it's doing.
>>>
>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>> some cases. And these include cases I care about messing with sparse VM
>>> memory :)
>>>
>>> I have strong opinions against populating more than required when user space set
>>> MADV_NOHUGEPAGE.
>>
>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>> The app has gone out of its way to explicitly set it, after all.
>>
>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>> are less obvious though. I could get on board with disabling large anon folios
>> globally when thp="never". But for other situations, I would prefer to keep
>> large anon folios enabled (treat "madvise" as "always"), with the argument that
>> their order is much smaller than traditional THP and therefore the internal
>> fragmentation is significantly reduced. I really don't want to end up with user
>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>> anon folios.
> 
> I was briefly playing with a nasty idea of an additional "madvise-pmd" option
> (that could be the new default), that would use PMD THP only in madvise'd
> regions, and ordinary everywhere else. But let's disregard that for now. I think
> there is a bigger issue (below).
> 
>>
>> I still feel that it would be better for the thp and large anon folio controls
>> to be independent though - what's the argument for tying them together?
> 
> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD
> THP on aarch64 (4k kernel), how are they any different? Just the way they are
> mapped ...

The last patch in the series shows my current approach to that:

int arch_wants_pte_order(struct vm_area_struct *vma)
{
	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
		return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size
	else
		return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K
}

But Yu has raised concerns that this type of policy needs to be in the core mm.
So we could have the arch blindly return the preferred order from HW perspective
(which would be contpte size for arm64). Then for !hugepage_vma_check(), mm
could take the min of that value and some determined "acceptable" limit (which
in my mind is 64K ;-).

> 
> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls,
> but how is "2MiB vs. 2 MiB" different?
> 
> Having that said, I think we have to make up our mind how much control we want
> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not
> trivial: memory waste is a real issue on some systems where we limit THP to
> madvise().
> 
> 
> Just throwing it out for discussing:
> 
> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE
> ...) but having an additional config knob that specifies in which cases we
> *still* allow flexible THP even though the system was configured for "madvise".
> 
> I can't come up with a good name for that, but something like
> "max_auto_size=64k" could be something reasonable to set. We could have an
> arch+hw specific default.

Ahha, yes, that's essentially what I have above. I personally also like the idea
of the limit being an absolute value rather than an order. Although I know Yu
feels differently (see [1]).

[1]
https://lore.kernel.org/linux-mm/4d4c45a2-0037-71de-b182-f516fee07e67@arm.com/T/#m2aff6eebd7f14d0d0620b48497d26eacecf970e6


> 
> (we all hate config options, I know, but I think there are good reasons to have
> such bare-minimum ones)
>
  
David Hildenbrand July 7, 2023, 7:06 p.m. UTC | #20
>>> I still feel that it would be better for the thp and large anon folio controls
>>> to be independent though - what's the argument for tying them together?
>>
>> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD
>> THP on aarch64 (4k kernel), how are they any different? Just the way they are
>> mapped ...
> 
> The last patch in the series shows my current approach to that:
> 
> int arch_wants_pte_order(struct vm_area_struct *vma)
> {
> 	if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
> 		return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size
> 	else
> 		return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K
> }
> 
> But Yu has raised concerns that this type of policy needs to be in the core mm.
> So we could have the arch blindly return the preferred order from HW perspective
> (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm
> could take the min of that value and some determined "acceptable" limit (which
> in my mind is 64K ;-).

Yeah, it's really tricky. Because why should arm64 with 64k base pages 
*not* return 2MiB (which is one possible cont-pte size IIRC) ?

I share the idea that 64k might *currently* on *some platforms* be a 
reasonable choice. But that's where the "fun" begins.

> 
>>
>> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls,
>> but how is "2MiB vs. 2 MiB" different?
>>
>> Having that said, I think we have to make up our mind how much control we want
>> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not
>> trivial: memory waste is a real issue on some systems where we limit THP to
>> madvise().
>>
>>
>> Just throwing it out for discussing:
>>
>> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE
>> ...) but having an additional config knob that specifies in which cases we
>> *still* allow flexible THP even though the system was configured for "madvise".
>>
>> I can't come up with a good name for that, but something like
>> "max_auto_size=64k" could be something reasonable to set. We could have an
>> arch+hw specific default.
> 
> Ahha, yes, that's essentially what I have above. I personally also like the idea
> of the limit being an absolute value rather than an order. Although I know Yu
> feels differently (see [1]).

Exposed to user space I think it should be a human-readable value. 
Inside the kernel, I don't particularly care.

(Having databases/VMs on arch64 with 64k in mind) I think it might be 
interesting to have something like the following:

thp=madvise
max_auto_size=64k/128k/256k


So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any 
flexible THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 
MiB THP? sure, give them to my VM. You're barely going to find 512 MiB 
THP either way in practice ....

But for the remainder of my system, just do something reasonable and 
don't go crazy on the memory waste.


I'll try reading all the previous discussions next week.
  
Huang, Ying July 10, 2023, 2:49 a.m. UTC | #21
Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>> On 07.07.23 11:52, Ryan Roberts wrote:
>> > On 07/07/2023 09:01, Huang, Ying wrote:
>> > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>> > > avoid internal fragmentation completely.  So, I think that finally we
>> > > will need to provide a mechanism for the users to opt out, e.g.,
>> > > something like "always madvise never" via
>> > > /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>> > > a good idea to reuse the existing interface of THP.
>> > 
>> > I wouldn't want to tie this to the existing interface, simply because that
>> > implies that we would want to follow the "always" and "madvise" advice too; That
>> > means that on a thp=madvise system (which is certainly the case for android and
>> > other client systems) we would have to disable large anon folios for VMAs that
>> > haven't explicitly opted in. That breaks the intention that this should be an
>> > invisible performance boost. I think it's important to set the policy for use of
>> 
>> It will never ever be a completely invisible performance boost, just like
>> ordinary THP.
>> 
>> Using the exact same existing toggle is the right thing to do. If someone
>> specify "never" or "madvise", then do exactly that.
>> 
>> It might make sense to have more modes or additional toggles, but
>> "madvise=never" means no memory waste.
>
> I hate the existing mechanisms.  They are an abdication of our
> responsibility, and an attempt to blame the user (be it the sysadmin
> or the programmer) of our code for using it wrongly.  We should not
> replicate this mistake.
>
> Our code should be auto-tuning.  I posted a long, detailed outline here:
> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/

Yes.  Auto-tuning should be more preferable than any configuration
mechanisms.

Something like THP shrinker could be another way of auto-tuning.

https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/

That is, allocating the large folios on page fault, then try to detect
internal fragmentation.

>> I remember I raised it already in the past, but you *absolutely* have to
>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>> example, userfaultfd) that doesn't want the kernel to populate any
>> additional page tables. So if you have to respect that already, then also
>> respect MADV_HUGEPAGE, simple.
>
> Possibly having uffd enabled on a VMA should disable using large folios,
> I can get behind that.  But the notion that userspace knows what it's
> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
> know what it's doing.

Best Regards,
Huang, Ying
  
Huang, Ying July 10, 2023, 3:03 a.m. UTC | #22
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 07/07/2023 15:07, David Hildenbrand wrote:
>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>> something like "always madvise never" via
>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>> a good idea to reuse the existing interface of THP.
>>>>>
>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>> That
>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>> invisible performance boost. I think it's important to set the policy for
>>>>> use of
>>>>
>>>> It will never ever be a completely invisible performance boost, just like
>>>> ordinary THP.
>>>>
>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>> specify "never" or "madvise", then do exactly that.
>>>>
>>>> It might make sense to have more modes or additional toggles, but
>>>> "madvise=never" means no memory waste.
>>>
>>> I hate the existing mechanisms.  They are an abdication of our
>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>> or the programmer) of our code for using it wrongly.  We should not
>>> replicate this mistake.
>> 
>> I don't agree regarding the programmer responsibility. In some cases the
>> programmer really doesn't want to get more memory populated than requested --
>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>> 
>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>> (and nailing down bugs or working around them in customer setups) have been very
>> good reasons to let the admin have a word.
>> 
>>>
>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>
>> 
>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>> strikes you know it isn't.
>> 
>> If people don't feel like using THP, let them have a word. The "madvise" config
>> option is probably more controversial. But the "always vs. never" absolutely
>> makes sense to me.
>> 
>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>> additional page tables. So if you have to respect that already, then also
>>>> respect MADV_HUGEPAGE, simple.
>>>
>>> Possibly having uffd enabled on a VMA should disable using large folios,
>> 
>> There are cases where we enable uffd *after* already touching memory (postcopy
>> live migration in QEMU being the famous example). That doesn't fly.
>> 
>>> I can get behind that.  But the notion that userspace knows what it's
>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>> know what it's doing.
>> 
>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>> some cases. And these include cases I care about messing with sparse VM memory :)
>> 
>> I have strong opinions against populating more than required when user space set
>> MADV_NOHUGEPAGE.
>
> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
> The app has gone out of its way to explicitly set it, after all.
>
> I think the correct behaviour for the global thp controls (cmdline and sysfs)
> are less obvious though. I could get on board with disabling large anon folios
> globally when thp="never". But for other situations, I would prefer to keep
> large anon folios enabled (treat "madvise" as "always"),

If we have some mechanism to auto-tune the large folios usage, for
example, detect the internal fragmentation and split the large folio,
then we can use thp="always" as default configuration.  If my memory
were correct, this is what Johannes and Alexander is working on.

> with the argument that
> their order is much smaller than traditional THP and therefore the internal
> fragmentation is significantly reduced.

Do you have any data for this?

> I really don't want to end up with user
> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
> anon folios.
>
> I still feel that it would be better for the thp and large anon folio controls
> to be independent though - what's the argument for tying them together?
>

Best Regards,
Huang, Ying
  
Ryan Roberts July 10, 2023, 8:41 a.m. UTC | #23
On 07/07/2023 20:06, David Hildenbrand wrote:
>>>> I still feel that it would be better for the thp and large anon folio controls
>>>> to be independent though - what's the argument for tying them together?
>>>
>>> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD
>>> THP on aarch64 (4k kernel), how are they any different? Just the way they are
>>> mapped ...
>>
>> The last patch in the series shows my current approach to that:
>>
>> int arch_wants_pte_order(struct vm_area_struct *vma)
>> {
>>     if (hugepage_vma_check(vma, vma->vm_flags, false, true, true))
>>         return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size
>>     else
>>         return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K
>> }
>>
>> But Yu has raised concerns that this type of policy needs to be in the core mm.
>> So we could have the arch blindly return the preferred order from HW perspective
>> (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm
>> could take the min of that value and some determined "acceptable" limit (which
>> in my mind is 64K ;-).
> 
> Yeah, it's really tricky. Because why should arm64 with 64k base pages *not*
> return 2MiB (which is one possible cont-pte size IIRC) ?
> 
> I share the idea that 64k might *currently* on *some platforms* be a reasonable
> choice. But that's where the "fun" begins.
> 
>>
>>>
>>> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls,
>>> but how is "2MiB vs. 2 MiB" different?
>>>
>>> Having that said, I think we have to make up our mind how much control we want
>>> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not
>>> trivial: memory waste is a real issue on some systems where we limit THP to
>>> madvise().
>>>
>>>
>>> Just throwing it out for discussing:
>>>
>>> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE
>>> ...) but having an additional config knob that specifies in which cases we
>>> *still* allow flexible THP even though the system was configured for "madvise".
>>>
>>> I can't come up with a good name for that, but something like
>>> "max_auto_size=64k" could be something reasonable to set. We could have an
>>> arch+hw specific default.
>>
>> Ahha, yes, that's essentially what I have above. I personally also like the idea
>> of the limit being an absolute value rather than an order. Although I know Yu
>> feels differently (see [1]).
> 
> Exposed to user space I think it should be a human-readable value. Inside the
> kernel, I don't particularly care.

My point was less about human-readable vs not. It was about expressing a value
that is relative to the base page size vs expressing a value that is independent
of base page size. If the concern is about limiting internal fragmentation, I
think its the absolute size that matters.

> 
> (Having databases/VMs on arch64 with 64k in mind) I think it might be
> interesting to have something like the following:
> 
> thp=madvise
> max_auto_size=64k/128k/256k
> 
> 
> So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any flexible
> THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 MiB THP? sure, give
> them to my VM. You're barely going to find 512 MiB THP either way in practice ....
> 
> But for the remainder of my system, just do something reasonable and don't go
> crazy on the memory waste.

Yep, we're on the same page. I've got a v3 that's almost ready to go, based on
Yu's prevuous round of review. I'm going to encorporate this mechanism into it
then post hopefully later in the week. Now I just need to figure out a decent
name for the max_auto_size control...

> 
> 
> I'll try reading all the previous discussions next week.
>
  
Ryan Roberts July 10, 2023, 8:55 a.m. UTC | #24
On 10/07/2023 04:03, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>> something like "always madvise never" via
>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>
>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>> That
>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>> use of
>>>>>
>>>>> It will never ever be a completely invisible performance boost, just like
>>>>> ordinary THP.
>>>>>
>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>> specify "never" or "madvise", then do exactly that.
>>>>>
>>>>> It might make sense to have more modes or additional toggles, but
>>>>> "madvise=never" means no memory waste.
>>>>
>>>> I hate the existing mechanisms.  They are an abdication of our
>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>> or the programmer) of our code for using it wrongly.  We should not
>>>> replicate this mistake.
>>>
>>> I don't agree regarding the programmer responsibility. In some cases the
>>> programmer really doesn't want to get more memory populated than requested --
>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>
>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>> (and nailing down bugs or working around them in customer setups) have been very
>>> good reasons to let the admin have a word.
>>>
>>>>
>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>
>>>
>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>> strikes you know it isn't.
>>>
>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>> option is probably more controversial. But the "always vs. never" absolutely
>>> makes sense to me.
>>>
>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>> additional page tables. So if you have to respect that already, then also
>>>>> respect MADV_HUGEPAGE, simple.
>>>>
>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>
>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>> live migration in QEMU being the famous example). That doesn't fly.
>>>
>>>> I can get behind that.  But the notion that userspace knows what it's
>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>> know what it's doing.
>>>
>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>
>>> I have strong opinions against populating more than required when user space set
>>> MADV_NOHUGEPAGE.
>>
>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>> The app has gone out of its way to explicitly set it, after all.
>>
>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>> are less obvious though. I could get on board with disabling large anon folios
>> globally when thp="never". But for other situations, I would prefer to keep
>> large anon folios enabled (treat "madvise" as "always"),
> 
> If we have some mechanism to auto-tune the large folios usage, for
> example, detect the internal fragmentation and split the large folio,
> then we can use thp="always" as default configuration.  If my memory
> were correct, this is what Johannes and Alexander is working on.

Could you point me to that work? I'd like to understand what the mechanism is.
The other half of my work aims to use arm64's pte "contiguous bit" to tell the
HW that a span of PTEs share the same mapping and is therefore coalesced into a
single TLB entry. The side effect of this, however, is that we only have a
single access and dirty bit for the whole contpte extent. So I'd like to avoid
any mechanism that relies on getting access/dirty at the base page granularity
for a large folio.

> 
>> with the argument that
>> their order is much smaller than traditional THP and therefore the internal
>> fragmentation is significantly reduced.
> 
> Do you have any data for this?

Some; its partly based on intuition that the smaller the allocation unit, the
smaller the internal fragmentation. And partly on peak memory usage data I've
collected for the benchmarks I'm running, comparing baseline-4k kernel with
baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
anon folios (I appreciate that's not exactly what we are talking about here, and
it's not exactly an extensive set of results!):


Kernel Compliation with 8 Jobs:
| kernel        |   peak |
|:--------------|-------:|
| baseline-4k   |   0.0% |
| anonfolio     |   0.1% |
| baseline-16k  |   6.3% |
| baseline-64k  |  28.1% |


Kernel Compliation with 80 Jobs:
| kernel        |   peak |
|:--------------|-------:|
| baseline-4k   |   0.0% |
| anonfolio     |   1.7% |
| baseline-16k  |   2.6% |
| baseline-64k  |  12.3% |



> 
>> I really don't want to end up with user
>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>> anon folios.
>>
>> I still feel that it would be better for the thp and large anon folio controls
>> to be independent though - what's the argument for tying them together?
>>
> 
> Best Regards,
> Huang, Ying
>
  
Huang, Ying July 10, 2023, 9:18 a.m. UTC | #25
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 10/07/2023 04:03, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>>> something like "always madvise never" via
>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>>
>>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>>> That
>>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>>> use of
>>>>>>
>>>>>> It will never ever be a completely invisible performance boost, just like
>>>>>> ordinary THP.
>>>>>>
>>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>>> specify "never" or "madvise", then do exactly that.
>>>>>>
>>>>>> It might make sense to have more modes or additional toggles, but
>>>>>> "madvise=never" means no memory waste.
>>>>>
>>>>> I hate the existing mechanisms.  They are an abdication of our
>>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>>> or the programmer) of our code for using it wrongly.  We should not
>>>>> replicate this mistake.
>>>>
>>>> I don't agree regarding the programmer responsibility. In some cases the
>>>> programmer really doesn't want to get more memory populated than requested --
>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>>
>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>>> (and nailing down bugs or working around them in customer setups) have been very
>>>> good reasons to let the admin have a word.
>>>>
>>>>>
>>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>>
>>>>
>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>>> strikes you know it isn't.
>>>>
>>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>>> option is probably more controversial. But the "always vs. never" absolutely
>>>> makes sense to me.
>>>>
>>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>>> additional page tables. So if you have to respect that already, then also
>>>>>> respect MADV_HUGEPAGE, simple.
>>>>>
>>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>>
>>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>>> live migration in QEMU being the famous example). That doesn't fly.
>>>>
>>>>> I can get behind that.  But the notion that userspace knows what it's
>>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>>> know what it's doing.
>>>>
>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>>
>>>> I have strong opinions against populating more than required when user space set
>>>> MADV_NOHUGEPAGE.
>>>
>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>>> The app has gone out of its way to explicitly set it, after all.
>>>
>>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>>> are less obvious though. I could get on board with disabling large anon folios
>>> globally when thp="never". But for other situations, I would prefer to keep
>>> large anon folios enabled (treat "madvise" as "always"),
>> 
>> If we have some mechanism to auto-tune the large folios usage, for
>> example, detect the internal fragmentation and split the large folio,
>> then we can use thp="always" as default configuration.  If my memory
>> were correct, this is what Johannes and Alexander is working on.
>
> Could you point me to that work? I'd like to understand what the mechanism is.
> The other half of my work aims to use arm64's pte "contiguous bit" to tell the
> HW that a span of PTEs share the same mapping and is therefore coalesced into a
> single TLB entry. The side effect of this, however, is that we only have a
> single access and dirty bit for the whole contpte extent. So I'd like to avoid
> any mechanism that relies on getting access/dirty at the base page granularity
> for a large folio.

Please take a look at the THP shrinker patchset,

https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/

>> 
>>> with the argument that
>>> their order is much smaller than traditional THP and therefore the internal
>>> fragmentation is significantly reduced.
>> 
>> Do you have any data for this?
>
> Some; its partly based on intuition that the smaller the allocation unit, the
> smaller the internal fragmentation. And partly on peak memory usage data I've
> collected for the benchmarks I'm running, comparing baseline-4k kernel with
> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
> anon folios (I appreciate that's not exactly what we are talking about here, and
> it's not exactly an extensive set of results!):
>
>
> Kernel Compliation with 8 Jobs:
> | kernel        |   peak |
> |:--------------|-------:|
> | baseline-4k   |   0.0% |
> | anonfolio     |   0.1% |
> | baseline-16k  |   6.3% |
> | baseline-64k  |  28.1% |
>
>
> Kernel Compliation with 80 Jobs:
> | kernel        |   peak |
> |:--------------|-------:|
> | baseline-4k   |   0.0% |
> | anonfolio     |   1.7% |
> | baseline-16k  |   2.6% |
> | baseline-64k  |  12.3% |
>

Why is anonfolio better than baseline-64k if you always allocate 64k
anonymous folio?  Because page cache uses 64k in baseline-64k?

We may need to test some workloads with sparse access patterns too.

Best Regards,
Huang, Ying

>> 
>>> I really don't want to end up with user
>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>>> anon folios.
>>>
>>> I still feel that it would be better for the thp and large anon folio controls
>>> to be independent though - what's the argument for tying them together?
>>>
  
Ryan Roberts July 10, 2023, 9:25 a.m. UTC | #26
On 10/07/2023 10:18, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 10/07/2023 04:03, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>>>> something like "always madvise never" via
>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>>>
>>>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>>>> That
>>>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>>>> use of
>>>>>>>
>>>>>>> It will never ever be a completely invisible performance boost, just like
>>>>>>> ordinary THP.
>>>>>>>
>>>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>>>> specify "never" or "madvise", then do exactly that.
>>>>>>>
>>>>>>> It might make sense to have more modes or additional toggles, but
>>>>>>> "madvise=never" means no memory waste.
>>>>>>
>>>>>> I hate the existing mechanisms.  They are an abdication of our
>>>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>>>> or the programmer) of our code for using it wrongly.  We should not
>>>>>> replicate this mistake.
>>>>>
>>>>> I don't agree regarding the programmer responsibility. In some cases the
>>>>> programmer really doesn't want to get more memory populated than requested --
>>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>>>
>>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>>>> (and nailing down bugs or working around them in customer setups) have been very
>>>>> good reasons to let the admin have a word.
>>>>>
>>>>>>
>>>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>>>
>>>>>
>>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>>>> strikes you know it isn't.
>>>>>
>>>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>>>> option is probably more controversial. But the "always vs. never" absolutely
>>>>> makes sense to me.
>>>>>
>>>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>>>> additional page tables. So if you have to respect that already, then also
>>>>>>> respect MADV_HUGEPAGE, simple.
>>>>>>
>>>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>>>
>>>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>>>> live migration in QEMU being the famous example). That doesn't fly.
>>>>>
>>>>>> I can get behind that.  But the notion that userspace knows what it's
>>>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>>>> know what it's doing.
>>>>>
>>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>>>
>>>>> I have strong opinions against populating more than required when user space set
>>>>> MADV_NOHUGEPAGE.
>>>>
>>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>>>> The app has gone out of its way to explicitly set it, after all.
>>>>
>>>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>>>> are less obvious though. I could get on board with disabling large anon folios
>>>> globally when thp="never". But for other situations, I would prefer to keep
>>>> large anon folios enabled (treat "madvise" as "always"),
>>>
>>> If we have some mechanism to auto-tune the large folios usage, for
>>> example, detect the internal fragmentation and split the large folio,
>>> then we can use thp="always" as default configuration.  If my memory
>>> were correct, this is what Johannes and Alexander is working on.
>>
>> Could you point me to that work? I'd like to understand what the mechanism is.
>> The other half of my work aims to use arm64's pte "contiguous bit" to tell the
>> HW that a span of PTEs share the same mapping and is therefore coalesced into a
>> single TLB entry. The side effect of this, however, is that we only have a
>> single access and dirty bit for the whole contpte extent. So I'd like to avoid
>> any mechanism that relies on getting access/dirty at the base page granularity
>> for a large folio.
> 
> Please take a look at the THP shrinker patchset,
> 
> https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/

Thanks!

> 
>>>
>>>> with the argument that
>>>> their order is much smaller than traditional THP and therefore the internal
>>>> fragmentation is significantly reduced.
>>>
>>> Do you have any data for this?
>>
>> Some; its partly based on intuition that the smaller the allocation unit, the
>> smaller the internal fragmentation. And partly on peak memory usage data I've
>> collected for the benchmarks I'm running, comparing baseline-4k kernel with
>> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
>> anon folios (I appreciate that's not exactly what we are talking about here, and
>> it's not exactly an extensive set of results!):
>>
>>
>> Kernel Compliation with 8 Jobs:
>> | kernel        |   peak |
>> |:--------------|-------:|
>> | baseline-4k   |   0.0% |
>> | anonfolio     |   0.1% |
>> | baseline-16k  |   6.3% |
>> | baseline-64k  |  28.1% |
>>
>>
>> Kernel Compliation with 80 Jobs:
>> | kernel        |   peak |
>> |:--------------|-------:|
>> | baseline-4k   |   0.0% |
>> | anonfolio     |   1.7% |
>> | baseline-16k  |   2.6% |
>> | baseline-64k  |  12.3% |
>>
> 
> Why is anonfolio better than baseline-64k if you always allocate 64k
> anonymous folio?  Because page cache uses 64k in baseline-64k?

No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios
only allocates a 64K folio if it does not breach the bounds of the VMA (and if
it doesn't overlap other allocated PTEs).

> 
> We may need to test some workloads with sparse access patterns too.

Yes, I agree if you have a workload with a pathalogical memory access pattern
where it writes to addresses with a stride of 64K, all contained in a single
VMA, then you will end up allocating 16x the memory. This is obviously an
unrealistic extreme though.

> 
> Best Regards,
> Huang, Ying
> 
>>>
>>>> I really don't want to end up with user
>>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>>>> anon folios.
>>>>
>>>> I still feel that it would be better for the thp and large anon folio controls
>>>> to be independent though - what's the argument for tying them together?
>>>>
>
  
Huang, Ying July 11, 2023, 12:48 a.m. UTC | #27
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 10/07/2023 10:18, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 10/07/2023 04:03, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> On 07/07/2023 15:07, David Hildenbrand wrote:
>>>>>> On 07.07.23 15:57, Matthew Wilcox wrote:
>>>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote:
>>>>>>>> On 07.07.23 11:52, Ryan Roberts wrote:
>>>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote:
>>>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to
>>>>>>>>>> avoid internal fragmentation completely.  So, I think that finally we
>>>>>>>>>> will need to provide a mechanism for the users to opt out, e.g.,
>>>>>>>>>> something like "always madvise never" via
>>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's
>>>>>>>>>> a good idea to reuse the existing interface of THP.
>>>>>>>>>
>>>>>>>>> I wouldn't want to tie this to the existing interface, simply because that
>>>>>>>>> implies that we would want to follow the "always" and "madvise" advice too;
>>>>>>>>> That
>>>>>>>>> means that on a thp=madvise system (which is certainly the case for android and
>>>>>>>>> other client systems) we would have to disable large anon folios for VMAs that
>>>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an
>>>>>>>>> invisible performance boost. I think it's important to set the policy for
>>>>>>>>> use of
>>>>>>>>
>>>>>>>> It will never ever be a completely invisible performance boost, just like
>>>>>>>> ordinary THP.
>>>>>>>>
>>>>>>>> Using the exact same existing toggle is the right thing to do. If someone
>>>>>>>> specify "never" or "madvise", then do exactly that.
>>>>>>>>
>>>>>>>> It might make sense to have more modes or additional toggles, but
>>>>>>>> "madvise=never" means no memory waste.
>>>>>>>
>>>>>>> I hate the existing mechanisms.  They are an abdication of our
>>>>>>> responsibility, and an attempt to blame the user (be it the sysadmin
>>>>>>> or the programmer) of our code for using it wrongly.  We should not
>>>>>>> replicate this mistake.
>>>>>>
>>>>>> I don't agree regarding the programmer responsibility. In some cases the
>>>>>> programmer really doesn't want to get more memory populated than requested --
>>>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do.
>>>>>>
>>>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste
>>>>>> (and nailing down bugs or working around them in customer setups) have been very
>>>>>> good reasons to let the admin have a word.
>>>>>>
>>>>>>>
>>>>>>> Our code should be auto-tuning.  I posted a long, detailed outline here:
>>>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
>>>>>>>
>>>>>>
>>>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality
>>>>>> strikes you know it isn't.
>>>>>>
>>>>>> If people don't feel like using THP, let them have a word. The "madvise" config
>>>>>> option is probably more controversial. But the "always vs. never" absolutely
>>>>>> makes sense to me.
>>>>>>
>>>>>>>> I remember I raised it already in the past, but you *absolutely* have to
>>>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for
>>>>>>>> example, userfaultfd) that doesn't want the kernel to populate any
>>>>>>>> additional page tables. So if you have to respect that already, then also
>>>>>>>> respect MADV_HUGEPAGE, simple.
>>>>>>>
>>>>>>> Possibly having uffd enabled on a VMA should disable using large folios,
>>>>>>
>>>>>> There are cases where we enable uffd *after* already touching memory (postcopy
>>>>>> live migration in QEMU being the famous example). That doesn't fly.
>>>>>>
>>>>>>> I can get behind that.  But the notion that userspace knows what it's
>>>>>>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't
>>>>>>> know what it's doing.
>>>>>>
>>>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in
>>>>>> some cases. And these include cases I care about messing with sparse VM memory :)
>>>>>>
>>>>>> I have strong opinions against populating more than required when user space set
>>>>>> MADV_NOHUGEPAGE.
>>>>>
>>>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is
>>>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set.
>>>>> The app has gone out of its way to explicitly set it, after all.
>>>>>
>>>>> I think the correct behaviour for the global thp controls (cmdline and sysfs)
>>>>> are less obvious though. I could get on board with disabling large anon folios
>>>>> globally when thp="never". But for other situations, I would prefer to keep
>>>>> large anon folios enabled (treat "madvise" as "always"),
>>>>
>>>> If we have some mechanism to auto-tune the large folios usage, for
>>>> example, detect the internal fragmentation and split the large folio,
>>>> then we can use thp="always" as default configuration.  If my memory
>>>> were correct, this is what Johannes and Alexander is working on.
>>>
>>> Could you point me to that work? I'd like to understand what the mechanism is.
>>> The other half of my work aims to use arm64's pte "contiguous bit" to tell the
>>> HW that a span of PTEs share the same mapping and is therefore coalesced into a
>>> single TLB entry. The side effect of this, however, is that we only have a
>>> single access and dirty bit for the whole contpte extent. So I'd like to avoid
>>> any mechanism that relies on getting access/dirty at the base page granularity
>>> for a large folio.
>> 
>> Please take a look at the THP shrinker patchset,
>> 
>> https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/
>
> Thanks!
>
>> 
>>>>
>>>>> with the argument that
>>>>> their order is much smaller than traditional THP and therefore the internal
>>>>> fragmentation is significantly reduced.
>>>>
>>>> Do you have any data for this?
>>>
>>> Some; its partly based on intuition that the smaller the allocation unit, the
>>> smaller the internal fragmentation. And partly on peak memory usage data I've
>>> collected for the benchmarks I'm running, comparing baseline-4k kernel with
>>> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large
>>> anon folios (I appreciate that's not exactly what we are talking about here, and
>>> it's not exactly an extensive set of results!):
>>>
>>>
>>> Kernel Compliation with 8 Jobs:
>>> | kernel        |   peak |
>>> |:--------------|-------:|
>>> | baseline-4k   |   0.0% |
>>> | anonfolio     |   0.1% |
>>> | baseline-16k  |   6.3% |
>>> | baseline-64k  |  28.1% |
>>>
>>>
>>> Kernel Compliation with 80 Jobs:
>>> | kernel        |   peak |
>>> |:--------------|-------:|
>>> | baseline-4k   |   0.0% |
>>> | anonfolio     |   1.7% |
>>> | baseline-16k  |   2.6% |
>>> | baseline-64k  |  12.3% |
>>>
>> 
>> Why is anonfolio better than baseline-64k if you always allocate 64k
>> anonymous folio?  Because page cache uses 64k in baseline-64k?
>
> No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios
> only allocates a 64K folio if it does not breach the bounds of the VMA (and if
> it doesn't overlap other allocated PTEs).

Thanks for explanation!

We will use more memory for file cache too for baseline-64k, right?  So,
you observed much more anonymous pages, but not so for file cache pages?

>> 
>> We may need to test some workloads with sparse access patterns too.
>
> Yes, I agree if you have a workload with a pathalogical memory access pattern
> where it writes to addresses with a stride of 64K, all contained in a single
> VMA, then you will end up allocating 16x the memory. This is obviously an
> unrealistic extreme though.

I think that there should be some realistic workload which has sparse
access patterns.

Best Regards,
Huang, Ying

>> 
>>>>
>>>>> I really don't want to end up with user
>>>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large
>>>>> anon folios.
>>>>>
>>>>> I still feel that it would be better for the thp and large anon folio controls
>>>>> to be independent though - what's the argument for tying them together?
>>>>>
>>
  

Patch

diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..1c06b2c0a24e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -822,6 +822,16 @@  config READ_ONLY_THP_FOR_FS
 	  support of file THPs will be developed in the next few release
 	  cycles.
 
+config FLEXIBLE_THP
+	bool "Flexible order THP"
+	depends on TRANSPARENT_HUGEPAGE
+	default n
+	help
+	  Use large (bigger than order-0) folios to back anonymous memory where
+	  possible, even if the order of the folio is smaller than the PMD
+	  order. This reduces the number of page faults, as well as other
+	  per-page overheads to improve performance for many workloads.
+
 endif # TRANSPARENT_HUGEPAGE
 
 #
diff --git a/mm/memory.c b/mm/memory.c
index fb30f7523550..abe2ea94f3f5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3001,6 +3001,116 @@  static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_FLEXIBLE_THP
+/*
+ * Allocates, zeros and returns a folio of the requested order for use as
+ * anonymous memory.
+ */
+static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
+				      unsigned long addr, int order)
+{
+	gfp_t gfp;
+	struct folio *folio;
+
+	if (order == 0)
+		return vma_alloc_zeroed_movable_folio(vma, addr);
+
+	gfp = vma_thp_gfp_mask(vma);
+	folio = vma_alloc_folio(gfp, order, vma, addr, true);
+	if (folio)
+		clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
+
+	return folio;
+}
+
+/*
+ * Preferred folio order to allocate for anonymous memory.
+ */
+#define max_anon_folio_order(vma)	arch_wants_pte_order(vma)
+#else
+#define alloc_anon_folio(vma, addr, order) \
+				vma_alloc_zeroed_movable_folio(vma, addr)
+#define max_anon_folio_order(vma)	0
+#endif
+
+/*
+ * Returns index of first pte that is not none, or nr if all are none.
+ */
+static inline int check_ptes_none(pte_t *pte, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		if (!pte_none(ptep_get(pte++)))
+			return i;
+	}
+
+	return nr;
+}
+
+static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
+{
+	/*
+	 * The aim here is to determine what size of folio we should allocate
+	 * for this fault. Factors include:
+	 * - Order must not be higher than `order` upon entry
+	 * - Folio must be naturally aligned within VA space
+	 * - Folio must be fully contained inside one pmd entry
+	 * - Folio must not breach boundaries of vma
+	 * - Folio must not overlap any non-none ptes
+	 *
+	 * Additionally, we do not allow order-1 since this breaks assumptions
+	 * elsewhere in the mm; THP pages must be at least order-2 (since they
+	 * store state up to the 3rd struct page subpage), and these pages must
+	 * be THP in order to correctly use pre-existing THP infrastructure such
+	 * as folio_split().
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the result is racy and the user must re-check any overlap
+	 * with non-none ptes under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	int nr;
+	unsigned long addr;
+	pte_t *pte;
+	pte_t *first_set = NULL;
+	int ret;
+
+	order = min(order, PMD_SHIFT - PAGE_SHIFT);
+
+	for (; order > 1; order--) {
+		nr = 1 << order;
+		addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
+		pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
+
+		/* Check vma bounds. */
+		if (addr < vma->vm_start ||
+		    addr + (nr << PAGE_SHIFT) > vma->vm_end)
+			continue;
+
+		/* Ptes covered by order already known to be none. */
+		if (pte + nr <= first_set)
+			break;
+
+		/* Already found set pte in range covered by order. */
+		if (pte <= first_set)
+			continue;
+
+		/* Need to check if all the ptes are none. */
+		ret = check_ptes_none(pte, nr);
+		if (ret == nr)
+			break;
+
+		first_set = pte + ret;
+	}
+
+	if (order == 1)
+		order = 0;
+
+	return order;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
@@ -3073,7 +3183,7 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		goto oom;
 
 	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
-		new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+		new_folio = alloc_anon_folio(vma, vmf->address, 0);
 		if (!new_folio)
 			goto oom;
 	} else {
@@ -4040,6 +4150,9 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	int order;
+	int pgcount;
+	unsigned long addr;
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -4081,24 +4194,51 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return handle_userfault(vmf, VM_UFFD_MISSING);
 		}
-		goto setpte;
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, vmf->address, vmf->pte);
+		goto unlock;
+	}
+
+	/*
+	 * If allocating a large folio, determine the biggest suitable order for
+	 * the VMA (e.g. it must not exceed the VMA's bounds, it must not
+	 * overlap with any populated PTEs, etc). We are not under the ptl here
+	 * so we will need to re-check that we are not overlapping any populated
+	 * PTEs once we have the lock.
+	 */
+	order = uffd_wp ? 0 : max_anon_folio_order(vma);
+	if (order > 0) {
+		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+		order = calc_anon_folio_order_alloc(vmf, order);
+		pte_unmap(vmf->pte);
 	}
 
-	/* Allocate our own private page. */
+	/* Allocate our own private folio. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+	folio = alloc_anon_folio(vma, vmf->address, order);
+	if (!folio && order > 0) {
+		order = 0;
+		folio = alloc_anon_folio(vma, vmf->address, order);
+	}
 	if (!folio)
 		goto oom;
 
+	pgcount = 1 << order;
+	addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
+
 	if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
 	 * The memory barrier inside __folio_mark_uptodate makes sure that
-	 * preceding stores to the page contents become visible before
-	 * the set_pte_at() write.
+	 * preceding stores to the folio contents become visible before
+	 * the set_ptes() write.
 	 */
 	__folio_mark_uptodate(folio);
 
@@ -4107,11 +4247,12 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
-			&vmf->ptl);
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	if (vmf_pte_changed(vmf)) {
 		update_mmu_tlb(vma, vmf->address, vmf->pte);
 		goto release;
+	} else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
+		goto release;
 	}
 
 	ret = check_stable_address_space(vma->vm_mm);
@@ -4125,16 +4266,17 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	folio_add_new_anon_rmap(folio, vma, vmf->address);
+	folio_ref_add(folio, pgcount - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
+	folio_add_new_anon_rmap(folio, vma, addr);
 	folio_add_lru_vma(folio, vma);
-setpte:
+
 	if (uffd_wp)
 		entry = pte_mkuffd_wp(entry);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, vmf->address, vmf->pte);
+	update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;