Message ID | 20230703135330.1865927-5-ryan.roberts@arm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9f45:0:b0:3ea:f831:8777 with SMTP id v5csp545770vqx; Mon, 3 Jul 2023 07:02:16 -0700 (PDT) X-Google-Smtp-Source: APBJJlGm1gGYePV/OSJ4F4lMSKoQq0M2pSk77BC6VQMo1xHSthjcUNCJSmx3vmTbQdCAMe42PYJs X-Received: by 2002:a17:902:c9c5:b0:1b8:50a9:6874 with SMTP id q5-20020a170902c9c500b001b850a96874mr10298867pld.5.1688392936465; Mon, 03 Jul 2023 07:02:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688392936; cv=none; d=google.com; s=arc-20160816; b=mZg9gwWGF1olqJOEPoeZ4oMpMhsCkP5JfV9Is1ZRQTA5eKOLdmIEOME9FdOmaM69rr EsmpYRkHVrYg5KkcfYvk+atoqLMgs5mEjlG7lSaop2fKByb4ve9MG4B56RXDZRTM9cjU rnj3KMTbKJf3DISuLS9qNj3HYax8FcFb0S/7NisF7L5J9FjUPvOJ9IS8M2+qGFrRVUKn 9bRpmQFBNuu2ApY2A3VumEkrTvHve1N9H7IqEtQmy8eu7UTKhy+7N/mznVGaMANkSGa/ VrXFBozxO6dZF3WHgcFzsd1axLZbIq7yR2vECYXx2hb8QKITZnenIXkY1xFFjvPV8UkJ vY6Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=M42aKf0Dh3v26rYqL7fJB4T37PH00qdaAFo246Anczo=; fh=W2gTym8Tes+4XZMJyNdq9yGC4n1yJ+vTSxJnRC1NAwA=; b=KJYvMHYvtmwNSOIstm1YfTuUVCgL1khqVmA8yFQ6f3eF0/dXiIcCQO0Qc+/eWdOknG 40k9YMqLAo7OmoIgBUiarJgmizLEWfmbQ4uVjwtLjMyJYv+q6KSiVtLH1oCh3okn1zXF 6XwVzPShFsTwLIM8NN5oYIp6Cymt/6RTotC8ZcCf9PUc+RmJB/8N2egpP/BQEL7R1uMX yiAMlf+KKz4XmGg3oYoiPPcjq6njdZ/S4gSoIwk4GUpmxf+frMF9nubFs+QAb/E44wO6 ehnpOba1K+eASUef8IClRWA/nVsjdkkbaLOcjdKOJKXZHZPt34xZ3jv1olAQuJ9xQ/QI 4xLQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t2-20020a1709027fc200b001b261eba754si9255541plb.165.2023.07.03.07.01.53; Mon, 03 Jul 2023 07:02:16 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230037AbjGCNyN (ORCPT <rfc822;ivan.orlov0322@gmail.com> + 99 others); Mon, 3 Jul 2023 09:54:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35782 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230262AbjGCNyL (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 3 Jul 2023 09:54:11 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 5971E1706 for <linux-kernel@vger.kernel.org>; Mon, 3 Jul 2023 06:53:51 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 76F33143D; Mon, 3 Jul 2023 06:54:33 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D88B03F73F; Mon, 3 Jul 2023 06:53:48 -0700 (PDT) From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Yin Fengwei <fengwei.yin@intel.com>, David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Anshuman Khandual <anshuman.khandual@arm.com>, Yang Shi <shy828301@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance Date: Mon, 3 Jul 2023 14:53:29 +0100 Message-Id: <20230703135330.1865927-5-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230703135330.1865927-1-ryan.roberts@arm.com> References: <20230703135330.1865927-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1770408311193829091?= X-GMAIL-MSGID: =?utf-8?q?1770408311193829091?= |
Series |
variable-order, large folios for anonymous memory
|
|
Commit Message
Ryan Roberts
July 3, 2023, 1:53 p.m. UTC
Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
allocated in large folios of a specified order. All pages of the large
folio are pte-mapped during the same page fault, significantly reducing
the number of page faults. The number of per-page operations (e.g. ref
counting, rmap management lru list management) are also significantly
reduced since those ops now become per-folio.
The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
defaults to disabled for now; there is a long list of todos to make
FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
madvise ops, etc). These items will be tackled in subsequent patches.
When enabled, the preferred folio order is as returned by
arch_wants_pte_order(), which may be overridden by the arch as it sees
fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
contiguous set of ptes map physically contigious, naturally aligned
memory, so this mechanism allows the architecture to optimize as
required.
If the preferred order can't be used (e.g. because the folio would
breach the bounds of the vma, or because ptes in the region are already
mapped) then we fall back to a suitable lower order.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
mm/Kconfig | 10 ++++
mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 165 insertions(+), 13 deletions(-)
Comments
Hi Ryan, kernel test robot noticed the following build errors: [auto build test ERROR on arm64/for-next/core] [also build test ERROR on v6.4] [cannot apply to akpm-mm/mm-everything linus/master next-20230703] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627 base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core patch link: https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance config: um-allyesconfig (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/config) compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1) reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202307032325.u93xmWbG-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __raw_readb(PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr)); ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu' #define __le16_to_cpu(x) ((__force __u16)(__le16)(x)) ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr)); ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu' #define __le32_to_cpu(x) ((__force __u32)(__le32)(x)) ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writeb(value, PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsb(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsw(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsl(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesb(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesw(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesl(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ >> mm/memory.c:4271:2: error: implicit declaration of function 'set_ptes' is invalid in C99 [-Werror,-Wimplicit-function-declaration] set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); ^ mm/memory.c:4271:2: note: did you mean 'set_pte'? arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here static inline void set_pte(pte_t *pteptr, pte_t pteval) ^ >> mm/memory.c:4274:2: error: implicit declaration of function 'update_mmu_cache_range' is invalid in C99 [-Werror,-Wimplicit-function-declaration] update_mmu_cache_range(vma, addr, vmf->pte, pgcount); ^ 12 warnings and 2 errors generated. vim +/set_ptes +4271 mm/memory.c 4135 4136 /* 4137 * We enter with non-exclusive mmap_lock (to exclude vma changes, 4138 * but allow concurrent faults), and pte mapped but not yet locked. 4139 * We return with mmap_lock still held, but pte unmapped and unlocked. 4140 */ 4141 static vm_fault_t do_anonymous_page(struct vm_fault *vmf) 4142 { 4143 bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); 4144 struct vm_area_struct *vma = vmf->vma; 4145 struct folio *folio; 4146 vm_fault_t ret = 0; 4147 pte_t entry; 4148 int order; 4149 int pgcount; 4150 unsigned long addr; 4151 4152 /* File mapping without ->vm_ops ? */ 4153 if (vma->vm_flags & VM_SHARED) 4154 return VM_FAULT_SIGBUS; 4155 4156 /* 4157 * Use pte_alloc() instead of pte_alloc_map(). We can't run 4158 * pte_offset_map() on pmds where a huge pmd might be created 4159 * from a different thread. 4160 * 4161 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when 4162 * parallel threads are excluded by other means. 4163 * 4164 * Here we only have mmap_read_lock(mm). 4165 */ 4166 if (pte_alloc(vma->vm_mm, vmf->pmd)) 4167 return VM_FAULT_OOM; 4168 4169 /* See comment in handle_pte_fault() */ 4170 if (unlikely(pmd_trans_unstable(vmf->pmd))) 4171 return 0; 4172 4173 /* Use the zero-page for reads */ 4174 if (!(vmf->flags & FAULT_FLAG_WRITE) && 4175 !mm_forbids_zeropage(vma->vm_mm)) { 4176 entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), 4177 vma->vm_page_prot)); 4178 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 4179 vmf->address, &vmf->ptl); 4180 if (vmf_pte_changed(vmf)) { 4181 update_mmu_tlb(vma, vmf->address, vmf->pte); 4182 goto unlock; 4183 } 4184 ret = check_stable_address_space(vma->vm_mm); 4185 if (ret) 4186 goto unlock; 4187 /* Deliver the page fault to userland, check inside PT lock */ 4188 if (userfaultfd_missing(vma)) { 4189 pte_unmap_unlock(vmf->pte, vmf->ptl); 4190 return handle_userfault(vmf, VM_UFFD_MISSING); 4191 } 4192 if (uffd_wp) 4193 entry = pte_mkuffd_wp(entry); 4194 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); 4195 4196 /* No need to invalidate - it was non-present before */ 4197 update_mmu_cache(vma, vmf->address, vmf->pte); 4198 goto unlock; 4199 } 4200 4201 /* 4202 * If allocating a large folio, determine the biggest suitable order for 4203 * the VMA (e.g. it must not exceed the VMA's bounds, it must not 4204 * overlap with any populated PTEs, etc). We are not under the ptl here 4205 * so we will need to re-check that we are not overlapping any populated 4206 * PTEs once we have the lock. 4207 */ 4208 order = uffd_wp ? 0 : max_anon_folio_order(vma); 4209 if (order > 0) { 4210 vmf->pte = pte_offset_map(vmf->pmd, vmf->address); 4211 order = calc_anon_folio_order_alloc(vmf, order); 4212 pte_unmap(vmf->pte); 4213 } 4214 4215 /* Allocate our own private folio. */ 4216 if (unlikely(anon_vma_prepare(vma))) 4217 goto oom; 4218 folio = alloc_anon_folio(vma, vmf->address, order); 4219 if (!folio && order > 0) { 4220 order = 0; 4221 folio = alloc_anon_folio(vma, vmf->address, order); 4222 } 4223 if (!folio) 4224 goto oom; 4225 4226 pgcount = 1 << order; 4227 addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); 4228 4229 if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) 4230 goto oom_free_page; 4231 folio_throttle_swaprate(folio, GFP_KERNEL); 4232 4233 /* 4234 * The memory barrier inside __folio_mark_uptodate makes sure that 4235 * preceding stores to the folio contents become visible before 4236 * the set_ptes() write. 4237 */ 4238 __folio_mark_uptodate(folio); 4239 4240 entry = mk_pte(&folio->page, vma->vm_page_prot); 4241 entry = pte_sw_mkyoung(entry); 4242 if (vma->vm_flags & VM_WRITE) 4243 entry = pte_mkwrite(pte_mkdirty(entry)); 4244 4245 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); 4246 if (vmf_pte_changed(vmf)) { 4247 update_mmu_tlb(vma, vmf->address, vmf->pte); 4248 goto release; 4249 } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { 4250 goto release; 4251 } 4252 4253 ret = check_stable_address_space(vma->vm_mm); 4254 if (ret) 4255 goto release; 4256 4257 /* Deliver the page fault to userland, check inside PT lock */ 4258 if (userfaultfd_missing(vma)) { 4259 pte_unmap_unlock(vmf->pte, vmf->ptl); 4260 folio_put(folio); 4261 return handle_userfault(vmf, VM_UFFD_MISSING); 4262 } 4263 4264 folio_ref_add(folio, pgcount - 1); 4265 add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); 4266 folio_add_new_anon_rmap(folio, vma, addr); 4267 folio_add_lru_vma(folio, vma); 4268 4269 if (uffd_wp) 4270 entry = pte_mkuffd_wp(entry); > 4271 set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); 4272 4273 /* No need to invalidate - it was non-present before */ > 4274 update_mmu_cache_range(vma, addr, vmf->pte, pgcount); 4275 unlock: 4276 pte_unmap_unlock(vmf->pte, vmf->ptl); 4277 return ret; 4278 release: 4279 folio_put(folio); 4280 goto unlock; 4281 oom_free_page: 4282 folio_put(folio); 4283 oom: 4284 return VM_FAULT_OOM; 4285 } 4286
Hi Ryan, kernel test robot noticed the following build errors: [auto build test ERROR on arm64/for-next/core] [also build test ERROR on v6.4] [cannot apply to akpm-mm/mm-everything linus/master next-20230703] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627 base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core patch link: https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance config: um-allnoconfig (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/config) compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a) reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202307032330.TguyNttt-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 547 | val = __raw_readb(PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 560 | val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr)); | ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu' 37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x)) | ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 573 | val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr)); | ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu' 35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x)) | ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 584 | __raw_writeb(value, PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 594 | __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 604 | __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 692 | readsb(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 700 | readsw(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 708 | readsl(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 717 | writesb(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 726 | writesw(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 735 | writesl(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ >> mm/memory.c:4271:2: error: call to undeclared function 'set_ptes'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 4271 | set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); | ^ mm/memory.c:4271:2: note: did you mean 'set_pte'? arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here 232 | static inline void set_pte(pte_t *pteptr, pte_t pteval) | ^ >> mm/memory.c:4274:2: error: call to undeclared function 'update_mmu_cache_range'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 4274 | update_mmu_cache_range(vma, addr, vmf->pte, pgcount); | ^ 12 warnings and 2 errors generated. vim +/set_ptes +4271 mm/memory.c 4135 4136 /* 4137 * We enter with non-exclusive mmap_lock (to exclude vma changes, 4138 * but allow concurrent faults), and pte mapped but not yet locked. 4139 * We return with mmap_lock still held, but pte unmapped and unlocked. 4140 */ 4141 static vm_fault_t do_anonymous_page(struct vm_fault *vmf) 4142 { 4143 bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); 4144 struct vm_area_struct *vma = vmf->vma; 4145 struct folio *folio; 4146 vm_fault_t ret = 0; 4147 pte_t entry; 4148 int order; 4149 int pgcount; 4150 unsigned long addr; 4151 4152 /* File mapping without ->vm_ops ? */ 4153 if (vma->vm_flags & VM_SHARED) 4154 return VM_FAULT_SIGBUS; 4155 4156 /* 4157 * Use pte_alloc() instead of pte_alloc_map(). We can't run 4158 * pte_offset_map() on pmds where a huge pmd might be created 4159 * from a different thread. 4160 * 4161 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when 4162 * parallel threads are excluded by other means. 4163 * 4164 * Here we only have mmap_read_lock(mm). 4165 */ 4166 if (pte_alloc(vma->vm_mm, vmf->pmd)) 4167 return VM_FAULT_OOM; 4168 4169 /* See comment in handle_pte_fault() */ 4170 if (unlikely(pmd_trans_unstable(vmf->pmd))) 4171 return 0; 4172 4173 /* Use the zero-page for reads */ 4174 if (!(vmf->flags & FAULT_FLAG_WRITE) && 4175 !mm_forbids_zeropage(vma->vm_mm)) { 4176 entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), 4177 vma->vm_page_prot)); 4178 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 4179 vmf->address, &vmf->ptl); 4180 if (vmf_pte_changed(vmf)) { 4181 update_mmu_tlb(vma, vmf->address, vmf->pte); 4182 goto unlock; 4183 } 4184 ret = check_stable_address_space(vma->vm_mm); 4185 if (ret) 4186 goto unlock; 4187 /* Deliver the page fault to userland, check inside PT lock */ 4188 if (userfaultfd_missing(vma)) { 4189 pte_unmap_unlock(vmf->pte, vmf->ptl); 4190 return handle_userfault(vmf, VM_UFFD_MISSING); 4191 } 4192 if (uffd_wp) 4193 entry = pte_mkuffd_wp(entry); 4194 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); 4195 4196 /* No need to invalidate - it was non-present before */ 4197 update_mmu_cache(vma, vmf->address, vmf->pte); 4198 goto unlock; 4199 } 4200 4201 /* 4202 * If allocating a large folio, determine the biggest suitable order for 4203 * the VMA (e.g. it must not exceed the VMA's bounds, it must not 4204 * overlap with any populated PTEs, etc). We are not under the ptl here 4205 * so we will need to re-check that we are not overlapping any populated 4206 * PTEs once we have the lock. 4207 */ 4208 order = uffd_wp ? 0 : max_anon_folio_order(vma); 4209 if (order > 0) { 4210 vmf->pte = pte_offset_map(vmf->pmd, vmf->address); 4211 order = calc_anon_folio_order_alloc(vmf, order); 4212 pte_unmap(vmf->pte); 4213 } 4214 4215 /* Allocate our own private folio. */ 4216 if (unlikely(anon_vma_prepare(vma))) 4217 goto oom; 4218 folio = alloc_anon_folio(vma, vmf->address, order); 4219 if (!folio && order > 0) { 4220 order = 0; 4221 folio = alloc_anon_folio(vma, vmf->address, order); 4222 } 4223 if (!folio) 4224 goto oom; 4225 4226 pgcount = 1 << order; 4227 addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); 4228 4229 if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) 4230 goto oom_free_page; 4231 folio_throttle_swaprate(folio, GFP_KERNEL); 4232 4233 /* 4234 * The memory barrier inside __folio_mark_uptodate makes sure that 4235 * preceding stores to the folio contents become visible before 4236 * the set_ptes() write. 4237 */ 4238 __folio_mark_uptodate(folio); 4239 4240 entry = mk_pte(&folio->page, vma->vm_page_prot); 4241 entry = pte_sw_mkyoung(entry); 4242 if (vma->vm_flags & VM_WRITE) 4243 entry = pte_mkwrite(pte_mkdirty(entry)); 4244 4245 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); 4246 if (vmf_pte_changed(vmf)) { 4247 update_mmu_tlb(vma, vmf->address, vmf->pte); 4248 goto release; 4249 } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { 4250 goto release; 4251 } 4252 4253 ret = check_stable_address_space(vma->vm_mm); 4254 if (ret) 4255 goto release; 4256 4257 /* Deliver the page fault to userland, check inside PT lock */ 4258 if (userfaultfd_missing(vma)) { 4259 pte_unmap_unlock(vmf->pte, vmf->ptl); 4260 folio_put(folio); 4261 return handle_userfault(vmf, VM_UFFD_MISSING); 4262 } 4263 4264 folio_ref_add(folio, pgcount - 1); 4265 add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); 4266 folio_add_new_anon_rmap(folio, vma, addr); 4267 folio_add_lru_vma(folio, vma); 4268 4269 if (uffd_wp) 4270 entry = pte_mkuffd_wp(entry); > 4271 set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); 4272 4273 /* No need to invalidate - it was non-present before */ > 4274 update_mmu_cache_range(vma, addr, vmf->pte, pgcount); 4275 unlock: 4276 pte_unmap_unlock(vmf->pte, vmf->ptl); 4277 return ret; 4278 release: 4279 folio_put(folio); 4280 goto unlock; 4281 oom_free_page: 4282 folio_put(folio); 4283 oom: 4284 return VM_FAULT_OOM; 4285 } 4286
On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > allocated in large folios of a specified order. All pages of the large > folio are pte-mapped during the same page fault, significantly reducing > the number of page faults. The number of per-page operations (e.g. ref > counting, rmap management lru list management) are also significantly > reduced since those ops now become per-folio. > > The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which > defaults to disabled for now; there is a long list of todos to make > FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some > madvise ops, etc). These items will be tackled in subsequent patches. > > When enabled, the preferred folio order is as returned by > arch_wants_pte_order(), which may be overridden by the arch as it sees > fit. Some architectures (e.g. arm64) can coalsece TLB entries if a coalesce > contiguous set of ptes map physically contigious, naturally aligned contiguous > memory, so this mechanism allows the architecture to optimize as > required. > > If the preferred order can't be used (e.g. because the folio would > breach the bounds of the vma, or because ptes in the region are already > mapped) then we fall back to a suitable lower order. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > mm/Kconfig | 10 ++++ > mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- > 2 files changed, 165 insertions(+), 13 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 7672a22647b4..1c06b2c0a24e 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS > support of file THPs will be developed in the next few release > cycles. > > +config FLEXIBLE_THP > + bool "Flexible order THP" > + depends on TRANSPARENT_HUGEPAGE > + default n The default value is already N. > + help > + Use large (bigger than order-0) folios to back anonymous memory where > + possible, even if the order of the folio is smaller than the PMD > + order. This reduces the number of page faults, as well as other > + per-page overheads to improve performance for many workloads. > + > endif # TRANSPARENT_HUGEPAGE > > # > diff --git a/mm/memory.c b/mm/memory.c > index fb30f7523550..abe2ea94f3f5 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) > return 0; > } > > +#ifdef CONFIG_FLEXIBLE_THP > +/* > + * Allocates, zeros and returns a folio of the requested order for use as > + * anonymous memory. > + */ > +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, > + unsigned long addr, int order) > +{ > + gfp_t gfp; > + struct folio *folio; > + > + if (order == 0) > + return vma_alloc_zeroed_movable_folio(vma, addr); > + > + gfp = vma_thp_gfp_mask(vma); > + folio = vma_alloc_folio(gfp, order, vma, addr, true); > + if (folio) > + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); > + > + return folio; > +} > + > +/* > + * Preferred folio order to allocate for anonymous memory. > + */ > +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) > +#else > +#define alloc_anon_folio(vma, addr, order) \ > + vma_alloc_zeroed_movable_folio(vma, addr) > +#define max_anon_folio_order(vma) 0 > +#endif > + > +/* > + * Returns index of first pte that is not none, or nr if all are none. > + */ > +static inline int check_ptes_none(pte_t *pte, int nr) > +{ > + int i; > + > + for (i = 0; i < nr; i++) { > + if (!pte_none(ptep_get(pte++))) > + return i; > + } > + > + return nr; > +} > + > +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) > +{ > + /* > + * The aim here is to determine what size of folio we should allocate > + * for this fault. Factors include: > + * - Order must not be higher than `order` upon entry > + * - Folio must be naturally aligned within VA space > + * - Folio must be fully contained inside one pmd entry > + * - Folio must not breach boundaries of vma > + * - Folio must not overlap any non-none ptes > + * > + * Additionally, we do not allow order-1 since this breaks assumptions > + * elsewhere in the mm; THP pages must be at least order-2 (since they > + * store state up to the 3rd struct page subpage), and these pages must > + * be THP in order to correctly use pre-existing THP infrastructure such > + * as folio_split(). > + * > + * Note that the caller may or may not choose to lock the pte. If > + * unlocked, the result is racy and the user must re-check any overlap > + * with non-none ptes under the lock. > + */ > + > + struct vm_area_struct *vma = vmf->vma; > + int nr; > + unsigned long addr; > + pte_t *pte; > + pte_t *first_set = NULL; > + int ret; > + > + order = min(order, PMD_SHIFT - PAGE_SHIFT); > + > + for (; order > 1; order--) { I'm not sure how we can justify this policy. As an initial step, it'd be a lot easier to sell if we only considered the order of arch_wants_pte_order() and the order 0. > + nr = 1 << order; > + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); > + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); > + > + /* Check vma bounds. */ > + if (addr < vma->vm_start || > + addr + (nr << PAGE_SHIFT) > vma->vm_end) > + continue; > + > + /* Ptes covered by order already known to be none. */ > + if (pte + nr <= first_set) > + break; > + > + /* Already found set pte in range covered by order. */ > + if (pte <= first_set) > + continue; > + > + /* Need to check if all the ptes are none. */ > + ret = check_ptes_none(pte, nr); > + if (ret == nr) > + break; > + > + first_set = pte + ret; > + } > + > + if (order == 1) > + order = 0; > + > + return order; > +} Everything above can be simplified into two helpers: vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you prefer). Details below. > /* > * Handle write page faults for pages that can be reused in the current vma > * > @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > goto oom; > > if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { > - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + new_folio = alloc_anon_folio(vma, vmf->address, 0); This seems unnecessary for now. Later on, we could fill in an aligned area with multiple write-protected zero pages during a read fault and then replace them with a large folio here. > if (!new_folio) > goto oom; > } else { > @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > struct folio *folio; > vm_fault_t ret = 0; > pte_t entry; > + int order; > + int pgcount; > + unsigned long addr; > > /* File mapping without ->vm_ops ? */ > if (vma->vm_flags & VM_SHARED) > @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > pte_unmap_unlock(vmf->pte, vmf->ptl); > return handle_userfault(vmf, VM_UFFD_MISSING); > } > - goto setpte; > + if (uffd_wp) > + entry = pte_mkuffd_wp(entry); > + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + > + /* No need to invalidate - it was non-present before */ > + update_mmu_cache(vma, vmf->address, vmf->pte); > + goto unlock; > + } Nor really needed IMO. Details below. === > + /* > + * If allocating a large folio, determine the biggest suitable order for > + * the VMA (e.g. it must not exceed the VMA's bounds, it must not > + * overlap with any populated PTEs, etc). We are not under the ptl here > + * so we will need to re-check that we are not overlapping any populated > + * PTEs once we have the lock. > + */ > + order = uffd_wp ? 0 : max_anon_folio_order(vma); > + if (order > 0) { > + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > + order = calc_anon_folio_order_alloc(vmf, order); > + pte_unmap(vmf->pte); > } === The section above together with the section below should be wrapped in a helper. > - /* Allocate our own private page. */ > + /* Allocate our own private folio. */ > if (unlikely(anon_vma_prepare(vma))) > goto oom; === > - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + folio = alloc_anon_folio(vma, vmf->address, order); > + if (!folio && order > 0) { > + order = 0; > + folio = alloc_anon_folio(vma, vmf->address, order); > + } === One helper returns a folio of order arch_wants_pte_order(), or order 0 if it fails to allocate that order, e.g., folio = alloc_anon_folio(vmf); And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0 regardless of arch_wants_pte_order(). Upon success, it can update vmf->address, since if we run into a race with another PF, we exit the fault handler and retry anyway. > if (!folio) > goto oom; > > + pgcount = 1 << order; > + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); As shown above, the helper already updates vmf->address. And mm/ never used pgcount before -- the convention is nr_pages = folio_nr_pages(). > if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) > goto oom_free_page; > folio_throttle_swaprate(folio, GFP_KERNEL); > > /* > * The memory barrier inside __folio_mark_uptodate makes sure that > - * preceding stores to the page contents become visible before > - * the set_pte_at() write. > + * preceding stores to the folio contents become visible before > + * the set_ptes() write. We don't have set_ptes() yet. > */ > __folio_mark_uptodate(folio); > > @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > if (vma->vm_flags & VM_WRITE) > entry = pte_mkwrite(pte_mkdirty(entry)); > > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, > - &vmf->ptl); > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); > if (vmf_pte_changed(vmf)) { > update_mmu_tlb(vma, vmf->address, vmf->pte); > goto release; > + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { > + goto release; > } Need new helper: if (vmf_pte_range_changed(vmf, nr_pages)) { for (i = 0; i < nr_pages; i++) update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); goto release; } (It should be fine to call update_mmu_tlb() even if it's not really necessary.) > ret = check_stable_address_space(vma->vm_mm); > @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > return handle_userfault(vmf, VM_UFFD_MISSING); > } > > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > - folio_add_new_anon_rmap(folio, vma, vmf->address); > + folio_ref_add(folio, pgcount - 1); > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); > + folio_add_new_anon_rmap(folio, vma, addr); > folio_add_lru_vma(folio, vma); > -setpte: > + > if (uffd_wp) > entry = pte_mkuffd_wp(entry); > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); We would have to do it one by one for now. > /* No need to invalidate - it was non-present before */ > - update_mmu_cache(vma, vmf->address, vmf->pte); > + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); Ditto. How about this (by moving mk_pte() and its friends here): ... folio_add_lru_vma(folio, vma); for (i = 0; i < nr_pages; i++) { entry = mk_pte(folio_page(folio, i), vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); setpte: if (uffd_wp) entry = pte_mkuffd_wp(entry); set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, vmf->pte + i, entry); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); } > unlock: > pte_unmap_unlock(vmf->pte, vmf->ptl); > return ret; Attaching a small patch in case anything above is not clear. Please take a look. Thanks.
On 7/3/2023 9:53 PM, Ryan Roberts wrote: > Introduce FLEXIBLE_THP feature, which allows anonymous memory to be THP is for huge page which is 2M size. We are not huge page here. But I don't have good name either. > allocated in large folios of a specified order. All pages of the large > folio are pte-mapped during the same page fault, significantly reducing > the number of page faults. The number of per-page operations (e.g. ref > counting, rmap management lru list management) are also significantly > reduced since those ops now become per-folio. > > The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which > defaults to disabled for now; there is a long list of todos to make > FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some > madvise ops, etc). These items will be tackled in subsequent patches. > > When enabled, the preferred folio order is as returned by > arch_wants_pte_order(), which may be overridden by the arch as it sees > fit. Some architectures (e.g. arm64) can coalsece TLB entries if a > contiguous set of ptes map physically contigious, naturally aligned > memory, so this mechanism allows the architecture to optimize as > required. > > If the preferred order can't be used (e.g. because the folio would > breach the bounds of the vma, or because ptes in the region are already > mapped) then we fall back to a suitable lower order. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > mm/Kconfig | 10 ++++ > mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- > 2 files changed, 165 insertions(+), 13 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 7672a22647b4..1c06b2c0a24e 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS > support of file THPs will be developed in the next few release > cycles. > > +config FLEXIBLE_THP > + bool "Flexible order THP" > + depends on TRANSPARENT_HUGEPAGE > + default n > + help > + Use large (bigger than order-0) folios to back anonymous memory where > + possible, even if the order of the folio is smaller than the PMD > + order. This reduces the number of page faults, as well as other > + per-page overheads to improve performance for many workloads. > + > endif # TRANSPARENT_HUGEPAGE > > # > diff --git a/mm/memory.c b/mm/memory.c > index fb30f7523550..abe2ea94f3f5 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) > return 0; > } > > +#ifdef CONFIG_FLEXIBLE_THP > +/* > + * Allocates, zeros and returns a folio of the requested order for use as > + * anonymous memory. > + */ > +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, > + unsigned long addr, int order) > +{ > + gfp_t gfp; > + struct folio *folio; > + > + if (order == 0) > + return vma_alloc_zeroed_movable_folio(vma, addr); > + > + gfp = vma_thp_gfp_mask(vma); > + folio = vma_alloc_folio(gfp, order, vma, addr, true); > + if (folio) > + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); > + > + return folio; > +} > + > +/* > + * Preferred folio order to allocate for anonymous memory. > + */ > +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) > +#else > +#define alloc_anon_folio(vma, addr, order) \ > + vma_alloc_zeroed_movable_folio(vma, addr) > +#define max_anon_folio_order(vma) 0 > +#endif > + > +/* > + * Returns index of first pte that is not none, or nr if all are none. > + */ > +static inline int check_ptes_none(pte_t *pte, int nr) > +{ > + int i; > + > + for (i = 0; i < nr; i++) { > + if (!pte_none(ptep_get(pte++))) > + return i; > + } > + > + return nr; > +} > + > +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) > +{ > + /* > + * The aim here is to determine what size of folio we should allocate > + * for this fault. Factors include: > + * - Order must not be higher than `order` upon entry > + * - Folio must be naturally aligned within VA space > + * - Folio must be fully contained inside one pmd entry > + * - Folio must not breach boundaries of vma > + * - Folio must not overlap any non-none ptes > + * > + * Additionally, we do not allow order-1 since this breaks assumptions > + * elsewhere in the mm; THP pages must be at least order-2 (since they > + * store state up to the 3rd struct page subpage), and these pages must > + * be THP in order to correctly use pre-existing THP infrastructure such > + * as folio_split(). > + * > + * Note that the caller may or may not choose to lock the pte. If > + * unlocked, the result is racy and the user must re-check any overlap > + * with non-none ptes under the lock. > + */ > + > + struct vm_area_struct *vma = vmf->vma; > + int nr; > + unsigned long addr; > + pte_t *pte; > + pte_t *first_set = NULL; > + int ret; > + > + order = min(order, PMD_SHIFT - PAGE_SHIFT); > + > + for (; order > 1; order--) { > + nr = 1 << order; > + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); > + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); > + > + /* Check vma bounds. */ > + if (addr < vma->vm_start || > + addr + (nr << PAGE_SHIFT) > vma->vm_end) > + continue; > + > + /* Ptes covered by order already known to be none. */ > + if (pte + nr <= first_set) > + break; > + > + /* Already found set pte in range covered by order. */ > + if (pte <= first_set) > + continue; > + > + /* Need to check if all the ptes are none. */ > + ret = check_ptes_none(pte, nr); > + if (ret == nr) > + break; > + > + first_set = pte + ret; > + } > + > + if (order == 1) > + order = 0; > + > + return order; > +} The logic in above function should be kept is whether the order fit in vma range. check_ptes_none() is not accurate here because no page table lock hold and concurrent fault could happen. So may just drop the check here? Check_ptes_none() is done after take the page table lock. We pick the arch prefered order or order 0 now. > + > /* > * Handle write page faults for pages that can be reused in the current vma > * > @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > goto oom; > > if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { > - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + new_folio = alloc_anon_folio(vma, vmf->address, 0); > if (!new_folio) > goto oom; > } else { > @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > struct folio *folio; > vm_fault_t ret = 0; > pte_t entry; > + int order; > + int pgcount; > + unsigned long addr; > > /* File mapping without ->vm_ops ? */ > if (vma->vm_flags & VM_SHARED) > @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > pte_unmap_unlock(vmf->pte, vmf->ptl); > return handle_userfault(vmf, VM_UFFD_MISSING); > } > - goto setpte; > + if (uffd_wp) > + entry = pte_mkuffd_wp(entry); > + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + > + /* No need to invalidate - it was non-present before */ > + update_mmu_cache(vma, vmf->address, vmf->pte); > + goto unlock; > + } > + > + /* > + * If allocating a large folio, determine the biggest suitable order for > + * the VMA (e.g. it must not exceed the VMA's bounds, it must not > + * overlap with any populated PTEs, etc). We are not under the ptl here > + * so we will need to re-check that we are not overlapping any populated > + * PTEs once we have the lock. > + */ > + order = uffd_wp ? 0 : max_anon_folio_order(vma); > + if (order > 0) { > + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > + order = calc_anon_folio_order_alloc(vmf, order); > + pte_unmap(vmf->pte); > } > > - /* Allocate our own private page. */ > + /* Allocate our own private folio. */ > if (unlikely(anon_vma_prepare(vma))) > goto oom; > - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + folio = alloc_anon_folio(vma, vmf->address, order); > + if (!folio && order > 0) { > + order = 0; > + folio = alloc_anon_folio(vma, vmf->address, order); > + } > if (!folio) > goto oom; > > + pgcount = 1 << order; > + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); > + > if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) > goto oom_free_page; > folio_throttle_swaprate(folio, GFP_KERNEL); > > /* > * The memory barrier inside __folio_mark_uptodate makes sure that > - * preceding stores to the page contents become visible before > - * the set_pte_at() write. > + * preceding stores to the folio contents become visible before > + * the set_ptes() write. > */ > __folio_mark_uptodate(folio); > > @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > if (vma->vm_flags & VM_WRITE) > entry = pte_mkwrite(pte_mkdirty(entry)); > > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, > - &vmf->ptl); > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); > if (vmf_pte_changed(vmf)) { > update_mmu_tlb(vma, vmf->address, vmf->pte); > goto release; > + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { This could be the case that we allocated order 4 page and find a neighbor PTE is filled by concurrent fault. Should we put current folio and fallback to order 0 and try again immedately (goto order 0 allocation instead of return from this function which will go through some page fault path again)? Regards Yin, Fengwei > + goto release; > } > > ret = check_stable_address_space(vma->vm_mm); > @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > return handle_userfault(vmf, VM_UFFD_MISSING); > } > > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > - folio_add_new_anon_rmap(folio, vma, vmf->address); > + folio_ref_add(folio, pgcount - 1); > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); > + folio_add_new_anon_rmap(folio, vma, addr); > folio_add_lru_vma(folio, vma); > -setpte: > + > if (uffd_wp) > entry = pte_mkuffd_wp(entry); > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); > > /* No need to invalidate - it was non-present before */ > - update_mmu_cache(vma, vmf->address, vmf->pte); > + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); > unlock: > pte_unmap_unlock(vmf->pte, vmf->ptl); > return ret;
On 04/07/2023 02:35, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >> allocated in large folios of a specified order. All pages of the large >> folio are pte-mapped during the same page fault, significantly reducing >> the number of page faults. The number of per-page operations (e.g. ref >> counting, rmap management lru list management) are also significantly >> reduced since those ops now become per-folio. >> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which >> defaults to disabled for now; there is a long list of todos to make >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some >> madvise ops, etc). These items will be tackled in subsequent patches. >> >> When enabled, the preferred folio order is as returned by >> arch_wants_pte_order(), which may be overridden by the arch as it sees >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a > > coalesce ACK > >> contiguous set of ptes map physically contigious, naturally aligned > > contiguous ACK > >> memory, so this mechanism allows the architecture to optimize as >> required. >> >> If the preferred order can't be used (e.g. because the folio would >> breach the bounds of the vma, or because ptes in the region are already >> mapped) then we fall back to a suitable lower order. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> mm/Kconfig | 10 ++++ >> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- >> 2 files changed, 165 insertions(+), 13 deletions(-) >> >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 7672a22647b4..1c06b2c0a24e 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS >> support of file THPs will be developed in the next few release >> cycles. >> >> +config FLEXIBLE_THP >> + bool "Flexible order THP" >> + depends on TRANSPARENT_HUGEPAGE >> + default n > > The default value is already N. Is there a coding standard for this? Personally I prefer to make it explicit. > >> + help >> + Use large (bigger than order-0) folios to back anonymous memory where >> + possible, even if the order of the folio is smaller than the PMD >> + order. This reduces the number of page faults, as well as other >> + per-page overheads to improve performance for many workloads. >> + >> endif # TRANSPARENT_HUGEPAGE >> >> # >> diff --git a/mm/memory.c b/mm/memory.c >> index fb30f7523550..abe2ea94f3f5 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) >> return 0; >> } >> >> +#ifdef CONFIG_FLEXIBLE_THP >> +/* >> + * Allocates, zeros and returns a folio of the requested order for use as >> + * anonymous memory. >> + */ >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, >> + unsigned long addr, int order) >> +{ >> + gfp_t gfp; >> + struct folio *folio; >> + >> + if (order == 0) >> + return vma_alloc_zeroed_movable_folio(vma, addr); >> + >> + gfp = vma_thp_gfp_mask(vma); >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); >> + if (folio) >> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); >> + >> + return folio; >> +} >> + >> +/* >> + * Preferred folio order to allocate for anonymous memory. >> + */ >> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) >> +#else >> +#define alloc_anon_folio(vma, addr, order) \ >> + vma_alloc_zeroed_movable_folio(vma, addr) >> +#define max_anon_folio_order(vma) 0 >> +#endif >> + >> +/* >> + * Returns index of first pte that is not none, or nr if all are none. >> + */ >> +static inline int check_ptes_none(pte_t *pte, int nr) >> +{ >> + int i; >> + >> + for (i = 0; i < nr; i++) { >> + if (!pte_none(ptep_get(pte++))) >> + return i; >> + } >> + >> + return nr; >> +} >> + >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) >> +{ >> + /* >> + * The aim here is to determine what size of folio we should allocate >> + * for this fault. Factors include: >> + * - Order must not be higher than `order` upon entry >> + * - Folio must be naturally aligned within VA space >> + * - Folio must be fully contained inside one pmd entry >> + * - Folio must not breach boundaries of vma >> + * - Folio must not overlap any non-none ptes >> + * >> + * Additionally, we do not allow order-1 since this breaks assumptions >> + * elsewhere in the mm; THP pages must be at least order-2 (since they >> + * store state up to the 3rd struct page subpage), and these pages must >> + * be THP in order to correctly use pre-existing THP infrastructure such >> + * as folio_split(). >> + * >> + * Note that the caller may or may not choose to lock the pte. If >> + * unlocked, the result is racy and the user must re-check any overlap >> + * with non-none ptes under the lock. >> + */ >> + >> + struct vm_area_struct *vma = vmf->vma; >> + int nr; >> + unsigned long addr; >> + pte_t *pte; >> + pte_t *first_set = NULL; >> + int ret; >> + >> + order = min(order, PMD_SHIFT - PAGE_SHIFT); >> + >> + for (; order > 1; order--) { > > I'm not sure how we can justify this policy. As an initial step, it'd > be a lot easier to sell if we only considered the order of > arch_wants_pte_order() and the order 0. My justification is in the cover letter; I see performance regression (vs the unpatched kernel) when using the policy you suggest. This policy performs much better in my tests. (I'll reply directly to your follow up questions in the cover letter shortly). What are your technical concerns about this approach? It is pretty light weight (I only touch each PTE once, regardless of the number of loops). If we have strong technical reasons for reverting to the less performant approach then fair enough, but I'd like to hear the rational first. > >> + nr = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); >> + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); >> + >> + /* Check vma bounds. */ >> + if (addr < vma->vm_start || >> + addr + (nr << PAGE_SHIFT) > vma->vm_end) >> + continue; >> + >> + /* Ptes covered by order already known to be none. */ >> + if (pte + nr <= first_set) >> + break; >> + >> + /* Already found set pte in range covered by order. */ >> + if (pte <= first_set) >> + continue; >> + >> + /* Need to check if all the ptes are none. */ >> + ret = check_ptes_none(pte, nr); >> + if (ret == nr) >> + break; >> + >> + first_set = pte + ret; >> + } >> + >> + if (order == 1) >> + order = 0; >> + >> + return order; >> +} > > Everything above can be simplified into two helpers: > vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you > prefer). Details below. > >> /* >> * Handle write page faults for pages that can be reused in the current vma >> * >> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) >> goto oom; >> >> if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { >> - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + new_folio = alloc_anon_folio(vma, vmf->address, 0); > > This seems unnecessary for now. Later on, we could fill in an aligned > area with multiple write-protected zero pages during a read fault and > then replace them with a large folio here. I don't have a strong opinion. I thought that it would be neater to use the same API everywhere, but happy to revert. > >> if (!new_folio) >> goto oom; >> } else { >> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> struct folio *folio; >> vm_fault_t ret = 0; >> pte_t entry; >> + int order; >> + int pgcount; >> + unsigned long addr; >> >> /* File mapping without ->vm_ops ? */ >> if (vma->vm_flags & VM_SHARED) >> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> - goto setpte; >> + if (uffd_wp) >> + entry = pte_mkuffd_wp(entry); >> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + >> + /* No need to invalidate - it was non-present before */ >> + update_mmu_cache(vma, vmf->address, vmf->pte); >> + goto unlock; >> + } > > Nor really needed IMO. Details below. > > === > >> + /* >> + * If allocating a large folio, determine the biggest suitable order for >> + * the VMA (e.g. it must not exceed the VMA's bounds, it must not >> + * overlap with any populated PTEs, etc). We are not under the ptl here >> + * so we will need to re-check that we are not overlapping any populated >> + * PTEs once we have the lock. >> + */ >> + order = uffd_wp ? 0 : max_anon_folio_order(vma); >> + if (order > 0) { >> + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >> + order = calc_anon_folio_order_alloc(vmf, order); >> + pte_unmap(vmf->pte); >> } > > === > > The section above together with the section below should be wrapped in a helper. > >> - /* Allocate our own private page. */ >> + /* Allocate our own private folio. */ >> if (unlikely(anon_vma_prepare(vma))) >> goto oom; > > === > >> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + if (!folio && order > 0) { >> + order = 0; >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + } > > === > > One helper returns a folio of order arch_wants_pte_order(), or order 0 > if it fails to allocate that order, e.g., > > folio = alloc_anon_folio(vmf); > > And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0 > regardless of arch_wants_pte_order(). Upon success, it can update > vmf->address, since if we run into a race with another PF, we exit the > fault handler and retry anyway. > >> if (!folio) >> goto oom; >> >> + pgcount = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); > > As shown above, the helper already updates vmf->address. And mm/ never > used pgcount before -- the convention is nr_pages = folio_nr_pages(). ACK > >> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) >> goto oom_free_page; >> folio_throttle_swaprate(folio, GFP_KERNEL); >> >> /* >> * The memory barrier inside __folio_mark_uptodate makes sure that >> - * preceding stores to the page contents become visible before >> - * the set_pte_at() write. >> + * preceding stores to the folio contents become visible before >> + * the set_ptes() write. > > We don't have set_ptes() yet. Indeed, that's why I listed the set_ptes() patch set as a hard dependency ;-) > >> */ >> __folio_mark_uptodate(folio); >> >> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> if (vma->vm_flags & VM_WRITE) >> entry = pte_mkwrite(pte_mkdirty(entry)); >> >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >> - &vmf->ptl); >> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); >> if (vmf_pte_changed(vmf)) { >> update_mmu_tlb(vma, vmf->address, vmf->pte); >> goto release; >> + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { >> + goto release; >> } > > Need new helper: > > if (vmf_pte_range_changed(vmf, nr_pages)) { > for (i = 0; i < nr_pages; i++) > update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); > goto release; > } > > (It should be fine to call update_mmu_tlb() even if it's not really necessary.) > >> ret = check_stable_address_space(vma->vm_mm); >> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> >> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >> - folio_add_new_anon_rmap(folio, vma, vmf->address); >> + folio_ref_add(folio, pgcount - 1); >> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); >> + folio_add_new_anon_rmap(folio, vma, addr); >> folio_add_lru_vma(folio, vma); >> -setpte: >> + >> if (uffd_wp) >> entry = pte_mkuffd_wp(entry); >> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); > > We would have to do it one by one for now. > >> /* No need to invalidate - it was non-present before */ >> - update_mmu_cache(vma, vmf->address, vmf->pte); >> + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); > > Ditto. > > How about this (by moving mk_pte() and its friends here): > ... > folio_add_lru_vma(folio, vma); > > for (i = 0; i < nr_pages; i++) { > entry = mk_pte(folio_page(folio, i), vma->vm_page_prot); > entry = pte_sw_mkyoung(entry); > if (vma->vm_flags & VM_WRITE) > entry = pte_mkwrite(pte_mkdirty(entry)); > setpte: > if (uffd_wp) > entry = pte_mkuffd_wp(entry); > set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, > vmf->pte + i, entry); > > /* No need to invalidate - it was non-present before */ > update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, > vmf->pte + i); > } > >> unlock: >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return ret; > > Attaching a small patch in case anything above is not clear. Please > take a look. Thanks. OK, I'll take a look and rework for v3.
On 04/07/2023 04:45, Yin, Fengwei wrote: > > On 7/3/2023 9:53 PM, Ryan Roberts wrote: >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > THP is for huge page which is 2M size. We are not huge page here. But > I don't have good name either. Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K base page, they are 512M. So huge pages already have a variable size. And they sometimes get PTE-mapped. So can't we just think of this as an extension of the THP feature? > >> allocated in large folios of a specified order. All pages of the large >> folio are pte-mapped during the same page fault, significantly reducing >> the number of page faults. The number of per-page operations (e.g. ref >> counting, rmap management lru list management) are also significantly >> reduced since those ops now become per-folio. >> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which >> defaults to disabled for now; there is a long list of todos to make >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some >> madvise ops, etc). These items will be tackled in subsequent patches. >> >> When enabled, the preferred folio order is as returned by >> arch_wants_pte_order(), which may be overridden by the arch as it sees >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a >> contiguous set of ptes map physically contigious, naturally aligned >> memory, so this mechanism allows the architecture to optimize as >> required. >> >> If the preferred order can't be used (e.g. because the folio would >> breach the bounds of the vma, or because ptes in the region are already >> mapped) then we fall back to a suitable lower order. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> mm/Kconfig | 10 ++++ >> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- >> 2 files changed, 165 insertions(+), 13 deletions(-) >> >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 7672a22647b4..1c06b2c0a24e 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS >> support of file THPs will be developed in the next few release >> cycles. >> >> +config FLEXIBLE_THP >> + bool "Flexible order THP" >> + depends on TRANSPARENT_HUGEPAGE >> + default n >> + help >> + Use large (bigger than order-0) folios to back anonymous memory where >> + possible, even if the order of the folio is smaller than the PMD >> + order. This reduces the number of page faults, as well as other >> + per-page overheads to improve performance for many workloads. >> + >> endif # TRANSPARENT_HUGEPAGE >> >> # >> diff --git a/mm/memory.c b/mm/memory.c >> index fb30f7523550..abe2ea94f3f5 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) >> return 0; >> } >> >> +#ifdef CONFIG_FLEXIBLE_THP >> +/* >> + * Allocates, zeros and returns a folio of the requested order for use as >> + * anonymous memory. >> + */ >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, >> + unsigned long addr, int order) >> +{ >> + gfp_t gfp; >> + struct folio *folio; >> + >> + if (order == 0) >> + return vma_alloc_zeroed_movable_folio(vma, addr); >> + >> + gfp = vma_thp_gfp_mask(vma); >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); >> + if (folio) >> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); >> + >> + return folio; >> +} >> + >> +/* >> + * Preferred folio order to allocate for anonymous memory. >> + */ >> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) >> +#else >> +#define alloc_anon_folio(vma, addr, order) \ >> + vma_alloc_zeroed_movable_folio(vma, addr) >> +#define max_anon_folio_order(vma) 0 >> +#endif >> + >> +/* >> + * Returns index of first pte that is not none, or nr if all are none. >> + */ >> +static inline int check_ptes_none(pte_t *pte, int nr) >> +{ >> + int i; >> + >> + for (i = 0; i < nr; i++) { >> + if (!pte_none(ptep_get(pte++))) >> + return i; >> + } >> + >> + return nr; >> +} >> + >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) >> +{ >> + /* >> + * The aim here is to determine what size of folio we should allocate >> + * for this fault. Factors include: >> + * - Order must not be higher than `order` upon entry >> + * - Folio must be naturally aligned within VA space >> + * - Folio must be fully contained inside one pmd entry >> + * - Folio must not breach boundaries of vma >> + * - Folio must not overlap any non-none ptes >> + * >> + * Additionally, we do not allow order-1 since this breaks assumptions >> + * elsewhere in the mm; THP pages must be at least order-2 (since they >> + * store state up to the 3rd struct page subpage), and these pages must >> + * be THP in order to correctly use pre-existing THP infrastructure such >> + * as folio_split(). >> + * >> + * Note that the caller may or may not choose to lock the pte. If >> + * unlocked, the result is racy and the user must re-check any overlap >> + * with non-none ptes under the lock. >> + */ >> + >> + struct vm_area_struct *vma = vmf->vma; >> + int nr; >> + unsigned long addr; >> + pte_t *pte; >> + pte_t *first_set = NULL; >> + int ret; >> + >> + order = min(order, PMD_SHIFT - PAGE_SHIFT); >> + >> + for (; order > 1; order--) { >> + nr = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); >> + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); >> + >> + /* Check vma bounds. */ >> + if (addr < vma->vm_start || >> + addr + (nr << PAGE_SHIFT) > vma->vm_end) >> + continue; >> + >> + /* Ptes covered by order already known to be none. */ >> + if (pte + nr <= first_set) >> + break; >> + >> + /* Already found set pte in range covered by order. */ >> + if (pte <= first_set) >> + continue; >> + >> + /* Need to check if all the ptes are none. */ >> + ret = check_ptes_none(pte, nr); >> + if (ret == nr) >> + break; >> + >> + first_set = pte + ret; >> + } >> + >> + if (order == 1) >> + order = 0; >> + >> + return order; >> +} > The logic in above function should be kept is whether the order fit in vma range. > > check_ptes_none() is not accurate here because no page table lock hold and concurrent > fault could happen. So may just drop the check here? Check_ptes_none() is done after > take the page table lock. I agree it is just an estimate given the lock is not held; the comment at the top says the same. But I don't think we can wait until after the lock is taken to measure this. We can't hold the lock while allocating the folio and we need a guess at what to allocate. If we don't guess here, we will allocate the biggest, then take the lock, see that it doesn't fit, and exit. Then the system will re-fault and we will follow the exact same path - ending up in live lock. > > We pick the arch prefered order or order 0 now. > >> + >> /* >> * Handle write page faults for pages that can be reused in the current vma >> * >> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) >> goto oom; >> >> if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { >> - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + new_folio = alloc_anon_folio(vma, vmf->address, 0); >> if (!new_folio) >> goto oom; >> } else { >> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> struct folio *folio; >> vm_fault_t ret = 0; >> pte_t entry; >> + int order; >> + int pgcount; >> + unsigned long addr; >> >> /* File mapping without ->vm_ops ? */ >> if (vma->vm_flags & VM_SHARED) >> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> - goto setpte; >> + if (uffd_wp) >> + entry = pte_mkuffd_wp(entry); >> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + >> + /* No need to invalidate - it was non-present before */ >> + update_mmu_cache(vma, vmf->address, vmf->pte); >> + goto unlock; >> + } >> + >> + /* >> + * If allocating a large folio, determine the biggest suitable order for >> + * the VMA (e.g. it must not exceed the VMA's bounds, it must not >> + * overlap with any populated PTEs, etc). We are not under the ptl here >> + * so we will need to re-check that we are not overlapping any populated >> + * PTEs once we have the lock. >> + */ >> + order = uffd_wp ? 0 : max_anon_folio_order(vma); >> + if (order > 0) { >> + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >> + order = calc_anon_folio_order_alloc(vmf, order); >> + pte_unmap(vmf->pte); >> } >> >> - /* Allocate our own private page. */ >> + /* Allocate our own private folio. */ >> if (unlikely(anon_vma_prepare(vma))) >> goto oom; >> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + if (!folio && order > 0) { >> + order = 0; >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + } >> if (!folio) >> goto oom; >> >> + pgcount = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); >> + >> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) >> goto oom_free_page; >> folio_throttle_swaprate(folio, GFP_KERNEL); >> >> /* >> * The memory barrier inside __folio_mark_uptodate makes sure that >> - * preceding stores to the page contents become visible before >> - * the set_pte_at() write. >> + * preceding stores to the folio contents become visible before >> + * the set_ptes() write. >> */ >> __folio_mark_uptodate(folio); >> >> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> if (vma->vm_flags & VM_WRITE) >> entry = pte_mkwrite(pte_mkdirty(entry)); >> >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >> - &vmf->ptl); >> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); >> if (vmf_pte_changed(vmf)) { >> update_mmu_tlb(vma, vmf->address, vmf->pte); >> goto release; >> + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { > This could be the case that we allocated order 4 page and find a neighbor PTE is > filled by concurrent fault. Should we put current folio and fallback to order 0 > and try again immedately (goto order 0 allocation instead of return from this > function which will go through some page fault path again)? That's how it worked in v1, but I had review comments from Yang Shi asking me to re-fault instead. This approach is certainly cleaner from a code point of view. And I expect races of that nature will be rare. > > > Regards > Yin, Fengwei > >> + goto release; >> } >> >> ret = check_stable_address_space(vma->vm_mm); >> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> >> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >> - folio_add_new_anon_rmap(folio, vma, vmf->address); >> + folio_ref_add(folio, pgcount - 1); >> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); >> + folio_add_new_anon_rmap(folio, vma, addr); >> folio_add_lru_vma(folio, vma); >> -setpte: >> + >> if (uffd_wp) >> entry = pte_mkuffd_wp(entry); >> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); >> >> /* No need to invalidate - it was non-present before */ >> - update_mmu_cache(vma, vmf->address, vmf->pte); >> + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); >> unlock: >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return ret;
On 7/4/23 22:20, Ryan Roberts wrote: > On 04/07/2023 04:45, Yin, Fengwei wrote: >> >> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >> THP is for huge page which is 2M size. We are not huge page here. But >> I don't have good name either. > > Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K > base page, they are 512M. So huge pages already have a variable size. And they > sometimes get PTE-mapped. So can't we just think of this as an extension of the > THP feature? My understanding is the THP has several fixed size on different arch. The 32K or 16K which could be picked here are not THP size. > >> >>> allocated in large folios of a specified order. All pages of the large >>> folio are pte-mapped during the same page fault, significantly reducing >>> the number of page faults. The number of per-page operations (e.g. ref >>> counting, rmap management lru list management) are also significantly >>> reduced since those ops now become per-folio. >>> >>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which >>> defaults to disabled for now; there is a long list of todos to make >>> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some >>> madvise ops, etc). These items will be tackled in subsequent patches. >>> >>> When enabled, the preferred folio order is as returned by >>> arch_wants_pte_order(), which may be overridden by the arch as it sees >>> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a >>> contiguous set of ptes map physically contigious, naturally aligned >>> memory, so this mechanism allows the architecture to optimize as >>> required. >>> >>> If the preferred order can't be used (e.g. because the folio would >>> breach the bounds of the vma, or because ptes in the region are already >>> mapped) then we fall back to a suitable lower order. >>> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>> --- >>> mm/Kconfig | 10 ++++ >>> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- >>> 2 files changed, 165 insertions(+), 13 deletions(-) >>> >>> diff --git a/mm/Kconfig b/mm/Kconfig >>> index 7672a22647b4..1c06b2c0a24e 100644 >>> --- a/mm/Kconfig >>> +++ b/mm/Kconfig >>> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS >>> support of file THPs will be developed in the next few release >>> cycles. >>> >>> +config FLEXIBLE_THP >>> + bool "Flexible order THP" >>> + depends on TRANSPARENT_HUGEPAGE >>> + default n >>> + help >>> + Use large (bigger than order-0) folios to back anonymous memory where >>> + possible, even if the order of the folio is smaller than the PMD >>> + order. This reduces the number of page faults, as well as other >>> + per-page overheads to improve performance for many workloads. >>> + >>> endif # TRANSPARENT_HUGEPAGE >>> >>> # >>> diff --git a/mm/memory.c b/mm/memory.c >>> index fb30f7523550..abe2ea94f3f5 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) >>> return 0; >>> } >>> >>> +#ifdef CONFIG_FLEXIBLE_THP >>> +/* >>> + * Allocates, zeros and returns a folio of the requested order for use as >>> + * anonymous memory. >>> + */ >>> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, >>> + unsigned long addr, int order) >>> +{ >>> + gfp_t gfp; >>> + struct folio *folio; >>> + >>> + if (order == 0) >>> + return vma_alloc_zeroed_movable_folio(vma, addr); >>> + >>> + gfp = vma_thp_gfp_mask(vma); >>> + folio = vma_alloc_folio(gfp, order, vma, addr, true); >>> + if (folio) >>> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); >>> + >>> + return folio; >>> +} >>> + >>> +/* >>> + * Preferred folio order to allocate for anonymous memory. >>> + */ >>> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) >>> +#else >>> +#define alloc_anon_folio(vma, addr, order) \ >>> + vma_alloc_zeroed_movable_folio(vma, addr) >>> +#define max_anon_folio_order(vma) 0 >>> +#endif >>> + >>> +/* >>> + * Returns index of first pte that is not none, or nr if all are none. >>> + */ >>> +static inline int check_ptes_none(pte_t *pte, int nr) >>> +{ >>> + int i; >>> + >>> + for (i = 0; i < nr; i++) { >>> + if (!pte_none(ptep_get(pte++))) >>> + return i; >>> + } >>> + >>> + return nr; >>> +} >>> + >>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) >>> +{ >>> + /* >>> + * The aim here is to determine what size of folio we should allocate >>> + * for this fault. Factors include: >>> + * - Order must not be higher than `order` upon entry >>> + * - Folio must be naturally aligned within VA space >>> + * - Folio must be fully contained inside one pmd entry >>> + * - Folio must not breach boundaries of vma >>> + * - Folio must not overlap any non-none ptes >>> + * >>> + * Additionally, we do not allow order-1 since this breaks assumptions >>> + * elsewhere in the mm; THP pages must be at least order-2 (since they >>> + * store state up to the 3rd struct page subpage), and these pages must >>> + * be THP in order to correctly use pre-existing THP infrastructure such >>> + * as folio_split(). >>> + * >>> + * Note that the caller may or may not choose to lock the pte. If >>> + * unlocked, the result is racy and the user must re-check any overlap >>> + * with non-none ptes under the lock. >>> + */ >>> + >>> + struct vm_area_struct *vma = vmf->vma; >>> + int nr; >>> + unsigned long addr; >>> + pte_t *pte; >>> + pte_t *first_set = NULL; >>> + int ret; >>> + >>> + order = min(order, PMD_SHIFT - PAGE_SHIFT); >>> + >>> + for (; order > 1; order--) { >>> + nr = 1 << order; >>> + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); >>> + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); >>> + >>> + /* Check vma bounds. */ >>> + if (addr < vma->vm_start || >>> + addr + (nr << PAGE_SHIFT) > vma->vm_end) >>> + continue; >>> + >>> + /* Ptes covered by order already known to be none. */ >>> + if (pte + nr <= first_set) >>> + break; >>> + >>> + /* Already found set pte in range covered by order. */ >>> + if (pte <= first_set) >>> + continue; >>> + >>> + /* Need to check if all the ptes are none. */ >>> + ret = check_ptes_none(pte, nr); >>> + if (ret == nr) >>> + break; >>> + >>> + first_set = pte + ret; >>> + } >>> + >>> + if (order == 1) >>> + order = 0; >>> + >>> + return order; >>> +} >> The logic in above function should be kept is whether the order fit in vma range. >> >> check_ptes_none() is not accurate here because no page table lock hold and concurrent >> fault could happen. So may just drop the check here? Check_ptes_none() is done after >> take the page table lock. > > I agree it is just an estimate given the lock is not held; the comment at the > top says the same. But I don't think we can wait until after the lock is taken > to measure this. We can't hold the lock while allocating the folio and we need a > guess at what to allocate. If we don't guess here, we will allocate the biggest, > then take the lock, see that it doesn't fit, and exit. Then the system will > re-fault and we will follow the exact same path - ending up in live lock. It will not if we try order0 immediately. But see my comments to the refault. > >> >> We pick the arch prefered order or order 0 now. >> >>> + >>> /* >>> * Handle write page faults for pages that can be reused in the current vma >>> * >>> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) >>> goto oom; >>> >>> if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { >>> - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >>> + new_folio = alloc_anon_folio(vma, vmf->address, 0); >>> if (!new_folio) >>> goto oom; >>> } else { >>> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >>> struct folio *folio; >>> vm_fault_t ret = 0; >>> pte_t entry; >>> + int order; >>> + int pgcount; >>> + unsigned long addr; >>> >>> /* File mapping without ->vm_ops ? */ >>> if (vma->vm_flags & VM_SHARED) >>> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >>> pte_unmap_unlock(vmf->pte, vmf->ptl); >>> return handle_userfault(vmf, VM_UFFD_MISSING); >>> } >>> - goto setpte; >>> + if (uffd_wp) >>> + entry = pte_mkuffd_wp(entry); >>> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >>> + >>> + /* No need to invalidate - it was non-present before */ >>> + update_mmu_cache(vma, vmf->address, vmf->pte); >>> + goto unlock; >>> + } >>> + >>> + /* >>> + * If allocating a large folio, determine the biggest suitable order for >>> + * the VMA (e.g. it must not exceed the VMA's bounds, it must not >>> + * overlap with any populated PTEs, etc). We are not under the ptl here >>> + * so we will need to re-check that we are not overlapping any populated >>> + * PTEs once we have the lock. >>> + */ >>> + order = uffd_wp ? 0 : max_anon_folio_order(vma); >>> + if (order > 0) { >>> + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >>> + order = calc_anon_folio_order_alloc(vmf, order); >>> + pte_unmap(vmf->pte); >>> } >>> >>> - /* Allocate our own private page. */ >>> + /* Allocate our own private folio. */ >>> if (unlikely(anon_vma_prepare(vma))) >>> goto oom; >>> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >>> + folio = alloc_anon_folio(vma, vmf->address, order); >>> + if (!folio && order > 0) { >>> + order = 0; >>> + folio = alloc_anon_folio(vma, vmf->address, order); >>> + } >>> if (!folio) >>> goto oom; >>> >>> + pgcount = 1 << order; >>> + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); >>> + >>> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) >>> goto oom_free_page; >>> folio_throttle_swaprate(folio, GFP_KERNEL); >>> >>> /* >>> * The memory barrier inside __folio_mark_uptodate makes sure that >>> - * preceding stores to the page contents become visible before >>> - * the set_pte_at() write. >>> + * preceding stores to the folio contents become visible before >>> + * the set_ptes() write. >>> */ >>> __folio_mark_uptodate(folio); >>> >>> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >>> if (vma->vm_flags & VM_WRITE) >>> entry = pte_mkwrite(pte_mkdirty(entry)); >>> >>> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >>> - &vmf->ptl); >>> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); >>> if (vmf_pte_changed(vmf)) { >>> update_mmu_tlb(vma, vmf->address, vmf->pte); >>> goto release; >>> + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { >> This could be the case that we allocated order 4 page and find a neighbor PTE is >> filled by concurrent fault. Should we put current folio and fallback to order 0 >> and try again immedately (goto order 0 allocation instead of return from this >> function which will go through some page fault path again)? > > That's how it worked in v1, but I had review comments from Yang Shi asking me to > re-fault instead. This approach is certainly cleaner from a code point of view. > And I expect races of that nature will be rare. I must miss that discussion in v1. My bad. I should jump in that discussion. So I will drop my comment here even I still think we should avoid refault. I don't want the comment back and forth. Regards Yin, Fengwei > >> >> >> Regards >> Yin, Fengwei >> >>> + goto release; >>> } >>> >>> ret = check_stable_address_space(vma->vm_mm); >>> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >>> return handle_userfault(vmf, VM_UFFD_MISSING); >>> } >>> >>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >>> - folio_add_new_anon_rmap(folio, vma, vmf->address); >>> + folio_ref_add(folio, pgcount - 1); >>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); >>> + folio_add_new_anon_rmap(folio, vma, addr); >>> folio_add_lru_vma(folio, vma); >>> -setpte: >>> + >>> if (uffd_wp) >>> entry = pte_mkuffd_wp(entry); >>> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >>> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); >>> >>> /* No need to invalidate - it was non-present before */ >>> - update_mmu_cache(vma, vmf->address, vmf->pte); >>> + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); >>> unlock: >>> pte_unmap_unlock(vmf->pte, vmf->ptl); >>> return ret; >
On Tue, Jul 4, 2023 at 8:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 04/07/2023 02:35, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > >> allocated in large folios of a specified order. All pages of the large > >> folio are pte-mapped during the same page fault, significantly reducing > >> the number of page faults. The number of per-page operations (e.g. ref > >> counting, rmap management lru list management) are also significantly > >> reduced since those ops now become per-folio. > >> > >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which > >> defaults to disabled for now; there is a long list of todos to make > >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some > >> madvise ops, etc). These items will be tackled in subsequent patches. > >> > >> When enabled, the preferred folio order is as returned by > >> arch_wants_pte_order(), which may be overridden by the arch as it sees > >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a > > > > coalesce > > ACK > > > > >> contiguous set of ptes map physically contigious, naturally aligned > > > > contiguous > > ACK > > > > >> memory, so this mechanism allows the architecture to optimize as > >> required. > >> > >> If the preferred order can't be used (e.g. because the folio would > >> breach the bounds of the vma, or because ptes in the region are already > >> mapped) then we fall back to a suitable lower order. > >> > >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > >> --- > >> mm/Kconfig | 10 ++++ > >> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- > >> 2 files changed, 165 insertions(+), 13 deletions(-) > >> > >> diff --git a/mm/Kconfig b/mm/Kconfig > >> index 7672a22647b4..1c06b2c0a24e 100644 > >> --- a/mm/Kconfig > >> +++ b/mm/Kconfig > >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS > >> support of file THPs will be developed in the next few release > >> cycles. > >> > >> +config FLEXIBLE_THP > >> + bool "Flexible order THP" > >> + depends on TRANSPARENT_HUGEPAGE > >> + default n > > > > The default value is already N. > > Is there a coding standard for this? Personally I prefer to make it explicit. > > > > >> + help > >> + Use large (bigger than order-0) folios to back anonymous memory where > >> + possible, even if the order of the folio is smaller than the PMD > >> + order. This reduces the number of page faults, as well as other > >> + per-page overheads to improve performance for many workloads. > >> + > >> endif # TRANSPARENT_HUGEPAGE > >> > >> # > >> diff --git a/mm/memory.c b/mm/memory.c > >> index fb30f7523550..abe2ea94f3f5 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) > >> return 0; > >> } > >> > >> +#ifdef CONFIG_FLEXIBLE_THP > >> +/* > >> + * Allocates, zeros and returns a folio of the requested order for use as > >> + * anonymous memory. > >> + */ > >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, > >> + unsigned long addr, int order) > >> +{ > >> + gfp_t gfp; > >> + struct folio *folio; > >> + > >> + if (order == 0) > >> + return vma_alloc_zeroed_movable_folio(vma, addr); > >> + > >> + gfp = vma_thp_gfp_mask(vma); > >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); > >> + if (folio) > >> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); > >> + > >> + return folio; > >> +} > >> + > >> +/* > >> + * Preferred folio order to allocate for anonymous memory. > >> + */ > >> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) > >> +#else > >> +#define alloc_anon_folio(vma, addr, order) \ > >> + vma_alloc_zeroed_movable_folio(vma, addr) > >> +#define max_anon_folio_order(vma) 0 > >> +#endif > >> + > >> +/* > >> + * Returns index of first pte that is not none, or nr if all are none. > >> + */ > >> +static inline int check_ptes_none(pte_t *pte, int nr) > >> +{ > >> + int i; > >> + > >> + for (i = 0; i < nr; i++) { > >> + if (!pte_none(ptep_get(pte++))) > >> + return i; > >> + } > >> + > >> + return nr; > >> +} > >> + > >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) > >> +{ > >> + /* > >> + * The aim here is to determine what size of folio we should allocate > >> + * for this fault. Factors include: > >> + * - Order must not be higher than `order` upon entry > >> + * - Folio must be naturally aligned within VA space > >> + * - Folio must be fully contained inside one pmd entry > >> + * - Folio must not breach boundaries of vma > >> + * - Folio must not overlap any non-none ptes > >> + * > >> + * Additionally, we do not allow order-1 since this breaks assumptions > >> + * elsewhere in the mm; THP pages must be at least order-2 (since they > >> + * store state up to the 3rd struct page subpage), and these pages must > >> + * be THP in order to correctly use pre-existing THP infrastructure such > >> + * as folio_split(). > >> + * > >> + * Note that the caller may or may not choose to lock the pte. If > >> + * unlocked, the result is racy and the user must re-check any overlap > >> + * with non-none ptes under the lock. > >> + */ > >> + > >> + struct vm_area_struct *vma = vmf->vma; > >> + int nr; > >> + unsigned long addr; > >> + pte_t *pte; > >> + pte_t *first_set = NULL; > >> + int ret; > >> + > >> + order = min(order, PMD_SHIFT - PAGE_SHIFT); > >> + > >> + for (; order > 1; order--) { > > > > I'm not sure how we can justify this policy. As an initial step, it'd > > be a lot easier to sell if we only considered the order of > > arch_wants_pte_order() and the order 0. > > My justification is in the cover letter; I see performance regression (vs the > unpatched kernel) when using the policy you suggest. This policy performs much > better in my tests. (I'll reply directly to your follow up questions in the > cover letter shortly). > > What are your technical concerns about this approach? It is pretty light weight > (I only touch each PTE once, regardless of the number of loops). If we have > strong technical reasons for reverting to the less performant approach then fair > enough, but I'd like to hear the rational first. Yes, mainly from three different angles: 1. The engineering principle: we'd want to separate the mechanical part and the policy part when attacking something large. This way it'd be easier to root cause any regressions if they happen. In our case, assuming the regression is real, it might actually prove my point here: I really don't think the two checks (if a vma range fits and if it does, which is unlikely according to your description, if all 64 PTEs are none) caused the regression. My theory is that 64KB itself caused the regression, but smaller sizes made an improvement. If this is really the case, I'd say the fallback policy masked the real problem, which is that 64KB is too large to begin with. 2. The benchmark methodology: I appreciate your effort in doing it, but we also need to consider that the setup is an uncommon scenario. The common scenarios are devices that have been running for weeks without reboots, generally having higher external fragmentation. In addition, for client devices, they are often under memory pressure, which makes fragmentation worse. So we should take the result with a grain of salt, and for that matter, results from after refresh reboots. 3. The technical concern: an ideal policy would consider all three major factors: the h/w features, userspace behaviors and the page allocator behavior. So far we only have the first one handy. The second one is too challenging, so let's forget about it for now. The third one is why I really don't like this best-fit policy. By falling back to smaller orders, we can waste a limited number of physically contiguous pages on wrong vmas (small vmas only), leading to failures to serve large vmas which otherwise would have a higher overall ROI. This can only be addressed within the page allocator: we need to enlighten it to return the highest order available, i.e., not breaking up any higher orders. I'm not really saying we should never try this fallback policy. I'm just thinking we can leave it for later, probably after we've addressed all the concerns with basic functionality.
On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote: > On 04/07/2023 04:45, Yin, Fengwei wrote: > > > > On 7/3/2023 9:53 PM, Ryan Roberts wrote: > >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > > THP is for huge page which is 2M size. We are not huge page here. But > > I don't have good name either. > > Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K > base page, they are 512M. So huge pages already have a variable size. And they > sometimes get PTE-mapped. So can't we just think of this as an extension of the > THP feature? The confusing thing is that we have counters for the number of THP allocated (and number of THP mapped), and for those we always use PMD-size folios. If we must have a config option, then this is ANON_LARGE_FOLIOS. But why do we need a config option? We don't have one for the page cache, and we're better off for it. Yes, it depends on CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental heritage, and it'd be great to do away with that dependency eventually. Hardware support isn't needed. Large folios benefit us from a software point of view. if we need a chicken bit, we can edit the source code to not create anon folios larger than order 0.
On 05/07/2023 00:57, Matthew Wilcox wrote: > On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote: >> On 04/07/2023 04:45, Yin, Fengwei wrote: >>> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >>> THP is for huge page which is 2M size. We are not huge page here. But >>> I don't have good name either. >> >> Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K >> base page, they are 512M. So huge pages already have a variable size. And they >> sometimes get PTE-mapped. So can't we just think of this as an extension of the >> THP feature? > > The confusing thing is that we have counters for the number of THP > allocated (and number of THP mapped), and for those we always use > PMD-size folios. OK fair point. I really don't have a strong opinion on the name - I changed it from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep this name for now and focus on converging the actual implementation to something that is agreeable. Once we are there, we can argue about the name. > > If we must have a config option, then this is ANON_LARGE_FOLIOS. > > But why do we need a config option? We don't have one for the > page cache, and we're better off for it. Yes, it depends on > CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental > heritage, and it'd be great to do away with that dependency eventually. > > Hardware support isn't needed. Large folios benefit us from a software > point of view. if we need a chicken bit, we can edit the source code > to not create anon folios larger than order 0. From my PoV it's about managing risk; there are currently parts of the mm that will interact poorly with large pte-mapped folios (madvise, compaction, ...). We want to incrementally fix that stuff, but until it's all fixed, we can't deploy this as always-on. Further down the line when things are more complete and there is more test coverage, we could remove the Kconfig or default it to enabled.
On Wed, Jul 05, 2023 at 10:54:30AM +0100, Ryan Roberts wrote: > On 05/07/2023 00:57, Matthew Wilcox wrote: > > The confusing thing is that we have counters for the number of THP > > allocated (and number of THP mapped), and for those we always use > > PMD-size folios. > > OK fair point. I really don't have a strong opinion on the name - I changed it > from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm > happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the > concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep > this name for now and focus on converging the actual implementation to something > that is agreeable. Once we are there, we can argue about the name. I didn't see Yu arguing for changing the name of the config options, just having far fewer of them. > > If we must have a config option, then this is ANON_LARGE_FOLIOS. > > > > But why do we need a config option? We don't have one for the > > page cache, and we're better off for it. Yes, it depends on > > CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental > > heritage, and it'd be great to do away with that dependency eventually. > > > > Hardware support isn't needed. Large folios benefit us from a software > > point of view. if we need a chicken bit, we can edit the source code > > to not create anon folios larger than order 0. > > >From my PoV it's about managing risk; there are currently parts of the mm that > will interact poorly with large pte-mapped folios (madvise, compaction, ...). We > want to incrementally fix that stuff, but until it's all fixed, we can't deploy > this as always-on. Further down the line when things are more complete and there > is more test coverage, we could remove the Kconfig or default it to enabled. We have to fix those places with the bad interactions, not merge a Kconfig option that lets you turn it on to experiment. That's how you get a bad reputation and advice to disable a config option. We had that for years with CONFIG_TRANSPARENT_HUGEPAGE; people tried it out early on, found the performance problems, and all these years later we still have articles being published that say to turn it off. By all means, we can have a golden patchset that we all agree is the one to use for finding problems, and we can merge the pre-enabling work "We don't have large anonymous folios yet, but when we do, this will need to iterate over each page in the folio".
Ryan Roberts <ryan.roberts@arm.com> writes: > Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > allocated in large folios of a specified order. All pages of the large > folio are pte-mapped during the same page fault, significantly reducing > the number of page faults. The number of per-page operations (e.g. ref > counting, rmap management lru list management) are also significantly > reduced since those ops now become per-folio. I likes the idea to share as much code as possible between large (anonymous) folio and THP. Finally, THP becomes just a special kind of large folio. Although we can use smaller page order for FLEXIBLE_THP, it's hard to avoid internal fragmentation completely. So, I think that finally we will need to provide a mechanism for the users to opt out, e.g., something like "always madvise never" via /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's a good idea to reuse the existing interface of THP. Best Regards, Huang, Ying
On 07/07/2023 09:01, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >> allocated in large folios of a specified order. All pages of the large >> folio are pte-mapped during the same page fault, significantly reducing >> the number of page faults. The number of per-page operations (e.g. ref >> counting, rmap management lru list management) are also significantly >> reduced since those ops now become per-folio. > > I likes the idea to share as much code as possible between large > (anonymous) folio and THP. Finally, THP becomes just a special kind of > large folio. > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to > avoid internal fragmentation completely. So, I think that finally we > will need to provide a mechanism for the users to opt out, e.g., > something like "always madvise never" via > /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's > a good idea to reuse the existing interface of THP. I wouldn't want to tie this to the existing interface, simply because that implies that we would want to follow the "always" and "madvise" advice too; That means that on a thp=madvise system (which is certainly the case for android and other client systems) we would have to disable large anon folios for VMAs that haven't explicitly opted in. That breaks the intention that this should be an invisible performance boost. I think it's important to set the policy for use of THP separately to use of large anon folios. I could be persuaded on the merrits of a new runtime enable/disable interface if there is concensus. > > Best Regards, > Huang, Ying
On 07.07.23 11:52, Ryan Roberts wrote: > On 07/07/2023 09:01, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >>> allocated in large folios of a specified order. All pages of the large >>> folio are pte-mapped during the same page fault, significantly reducing >>> the number of page faults. The number of per-page operations (e.g. ref >>> counting, rmap management lru list management) are also significantly >>> reduced since those ops now become per-folio. >> >> I likes the idea to share as much code as possible between large >> (anonymous) folio and THP. Finally, THP becomes just a special kind of >> large folio. >> >> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >> avoid internal fragmentation completely. So, I think that finally we >> will need to provide a mechanism for the users to opt out, e.g., >> something like "always madvise never" via >> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >> a good idea to reuse the existing interface of THP. > > I wouldn't want to tie this to the existing interface, simply because that > implies that we would want to follow the "always" and "madvise" advice too; That > means that on a thp=madvise system (which is certainly the case for android and > other client systems) we would have to disable large anon folios for VMAs that > haven't explicitly opted in. That breaks the intention that this should be an > invisible performance boost. I think it's important to set the policy for use of It will never ever be a completely invisible performance boost, just like ordinary THP. Using the exact same existing toggle is the right thing to do. If someone specify "never" or "madvise", then do exactly that. It might make sense to have more modes or additional toggles, but "madvise=never" means no memory waste. I remember I raised it already in the past, but you *absolutely* have to respect the MADV_NOHUGEPAGE flag. There is user space out there (for example, userfaultfd) that doesn't want the kernel to populate any additional page tables. So if you have to respect that already, then also respect MADV_HUGEPAGE, simple. > THP separately to use of large anon folios. > > I could be persuaded on the merrits of a new runtime enable/disable interface if > there is concensus. There would have to be very good reason for a completely separate control. Bypassing MADV_NOHUGEPAGE or "madvise=never" simply because we add a "flexible" before the THP sounds broken.
On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: > On 07.07.23 11:52, Ryan Roberts wrote: > > On 07/07/2023 09:01, Huang, Ying wrote: > > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to > > > avoid internal fragmentation completely. So, I think that finally we > > > will need to provide a mechanism for the users to opt out, e.g., > > > something like "always madvise never" via > > > /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's > > > a good idea to reuse the existing interface of THP. > > > > I wouldn't want to tie this to the existing interface, simply because that > > implies that we would want to follow the "always" and "madvise" advice too; That > > means that on a thp=madvise system (which is certainly the case for android and > > other client systems) we would have to disable large anon folios for VMAs that > > haven't explicitly opted in. That breaks the intention that this should be an > > invisible performance boost. I think it's important to set the policy for use of > > It will never ever be a completely invisible performance boost, just like > ordinary THP. > > Using the exact same existing toggle is the right thing to do. If someone > specify "never" or "madvise", then do exactly that. > > It might make sense to have more modes or additional toggles, but > "madvise=never" means no memory waste. I hate the existing mechanisms. They are an abdication of our responsibility, and an attempt to blame the user (be it the sysadmin or the programmer) of our code for using it wrongly. We should not replicate this mistake. Our code should be auto-tuning. I posted a long, detailed outline here: https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ > I remember I raised it already in the past, but you *absolutely* have to > respect the MADV_NOHUGEPAGE flag. There is user space out there (for > example, userfaultfd) that doesn't want the kernel to populate any > additional page tables. So if you have to respect that already, then also > respect MADV_HUGEPAGE, simple. Possibly having uffd enabled on a VMA should disable using large folios, I can get behind that. But the notion that userspace knows what it's doing ... hahaha. Just ignore the madvise flags. Userspace doesn't know what it's doing.
On 07.07.23 15:57, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >> On 07.07.23 11:52, Ryan Roberts wrote: >>> On 07/07/2023 09:01, Huang, Ying wrote: >>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>> avoid internal fragmentation completely. So, I think that finally we >>>> will need to provide a mechanism for the users to opt out, e.g., >>>> something like "always madvise never" via >>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>> a good idea to reuse the existing interface of THP. >>> >>> I wouldn't want to tie this to the existing interface, simply because that >>> implies that we would want to follow the "always" and "madvise" advice too; That >>> means that on a thp=madvise system (which is certainly the case for android and >>> other client systems) we would have to disable large anon folios for VMAs that >>> haven't explicitly opted in. That breaks the intention that this should be an >>> invisible performance boost. I think it's important to set the policy for use of >> >> It will never ever be a completely invisible performance boost, just like >> ordinary THP. >> >> Using the exact same existing toggle is the right thing to do. If someone >> specify "never" or "madvise", then do exactly that. >> >> It might make sense to have more modes or additional toggles, but >> "madvise=never" means no memory waste. > > I hate the existing mechanisms. They are an abdication of our > responsibility, and an attempt to blame the user (be it the sysadmin > or the programmer) of our code for using it wrongly. We should not > replicate this mistake. I don't agree regarding the programmer responsibility. In some cases the programmer really doesn't want to get more memory populated than requested -- and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. Regarding the madvise=never/madvise/always (sys admin decision), memory waste (and nailing down bugs or working around them in customer setups) have been very good reasons to let the admin have a word. > > Our code should be auto-tuning. I posted a long, detailed outline here: > https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ > Well, "auto-tuning" also should be perfect for everybody, but once reality strikes you know it isn't. If people don't feel like using THP, let them have a word. The "madvise" config option is probably more controversial. But the "always vs. never" absolutely makes sense to me. >> I remember I raised it already in the past, but you *absolutely* have to >> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >> example, userfaultfd) that doesn't want the kernel to populate any >> additional page tables. So if you have to respect that already, then also >> respect MADV_HUGEPAGE, simple. > > Possibly having uffd enabled on a VMA should disable using large folios, There are cases where we enable uffd *after* already touching memory (postcopy live migration in QEMU being the famous example). That doesn't fly. > I can get behind that. But the notion that userspace knows what it's > doing ... hahaha. Just ignore the madvise flags. Userspace doesn't > know what it's doing. If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in some cases. And these include cases I care about messing with sparse VM memory :) I have strong opinions against populating more than required when user space set MADV_NOHUGEPAGE.
On 07/07/2023 15:07, David Hildenbrand wrote: > On 07.07.23 15:57, Matthew Wilcox wrote: >> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>> On 07.07.23 11:52, Ryan Roberts wrote: >>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>> avoid internal fragmentation completely. So, I think that finally we >>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>> something like "always madvise never" via >>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>> a good idea to reuse the existing interface of THP. >>>> >>>> I wouldn't want to tie this to the existing interface, simply because that >>>> implies that we would want to follow the "always" and "madvise" advice too; >>>> That >>>> means that on a thp=madvise system (which is certainly the case for android and >>>> other client systems) we would have to disable large anon folios for VMAs that >>>> haven't explicitly opted in. That breaks the intention that this should be an >>>> invisible performance boost. I think it's important to set the policy for >>>> use of >>> >>> It will never ever be a completely invisible performance boost, just like >>> ordinary THP. >>> >>> Using the exact same existing toggle is the right thing to do. If someone >>> specify "never" or "madvise", then do exactly that. >>> >>> It might make sense to have more modes or additional toggles, but >>> "madvise=never" means no memory waste. >> >> I hate the existing mechanisms. They are an abdication of our >> responsibility, and an attempt to blame the user (be it the sysadmin >> or the programmer) of our code for using it wrongly. We should not >> replicate this mistake. > > I don't agree regarding the programmer responsibility. In some cases the > programmer really doesn't want to get more memory populated than requested -- > and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. > > Regarding the madvise=never/madvise/always (sys admin decision), memory waste > (and nailing down bugs or working around them in customer setups) have been very > good reasons to let the admin have a word. > >> >> Our code should be auto-tuning. I posted a long, detailed outline here: >> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >> > > Well, "auto-tuning" also should be perfect for everybody, but once reality > strikes you know it isn't. > > If people don't feel like using THP, let them have a word. The "madvise" config > option is probably more controversial. But the "always vs. never" absolutely > makes sense to me. > >>> I remember I raised it already in the past, but you *absolutely* have to >>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>> example, userfaultfd) that doesn't want the kernel to populate any >>> additional page tables. So if you have to respect that already, then also >>> respect MADV_HUGEPAGE, simple. >> >> Possibly having uffd enabled on a VMA should disable using large folios, > > There are cases where we enable uffd *after* already touching memory (postcopy > live migration in QEMU being the famous example). That doesn't fly. > >> I can get behind that. But the notion that userspace knows what it's >> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >> know what it's doing. > > If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in > some cases. And these include cases I care about messing with sparse VM memory :) > > I have strong opinions against populating more than required when user space set > MADV_NOHUGEPAGE. I can see your point about honouring MADV_NOHUGEPAGE, so think that it is reasonable to fallback to allocating an order-0 page in a VMA that has it set. The app has gone out of its way to explicitly set it, after all. I think the correct behaviour for the global thp controls (cmdline and sysfs) are less obvious though. I could get on board with disabling large anon folios globally when thp="never". But for other situations, I would prefer to keep large anon folios enabled (treat "madvise" as "always"), with the argument that their order is much smaller than traditional THP and therefore the internal fragmentation is significantly reduced. I really don't want to end up with user space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large anon folios. I still feel that it would be better for the thp and large anon folio controls to be independent though - what's the argument for tying them together? >
On 07.07.23 17:13, Ryan Roberts wrote: > On 07/07/2023 15:07, David Hildenbrand wrote: >> On 07.07.23 15:57, Matthew Wilcox wrote: >>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>> something like "always madvise never" via >>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>> a good idea to reuse the existing interface of THP. >>>>> >>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>> That >>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>> invisible performance boost. I think it's important to set the policy for >>>>> use of >>>> >>>> It will never ever be a completely invisible performance boost, just like >>>> ordinary THP. >>>> >>>> Using the exact same existing toggle is the right thing to do. If someone >>>> specify "never" or "madvise", then do exactly that. >>>> >>>> It might make sense to have more modes or additional toggles, but >>>> "madvise=never" means no memory waste. >>> >>> I hate the existing mechanisms. They are an abdication of our >>> responsibility, and an attempt to blame the user (be it the sysadmin >>> or the programmer) of our code for using it wrongly. We should not >>> replicate this mistake. >> >> I don't agree regarding the programmer responsibility. In some cases the >> programmer really doesn't want to get more memory populated than requested -- >> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >> >> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >> (and nailing down bugs or working around them in customer setups) have been very >> good reasons to let the admin have a word. >> >>> >>> Our code should be auto-tuning. I posted a long, detailed outline here: >>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>> >> >> Well, "auto-tuning" also should be perfect for everybody, but once reality >> strikes you know it isn't. >> >> If people don't feel like using THP, let them have a word. The "madvise" config >> option is probably more controversial. But the "always vs. never" absolutely >> makes sense to me. >> >>>> I remember I raised it already in the past, but you *absolutely* have to >>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>> example, userfaultfd) that doesn't want the kernel to populate any >>>> additional page tables. So if you have to respect that already, then also >>>> respect MADV_HUGEPAGE, simple. >>> >>> Possibly having uffd enabled on a VMA should disable using large folios, >> >> There are cases where we enable uffd *after* already touching memory (postcopy >> live migration in QEMU being the famous example). That doesn't fly. >> >>> I can get behind that. But the notion that userspace knows what it's >>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>> know what it's doing. >> >> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >> some cases. And these include cases I care about messing with sparse VM memory :) >> >> I have strong opinions against populating more than required when user space set >> MADV_NOHUGEPAGE. > > I can see your point about honouring MADV_NOHUGEPAGE, so think that it is > reasonable to fallback to allocating an order-0 page in a VMA that has it set. > The app has gone out of its way to explicitly set it, after all. > > I think the correct behaviour for the global thp controls (cmdline and sysfs) > are less obvious though. I could get on board with disabling large anon folios > globally when thp="never". But for other situations, I would prefer to keep > large anon folios enabled (treat "madvise" as "always"), with the argument that > their order is much smaller than traditional THP and therefore the internal > fragmentation is significantly reduced. I really don't want to end up with user > space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large > anon folios. I was briefly playing with a nasty idea of an additional "madvise-pmd" option (that could be the new default), that would use PMD THP only in madvise'd regions, and ordinary everywhere else. But let's disregard that for now. I think there is a bigger issue (below). > > I still feel that it would be better for the thp and large anon folio controls > to be independent though - what's the argument for tying them together? Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD THP on aarch64 (4k kernel), how are they any different? Just the way they are mapped ... It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, but how is "2MiB vs. 2 MiB" different? Having that said, I think we have to make up our mind how much control we want to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not trivial: memory waste is a real issue on some systems where we limit THP to madvise(). Just throwing it out for discussing: What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE ...) but having an additional config knob that specifies in which cases we *still* allow flexible THP even though the system was configured for "madvise". I can't come up with a good name for that, but something like "max_auto_size=64k" could be something reasonable to set. We could have an arch+hw specific default. (we all hate config options, I know, but I think there are good reasons to have such bare-minimum ones)
On 07/07/2023 17:06, David Hildenbrand wrote: > On 07.07.23 17:13, Ryan Roberts wrote: >> On 07/07/2023 15:07, David Hildenbrand wrote: >>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>> something like "always madvise never" via >>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>> a good idea to reuse the existing interface of THP. >>>>>> >>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>> That >>>>>> means that on a thp=madvise system (which is certainly the case for >>>>>> android and >>>>>> other client systems) we would have to disable large anon folios for VMAs >>>>>> that >>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>> invisible performance boost. I think it's important to set the policy for >>>>>> use of >>>>> >>>>> It will never ever be a completely invisible performance boost, just like >>>>> ordinary THP. >>>>> >>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>> specify "never" or "madvise", then do exactly that. >>>>> >>>>> It might make sense to have more modes or additional toggles, but >>>>> "madvise=never" means no memory waste. >>>> >>>> I hate the existing mechanisms. They are an abdication of our >>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>> or the programmer) of our code for using it wrongly. We should not >>>> replicate this mistake. >>> >>> I don't agree regarding the programmer responsibility. In some cases the >>> programmer really doesn't want to get more memory populated than requested -- >>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>> >>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>> (and nailing down bugs or working around them in customer setups) have been very >>> good reasons to let the admin have a word. >>> >>>> >>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>> >>> >>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>> strikes you know it isn't. >>> >>> If people don't feel like using THP, let them have a word. The "madvise" config >>> option is probably more controversial. But the "always vs. never" absolutely >>> makes sense to me. >>> >>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>> additional page tables. So if you have to respect that already, then also >>>>> respect MADV_HUGEPAGE, simple. >>>> >>>> Possibly having uffd enabled on a VMA should disable using large folios, >>> >>> There are cases where we enable uffd *after* already touching memory (postcopy >>> live migration in QEMU being the famous example). That doesn't fly. >>> >>>> I can get behind that. But the notion that userspace knows what it's >>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>> know what it's doing. >>> >>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>> some cases. And these include cases I care about messing with sparse VM >>> memory :) >>> >>> I have strong opinions against populating more than required when user space set >>> MADV_NOHUGEPAGE. >> >> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >> The app has gone out of its way to explicitly set it, after all. >> >> I think the correct behaviour for the global thp controls (cmdline and sysfs) >> are less obvious though. I could get on board with disabling large anon folios >> globally when thp="never". But for other situations, I would prefer to keep >> large anon folios enabled (treat "madvise" as "always"), with the argument that >> their order is much smaller than traditional THP and therefore the internal >> fragmentation is significantly reduced. I really don't want to end up with user >> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >> anon folios. > > I was briefly playing with a nasty idea of an additional "madvise-pmd" option > (that could be the new default), that would use PMD THP only in madvise'd > regions, and ordinary everywhere else. But let's disregard that for now. I think > there is a bigger issue (below). > >> >> I still feel that it would be better for the thp and large anon folio controls >> to be independent though - what's the argument for tying them together? > > Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD > THP on aarch64 (4k kernel), how are they any different? Just the way they are > mapped ... The last patch in the series shows my current approach to that: int arch_wants_pte_order(struct vm_area_struct *vma) { if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size else return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K } But Yu has raised concerns that this type of policy needs to be in the core mm. So we could have the arch blindly return the preferred order from HW perspective (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm could take the min of that value and some determined "acceptable" limit (which in my mind is 64K ;-). > > It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, > but how is "2MiB vs. 2 MiB" different? > > Having that said, I think we have to make up our mind how much control we want > to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not > trivial: memory waste is a real issue on some systems where we limit THP to > madvise(). > > > Just throwing it out for discussing: > > What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE > ...) but having an additional config knob that specifies in which cases we > *still* allow flexible THP even though the system was configured for "madvise". > > I can't come up with a good name for that, but something like > "max_auto_size=64k" could be something reasonable to set. We could have an > arch+hw specific default. Ahha, yes, that's essentially what I have above. I personally also like the idea of the limit being an absolute value rather than an order. Although I know Yu feels differently (see [1]). [1] https://lore.kernel.org/linux-mm/4d4c45a2-0037-71de-b182-f516fee07e67@arm.com/T/#m2aff6eebd7f14d0d0620b48497d26eacecf970e6 > > (we all hate config options, I know, but I think there are good reasons to have > such bare-minimum ones) >
>>> I still feel that it would be better for the thp and large anon folio controls >>> to be independent though - what's the argument for tying them together? >> >> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD >> THP on aarch64 (4k kernel), how are they any different? Just the way they are >> mapped ... > > The last patch in the series shows my current approach to that: > > int arch_wants_pte_order(struct vm_area_struct *vma) > { > if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) > return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size > else > return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K > } > > But Yu has raised concerns that this type of policy needs to be in the core mm. > So we could have the arch blindly return the preferred order from HW perspective > (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm > could take the min of that value and some determined "acceptable" limit (which > in my mind is 64K ;-). Yeah, it's really tricky. Because why should arm64 with 64k base pages *not* return 2MiB (which is one possible cont-pte size IIRC) ? I share the idea that 64k might *currently* on *some platforms* be a reasonable choice. But that's where the "fun" begins. > >> >> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, >> but how is "2MiB vs. 2 MiB" different? >> >> Having that said, I think we have to make up our mind how much control we want >> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not >> trivial: memory waste is a real issue on some systems where we limit THP to >> madvise(). >> >> >> Just throwing it out for discussing: >> >> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE >> ...) but having an additional config knob that specifies in which cases we >> *still* allow flexible THP even though the system was configured for "madvise". >> >> I can't come up with a good name for that, but something like >> "max_auto_size=64k" could be something reasonable to set. We could have an >> arch+hw specific default. > > Ahha, yes, that's essentially what I have above. I personally also like the idea > of the limit being an absolute value rather than an order. Although I know Yu > feels differently (see [1]). Exposed to user space I think it should be a human-readable value. Inside the kernel, I don't particularly care. (Having databases/VMs on arch64 with 64k in mind) I think it might be interesting to have something like the following: thp=madvise max_auto_size=64k/128k/256k So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any flexible THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 MiB THP? sure, give them to my VM. You're barely going to find 512 MiB THP either way in practice .... But for the remainder of my system, just do something reasonable and don't go crazy on the memory waste. I'll try reading all the previous discussions next week.
Matthew Wilcox <willy@infradead.org> writes: > On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >> On 07.07.23 11:52, Ryan Roberts wrote: >> > On 07/07/2023 09:01, Huang, Ying wrote: >> > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to >> > > avoid internal fragmentation completely. So, I think that finally we >> > > will need to provide a mechanism for the users to opt out, e.g., >> > > something like "always madvise never" via >> > > /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >> > > a good idea to reuse the existing interface of THP. >> > >> > I wouldn't want to tie this to the existing interface, simply because that >> > implies that we would want to follow the "always" and "madvise" advice too; That >> > means that on a thp=madvise system (which is certainly the case for android and >> > other client systems) we would have to disable large anon folios for VMAs that >> > haven't explicitly opted in. That breaks the intention that this should be an >> > invisible performance boost. I think it's important to set the policy for use of >> >> It will never ever be a completely invisible performance boost, just like >> ordinary THP. >> >> Using the exact same existing toggle is the right thing to do. If someone >> specify "never" or "madvise", then do exactly that. >> >> It might make sense to have more modes or additional toggles, but >> "madvise=never" means no memory waste. > > I hate the existing mechanisms. They are an abdication of our > responsibility, and an attempt to blame the user (be it the sysadmin > or the programmer) of our code for using it wrongly. We should not > replicate this mistake. > > Our code should be auto-tuning. I posted a long, detailed outline here: > https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ Yes. Auto-tuning should be more preferable than any configuration mechanisms. Something like THP shrinker could be another way of auto-tuning. https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ That is, allocating the large folios on page fault, then try to detect internal fragmentation. >> I remember I raised it already in the past, but you *absolutely* have to >> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >> example, userfaultfd) that doesn't want the kernel to populate any >> additional page tables. So if you have to respect that already, then also >> respect MADV_HUGEPAGE, simple. > > Possibly having uffd enabled on a VMA should disable using large folios, > I can get behind that. But the notion that userspace knows what it's > doing ... hahaha. Just ignore the madvise flags. Userspace doesn't > know what it's doing. Best Regards, Huang, Ying
Ryan Roberts <ryan.roberts@arm.com> writes: > On 07/07/2023 15:07, David Hildenbrand wrote: >> On 07.07.23 15:57, Matthew Wilcox wrote: >>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>> something like "always madvise never" via >>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>> a good idea to reuse the existing interface of THP. >>>>> >>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>> That >>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>> invisible performance boost. I think it's important to set the policy for >>>>> use of >>>> >>>> It will never ever be a completely invisible performance boost, just like >>>> ordinary THP. >>>> >>>> Using the exact same existing toggle is the right thing to do. If someone >>>> specify "never" or "madvise", then do exactly that. >>>> >>>> It might make sense to have more modes or additional toggles, but >>>> "madvise=never" means no memory waste. >>> >>> I hate the existing mechanisms. They are an abdication of our >>> responsibility, and an attempt to blame the user (be it the sysadmin >>> or the programmer) of our code for using it wrongly. We should not >>> replicate this mistake. >> >> I don't agree regarding the programmer responsibility. In some cases the >> programmer really doesn't want to get more memory populated than requested -- >> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >> >> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >> (and nailing down bugs or working around them in customer setups) have been very >> good reasons to let the admin have a word. >> >>> >>> Our code should be auto-tuning. I posted a long, detailed outline here: >>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>> >> >> Well, "auto-tuning" also should be perfect for everybody, but once reality >> strikes you know it isn't. >> >> If people don't feel like using THP, let them have a word. The "madvise" config >> option is probably more controversial. But the "always vs. never" absolutely >> makes sense to me. >> >>>> I remember I raised it already in the past, but you *absolutely* have to >>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>> example, userfaultfd) that doesn't want the kernel to populate any >>>> additional page tables. So if you have to respect that already, then also >>>> respect MADV_HUGEPAGE, simple. >>> >>> Possibly having uffd enabled on a VMA should disable using large folios, >> >> There are cases where we enable uffd *after* already touching memory (postcopy >> live migration in QEMU being the famous example). That doesn't fly. >> >>> I can get behind that. But the notion that userspace knows what it's >>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>> know what it's doing. >> >> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >> some cases. And these include cases I care about messing with sparse VM memory :) >> >> I have strong opinions against populating more than required when user space set >> MADV_NOHUGEPAGE. > > I can see your point about honouring MADV_NOHUGEPAGE, so think that it is > reasonable to fallback to allocating an order-0 page in a VMA that has it set. > The app has gone out of its way to explicitly set it, after all. > > I think the correct behaviour for the global thp controls (cmdline and sysfs) > are less obvious though. I could get on board with disabling large anon folios > globally when thp="never". But for other situations, I would prefer to keep > large anon folios enabled (treat "madvise" as "always"), If we have some mechanism to auto-tune the large folios usage, for example, detect the internal fragmentation and split the large folio, then we can use thp="always" as default configuration. If my memory were correct, this is what Johannes and Alexander is working on. > with the argument that > their order is much smaller than traditional THP and therefore the internal > fragmentation is significantly reduced. Do you have any data for this? > I really don't want to end up with user > space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large > anon folios. > > I still feel that it would be better for the thp and large anon folio controls > to be independent though - what's the argument for tying them together? > Best Regards, Huang, Ying
On 07/07/2023 20:06, David Hildenbrand wrote: >>>> I still feel that it would be better for the thp and large anon folio controls >>>> to be independent though - what's the argument for tying them together? >>> >>> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD >>> THP on aarch64 (4k kernel), how are they any different? Just the way they are >>> mapped ... >> >> The last patch in the series shows my current approach to that: >> >> int arch_wants_pte_order(struct vm_area_struct *vma) >> { >> if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >> return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size >> else >> return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K >> } >> >> But Yu has raised concerns that this type of policy needs to be in the core mm. >> So we could have the arch blindly return the preferred order from HW perspective >> (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm >> could take the min of that value and some determined "acceptable" limit (which >> in my mind is 64K ;-). > > Yeah, it's really tricky. Because why should arm64 with 64k base pages *not* > return 2MiB (which is one possible cont-pte size IIRC) ? > > I share the idea that 64k might *currently* on *some platforms* be a reasonable > choice. But that's where the "fun" begins. > >> >>> >>> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, >>> but how is "2MiB vs. 2 MiB" different? >>> >>> Having that said, I think we have to make up our mind how much control we want >>> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not >>> trivial: memory waste is a real issue on some systems where we limit THP to >>> madvise(). >>> >>> >>> Just throwing it out for discussing: >>> >>> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE >>> ...) but having an additional config knob that specifies in which cases we >>> *still* allow flexible THP even though the system was configured for "madvise". >>> >>> I can't come up with a good name for that, but something like >>> "max_auto_size=64k" could be something reasonable to set. We could have an >>> arch+hw specific default. >> >> Ahha, yes, that's essentially what I have above. I personally also like the idea >> of the limit being an absolute value rather than an order. Although I know Yu >> feels differently (see [1]). > > Exposed to user space I think it should be a human-readable value. Inside the > kernel, I don't particularly care. My point was less about human-readable vs not. It was about expressing a value that is relative to the base page size vs expressing a value that is independent of base page size. If the concern is about limiting internal fragmentation, I think its the absolute size that matters. > > (Having databases/VMs on arch64 with 64k in mind) I think it might be > interesting to have something like the following: > > thp=madvise > max_auto_size=64k/128k/256k > > > So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any flexible > THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 MiB THP? sure, give > them to my VM. You're barely going to find 512 MiB THP either way in practice .... > > But for the remainder of my system, just do something reasonable and don't go > crazy on the memory waste. Yep, we're on the same page. I've got a v3 that's almost ready to go, based on Yu's prevuous round of review. I'm going to encorporate this mechanism into it then post hopefully later in the week. Now I just need to figure out a decent name for the max_auto_size control... > > > I'll try reading all the previous discussions next week. >
On 10/07/2023 04:03, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> On 07/07/2023 15:07, David Hildenbrand wrote: >>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>> something like "always madvise never" via >>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>> a good idea to reuse the existing interface of THP. >>>>>> >>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>> That >>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>> invisible performance boost. I think it's important to set the policy for >>>>>> use of >>>>> >>>>> It will never ever be a completely invisible performance boost, just like >>>>> ordinary THP. >>>>> >>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>> specify "never" or "madvise", then do exactly that. >>>>> >>>>> It might make sense to have more modes or additional toggles, but >>>>> "madvise=never" means no memory waste. >>>> >>>> I hate the existing mechanisms. They are an abdication of our >>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>> or the programmer) of our code for using it wrongly. We should not >>>> replicate this mistake. >>> >>> I don't agree regarding the programmer responsibility. In some cases the >>> programmer really doesn't want to get more memory populated than requested -- >>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>> >>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>> (and nailing down bugs or working around them in customer setups) have been very >>> good reasons to let the admin have a word. >>> >>>> >>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>> >>> >>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>> strikes you know it isn't. >>> >>> If people don't feel like using THP, let them have a word. The "madvise" config >>> option is probably more controversial. But the "always vs. never" absolutely >>> makes sense to me. >>> >>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>> additional page tables. So if you have to respect that already, then also >>>>> respect MADV_HUGEPAGE, simple. >>>> >>>> Possibly having uffd enabled on a VMA should disable using large folios, >>> >>> There are cases where we enable uffd *after* already touching memory (postcopy >>> live migration in QEMU being the famous example). That doesn't fly. >>> >>>> I can get behind that. But the notion that userspace knows what it's >>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>> know what it's doing. >>> >>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>> some cases. And these include cases I care about messing with sparse VM memory :) >>> >>> I have strong opinions against populating more than required when user space set >>> MADV_NOHUGEPAGE. >> >> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >> The app has gone out of its way to explicitly set it, after all. >> >> I think the correct behaviour for the global thp controls (cmdline and sysfs) >> are less obvious though. I could get on board with disabling large anon folios >> globally when thp="never". But for other situations, I would prefer to keep >> large anon folios enabled (treat "madvise" as "always"), > > If we have some mechanism to auto-tune the large folios usage, for > example, detect the internal fragmentation and split the large folio, > then we can use thp="always" as default configuration. If my memory > were correct, this is what Johannes and Alexander is working on. Could you point me to that work? I'd like to understand what the mechanism is. The other half of my work aims to use arm64's pte "contiguous bit" to tell the HW that a span of PTEs share the same mapping and is therefore coalesced into a single TLB entry. The side effect of this, however, is that we only have a single access and dirty bit for the whole contpte extent. So I'd like to avoid any mechanism that relies on getting access/dirty at the base page granularity for a large folio. > >> with the argument that >> their order is much smaller than traditional THP and therefore the internal >> fragmentation is significantly reduced. > > Do you have any data for this? Some; its partly based on intuition that the smaller the allocation unit, the smaller the internal fragmentation. And partly on peak memory usage data I've collected for the benchmarks I'm running, comparing baseline-4k kernel with baseline-16k and baseline-64 kernels along with a 4k kernel that supports large anon folios (I appreciate that's not exactly what we are talking about here, and it's not exactly an extensive set of results!): Kernel Compliation with 8 Jobs: | kernel | peak | |:--------------|-------:| | baseline-4k | 0.0% | | anonfolio | 0.1% | | baseline-16k | 6.3% | | baseline-64k | 28.1% | Kernel Compliation with 80 Jobs: | kernel | peak | |:--------------|-------:| | baseline-4k | 0.0% | | anonfolio | 1.7% | | baseline-16k | 2.6% | | baseline-64k | 12.3% | > >> I really don't want to end up with user >> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >> anon folios. >> >> I still feel that it would be better for the thp and large anon folio controls >> to be independent though - what's the argument for tying them together? >> > > Best Regards, > Huang, Ying >
Ryan Roberts <ryan.roberts@arm.com> writes: > On 10/07/2023 04:03, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> On 07/07/2023 15:07, David Hildenbrand wrote: >>>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>>> something like "always madvise never" via >>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>>> a good idea to reuse the existing interface of THP. >>>>>>> >>>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>>> That >>>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>>> invisible performance boost. I think it's important to set the policy for >>>>>>> use of >>>>>> >>>>>> It will never ever be a completely invisible performance boost, just like >>>>>> ordinary THP. >>>>>> >>>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>>> specify "never" or "madvise", then do exactly that. >>>>>> >>>>>> It might make sense to have more modes or additional toggles, but >>>>>> "madvise=never" means no memory waste. >>>>> >>>>> I hate the existing mechanisms. They are an abdication of our >>>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>>> or the programmer) of our code for using it wrongly. We should not >>>>> replicate this mistake. >>>> >>>> I don't agree regarding the programmer responsibility. In some cases the >>>> programmer really doesn't want to get more memory populated than requested -- >>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>>> >>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>>> (and nailing down bugs or working around them in customer setups) have been very >>>> good reasons to let the admin have a word. >>>> >>>>> >>>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>>> >>>> >>>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>>> strikes you know it isn't. >>>> >>>> If people don't feel like using THP, let them have a word. The "madvise" config >>>> option is probably more controversial. But the "always vs. never" absolutely >>>> makes sense to me. >>>> >>>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>>> additional page tables. So if you have to respect that already, then also >>>>>> respect MADV_HUGEPAGE, simple. >>>>> >>>>> Possibly having uffd enabled on a VMA should disable using large folios, >>>> >>>> There are cases where we enable uffd *after* already touching memory (postcopy >>>> live migration in QEMU being the famous example). That doesn't fly. >>>> >>>>> I can get behind that. But the notion that userspace knows what it's >>>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>>> know what it's doing. >>>> >>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>>> some cases. And these include cases I care about messing with sparse VM memory :) >>>> >>>> I have strong opinions against populating more than required when user space set >>>> MADV_NOHUGEPAGE. >>> >>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >>> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >>> The app has gone out of its way to explicitly set it, after all. >>> >>> I think the correct behaviour for the global thp controls (cmdline and sysfs) >>> are less obvious though. I could get on board with disabling large anon folios >>> globally when thp="never". But for other situations, I would prefer to keep >>> large anon folios enabled (treat "madvise" as "always"), >> >> If we have some mechanism to auto-tune the large folios usage, for >> example, detect the internal fragmentation and split the large folio, >> then we can use thp="always" as default configuration. If my memory >> were correct, this is what Johannes and Alexander is working on. > > Could you point me to that work? I'd like to understand what the mechanism is. > The other half of my work aims to use arm64's pte "contiguous bit" to tell the > HW that a span of PTEs share the same mapping and is therefore coalesced into a > single TLB entry. The side effect of this, however, is that we only have a > single access and dirty bit for the whole contpte extent. So I'd like to avoid > any mechanism that relies on getting access/dirty at the base page granularity > for a large folio. Please take a look at the THP shrinker patchset, https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ >> >>> with the argument that >>> their order is much smaller than traditional THP and therefore the internal >>> fragmentation is significantly reduced. >> >> Do you have any data for this? > > Some; its partly based on intuition that the smaller the allocation unit, the > smaller the internal fragmentation. And partly on peak memory usage data I've > collected for the benchmarks I'm running, comparing baseline-4k kernel with > baseline-16k and baseline-64 kernels along with a 4k kernel that supports large > anon folios (I appreciate that's not exactly what we are talking about here, and > it's not exactly an extensive set of results!): > > > Kernel Compliation with 8 Jobs: > | kernel | peak | > |:--------------|-------:| > | baseline-4k | 0.0% | > | anonfolio | 0.1% | > | baseline-16k | 6.3% | > | baseline-64k | 28.1% | > > > Kernel Compliation with 80 Jobs: > | kernel | peak | > |:--------------|-------:| > | baseline-4k | 0.0% | > | anonfolio | 1.7% | > | baseline-16k | 2.6% | > | baseline-64k | 12.3% | > Why is anonfolio better than baseline-64k if you always allocate 64k anonymous folio? Because page cache uses 64k in baseline-64k? We may need to test some workloads with sparse access patterns too. Best Regards, Huang, Ying >> >>> I really don't want to end up with user >>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >>> anon folios. >>> >>> I still feel that it would be better for the thp and large anon folio controls >>> to be independent though - what's the argument for tying them together? >>>
On 10/07/2023 10:18, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> On 10/07/2023 04:03, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> >>>> On 07/07/2023 15:07, David Hildenbrand wrote: >>>>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>>>> something like "always madvise never" via >>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>>>> a good idea to reuse the existing interface of THP. >>>>>>>> >>>>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>>>> That >>>>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>>>> invisible performance boost. I think it's important to set the policy for >>>>>>>> use of >>>>>>> >>>>>>> It will never ever be a completely invisible performance boost, just like >>>>>>> ordinary THP. >>>>>>> >>>>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>>>> specify "never" or "madvise", then do exactly that. >>>>>>> >>>>>>> It might make sense to have more modes or additional toggles, but >>>>>>> "madvise=never" means no memory waste. >>>>>> >>>>>> I hate the existing mechanisms. They are an abdication of our >>>>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>>>> or the programmer) of our code for using it wrongly. We should not >>>>>> replicate this mistake. >>>>> >>>>> I don't agree regarding the programmer responsibility. In some cases the >>>>> programmer really doesn't want to get more memory populated than requested -- >>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>>>> >>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>>>> (and nailing down bugs or working around them in customer setups) have been very >>>>> good reasons to let the admin have a word. >>>>> >>>>>> >>>>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>>>> >>>>> >>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>>>> strikes you know it isn't. >>>>> >>>>> If people don't feel like using THP, let them have a word. The "madvise" config >>>>> option is probably more controversial. But the "always vs. never" absolutely >>>>> makes sense to me. >>>>> >>>>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>>>> additional page tables. So if you have to respect that already, then also >>>>>>> respect MADV_HUGEPAGE, simple. >>>>>> >>>>>> Possibly having uffd enabled on a VMA should disable using large folios, >>>>> >>>>> There are cases where we enable uffd *after* already touching memory (postcopy >>>>> live migration in QEMU being the famous example). That doesn't fly. >>>>> >>>>>> I can get behind that. But the notion that userspace knows what it's >>>>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>>>> know what it's doing. >>>>> >>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>>>> some cases. And these include cases I care about messing with sparse VM memory :) >>>>> >>>>> I have strong opinions against populating more than required when user space set >>>>> MADV_NOHUGEPAGE. >>>> >>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >>>> The app has gone out of its way to explicitly set it, after all. >>>> >>>> I think the correct behaviour for the global thp controls (cmdline and sysfs) >>>> are less obvious though. I could get on board with disabling large anon folios >>>> globally when thp="never". But for other situations, I would prefer to keep >>>> large anon folios enabled (treat "madvise" as "always"), >>> >>> If we have some mechanism to auto-tune the large folios usage, for >>> example, detect the internal fragmentation and split the large folio, >>> then we can use thp="always" as default configuration. If my memory >>> were correct, this is what Johannes and Alexander is working on. >> >> Could you point me to that work? I'd like to understand what the mechanism is. >> The other half of my work aims to use arm64's pte "contiguous bit" to tell the >> HW that a span of PTEs share the same mapping and is therefore coalesced into a >> single TLB entry. The side effect of this, however, is that we only have a >> single access and dirty bit for the whole contpte extent. So I'd like to avoid >> any mechanism that relies on getting access/dirty at the base page granularity >> for a large folio. > > Please take a look at the THP shrinker patchset, > > https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ Thanks! > >>> >>>> with the argument that >>>> their order is much smaller than traditional THP and therefore the internal >>>> fragmentation is significantly reduced. >>> >>> Do you have any data for this? >> >> Some; its partly based on intuition that the smaller the allocation unit, the >> smaller the internal fragmentation. And partly on peak memory usage data I've >> collected for the benchmarks I'm running, comparing baseline-4k kernel with >> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large >> anon folios (I appreciate that's not exactly what we are talking about here, and >> it's not exactly an extensive set of results!): >> >> >> Kernel Compliation with 8 Jobs: >> | kernel | peak | >> |:--------------|-------:| >> | baseline-4k | 0.0% | >> | anonfolio | 0.1% | >> | baseline-16k | 6.3% | >> | baseline-64k | 28.1% | >> >> >> Kernel Compliation with 80 Jobs: >> | kernel | peak | >> |:--------------|-------:| >> | baseline-4k | 0.0% | >> | anonfolio | 1.7% | >> | baseline-16k | 2.6% | >> | baseline-64k | 12.3% | >> > > Why is anonfolio better than baseline-64k if you always allocate 64k > anonymous folio? Because page cache uses 64k in baseline-64k? No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios only allocates a 64K folio if it does not breach the bounds of the VMA (and if it doesn't overlap other allocated PTEs). > > We may need to test some workloads with sparse access patterns too. Yes, I agree if you have a workload with a pathalogical memory access pattern where it writes to addresses with a stride of 64K, all contained in a single VMA, then you will end up allocating 16x the memory. This is obviously an unrealistic extreme though. > > Best Regards, > Huang, Ying > >>> >>>> I really don't want to end up with user >>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >>>> anon folios. >>>> >>>> I still feel that it would be better for the thp and large anon folio controls >>>> to be independent though - what's the argument for tying them together? >>>> >
Ryan Roberts <ryan.roberts@arm.com> writes: > On 10/07/2023 10:18, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> On 10/07/2023 04:03, Huang, Ying wrote: >>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>> >>>>> On 07/07/2023 15:07, David Hildenbrand wrote: >>>>>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>>>>> something like "always madvise never" via >>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>>>>> a good idea to reuse the existing interface of THP. >>>>>>>>> >>>>>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>>>>> That >>>>>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>>>>> invisible performance boost. I think it's important to set the policy for >>>>>>>>> use of >>>>>>>> >>>>>>>> It will never ever be a completely invisible performance boost, just like >>>>>>>> ordinary THP. >>>>>>>> >>>>>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>>>>> specify "never" or "madvise", then do exactly that. >>>>>>>> >>>>>>>> It might make sense to have more modes or additional toggles, but >>>>>>>> "madvise=never" means no memory waste. >>>>>>> >>>>>>> I hate the existing mechanisms. They are an abdication of our >>>>>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>>>>> or the programmer) of our code for using it wrongly. We should not >>>>>>> replicate this mistake. >>>>>> >>>>>> I don't agree regarding the programmer responsibility. In some cases the >>>>>> programmer really doesn't want to get more memory populated than requested -- >>>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>>>>> >>>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>>>>> (and nailing down bugs or working around them in customer setups) have been very >>>>>> good reasons to let the admin have a word. >>>>>> >>>>>>> >>>>>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>>>>> >>>>>> >>>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>>>>> strikes you know it isn't. >>>>>> >>>>>> If people don't feel like using THP, let them have a word. The "madvise" config >>>>>> option is probably more controversial. But the "always vs. never" absolutely >>>>>> makes sense to me. >>>>>> >>>>>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>>>>> additional page tables. So if you have to respect that already, then also >>>>>>>> respect MADV_HUGEPAGE, simple. >>>>>>> >>>>>>> Possibly having uffd enabled on a VMA should disable using large folios, >>>>>> >>>>>> There are cases where we enable uffd *after* already touching memory (postcopy >>>>>> live migration in QEMU being the famous example). That doesn't fly. >>>>>> >>>>>>> I can get behind that. But the notion that userspace knows what it's >>>>>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>>>>> know what it's doing. >>>>>> >>>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>>>>> some cases. And these include cases I care about messing with sparse VM memory :) >>>>>> >>>>>> I have strong opinions against populating more than required when user space set >>>>>> MADV_NOHUGEPAGE. >>>>> >>>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >>>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >>>>> The app has gone out of its way to explicitly set it, after all. >>>>> >>>>> I think the correct behaviour for the global thp controls (cmdline and sysfs) >>>>> are less obvious though. I could get on board with disabling large anon folios >>>>> globally when thp="never". But for other situations, I would prefer to keep >>>>> large anon folios enabled (treat "madvise" as "always"), >>>> >>>> If we have some mechanism to auto-tune the large folios usage, for >>>> example, detect the internal fragmentation and split the large folio, >>>> then we can use thp="always" as default configuration. If my memory >>>> were correct, this is what Johannes and Alexander is working on. >>> >>> Could you point me to that work? I'd like to understand what the mechanism is. >>> The other half of my work aims to use arm64's pte "contiguous bit" to tell the >>> HW that a span of PTEs share the same mapping and is therefore coalesced into a >>> single TLB entry. The side effect of this, however, is that we only have a >>> single access and dirty bit for the whole contpte extent. So I'd like to avoid >>> any mechanism that relies on getting access/dirty at the base page granularity >>> for a large folio. >> >> Please take a look at the THP shrinker patchset, >> >> https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ > > Thanks! > >> >>>> >>>>> with the argument that >>>>> their order is much smaller than traditional THP and therefore the internal >>>>> fragmentation is significantly reduced. >>>> >>>> Do you have any data for this? >>> >>> Some; its partly based on intuition that the smaller the allocation unit, the >>> smaller the internal fragmentation. And partly on peak memory usage data I've >>> collected for the benchmarks I'm running, comparing baseline-4k kernel with >>> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large >>> anon folios (I appreciate that's not exactly what we are talking about here, and >>> it's not exactly an extensive set of results!): >>> >>> >>> Kernel Compliation with 8 Jobs: >>> | kernel | peak | >>> |:--------------|-------:| >>> | baseline-4k | 0.0% | >>> | anonfolio | 0.1% | >>> | baseline-16k | 6.3% | >>> | baseline-64k | 28.1% | >>> >>> >>> Kernel Compliation with 80 Jobs: >>> | kernel | peak | >>> |:--------------|-------:| >>> | baseline-4k | 0.0% | >>> | anonfolio | 1.7% | >>> | baseline-16k | 2.6% | >>> | baseline-64k | 12.3% | >>> >> >> Why is anonfolio better than baseline-64k if you always allocate 64k >> anonymous folio? Because page cache uses 64k in baseline-64k? > > No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios > only allocates a 64K folio if it does not breach the bounds of the VMA (and if > it doesn't overlap other allocated PTEs). Thanks for explanation! We will use more memory for file cache too for baseline-64k, right? So, you observed much more anonymous pages, but not so for file cache pages? >> >> We may need to test some workloads with sparse access patterns too. > > Yes, I agree if you have a workload with a pathalogical memory access pattern > where it writes to addresses with a stride of 64K, all contained in a single > VMA, then you will end up allocating 16x the memory. This is obviously an > unrealistic extreme though. I think that there should be some realistic workload which has sparse access patterns. Best Regards, Huang, Ying >> >>>> >>>>> I really don't want to end up with user >>>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >>>>> anon folios. >>>>> >>>>> I still feel that it would be better for the thp and large anon folio controls >>>>> to be independent though - what's the argument for tying them together? >>>>> >>
diff --git a/mm/Kconfig b/mm/Kconfig index 7672a22647b4..1c06b2c0a24e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS support of file THPs will be developed in the next few release cycles. +config FLEXIBLE_THP + bool "Flexible order THP" + depends on TRANSPARENT_HUGEPAGE + default n + help + Use large (bigger than order-0) folios to back anonymous memory where + possible, even if the order of the folio is smaller than the PMD + order. This reduces the number of page faults, as well as other + per-page overheads to improve performance for many workloads. + endif # TRANSPARENT_HUGEPAGE # diff --git a/mm/memory.c b/mm/memory.c index fb30f7523550..abe2ea94f3f5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) return 0; } +#ifdef CONFIG_FLEXIBLE_THP +/* + * Allocates, zeros and returns a folio of the requested order for use as + * anonymous memory. + */ +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, + unsigned long addr, int order) +{ + gfp_t gfp; + struct folio *folio; + + if (order == 0) + return vma_alloc_zeroed_movable_folio(vma, addr); + + gfp = vma_thp_gfp_mask(vma); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); + + return folio; +} + +/* + * Preferred folio order to allocate for anonymous memory. + */ +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) +#else +#define alloc_anon_folio(vma, addr, order) \ + vma_alloc_zeroed_movable_folio(vma, addr) +#define max_anon_folio_order(vma) 0 +#endif + +/* + * Returns index of first pte that is not none, or nr if all are none. + */ +static inline int check_ptes_none(pte_t *pte, int nr) +{ + int i; + + for (i = 0; i < nr; i++) { + if (!pte_none(ptep_get(pte++))) + return i; + } + + return nr; +} + +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) +{ + /* + * The aim here is to determine what size of folio we should allocate + * for this fault. Factors include: + * - Order must not be higher than `order` upon entry + * - Folio must be naturally aligned within VA space + * - Folio must be fully contained inside one pmd entry + * - Folio must not breach boundaries of vma + * - Folio must not overlap any non-none ptes + * + * Additionally, we do not allow order-1 since this breaks assumptions + * elsewhere in the mm; THP pages must be at least order-2 (since they + * store state up to the 3rd struct page subpage), and these pages must + * be THP in order to correctly use pre-existing THP infrastructure such + * as folio_split(). + * + * Note that the caller may or may not choose to lock the pte. If + * unlocked, the result is racy and the user must re-check any overlap + * with non-none ptes under the lock. + */ + + struct vm_area_struct *vma = vmf->vma; + int nr; + unsigned long addr; + pte_t *pte; + pte_t *first_set = NULL; + int ret; + + order = min(order, PMD_SHIFT - PAGE_SHIFT); + + for (; order > 1; order--) { + nr = 1 << order; + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); + + /* Check vma bounds. */ + if (addr < vma->vm_start || + addr + (nr << PAGE_SHIFT) > vma->vm_end) + continue; + + /* Ptes covered by order already known to be none. */ + if (pte + nr <= first_set) + break; + + /* Already found set pte in range covered by order. */ + if (pte <= first_set) + continue; + + /* Need to check if all the ptes are none. */ + ret = check_ptes_none(pte, nr); + if (ret == nr) + break; + + first_set = pte + ret; + } + + if (order == 1) + order = 0; + + return order; +} + /* * Handle write page faults for pages that can be reused in the current vma * @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) goto oom; if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + new_folio = alloc_anon_folio(vma, vmf->address, 0); if (!new_folio) goto oom; } else { @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) struct folio *folio; vm_fault_t ret = 0; pte_t entry; + int order; + int pgcount; + unsigned long addr; /* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_MISSING); } - goto setpte; + if (uffd_wp) + entry = pte_mkuffd_wp(entry); + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, vmf->address, vmf->pte); + goto unlock; + } + + /* + * If allocating a large folio, determine the biggest suitable order for + * the VMA (e.g. it must not exceed the VMA's bounds, it must not + * overlap with any populated PTEs, etc). We are not under the ptl here + * so we will need to re-check that we are not overlapping any populated + * PTEs once we have the lock. + */ + order = uffd_wp ? 0 : max_anon_folio_order(vma); + if (order > 0) { + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); + order = calc_anon_folio_order_alloc(vmf, order); + pte_unmap(vmf->pte); } - /* Allocate our own private page. */ + /* Allocate our own private folio. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vma, vmf->address, order); + if (!folio && order > 0) { + order = 0; + folio = alloc_anon_folio(vma, vmf->address, order); + } if (!folio) goto oom; + pgcount = 1 << order; + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); /* * The memory barrier inside __folio_mark_uptodate makes sure that - * preceding stores to the page contents become visible before - * the set_pte_at() write. + * preceding stores to the folio contents become visible before + * the set_ptes() write. */ __folio_mark_uptodate(folio); @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (vmf_pte_changed(vmf)) { update_mmu_tlb(vma, vmf->address, vmf->pte); goto release; + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { + goto release; } ret = check_stable_address_space(vma->vm_mm); @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - folio_add_new_anon_rmap(folio, vma, vmf->address); + folio_ref_add(folio, pgcount - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); + folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma); -setpte: + if (uffd_wp) entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, vmf->address, vmf->pte); + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); return ret;