Message ID | 20230810142942.3169679-4-ryan.roberts@arm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp471007vqi; Thu, 10 Aug 2023 07:45:51 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFMKCHXi4COnXFzKAdv84xGoRuJdtpPK1jbZYtvc9BtMxrzTRA0/r5ozs26XoDXUrTiqNVb X-Received: by 2002:a05:6a00:1943:b0:687:5763:ef27 with SMTP id s3-20020a056a00194300b006875763ef27mr3293755pfk.33.1691678750955; Thu, 10 Aug 2023 07:45:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691678750; cv=none; d=google.com; s=arc-20160816; b=VRB0dpJqL4Mk70fQbp/G2e5Dr4Gv/2eglXtVeiN56kGvZTlo1s//WkbU4NJPMKaYBm 2aasAtVw6IHjfgpSZVnEg9BFgeP9xMXlWOowgQxrdO/6BqAYTiQRnAmYgFZEO6qwg4SK XY7svTGYjClVewyjMP4KjWbcEzyOHwIQnTvTIxybvQUBHdD03OmIKkBZXMkY4FXz7Thh vQU+kVopQNSkFURTiNg0XFXjp5sHvp1g41BKZfYD3W8IOWOjHd8nkqAMD+hpmBxiQFfE As6HumixWtoJiBnqjMKBN65k+hbl9agD0sHEVnOHCcM9/23n0fRRWlDAiywjPHe9Jr5k W58Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=CkBS3801+DyIGuoTkxthrjCxqHAXo1ZB5C9GVz80YpI=; fh=2S5jT4dIIqOhOs7q3j0K2DFa5C1ZmfWI8H2ybAUe/oA=; b=nvoxOWIvqK4g0kHI9mX9vDVxy7cHAGJ/8+g541pAr4clOqWAViNNnqxKSOC4qJuNhu 6dYBpcRX/Tq+aauku1BQ33y9UE6V3XtNpVo/lt6WsLusagqtoocbPYrbHCB6i7r7sMVa mWWrXnmLHRC4qvONhOmywk03uxCqiDKwEYxU1ak0j24laoQLmRxOlO4CfJ+O1A+6Nx5E zRaZHLnfr0syBqBpLMowNI3qR4MPV6D1E02By6jZWOHrZJKoHw2+Vtf3ozcgWEhjFBAc P9RCM2yWrGfCunjY6NSIvdYugx27rgdjW9nK0D94ukDwrExrb9Ke6WRfrQR9M8oTge2Q ti1Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o16-20020a634e50000000b005648d3f2031si1607877pgl.362.2023.08.10.07.45.36; Thu, 10 Aug 2023 07:45:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235724AbjHJOaI (ORCPT <rfc822;lanlanxiyiji@gmail.com> + 99 others); Thu, 10 Aug 2023 10:30:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60168 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235748AbjHJOaF (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 10 Aug 2023 10:30:05 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id EF3F4270F for <linux-kernel@vger.kernel.org>; Thu, 10 Aug 2023 07:30:00 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 299A81480; Thu, 10 Aug 2023 07:30:43 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4EE6B3F64C; Thu, 10 Aug 2023 07:29:58 -0700 (PDT) From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Yin Fengwei <fengwei.yin@intel.com>, David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>, Catalin Marinas <catalin.marinas@arm.com>, Anshuman Khandual <anshuman.khandual@arm.com>, Yang Shi <shy828301@gmail.com>, "Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>, Luis Chamberlain <mcgrof@kernel.org>, Itaru Kitayama <itaru.kitayama@gmail.com>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance Date: Thu, 10 Aug 2023 15:29:40 +0100 Message-Id: <20230810142942.3169679-4-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230810142942.3169679-1-ryan.roberts@arm.com> References: <20230810142942.3169679-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773853737851322398 X-GMAIL-MSGID: 1773853737851322398 |
Series |
variable-order, large folios for anonymous memory
|
|
Commit Message
Ryan Roberts
Aug. 10, 2023, 2:29 p.m. UTC
Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
allocated in large folios of a determined order. All pages of the large
folio are pte-mapped during the same page fault, significantly reducing
the number of page faults. The number of per-page operations (e.g. ref
counting, rmap management lru list management) are also significantly
reduced since those ops now become per-folio.
The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
which defaults to disabled for now; The long term aim is for this to
defaut to enabled, but there are some risks around internal
fragmentation that need to be better understood first.
Large anonymous folio (LAF) allocation is integrated with the existing
(PMD-order) THP and single (S) page allocation according to this policy,
where fallback (>) is performed for various reasons, such as the
proposed folio order not fitting within the bounds of the VMA, etc:
| prctl=dis | prctl=ena | prctl=ena | prctl=ena
| sysfs=X | sysfs=never | sysfs=madvise | sysfs=always
----------------|-----------|-------------|---------------|-------------
no hint | S | LAF>S | LAF>S | THP>LAF>S
MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S
MADV_NOHUGEPAGE | S | S | S | S
This approach ensures that we don't violate existing hints to only
allocate single pages - this is required for QEMU's VM live migration
implementation to work correctly - while allowing us to use LAF
independently of THP (when sysfs=never). This makes wide scale
performance characterization simpler, while avoiding exposing any new
ABI to user space.
When using LAF for allocation, the folio order is determined as follows:
The return value of arch_wants_pte_order() is used. For vmas that have
not explicitly opted-in to use transparent hugepages (e.g. where
sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never),
then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever
is bigger). This allows for a performance boost without requiring any
explicit opt-in from the workload while limitting internal
fragmentation.
If the preferred order can't be used (e.g. because the folio would
breach the bounds of the vma, or because ptes in the region are already
mapped) then we fall back to a suitable lower order; first
PAGE_ALLOC_COSTLY_ORDER, then order-0.
arch_wants_pte_order() can be overridden by the architecture if desired.
Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous
set of ptes map physically contigious, naturally aligned memory, so this
mechanism allows the architecture to optimize as required.
Here we add the default implementation of arch_wants_pte_order(), used
when the architecture does not define it, which returns -1, implying
that the HW has no preference. In this case, mm will choose it's own
default order.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
include/linux/pgtable.h | 13 ++++
mm/Kconfig | 10 +++
mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++---
3 files changed, 158 insertions(+), 9 deletions(-)
Comments
On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be > allocated in large folios of a determined order. All pages of the large > folio are pte-mapped during the same page fault, significantly reducing > the number of page faults. The number of per-page operations (e.g. ref > counting, rmap management lru list management) are also significantly > reduced since those ops now become per-folio. > > The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, > which defaults to disabled for now; The long term aim is for this to > defaut to enabled, but there are some risks around internal > fragmentation that need to be better understood first. > > Large anonymous folio (LAF) allocation is integrated with the existing > (PMD-order) THP and single (S) page allocation according to this policy, > where fallback (>) is performed for various reasons, such as the > proposed folio order not fitting within the bounds of the VMA, etc: > > | prctl=dis | prctl=ena | prctl=ena | prctl=ena > | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always > ----------------|-----------|-------------|---------------|------------- > no hint | S | LAF>S | LAF>S | THP>LAF>S > MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S > MADV_NOHUGEPAGE | S | S | S | S > > This approach ensures that we don't violate existing hints to only > allocate single pages - this is required for QEMU's VM live migration > implementation to work correctly - while allowing us to use LAF > independently of THP (when sysfs=never). This makes wide scale > performance characterization simpler, while avoiding exposing any new > ABI to user space. > > When using LAF for allocation, the folio order is determined as follows: > The return value of arch_wants_pte_order() is used. For vmas that have > not explicitly opted-in to use transparent hugepages (e.g. where > sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never), > then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever > is bigger). This allows for a performance boost without requiring any > explicit opt-in from the workload while limitting internal > fragmentation. > > If the preferred order can't be used (e.g. because the folio would > breach the bounds of the vma, or because ptes in the region are already > mapped) then we fall back to a suitable lower order; first > PAGE_ALLOC_COSTLY_ORDER, then order-0. > > arch_wants_pte_order() can be overridden by the architecture if desired. > Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous > set of ptes map physically contigious, naturally aligned memory, so this > mechanism allows the architecture to optimize as required. > > Here we add the default implementation of arch_wants_pte_order(), used > when the architecture does not define it, which returns -1, implying > that the HW has no preference. In this case, mm will choose it's own > default order. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > include/linux/pgtable.h | 13 ++++ > mm/Kconfig | 10 +++ > mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++--- > 3 files changed, 158 insertions(+), 9 deletions(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 222a33b9600d..4b488cc66ddc 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) > } > #endif > > +#ifndef arch_wants_pte_order > +/* > + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > + * to be at least order-2. Negative value implies that the HW has no preference > + * and mm will choose it's own default order. > + */ > +static inline int arch_wants_pte_order(void) > +{ > + return -1; > +} > +#endif > + > #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR > static inline pte_t ptep_get_and_clear(struct mm_struct *mm, > unsigned long address, > diff --git a/mm/Kconfig b/mm/Kconfig > index 721dc88423c7..a1e28b8ddc24 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA > > source "mm/damon/Kconfig" > > +config LARGE_ANON_FOLIO > + bool "Allocate large folios for anonymous memory" > + depends on TRANSPARENT_HUGEPAGE > + default n > + help > + Use large (bigger than order-0) folios to back anonymous memory where > + possible, even for pte-mapped memory. This reduces the number of page > + faults, as well as other per-page overheads to improve performance for > + many workloads. > + > endmenu > diff --git a/mm/memory.c b/mm/memory.c > index d003076b218d..bbc7d4ce84f7 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > return ret; > } > > +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) > +{ > + int i; > + > + if (nr_pages == 1) > + return vmf_pte_changed(vmf); > + > + for (i = 0; i < nr_pages; i++) { > + if (!pte_none(ptep_get_lockless(vmf->pte + i))) > + return true; > + } > + > + return false; > +} > + > +#ifdef CONFIG_LARGE_ANON_FOLIO > +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ > + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) > + > +static int anon_folio_order(struct vm_area_struct *vma) > +{ > + int order; > + > + /* > + * If the vma is eligible for thp, allocate a large folio of the size > + * preferred by the arch. Or if the arch requested a very small size or > + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still > + * meets the arch's requirements but means we still take advantage of SW > + * optimizations (e.g. fewer page faults). > + * > + * If the vma isn't eligible for thp, take the arch-preferred size and > + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads > + * that have not explicitly opted-in take benefit while capping the > + * potential for internal fragmentation. > + */ > + > + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); > + > + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) > + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); > + > + return order; > +} I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED. 1. It's not used, since no archs at the moment implement arch_wants_pte_order() that returns >64KB. 2. As far as I know, there is no plan for any arch to do so. 3. Again, it seems to me the rationale behind ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all. Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please? Also you made arch_wants_pte_order() return -1, and I acknowledged [1]: Thanks: -1 actually is better than 0 (what I suggested) for the obvious reason. I thought we were on the same page, i.e., the "obvious reason" is that h/w might prefer 0. But here you are not respecting 0. But then why -1? [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/
On 10/08/2023 18:01, Yu Zhao wrote: > On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >> allocated in large folios of a determined order. All pages of the large >> folio are pte-mapped during the same page fault, significantly reducing >> the number of page faults. The number of per-page operations (e.g. ref >> counting, rmap management lru list management) are also significantly >> reduced since those ops now become per-folio. >> >> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >> which defaults to disabled for now; The long term aim is for this to >> defaut to enabled, but there are some risks around internal >> fragmentation that need to be better understood first. >> >> Large anonymous folio (LAF) allocation is integrated with the existing >> (PMD-order) THP and single (S) page allocation according to this policy, >> where fallback (>) is performed for various reasons, such as the >> proposed folio order not fitting within the bounds of the VMA, etc: >> >> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >> ----------------|-----------|-------------|---------------|------------- >> no hint | S | LAF>S | LAF>S | THP>LAF>S >> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >> MADV_NOHUGEPAGE | S | S | S | S >> >> This approach ensures that we don't violate existing hints to only >> allocate single pages - this is required for QEMU's VM live migration >> implementation to work correctly - while allowing us to use LAF >> independently of THP (when sysfs=never). This makes wide scale >> performance characterization simpler, while avoiding exposing any new >> ABI to user space. >> >> When using LAF for allocation, the folio order is determined as follows: >> The return value of arch_wants_pte_order() is used. For vmas that have >> not explicitly opted-in to use transparent hugepages (e.g. where >> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never), >> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever >> is bigger). This allows for a performance boost without requiring any >> explicit opt-in from the workload while limitting internal >> fragmentation. >> >> If the preferred order can't be used (e.g. because the folio would >> breach the bounds of the vma, or because ptes in the region are already >> mapped) then we fall back to a suitable lower order; first >> PAGE_ALLOC_COSTLY_ORDER, then order-0. >> >> arch_wants_pte_order() can be overridden by the architecture if desired. >> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >> set of ptes map physically contigious, naturally aligned memory, so this >> mechanism allows the architecture to optimize as required. >> >> Here we add the default implementation of arch_wants_pte_order(), used >> when the architecture does not define it, which returns -1, implying >> that the HW has no preference. In this case, mm will choose it's own >> default order. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> include/linux/pgtable.h | 13 ++++ >> mm/Kconfig | 10 +++ >> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++--- >> 3 files changed, 158 insertions(+), 9 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index 222a33b9600d..4b488cc66ddc 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) >> } >> #endif >> >> +#ifndef arch_wants_pte_order >> +/* >> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >> + * to be at least order-2. Negative value implies that the HW has no preference >> + * and mm will choose it's own default order. >> + */ >> +static inline int arch_wants_pte_order(void) >> +{ >> + return -1; >> +} >> +#endif >> + >> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >> unsigned long address, >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 721dc88423c7..a1e28b8ddc24 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA >> >> source "mm/damon/Kconfig" >> >> +config LARGE_ANON_FOLIO >> + bool "Allocate large folios for anonymous memory" >> + depends on TRANSPARENT_HUGEPAGE >> + default n >> + help >> + Use large (bigger than order-0) folios to back anonymous memory where >> + possible, even for pte-mapped memory. This reduces the number of page >> + faults, as well as other per-page overheads to improve performance for >> + many workloads. >> + >> endmenu >> diff --git a/mm/memory.c b/mm/memory.c >> index d003076b218d..bbc7d4ce84f7 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> return ret; >> } >> >> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >> +{ >> + int i; >> + >> + if (nr_pages == 1) >> + return vmf_pte_changed(vmf); >> + >> + for (i = 0; i < nr_pages; i++) { >> + if (!pte_none(ptep_get_lockless(vmf->pte + i))) >> + return true; >> + } >> + >> + return false; >> +} >> + >> +#ifdef CONFIG_LARGE_ANON_FOLIO >> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >> + >> +static int anon_folio_order(struct vm_area_struct *vma) >> +{ >> + int order; >> + >> + /* >> + * If the vma is eligible for thp, allocate a large folio of the size >> + * preferred by the arch. Or if the arch requested a very small size or >> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still >> + * meets the arch's requirements but means we still take advantage of SW >> + * optimizations (e.g. fewer page faults). >> + * >> + * If the vma isn't eligible for thp, take the arch-preferred size and >> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads >> + * that have not explicitly opted-in take benefit while capping the >> + * potential for internal fragmentation. >> + */ >> + >> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >> + >> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >> + >> + return order; >> +} > > I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED. > 1. It's not used, since no archs at the moment implement > arch_wants_pte_order() that returns >64KB. > 2. As far as I know, there is no plan for any arch to do so. My rationale is that arm64 is planning to use this for contpte mapping 2MB blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB blocks without the proper THP hinting is a bad plan. As I see it, arches could add their own arch_wants_pte_order() at any time, and just because the HW has a preference, doesn't mean the SW shouldn't get a say. Its a negotiation between HW and SW for the LAF order, embodied in this policy. > 3. Again, it seems to me the rationale behind > ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all. > > Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please? > > Also you made arch_wants_pte_order() return -1, and I acknowledged [1]: > Thanks: -1 actually is better than 0 (what I suggested) for the > obvious reason. > > I thought we were on the same page, i.e., the "obvious reason" is that > h/w might prefer 0. But here you are not respecting 0. But then why > -1? I agree that the "obvious reason" is that HW might prefer order-0. But the performance wins don't come solely from the HW. Batching up page faults is a big win for SW even if the HW doesn't benefit. So I think it is important that a HW preference of order-0 is possible to express through this API. But that doesn't mean that we don't listen to SW's preferences either. I would really rather leave it in; As I've mentioned in the past, we have a partner who is actively keen to take advantage of 2MB blocks with 64K kernel and this is the mechanism that means we don't dole out those 2MB blocks unless explicitly opted-in. I'm going to be out on holiday for a couple of weeks, so we might have to wait until I'm back to conclude on this, if you still take issue with the justification. Thanks, Ryan > > [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/
On 10 Aug 2023, at 15:12, Ryan Roberts wrote: > On 10/08/2023 18:01, Yu Zhao wrote: >> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> >>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>> allocated in large folios of a determined order. All pages of the large >>> folio are pte-mapped during the same page fault, significantly reducing >>> the number of page faults. The number of per-page operations (e.g. ref >>> counting, rmap management lru list management) are also significantly >>> reduced since those ops now become per-folio. >>> >>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>> which defaults to disabled for now; The long term aim is for this to >>> defaut to enabled, but there are some risks around internal >>> fragmentation that need to be better understood first. >>> >>> Large anonymous folio (LAF) allocation is integrated with the existing >>> (PMD-order) THP and single (S) page allocation according to this policy, >>> where fallback (>) is performed for various reasons, such as the >>> proposed folio order not fitting within the bounds of the VMA, etc: >>> >>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>> ----------------|-----------|-------------|---------------|------------- >>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>> MADV_NOHUGEPAGE | S | S | S | S >>> >>> This approach ensures that we don't violate existing hints to only >>> allocate single pages - this is required for QEMU's VM live migration >>> implementation to work correctly - while allowing us to use LAF >>> independently of THP (when sysfs=never). This makes wide scale >>> performance characterization simpler, while avoiding exposing any new >>> ABI to user space. >>> >>> When using LAF for allocation, the folio order is determined as follows: >>> The return value of arch_wants_pte_order() is used. For vmas that have >>> not explicitly opted-in to use transparent hugepages (e.g. where >>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never), >>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever >>> is bigger). This allows for a performance boost without requiring any >>> explicit opt-in from the workload while limitting internal >>> fragmentation. >>> >>> If the preferred order can't be used (e.g. because the folio would >>> breach the bounds of the vma, or because ptes in the region are already >>> mapped) then we fall back to a suitable lower order; first >>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>> >>> arch_wants_pte_order() can be overridden by the architecture if desired. >>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>> set of ptes map physically contigious, naturally aligned memory, so this >>> mechanism allows the architecture to optimize as required. >>> >>> Here we add the default implementation of arch_wants_pte_order(), used >>> when the architecture does not define it, which returns -1, implying >>> that the HW has no preference. In this case, mm will choose it's own >>> default order. >>> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>> --- >>> include/linux/pgtable.h | 13 ++++ >>> mm/Kconfig | 10 +++ >>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++--- >>> 3 files changed, 158 insertions(+), 9 deletions(-) >>> >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>> index 222a33b9600d..4b488cc66ddc 100644 >>> --- a/include/linux/pgtable.h >>> +++ b/include/linux/pgtable.h >>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) >>> } >>> #endif >>> >>> +#ifndef arch_wants_pte_order >>> +/* >>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>> + * to be at least order-2. Negative value implies that the HW has no preference >>> + * and mm will choose it's own default order. >>> + */ >>> +static inline int arch_wants_pte_order(void) >>> +{ >>> + return -1; >>> +} >>> +#endif >>> + >>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >>> unsigned long address, >>> diff --git a/mm/Kconfig b/mm/Kconfig >>> index 721dc88423c7..a1e28b8ddc24 100644 >>> --- a/mm/Kconfig >>> +++ b/mm/Kconfig >>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA >>> >>> source "mm/damon/Kconfig" >>> >>> +config LARGE_ANON_FOLIO >>> + bool "Allocate large folios for anonymous memory" >>> + depends on TRANSPARENT_HUGEPAGE >>> + default n >>> + help >>> + Use large (bigger than order-0) folios to back anonymous memory where >>> + possible, even for pte-mapped memory. This reduces the number of page >>> + faults, as well as other per-page overheads to improve performance for >>> + many workloads. >>> + >>> endmenu >>> diff --git a/mm/memory.c b/mm/memory.c >>> index d003076b218d..bbc7d4ce84f7 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>> return ret; >>> } >>> >>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >>> +{ >>> + int i; >>> + >>> + if (nr_pages == 1) >>> + return vmf_pte_changed(vmf); >>> + >>> + for (i = 0; i < nr_pages; i++) { >>> + if (!pte_none(ptep_get_lockless(vmf->pte + i))) >>> + return true; >>> + } >>> + >>> + return false; >>> +} >>> + >>> +#ifdef CONFIG_LARGE_ANON_FOLIO >>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >>> + >>> +static int anon_folio_order(struct vm_area_struct *vma) >>> +{ >>> + int order; >>> + >>> + /* >>> + * If the vma is eligible for thp, allocate a large folio of the size >>> + * preferred by the arch. Or if the arch requested a very small size or >>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still >>> + * meets the arch's requirements but means we still take advantage of SW >>> + * optimizations (e.g. fewer page faults). >>> + * >>> + * If the vma isn't eligible for thp, take the arch-preferred size and >>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads >>> + * that have not explicitly opted-in take benefit while capping the >>> + * potential for internal fragmentation. >>> + */ >>> + >>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >>> + >>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >>> + >>> + return order; >>> +} >> >> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED. >> 1. It's not used, since no archs at the moment implement >> arch_wants_pte_order() that returns >64KB. >> 2. As far as I know, there is no plan for any arch to do so. > > My rationale is that arm64 is planning to use this for contpte mapping 2MB > blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB > blocks without the proper THP hinting is a bad plan. > > As I see it, arches could add their own arch_wants_pte_order() at any time, and > just because the HW has a preference, doesn't mean the SW shouldn't get a say. > Its a negotiation between HW and SW for the LAF order, embodied in this policy. > >> 3. Again, it seems to me the rationale behind >> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all. >> >> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please? >> >> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]: >> Thanks: -1 actually is better than 0 (what I suggested) for the >> obvious reason. >> >> I thought we were on the same page, i.e., the "obvious reason" is that >> h/w might prefer 0. But here you are not respecting 0. But then why >> -1? > > I agree that the "obvious reason" is that HW might prefer order-0. But the > performance wins don't come solely from the HW. Batching up page faults is a big > win for SW even if the HW doesn't benefit. So I think it is important that a HW > preference of order-0 is possible to express through this API. But that doesn't > mean that we don't listen to SW's preferences either. > > I would really rather leave it in; As I've mentioned in the past, we have a > partner who is actively keen to take advantage of 2MB blocks with 64K kernel and > this is the mechanism that means we don't dole out those 2MB blocks unless > explicitly opted-in. > > I'm going to be out on holiday for a couple of weeks, so we might have to wait > until I'm back to conclude on this, if you still take issue with the justification. From my understanding (correct me if I am wrong), Yu seems to want order-0 to be the default order even if LAF is enabled. But that does not make sense to me, since if LAF is configured to be enabled (it is disabled by default now), user (and distros) must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation time or by using prctl. Enabling LAF and using order-0 as the default order makes most of LAF code not used. Also arch_wants_pte_order() might need a better name like arch_wants_large_folio_order(). Since current name sounds like the specified order is wanted by HW in a general setting, but it is not. It is an order HW wants when LAF is enabled. That might cause some confusion. >> >> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/ -- Best Regards, Yan, Zi
On 8/11/2023 3:12 AM, Ryan Roberts wrote: > On 10/08/2023 18:01, Yu Zhao wrote: >> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> >>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>> allocated in large folios of a determined order. All pages of the large >>> folio are pte-mapped during the same page fault, significantly reducing >>> the number of page faults. The number of per-page operations (e.g. ref >>> counting, rmap management lru list management) are also significantly >>> reduced since those ops now become per-folio. >>> >>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>> which defaults to disabled for now; The long term aim is for this to >>> defaut to enabled, but there are some risks around internal >>> fragmentation that need to be better understood first. >>> >>> Large anonymous folio (LAF) allocation is integrated with the existing >>> (PMD-order) THP and single (S) page allocation according to this policy, >>> where fallback (>) is performed for various reasons, such as the >>> proposed folio order not fitting within the bounds of the VMA, etc: >>> >>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>> ----------------|-----------|-------------|---------------|------------- >>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>> MADV_NOHUGEPAGE | S | S | S | S >>> >>> This approach ensures that we don't violate existing hints to only >>> allocate single pages - this is required for QEMU's VM live migration >>> implementation to work correctly - while allowing us to use LAF >>> independently of THP (when sysfs=never). This makes wide scale >>> performance characterization simpler, while avoiding exposing any new >>> ABI to user space. >>> >>> When using LAF for allocation, the folio order is determined as follows: >>> The return value of arch_wants_pte_order() is used. For vmas that have >>> not explicitly opted-in to use transparent hugepages (e.g. where >>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never), >>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever >>> is bigger). This allows for a performance boost without requiring any >>> explicit opt-in from the workload while limitting internal >>> fragmentation. >>> >>> If the preferred order can't be used (e.g. because the folio would >>> breach the bounds of the vma, or because ptes in the region are already >>> mapped) then we fall back to a suitable lower order; first >>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>> >>> arch_wants_pte_order() can be overridden by the architecture if desired. >>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>> set of ptes map physically contigious, naturally aligned memory, so this >>> mechanism allows the architecture to optimize as required. >>> >>> Here we add the default implementation of arch_wants_pte_order(), used >>> when the architecture does not define it, which returns -1, implying >>> that the HW has no preference. In this case, mm will choose it's own >>> default order. >>> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>> --- >>> include/linux/pgtable.h | 13 ++++ >>> mm/Kconfig | 10 +++ >>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++--- >>> 3 files changed, 158 insertions(+), 9 deletions(-) >>> >>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>> index 222a33b9600d..4b488cc66ddc 100644 >>> --- a/include/linux/pgtable.h >>> +++ b/include/linux/pgtable.h >>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) >>> } >>> #endif >>> >>> +#ifndef arch_wants_pte_order >>> +/* >>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>> + * to be at least order-2. Negative value implies that the HW has no preference >>> + * and mm will choose it's own default order. >>> + */ >>> +static inline int arch_wants_pte_order(void) >>> +{ >>> + return -1; >>> +} >>> +#endif >>> + >>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >>> unsigned long address, >>> diff --git a/mm/Kconfig b/mm/Kconfig >>> index 721dc88423c7..a1e28b8ddc24 100644 >>> --- a/mm/Kconfig >>> +++ b/mm/Kconfig >>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA >>> >>> source "mm/damon/Kconfig" >>> >>> +config LARGE_ANON_FOLIO >>> + bool "Allocate large folios for anonymous memory" >>> + depends on TRANSPARENT_HUGEPAGE >>> + default n >>> + help >>> + Use large (bigger than order-0) folios to back anonymous memory where >>> + possible, even for pte-mapped memory. This reduces the number of page >>> + faults, as well as other per-page overheads to improve performance for >>> + many workloads. >>> + >>> endmenu >>> diff --git a/mm/memory.c b/mm/memory.c >>> index d003076b218d..bbc7d4ce84f7 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>> return ret; >>> } >>> >>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >>> +{ >>> + int i; >>> + >>> + if (nr_pages == 1) >>> + return vmf_pte_changed(vmf); >>> + >>> + for (i = 0; i < nr_pages; i++) { >>> + if (!pte_none(ptep_get_lockless(vmf->pte + i))) >>> + return true; >>> + } >>> + >>> + return false; >>> +} >>> + >>> +#ifdef CONFIG_LARGE_ANON_FOLIO >>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >>> + >>> +static int anon_folio_order(struct vm_area_struct *vma) >>> +{ >>> + int order; >>> + >>> + /* >>> + * If the vma is eligible for thp, allocate a large folio of the size >>> + * preferred by the arch. Or if the arch requested a very small size or >>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still >>> + * meets the arch's requirements but means we still take advantage of SW >>> + * optimizations (e.g. fewer page faults). >>> + * >>> + * If the vma isn't eligible for thp, take the arch-preferred size and >>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads >>> + * that have not explicitly opted-in take benefit while capping the >>> + * potential for internal fragmentation. >>> + */ >>> + >>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >>> + >>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >>> + >>> + return order; >>> +} >> >> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED. >> 1. It's not used, since no archs at the moment implement >> arch_wants_pte_order() that returns >64KB. >> 2. As far as I know, there is no plan for any arch to do so. > > My rationale is that arm64 is planning to use this for contpte mapping 2MB > blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB > blocks without the proper THP hinting is a bad plan. > > As I see it, arches could add their own arch_wants_pte_order() at any time, and > just because the HW has a preference, doesn't mean the SW shouldn't get a say. > Its a negotiation between HW and SW for the LAF order, embodied in this policy. > >> 3. Again, it seems to me the rationale behind >> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all. >> >> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please? >> >> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]: >> Thanks: -1 actually is better than 0 (what I suggested) for the >> obvious reason. >> >> I thought we were on the same page, i.e., the "obvious reason" is that >> h/w might prefer 0. But here you are not respecting 0. But then why >> -1? > > I agree that the "obvious reason" is that HW might prefer order-0. But the > performance wins don't come solely from the HW. Batching up page faults is a big > win for SW even if the HW doesn't benefit. So I think it is important that a HW > preference of order-0 is possible to express through this API. But that doesn't > mean that we don't listen to SW's preferences either. > > I would really rather leave it in; As I've mentioned in the past, we have a > partner who is actively keen to take advantage of 2MB blocks with 64K kernel and > this is the mechanism that means we don't dole out those 2MB blocks unless > explicitly opted-in. Even so, I don't think we want to put the ANON_FOLIO_MAX_ORDER_UNHINTED hardcoded in common mm code as it's useless to other ARCHs. Another drawback is it brings trouble to do performance testing. People needs either change code and recompile the kernel or add another knob to configure it. Considering we are still on the phase to do more testing to understand the impact of the LAF, I agree with Yu on this. Thanks. Regards Yin, Fengwei > > I'm going to be out on holiday for a couple of weeks, so we might have to wait > until I'm back to conclude on this, if you still take issue with the justification. > > Thanks, > Ryan > > >> >> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/ >
On 8/11/2023 3:46 AM, Zi Yan wrote: > On 10 Aug 2023, at 15:12, Ryan Roberts wrote: > >> On 10/08/2023 18:01, Yu Zhao wrote: >>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>>> allocated in large folios of a determined order. All pages of the large >>>> folio are pte-mapped during the same page fault, significantly reducing >>>> the number of page faults. The number of per-page operations (e.g. ref >>>> counting, rmap management lru list management) are also significantly >>>> reduced since those ops now become per-folio. >>>> >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>>> which defaults to disabled for now; The long term aim is for this to >>>> defaut to enabled, but there are some risks around internal >>>> fragmentation that need to be better understood first. >>>> >>>> Large anonymous folio (LAF) allocation is integrated with the existing >>>> (PMD-order) THP and single (S) page allocation according to this policy, >>>> where fallback (>) is performed for various reasons, such as the >>>> proposed folio order not fitting within the bounds of the VMA, etc: >>>> >>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>> ----------------|-----------|-------------|---------------|------------- >>>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>>> MADV_NOHUGEPAGE | S | S | S | S >>>> >>>> This approach ensures that we don't violate existing hints to only >>>> allocate single pages - this is required for QEMU's VM live migration >>>> implementation to work correctly - while allowing us to use LAF >>>> independently of THP (when sysfs=never). This makes wide scale >>>> performance characterization simpler, while avoiding exposing any new >>>> ABI to user space. >>>> >>>> When using LAF for allocation, the folio order is determined as follows: >>>> The return value of arch_wants_pte_order() is used. For vmas that have >>>> not explicitly opted-in to use transparent hugepages (e.g. where >>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never), >>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever >>>> is bigger). This allows for a performance boost without requiring any >>>> explicit opt-in from the workload while limitting internal >>>> fragmentation. >>>> >>>> If the preferred order can't be used (e.g. because the folio would >>>> breach the bounds of the vma, or because ptes in the region are already >>>> mapped) then we fall back to a suitable lower order; first >>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>> >>>> arch_wants_pte_order() can be overridden by the architecture if desired. >>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>>> set of ptes map physically contigious, naturally aligned memory, so this >>>> mechanism allows the architecture to optimize as required. >>>> >>>> Here we add the default implementation of arch_wants_pte_order(), used >>>> when the architecture does not define it, which returns -1, implying >>>> that the HW has no preference. In this case, mm will choose it's own >>>> default order. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>> --- >>>> include/linux/pgtable.h | 13 ++++ >>>> mm/Kconfig | 10 +++ >>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++--- >>>> 3 files changed, 158 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>> index 222a33b9600d..4b488cc66ddc 100644 >>>> --- a/include/linux/pgtable.h >>>> +++ b/include/linux/pgtable.h >>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) >>>> } >>>> #endif >>>> >>>> +#ifndef arch_wants_pte_order >>>> +/* >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>> + * to be at least order-2. Negative value implies that the HW has no preference >>>> + * and mm will choose it's own default order. >>>> + */ >>>> +static inline int arch_wants_pte_order(void) >>>> +{ >>>> + return -1; >>>> +} >>>> +#endif >>>> + >>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >>>> unsigned long address, >>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>> index 721dc88423c7..a1e28b8ddc24 100644 >>>> --- a/mm/Kconfig >>>> +++ b/mm/Kconfig >>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA >>>> >>>> source "mm/damon/Kconfig" >>>> >>>> +config LARGE_ANON_FOLIO >>>> + bool "Allocate large folios for anonymous memory" >>>> + depends on TRANSPARENT_HUGEPAGE >>>> + default n >>>> + help >>>> + Use large (bigger than order-0) folios to back anonymous memory where >>>> + possible, even for pte-mapped memory. This reduces the number of page >>>> + faults, as well as other per-page overheads to improve performance for >>>> + many workloads. >>>> + >>>> endmenu >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index d003076b218d..bbc7d4ce84f7 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> return ret; >>>> } >>>> >>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >>>> +{ >>>> + int i; >>>> + >>>> + if (nr_pages == 1) >>>> + return vmf_pte_changed(vmf); >>>> + >>>> + for (i = 0; i < nr_pages; i++) { >>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i))) >>>> + return true; >>>> + } >>>> + >>>> + return false; >>>> +} >>>> + >>>> +#ifdef CONFIG_LARGE_ANON_FOLIO >>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >>>> + >>>> +static int anon_folio_order(struct vm_area_struct *vma) >>>> +{ >>>> + int order; >>>> + >>>> + /* >>>> + * If the vma is eligible for thp, allocate a large folio of the size >>>> + * preferred by the arch. Or if the arch requested a very small size or >>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still >>>> + * meets the arch's requirements but means we still take advantage of SW >>>> + * optimizations (e.g. fewer page faults). >>>> + * >>>> + * If the vma isn't eligible for thp, take the arch-preferred size and >>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads >>>> + * that have not explicitly opted-in take benefit while capping the >>>> + * potential for internal fragmentation. >>>> + */ >>>> + >>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >>>> + >>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >>>> + >>>> + return order; >>>> +} >>> >>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED. >>> 1. It's not used, since no archs at the moment implement >>> arch_wants_pte_order() that returns >64KB. >>> 2. As far as I know, there is no plan for any arch to do so. >> >> My rationale is that arm64 is planning to use this for contpte mapping 2MB >> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB >> blocks without the proper THP hinting is a bad plan. >> >> As I see it, arches could add their own arch_wants_pte_order() at any time, and >> just because the HW has a preference, doesn't mean the SW shouldn't get a say. >> Its a negotiation between HW and SW for the LAF order, embodied in this policy. >> >>> 3. Again, it seems to me the rationale behind >>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all. >>> >>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please? >>> >>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]: >>> Thanks: -1 actually is better than 0 (what I suggested) for the >>> obvious reason. >>> >>> I thought we were on the same page, i.e., the "obvious reason" is that >>> h/w might prefer 0. But here you are not respecting 0. But then why >>> -1? >> >> I agree that the "obvious reason" is that HW might prefer order-0. But the >> performance wins don't come solely from the HW. Batching up page faults is a big >> win for SW even if the HW doesn't benefit. So I think it is important that a HW >> preference of order-0 is possible to express through this API. But that doesn't >> mean that we don't listen to SW's preferences either. >> >> I would really rather leave it in; As I've mentioned in the past, we have a >> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and >> this is the mechanism that means we don't dole out those 2MB blocks unless >> explicitly opted-in. >> >> I'm going to be out on holiday for a couple of weeks, so we might have to wait >> until I'm back to conclude on this, if you still take issue with the justification. > > From my understanding (correct me if I am wrong), Yu seems to want order-0 to be > the default order even if LAF is enabled. But that does not make sense to me, since > if LAF is configured to be enabled (it is disabled by default now), user (and distros) > must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation > time or by using prctl. Enabling LAF and using order-0 as the default order makes > most of LAF code not used. For the device with limited memory size and it still wants LAF enabled for some specific memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise to enable LAF for specific memory ranges. So my understanding is it's possible case. But it's another configuration thing and not necessary to be finalized now. Regards Yin, Fengwei > > Also arch_wants_pte_order() might need a better name like > arch_wants_large_folio_order(). Since current name sounds like the specified order > is wanted by HW in a general setting, but it is not. It is an order HW wants > when LAF is enabled. That might cause some confusion. > >>> >>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/ > > > -- > Best Regards, > Yan, Zi
On 10 Aug 2023, at 20:36, Yin, Fengwei wrote: > On 8/11/2023 3:46 AM, Zi Yan wrote: >> On 10 Aug 2023, at 15:12, Ryan Roberts wrote: >> >>> On 10/08/2023 18:01, Yu Zhao wrote: >>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>>>> allocated in large folios of a determined order. All pages of the large >>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>> counting, rmap management lru list management) are also significantly >>>>> reduced since those ops now become per-folio. >>>>> >>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>>>> which defaults to disabled for now; The long term aim is for this to >>>>> defaut to enabled, but there are some risks around internal >>>>> fragmentation that need to be better understood first. >>>>> >>>>> Large anonymous folio (LAF) allocation is integrated with the existing >>>>> (PMD-order) THP and single (S) page allocation according to this policy, >>>>> where fallback (>) is performed for various reasons, such as the >>>>> proposed folio order not fitting within the bounds of the VMA, etc: >>>>> >>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>>> ----------------|-----------|-------------|---------------|------------- >>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>>>> MADV_NOHUGEPAGE | S | S | S | S >>>>> >>>>> This approach ensures that we don't violate existing hints to only >>>>> allocate single pages - this is required for QEMU's VM live migration >>>>> implementation to work correctly - while allowing us to use LAF >>>>> independently of THP (when sysfs=never). This makes wide scale >>>>> performance characterization simpler, while avoiding exposing any new >>>>> ABI to user space. >>>>> >>>>> When using LAF for allocation, the folio order is determined as follows: >>>>> The return value of arch_wants_pte_order() is used. For vmas that have >>>>> not explicitly opted-in to use transparent hugepages (e.g. where >>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never), >>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever >>>>> is bigger). This allows for a performance boost without requiring any >>>>> explicit opt-in from the workload while limitting internal >>>>> fragmentation. >>>>> >>>>> If the preferred order can't be used (e.g. because the folio would >>>>> breach the bounds of the vma, or because ptes in the region are already >>>>> mapped) then we fall back to a suitable lower order; first >>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>>> >>>>> arch_wants_pte_order() can be overridden by the architecture if desired. >>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>>>> set of ptes map physically contigious, naturally aligned memory, so this >>>>> mechanism allows the architecture to optimize as required. >>>>> >>>>> Here we add the default implementation of arch_wants_pte_order(), used >>>>> when the architecture does not define it, which returns -1, implying >>>>> that the HW has no preference. In this case, mm will choose it's own >>>>> default order. >>>>> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>> --- >>>>> include/linux/pgtable.h | 13 ++++ >>>>> mm/Kconfig | 10 +++ >>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++--- >>>>> 3 files changed, 158 insertions(+), 9 deletions(-) >>>>> >>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>>> index 222a33b9600d..4b488cc66ddc 100644 >>>>> --- a/include/linux/pgtable.h >>>>> +++ b/include/linux/pgtable.h >>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) >>>>> } >>>>> #endif >>>>> >>>>> +#ifndef arch_wants_pte_order >>>>> +/* >>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>>> + * to be at least order-2. Negative value implies that the HW has no preference >>>>> + * and mm will choose it's own default order. >>>>> + */ >>>>> +static inline int arch_wants_pte_order(void) >>>>> +{ >>>>> + return -1; >>>>> +} >>>>> +#endif >>>>> + >>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >>>>> unsigned long address, >>>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>>> index 721dc88423c7..a1e28b8ddc24 100644 >>>>> --- a/mm/Kconfig >>>>> +++ b/mm/Kconfig >>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA >>>>> >>>>> source "mm/damon/Kconfig" >>>>> >>>>> +config LARGE_ANON_FOLIO >>>>> + bool "Allocate large folios for anonymous memory" >>>>> + depends on TRANSPARENT_HUGEPAGE >>>>> + default n >>>>> + help >>>>> + Use large (bigger than order-0) folios to back anonymous memory where >>>>> + possible, even for pte-mapped memory. This reduces the number of page >>>>> + faults, as well as other per-page overheads to improve performance for >>>>> + many workloads. >>>>> + >>>>> endmenu >>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>> index d003076b218d..bbc7d4ce84f7 100644 >>>>> --- a/mm/memory.c >>>>> +++ b/mm/memory.c >>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>>> return ret; >>>>> } >>>>> >>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >>>>> +{ >>>>> + int i; >>>>> + >>>>> + if (nr_pages == 1) >>>>> + return vmf_pte_changed(vmf); >>>>> + >>>>> + for (i = 0; i < nr_pages; i++) { >>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i))) >>>>> + return true; >>>>> + } >>>>> + >>>>> + return false; >>>>> +} >>>>> + >>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO >>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >>>>> + >>>>> +static int anon_folio_order(struct vm_area_struct *vma) >>>>> +{ >>>>> + int order; >>>>> + >>>>> + /* >>>>> + * If the vma is eligible for thp, allocate a large folio of the size >>>>> + * preferred by the arch. Or if the arch requested a very small size or >>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still >>>>> + * meets the arch's requirements but means we still take advantage of SW >>>>> + * optimizations (e.g. fewer page faults). >>>>> + * >>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and >>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads >>>>> + * that have not explicitly opted-in take benefit while capping the >>>>> + * potential for internal fragmentation. >>>>> + */ >>>>> + >>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >>>>> + >>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >>>>> + >>>>> + return order; >>>>> +} >>>> >>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED. >>>> 1. It's not used, since no archs at the moment implement >>>> arch_wants_pte_order() that returns >64KB. >>>> 2. As far as I know, there is no plan for any arch to do so. >>> >>> My rationale is that arm64 is planning to use this for contpte mapping 2MB >>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB >>> blocks without the proper THP hinting is a bad plan. >>> >>> As I see it, arches could add their own arch_wants_pte_order() at any time, and >>> just because the HW has a preference, doesn't mean the SW shouldn't get a say. >>> Its a negotiation between HW and SW for the LAF order, embodied in this policy. >>> >>>> 3. Again, it seems to me the rationale behind >>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all. >>>> >>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please? >>>> >>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]: >>>> Thanks: -1 actually is better than 0 (what I suggested) for the >>>> obvious reason. >>>> >>>> I thought we were on the same page, i.e., the "obvious reason" is that >>>> h/w might prefer 0. But here you are not respecting 0. But then why >>>> -1? >>> >>> I agree that the "obvious reason" is that HW might prefer order-0. But the >>> performance wins don't come solely from the HW. Batching up page faults is a big >>> win for SW even if the HW doesn't benefit. So I think it is important that a HW >>> preference of order-0 is possible to express through this API. But that doesn't >>> mean that we don't listen to SW's preferences either. >>> >>> I would really rather leave it in; As I've mentioned in the past, we have a >>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and >>> this is the mechanism that means we don't dole out those 2MB blocks unless >>> explicitly opted-in. >>> >>> I'm going to be out on holiday for a couple of weeks, so we might have to wait >>> until I'm back to conclude on this, if you still take issue with the justification. >> >> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be >> the default order even if LAF is enabled. But that does not make sense to me, since >> if LAF is configured to be enabled (it is disabled by default now), user (and distros) >> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation >> time or by using prctl. Enabling LAF and using order-0 as the default order makes >> most of LAF code not used. > For the device with limited memory size and it still wants LAF enabled for some specific > memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise > to enable LAF for specific memory ranges. Do you have a use case? Or it is just a possible scenario? IIUC, Ryan has a concrete use case for his choice. For ARM64 with 16KB/64KB base pages, 2MB folios (LAF in this config) would be desirable since THP is 32MB/512MB and much harder to get. > > So my understanding is it's possible case. But it's another configuration thing and not > necessary to be finalized now. Basically, we are deciding whether LAF should use order-0 by default once it is compiled in to kernel. From your other email on ANON_FOLIO_MAX_ORDER_UNHINTED, your argument is that code change is needed to test the impact of LAF with different orders. That seems to imply we actually need an extra knob (maybe sysctl) to control the max LAF order. And with that extra knob, we can solve this default order problem, since we can set it to 0 for devices want to opt in LAF and set it N (like 64KB) for other devices want to opt out LAF. So maybe we need the extra knob for both testing purpose and serving different device configuration purpose. >> >> Also arch_wants_pte_order() might need a better name like >> arch_wants_large_folio_order(). Since current name sounds like the specified order >> is wanted by HW in a general setting, but it is not. It is an order HW wants >> when LAF is enabled. That might cause some confusion. >> >>>> >>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/ >> >> >> -- >> Best Regards, >> Yan, Zi -- Best Regards, Yan, Zi
On 8/11/2023 9:04 AM, Zi Yan wrote: > On 10 Aug 2023, at 20:36, Yin, Fengwei wrote: > >> On 8/11/2023 3:46 AM, Zi Yan wrote: >>> On 10 Aug 2023, at 15:12, Ryan Roberts wrote: >>> >>>> On 10/08/2023 18:01, Yu Zhao wrote: >>>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>> >>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>>>>> allocated in large folios of a determined order. All pages of the large >>>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>>> counting, rmap management lru list management) are also significantly >>>>>> reduced since those ops now become per-folio. >>>>>> >>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>>>>> which defaults to disabled for now; The long term aim is for this to >>>>>> defaut to enabled, but there are some risks around internal >>>>>> fragmentation that need to be better understood first. >>>>>> >>>>>> Large anonymous folio (LAF) allocation is integrated with the existing >>>>>> (PMD-order) THP and single (S) page allocation according to this policy, >>>>>> where fallback (>) is performed for various reasons, such as the >>>>>> proposed folio order not fitting within the bounds of the VMA, etc: >>>>>> >>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>>>> ----------------|-----------|-------------|---------------|------------- >>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>>>>> MADV_NOHUGEPAGE | S | S | S | S >>>>>> >>>>>> This approach ensures that we don't violate existing hints to only >>>>>> allocate single pages - this is required for QEMU's VM live migration >>>>>> implementation to work correctly - while allowing us to use LAF >>>>>> independently of THP (when sysfs=never). This makes wide scale >>>>>> performance characterization simpler, while avoiding exposing any new >>>>>> ABI to user space. >>>>>> >>>>>> When using LAF for allocation, the folio order is determined as follows: >>>>>> The return value of arch_wants_pte_order() is used. For vmas that have >>>>>> not explicitly opted-in to use transparent hugepages (e.g. where >>>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never), >>>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever >>>>>> is bigger). This allows for a performance boost without requiring any >>>>>> explicit opt-in from the workload while limitting internal >>>>>> fragmentation. >>>>>> >>>>>> If the preferred order can't be used (e.g. because the folio would >>>>>> breach the bounds of the vma, or because ptes in the region are already >>>>>> mapped) then we fall back to a suitable lower order; first >>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>>>> >>>>>> arch_wants_pte_order() can be overridden by the architecture if desired. >>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>>>>> set of ptes map physically contigious, naturally aligned memory, so this >>>>>> mechanism allows the architecture to optimize as required. >>>>>> >>>>>> Here we add the default implementation of arch_wants_pte_order(), used >>>>>> when the architecture does not define it, which returns -1, implying >>>>>> that the HW has no preference. In this case, mm will choose it's own >>>>>> default order. >>>>>> >>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>> --- >>>>>> include/linux/pgtable.h | 13 ++++ >>>>>> mm/Kconfig | 10 +++ >>>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++--- >>>>>> 3 files changed, 158 insertions(+), 9 deletions(-) >>>>>> >>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>>>> index 222a33b9600d..4b488cc66ddc 100644 >>>>>> --- a/include/linux/pgtable.h >>>>>> +++ b/include/linux/pgtable.h >>>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) >>>>>> } >>>>>> #endif >>>>>> >>>>>> +#ifndef arch_wants_pte_order >>>>>> +/* >>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>>>> + * to be at least order-2. Negative value implies that the HW has no preference >>>>>> + * and mm will choose it's own default order. >>>>>> + */ >>>>>> +static inline int arch_wants_pte_order(void) >>>>>> +{ >>>>>> + return -1; >>>>>> +} >>>>>> +#endif >>>>>> + >>>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >>>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >>>>>> unsigned long address, >>>>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>>>> index 721dc88423c7..a1e28b8ddc24 100644 >>>>>> --- a/mm/Kconfig >>>>>> +++ b/mm/Kconfig >>>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA >>>>>> >>>>>> source "mm/damon/Kconfig" >>>>>> >>>>>> +config LARGE_ANON_FOLIO >>>>>> + bool "Allocate large folios for anonymous memory" >>>>>> + depends on TRANSPARENT_HUGEPAGE >>>>>> + default n >>>>>> + help >>>>>> + Use large (bigger than order-0) folios to back anonymous memory where >>>>>> + possible, even for pte-mapped memory. This reduces the number of page >>>>>> + faults, as well as other per-page overheads to improve performance for >>>>>> + many workloads. >>>>>> + >>>>>> endmenu >>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>> index d003076b218d..bbc7d4ce84f7 100644 >>>>>> --- a/mm/memory.c >>>>>> +++ b/mm/memory.c >>>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>>>> return ret; >>>>>> } >>>>>> >>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >>>>>> +{ >>>>>> + int i; >>>>>> + >>>>>> + if (nr_pages == 1) >>>>>> + return vmf_pte_changed(vmf); >>>>>> + >>>>>> + for (i = 0; i < nr_pages; i++) { >>>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i))) >>>>>> + return true; >>>>>> + } >>>>>> + >>>>>> + return false; >>>>>> +} >>>>>> + >>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO >>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >>>>>> + >>>>>> +static int anon_folio_order(struct vm_area_struct *vma) >>>>>> +{ >>>>>> + int order; >>>>>> + >>>>>> + /* >>>>>> + * If the vma is eligible for thp, allocate a large folio of the size >>>>>> + * preferred by the arch. Or if the arch requested a very small size or >>>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still >>>>>> + * meets the arch's requirements but means we still take advantage of SW >>>>>> + * optimizations (e.g. fewer page faults). >>>>>> + * >>>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and >>>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads >>>>>> + * that have not explicitly opted-in take benefit while capping the >>>>>> + * potential for internal fragmentation. >>>>>> + */ >>>>>> + >>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >>>>>> + >>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >>>>>> + >>>>>> + return order; >>>>>> +} >>>>> >>>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED. >>>>> 1. It's not used, since no archs at the moment implement >>>>> arch_wants_pte_order() that returns >64KB. >>>>> 2. As far as I know, there is no plan for any arch to do so. >>>> >>>> My rationale is that arm64 is planning to use this for contpte mapping 2MB >>>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB >>>> blocks without the proper THP hinting is a bad plan. >>>> >>>> As I see it, arches could add their own arch_wants_pte_order() at any time, and >>>> just because the HW has a preference, doesn't mean the SW shouldn't get a say. >>>> Its a negotiation between HW and SW for the LAF order, embodied in this policy. >>>> >>>>> 3. Again, it seems to me the rationale behind >>>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all. >>>>> >>>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please? >>>>> >>>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]: >>>>> Thanks: -1 actually is better than 0 (what I suggested) for the >>>>> obvious reason. >>>>> >>>>> I thought we were on the same page, i.e., the "obvious reason" is that >>>>> h/w might prefer 0. But here you are not respecting 0. But then why >>>>> -1? >>>> >>>> I agree that the "obvious reason" is that HW might prefer order-0. But the >>>> performance wins don't come solely from the HW. Batching up page faults is a big >>>> win for SW even if the HW doesn't benefit. So I think it is important that a HW >>>> preference of order-0 is possible to express through this API. But that doesn't >>>> mean that we don't listen to SW's preferences either. >>>> >>>> I would really rather leave it in; As I've mentioned in the past, we have a >>>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and >>>> this is the mechanism that means we don't dole out those 2MB blocks unless >>>> explicitly opted-in. >>>> >>>> I'm going to be out on holiday for a couple of weeks, so we might have to wait >>>> until I'm back to conclude on this, if you still take issue with the justification. >>> >>> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be >>> the default order even if LAF is enabled. But that does not make sense to me, since >>> if LAF is configured to be enabled (it is disabled by default now), user (and distros) >>> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation >>> time or by using prctl. Enabling LAF and using order-0 as the default order makes >>> most of LAF code not used. >> For the device with limited memory size and it still wants LAF enabled for some specific >> memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise >> to enable LAF for specific memory ranges. > > Do you have a use case? Or it is just a possible scenario? It's a possible scenario. Per my experience, it's valid use case for embedded system or low end android phone. > > IIUC, Ryan has a concrete use case for his choice. For ARM64 with 16KB/64KB > base pages, 2MB folios (LAF in this config) would be desirable since THP is > 32MB/512MB and much harder to get. > >> >> So my understanding is it's possible case. But it's another configuration thing and not >> necessary to be finalized now. > > Basically, we are deciding whether LAF should use order-0 by default once it is > compiled in to kernel. From your other email on ANON_FOLIO_MAX_ORDER_UNHINTED, > your argument is that code change is needed to test the impact of LAF with > different orders. That seems to imply we actually need an extra knob (maybe sysctl) > to control the max LAF order. And with that extra knob, we can solve this default > order problem, since we can set it to 0 for devices want to opt in LAF and set > it N (like 64KB) for other devices want to opt out LAF. From performance tuning perspective, it's necessary to have knobs to configure and check the attribute of LAF. But we must be careful to add the knobs as they need be maintained for ever. Regards Yin, Fengwei > > So maybe we need the extra knob for both testing purpose and serving different > device configuration purpose. > >>> >>> Also arch_wants_pte_order() might need a better name like >>> arch_wants_large_folio_order(). Since current name sounds like the specified order >>> is wanted by HW in a general setting, but it is not. It is an order HW wants >>> when LAF is enabled. That might cause some confusion. >>> >>>>> >>>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/ >>> >>> >>> -- >>> Best Regards, >>> Yan, Zi > > > -- > Best Regards, > Yan, Zi
On 11 Aug 2023, at 1:34, Yin, Fengwei wrote: > On 8/11/2023 9:04 AM, Zi Yan wrote: >> On 10 Aug 2023, at 20:36, Yin, Fengwei wrote: >> >>> On 8/11/2023 3:46 AM, Zi Yan wrote: >>>> On 10 Aug 2023, at 15:12, Ryan Roberts wrote: >>>> >>>>> On 10/08/2023 18:01, Yu Zhao wrote: >>>>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>> >>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>>>>>> allocated in large folios of a determined order. All pages of the large >>>>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>>>> counting, rmap management lru list management) are also significantly >>>>>>> reduced since those ops now become per-folio. >>>>>>> >>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>>>>>> which defaults to disabled for now; The long term aim is for this to >>>>>>> defaut to enabled, but there are some risks around internal >>>>>>> fragmentation that need to be better understood first. >>>>>>> >>>>>>> Large anonymous folio (LAF) allocation is integrated with the existing >>>>>>> (PMD-order) THP and single (S) page allocation according to this policy, >>>>>>> where fallback (>) is performed for various reasons, such as the >>>>>>> proposed folio order not fitting within the bounds of the VMA, etc: >>>>>>> >>>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>>>>> ----------------|-----------|-------------|---------------|------------- >>>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>>>>>> MADV_NOHUGEPAGE | S | S | S | S >>>>>>> >>>>>>> This approach ensures that we don't violate existing hints to only >>>>>>> allocate single pages - this is required for QEMU's VM live migration >>>>>>> implementation to work correctly - while allowing us to use LAF >>>>>>> independently of THP (when sysfs=never). This makes wide scale >>>>>>> performance characterization simpler, while avoiding exposing any new >>>>>>> ABI to user space. >>>>>>> >>>>>>> When using LAF for allocation, the folio order is determined as follows: >>>>>>> The return value of arch_wants_pte_order() is used. For vmas that have >>>>>>> not explicitly opted-in to use transparent hugepages (e.g. where >>>>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never), >>>>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever >>>>>>> is bigger). This allows for a performance boost without requiring any >>>>>>> explicit opt-in from the workload while limitting internal >>>>>>> fragmentation. >>>>>>> >>>>>>> If the preferred order can't be used (e.g. because the folio would >>>>>>> breach the bounds of the vma, or because ptes in the region are already >>>>>>> mapped) then we fall back to a suitable lower order; first >>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>>>>> >>>>>>> arch_wants_pte_order() can be overridden by the architecture if desired. >>>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>>>>>> set of ptes map physically contigious, naturally aligned memory, so this >>>>>>> mechanism allows the architecture to optimize as required. >>>>>>> >>>>>>> Here we add the default implementation of arch_wants_pte_order(), used >>>>>>> when the architecture does not define it, which returns -1, implying >>>>>>> that the HW has no preference. In this case, mm will choose it's own >>>>>>> default order. >>>>>>> >>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>>> --- >>>>>>> include/linux/pgtable.h | 13 ++++ >>>>>>> mm/Kconfig | 10 +++ >>>>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++--- >>>>>>> 3 files changed, 158 insertions(+), 9 deletions(-) >>>>>>> >>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>>>>> index 222a33b9600d..4b488cc66ddc 100644 >>>>>>> --- a/include/linux/pgtable.h >>>>>>> +++ b/include/linux/pgtable.h >>>>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) >>>>>>> } >>>>>>> #endif >>>>>>> >>>>>>> +#ifndef arch_wants_pte_order >>>>>>> +/* >>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>>>>> + * to be at least order-2. Negative value implies that the HW has no preference >>>>>>> + * and mm will choose it's own default order. >>>>>>> + */ >>>>>>> +static inline int arch_wants_pte_order(void) >>>>>>> +{ >>>>>>> + return -1; >>>>>>> +} >>>>>>> +#endif >>>>>>> + >>>>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >>>>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >>>>>>> unsigned long address, >>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>>>>> index 721dc88423c7..a1e28b8ddc24 100644 >>>>>>> --- a/mm/Kconfig >>>>>>> +++ b/mm/Kconfig >>>>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA >>>>>>> >>>>>>> source "mm/damon/Kconfig" >>>>>>> >>>>>>> +config LARGE_ANON_FOLIO >>>>>>> + bool "Allocate large folios for anonymous memory" >>>>>>> + depends on TRANSPARENT_HUGEPAGE >>>>>>> + default n >>>>>>> + help >>>>>>> + Use large (bigger than order-0) folios to back anonymous memory where >>>>>>> + possible, even for pte-mapped memory. This reduces the number of page >>>>>>> + faults, as well as other per-page overheads to improve performance for >>>>>>> + many workloads. >>>>>>> + >>>>>>> endmenu >>>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>>> index d003076b218d..bbc7d4ce84f7 100644 >>>>>>> --- a/mm/memory.c >>>>>>> +++ b/mm/memory.c >>>>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>>>>> return ret; >>>>>>> } >>>>>>> >>>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >>>>>>> +{ >>>>>>> + int i; >>>>>>> + >>>>>>> + if (nr_pages == 1) >>>>>>> + return vmf_pte_changed(vmf); >>>>>>> + >>>>>>> + for (i = 0; i < nr_pages; i++) { >>>>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i))) >>>>>>> + return true; >>>>>>> + } >>>>>>> + >>>>>>> + return false; >>>>>>> +} >>>>>>> + >>>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO >>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >>>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >>>>>>> + >>>>>>> +static int anon_folio_order(struct vm_area_struct *vma) >>>>>>> +{ >>>>>>> + int order; >>>>>>> + >>>>>>> + /* >>>>>>> + * If the vma is eligible for thp, allocate a large folio of the size >>>>>>> + * preferred by the arch. Or if the arch requested a very small size or >>>>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still >>>>>>> + * meets the arch's requirements but means we still take advantage of SW >>>>>>> + * optimizations (e.g. fewer page faults). >>>>>>> + * >>>>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and >>>>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads >>>>>>> + * that have not explicitly opted-in take benefit while capping the >>>>>>> + * potential for internal fragmentation. >>>>>>> + */ >>>>>>> + >>>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >>>>>>> + >>>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >>>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >>>>>>> + >>>>>>> + return order; >>>>>>> +} >>>>>> >>>>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED. >>>>>> 1. It's not used, since no archs at the moment implement >>>>>> arch_wants_pte_order() that returns >64KB. >>>>>> 2. As far as I know, there is no plan for any arch to do so. >>>>> >>>>> My rationale is that arm64 is planning to use this for contpte mapping 2MB >>>>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB >>>>> blocks without the proper THP hinting is a bad plan. >>>>> >>>>> As I see it, arches could add their own arch_wants_pte_order() at any time, and >>>>> just because the HW has a preference, doesn't mean the SW shouldn't get a say. >>>>> Its a negotiation between HW and SW for the LAF order, embodied in this policy. >>>>> >>>>>> 3. Again, it seems to me the rationale behind >>>>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all. >>>>>> >>>>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please? >>>>>> >>>>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]: >>>>>> Thanks: -1 actually is better than 0 (what I suggested) for the >>>>>> obvious reason. >>>>>> >>>>>> I thought we were on the same page, i.e., the "obvious reason" is that >>>>>> h/w might prefer 0. But here you are not respecting 0. But then why >>>>>> -1? >>>>> >>>>> I agree that the "obvious reason" is that HW might prefer order-0. But the >>>>> performance wins don't come solely from the HW. Batching up page faults is a big >>>>> win for SW even if the HW doesn't benefit. So I think it is important that a HW >>>>> preference of order-0 is possible to express through this API. But that doesn't >>>>> mean that we don't listen to SW's preferences either. >>>>> >>>>> I would really rather leave it in; As I've mentioned in the past, we have a >>>>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and >>>>> this is the mechanism that means we don't dole out those 2MB blocks unless >>>>> explicitly opted-in. >>>>> >>>>> I'm going to be out on holiday for a couple of weeks, so we might have to wait >>>>> until I'm back to conclude on this, if you still take issue with the justification. >>>> >>>> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be >>>> the default order even if LAF is enabled. But that does not make sense to me, since >>>> if LAF is configured to be enabled (it is disabled by default now), user (and distros) >>>> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation >>>> time or by using prctl. Enabling LAF and using order-0 as the default order makes >>>> most of LAF code not used. >>> For the device with limited memory size and it still wants LAF enabled for some specific >>> memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise >>> to enable LAF for specific memory ranges. >> >> Do you have a use case? Or it is just a possible scenario? > It's a possible scenario. Per my experience, it's valid use case for embedded > system or low end android phone. > >> >> IIUC, Ryan has a concrete use case for his choice. For ARM64 with 16KB/64KB >> base pages, 2MB folios (LAF in this config) would be desirable since THP is >> 32MB/512MB and much harder to get. >> >>> >>> So my understanding is it's possible case. But it's another configuration thing and not >>> necessary to be finalized now. >> >> Basically, we are deciding whether LAF should use order-0 by default once it is >> compiled in to kernel. From your other email on ANON_FOLIO_MAX_ORDER_UNHINTED, >> your argument is that code change is needed to test the impact of LAF with >> different orders. That seems to imply we actually need an extra knob (maybe sysctl) >> to control the max LAF order. And with that extra knob, we can solve this default >> order problem, since we can set it to 0 for devices want to opt in LAF and set >> it N (like 64KB) for other devices want to opt out LAF. > From performance tuning perspective, it's necessary to have knobs to configure and > check the attribute of LAF. But we must be careful to add the knobs as they need > be maintained for ever. If we do not want to maintain such a knob (since it may take some time to finalize) and tweaking LAF order is important for us to explore different LAF configurations (Ryan thinks 64KB will perform well on ARM64, whereas Yu mentioned 16KB/32KB is better in his use cases), we probably just put the LAF order knob in debugfs like Ryan suggested before to move forward. >> >> So maybe we need the extra knob for both testing purpose and serving different >> device configuration purpose. >> >>>> >>>> Also arch_wants_pte_order() might need a better name like >>>> arch_wants_large_folio_order(). Since current name sounds like the specified order >>>> is wanted by HW in a general setting, but it is not. It is an order HW wants >>>> when LAF is enabled. That might cause some confusion. >>>> >>>>>> >>>>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/ >>>> >>>> >>>> -- >>>> Best Regards, >>>> Yan, Zi >> >> >> -- >> Best Regards, >> Yan, Zi -- Best Regards, Yan, Zi
On 8/11/2023 10:33 PM, Zi Yan wrote: > On 11 Aug 2023, at 1:34, Yin, Fengwei wrote: > >> On 8/11/2023 9:04 AM, Zi Yan wrote: >>> On 10 Aug 2023, at 20:36, Yin, Fengwei wrote: >>> >>>> On 8/11/2023 3:46 AM, Zi Yan wrote: >>>>> On 10 Aug 2023, at 15:12, Ryan Roberts wrote: >>>>> >>>>>> On 10/08/2023 18:01, Yu Zhao wrote: >>>>>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>> >>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>>>>>>> allocated in large folios of a determined order. All pages of the large >>>>>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>>>>> counting, rmap management lru list management) are also significantly >>>>>>>> reduced since those ops now become per-folio. >>>>>>>> >>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>>>>>>> which defaults to disabled for now; The long term aim is for this to >>>>>>>> defaut to enabled, but there are some risks around internal >>>>>>>> fragmentation that need to be better understood first. >>>>>>>> >>>>>>>> Large anonymous folio (LAF) allocation is integrated with the existing >>>>>>>> (PMD-order) THP and single (S) page allocation according to this policy, >>>>>>>> where fallback (>) is performed for various reasons, such as the >>>>>>>> proposed folio order not fitting within the bounds of the VMA, etc: >>>>>>>> >>>>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>>>>>> ----------------|-----------|-------------|---------------|------------- >>>>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>>>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>>>>>>> MADV_NOHUGEPAGE | S | S | S | S >>>>>>>> >>>>>>>> This approach ensures that we don't violate existing hints to only >>>>>>>> allocate single pages - this is required for QEMU's VM live migration >>>>>>>> implementation to work correctly - while allowing us to use LAF >>>>>>>> independently of THP (when sysfs=never). This makes wide scale >>>>>>>> performance characterization simpler, while avoiding exposing any new >>>>>>>> ABI to user space. >>>>>>>> >>>>>>>> When using LAF for allocation, the folio order is determined as follows: >>>>>>>> The return value of arch_wants_pte_order() is used. For vmas that have >>>>>>>> not explicitly opted-in to use transparent hugepages (e.g. where >>>>>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never), >>>>>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever >>>>>>>> is bigger). This allows for a performance boost without requiring any >>>>>>>> explicit opt-in from the workload while limitting internal >>>>>>>> fragmentation. >>>>>>>> >>>>>>>> If the preferred order can't be used (e.g. because the folio would >>>>>>>> breach the bounds of the vma, or because ptes in the region are already >>>>>>>> mapped) then we fall back to a suitable lower order; first >>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>>>>>> >>>>>>>> arch_wants_pte_order() can be overridden by the architecture if desired. >>>>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>>>>>>> set of ptes map physically contigious, naturally aligned memory, so this >>>>>>>> mechanism allows the architecture to optimize as required. >>>>>>>> >>>>>>>> Here we add the default implementation of arch_wants_pte_order(), used >>>>>>>> when the architecture does not define it, which returns -1, implying >>>>>>>> that the HW has no preference. In this case, mm will choose it's own >>>>>>>> default order. >>>>>>>> >>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>>>> --- >>>>>>>> include/linux/pgtable.h | 13 ++++ >>>>>>>> mm/Kconfig | 10 +++ >>>>>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++--- >>>>>>>> 3 files changed, 158 insertions(+), 9 deletions(-) >>>>>>>> >>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>>>>>> index 222a33b9600d..4b488cc66ddc 100644 >>>>>>>> --- a/include/linux/pgtable.h >>>>>>>> +++ b/include/linux/pgtable.h >>>>>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) >>>>>>>> } >>>>>>>> #endif >>>>>>>> >>>>>>>> +#ifndef arch_wants_pte_order >>>>>>>> +/* >>>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>>>>>> + * to be at least order-2. Negative value implies that the HW has no preference >>>>>>>> + * and mm will choose it's own default order. >>>>>>>> + */ >>>>>>>> +static inline int arch_wants_pte_order(void) >>>>>>>> +{ >>>>>>>> + return -1; >>>>>>>> +} >>>>>>>> +#endif >>>>>>>> + >>>>>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >>>>>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >>>>>>>> unsigned long address, >>>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>>>>>> index 721dc88423c7..a1e28b8ddc24 100644 >>>>>>>> --- a/mm/Kconfig >>>>>>>> +++ b/mm/Kconfig >>>>>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA >>>>>>>> >>>>>>>> source "mm/damon/Kconfig" >>>>>>>> >>>>>>>> +config LARGE_ANON_FOLIO >>>>>>>> + bool "Allocate large folios for anonymous memory" >>>>>>>> + depends on TRANSPARENT_HUGEPAGE >>>>>>>> + default n >>>>>>>> + help >>>>>>>> + Use large (bigger than order-0) folios to back anonymous memory where >>>>>>>> + possible, even for pte-mapped memory. This reduces the number of page >>>>>>>> + faults, as well as other per-page overheads to improve performance for >>>>>>>> + many workloads. >>>>>>>> + >>>>>>>> endmenu >>>>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>>>> index d003076b218d..bbc7d4ce84f7 100644 >>>>>>>> --- a/mm/memory.c >>>>>>>> +++ b/mm/memory.c >>>>>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>>>>>> return ret; >>>>>>>> } >>>>>>>> >>>>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >>>>>>>> +{ >>>>>>>> + int i; >>>>>>>> + >>>>>>>> + if (nr_pages == 1) >>>>>>>> + return vmf_pte_changed(vmf); >>>>>>>> + >>>>>>>> + for (i = 0; i < nr_pages; i++) { >>>>>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i))) >>>>>>>> + return true; >>>>>>>> + } >>>>>>>> + >>>>>>>> + return false; >>>>>>>> +} >>>>>>>> + >>>>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO >>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >>>>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >>>>>>>> + >>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma) >>>>>>>> +{ >>>>>>>> + int order; >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * If the vma is eligible for thp, allocate a large folio of the size >>>>>>>> + * preferred by the arch. Or if the arch requested a very small size or >>>>>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still >>>>>>>> + * meets the arch's requirements but means we still take advantage of SW >>>>>>>> + * optimizations (e.g. fewer page faults). >>>>>>>> + * >>>>>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and >>>>>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads >>>>>>>> + * that have not explicitly opted-in take benefit while capping the >>>>>>>> + * potential for internal fragmentation. >>>>>>>> + */ >>>>>>>> + >>>>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >>>>>>>> + >>>>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >>>>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >>>>>>>> + >>>>>>>> + return order; >>>>>>>> +} >>>>>>> >>>>>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED. >>>>>>> 1. It's not used, since no archs at the moment implement >>>>>>> arch_wants_pte_order() that returns >64KB. >>>>>>> 2. As far as I know, there is no plan for any arch to do so. >>>>>> >>>>>> My rationale is that arm64 is planning to use this for contpte mapping 2MB >>>>>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB >>>>>> blocks without the proper THP hinting is a bad plan. >>>>>> >>>>>> As I see it, arches could add their own arch_wants_pte_order() at any time, and >>>>>> just because the HW has a preference, doesn't mean the SW shouldn't get a say. >>>>>> Its a negotiation between HW and SW for the LAF order, embodied in this policy. >>>>>> >>>>>>> 3. Again, it seems to me the rationale behind >>>>>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all. >>>>>>> >>>>>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please? >>>>>>> >>>>>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]: >>>>>>> Thanks: -1 actually is better than 0 (what I suggested) for the >>>>>>> obvious reason. >>>>>>> >>>>>>> I thought we were on the same page, i.e., the "obvious reason" is that >>>>>>> h/w might prefer 0. But here you are not respecting 0. But then why >>>>>>> -1? >>>>>> >>>>>> I agree that the "obvious reason" is that HW might prefer order-0. But the >>>>>> performance wins don't come solely from the HW. Batching up page faults is a big >>>>>> win for SW even if the HW doesn't benefit. So I think it is important that a HW >>>>>> preference of order-0 is possible to express through this API. But that doesn't >>>>>> mean that we don't listen to SW's preferences either. >>>>>> >>>>>> I would really rather leave it in; As I've mentioned in the past, we have a >>>>>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and >>>>>> this is the mechanism that means we don't dole out those 2MB blocks unless >>>>>> explicitly opted-in. >>>>>> >>>>>> I'm going to be out on holiday for a couple of weeks, so we might have to wait >>>>>> until I'm back to conclude on this, if you still take issue with the justification. >>>>> >>>>> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be >>>>> the default order even if LAF is enabled. But that does not make sense to me, since >>>>> if LAF is configured to be enabled (it is disabled by default now), user (and distros) >>>>> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation >>>>> time or by using prctl. Enabling LAF and using order-0 as the default order makes >>>>> most of LAF code not used. >>>> For the device with limited memory size and it still wants LAF enabled for some specific >>>> memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise >>>> to enable LAF for specific memory ranges. >>> >>> Do you have a use case? Or it is just a possible scenario? >> It's a possible scenario. Per my experience, it's valid use case for embedded >> system or low end android phone. >> >>> >>> IIUC, Ryan has a concrete use case for his choice. For ARM64 with 16KB/64KB >>> base pages, 2MB folios (LAF in this config) would be desirable since THP is >>> 32MB/512MB and much harder to get. >>> >>>> >>>> So my understanding is it's possible case. But it's another configuration thing and not >>>> necessary to be finalized now. >>> >>> Basically, we are deciding whether LAF should use order-0 by default once it is >>> compiled in to kernel. From your other email on ANON_FOLIO_MAX_ORDER_UNHINTED, >>> your argument is that code change is needed to test the impact of LAF with >>> different orders. That seems to imply we actually need an extra knob (maybe sysctl) >>> to control the max LAF order. And with that extra knob, we can solve this default >>> order problem, since we can set it to 0 for devices want to opt in LAF and set >>> it N (like 64KB) for other devices want to opt out LAF. >> From performance tuning perspective, it's necessary to have knobs to configure and >> check the attribute of LAF. But we must be careful to add the knobs as they need >> be maintained for ever. > > If we do not want to maintain such a knob (since it may take some time to finalize) > and tweaking LAF order is important for us to explore different LAF configurations > (Ryan thinks 64KB will perform well on ARM64, whereas Yu mentioned 16KB/32KB is > better in his use cases), we probably just put the LAF order knob in debugfs > like Ryan suggested before to move forward. Works for me. > > >>> >>> So maybe we need the extra knob for both testing purpose and serving different >>> device configuration purpose. >>> >>>>> >>>>> Also arch_wants_pte_order() might need a better name like >>>>> arch_wants_large_folio_order(). Since current name sounds like the specified order >>>>> is wanted by HW in a general setting, but it is not. It is an order HW wants >>>>> when LAF is enabled. That might cause some confusion. >>>>> >>>>>>> >>>>>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/ >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Yan, Zi >>> >>> >>> -- >>> Best Regards, >>> Yan, Zi > > > -- > Best Regards, > Yan, Zi
Sorry for the delay in responding (I've been out on holiday). Questions for Yu, Zi and Yin below... On 12/08/2023 01:23, Yin, Fengwei wrote: > > > On 8/11/2023 10:33 PM, Zi Yan wrote: >> On 11 Aug 2023, at 1:34, Yin, Fengwei wrote: >> >>> On 8/11/2023 9:04 AM, Zi Yan wrote: >>>> On 10 Aug 2023, at 20:36, Yin, Fengwei wrote: >>>> >>>>> On 8/11/2023 3:46 AM, Zi Yan wrote: >>>>>> On 10 Aug 2023, at 15:12, Ryan Roberts wrote: >>>>>> >>>>>>> On 10/08/2023 18:01, Yu Zhao wrote: >>>>>>>> On Thu, Aug 10, 2023 at 8:30 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>>>> >>>>>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>>>>>>>> allocated in large folios of a determined order. All pages of the large >>>>>>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>>>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>>>>>> counting, rmap management lru list management) are also significantly >>>>>>>>> reduced since those ops now become per-folio. >>>>>>>>> >>>>>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>>>>>>>> which defaults to disabled for now; The long term aim is for this to >>>>>>>>> defaut to enabled, but there are some risks around internal >>>>>>>>> fragmentation that need to be better understood first. >>>>>>>>> >>>>>>>>> Large anonymous folio (LAF) allocation is integrated with the existing >>>>>>>>> (PMD-order) THP and single (S) page allocation according to this policy, >>>>>>>>> where fallback (>) is performed for various reasons, such as the >>>>>>>>> proposed folio order not fitting within the bounds of the VMA, etc: >>>>>>>>> >>>>>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>>>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>>>>>>> ----------------|-----------|-------------|---------------|------------- >>>>>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>>>>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>>>>>>>> MADV_NOHUGEPAGE | S | S | S | S >>>>>>>>> >>>>>>>>> This approach ensures that we don't violate existing hints to only >>>>>>>>> allocate single pages - this is required for QEMU's VM live migration >>>>>>>>> implementation to work correctly - while allowing us to use LAF >>>>>>>>> independently of THP (when sysfs=never). This makes wide scale >>>>>>>>> performance characterization simpler, while avoiding exposing any new >>>>>>>>> ABI to user space. >>>>>>>>> >>>>>>>>> When using LAF for allocation, the folio order is determined as follows: >>>>>>>>> The return value of arch_wants_pte_order() is used. For vmas that have >>>>>>>>> not explicitly opted-in to use transparent hugepages (e.g. where >>>>>>>>> sysfs=madvise and the vma does not have MADV_HUGEPAGE or sysfs=never), >>>>>>>>> then arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever >>>>>>>>> is bigger). This allows for a performance boost without requiring any >>>>>>>>> explicit opt-in from the workload while limitting internal >>>>>>>>> fragmentation. >>>>>>>>> >>>>>>>>> If the preferred order can't be used (e.g. because the folio would >>>>>>>>> breach the bounds of the vma, or because ptes in the region are already >>>>>>>>> mapped) then we fall back to a suitable lower order; first >>>>>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>>>>>>> >>>>>>>>> arch_wants_pte_order() can be overridden by the architecture if desired. >>>>>>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>>>>>>>> set of ptes map physically contigious, naturally aligned memory, so this >>>>>>>>> mechanism allows the architecture to optimize as required. >>>>>>>>> >>>>>>>>> Here we add the default implementation of arch_wants_pte_order(), used >>>>>>>>> when the architecture does not define it, which returns -1, implying >>>>>>>>> that the HW has no preference. In this case, mm will choose it's own >>>>>>>>> default order. >>>>>>>>> >>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>>>>> --- >>>>>>>>> include/linux/pgtable.h | 13 ++++ >>>>>>>>> mm/Kconfig | 10 +++ >>>>>>>>> mm/memory.c | 144 +++++++++++++++++++++++++++++++++++++--- >>>>>>>>> 3 files changed, 158 insertions(+), 9 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>>>>>>> index 222a33b9600d..4b488cc66ddc 100644 >>>>>>>>> --- a/include/linux/pgtable.h >>>>>>>>> +++ b/include/linux/pgtable.h >>>>>>>>> @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) >>>>>>>>> } >>>>>>>>> #endif >>>>>>>>> >>>>>>>>> +#ifndef arch_wants_pte_order >>>>>>>>> +/* >>>>>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>>>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>>>>>>> + * to be at least order-2. Negative value implies that the HW has no preference >>>>>>>>> + * and mm will choose it's own default order. >>>>>>>>> + */ >>>>>>>>> +static inline int arch_wants_pte_order(void) >>>>>>>>> +{ >>>>>>>>> + return -1; >>>>>>>>> +} >>>>>>>>> +#endif >>>>>>>>> + >>>>>>>>> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >>>>>>>>> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >>>>>>>>> unsigned long address, >>>>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>>>>>>> index 721dc88423c7..a1e28b8ddc24 100644 >>>>>>>>> --- a/mm/Kconfig >>>>>>>>> +++ b/mm/Kconfig >>>>>>>>> @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA >>>>>>>>> >>>>>>>>> source "mm/damon/Kconfig" >>>>>>>>> >>>>>>>>> +config LARGE_ANON_FOLIO >>>>>>>>> + bool "Allocate large folios for anonymous memory" >>>>>>>>> + depends on TRANSPARENT_HUGEPAGE >>>>>>>>> + default n >>>>>>>>> + help >>>>>>>>> + Use large (bigger than order-0) folios to back anonymous memory where >>>>>>>>> + possible, even for pte-mapped memory. This reduces the number of page >>>>>>>>> + faults, as well as other per-page overheads to improve performance for >>>>>>>>> + many workloads. >>>>>>>>> + >>>>>>>>> endmenu >>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>>>>> index d003076b218d..bbc7d4ce84f7 100644 >>>>>>>>> --- a/mm/memory.c >>>>>>>>> +++ b/mm/memory.c >>>>>>>>> @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>>>>>>> return ret; >>>>>>>>> } >>>>>>>>> >>>>>>>>> +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) >>>>>>>>> +{ >>>>>>>>> + int i; >>>>>>>>> + >>>>>>>>> + if (nr_pages == 1) >>>>>>>>> + return vmf_pte_changed(vmf); >>>>>>>>> + >>>>>>>>> + for (i = 0; i < nr_pages; i++) { >>>>>>>>> + if (!pte_none(ptep_get_lockless(vmf->pte + i))) >>>>>>>>> + return true; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + return false; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> +#ifdef CONFIG_LARGE_ANON_FOLIO >>>>>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >>>>>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >>>>>>>>> + >>>>>>>>> +static int anon_folio_order(struct vm_area_struct *vma) >>>>>>>>> +{ >>>>>>>>> + int order; >>>>>>>>> + >>>>>>>>> + /* >>>>>>>>> + * If the vma is eligible for thp, allocate a large folio of the size >>>>>>>>> + * preferred by the arch. Or if the arch requested a very small size or >>>>>>>>> + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still >>>>>>>>> + * meets the arch's requirements but means we still take advantage of SW >>>>>>>>> + * optimizations (e.g. fewer page faults). >>>>>>>>> + * >>>>>>>>> + * If the vma isn't eligible for thp, take the arch-preferred size and >>>>>>>>> + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads >>>>>>>>> + * that have not explicitly opted-in take benefit while capping the >>>>>>>>> + * potential for internal fragmentation. >>>>>>>>> + */ >>>>>>>>> + >>>>>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >>>>>>>>> + >>>>>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >>>>>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >>>>>>>>> + >>>>>>>>> + return order; >>>>>>>>> +} >>>>>>>> >>>>>>>> I don't understand why we still want to keep ANON_FOLIO_MAX_ORDER_UNHINTED. >>>>>>>> 1. It's not used, since no archs at the moment implement >>>>>>>> arch_wants_pte_order() that returns >64KB. >>>>>>>> 2. As far as I know, there is no plan for any arch to do so. >>>>>>> >>>>>>> My rationale is that arm64 is planning to use this for contpte mapping 2MB >>>>>>> blocks for 16K and 64K kernels. But I think we will all agree that allowing 2MB >>>>>>> blocks without the proper THP hinting is a bad plan. >>>>>>> >>>>>>> As I see it, arches could add their own arch_wants_pte_order() at any time, and >>>>>>> just because the HW has a preference, doesn't mean the SW shouldn't get a say. >>>>>>> Its a negotiation between HW and SW for the LAF order, embodied in this policy. Yu, I never saw a reply to this. Have I managed to convince you? I'm willing to put the vma param back into arch_wants_pte_order() and handle the policy in the arch, if you consider that a less bad solution. >>>>>>> >>>>>>>> 3. Again, it seems to me the rationale behind >>>>>>>> ANON_FOLIO_MAX_ORDER_UNHINTED isn't convincing at all. >>>>>>>> >>>>>>>> Can we introduce ANON_FOLIO_MAX_ORDER_UNHINTED if/when needed please? >>>>>>>> >>>>>>>> Also you made arch_wants_pte_order() return -1, and I acknowledged [1]: >>>>>>>> Thanks: -1 actually is better than 0 (what I suggested) for the >>>>>>>> obvious reason. >>>>>>>> >>>>>>>> I thought we were on the same page, i.e., the "obvious reason" is that >>>>>>>> h/w might prefer 0. But here you are not respecting 0. But then why >>>>>>>> -1? >>>>>>> >>>>>>> I agree that the "obvious reason" is that HW might prefer order-0. But the >>>>>>> performance wins don't come solely from the HW. Batching up page faults is a big >>>>>>> win for SW even if the HW doesn't benefit. So I think it is important that a HW >>>>>>> preference of order-0 is possible to express through this API. But that doesn't >>>>>>> mean that we don't listen to SW's preferences either. >>>>>>> >>>>>>> I would really rather leave it in; As I've mentioned in the past, we have a >>>>>>> partner who is actively keen to take advantage of 2MB blocks with 64K kernel and >>>>>>> this is the mechanism that means we don't dole out those 2MB blocks unless >>>>>>> explicitly opted-in. Yu, would appreciate any comments here. >>>>>>> >>>>>>> I'm going to be out on holiday for a couple of weeks, so we might have to wait >>>>>>> until I'm back to conclude on this, if you still take issue with the justification. >>>>>> >>>>>> From my understanding (correct me if I am wrong), Yu seems to want order-0 to be >>>>>> the default order even if LAF is enabled. Zi, I think you are incorrect; Yu does not want order-0 to be the default. He's just pointing out the that original "default return value that actually means PAGE_ALLOC_COSTLY_ORDER" was 0 and that was not an ideal choice because 0 _could_ be a legitimate preference from the HW. So -1 is preferred for this purpose. Yu - correct me if wrong! >>>>>> But that does not make sense to me, since >>>>>> if LAF is configured to be enabled (it is disabled by default now), user (and distros) >>>>>> must think LAF is giving benefit. Otherwise, they will just disable LAF at compilation >>>>>> time or by using prctl. Enabling LAF and using order-0 as the default order makes >>>>>> most of LAF code not used. >>>>> For the device with limited memory size and it still wants LAF enabled for some specific >>>>> memory ranges, it's possible the LAF is enabled, order-0 as default order and use madvise >>>>> to enable LAF for specific memory ranges. >>>> >>>> Do you have a use case? Or it is just a possible scenario? >>> It's a possible scenario. Per my experience, it's valid use case for embedded >>> system or low end android phone. >>> >>>> >>>> IIUC, Ryan has a concrete use case for his choice. For ARM64 with 16KB/64KB >>>> base pages, 2MB folios (LAF in this config) would be desirable since THP is >>>> 32MB/512MB and much harder to get. Yes I have a real use case for my choice. But as I said above, I'm willing to move that policy into the arch impl of arch_wants_pte_order() if its acceptable to pass the vma in (this is how I was doing it in the original version, but preference was to remove the parameter). >>>> >>>>> >>>>> So my understanding is it's possible case. But it's another configuration thing and not >>>>> necessary to be finalized now. >>>> >>>> Basically, we are deciding whether LAF should use order-0 by default once it is >>>> compiled in to kernel. From your other email on ANON_FOLIO_MAX_ORDER_UNHINTED, >>>> your argument is that code change is needed to test the impact of LAF with >>>> different orders. That seems to imply we actually need an extra knob (maybe sysctl) >>>> to control the max LAF order. And with that extra knob, we can solve this default >>>> order problem, since we can set it to 0 for devices want to opt in LAF and set >>>> it N (like 64KB) for other devices want to opt out LAF. >>> From performance tuning perspective, it's necessary to have knobs to configure and >>> check the attribute of LAF. But we must be careful to add the knobs as they need >>> be maintained for ever. >> >> If we do not want to maintain such a knob (since it may take some time to finalize) >> and tweaking LAF order is important for us to explore different LAF configurations >> (Ryan thinks 64KB will perform well on ARM64, whereas Yu mentioned 16KB/32KB is >> better in his use cases), we probably just put the LAF order knob in debugfs >> like Ryan suggested before to move forward. > Works for me. I would really rather avoid adding any knob for now if we possibly can. We have discussed this in the past and concluded we should avoid. It was also raised that if we do add a knob, then debugfs is not sufficient because you can't access it in some environments. > >> >> >>>> >>>> So maybe we need the extra knob for both testing purpose and serving different >>>> device configuration purpose. >>>> >>>>>> >>>>>> Also arch_wants_pte_order() might need a better name like >>>>>> arch_wants_large_folio_order(). Since current name sounds like the specified order >>>>>> is wanted by HW in a general setting, but it is not. It is an order HW wants >>>>>> when LAF is enabled. That might cause some confusion. Personally I don't think it makes much difference. "large folio" does not make it clear that its for pte-mapped memory only. How about arch_prefers_pte_order(), if it really must be changed? >>>>>> >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/linux-mm/CAOUHufZ7HJZW8Srwatyudf=FbwTGQtyq4DyL2SHwSg37N_Bo_A@mail.gmail.com/ >>>>>> >>>>>> >>>>>> -- >>>>>> Best Regards, >>>>>> Yan, Zi >>>> >>>> >>>> -- >>>> Best Regards, >>>> Yan, Zi >> >> >> -- >> Best Regards, >> Yan, Zi
On 15/08/2023 22:32, Huang, Ying wrote: > Hi, Ryan, > > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >> allocated in large folios of a determined order. All pages of the large >> folio are pte-mapped during the same page fault, significantly reducing >> the number of page faults. The number of per-page operations (e.g. ref >> counting, rmap management lru list management) are also significantly >> reduced since those ops now become per-folio. >> >> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >> which defaults to disabled for now; The long term aim is for this to >> defaut to enabled, but there are some risks around internal >> fragmentation that need to be better understood first. >> >> Large anonymous folio (LAF) allocation is integrated with the existing >> (PMD-order) THP and single (S) page allocation according to this policy, >> where fallback (>) is performed for various reasons, such as the >> proposed folio order not fitting within the bounds of the VMA, etc: >> >> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >> ----------------|-----------|-------------|---------------|------------- >> no hint | S | LAF>S | LAF>S | THP>LAF>S >> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >> MADV_NOHUGEPAGE | S | S | S | S > > IMHO, we should use the following semantics as you have suggested > before. > > | prctl=dis | prctl=ena | prctl=ena | prctl=ena > | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always > ----------------|-----------|-------------|---------------|------------- > no hint | S | S | LAF>S | THP>LAF>S > MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S > MADV_NOHUGEPAGE | S | S | S | S > > Or even, > > | prctl=dis | prctl=ena | prctl=ena | prctl=ena > | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always > ----------------|-----------|-------------|---------------|------------- > no hint | S | S | S | THP>LAF>S > MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S > MADV_NOHUGEPAGE | S | S | S | S > > From the implementation point of view, PTE mapped PMD-sized THP has > almost no difference with LAF (just some small sized THP). It will be > confusing to distinguish them from the interface point of view. > > So, IMHO, the real difference is the policy. For example, prefer > PMD-sized THP, prefer small sized THP, or fully auto. The sysfs > interface is used to specify system global policy. In the long term, it > can be something like below, > > never: S # disable all THP > madvise: # never by default, control via madvise() > always: THP>LAF>S # prefer PMD-sized THP in fact > small: LAF>S # prefer small sized THP > auto: # use in-kernel heuristics for THP size > > But it may be not ready to add new policies now. So, before the new > policies are ready, we can add a debugfs interface to override the > original policy in /sys/kernel/mm/transparent_hugepage/enabled. After > we have tuned enough workloads, collected enough data, we can add new > policies to the sysfs interface. I think we can all imagine many policy options. But we don't really have much evidence yet for what it best. The policy I'm currently using is intended to give some flexibility for testing (use LAF without THP by setting sysfs=never, use THP without LAF by compiling without LAF) without adding any new knobs at all. Given that, surely we can defer these decisions until we have more data? In the absence of data, your proposed solution sounds very sensible to me. But for the purposes of scaling up perf testing, I don't think its essential given the current policy will also produce the same options. If we were going to add a debugfs knob, I think the higher priority would be a knob to specify the folio order. (but again, I would rather avoid if possible). Thanks, Ryan
On 31.08.23 10:02, Yin, Fengwei wrote: > > > On 8/31/2023 3:57 PM, David Hildenbrand wrote: >> On 31.08.23 03:40, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> >>>> On 15/08/2023 22:32, Huang, Ying wrote: >>>>> Hi, Ryan, >>>>> >>>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>>> >>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>>>>> allocated in large folios of a determined order. All pages of the large >>>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>>> counting, rmap management lru list management) are also significantly >>>>>> reduced since those ops now become per-folio. >>>>>> >>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>>>>> which defaults to disabled for now; The long term aim is for this to >>>>>> defaut to enabled, but there are some risks around internal >>>>>> fragmentation that need to be better understood first. >>>>>> >>>>>> Large anonymous folio (LAF) allocation is integrated with the existing >>>>>> (PMD-order) THP and single (S) page allocation according to this policy, >>>>>> where fallback (>) is performed for various reasons, such as the >>>>>> proposed folio order not fitting within the bounds of the VMA, etc: >>>>>> >>>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>>>> ----------------|-----------|-------------|---------------|------------- >>>>>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>>>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>>>>> MADV_NOHUGEPAGE | S | S | S | S >>>>> >>>>> IMHO, we should use the following semantics as you have suggested >>>>> before. >>>>> >>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>>> ----------------|-----------|-------------|---------------|------------- >>>>> no hint | S | S | LAF>S | THP>LAF>S >>>>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S >>>>> MADV_NOHUGEPAGE | S | S | S | S >>>>> >>>>> Or even, >>>>> >>>>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>>>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>>>> ----------------|-----------|-------------|---------------|------------- >>>>> no hint | S | S | S | THP>LAF>S >>>>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S >>>>> MADV_NOHUGEPAGE | S | S | S | S >>>>> >>>>> From the implementation point of view, PTE mapped PMD-sized THP has >>>>> almost no difference with LAF (just some small sized THP). It will be >>>>> confusing to distinguish them from the interface point of view. >>>>> >>>>> So, IMHO, the real difference is the policy. For example, prefer >>>>> PMD-sized THP, prefer small sized THP, or fully auto. The sysfs >>>>> interface is used to specify system global policy. In the long term, it >>>>> can be something like below, >>>>> >>>>> never: S # disable all THP >>>>> madvise: # never by default, control via madvise() >>>>> always: THP>LAF>S # prefer PMD-sized THP in fact >>>>> small: LAF>S # prefer small sized THP >>>>> auto: # use in-kernel heuristics for THP size >>>>> >>>>> But it may be not ready to add new policies now. So, before the new >>>>> policies are ready, we can add a debugfs interface to override the >>>>> original policy in /sys/kernel/mm/transparent_hugepage/enabled. After >>>>> we have tuned enough workloads, collected enough data, we can add new >>>>> policies to the sysfs interface. >>>> >>>> I think we can all imagine many policy options. But we don't really have much >>>> evidence yet for what it best. The policy I'm currently using is intended to >>>> give some flexibility for testing (use LAF without THP by setting sysfs=never, >>>> use THP without LAF by compiling without LAF) without adding any new knobs at >>>> all. Given that, surely we can defer these decisions until we have more data? >>>> >>>> In the absence of data, your proposed solution sounds very sensible to me. But >>>> for the purposes of scaling up perf testing, I don't think its essential given >>>> the current policy will also produce the same options. >>>> >>>> If we were going to add a debugfs knob, I think the higher priority would be a >>>> knob to specify the folio order. (but again, I would rather avoid if possible). >>> >>> I totally understand we need some way to control PMD-sized THP and LAF >>> to tune the workload, and nobody likes debugfs knob. >>> >>> My concern about interface is that we have no way to disable LAF >>> system-wise without rebuilding the kernel. In the future, should we add >>> a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be >>> stricter than "never"? "really_never"? >> >> Let's talk about that in a bi-weekly MM session. (I proposed it as a topic for next week). > > The time slot of the meeting is not friendly to our timezone. Like > it's 1 or 2 AM. Yes. I know it's very hard to find a good time slot > for US, EU and Asia. :(. :/ Yeah, even for me in Germany it's usually already around 6-7pm. > > So maybe we still need to discuss it through mail? I don't think we'll be done discussing that in one session. One of the main goals is to get some input from the wider MM community.
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 222a33b9600d..4b488cc66ddc 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -369,6 +369,19 @@ static inline bool arch_has_hw_pte_young(void) } #endif +#ifndef arch_wants_pte_order +/* + * Returns preferred folio order for pte-mapped memory. Must be in range [0, + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios + * to be at least order-2. Negative value implies that the HW has no preference + * and mm will choose it's own default order. + */ +static inline int arch_wants_pte_order(void) +{ + return -1; +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/mm/Kconfig b/mm/Kconfig index 721dc88423c7..a1e28b8ddc24 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1243,4 +1243,14 @@ config LOCK_MM_AND_FIND_VMA source "mm/damon/Kconfig" +config LARGE_ANON_FOLIO + bool "Allocate large folios for anonymous memory" + depends on TRANSPARENT_HUGEPAGE + default n + help + Use large (bigger than order-0) folios to back anonymous memory where + possible, even for pte-mapped memory. This reduces the number of page + faults, as well as other per-page overheads to improve performance for + many workloads. + endmenu diff --git a/mm/memory.c b/mm/memory.c index d003076b218d..bbc7d4ce84f7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4073,6 +4073,123 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) return ret; } +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) +{ + int i; + + if (nr_pages == 1) + return vmf_pte_changed(vmf); + + for (i = 0; i < nr_pages; i++) { + if (!pte_none(ptep_get_lockless(vmf->pte + i))) + return true; + } + + return false; +} + +#ifdef CONFIG_LARGE_ANON_FOLIO +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) + +static int anon_folio_order(struct vm_area_struct *vma) +{ + int order; + + /* + * If the vma is eligible for thp, allocate a large folio of the size + * preferred by the arch. Or if the arch requested a very small size or + * didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, which still + * meets the arch's requirements but means we still take advantage of SW + * optimizations (e.g. fewer page faults). + * + * If the vma isn't eligible for thp, take the arch-preferred size and + * limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. This ensures workloads + * that have not explicitly opted-in take benefit while capping the + * potential for internal fragmentation. + */ + + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); + + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); + + return order; +} + +static struct folio *alloc_anon_folio(struct vm_fault *vmf) +{ + int i; + gfp_t gfp; + pte_t *pte; + unsigned long addr; + struct folio *folio; + struct vm_area_struct *vma = vmf->vma; + int prefer = anon_folio_order(vma); + int orders[] = { + prefer, + prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0, + 0, + }; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (userfaultfd_armed(vma)) + goto fallback; + + /* + * If hugepages are explicitly disabled for the vma (either + * MADV_NOHUGEPAGE or prctl) fallback to order-0. Failure to do this + * breaks correctness for user space. We ignore the sysfs global knob. + */ + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, false)) + goto fallback; + + for (i = 0; orders[i]; i++) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + if (addr >= vma->vm_start && + addr + (PAGE_SIZE << orders[i]) <= vma->vm_end) + break; + } + + if (!orders[i]) + goto fallback; + + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); + if (!pte) + return ERR_PTR(-EAGAIN); + + for (; orders[i]; i++) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + vmf->pte = pte + pte_index(addr); + if (!vmf_pte_range_changed(vmf, 1 << orders[i])) + break; + } + + vmf->pte = NULL; + pte_unmap(pte); + + gfp = vma_thp_gfp_mask(vma); + + for (; orders[i]; i++) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + folio = vma_alloc_folio(gfp, orders[i], vma, addr, true); + if (folio) { + clear_huge_page(&folio->page, addr, 1 << orders[i]); + return folio; + } + } + +fallback: + return vma_alloc_zeroed_movable_folio(vma, vmf->address); +} +#else +#define alloc_anon_folio(vmf) \ + vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address) +#endif + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4080,6 +4197,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { + int i; + int nr_pages = 1; + unsigned long addr = vmf->address; bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; struct folio *folio; @@ -4124,10 +4244,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vmf); + if (IS_ERR(folio)) + return 0; if (!folio) goto oom; + nr_pages = folio_nr_pages(folio); + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); @@ -4144,12 +4269,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (!vmf->pte) goto release; - if (vmf_pte_changed(vmf)) { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if (vmf_pte_range_changed(vmf, nr_pages)) { + for (i = 0; i < nr_pages; i++) + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i); goto release; } @@ -4164,16 +4289,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - folio_add_new_anon_rmap(folio, vma, vmf->address); + folio_ref_add(folio, nr_pages - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); + folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma); setpte: if (uffd_wp) entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages); /* No need to invalidate - it was non-present before */ - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages); unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl);