Message ID | 20230626171430.3167004-4-ryan.roberts@arm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp7643954vqr; Mon, 26 Jun 2023 10:35:13 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6EmnYv8kQRBDbZumO2L9qOfTcfxNiRathBCe4eKB0V6WRreKOsQywdLVQVpC+PHwk1HjMg X-Received: by 2002:a05:6808:1407:b0:39e:dde3:485 with SMTP id w7-20020a056808140700b0039edde30485mr31102676oiv.41.1687800912821; Mon, 26 Jun 2023 10:35:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687800912; cv=none; d=google.com; s=arc-20160816; b=EXt4G3bIphJ1xqBWYWRnXm1BWYcLoAmqol7DXW/ESEKFytIkXxGmtshbZSPr9e3VRN oV04X79uhM71xObxeYLqq0ZkLWeX9vTpahSeeVkcx/+w/Au8PELLw1r6cG5xlZNEsSyk PXF7b5h+Hjp4XSrzaz8DOYYLlrh652Kw81ih4nWHfg3LUHjdO4dv0uH9b4djewlVtsxy t1CU8NstHlIPFO3c61rOT+4K1fKTCMlRZ9roqAIHoGYDsb22zQEAtTNlqI9PBzJ0L2g3 442mZEhtj5Vq+3EUI7DvG/xP8ECMvYAz7O1QWZfgKYhHapQxIUxh6WIXIZN1SNazjycq GlNw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=lSO1RBWRS/esyQ/yVe9ST1uKh8G+pWrwI0yEsZeuvk8=; fh=H2MVjBlipHVEN6kEAh1RDhnPLB9jpPNjGExTmo1/EvA=; b=X6NwO1k/NHZpFn5dUVGgbBkxcj5ciPyOgEtQk1nzTY1nJhPjLouqZfC43zHP3klQXY FquXm3/kZ1eR1RobeRHWYmWLBQTW9lvnhFgiD6Dq4d2Y5EHdV7fRdpNt44MtJp/JrWa0 Xdwb1pQ0PiZWETZ7JN34t6yhQkAXOpNPcXnF/+rzKAApF10MyTh24wodvhlBYvztOacs L5g0wUUOSA9HhxBP0Jt5moCTmZ3gleosdHBKHz6n+lPwg9tzHhMCUhLJHNwbX6k3U8ms RReL0FBhhEgh0Givb58l2Qvv8Jnl7u+BJVWZCyyRcZuV22UaNvAPWiu+0ilDUUrQ2rU3 blEw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v3-20020a655c43000000b0054fe7a4c49esi5557887pgr.824.2023.06.26.10.35.00; Mon, 26 Jun 2023 10:35:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231339AbjFZRPw (ORCPT <rfc822;filip.gregor98@gmail.com> + 99 others); Mon, 26 Jun 2023 13:15:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51310 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231267AbjFZRO5 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 26 Jun 2023 13:14:57 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9547010D2; Mon, 26 Jun 2023 10:14:50 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6A1EA1474; Mon, 26 Jun 2023 10:15:34 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 80DFE3F663; Mon, 26 Jun 2023 10:14:47 -0700 (PDT) From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Yin Fengwei <fengwei.yin@intel.com>, David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Geert Uytterhoeven <geert@linux-m68k.org>, Christian Borntraeger <borntraeger@linux.ibm.com>, Sven Schnelle <svens@linux.ibm.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, "H. Peter Anvin" <hpa@zytor.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-alpha@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-ia64@vger.kernel.org, linux-m68k@lists.linux-m68k.org, linux-s390@vger.kernel.org Subject: [PATCH v1 03/10] mm: Introduce try_vma_alloc_movable_folio() Date: Mon, 26 Jun 2023 18:14:23 +0100 Message-Id: <20230626171430.3167004-4-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230626171430.3167004-1-ryan.roberts@arm.com> References: <20230626171430.3167004-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769787529938797673?= X-GMAIL-MSGID: =?utf-8?q?1769787529938797673?= |
Series |
variable-order, large folios for anonymous memory
|
|
Commit Message
Ryan Roberts
June 26, 2023, 5:14 p.m. UTC
Opportunistically attempt to allocate high-order folios in highmem,
optionally zeroed. Retry with lower orders all the way to order-0, until
success. Although, of note, order-1 allocations are skipped since a
large folio must be at least order-2 to work with the THP machinery. The
user must check what they got with folio_order().
This will be used to oportunistically allocate large folios for
anonymous memory with a sensible fallback under memory pressure.
For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent
high latency due to reclaim, instead preferring to just try for a lower
order. The same approach is used by the readahead code when allocating
large folios.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
mm/memory.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
Comments
On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Opportunistically attempt to allocate high-order folios in highmem, > optionally zeroed. Retry with lower orders all the way to order-0, until > success. Although, of note, order-1 allocations are skipped since a > large folio must be at least order-2 to work with the THP machinery. The > user must check what they got with folio_order(). > > This will be used to oportunistically allocate large folios for > anonymous memory with a sensible fallback under memory pressure. > > For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent > high latency due to reclaim, instead preferring to just try for a lower > order. The same approach is used by the readahead code when allocating > large folios. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > mm/memory.c | 33 +++++++++++++++++++++++++++++++++ > 1 file changed, 33 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index 367bbbb29d91..53896d46e686 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) > return 0; > } > > +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma, > + unsigned long vaddr, int order, bool zeroed) > +{ > + gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0; > + > + if (zeroed) > + return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order); > + else > + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma, > + vaddr, false); > +} > + > +/* > + * Opportunistically attempt to allocate high-order folios, retrying with lower > + * orders all the way to order-0, until success. order-1 allocations are skipped > + * since a folio must be at least order-2 to work with the THP machinery. The > + * user must check what they got with folio_order(). vaddr can be any virtual > + * address that will be mapped by the allocated folio. > + */ > +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma, > + unsigned long vaddr, int order, bool zeroed) > +{ > + struct folio *folio; > + > + for (; order > 1; order--) { > + folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed); > + if (folio) > + return folio; > + } > + > + return vma_alloc_movable_folio(vma, vaddr, 0, zeroed); > +} I'd drop this patch. Instead, in do_anonymous_page(): if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER)) folio = vma_alloc_zeroed_movable_folio(vma, addr, CONFIG_ARCH_WANTS_PTE_ORDER)) if (!folio) folio = vma_alloc_zeroed_movable_folio(vma, addr, 0);
On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote: > > On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > > > Opportunistically attempt to allocate high-order folios in highmem, > > optionally zeroed. Retry with lower orders all the way to order-0, until > > success. Although, of note, order-1 allocations are skipped since a > > large folio must be at least order-2 to work with the THP machinery. The > > user must check what they got with folio_order(). > > > > This will be used to oportunistically allocate large folios for > > anonymous memory with a sensible fallback under memory pressure. > > > > For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent > > high latency due to reclaim, instead preferring to just try for a lower > > order. The same approach is used by the readahead code when allocating > > large folios. > > > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > > --- > > mm/memory.c | 33 +++++++++++++++++++++++++++++++++ > > 1 file changed, 33 insertions(+) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 367bbbb29d91..53896d46e686 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) > > return 0; > > } > > > > +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma, > > + unsigned long vaddr, int order, bool zeroed) > > +{ > > + gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0; > > + > > + if (zeroed) > > + return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order); > > + else > > + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma, > > + vaddr, false); > > +} > > + > > +/* > > + * Opportunistically attempt to allocate high-order folios, retrying with lower > > + * orders all the way to order-0, until success. order-1 allocations are skipped > > + * since a folio must be at least order-2 to work with the THP machinery. The > > + * user must check what they got with folio_order(). vaddr can be any virtual > > + * address that will be mapped by the allocated folio. > > + */ > > +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma, > > + unsigned long vaddr, int order, bool zeroed) > > +{ > > + struct folio *folio; > > + > > + for (; order > 1; order--) { > > + folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed); > > + if (folio) > > + return folio; > > + } > > + > > + return vma_alloc_movable_folio(vma, vaddr, 0, zeroed); > > +} > > I'd drop this patch. Instead, in do_anonymous_page(): > > if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER)) > folio = vma_alloc_zeroed_movable_folio(vma, addr, > CONFIG_ARCH_WANTS_PTE_ORDER)) > > if (!folio) > folio = vma_alloc_zeroed_movable_folio(vma, addr, 0); I meant a runtime function arch_wants_pte_order() (Its default implementation would return 0.)
On 27/06/2023 06:29, Yu Zhao wrote: > On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote: >> >> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> >>> Opportunistically attempt to allocate high-order folios in highmem, >>> optionally zeroed. Retry with lower orders all the way to order-0, until >>> success. Although, of note, order-1 allocations are skipped since a >>> large folio must be at least order-2 to work with the THP machinery. The >>> user must check what they got with folio_order(). >>> >>> This will be used to oportunistically allocate large folios for >>> anonymous memory with a sensible fallback under memory pressure. >>> >>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent >>> high latency due to reclaim, instead preferring to just try for a lower >>> order. The same approach is used by the readahead code when allocating >>> large folios. >>> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>> --- >>> mm/memory.c | 33 +++++++++++++++++++++++++++++++++ >>> 1 file changed, 33 insertions(+) >>> >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 367bbbb29d91..53896d46e686 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) >>> return 0; >>> } >>> >>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma, >>> + unsigned long vaddr, int order, bool zeroed) >>> +{ >>> + gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0; >>> + >>> + if (zeroed) >>> + return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order); >>> + else >>> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma, >>> + vaddr, false); >>> +} >>> + >>> +/* >>> + * Opportunistically attempt to allocate high-order folios, retrying with lower >>> + * orders all the way to order-0, until success. order-1 allocations are skipped >>> + * since a folio must be at least order-2 to work with the THP machinery. The >>> + * user must check what they got with folio_order(). vaddr can be any virtual >>> + * address that will be mapped by the allocated folio. >>> + */ >>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma, >>> + unsigned long vaddr, int order, bool zeroed) >>> +{ >>> + struct folio *folio; >>> + >>> + for (; order > 1; order--) { >>> + folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed); >>> + if (folio) >>> + return folio; >>> + } >>> + >>> + return vma_alloc_movable_folio(vma, vaddr, 0, zeroed); >>> +} >> >> I'd drop this patch. Instead, in do_anonymous_page(): >> >> if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER)) >> folio = vma_alloc_zeroed_movable_folio(vma, addr, >> CONFIG_ARCH_WANTS_PTE_ORDER)) >> >> if (!folio) >> folio = vma_alloc_zeroed_movable_folio(vma, addr, 0); > > I meant a runtime function arch_wants_pte_order() (Its default > implementation would return 0.) There are a bunch of things which you are implying here which I'll try to make explicit: I think you are implying that we shouldn't retry allocation with intermediate orders; but only try the order requested by the arch (arch_wants_pte_order()) and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a factor in determining the preferred order (see patches 8 and 9). So I would add a vma parameter to arch_wants_pte_order() to allow for this. For the case where the THP hint is present, then the arch will request 2M (if the page size is 16K or 64K). If that fails to allocate, there is still value in allocating a 64K folio (which is order 2 in the 16K case). Without the retry with intermediate orders logic, we would not get this. We can't just blindly allocate a folio of arch_wants_pte_order() size because it might overlap with existing populated PTEs, or cross the bounds of the VMA (or a number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you implying that if there is any kind of issue like this, then we should go directly to order 0? I can kind of see the argument from a minimizing fragmentation perspective, but for best possible performance I think we are better off "packing the bin" with intermediate orders. You're also implying that a runtime arch_wants_pte_order() function is better than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you would probably want runtime logic to determine if you are on an appropriate AMD CPU as part of the decision in that function? The real reason for the existance of try_vma_alloc_movable_folio() is that I'm reusing it on the other fault paths (which are no longer part of this series). But I guess that's not a good reason to keep this until we get to those patches.
On 6/27/23 15:56, Ryan Roberts wrote: > On 27/06/2023 06:29, Yu Zhao wrote: >> On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote: >>> >>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> Opportunistically attempt to allocate high-order folios in highmem, >>>> optionally zeroed. Retry with lower orders all the way to order-0, until >>>> success. Although, of note, order-1 allocations are skipped since a >>>> large folio must be at least order-2 to work with the THP machinery. The >>>> user must check what they got with folio_order(). >>>> >>>> This will be used to oportunistically allocate large folios for >>>> anonymous memory with a sensible fallback under memory pressure. >>>> >>>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent >>>> high latency due to reclaim, instead preferring to just try for a lower >>>> order. The same approach is used by the readahead code when allocating >>>> large folios. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>> --- >>>> mm/memory.c | 33 +++++++++++++++++++++++++++++++++ >>>> 1 file changed, 33 insertions(+) >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 367bbbb29d91..53896d46e686 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) >>>> return 0; >>>> } >>>> >>>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma, >>>> + unsigned long vaddr, int order, bool zeroed) >>>> +{ >>>> + gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0; >>>> + >>>> + if (zeroed) >>>> + return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order); >>>> + else >>>> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma, >>>> + vaddr, false); >>>> +} >>>> + >>>> +/* >>>> + * Opportunistically attempt to allocate high-order folios, retrying with lower >>>> + * orders all the way to order-0, until success. order-1 allocations are skipped >>>> + * since a folio must be at least order-2 to work with the THP machinery. The >>>> + * user must check what they got with folio_order(). vaddr can be any virtual >>>> + * address that will be mapped by the allocated folio. >>>> + */ >>>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma, >>>> + unsigned long vaddr, int order, bool zeroed) >>>> +{ >>>> + struct folio *folio; >>>> + >>>> + for (; order > 1; order--) { >>>> + folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed); >>>> + if (folio) >>>> + return folio; >>>> + } >>>> + >>>> + return vma_alloc_movable_folio(vma, vaddr, 0, zeroed); >>>> +} >>> >>> I'd drop this patch. Instead, in do_anonymous_page(): >>> >>> if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER)) >>> folio = vma_alloc_zeroed_movable_folio(vma, addr, >>> CONFIG_ARCH_WANTS_PTE_ORDER)) >>> >>> if (!folio) >>> folio = vma_alloc_zeroed_movable_folio(vma, addr, 0); >> >> I meant a runtime function arch_wants_pte_order() (Its default >> implementation would return 0.) > > There are a bunch of things which you are implying here which I'll try to make > explicit: > > I think you are implying that we shouldn't retry allocation with intermediate > orders; but only try the order requested by the arch (arch_wants_pte_order()) > and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a > factor in determining the preferred order (see patches 8 and 9). So I would add > a vma parameter to arch_wants_pte_order() to allow for this. > > For the case where the THP hint is present, then the arch will request 2M (if > the page size is 16K or 64K). If that fails to allocate, there is still value in > allocating a 64K folio (which is order 2 in the 16K case). Without the retry > with intermediate orders logic, we would not get this. > > We can't just blindly allocate a folio of arch_wants_pte_order() size because it > might overlap with existing populated PTEs, or cross the bounds of the VMA (or a > number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you > implying that if there is any kind of issue like this, then we should go > directly to order 0? I can kind of see the argument from a minimizing > fragmentation perspective, but for best possible performance I think we are > better off "packing the bin" with intermediate orders. One drawback of the retry is that it could introduce large tail latency (by memory zeroing, memory reclaiming or existing populated PTEs). That may not be appreciated by some applications. Thanks. Regards Yin, Fengwei > > You're also implying that a runtime arch_wants_pte_order() function is better > than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I > think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you > would probably want runtime logic to determine if you are on an appropriate AMD > CPU as part of the decision in that function? > > The real reason for the existance of try_vma_alloc_movable_folio() is that I'm > reusing it on the other fault paths (which are no longer part of this series). > But I guess that's not a good reason to keep this until we get to those patches.
On 28/06/2023 03:32, Yin Fengwei wrote: > > > On 6/27/23 15:56, Ryan Roberts wrote: >> On 27/06/2023 06:29, Yu Zhao wrote: >>> On Mon, Jun 26, 2023 at 8:34 PM Yu Zhao <yuzhao@google.com> wrote: >>>> >>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> Opportunistically attempt to allocate high-order folios in highmem, >>>>> optionally zeroed. Retry with lower orders all the way to order-0, until >>>>> success. Although, of note, order-1 allocations are skipped since a >>>>> large folio must be at least order-2 to work with the THP machinery. The >>>>> user must check what they got with folio_order(). >>>>> >>>>> This will be used to oportunistically allocate large folios for >>>>> anonymous memory with a sensible fallback under memory pressure. >>>>> >>>>> For attempts to allocate non-0 orders, we set __GFP_NORETRY to prevent >>>>> high latency due to reclaim, instead preferring to just try for a lower >>>>> order. The same approach is used by the readahead code when allocating >>>>> large folios. >>>>> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>> --- >>>>> mm/memory.c | 33 +++++++++++++++++++++++++++++++++ >>>>> 1 file changed, 33 insertions(+) >>>>> >>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>> index 367bbbb29d91..53896d46e686 100644 >>>>> --- a/mm/memory.c >>>>> +++ b/mm/memory.c >>>>> @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) >>>>> return 0; >>>>> } >>>>> >>>>> +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma, >>>>> + unsigned long vaddr, int order, bool zeroed) >>>>> +{ >>>>> + gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0; >>>>> + >>>>> + if (zeroed) >>>>> + return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order); >>>>> + else >>>>> + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma, >>>>> + vaddr, false); >>>>> +} >>>>> + >>>>> +/* >>>>> + * Opportunistically attempt to allocate high-order folios, retrying with lower >>>>> + * orders all the way to order-0, until success. order-1 allocations are skipped >>>>> + * since a folio must be at least order-2 to work with the THP machinery. The >>>>> + * user must check what they got with folio_order(). vaddr can be any virtual >>>>> + * address that will be mapped by the allocated folio. >>>>> + */ >>>>> +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma, >>>>> + unsigned long vaddr, int order, bool zeroed) >>>>> +{ >>>>> + struct folio *folio; >>>>> + >>>>> + for (; order > 1; order--) { >>>>> + folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed); >>>>> + if (folio) >>>>> + return folio; >>>>> + } >>>>> + >>>>> + return vma_alloc_movable_folio(vma, vaddr, 0, zeroed); >>>>> +} >>>> >>>> I'd drop this patch. Instead, in do_anonymous_page(): >>>> >>>> if (IS_ENABLED(CONFIG_ARCH_WANTS_PTE_ORDER)) >>>> folio = vma_alloc_zeroed_movable_folio(vma, addr, >>>> CONFIG_ARCH_WANTS_PTE_ORDER)) >>>> >>>> if (!folio) >>>> folio = vma_alloc_zeroed_movable_folio(vma, addr, 0); >>> >>> I meant a runtime function arch_wants_pte_order() (Its default >>> implementation would return 0.) >> >> There are a bunch of things which you are implying here which I'll try to make >> explicit: >> >> I think you are implying that we shouldn't retry allocation with intermediate >> orders; but only try the order requested by the arch (arch_wants_pte_order()) >> and 0. Correct? For arm64 at least, I would like the VMA's THP hint to be a >> factor in determining the preferred order (see patches 8 and 9). So I would add >> a vma parameter to arch_wants_pte_order() to allow for this. >> >> For the case where the THP hint is present, then the arch will request 2M (if >> the page size is 16K or 64K). If that fails to allocate, there is still value in >> allocating a 64K folio (which is order 2 in the 16K case). Without the retry >> with intermediate orders logic, we would not get this. >> >> We can't just blindly allocate a folio of arch_wants_pte_order() size because it >> might overlap with existing populated PTEs, or cross the bounds of the VMA (or a >> number of other things - see calc_anon_folio_order_alloc() in patch 10). Are you >> implying that if there is any kind of issue like this, then we should go >> directly to order 0? I can kind of see the argument from a minimizing >> fragmentation perspective, but for best possible performance I think we are >> better off "packing the bin" with intermediate orders. > > One drawback of the retry is that it could introduce large tail latency (by > memory zeroing, memory reclaiming or existing populated PTEs). That may not > be appreciated by some applications. Thanks. Good point. based on all the discussion, I think the conclusion is: - ask the arch to for preferred folio order with runtime function - check the folio will fit (racy) - if does not fit fall back to order-0 - allocate the folio - take the ptl - check the folio still fits (not racy) - if does not fit fall back to order-0 So in the worst case the latency will be allocating and zeroing a large folio, then allocating and zeroing an order-0 folio. Which is obviously better than iterating through every order from preferred to 0. I'll work this flow into a v2. > > > Regards > Yin, Fengwei > >> >> You're also implying that a runtime arch_wants_pte_order() function is better >> than the Kconfig stuff I did in patch 8. On reflection, I agree with you here. I >> think you mentioned that AMD supports coalescing 8 pages on some CPUs - so you >> would probably want runtime logic to determine if you are on an appropriate AMD >> CPU as part of the decision in that function? >> >> The real reason for the existance of try_vma_alloc_movable_folio() is that I'm >> reusing it on the other fault paths (which are no longer part of this series). >> But I guess that's not a good reason to keep this until we get to those patches.
diff --git a/mm/memory.c b/mm/memory.c index 367bbbb29d91..53896d46e686 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3001,6 +3001,39 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) return 0; } +static inline struct folio *vma_alloc_movable_folio(struct vm_area_struct *vma, + unsigned long vaddr, int order, bool zeroed) +{ + gfp_t gfp = order > 0 ? __GFP_NORETRY | __GFP_NOWARN : 0; + + if (zeroed) + return vma_alloc_zeroed_movable_folio(vma, vaddr, gfp, order); + else + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | gfp, order, vma, + vaddr, false); +} + +/* + * Opportunistically attempt to allocate high-order folios, retrying with lower + * orders all the way to order-0, until success. order-1 allocations are skipped + * since a folio must be at least order-2 to work with the THP machinery. The + * user must check what they got with folio_order(). vaddr can be any virtual + * address that will be mapped by the allocated folio. + */ +static struct folio *try_vma_alloc_movable_folio(struct vm_area_struct *vma, + unsigned long vaddr, int order, bool zeroed) +{ + struct folio *folio; + + for (; order > 1; order--) { + folio = vma_alloc_movable_folio(vma, vaddr, order, zeroed); + if (folio) + return folio; + } + + return vma_alloc_movable_folio(vma, vaddr, 0, zeroed); +} + /* * Handle write page faults for pages that can be reused in the current vma *