Message ID | 20230626171430.3167004-1-ryan.roberts@arm.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp7634271vqr; Mon, 26 Jun 2023 10:18:10 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7/fIbSNYtiEfR4/3NTo2sXCVfwi1mVprlenurlh3Nz6sNGGLsizyNahoNDbpabqjHEtH8k X-Received: by 2002:a05:6402:353:b0:51b:ec86:b49a with SMTP id r19-20020a056402035300b0051bec86b49amr7662323edw.7.1687799890253; Mon, 26 Jun 2023 10:18:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687799890; cv=none; d=google.com; s=arc-20160816; b=Tgvkxs9rAWJQNlx5fnvbwRWx0SXKH6ydTfLR3ARJKsHQaMrqkPV2yan/RCVYpgfNLV J3GFLHVnlBKgSyFnfvSLj8c48nvUxCocpLN9z6c7K+hhpKvo8jtSaegzJdEt1L599sWs e2yw82fUFsB3AwY323w/Ii4eIchjtO4yGHicisLjL1J4Oz9QsTknBvA8Jc5HSQ92TPHL sFXeIzas22ByofGmGJCk4HWZpFs+3Hb/ayWGwbsxlbDDDQ7LGxiDjw5Ug8Yftx8e0V+J y/d9WsLSIH9WQjxGw4H3VRSlgEKp3NoPEnM/Gf/voSXbUhG3/Q5GPoCBfc/cz0NVZ4uD lrAg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=E6LG4rOfPg7WolzoVh62k+bSh7yzVp6GZraJCv2Ku4E=; fh=H2MVjBlipHVEN6kEAh1RDhnPLB9jpPNjGExTmo1/EvA=; b=eJ6+wkYfU/pZm0cIDZwQ9idi0UivWfCs9yNEI/PU2bZvawcLCzfcc92jyY0jc9ciXZ Gcww6KWB/KLhxoplEnuYM+hSAaqKavVJD+kvUiLaWQxwZzkc1R1uiFqjk2EH8S61fhIL L8jZW64nbHv4CuGf3MwDNnprgPwgTuYjdPjlLF7DHLWyhPzoZXkCG7Dx+cVRXhpveoqn bJrLBYpANqbLUV4DOWfPuO7ZcrNbwBQFAgC/+5Gs1FCZ1U2zohlREUK8Rx1PdUNkEu0I uYknTrN5q+ORuZG24BMLYXLbaBPx7/n+gQjBDnsROzPSSOCQnsLdxyhZMSqNJH8oHtmi XDew== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o2-20020aa7d3c2000000b0051be93bd984si2715288edr.283.2023.06.26.10.17.45; Mon, 26 Jun 2023 10:18:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231196AbjFZRPP (ORCPT <rfc822;filip.gregor98@gmail.com> + 99 others); Mon, 26 Jun 2023 13:15:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51272 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230509AbjFZROt (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 26 Jun 2023 13:14:49 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9C69E10E7; Mon, 26 Jun 2023 10:14:41 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 23F3C2F4; Mon, 26 Jun 2023 10:15:25 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 37CCD3F663; Mon, 26 Jun 2023 10:14:38 -0700 (PDT) From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Yin Fengwei <fengwei.yin@intel.com>, David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Geert Uytterhoeven <geert@linux-m68k.org>, Christian Borntraeger <borntraeger@linux.ibm.com>, Sven Schnelle <svens@linux.ibm.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, "H. Peter Anvin" <hpa@zytor.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-alpha@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-ia64@vger.kernel.org, linux-m68k@lists.linux-m68k.org, linux-s390@vger.kernel.org Subject: [PATCH v1 00/10] variable-order, large folios for anonymous memory Date: Mon, 26 Jun 2023 18:14:20 +0100 Message-Id: <20230626171430.3167004-1-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769786457381393754?= X-GMAIL-MSGID: =?utf-8?q?1769786457381393754?= |
Series |
variable-order, large folios for anonymous memory
|
|
Message
Ryan Roberts
June 26, 2023, 5:14 p.m. UTC
Hi All, Following on from the previous RFCv2 [1], this series implements variable order, large folios for anonymous memory. The objective of this is to improve performance by allocating larger chunks of memory during anonymous page faults: - Since SW (the kernel) is dealing with larger chunks of memory than base pages, there are efficiency savings to be had; fewer page faults, batched PTE and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel overhead. This should benefit all architectures. - Since we are now mapping physically contiguous chunks of memory, we can take advantage of HW TLB compression techniques. A reduction in TLB pressure speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce TLB entries; "the contiguous bit" (architectural) and HPA (uarch). This patch set deals with the SW side of things only and based on feedback from the RFC, aims to be the most minimal initial change, upon which future incremental changes can be added. For this reason, the new behaviour is hidden behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by default. Although the code has been refactored to parameterize the desired order of the allocation, when the feature is disabled (by forcing the order to be always 0) my performance tests measure no regression. So I'm hoping this will be a suitable mechanism to allow incremental submissions to the kernel without affecting the rest of the world. The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series [2], which is a hard dependency. I'm not sure of Matthew's exact plans for getting that series into the kernel, but I'm hoping we can start the review process on this patch set independently. I have a branch at [3]. I've posted a separate series concerning the HW part (contpte mapping) for arm64 at [4]. Performance ----------- Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a javascript benchmark running in Chromium). Both cases are running on Ampere Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5 reboots and averaged. All improvements are relative to baseline-4k. 'anonfolio-basic' is this series. 'anonfolio' is the full patch set similar to the RFC with the additional changes to the extra 3 fault paths. The rest of the configs are described at [4]. Kernel Compilation (smaller is better): | kernel | real-time | kern-time | user-time | |:----------------|------------:|------------:|------------:| | baseline-4k | 0.0% | 0.0% | 0.0% | | anonfolio-basic | -5.3% | -42.9% | -0.6% | | anonfolio | -5.4% | -46.0% | -0.3% | | contpte | -6.8% | -45.7% | -2.1% | | exefolio | -8.4% | -46.4% | -3.7% | | baseline-16k | -8.7% | -49.2% | -3.7% | | baseline-64k | -10.5% | -66.0% | -3.5% | Speedometer 2.0 (bigger is better): | kernel | runs_per_min | |:----------------|---------------:| | baseline-4k | 0.0% | | anonfolio-basic | 0.7% | | anonfolio | 1.2% | | contpte | 3.1% | | exefolio | 4.2% | | baseline-16k | 5.3% | Changes since RFCv2 ------------------- - Simplified series to bare minimum (on David Hildenbrand's advice) - Removed changes to 3 fault paths: - write fault on zero page: wp_page_copy() - write fault on non-exclusive CoW page: wp_page_copy() - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() - Only 1 fault path change remains: - write fault on unallocated address: do_anonymous_page() - Removed support patches that are no longer needed - Added Kconfig CONFIG_LARGE_ANON_FOLIO and friends - Whole feature defaults to off - Arch opts-in to allowing feature and provides max allocation order Future Work ----------- Once this series is in, there are some more incremental changes I plan to follow up with: - Add the other 3 fault path changes back in - Properly support pte-mapped folios for: - numa balancing (do_numa_page()) - fix assumptions about exclusivity for large folios in madvise() - compaction (although I think this is already a problem for large folios in the file cache so perhaps someone is working on it?) [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ [3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v1 [4] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/ Thanks, Ryan Ryan Roberts (10): mm: Expose clear_huge_page() unconditionally mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() mm: Introduce try_vma_alloc_movable_folio() mm: Implement folio_add_new_anon_rmap_range() mm: Implement folio_remove_rmap_range() mm: Allow deferred splitting of arbitrary large anon folios mm: Batch-zap large anonymous folio PTE mappings mm: Kconfig hooks to determine max anon folio allocation order arm64: mm: Declare support for large anonymous folios mm: Allocate large folios for anonymous memory arch/alpha/include/asm/page.h | 5 +- arch/arm64/Kconfig | 13 ++ arch/arm64/include/asm/page.h | 3 +- arch/arm64/mm/fault.c | 7 +- arch/ia64/include/asm/page.h | 5 +- arch/m68k/include/asm/page_no.h | 7 +- arch/s390/include/asm/page.h | 5 +- arch/x86/include/asm/page.h | 5 +- include/linux/highmem.h | 23 ++- include/linux/mm.h | 3 +- include/linux/rmap.h | 4 + mm/Kconfig | 39 ++++ mm/memory.c | 324 ++++++++++++++++++++++++++++++-- mm/rmap.c | 107 ++++++++++- 14 files changed, 506 insertions(+), 44 deletions(-) -- 2.25.1
Comments
On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi All, > > Following on from the previous RFCv2 [1], this series implements variable order, > large folios for anonymous memory. The objective of this is to improve > performance by allocating larger chunks of memory during anonymous page faults: > > - Since SW (the kernel) is dealing with larger chunks of memory than base > pages, there are efficiency savings to be had; fewer page faults, batched PTE > and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel > overhead. This should benefit all architectures. > - Since we are now mapping physically contiguous chunks of memory, we can take > advantage of HW TLB compression techniques. A reduction in TLB pressure > speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce > TLB entries; "the contiguous bit" (architectural) and HPA (uarch). > > This patch set deals with the SW side of things only and based on feedback from > the RFC, aims to be the most minimal initial change, upon which future > incremental changes can be added. For this reason, the new behaviour is hidden > behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by > default. Although the code has been refactored to parameterize the desired order > of the allocation, when the feature is disabled (by forcing the order to be > always 0) my performance tests measure no regression. So I'm hoping this will be > a suitable mechanism to allow incremental submissions to the kernel without > affecting the rest of the world. > > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series > [2], which is a hard dependency. I'm not sure of Matthew's exact plans for > getting that series into the kernel, but I'm hoping we can start the review > process on this patch set independently. I have a branch at [3]. > > I've posted a separate series concerning the HW part (contpte mapping) for arm64 > at [4]. > > > Performance > ----------- > > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a > javascript benchmark running in Chromium). Both cases are running on Ampere > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark > is repeated 15 times over 5 reboots and averaged. > > All improvements are relative to baseline-4k. 'anonfolio-basic' is this series. > 'anonfolio' is the full patch set similar to the RFC with the additional changes > to the extra 3 fault paths. The rest of the configs are described at [4]. > > Kernel Compilation (smaller is better): > > | kernel | real-time | kern-time | user-time | > |:----------------|------------:|------------:|------------:| > | baseline-4k | 0.0% | 0.0% | 0.0% | > | anonfolio-basic | -5.3% | -42.9% | -0.6% | > | anonfolio | -5.4% | -46.0% | -0.3% | > | contpte | -6.8% | -45.7% | -2.1% | > | exefolio | -8.4% | -46.4% | -3.7% | > | baseline-16k | -8.7% | -49.2% | -3.7% | > | baseline-64k | -10.5% | -66.0% | -3.5% | > > Speedometer 2.0 (bigger is better): > > | kernel | runs_per_min | > |:----------------|---------------:| > | baseline-4k | 0.0% | > | anonfolio-basic | 0.7% | > | anonfolio | 1.2% | > | contpte | 3.1% | > | exefolio | 4.2% | > | baseline-16k | 5.3% | Thanks for pushing this forward! > Changes since RFCv2 > ------------------- > > - Simplified series to bare minimum (on David Hildenbrand's advice) My impression is that this series still includes many pieces that can be split out and discussed separately with followup series. (I skipped 04/10 and will look at it tomorrow.)
On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote: > > On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > > > Hi All, > > > > Following on from the previous RFCv2 [1], this series implements variable order, > > large folios for anonymous memory. The objective of this is to improve > > performance by allocating larger chunks of memory during anonymous page faults: > > > > - Since SW (the kernel) is dealing with larger chunks of memory than base > > pages, there are efficiency savings to be had; fewer page faults, batched PTE > > and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel > > overhead. This should benefit all architectures. > > - Since we are now mapping physically contiguous chunks of memory, we can take > > advantage of HW TLB compression techniques. A reduction in TLB pressure > > speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce > > TLB entries; "the contiguous bit" (architectural) and HPA (uarch). > > > > This patch set deals with the SW side of things only and based on feedback from > > the RFC, aims to be the most minimal initial change, upon which future > > incremental changes can be added. For this reason, the new behaviour is hidden > > behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by > > default. Although the code has been refactored to parameterize the desired order > > of the allocation, when the feature is disabled (by forcing the order to be > > always 0) my performance tests measure no regression. So I'm hoping this will be > > a suitable mechanism to allow incremental submissions to the kernel without > > affecting the rest of the world. > > > > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series > > [2], which is a hard dependency. I'm not sure of Matthew's exact plans for > > getting that series into the kernel, but I'm hoping we can start the review > > process on this patch set independently. I have a branch at [3]. > > > > I've posted a separate series concerning the HW part (contpte mapping) for arm64 > > at [4]. > > > > > > Performance > > ----------- > > > > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a > > javascript benchmark running in Chromium). Both cases are running on Ampere > > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark > > is repeated 15 times over 5 reboots and averaged. > > > > All improvements are relative to baseline-4k. 'anonfolio-basic' is this series. > > 'anonfolio' is the full patch set similar to the RFC with the additional changes > > to the extra 3 fault paths. The rest of the configs are described at [4]. > > > > Kernel Compilation (smaller is better): > > > > | kernel | real-time | kern-time | user-time | > > |:----------------|------------:|------------:|------------:| > > | baseline-4k | 0.0% | 0.0% | 0.0% | > > | anonfolio-basic | -5.3% | -42.9% | -0.6% | > > | anonfolio | -5.4% | -46.0% | -0.3% | > > | contpte | -6.8% | -45.7% | -2.1% | > > | exefolio | -8.4% | -46.4% | -3.7% | > > | baseline-16k | -8.7% | -49.2% | -3.7% | > > | baseline-64k | -10.5% | -66.0% | -3.5% | > > > > Speedometer 2.0 (bigger is better): > > > > | kernel | runs_per_min | > > |:----------------|---------------:| > > | baseline-4k | 0.0% | > > | anonfolio-basic | 0.7% | > > | anonfolio | 1.2% | > > | contpte | 3.1% | > > | exefolio | 4.2% | > > | baseline-16k | 5.3% | > > Thanks for pushing this forward! > > > Changes since RFCv2 > > ------------------- > > > > - Simplified series to bare minimum (on David Hildenbrand's advice) > > My impression is that this series still includes many pieces that can > be split out and discussed separately with followup series. > > (I skipped 04/10 and will look at it tomorrow.) I went through the series twice. Here what I think a bare minimum series (easier to review/debug/land) would look like: 1. a new arch specific function providing a prefered order within (0, PMD_ORDER). 2. an extended anon folio alloc API taking that order (02/10, partially). 3. an updated folio_add_new_anon_rmap() covering the large() && !pmd_mappable() case (similar to 04/10). 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap() (06/10, reviewed-by provided). 5. finally, use the extended anon folio alloc API with the arch preferred order in do_anonymous_page() (10/10, partially). The rest can be split out into separate series and move forward in parallel with probably a long list of things we need/want to do.
On 27/06/2023 08:49, Yu Zhao wrote: > On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote: >> >> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> >>> Hi All, >>> >>> Following on from the previous RFCv2 [1], this series implements variable order, >>> large folios for anonymous memory. The objective of this is to improve >>> performance by allocating larger chunks of memory during anonymous page faults: >>> >>> - Since SW (the kernel) is dealing with larger chunks of memory than base >>> pages, there are efficiency savings to be had; fewer page faults, batched PTE >>> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel >>> overhead. This should benefit all architectures. >>> - Since we are now mapping physically contiguous chunks of memory, we can take >>> advantage of HW TLB compression techniques. A reduction in TLB pressure >>> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce >>> TLB entries; "the contiguous bit" (architectural) and HPA (uarch). >>> >>> This patch set deals with the SW side of things only and based on feedback from >>> the RFC, aims to be the most minimal initial change, upon which future >>> incremental changes can be added. For this reason, the new behaviour is hidden >>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by >>> default. Although the code has been refactored to parameterize the desired order >>> of the allocation, when the feature is disabled (by forcing the order to be >>> always 0) my performance tests measure no regression. So I'm hoping this will be >>> a suitable mechanism to allow incremental submissions to the kernel without >>> affecting the rest of the world. >>> >>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series >>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for >>> getting that series into the kernel, but I'm hoping we can start the review >>> process on this patch set independently. I have a branch at [3]. >>> >>> I've posted a separate series concerning the HW part (contpte mapping) for arm64 >>> at [4]. >>> >>> >>> Performance >>> ----------- >>> >>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a >>> javascript benchmark running in Chromium). Both cases are running on Ampere >>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark >>> is repeated 15 times over 5 reboots and averaged. >>> >>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series. >>> 'anonfolio' is the full patch set similar to the RFC with the additional changes >>> to the extra 3 fault paths. The rest of the configs are described at [4]. >>> >>> Kernel Compilation (smaller is better): >>> >>> | kernel | real-time | kern-time | user-time | >>> |:----------------|------------:|------------:|------------:| >>> | baseline-4k | 0.0% | 0.0% | 0.0% | >>> | anonfolio-basic | -5.3% | -42.9% | -0.6% | >>> | anonfolio | -5.4% | -46.0% | -0.3% | >>> | contpte | -6.8% | -45.7% | -2.1% | >>> | exefolio | -8.4% | -46.4% | -3.7% | >>> | baseline-16k | -8.7% | -49.2% | -3.7% | >>> | baseline-64k | -10.5% | -66.0% | -3.5% | >>> >>> Speedometer 2.0 (bigger is better): >>> >>> | kernel | runs_per_min | >>> |:----------------|---------------:| >>> | baseline-4k | 0.0% | >>> | anonfolio-basic | 0.7% | >>> | anonfolio | 1.2% | >>> | contpte | 3.1% | >>> | exefolio | 4.2% | >>> | baseline-16k | 5.3% | >> >> Thanks for pushing this forward! >> >>> Changes since RFCv2 >>> ------------------- >>> >>> - Simplified series to bare minimum (on David Hildenbrand's advice) >> >> My impression is that this series still includes many pieces that can >> be split out and discussed separately with followup series. >> >> (I skipped 04/10 and will look at it tomorrow.) > > I went through the series twice. Here what I think a bare minimum > series (easier to review/debug/land) would look like: > 1. a new arch specific function providing a prefered order within (0, > PMD_ORDER). > 2. an extended anon folio alloc API taking that order (02/10, partially). > 3. an updated folio_add_new_anon_rmap() covering the large() && > !pmd_mappable() case (similar to 04/10). > 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap() > (06/10, reviewed-by provided). > 5. finally, use the extended anon folio alloc API with the arch > preferred order in do_anonymous_page() (10/10, partially). > > The rest can be split out into separate series and move forward in > parallel with probably a long list of things we need/want to do. Thanks for the fadt review - I really appreciate it! I've responded to many of your comments. I'd appreciate if we can close those points then I will work up a v2. Thanks, Ryan
On Tue, Jun 27, 2023 at 3:59 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 27/06/2023 08:49, Yu Zhao wrote: > > On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote: > >> > >> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>> > >>> Hi All, > >>> > >>> Following on from the previous RFCv2 [1], this series implements variable order, > >>> large folios for anonymous memory. The objective of this is to improve > >>> performance by allocating larger chunks of memory during anonymous page faults: > >>> > >>> - Since SW (the kernel) is dealing with larger chunks of memory than base > >>> pages, there are efficiency savings to be had; fewer page faults, batched PTE > >>> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel > >>> overhead. This should benefit all architectures. > >>> - Since we are now mapping physically contiguous chunks of memory, we can take > >>> advantage of HW TLB compression techniques. A reduction in TLB pressure > >>> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce > >>> TLB entries; "the contiguous bit" (architectural) and HPA (uarch). > >>> > >>> This patch set deals with the SW side of things only and based on feedback from > >>> the RFC, aims to be the most minimal initial change, upon which future > >>> incremental changes can be added. For this reason, the new behaviour is hidden > >>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by > >>> default. Although the code has been refactored to parameterize the desired order > >>> of the allocation, when the feature is disabled (by forcing the order to be > >>> always 0) my performance tests measure no regression. So I'm hoping this will be > >>> a suitable mechanism to allow incremental submissions to the kernel without > >>> affecting the rest of the world. > >>> > >>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series > >>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for > >>> getting that series into the kernel, but I'm hoping we can start the review > >>> process on this patch set independently. I have a branch at [3]. > >>> > >>> I've posted a separate series concerning the HW part (contpte mapping) for arm64 > >>> at [4]. > >>> > >>> > >>> Performance > >>> ----------- > >>> > >>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a > >>> javascript benchmark running in Chromium). Both cases are running on Ampere > >>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark > >>> is repeated 15 times over 5 reboots and averaged. > >>> > >>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series. > >>> 'anonfolio' is the full patch set similar to the RFC with the additional changes > >>> to the extra 3 fault paths. The rest of the configs are described at [4]. > >>> > >>> Kernel Compilation (smaller is better): > >>> > >>> | kernel | real-time | kern-time | user-time | > >>> |:----------------|------------:|------------:|------------:| > >>> | baseline-4k | 0.0% | 0.0% | 0.0% | > >>> | anonfolio-basic | -5.3% | -42.9% | -0.6% | > >>> | anonfolio | -5.4% | -46.0% | -0.3% | > >>> | contpte | -6.8% | -45.7% | -2.1% | > >>> | exefolio | -8.4% | -46.4% | -3.7% | > >>> | baseline-16k | -8.7% | -49.2% | -3.7% | > >>> | baseline-64k | -10.5% | -66.0% | -3.5% | > >>> > >>> Speedometer 2.0 (bigger is better): > >>> > >>> | kernel | runs_per_min | > >>> |:----------------|---------------:| > >>> | baseline-4k | 0.0% | > >>> | anonfolio-basic | 0.7% | > >>> | anonfolio | 1.2% | > >>> | contpte | 3.1% | > >>> | exefolio | 4.2% | > >>> | baseline-16k | 5.3% | > >> > >> Thanks for pushing this forward! > >> > >>> Changes since RFCv2 > >>> ------------------- > >>> > >>> - Simplified series to bare minimum (on David Hildenbrand's advice) > >> > >> My impression is that this series still includes many pieces that can > >> be split out and discussed separately with followup series. > >> > >> (I skipped 04/10 and will look at it tomorrow.) > > > > I went through the series twice. Here what I think a bare minimum > > series (easier to review/debug/land) would look like: === > > 1. a new arch specific function providing a prefered order within (0, > > PMD_ORDER). > > 2. an extended anon folio alloc API taking that order (02/10, partially). > > 3. an updated folio_add_new_anon_rmap() covering the large() && > > !pmd_mappable() case (similar to 04/10). > > 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap() > > (06/10, reviewed-by provided). > > 5. finally, use the extended anon folio alloc API with the arch > > preferred order in do_anonymous_page() (10/10, partially). === > > The rest can be split out into separate series and move forward in > > parallel with probably a long list of things we need/want to do. > > Thanks for the fadt review - I really appreciate it! > > I've responded to many of your comments. I'd appreciate if we can close those > points then I will work up a v2. Thanks! Based on the latest discussion here [1], my original list above can be optionally reduced to 4 patches: item 2 can be quashed into item 5. Also please make sure we have only one global (apply to all archs) Kconfig option, and it should be added in item 5: if TRANSPARENT_HUGEPAGE config FLEXIBLE/VARIABLE_THP # or whatever name you see fit end if (How many new Kconfig options added within arch/arm64/ is not a concern of MM.) And please make sure it's disabled by default, because we are still missing many important functions, e.g., I don't think we can mlock() when large() && !pmd_mappable(), see mlock_pte_range() and mlock_vma_folio(). We can fix it along with many things later, but we need to present a plan and a schedule now. Otherwise, there would be pushback if we try to land the series without supporting mlock(). Do you or Fengwei plan to take on it? (I personally don't.) If not, I'll try to find someone from our team to look at it. (It'd be more scalable if we have a coordinated group of people individually solving different problems.) [1] https://lore.kernel.org/r/b2c81404-67df-f841-ef02-919e841f49f2@arm.com/
Hi Yu, On 6/29/23 02:22, Yu Zhao wrote: > And please make sure it's disabled by default, because we are still > missing many important functions, e.g., I don't think we can mlock() > when large() && !pmd_mappable(), see mlock_pte_range() and > mlock_vma_folio(). We can fix it along with many things later, but we > need to present a plan and a schedule now. Otherwise, there would be > pushback if we try to land the series without supporting mlock(). > > Do you or Fengwei plan to take on it? (I personally don't.) If not, Do you mean the mlock() with large folio? Yes. I can work on it. Thanks. Regards Yin, Fengwei > I'll try to find someone from our team to look at it. (It'd be more > scalable if we have a coordinated group of people individually solving > different problems.)
On Wed, Jun 28, 2023 at 5:59 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > Hi Yu, > > On 6/29/23 02:22, Yu Zhao wrote: > > And please make sure it's disabled by default, because we are still > > missing many important functions, e.g., I don't think we can mlock() > > when large() && !pmd_mappable(), see mlock_pte_range() and > > mlock_vma_folio(). We can fix it along with many things later, but we > > need to present a plan and a schedule now. Otherwise, there would be > > pushback if we try to land the series without supporting mlock(). > > > > Do you or Fengwei plan to take on it? (I personally don't.) If not, > Do you mean the mlock() with large folio? Yes. I can work on it. Thanks. Great. Thanks! Other places that have the similar problem but are probably easier to fix than the mlock() case: * madvise_cold_or_pageout_pte_range() * shrink_folio_list()
On 6/29/23 08:27, Yu Zhao wrote: > On Wed, Jun 28, 2023 at 5:59 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> Hi Yu, >> >> On 6/29/23 02:22, Yu Zhao wrote: >>> And please make sure it's disabled by default, because we are still >>> missing many important functions, e.g., I don't think we can mlock() >>> when large() && !pmd_mappable(), see mlock_pte_range() and >>> mlock_vma_folio(). We can fix it along with many things later, but we >>> need to present a plan and a schedule now. Otherwise, there would be >>> pushback if we try to land the series without supporting mlock(). >>> >>> Do you or Fengwei plan to take on it? (I personally don't.) If not, >> Do you mean the mlock() with large folio? Yes. I can work on it. Thanks. > > Great. Thanks! > > Other places that have the similar problem but are probably easier to > fix than the mlock() case: > * madvise_cold_or_pageout_pte_range() This one was on my radar. :). Regards Yin, Fengwei > * shrink_folio_list()
On Tue, Jun 27, 2023 at 12:49 AM Yu Zhao <yuzhao@google.com> wrote: > > On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote: > > > > On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > > > > > Hi All, > > > > > > Following on from the previous RFCv2 [1], this series implements variable order, > > > large folios for anonymous memory. The objective of this is to improve > > > performance by allocating larger chunks of memory during anonymous page faults: > > > > > > - Since SW (the kernel) is dealing with larger chunks of memory than base > > > pages, there are efficiency savings to be had; fewer page faults, batched PTE > > > and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel > > > overhead. This should benefit all architectures. > > > - Since we are now mapping physically contiguous chunks of memory, we can take > > > advantage of HW TLB compression techniques. A reduction in TLB pressure > > > speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce > > > TLB entries; "the contiguous bit" (architectural) and HPA (uarch). > > > > > > This patch set deals with the SW side of things only and based on feedback from > > > the RFC, aims to be the most minimal initial change, upon which future > > > incremental changes can be added. For this reason, the new behaviour is hidden > > > behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by > > > default. Although the code has been refactored to parameterize the desired order > > > of the allocation, when the feature is disabled (by forcing the order to be > > > always 0) my performance tests measure no regression. So I'm hoping this will be > > > a suitable mechanism to allow incremental submissions to the kernel without > > > affecting the rest of the world. > > > > > > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series > > > [2], which is a hard dependency. I'm not sure of Matthew's exact plans for > > > getting that series into the kernel, but I'm hoping we can start the review > > > process on this patch set independently. I have a branch at [3]. > > > > > > I've posted a separate series concerning the HW part (contpte mapping) for arm64 > > > at [4]. > > > > > > > > > Performance > > > ----------- > > > > > > Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a > > > javascript benchmark running in Chromium). Both cases are running on Ampere > > > Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark > > > is repeated 15 times over 5 reboots and averaged. > > > > > > All improvements are relative to baseline-4k. 'anonfolio-basic' is this series. > > > 'anonfolio' is the full patch set similar to the RFC with the additional changes > > > to the extra 3 fault paths. The rest of the configs are described at [4]. > > > > > > Kernel Compilation (smaller is better): > > > > > > | kernel | real-time | kern-time | user-time | > > > |:----------------|------------:|------------:|------------:| > > > | baseline-4k | 0.0% | 0.0% | 0.0% | > > > | anonfolio-basic | -5.3% | -42.9% | -0.6% | > > > | anonfolio | -5.4% | -46.0% | -0.3% | > > > | contpte | -6.8% | -45.7% | -2.1% | > > > | exefolio | -8.4% | -46.4% | -3.7% | > > > | baseline-16k | -8.7% | -49.2% | -3.7% | > > > | baseline-64k | -10.5% | -66.0% | -3.5% | > > > > > > Speedometer 2.0 (bigger is better): > > > > > > | kernel | runs_per_min | > > > |:----------------|---------------:| > > > | baseline-4k | 0.0% | > > > | anonfolio-basic | 0.7% | > > > | anonfolio | 1.2% | > > > | contpte | 3.1% | > > > | exefolio | 4.2% | > > > | baseline-16k | 5.3% | > > > > Thanks for pushing this forward! > > > > > Changes since RFCv2 > > > ------------------- > > > > > > - Simplified series to bare minimum (on David Hildenbrand's advice) > > > > My impression is that this series still includes many pieces that can > > be split out and discussed separately with followup series. > > > > (I skipped 04/10 and will look at it tomorrow.) > > I went through the series twice. Here what I think a bare minimum > series (easier to review/debug/land) would look like: > 1. a new arch specific function providing a prefered order within (0, > PMD_ORDER). > 2. an extended anon folio alloc API taking that order (02/10, partially). > 3. an updated folio_add_new_anon_rmap() covering the large() && > !pmd_mappable() case (similar to 04/10). > 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap() > (06/10, reviewed-by provided). > 5. finally, use the extended anon folio alloc API with the arch > preferred order in do_anonymous_page() (10/10, partially). > > The rest can be split out into separate series and move forward in > parallel with probably a long list of things we need/want to do. Yeah, the suggestion makes sense to me. And I'd like to go with the simplest way unless there is strong justification for extra optimization for the time being IMHO. >
On 28/06/2023 19:22, Yu Zhao wrote: > On Tue, Jun 27, 2023 at 3:59 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 27/06/2023 08:49, Yu Zhao wrote: >>> On Mon, Jun 26, 2023 at 9:30 PM Yu Zhao <yuzhao@google.com> wrote: >>>> >>>> On Mon, Jun 26, 2023 at 11:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> Hi All, >>>>> >>>>> Following on from the previous RFCv2 [1], this series implements variable order, >>>>> large folios for anonymous memory. The objective of this is to improve >>>>> performance by allocating larger chunks of memory during anonymous page faults: >>>>> >>>>> - Since SW (the kernel) is dealing with larger chunks of memory than base >>>>> pages, there are efficiency savings to be had; fewer page faults, batched PTE >>>>> and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel >>>>> overhead. This should benefit all architectures. >>>>> - Since we are now mapping physically contiguous chunks of memory, we can take >>>>> advantage of HW TLB compression techniques. A reduction in TLB pressure >>>>> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce >>>>> TLB entries; "the contiguous bit" (architectural) and HPA (uarch). >>>>> >>>>> This patch set deals with the SW side of things only and based on feedback from >>>>> the RFC, aims to be the most minimal initial change, upon which future >>>>> incremental changes can be added. For this reason, the new behaviour is hidden >>>>> behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by >>>>> default. Although the code has been refactored to parameterize the desired order >>>>> of the allocation, when the feature is disabled (by forcing the order to be >>>>> always 0) my performance tests measure no regression. So I'm hoping this will be >>>>> a suitable mechanism to allow incremental submissions to the kernel without >>>>> affecting the rest of the world. >>>>> >>>>> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series >>>>> [2], which is a hard dependency. I'm not sure of Matthew's exact plans for >>>>> getting that series into the kernel, but I'm hoping we can start the review >>>>> process on this patch set independently. I have a branch at [3]. >>>>> >>>>> I've posted a separate series concerning the HW part (contpte mapping) for arm64 >>>>> at [4]. >>>>> >>>>> >>>>> Performance >>>>> ----------- >>>>> >>>>> Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a >>>>> javascript benchmark running in Chromium). Both cases are running on Ampere >>>>> Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark >>>>> is repeated 15 times over 5 reboots and averaged. >>>>> >>>>> All improvements are relative to baseline-4k. 'anonfolio-basic' is this series. >>>>> 'anonfolio' is the full patch set similar to the RFC with the additional changes >>>>> to the extra 3 fault paths. The rest of the configs are described at [4]. >>>>> >>>>> Kernel Compilation (smaller is better): >>>>> >>>>> | kernel | real-time | kern-time | user-time | >>>>> |:----------------|------------:|------------:|------------:| >>>>> | baseline-4k | 0.0% | 0.0% | 0.0% | >>>>> | anonfolio-basic | -5.3% | -42.9% | -0.6% | >>>>> | anonfolio | -5.4% | -46.0% | -0.3% | >>>>> | contpte | -6.8% | -45.7% | -2.1% | >>>>> | exefolio | -8.4% | -46.4% | -3.7% | >>>>> | baseline-16k | -8.7% | -49.2% | -3.7% | >>>>> | baseline-64k | -10.5% | -66.0% | -3.5% | >>>>> >>>>> Speedometer 2.0 (bigger is better): >>>>> >>>>> | kernel | runs_per_min | >>>>> |:----------------|---------------:| >>>>> | baseline-4k | 0.0% | >>>>> | anonfolio-basic | 0.7% | >>>>> | anonfolio | 1.2% | >>>>> | contpte | 3.1% | >>>>> | exefolio | 4.2% | >>>>> | baseline-16k | 5.3% | >>>> >>>> Thanks for pushing this forward! >>>> >>>>> Changes since RFCv2 >>>>> ------------------- >>>>> >>>>> - Simplified series to bare minimum (on David Hildenbrand's advice) >>>> >>>> My impression is that this series still includes many pieces that can >>>> be split out and discussed separately with followup series. >>>> >>>> (I skipped 04/10 and will look at it tomorrow.) >>> >>> I went through the series twice. Here what I think a bare minimum >>> series (easier to review/debug/land) would look like: > > === > >>> 1. a new arch specific function providing a prefered order within (0, >>> PMD_ORDER). >>> 2. an extended anon folio alloc API taking that order (02/10, partially). >>> 3. an updated folio_add_new_anon_rmap() covering the large() && >>> !pmd_mappable() case (similar to 04/10). >>> 4. s/folio_test_pmd_mappable/folio_test_large/ in page_remove_rmap() >>> (06/10, reviewed-by provided). >>> 5. finally, use the extended anon folio alloc API with the arch >>> preferred order in do_anonymous_page() (10/10, partially). > > === > >>> The rest can be split out into separate series and move forward in >>> parallel with probably a long list of things we need/want to do. >> >> Thanks for the fadt review - I really appreciate it! >> >> I've responded to many of your comments. I'd appreciate if we can close those >> points then I will work up a v2. > > Thanks! > > Based on the latest discussion here [1], my original list above can be > optionally reduced to 4 patches: item 2 can be quashed into item 5. > > Also please make sure we have only one global (apply to all archs) > Kconfig option, and it should be added in item 5: > > if TRANSPARENT_HUGEPAGE > config FLEXIBLE/VARIABLE_THP # or whatever name you see fit > end if Naming is always the hardest part. I've been calling it LARGE_ANON_FOLIO up until now. But I think you are right that we should show that it is related to THP, so I'll go with FLEXIBLE_THP for v2, and let people shout if they hate it. If we are not letting the arch declare that it supports FLEXIBLE_THP, then I think we need the default version of arch_wants_pte_order() to return a value higher than 0 (which is what I have it returning at the moment). Because otherwise, for an arch that hasn't defined its own version of arch_wants_pte_order(), FLEXIBLE_THP on vs off will give the same result. So I propose to set the default to ilog2(SZ_64K >> PAGE_SHIFT). Shout if you have any concerns. > > (How many new Kconfig options added within arch/arm64/ is not a concern of MM.) > > And please make sure it's disabled by default, Done because we are still > missing many important functions, e.g., I don't think we can mlock() > when large() && !pmd_mappable(), see mlock_pte_range() and > mlock_vma_folio(). We can fix it along with many things later, but we > need to present a plan and a schedule now. Otherwise, there would be > pushback if we try to land the series without supporting mlock(). There are other areas that I'm aware off. I'll put together a table and send it out once I have v2 out the door (hopefully tomorrow or Monday). Hopefully we can work together to fill it in and figure out who can do what? I'm certainly planning to continue to push this work forwards beyond this initial patch set. Thanks, Ryan > > Do you or Fengwei plan to take on it? (I personally don't.) If not, > I'll try to find someone from our team to look at it. (It'd be more > scalable if we have a coordinated group of people individually solving > different problems.) > > [1] https://lore.kernel.org/r/b2c81404-67df-f841-ef02-919e841f49f2@arm.com/