Message ID | 20231123065708.91345-1-luxu.kernel@bytedance.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce62:0:b0:403:3b70:6f57 with SMTP id o2csp261625vqx; Wed, 22 Nov 2023 22:58:09 -0800 (PST) X-Google-Smtp-Source: AGHT+IFMBmhK22LpLg9JqbGyxjezuvu5fZPDl4CEsntOV8W3LtlOb6FyxP+8gfTgJkaxHKyh4Fu1 X-Received: by 2002:a05:6358:88c:b0:16b:c414:ae2 with SMTP id m12-20020a056358088c00b0016bc4140ae2mr5957820rwj.8.1700722688975; Wed, 22 Nov 2023 22:58:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700722688; cv=none; d=google.com; s=arc-20160816; b=FRY7n1QI/8FK747Te52MoQxKMZnQF1o5e99rE4mvz72EFoiOtiq4BcetyqmPX/W2xE zscsTYjjqQyo/9Dn+pUve6XUPhn2kjFvGvjMQGOxoT9w52XMVDBzTt4efhntGYxNvRbq IhrYtxMi6nz+3VyGrBEB5jkR/9czg/35yxvpiID9WN3iBizUqPecfbUXXR0GVUH3/9dB UJ6iHn8jPGxeJv53geclpbt1mkAnfjCiJwG9zn8j5MfwpayyZ2Yt2Bq+l9hn5dtzIUYq x/+omVrXtZBft5bfHP5X0dPEr8rnbHa/pzgBApZyTuZhsh1avOVlhDlF6MB9FYElm2Hk 9GDQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=zo9GRhTooUc/KviZnnIa42NwX6MC9ytBHXrHIhQGMDg=; fh=lDvMTZMBKweKvOrJ9wQD/BR+dk+xQF4teciUGM79mGM=; b=g/+CzEW+murtkcBA2ZWUiaqxJzVbHC2saartRRf5+pwvEMDrmt2gD4eIjpeJIPOtJZ VzqRGlgc6JpbeDmeR7glcIDuLuwPf1G8PwPAB9RskW9ltwvOj3jvHTWotATIAqFR+Cma X6SiTl3rNrkJGsO1qJxFwLCEHActl5Td1LFUrT8Y1fBNnRJAcqKRjCzsXg2nHoGOHaU4 xLl2sDOpfieRwDRMPr+ssv5NB+ThYxx5mhVD3qWK4na7Zm96762mrCEvgI5tVuxljDpj vUEOOlcfhHx0HIa/0uPK8xIArzMvQxHf8/dEhggFDkX0LagVfWmJuCvwTU39NXjwiJXB 7psA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=WcazZeFY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id k24-20020a6568d8000000b005655bf61e32si722884pgt.23.2023.11.22.22.58.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Nov 2023 22:58:08 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=WcazZeFY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 12E6C807FD66; Wed, 22 Nov 2023 22:58:03 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344813AbjKWG5u (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others); Thu, 23 Nov 2023 01:57:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34408 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235124AbjKWG5p (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 23 Nov 2023 01:57:45 -0500 Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com [IPv6:2607:f8b0:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 244A8D47 for <linux-kernel@vger.kernel.org>; Wed, 22 Nov 2023 22:57:26 -0800 (PST) Received: by mail-pf1-x42e.google.com with SMTP id d2e1a72fcca58-6cbc8199a2aso498921b3a.1 for <linux-kernel@vger.kernel.org>; Wed, 22 Nov 2023 22:57:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1700722645; x=1701327445; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=zo9GRhTooUc/KviZnnIa42NwX6MC9ytBHXrHIhQGMDg=; b=WcazZeFYOHHmjOfZmSEbhf0zumLwVV6tPZFnaVqYjdL9dFTDZ7NBIEV+zwoeCkTE5O h704rp6nGrHQzlBoJft/kwVzrdmsFhSywtPtJB4dKVxlrVGBJP7R7dPZHs+SbOQmbqDA GhjN6r89S5oRi6p3H0nRnC/i/in8PdX6kA4Tb8IpsEIoHCqOT7U1yXp8kFamPXnIefcc xNCFnV2Wog6IQIcPbKLhB5mNxD4ftmL3z01juNTJDonFB/rqJwXG6Z0qvJMTGZQ+WSz9 e9VJB5QARX8FgyaOKlSiQwK1APzfSddLRwizqLw4bbonoCohmqGdglPDXXEiOI8+dnQh jRuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700722645; x=1701327445; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=zo9GRhTooUc/KviZnnIa42NwX6MC9ytBHXrHIhQGMDg=; b=YtyOqd5ybTS66DAPfHlVEaUBetriveCK0ktn8zuntlBSyJKVEKzVWNDFhNKIwxLrW8 R4vC0Wjg9ga14ZzyWeyE9+oxwxARWZbfEoAlSIgrnmfQ/XnQpP8hDyq1CRQamvrWHZXI RITxm1r8srHY5fhiBqclEfTIW/AuCSvly55jJ2HNTONkFjLWo8q5evpQwIS2qabiOsZK sEU2kRIuj/Jzk8GwUz8fGkbGXkKNnk7YsFSuJHECVHMlWXVyQNkJ6wHEFg7UWNKog2nO lfcFH3ef5HzrgyedZVhhYJ8KA/ajtuT5ptm3aNoHmEQnyAdhY6P4pLCxMppYYX8ws6Kt GXqQ== X-Gm-Message-State: AOJu0YzRq40iAFzxMzEkUzI5Hn4fcpMmKIHHmplrIrrac1IW7cG8wq7y qskJenkyhHTYb9QNgLPbFW287Q== X-Received: by 2002:a05:6a00:1f08:b0:6cb:a431:2d75 with SMTP id be8-20020a056a001f0800b006cba4312d75mr5099452pfb.7.1700722645519; Wed, 22 Nov 2023 22:57:25 -0800 (PST) Received: from J9GPGXL7NT.bytedance.net ([139.177.225.230]) by smtp.gmail.com with ESMTPSA id w37-20020a634765000000b005bd2b3a03eesm615437pgk.6.2023.11.22.22.57.20 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 22 Nov 2023 22:57:25 -0800 (PST) From: Xu Lu <luxu.kernel@bytedance.com> To: paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, ardb@kernel.org, anup@brainfault.org, atishp@atishpatra.org Cc: dengliang.1214@bytedance.com, xieyongji@bytedance.com, lihangjing@bytedance.com, songmuchun@bytedance.com, punit.agrawal@bytedance.com, linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org, Xu Lu <luxu.kernel@bytedance.com> Subject: [RFC PATCH V1 00/11] riscv: Introduce 64K base page Date: Thu, 23 Nov 2023 14:56:57 +0800 Message-Id: <20231123065708.91345-1-luxu.kernel@bytedance.com> X-Mailer: git-send-email 2.39.3 (Apple Git-145) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Wed, 22 Nov 2023 22:58:03 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783336994359880161 X-GMAIL-MSGID: 1783336994359880161 |
Series |
riscv: Introduce 64K base page
|
|
Message
Xu Lu
Nov. 23, 2023, 6:56 a.m. UTC
Some existing architectures like ARM supports base page larger than 4K as their MMU supports more page sizes. Thus, besides hugetlb page and transparent huge page, there is another way for these architectures to enjoy the benefits of fewer TLB misses without worrying about cost of splitting and merging huge pages. However, on architectures with only 4K MMU, larger base page is unavailable now. This patch series attempts to break through the limitation of MMU and supports larger base page on RISC-V, which only supports 4K page size now. The key idea to implement larger base page based on 4K MMU is to decouple the MMU page from the base page in view of kernel mm, which we denote as software page. In contrary to software page, we denote the MMU page as hardware page. Below is the difference between these two kinds of pages. 1. Kernel memory management module manages, allocates and maps memory at a granularity of software page, which should not be restricted by MMU and can be larger than hardware page. 2. Architecture page table operations should be carried out from MMU's perspective and page table entries are encoded at a granularity of hardware page, which is 4K on RISC-V MMU now. The main work to decouple these two kinds of pages lies in architecture code. For example, we turn the pte_t struct to an array of page table entries to match it with software page which can be larger than hardware page, and adapt the page table operations accordingly. For 64K software base page, the pte_t struct now contains 16 contiguous page table entries which point to 16 contiguous 4K hardware pages. To achieve the benefits of large base page, we applies Svnapot for each base page's mapping. The Svnapot extension on RISC-V is like contiguous PTE on ARM64. It allows ptes of a naturally aligned power-of 2 size memory range be encoded in the same format to save the TLB space. This patch series is the first version and is based on v6.7-rc1. This version supports both bare metal and virtualization scenarios. In the next versions, we will continue on the following works: 1. Reduce the memory usage of page table page as it only uses 4K space while costs a whole base page. 2. When IMSIC interrupt file is smaller than 64K, extra isolation measures for the interrupt file are needed. (S)PMP and IOPMP may be good choices. 3. More consideration is needed to make this patch series collaborate with folios better. 4. Support 64K base page on IOMMU. 5. The performance test is on schedule to verify the actual performance improvement and the decrease in TLB miss rate. Thanks in advance for comments. Xu Lu (11): mm: Fix misused APIs on huge pte riscv: Introduce concept of hardware base page riscv: Adapt pte struct to gap between hw page and sw page riscv: Adapt pte operations to gap between hw page and sw page riscv: Decouple pmd operations and pte operations riscv: Distinguish pmd huge pte and napot huge pte riscv: Adapt satp operations to gap between hw page and sw page riscv: Apply Svnapot for base page mapping riscv: Adjust fix_btmap slots number to match variable page size riscv: kvm: Adapt kvm to gap between hw page and sw page riscv: Introduce 64K page size arch/Kconfig | 1 + arch/riscv/Kconfig | 28 +++ arch/riscv/include/asm/fixmap.h | 3 +- arch/riscv/include/asm/hugetlb.h | 71 ++++++- arch/riscv/include/asm/page.h | 16 +- arch/riscv/include/asm/pgalloc.h | 21 ++- arch/riscv/include/asm/pgtable-32.h | 2 +- arch/riscv/include/asm/pgtable-64.h | 45 +++-- arch/riscv/include/asm/pgtable.h | 282 +++++++++++++++++++++++----- arch/riscv/kernel/efi.c | 2 +- arch/riscv/kernel/head.S | 4 +- arch/riscv/kernel/hibernate.c | 3 +- arch/riscv/kvm/mmu.c | 198 +++++++++++++------ arch/riscv/mm/context.c | 7 +- arch/riscv/mm/fault.c | 1 + arch/riscv/mm/hugetlbpage.c | 42 +++-- arch/riscv/mm/init.c | 25 +-- arch/riscv/mm/kasan_init.c | 7 +- arch/riscv/mm/pageattr.c | 2 +- fs/proc/task_mmu.c | 2 +- include/asm-generic/hugetlb.h | 7 + include/asm-generic/pgtable-nopmd.h | 1 + include/linux/pgtable.h | 6 + mm/hugetlb.c | 2 +- mm/migrate.c | 5 +- mm/mprotect.c | 2 +- mm/rmap.c | 10 +- mm/vmalloc.c | 3 +- 28 files changed, 616 insertions(+), 182 deletions(-)
Comments
On Thu, Nov 23, 2023, at 07:56, Xu Lu wrote: > Some existing architectures like ARM supports base page larger than 4K > as their MMU supports more page sizes. Thus, besides hugetlb page and > transparent huge page, there is another way for these architectures to > enjoy the benefits of fewer TLB misses without worrying about cost of > splitting and merging huge pages. However, on architectures with only > 4K MMU, larger base page is unavailable now. > > This patch series attempts to break through the limitation of MMU and > supports larger base page on RISC-V, which only supports 4K page size > now. > > The key idea to implement larger base page based on 4K MMU is to > decouple the MMU page from the base page in view of kernel mm, which we > denote as software page. In contrary to software page, we denote the MMU > page as hardware page. Below is the difference between these two kinds > of pages. We have played with this on arm32, but the conclusion is that it's almost never worth the memory overhead, as most workloads end up using several times the amount of physical RAM after each small file in the page cache and any sparse populated anonymous memory area explodes to up to 16 times the size. On ppc64, using 64KB pages was way to get around limitations in their hashed MMU design, which had a much bigger performance impact because any page table access ends up being a cache miss. On arm64, there are some CPUs like the Fujitsu A64FX that are really bad at 4KB pages and don't support 16KB pages, so this is the only real option. You will see a notable performance benefit in synthetic benchmarks like speccpu with 64KB pages, or on specific computational workloads that have large densely packed memory chunks, but for real workloads, the usual answer is to just use transparent hugepages for larger mappings and a page size of no more than 16KB for the page cache. With the work going into using folios in the kernel (see e.g. https://lwn.net/Articles/932386/), even the workloads that benefit from 64KB base pages should be better off with 4KB pages and just using the TLB hints for large folios. Arnd
Thanks a lot for your reply! And sorry for replying so late. On Thu, Nov 23, 2023 at 5:30 PM Arnd Bergmann <arnd@arndb.de> wrote: > > On Thu, Nov 23, 2023, at 07:56, Xu Lu wrote: > > Some existing architectures like ARM supports base page larger than 4K > > as their MMU supports more page sizes. Thus, besides hugetlb page and > > transparent huge page, there is another way for these architectures to > > enjoy the benefits of fewer TLB misses without worrying about cost of > > splitting and merging huge pages. However, on architectures with only > > 4K MMU, larger base page is unavailable now. > > > > This patch series attempts to break through the limitation of MMU and > > supports larger base page on RISC-V, which only supports 4K page size > > now. > > > > The key idea to implement larger base page based on 4K MMU is to > > decouple the MMU page from the base page in view of kernel mm, which we > > denote as software page. In contrary to software page, we denote the MMU > > page as hardware page. Below is the difference between these two kinds > > of pages. > > We have played with this on arm32, but the conclusion is that it's > almost never worth the memory overhead, as most workloads end up > using several times the amount of physical RAM after each small > file in the page cache and any sparse populated anonymous memory > area explodes to up to 16 times the size. > > On ppc64, using 64KB pages was way to get around limitations in > their hashed MMU design, which had a much bigger performance impact > because any page table access ends up being a cache miss. On arm64, > there are some CPUs like the Fujitsu A64FX that are really bad at > 4KB pages and don't support 16KB pages, so this is the only real > option. > > You will see a notable performance benefit in synthetic benchmarks > like speccpu with 64KB pages, or on specific computational > workloads that have large densely packed memory chunks, but for > real workloads, the usual answer is to just use transparent > hugepages for larger mappings and a page size of no more than > 16KB for the page cache. Actually we did find actual performance benefits brought by 64K page size in real business scenarios. On the Ampere ARM server, when applying 64K base page size, we saw an improvement of 2.5x for both qps and latency on redis, a performance improvement of 10~20% on our own newsql database and 50% on object storage. For mysql, the qps increases about 14%, 17.5% and 20% for read-only, write-only and random read/write workloads respectively. And the latency reduces about 13.7%, 15.8% and 14.5% on average. This is also why we chose to implement a similar feature on RISC-V in the beginning. > > With the work going into using folios in the kernel (see e.g. > https://lwn.net/Articles/932386/), even the workloads that > benefit from 64KB base pages should be better off with 4KB > pages and just using the TLB hints for large folios. Maybe 64K page size combined with large folios can achieve more benefits. As is mentioned in this patch[1], a 64K page size kernel combined with large folios and THPs via cont pte can achieve speedup of 10.5x on some memory-intensive workloads on arm64 SBSA server. [1] https://lore.kernel.org/all/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/ > > Arnd
A gentle ping. On Thu, Nov 23, 2023 at 2:57 PM Xu Lu <luxu.kernel@bytedance.com> wrote: > > Some existing architectures like ARM supports base page larger than 4K > as their MMU supports more page sizes. Thus, besides hugetlb page and > transparent huge page, there is another way for these architectures to > enjoy the benefits of fewer TLB misses without worrying about cost of > splitting and merging huge pages. However, on architectures with only > 4K MMU, larger base page is unavailable now. > > This patch series attempts to break through the limitation of MMU and > supports larger base page on RISC-V, which only supports 4K page size > now. > > The key idea to implement larger base page based on 4K MMU is to > decouple the MMU page from the base page in view of kernel mm, which we > denote as software page. In contrary to software page, we denote the MMU > page as hardware page. Below is the difference between these two kinds > of pages. > > 1. Kernel memory management module manages, allocates and maps memory at > a granularity of software page, which should not be restricted by > MMU and can be larger than hardware page. > > 2. Architecture page table operations should be carried out from MMU's > perspective and page table entries are encoded at a granularity of > hardware page, which is 4K on RISC-V MMU now. > > The main work to decouple these two kinds of pages lies in architecture > code. For example, we turn the pte_t struct to an array of page table > entries to match it with software page which can be larger than hardware > page, and adapt the page table operations accordingly. For 64K software > base page, the pte_t struct now contains 16 contiguous page table > entries which point to 16 contiguous 4K hardware pages. > > To achieve the benefits of large base page, we applies Svnapot for each > base page's mapping. The Svnapot extension on RISC-V is like contiguous > PTE on ARM64. It allows ptes of a naturally aligned power-of 2 size > memory range be encoded in the same format to save the TLB space. > > This patch series is the first version and is based on v6.7-rc1. This > version supports both bare metal and virtualization scenarios. > > In the next versions, we will continue on the following works: > > 1. Reduce the memory usage of page table page as it only uses 4K space > while costs a whole base page. > > 2. When IMSIC interrupt file is smaller than 64K, extra isolation > measures for the interrupt file are needed. (S)PMP and IOPMP may be good > choices. > > 3. More consideration is needed to make this patch series collaborate > with folios better. > > 4. Support 64K base page on IOMMU. > > 5. The performance test is on schedule to verify the actual performance > improvement and the decrease in TLB miss rate. > > Thanks in advance for comments. > > Xu Lu (11): > mm: Fix misused APIs on huge pte > riscv: Introduce concept of hardware base page > riscv: Adapt pte struct to gap between hw page and sw page > riscv: Adapt pte operations to gap between hw page and sw page > riscv: Decouple pmd operations and pte operations > riscv: Distinguish pmd huge pte and napot huge pte > riscv: Adapt satp operations to gap between hw page and sw page > riscv: Apply Svnapot for base page mapping > riscv: Adjust fix_btmap slots number to match variable page size > riscv: kvm: Adapt kvm to gap between hw page and sw page > riscv: Introduce 64K page size > > arch/Kconfig | 1 + > arch/riscv/Kconfig | 28 +++ > arch/riscv/include/asm/fixmap.h | 3 +- > arch/riscv/include/asm/hugetlb.h | 71 ++++++- > arch/riscv/include/asm/page.h | 16 +- > arch/riscv/include/asm/pgalloc.h | 21 ++- > arch/riscv/include/asm/pgtable-32.h | 2 +- > arch/riscv/include/asm/pgtable-64.h | 45 +++-- > arch/riscv/include/asm/pgtable.h | 282 +++++++++++++++++++++++----- > arch/riscv/kernel/efi.c | 2 +- > arch/riscv/kernel/head.S | 4 +- > arch/riscv/kernel/hibernate.c | 3 +- > arch/riscv/kvm/mmu.c | 198 +++++++++++++------ > arch/riscv/mm/context.c | 7 +- > arch/riscv/mm/fault.c | 1 + > arch/riscv/mm/hugetlbpage.c | 42 +++-- > arch/riscv/mm/init.c | 25 +-- > arch/riscv/mm/kasan_init.c | 7 +- > arch/riscv/mm/pageattr.c | 2 +- > fs/proc/task_mmu.c | 2 +- > include/asm-generic/hugetlb.h | 7 + > include/asm-generic/pgtable-nopmd.h | 1 + > include/linux/pgtable.h | 6 + > mm/hugetlb.c | 2 +- > mm/migrate.c | 5 +- > mm/mprotect.c | 2 +- > mm/rmap.c | 10 +- > mm/vmalloc.c | 3 +- > 28 files changed, 616 insertions(+), 182 deletions(-) > > -- > 2.20.1 >