Message ID | 20231220054123.1266001-1-maskray@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-6388-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:24d3:b0:fb:cd0c:d3e with SMTP id r19csp2436222dyi; Tue, 19 Dec 2023 21:41:44 -0800 (PST) X-Google-Smtp-Source: AGHT+IFfrGWIAd+nhLUKT6yolkFPHd80iPkw+N6DjQTp6ZMXlJFPtyqI6Xqte3qZUf2TeeDWSeAU X-Received: by 2002:a05:6e02:3206:b0:35f:afff:5c64 with SMTP id cd6-20020a056e02320600b0035fafff5c64mr6527567ilb.96.1703050904523; Tue, 19 Dec 2023 21:41:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703050904; cv=none; d=google.com; s=arc-20160816; b=J6FXCvfMRWHlDeaahjn8YtHbKCGCliyWHD6rwR7fVzb0zuTl9YYxCHDuO1euVn2GSj boDnLWb/Ny9VmTYejo233U1f78ZYNc2cIWpFrz5dXXi06aI+Zpuoj8bCy/nqGReIu0C2 yJQI1pGvjQStI61hLsDfb5zQo61JCKCOHF5JekoQdgGbdgG2i5KSXmVBgPoyGutWt2nb yisVM1XPpUORjETB+JGWoXM2Ux/Sz6omElqGUquLCJ8aE/PxLfZ+2jIAifti0BGGIjKW pAdjR0Ic6lKinaj+nS38vIGJJDY2hDetV9iv4RNKO9T/Ds74AQygaPBMNgW2VnsQAlAT FxJA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:dkim-signature; bh=CnjYMOrzwkCbD0FGgIKrRfe2ITqCmTOHmPVslRt7U1U=; fh=2vUfrM+RAmIeDl0OHlpGwUmB3WmABEYIBQuvj1xSrIU=; b=VMqgrGkIp5hstF1sWkqwTwe4c8v8+noleOQ4eu457hmzXfrcLhbhzFnB90muNYH8kx hpI7sy8FGK/jtGXQ0jkur8l9LDP+sXliBIAdlxaZEY634Stv7J1G5d8oaFzyX5Ugcf12 rnM4R0Pf3LQZMA4RT/SdF0w7WWcxz5YeTlya/37tfGRRWstmM+2e3dQnpJQQaG1uFLN8 BNadfY2e3VV2NSvqk7YSKrw030NcLt7VvwRPSLnhPIKHnFCRa14pN5UxRp7+7TTIPCXE yyEkCL0Abb5/tpAixaBT9/EtOUcS3bw7Fy+8XbNb9Ln3LkQ+vZc1gzqqYj/ESdlYmJwX Qh7A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=HS+CDUdZ; spf=pass (google.com: domain of linux-kernel+bounces-6388-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-6388-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id v67-20020a632f46000000b005893a7c438esi21177674pgv.689.2023.12.19.21.41.44 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Dec 2023 21:41:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-6388-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=HS+CDUdZ; spf=pass (google.com: domain of linux-kernel+bounces-6388-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-6388-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 391672831BE for <ouuuleilei@gmail.com>; Wed, 20 Dec 2023 05:41:44 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id E8237168CF; Wed, 20 Dec 2023 05:41:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="HS+CDUdZ" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D60B116403 for <linux-kernel@vger.kernel.org>; Wed, 20 Dec 2023 05:41:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--maskray.bounces.google.com Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-5e7ac088580so22390487b3.1 for <linux-kernel@vger.kernel.org>; Tue, 19 Dec 2023 21:41:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1703050887; x=1703655687; darn=vger.kernel.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=CnjYMOrzwkCbD0FGgIKrRfe2ITqCmTOHmPVslRt7U1U=; b=HS+CDUdZeDvyphojZBYpr2D8RITQloFkZRj3XXp63pPsK+NhZqu9BAqTByqTfg9p9j /CDtOjHW7j4swTIDi5v1UAiDD3Bdx9nu/Fosz4u5N8q4YNxk1vgVH2woXEP7p/W65Jd0 xbGHDNmiaGVoZ4ld2v+IhFJJpb+8578+huk9eTztQlHrq/X92yEzwIgCSD+jMgUhu6U+ l8mN5cKnXgrMdDXARJxdfWp6cKpSiuU2L+ZkvSBSwIQgB1+weHtOQZqd3EXs5/staRj8 EzACWcgvndsKSorq47m1WJCCqkm8/+qEKkq5MBwh7XFfpKrzeTb6aXwY7TYJQZwbP1SI s6ew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703050887; x=1703655687; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=CnjYMOrzwkCbD0FGgIKrRfe2ITqCmTOHmPVslRt7U1U=; b=vFC6qnrRBpcYqx7SietATNLlm56dAwUHkmdLQduVElMcEqH00aRy4gvfJIshDuXGfT vAnZ0e+eTwZm3L5RCZMMnAyrRuSYkNHedS3h5Lyj54OFsIIySmtopNc4t7+sFm0eBmgd 4YAnORb2zLSatEtQ1+rTBMO6fTRM2JKVycANOqzoFCTbrqvQoEYBnPJdvZI6UFUhS8Qw hnXReB4ykZ/Nmnb7yZxgtZ9lpvhTnknvl0wQGmYUO2RjFkNkgvTksNMLWSw27TOyVqxe NfpqT7RC88XVHovDp4Ih72DnTfu5ctae+O4TgvEbPyMhX5U2Zw08c2Eyz8NcmKiQsQXF q3Sg== X-Gm-Message-State: AOJu0YzxSSIM39rMrTUgDKaTE4m3G+F+7CSpIpH4M6wGNyvdZ1FJwyk4 g32qaDcVLnYDkLgOjvqSWEXxPrjIbC3s X-Received: from maskray.svl.corp.google.com ([2620:15c:2d3:205:405:b62:49a3:2869]) (user=maskray job=sendgmr) by 2002:a05:690c:841:b0:5e8:3e57:68e3 with SMTP id bz1-20020a05690c084100b005e83e5768e3mr537282ywb.10.1703050886884; Tue, 19 Dec 2023 21:41:26 -0800 (PST) Date: Tue, 19 Dec 2023 21:41:23 -0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> Mime-Version: 1.0 Message-ID: <20231220054123.1266001-1-maskray@google.com> Subject: [PATCH] mm: remove VM_EXEC requirement for THP eligibility From: Fangrui Song <maskray@google.com> To: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org Cc: Song Liu <songliubraving@fb.com>, Yang Shi <shy828301@gmail.com>, Miaohe Lin <linmiaohe@huawei.com>, linux-kernel@vger.kernel.org, Fangrui Song <maskray@google.com> Content-Type: text/plain; charset="UTF-8" X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785778305183308048 X-GMAIL-MSGID: 1785778305183308048 |
Series |
mm: remove VM_EXEC requirement for THP eligibility
|
|
Commit Message
Fangrui Song
Dec. 20, 2023, 5:41 a.m. UTC
Commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
checking in transparent_hugepage_enabled()") introduced the VM_EXEC
requirement, which is not strictly needed.
lld's default --rosegment option and GNU ld's -z separate-code option
(default on Linux/x86 since binutils 2.31) create a read-only PT_LOAD
segment without the PF_X flag, which should be eligible for THP.
Certain architectures support medium and large code models, where
.lrodata may be placed in a separate read-only PT_LOAD segment, which
should be eligible for THP as well.
Signed-off-by: Fangrui Song <maskray@google.com>
---
include/linux/huge_mm.h | 1 -
1 file changed, 1 deletion(-)
Comments
On Tue, Dec 19, 2023 at 9:41 PM Fangrui Song <maskray@google.com> wrote: > > Commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP > checking in transparent_hugepage_enabled()") introduced the VM_EXEC > requirement, which is not strictly needed. > > lld's default --rosegment option and GNU ld's -z separate-code option > (default on Linux/x86 since binutils 2.31) create a read-only PT_LOAD > segment without the PF_X flag, which should be eligible for THP. > > Certain architectures support medium and large code models, where > .lrodata may be placed in a separate read-only PT_LOAD segment, which > should be eligible for THP as well. Yeah, it doesn't have to be VM_EXEC. The original implementation was restricted to VM_EXEC to minimize the blast radius and the targe use case is for large text segments. Out of curiosity, did you see any noticeable improvement with this change? > > Signed-off-by: Fangrui Song <maskray@google.com> > --- > include/linux/huge_mm.h | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index fa0350b0812a..4c9e67e9000f 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -126,7 +126,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) > inode = vma->vm_file->f_inode; > > return (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) && > - (vma->vm_flags & VM_EXEC) && > !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); > } > > -- > 2.43.0.472.g3155946c3a-goog >
On Wed, Dec 20, 2023 at 3:42 PM Yang Shi <shy828301@gmail.com> wrote: > > On Tue, Dec 19, 2023 at 9:41 PM Fangrui Song <maskray@google.com> wrote: > > > > Commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP > > checking in transparent_hugepage_enabled()") introduced the VM_EXEC > > requirement, which is not strictly needed. > > > > lld's default --rosegment option and GNU ld's -z separate-code option > > (default on Linux/x86 since binutils 2.31) create a read-only PT_LOAD > > segment without the PF_X flag, which should be eligible for THP. > > > > Certain architectures support medium and large code models, where > > .lrodata may be placed in a separate read-only PT_LOAD segment, which > > should be eligible for THP as well. > > Yeah, it doesn't have to be VM_EXEC. The original implementation was > restricted to VM_EXEC to minimize the blast radius and the targe use > case is for large text segments. Out of curiosity, did you see any > noticeable improvement with this change? Hi Yang, Thanks for the comment. Frankly, I am not familiar with huge pages... I noticed this VM_EXEC condition when I was writing this hugepage-related section in https://maskray.me/blog/2023-12-17-exploring-the-section-layout-in-linker-output#transparent-huge-pages-for-mapped-files (Thanks to Alexander Monakov's comment about CONFIG_READ_ONLY_THP_FOR_FS in https://mazzo.li/posts/check-huge-page.html). As dTLB for read-only data is also an important optimization of file-backed THP, it seems straightforward that we should drop the VM_EXEC condition :) On my Arch linux machine, the r--p page gets split if I invoke madvise(__ehdr_start, HPAGE_SIZE, MADV_HUGEPAGE) I haven't figured out why it behaves so in the presence of the VM_EXEC check. % g++ test.cc -o ~/tmp/test -O2 -fuse-ld=lld -Wl,-z,max-page-size=2097152 && sudo ~/tmp/test __ehdr_start: 0x55f3b1c00000 55f3b1c00000-55f3b1e00000 r--p 00000000 103:03 555277119 /home/ray/tmp/test 55f3b1e00000-55f3b1e01000 r--p 00200000 103:03 555277119 /home/ray/tmp/test 55f3b2000000-55f3b2002000 r-xp 00200000 103:03 555277119 /home/ray/tmp/test 55f3b2201000-55f3b2202000 r--p 00201000 103:03 555277119 /home/ray/tmp/test 55f3b2401000-55f3b2402000 rw-p 00201000 103:03 555277119 /home/ray/tmp/test 55f3b3a9a000-55f3b3abb000 rw-p 00000000 00:00 0 [heap] It'd be greatly appreciated if someone familiar with CONFIG_READ_ONLY_THP_FOR_FS could provide some notes on how to use this feature:) > > > > Signed-off-by: Fangrui Song <maskray@google.com> > > --- > > include/linux/huge_mm.h | 1 - > > 1 file changed, 1 deletion(-) > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index fa0350b0812a..4c9e67e9000f 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -126,7 +126,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) > > inode = vma->vm_file->f_inode; > > > > return (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) && > > - (vma->vm_flags & VM_EXEC) && > > !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); > > } > > > > -- > > 2.43.0.472.g3155946c3a-goog > >
On Wed, Dec 20, 2023 at 8:53 PM Fangrui Song <maskray@google.com> wrote: > > On Wed, Dec 20, 2023 at 3:42 PM Yang Shi <shy828301@gmail.com> wrote: > > > > On Tue, Dec 19, 2023 at 9:41 PM Fangrui Song <maskray@google.com> wrote: > > > > > > Commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP > > > checking in transparent_hugepage_enabled()") introduced the VM_EXEC > > > requirement, which is not strictly needed. > > > > > > lld's default --rosegment option and GNU ld's -z separate-code option > > > (default on Linux/x86 since binutils 2.31) create a read-only PT_LOAD > > > segment without the PF_X flag, which should be eligible for THP. > > > > > > Certain architectures support medium and large code models, where > > > .lrodata may be placed in a separate read-only PT_LOAD segment, which > > > should be eligible for THP as well. > > > > Yeah, it doesn't have to be VM_EXEC. The original implementation was > > restricted to VM_EXEC to minimize the blast radius and the targe use > > case is for large text segments. Out of curiosity, did you see any > > noticeable improvement with this change? > > Hi Yang, > > Thanks for the comment. Frankly, I am not familiar with huge pages... > I noticed this VM_EXEC condition when I was writing this > hugepage-related section in > https://maskray.me/blog/2023-12-17-exploring-the-section-layout-in-linker-output#transparent-huge-pages-for-mapped-files > (Thanks to Alexander Monakov's comment about > CONFIG_READ_ONLY_THP_FOR_FS in > https://mazzo.li/posts/check-huge-page.html). Thanks for sharing the article, learnt something about linker and loader. > > As dTLB for read-only data is also an important optimization of > file-backed THP, it seems straightforward that we should drop the > VM_EXEC condition :) Yeah, as long as the use case is valid, it is definitely fine to lift the restriction. > > On my Arch linux machine, the r--p page gets split if I invoke > madvise(__ehdr_start, HPAGE_SIZE, MADV_HUGEPAGE) I haven't figured out > why it behaves so in the presence of the VM_EXEC check. What do you mean about "split"? THP got split into small pages? It depends on the address of __ehdr_start. If it is in the middle of a VMA, the VMA is going to be split due to the different huge page attributes. > > % g++ test.cc -o ~/tmp/test -O2 -fuse-ld=lld > -Wl,-z,max-page-size=2097152 && sudo ~/tmp/test > __ehdr_start: 0x55f3b1c00000 > 55f3b1c00000-55f3b1e00000 r--p 00000000 103:03 555277119 > /home/ray/tmp/test > 55f3b1e00000-55f3b1e01000 r--p 00200000 103:03 555277119 > /home/ray/tmp/test > 55f3b2000000-55f3b2002000 r-xp 00200000 103:03 555277119 > /home/ray/tmp/test > 55f3b2201000-55f3b2202000 r--p 00201000 103:03 555277119 > /home/ray/tmp/test > 55f3b2401000-55f3b2402000 rw-p 00201000 103:03 555277119 > /home/ray/tmp/test > 55f3b3a9a000-55f3b3abb000 rw-p 00000000 00:00 0 [heap] > > > It'd be greatly appreciated if someone familiar with > CONFIG_READ_ONLY_THP_FOR_FS could provide some notes on how to use > this feature:) I think your blog covered all the points. If you don't mind, you could add some notes in Documentation/admin-guide/mm/transhuge.rst. > > > > > > > Signed-off-by: Fangrui Song <maskray@google.com> > > > --- > > > include/linux/huge_mm.h | 1 - > > > 1 file changed, 1 deletion(-) > > > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > > index fa0350b0812a..4c9e67e9000f 100644 > > > --- a/include/linux/huge_mm.h > > > +++ b/include/linux/huge_mm.h > > > @@ -126,7 +126,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) > > > inode = vma->vm_file->f_inode; > > > > > > return (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) && > > > - (vma->vm_flags & VM_EXEC) && > > > !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); > > > } > > > > > > -- > > > 2.43.0.472.g3155946c3a-goog > > > > > > > -- > 宋方睿
On Thu, Dec 21, 2023 at 11:31 AM Yang Shi <shy828301@gmail.com> wrote: > > On Wed, Dec 20, 2023 at 8:53 PM Fangrui Song <maskray@google.com> wrote: > > > > On Wed, Dec 20, 2023 at 3:42 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > On Tue, Dec 19, 2023 at 9:41 PM Fangrui Song <maskray@google.com> wrote: > > > > > > > > Commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP > > > > checking in transparent_hugepage_enabled()") introduced the VM_EXEC > > > > requirement, which is not strictly needed. > > > > > > > > lld's default --rosegment option and GNU ld's -z separate-code option > > > > (default on Linux/x86 since binutils 2.31) create a read-only PT_LOAD > > > > segment without the PF_X flag, which should be eligible for THP. > > > > > > > > Certain architectures support medium and large code models, where > > > > .lrodata may be placed in a separate read-only PT_LOAD segment, which > > > > should be eligible for THP as well. > > > > > > Yeah, it doesn't have to be VM_EXEC. The original implementation was > > > restricted to VM_EXEC to minimize the blast radius and the targe use > > > case is for large text segments. Out of curiosity, did you see any > > > noticeable improvement with this change? > > > > Hi Yang, > > > > Thanks for the comment. Frankly, I am not familiar with huge pages... > > I noticed this VM_EXEC condition when I was writing this > > hugepage-related section in > > https://maskray.me/blog/2023-12-17-exploring-the-section-layout-in-linker-output#transparent-huge-pages-for-mapped-files > > (Thanks to Alexander Monakov's comment about > > CONFIG_READ_ONLY_THP_FOR_FS in > > https://mazzo.li/posts/check-huge-page.html). > > Thanks for sharing the article, learnt something about linker and loader. BTW, kernel should try to map the segments (the size has to be >= 2M) to 2M aligned address even though the loading address is not 2M aligned for ext4/xfs/btrfs since v5.18. See commit 1854bc6e2420 ("mm/readahead: Align file mappings for non-DAX"). Did you see this behavior? > > > > > As dTLB for read-only data is also an important optimization of > > file-backed THP, it seems straightforward that we should drop the > > VM_EXEC condition :) > > Yeah, as long as the use case is valid, it is definitely fine to lift > the restriction. > > > > > On my Arch linux machine, the r--p page gets split if I invoke > > madvise(__ehdr_start, HPAGE_SIZE, MADV_HUGEPAGE) I haven't figured out > > why it behaves so in the presence of the VM_EXEC check. > > What do you mean about "split"? THP got split into small pages? It > depends on the address of __ehdr_start. If it is in the middle of a > VMA, the VMA is going to be split due to the different huge page > attributes. > > > > > % g++ test.cc -o ~/tmp/test -O2 -fuse-ld=lld > > -Wl,-z,max-page-size=2097152 && sudo ~/tmp/test > > __ehdr_start: 0x55f3b1c00000 > > 55f3b1c00000-55f3b1e00000 r--p 00000000 103:03 555277119 > > /home/ray/tmp/test > > 55f3b1e00000-55f3b1e01000 r--p 00200000 103:03 555277119 > > /home/ray/tmp/test > > 55f3b2000000-55f3b2002000 r-xp 00200000 103:03 555277119 > > /home/ray/tmp/test > > 55f3b2201000-55f3b2202000 r--p 00201000 103:03 555277119 > > /home/ray/tmp/test > > 55f3b2401000-55f3b2402000 rw-p 00201000 103:03 555277119 > > /home/ray/tmp/test > > 55f3b3a9a000-55f3b3abb000 rw-p 00000000 00:00 0 [heap] > > > > > > It'd be greatly appreciated if someone familiar with > > CONFIG_READ_ONLY_THP_FOR_FS could provide some notes on how to use > > this feature:) > > I think your blog covered all the points. If you don't mind, you could > add some notes in Documentation/admin-guide/mm/transhuge.rst. > > > > > > > > > > > Signed-off-by: Fangrui Song <maskray@google.com> > > > > --- > > > > include/linux/huge_mm.h | 1 - > > > > 1 file changed, 1 deletion(-) > > > > > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > > > index fa0350b0812a..4c9e67e9000f 100644 > > > > --- a/include/linux/huge_mm.h > > > > +++ b/include/linux/huge_mm.h > > > > @@ -126,7 +126,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) > > > > inode = vma->vm_file->f_inode; > > > > > > > > return (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) && > > > > - (vma->vm_flags & VM_EXEC) && > > > > !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); > > > > } > > > > > > > > -- > > > > 2.43.0.472.g3155946c3a-goog > > > > > > > > > > > > -- > > 宋方睿
On Wed, Dec 20, 2023 at 08:53:38PM -0800, Fangrui Song wrote: > Thanks for the comment. Frankly, I am not familiar with huge pages... > I noticed this VM_EXEC condition when I was writing this > hugepage-related section in > https://maskray.me/blog/2023-12-17-exploring-the-section-layout-in-linker-output#transparent-huge-pages-for-mapped-files > (Thanks to Alexander Monakov's comment about > CONFIG_READ_ONLY_THP_FOR_FS in > https://mazzo.li/posts/check-huge-page.html). CONFIG_READ_ONLY_THP_FOR_FS is a preliminary hack which solves some problems. The real solution is using large folios, which at the moment means that you should test on XFS or AFS; filesystem authors have not been enthusiastic about adding support to their filesystems so far. In your blog, you write: : In -z noseparate-code layouts, the file content starts somewhere at : the first page, potentially wasting half a huge page on unrelated : content. Switching to -z separate-code allows reclaiming the benefits : of the half huge page but increases the file size. Balancing : these aspects poses a challenge. One potential solution is using : fallocate(FALLOC_FL_PUNCH_HOLE), which introduces complexity into the : linker. However, this approach feels like a workaround to address a : kernel limitation. It would be preferable if a file-backed huge page : didn't necessitate a file offset aligned to a huge page boundary. You should distinguish between file size (ie st_size in stat(3)) and amount of space occupied on storage (st_blocks). The linker should be fine with creating a sparse file. If it doesn't, cp --sparse will do the trick. Yes, it's a kernel limitation that folios have to be aligned within the file as well as in both virtual and physical address space. It's a huge complexity win to do that; I don't think we'd be able to tile the page cache effectively if we allowed folios to be placed at arbitrary offsets (I think it turns into a knapsack problem at that point). > As dTLB for read-only data is also an important optimization of > file-backed THP, it seems straightforward that we should drop the > VM_EXEC condition :) I'm not particularly enthusiastic about making CONFIG_READ_ONLY_THP_FOR_FS better. Large folios are the future. Indeed, I'd like to see CONFIG_READ_ONLY_THP_FOR_FS go away in the next year or two once btrfs and ext4 have support for large folios.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index fa0350b0812a..4c9e67e9000f 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -126,7 +126,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) inode = vma->vm_file->f_inode; return (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) && - (vma->vm_flags & VM_EXEC) && !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); }