Message ID | 20231116012908.392077-7-peterx@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b909:0:b0:403:3b70:6f57 with SMTP id t9csp2924228vqg; Wed, 15 Nov 2023 17:29:45 -0800 (PST) X-Google-Smtp-Source: AGHT+IHBf/NkXxMPlzQXI1WFP04SAP0tE08tbYyc2otgPiZiWHL7xNx3tuE+wvicdsIo9wuAenzd X-Received: by 2002:a05:6830:5:b0:6c0:ae30:671e with SMTP id c5-20020a056830000500b006c0ae30671emr8276065otp.20.1700098185676; Wed, 15 Nov 2023 17:29:45 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700098185; cv=none; d=google.com; s=arc-20160816; b=kz7UDpZrPOuF5nTS04XSFqHZlH3TQnhXYpXltIGLZsnJ5gC11YhUglDG1zQdbi9B8C GT8KS7OY5WKO2zkCE4u546/cjgsBWkJeRK9bh7crcgYCuGzTdYF8LT9KnWiwpc0uktOf H+oybIWTy8dhPYF2pbFvbYvETVe1w+bSr3CDwHr0a8JR4I9+6PxZ288rVPaqO1pSExJx 45OGd3TCRELtir4Ak6CfZMxXxJdx2O/tczzWCjmuE/RkHsn6DQKdANJUvraCuGEjOAvm FQvCyh8GNFrCpDk7Ia3gqVzsNxsshz4xw3BchMHKqqTwdEn2vAoWOSUgVuPJEBn+odRG PlUg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=bUKDadnp+1aD4MuuhezpyppaSOr8lCSuMGXbTAXddpQ=; fh=nvfIpCvT5SnQ83HxwVfj1I4jmL6Kgaw3F1yj79JBUNM=; b=mJbbrZIyvPttWUk80fpaGyu7vqSHH0+Vl/vUtIIG4e1zrEHPF0RF+bQtKMm49TVCuE Q9Y9lCIDJYWGcz6FYtBp7kzxZg9zyT36f7qbs2W7K9VK+ErPUteiLFVPvfcNN/IXAMNp Yg/DiAmoazKwaQ2iGD7Rv0PdDADeLQL0tVWl8ckP2D0ia04oCFhiOk0OPqyCTd3XslsZ hMVXeVN6lS6dliVH6bTbj9RB+/S+Et/8HHmDKDfMEz7uoBDrNA+pIr0/aqq7XIHlyH8+ AozuU/JvDHepnEN47UTRbyO2mcoy/GrkrN13/PpsbKsKGCWJLhiF40OismRGvupd/AS0 t0VQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=EQ9DHPYo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id ck6-20020a056a02090600b005b8f24e6526si11870235pgb.234.2023.11.15.17.29.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Nov 2023 17:29:45 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=EQ9DHPYo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id EFF1C81121EB; Wed, 15 Nov 2023 17:29:44 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344450AbjKPB3h (ORCPT <rfc822;lhua1029@gmail.com> + 28 others); Wed, 15 Nov 2023 20:29:37 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50648 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344498AbjKPB3e (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 15 Nov 2023 20:29:34 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EC8DB199 for <linux-kernel@vger.kernel.org>; Wed, 15 Nov 2023 17:29:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700098163; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bUKDadnp+1aD4MuuhezpyppaSOr8lCSuMGXbTAXddpQ=; b=EQ9DHPYog5gjUZU2fkQMByjxMp4TytU1nvXgGzlu7Qx8kOBVGotuAfA4DZZg/X+HQpsfay R7uUAF/At2X+V0FiQeYu7Mk8XrSuxtK2iM41rRw8rcffqaBypE9rtFb+qEbE3/VdIt1me5 zy1k4uWq4Fx9WeH551GIij5CmWwD/Oc= Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-184-4Qd8MIcmOk-Nvc_gpjxIoQ-1; Wed, 15 Nov 2023 20:29:21 -0500 X-MC-Unique: 4Qd8MIcmOk-Nvc_gpjxIoQ-1 Received: by mail-qt1-f199.google.com with SMTP id d75a77b69052e-41cbafdb4b6so931571cf.0 for <linux-kernel@vger.kernel.org>; Wed, 15 Nov 2023 17:29:21 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700098161; x=1700702961; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bUKDadnp+1aD4MuuhezpyppaSOr8lCSuMGXbTAXddpQ=; b=k9eEFSbhhlgrcGtAikDFRan7CnSbGRU1FQkSi3E5dvBWBXKdmrfxSwbuIbJRqSJmOH 5VYE1YAu8LdMxpnZqB06FAWOpSTBEkS/BaL0zenC6ETYUeIYPijpodxApMBIzHRnw6+G qhD83jcuSiXTP0xiRhqY25OQXDFza/MQTQxONxh3DCppN3e8ERaHB/AVuKMu0hl8shmg 1Bqd2LyStajvfU78XJ3sMY/EEVhcfBgkQm9K7TmiACl/SeBr59TihLxwSwXT6nYpp7hM vgOwJOyn9/YnFhIpgq2/zYTHgpiVRF9BUHRAf1vFHiB3Ba9wsUDwvgv1zf5b3RUHVmkk HNEA== X-Gm-Message-State: AOJu0Yzh6vWbY0n/PDkqTVsRE04oxzCGL9W1hpHiO4npGoz7kPy92W8U gBgqUZ6BTQOaTMpgeB/8JR8cOlo0sg/5mdMO4bGrx1jCVSKen/8g3iLDWMyLgoeEW+lqyXQwncj bWzZNx43CXF8Ny7ajiBBifzp09uapLQEqqFK/twIyTjaZcNOzdBWRlS0XEowj9dRcVgL/gz0qbl NpiWFygw== X-Received: by 2002:a05:622a:810e:b0:41c:d433:6c86 with SMTP id jx14-20020a05622a810e00b0041cd4336c86mr6482576qtb.4.1700098161061; Wed, 15 Nov 2023 17:29:21 -0800 (PST) X-Received: by 2002:a05:622a:810e:b0:41c:d433:6c86 with SMTP id jx14-20020a05622a810e00b0041cd4336c86mr6482544qtb.4.1700098160768; Wed, 15 Nov 2023 17:29:20 -0800 (PST) Received: from x1n.redhat.com (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39]) by smtp.gmail.com with ESMTPSA id c24-20020ac85198000000b0041e383d527esm3922598qtn.66.2023.11.15.17.29.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Nov 2023 17:29:20 -0800 (PST) From: Peter Xu <peterx@redhat.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Mike Kravetz <mike.kravetz@oracle.com>, "Kirill A . Shutemov" <kirill@shutemov.name>, Lorenzo Stoakes <lstoakes@gmail.com>, Axel Rasmussen <axelrasmussen@google.com>, Matthew Wilcox <willy@infradead.org>, John Hubbard <jhubbard@nvidia.com>, Mike Rapoport <rppt@kernel.org>, peterx@redhat.com, Hugh Dickins <hughd@google.com>, David Hildenbrand <david@redhat.com>, Andrea Arcangeli <aarcange@redhat.com>, Rik van Riel <riel@surriel.com>, James Houghton <jthoughton@google.com>, Yang Shi <shy828301@gmail.com>, Jason Gunthorpe <jgg@nvidia.com>, Vlastimil Babka <vbabka@suse.cz>, Andrew Morton <akpm@linux-foundation.org>, Michael Ellerman <mpe@ellerman.id.au>, linuxppc-dev@lists.ozlabs.org Subject: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing Date: Wed, 15 Nov 2023 20:29:02 -0500 Message-ID: <20231116012908.392077-7-peterx@redhat.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20231116012908.392077-1-peterx@redhat.com> References: <20231116012908.392077-1-peterx@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Wed, 15 Nov 2023 17:29:45 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1782682155051627968 X-GMAIL-MSGID: 1782682155051627968 |
Series |
mm/gup: Unify hugetlb, part 2
|
|
Commit Message
Peter Xu
Nov. 16, 2023, 1:29 a.m. UTC
Hugepd format is only used in PowerPC with hugetlbfs. In commit
a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings"), we added a check to fail gup-fast if there's
potential risk of violating GUP over writeback file systems. That should
never apply to hugepd.
Drop that check, not only because it'll never be true for hugepd, but also
it paves way for reusing the function outside fast-gup.
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
mm/gup.c | 5 -----
1 file changed, 5 deletions(-)
Comments
On Wed, Nov 15, 2023 at 08:29:02PM -0500, Peter Xu wrote: > Hugepd format is only used in PowerPC with hugetlbfs. In commit > a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to > file-backed mappings"), we added a check to fail gup-fast if there's > potential risk of violating GUP over writeback file systems. That should > never apply to hugepd. > > Drop that check, not only because it'll never be true for hugepd, but also > it paves way for reusing the function outside fast-gup. What prevents us from ever using hugepd with file mappings? I think it would naturally fit in with how large folios for the pagecache work. So keeping this check and generalizing it seems like the better idea to me.
On Mon, Nov 20, 2023 at 12:26:24AM -0800, Christoph Hellwig wrote: > On Wed, Nov 15, 2023 at 08:29:02PM -0500, Peter Xu wrote: > > Hugepd format is only used in PowerPC with hugetlbfs. In commit > > a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to > > file-backed mappings"), we added a check to fail gup-fast if there's > > potential risk of violating GUP over writeback file systems. That should > > never apply to hugepd. > > > > Drop that check, not only because it'll never be true for hugepd, but also > > it paves way for reusing the function outside fast-gup. > > What prevents us from ever using hugepd with file mappings? I think > it would naturally fit in with how large folios for the pagecache work. > > So keeping this check and generalizing it seems like the better idea to > me. But then it means we're still keeping that dead code for fast-gup even if we know that fact.. Or do we have a plan to add that support very soon, so this code will be destined to add back? The other option is I can always add a comment above gup_huge_pd() explaining this special bit, so that when someone is adding hugepd support to file large folios we'll hopefully not forget it? But then that generalization work will only happen when the code will be needed. Thanks,
On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote: > > What prevents us from ever using hugepd with file mappings? I think > > it would naturally fit in with how large folios for the pagecache work. > > > > So keeping this check and generalizing it seems like the better idea to > > me. > > But then it means we're still keeping that dead code for fast-gup even if > we know that fact.. Or do we have a plan to add that support very soon, so > this code will be destined to add back? The question wasn't mean retorical - we support arbitrary power of two sized folios for the pagepage, what prevents us from using hugepd with them right now? > The other option is I can always add a comment above gup_huge_pd() > explaining this special bit, so that when someone is adding hugepd support > to file large folios we'll hopefully not forget it? But then that > generalization work will only happen when the code will be needed. If dropping the check is the right thing for now (and I think the ppc maintainers and willy as the large folio guy might have a more useful opinions than I do), leaving a comment in would be very useful.
On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote: > On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote: > > > What prevents us from ever using hugepd with file mappings? I think > > > it would naturally fit in with how large folios for the pagecache work. > > > > > > So keeping this check and generalizing it seems like the better idea to > > > me. > > > > But then it means we're still keeping that dead code for fast-gup even if > > we know that fact.. Or do we have a plan to add that support very soon, so > > this code will be destined to add back? > > The question wasn't mean retorical - we support arbitrary power of two > sized folios for the pagepage, what prevents us from using hugepd with > them right now? Ah, didn't catch that point previously. Hugepd is just not used outside hugetlb right now, afaiu. For example, __hugepte_alloc() (and that's the only one calls hugepd_populate()) should be the function to allocate a hugepd (ppc only), and it's only called in huge_pte_alloc(), which is part of the current arch-specific hugetlb api. And generic mm paths don't normally have hugepd handling, afaics. For example, page_vma_mapped_walk() doesn't handle hugepd at all unless in hugetlb specific path. There're actually (only) two generic mm paths that can handle hugepd, namely: - fast-gup - walk_page_*() apis (aka, __walk_page_range()) For fast-gup I think the hugepd code is in use, however for walk_page_* apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit a hugepd can be dead code to me (but note that this "dead code" is good stuff to me, if one would like to merge hugetlb instead into generic mm). This series tries to add slow gup into that list too, so the 3rd one to support it. I plan to look more into this area (e.g., __walk_page_range() can be another good candidate soon). I'm not sure whether we should teach the whole mm to understand hugepd yet, but slow gup and __walk_page_range() does look like good candidates to already remove the hugetlb specific code paths - slow-gup has average ~add/~del LOCs (which this series does), and __walk_page_range() can remove some code logically, no harm I yet see. Indeed above are based on only my code observations, so I'll be more than happy to be corrected otherwise, as early as possible. > > > The other option is I can always add a comment above gup_huge_pd() > > explaining this special bit, so that when someone is adding hugepd support > > to file large folios we'll hopefully not forget it? But then that > > generalization work will only happen when the code will be needed. > > If dropping the check is the right thing for now (and I think the ppc > maintainers and willy as the large folio guy might have a more useful > opinions than I do), leaving a comment in would be very useful. Willy is in the loop, and I just notice I didn't really copy ppc list, even I planned to.. I am adding the list (linuxppc-dev@lists.ozlabs.org) into this reply. I'll remember to do so as long as there's a new version. The other reason I feel like hugepd may or may not be further developed for new features like large folio is that I saw Power9 started to shift to radix pgtables, and afaics hugepd is only supported in hash tables (hugepd_ok()). But again, I confess I know nothing about Power at all. Thanks,
On Wed, Nov 22, 2023 at 10:22:11AM -0500, Peter Xu wrote: > The other reason I feel like hugepd may or may not be further developed for > new features like large folio is that I saw Power9 started to shift to > radix pgtables, and afaics hugepd is only supported in hash tables > (hugepd_ok()). But again, I confess I know nothing about Power at all. That alone sounds like a good reason to not bother. So unless more qualified people have a different opinion I'm fine with this patch as long as you leave a comment in place, and ammend the commit message with some of the very useful information from your mail.
On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote: > > The other option is I can always add a comment above gup_huge_pd() > > explaining this special bit, so that when someone is adding hugepd support > > to file large folios we'll hopefully not forget it? But then that > > generalization work will only happen when the code will be needed. > > If dropping the check is the right thing for now (and I think the ppc > maintainers and willy as the large folio guy might have a more useful > opinions than I do), leaving a comment in would be very useful. It looks like ARM (in the person of Ryan) are going to add support for something equivalent to hugepd. Insofar as I understand hugepd, anyway. I've done my best to set up the generic code so that the arch code can use whatever size TLB entries it supports. I haven't been looking to the hugetlb code as a reference for it, since it can assume natural alignment and generic THP/large folio must be able to handle arbitrary alignment. If powerpc want to join in on the fun, they're more than welcome, but I get the feeling that investment in Linux-on-PPC is somewhat smaller than Linux-on-ARM these days. Even if we restrict that to the server space.
On Wed, Nov 22, 2023 at 11:21:50PM -0800, Christoph Hellwig wrote: > That alone sounds like a good reason to not bother. So unless more > qualified people have a different opinion I'm fine with this patch > as long as you leave a comment in place, and ammend the commit message > with some of the very useful information from your mail. Will do, thanks. This is what I will squash into the same patch in the new version, as the current plan: +/* + * NOTE: currently hugepd is only used in hugetlbfs file systems on Power, + * which does not have issue with folio writeback against GUP updates. + * When hugepd will be extended to support non-hugetlbfs or even anonymous + * memory, we need to do extra check as what we do with most of the other + * folios. See writable_file_mapping_allowed() and folio_fast_pin_allowed() + * for more information. + */ static int gup_huge_pd(hugepd_t hugepd, unsigned long addr, unsigned int pdshift, unsigned long end, unsigned int flags, struct page **pages, int *nr)
On Thu, Nov 23, 2023 at 03:47:49PM +0000, Matthew Wilcox wrote: > It looks like ARM (in the person of Ryan) are going to add support for > something equivalent to hugepd. If it's about arm's cont_pte, then it looks ideal because this series didn't yet touch cont_pte, assuming it'll just work. From that aspect, his work may help mine, and no immediately collapsing either. There can be a slight performance difference which I need to measure for arm's cont_pte already for hugetlb, but I didn't worry much on that; quotting my commit message in the last patch: There may be a slight difference of how the loops run when processing GUP over a large hugetlb range on either ARM64 (e.g. CONT_PMD) or RISCV (mostly its Svnapot extension on 64K huge pages): each loop of __get_user_pages() will resolve one pgtable entry with the patch applied, rather than relying on the size of hugetlb hstate, the latter may cover multiple entries in one loop. However, the performance difference should hopefully not be a major concern, considering that GUP just yet got 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL""), and that's not part of a performance analysis but a side dish. If the performance will be a concern, we can consider handle CONT_PTE in follow_page(), for example. So IMHO it can be slightly different comparing to e.g. page fault, because each fault is still pretty slow as a whole if one fault for each small pte (of a large folio / cont_pte), while the loop in GUP is still relatively tight and short, comparing to a fault. I'd boldly guess more low hanging fruits out there for large folio outside GUP areas. In all cases, it'll be interesting to know if Ryan has worked on cont_pte support for gup on large folios, and whether there's any performance number to share. It's definitely good news to me because it means Ryan's work can also then benefit hugetlb if this series will be merged, I just don't know how much difference there will be. Thanks,
Le 22/11/2023 à 16:22, Peter Xu a écrit : > On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote: >> On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote: >>>> What prevents us from ever using hugepd with file mappings? I think >>>> it would naturally fit in with how large folios for the pagecache work. >>>> >>>> So keeping this check and generalizing it seems like the better idea to >>>> me. >>> >>> But then it means we're still keeping that dead code for fast-gup even if >>> we know that fact.. Or do we have a plan to add that support very soon, so >>> this code will be destined to add back? >> >> The question wasn't mean retorical - we support arbitrary power of two >> sized folios for the pagepage, what prevents us from using hugepd with >> them right now? > > Ah, didn't catch that point previously. Hugepd is just not used outside > hugetlb right now, afaiu. > > For example, __hugepte_alloc() (and that's the only one calls > hugepd_populate()) should be the function to allocate a hugepd (ppc only), > and it's only called in huge_pte_alloc(), which is part of the current > arch-specific hugetlb api. > > And generic mm paths don't normally have hugepd handling, afaics. For > example, page_vma_mapped_walk() doesn't handle hugepd at all unless in > hugetlb specific path. > > There're actually (only) two generic mm paths that can handle hugepd, > namely: > > - fast-gup > - walk_page_*() apis (aka, __walk_page_range()) > > For fast-gup I think the hugepd code is in use, however for walk_page_* > apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific > handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit > a hugepd can be dead code to me (but note that this "dead code" is good > stuff to me, if one would like to merge hugetlb instead into generic mm). Not sure what you mean here. What do you mean by "dead code" ? A hugepage directory can be plugged at any page level, from PGD to PMD. So the following bit in walk_pgd_range() is valid and not dead: if (is_hugepd(__hugepd(pgd_val(*pgd)))) err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT); > > This series tries to add slow gup into that list too, so the 3rd one to > support it. I plan to look more into this area (e.g., __walk_page_range() > can be another good candidate soon). I'm not sure whether we should teach > the whole mm to understand hugepd yet, but slow gup and __walk_page_range() > does look like good candidates to already remove the hugetlb specific code > paths - slow-gup has average ~add/~del LOCs (which this series does), and > __walk_page_range() can remove some code logically, no harm I yet see. > > Indeed above are based on only my code observations, so I'll be more than > happy to be corrected otherwise, as early as possible. > >> >>> The other option is I can always add a comment above gup_huge_pd() >>> explaining this special bit, so that when someone is adding hugepd support >>> to file large folios we'll hopefully not forget it? But then that >>> generalization work will only happen when the code will be needed. >> >> If dropping the check is the right thing for now (and I think the ppc >> maintainers and willy as the large folio guy might have a more useful >> opinions than I do), leaving a comment in would be very useful. > > Willy is in the loop, and I just notice I didn't really copy ppc list, even > I planned to.. I am adding the list (linuxppc-dev@lists.ozlabs.org) into > this reply. I'll remember to do so as long as there's a new version. > > The other reason I feel like hugepd may or may not be further developed for > new features like large folio is that I saw Power9 started to shift to > radix pgtables, and afaics hugepd is only supported in hash tables > (hugepd_ok()). But again, I confess I know nothing about Power at all. > > Thanks, >
On 23/11/2023 17:22, Peter Xu wrote: > On Thu, Nov 23, 2023 at 03:47:49PM +0000, Matthew Wilcox wrote: >> It looks like ARM (in the person of Ryan) are going to add support for >> something equivalent to hugepd. > > If it's about arm's cont_pte, then it looks ideal because this series > didn't yet touch cont_pte, assuming it'll just work. From that aspect, his > work may help mine, and no immediately collapsing either. Hi, I'm not sure I've 100% understood the crossover between this series and my work to support arm64's contpte mappings generally for anonymous and file-backed memory. My approach is to transparently use contpte mappings when core-mm request pte mappings that meet the requirements; and its all based around intercepting the normal (non-hugetlb) helpers (e.g. set_ptes(), ptep_get() and friends). There is no semantic change to the core-mm. See [1]. It relies on 1) the page cache using large folios and 2) my "small-sized THP" series which starts using arbitrary sized large folios for anonymous memory [2]. If I've understood this conversation correctly there is an object called hugepd, which today is only supported by powerpc, but which could allow the core-mm to control the mapping granularity? I can see some value in exposing that control to core-mm in the (very) long term. [1] https://lore.kernel.org/all/20231115163018.1303287-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20231115132734.931023-1-ryan.roberts@arm.com/ Thanks, Ryan > > There can be a slight performance difference which I need to measure for > arm's cont_pte already for hugetlb, but I didn't worry much on that; > quotting my commit message in the last patch: > > There may be a slight difference of how the loops run when processing > GUP over a large hugetlb range on either ARM64 (e.g. CONT_PMD) or RISCV > (mostly its Svnapot extension on 64K huge pages): each loop of > __get_user_pages() will resolve one pgtable entry with the patch > applied, rather than relying on the size of hugetlb hstate, the latter > may cover multiple entries in one loop. > > However, the performance difference should hopefully not be a major > concern, considering that GUP just yet got 57edfcfd3419 ("mm/gup: > accelerate thp gup even for "pages != NULL""), and that's not part of a > performance analysis but a side dish. If the performance will be a > concern, we can consider handle CONT_PTE in follow_page(), for example. > > So IMHO it can be slightly different comparing to e.g. page fault, because > each fault is still pretty slow as a whole if one fault for each small pte > (of a large folio / cont_pte), while the loop in GUP is still relatively > tight and short, comparing to a fault. I'd boldly guess more low hanging > fruits out there for large folio outside GUP areas. > > In all cases, it'll be interesting to know if Ryan has worked on cont_pte > support for gup on large folios, and whether there's any performance number > to share. It's definitely good news to me because it means Ryan's work can > also then benefit hugetlb if this series will be merged, I just don't know > how much difference there will be. > > Thanks, >
On Thu, Nov 23, 2023 at 06:22:33PM +0000, Christophe Leroy wrote: > > For fast-gup I think the hugepd code is in use, however for walk_page_* > > apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific > > handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit > > a hugepd can be dead code to me (but note that this "dead code" is good > > stuff to me, if one would like to merge hugetlb instead into generic mm). > > Not sure what you mean here. What do you mean by "dead code" ? > A hugepage directory can be plugged at any page level, from PGD to PMD. > So the following bit in walk_pgd_range() is valid and not dead: > > if (is_hugepd(__hugepd(pgd_val(*pgd)))) > err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT); IMHO it boils down to the question on whether hugepd is only used in hugetlbfs. I think I already mentioned that above, but I can be more explicit; what I see is that from higher stack in __walk_page_range(): if (is_vm_hugetlb_page(vma)) { if (ops->hugetlb_entry) err = walk_hugetlb_range(start, end, walk); } else err = walk_pgd_range(start, end, walk); It means to me as long as the vma is hugetlb, it'll not trigger any code in walk_pgd_range(), but only walk_hugetlb_range(). Do you perhaps mean hugepd is used outside hugetlbfs? Thanks,
On Thu, Nov 23, 2023 at 07:11:19PM +0000, Ryan Roberts wrote: > Hi, > > I'm not sure I've 100% understood the crossover between this series and my work > to support arm64's contpte mappings generally for anonymous and file-backed memory. No worry, there's no confliction. If you worked on that it's only be something nice on top. Also, I'm curious if you have performance numbers, because I'm going to do some test for hugetlb cont_ptes (which is only the current plan), and if you got those it'll be a great baseline for me, because it should be similar in you case even though the goal is slightly different. > > My approach is to transparently use contpte mappings when core-mm request pte > mappings that meet the requirements; and its all based around intercepting the > normal (non-hugetlb) helpers (e.g. set_ptes(), ptep_get() and friends). There is > no semantic change to the core-mm. See [1]. It relies on 1) the page cache using > large folios and 2) my "small-sized THP" series which starts using arbitrary > sized large folios for anonymous memory [2]. > > If I've understood this conversation correctly there is an object called hugepd, > which today is only supported by powerpc, but which could allow the core-mm to > control the mapping granularity? I can see some value in exposing that control > to core-mm in the (very) long term. For me it's needed immediately, because hugetlb_follow_page_mask() will be gone after the last patch. > > [1] https://lore.kernel.org/all/20231115163018.1303287-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/linux-mm/20231115132734.931023-1-ryan.roberts@arm.com/ AFAICT you haven't yet worked on gup then, after I glimpsed the above series. It's a matter of whether one follow_page_mask() call can fetch more than one page* for a cont_pte entry on aarch64 for a large non-hugetlb folio (and if this series lands, it'll be the same to hugetlb or non-hugetlb). Now the current code can only fetch one page I think. Thanks,
Peter Xu <peterx@redhat.com> writes: > On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote: >> On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote: ... >> >> If dropping the check is the right thing for now (and I think the ppc >> maintainers and willy as the large folio guy might have a more useful >> opinions than I do), leaving a comment in would be very useful. > > Willy is in the loop, and I just notice I didn't really copy ppc list, even > I planned to.. I am adding the list (linuxppc-dev@lists.ozlabs.org) into > this reply. I'll remember to do so as long as there's a new version. Thanks. > The other reason I feel like hugepd may or may not be further developed for > new features like large folio is that I saw Power9 started to shift to > radix pgtables, and afaics hugepd is only supported in hash tables > (hugepd_ok()). Because it's powerpc it's not quite that simple :} Power9 uses the Radix MMU by default, but the hash page table MMU is still supported. However although hugepd is used with the hash page table MMU, that's only when PAGE_SIZE=4K. These days none of the major distros build with 4K pages. But some of the non-server CPU platforms also use hugepd. 32-bit 8xx does, which is actively maintained by Christophe. And I believe Freescale e6500 can use it, but that is basically orphaned, and although I boot test it I don't run any hugetlb tests. (I guess I should do that). cheers
Le 23/11/2023 à 20:37, Peter Xu a écrit : > On Thu, Nov 23, 2023 at 06:22:33PM +0000, Christophe Leroy wrote: >>> For fast-gup I think the hugepd code is in use, however for walk_page_* >>> apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific >>> handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit >>> a hugepd can be dead code to me (but note that this "dead code" is good >>> stuff to me, if one would like to merge hugetlb instead into generic mm). >> >> Not sure what you mean here. What do you mean by "dead code" ? >> A hugepage directory can be plugged at any page level, from PGD to PMD. >> So the following bit in walk_pgd_range() is valid and not dead: >> >> if (is_hugepd(__hugepd(pgd_val(*pgd)))) >> err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT); > > IMHO it boils down to the question on whether hugepd is only used in > hugetlbfs. I think I already mentioned that above, but I can be more > explicit; what I see is that from higher stack in __walk_page_range(): > > if (is_vm_hugetlb_page(vma)) { > if (ops->hugetlb_entry) > err = walk_hugetlb_range(start, end, walk); > } else > err = walk_pgd_range(start, end, walk); > > It means to me as long as the vma is hugetlb, it'll not trigger any code in > walk_pgd_range(), but only walk_hugetlb_range(). Do you perhaps mean > hugepd is used outside hugetlbfs? I added that code with commit e17eae2b8399 ("mm: pagewalk: fix walk for hugepage tables") because I was getting crazy displays when dumping /sys/kernel/debug/pagetables Huge pages can be used for many thing. On powerpc 8xx, there are 4 possible page size: 4k, 16k, 512k and 8M. Each PGD entry addresses 4M areas, so hugepd is used for anything using 8M pages. Could have used regular page tables instead, but it is not worth allocating a 4k table when the HW will only read first entry. At the time being, linear memory mapping is performed with 8M pages, so ptdump_walk_pgd() will walk into huge page directories. Also, huge pages can be used in vmalloc() and in vmap(). At the time being we support 512k pages there on the 8xx. 8M pages will be supported once vmalloc() and vmap() support hugepd, as explained in commit a6a8f7c4aa7e ("powerpc/8xx: add support for huge pages on VMAP and VMALLOC") So yes as a conclusion hugepd is used outside hugetlbfs, hope it clarifies things. Christophe
On 23/11/2023 19:46, Peter Xu wrote: > On Thu, Nov 23, 2023 at 07:11:19PM +0000, Ryan Roberts wrote: >> Hi, >> >> I'm not sure I've 100% understood the crossover between this series and my work >> to support arm64's contpte mappings generally for anonymous and file-backed memory. > > No worry, there's no confliction. If you worked on that it's only be > something nice on top. Also, I'm curious if you have performance numbers, I have perf numbers for high level use cases (kernel compilation and Speedometer Java Script benchmarks) at https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/ I don't have any micro-benchmarks for GUP though, if that's your question. Is there an easy-to-use test I can run to get some numbers? I'd be happy to try it out. > because I'm going to do some test for hugetlb cont_ptes (which is only the > current plan), and if you got those it'll be a great baseline for me, > because it should be similar in you case even though the goal is slightly > different. > >> >> My approach is to transparently use contpte mappings when core-mm request pte >> mappings that meet the requirements; and its all based around intercepting the >> normal (non-hugetlb) helpers (e.g. set_ptes(), ptep_get() and friends). There is >> no semantic change to the core-mm. See [1]. It relies on 1) the page cache using >> large folios and 2) my "small-sized THP" series which starts using arbitrary >> sized large folios for anonymous memory [2]. >> >> If I've understood this conversation correctly there is an object called hugepd, >> which today is only supported by powerpc, but which could allow the core-mm to >> control the mapping granularity? I can see some value in exposing that control >> to core-mm in the (very) long term. > > For me it's needed immediately, because hugetlb_follow_page_mask() will be > gone after the last patch. > >> >> [1] https://lore.kernel.org/all/20231115163018.1303287-1-ryan.roberts@arm.com/ >> [2] https://lore.kernel.org/linux-mm/20231115132734.931023-1-ryan.roberts@arm.com/ > > AFAICT you haven't yet worked on gup then, after I glimpsed the above > series. No, I haven't touched GUP at all. The approach is fully inside the arm64 arch code (except 1 patch to core-mm which enables an optimization). So as far as GUP and the rest of the core-mm is concerned, there are still only page-sized ptes and they can all be iterated over and accessed as normal. > > It's a matter of whether one follow_page_mask() call can fetch more than > one page* for a cont_pte entry on aarch64 for a large non-hugetlb folio > (and if this series lands, it'll be the same to hugetlb or non-hugetlb). > Now the current code can only fetch one page I think. > > Thanks, >
On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote: > I don't have any micro-benchmarks for GUP though, if that's your question. Is > there an easy-to-use test I can run to get some numbers? I'd be happy to try it out. Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched from your side, afaict. I'll see whether I can provide some rough numbers instead in the next post (I'll probably only be able to test it in a VM, though, but hopefully that should still reflect mostly the truth).
Hi, Christophe, Michael, Aneesh, [I'll reply altogether here] On Fri, Nov 24, 2023 at 07:03:11AM +0000, Christophe Leroy wrote: > I added that code with commit e17eae2b8399 ("mm: pagewalk: fix walk for > hugepage tables") because I was getting crazy displays when dumping > /sys/kernel/debug/pagetables > > Huge pages can be used for many thing. > > On powerpc 8xx, there are 4 possible page size: 4k, 16k, 512k and 8M. > Each PGD entry addresses 4M areas, so hugepd is used for anything using > 8M pages. Could have used regular page tables instead, but it is not > worth allocating a 4k table when the HW will only read first entry. > > At the time being, linear memory mapping is performed with 8M pages, so > ptdump_walk_pgd() will walk into huge page directories. > > Also, huge pages can be used in vmalloc() and in vmap(). At the time > being we support 512k pages there on the 8xx. 8M pages will be supported > once vmalloc() and vmap() support hugepd, as explained in commit > a6a8f7c4aa7e ("powerpc/8xx: add support for huge pages on VMAP and VMALLOC") > > So yes as a conclusion hugepd is used outside hugetlbfs, hope it > clarifies things. Yes it does, thanks a lot for all of your replies. So I think this is what I missed: on Freescale ppc 8xx there's a special hugepd_populate_kernel() defined to install kernel pgtables for hugepd. Obviously I didn't check further than hugepd_populate() when I first looked, and stopped at the first instance of hugepd_populate() definition on the 64 bits ppc. For this specific patch: I suppose the change is still all fine to reuse the fast-gup function, because it is still true when there's a VMA present (GUP applies only to user mappings, nothing like KASAN should ever pop up). So AFAIU it's still true that hugepd is only used in hugetlb pages in this case even for Freescale 8xx, and nothing should yet explode. So maybe I can still keep the code changes. However the comment at least definitely needs fixing (that I'm going to add some, which hch requested and I agree), that is not yet in the patch I posted here but I'll refine them locally. For the whole work: the purpose of it is to start merging hugetlb pgtable processing with generic mm. That is my take of previous lsfmm discussions in the community on how we should move forward with hugetlb in the future, to avoid code duplications against generic mm. Hugetlb is kind of blocked on adding new (especially, large) features in general because of such complexity. This is all about that, but a small step towards it. I see that it seems a trend to make hugepd more general. Christophe's fix on dump pgtable is exactly what I would also look for if keep going. I hope that's the right way to go. I'll also need to think more on how this will affect my plan, currently it seems all fine: I won't ever try to change any kernel mapping specific code. I suppose any hugetlbfs based test should still cover all codes I will touch on hugepd. Then it should just work for kernel mappings on Freescales; it'll be great if e.g. Christophe can help me double check that if the series can stablize in a few versions. If any of you have any hint on testing it'll be more than welcomed, either specific test case or hints; currently I'm still at a phase looking for a valid ppc systems - QEMU tcg ppc64 emulation on x86 is slow enough to let me give up already. Considering hugepd's specialty in ppc and the possibility that I'll break it, there's yet another option which is I only apply the new logic into archs with !ARCH_HAS_HUGEPD. It'll make my life easier, but that also means even if my attempt would work out anything new will by default rule ppc out. And we'll have a bunch of "#ifdef ARCH_HAS_HUGEPD" in generic code, which is not preferred either. For gup, it might be relatively easy when comparing to the rest. I'm still hesitating for the long term plan. Please let me know if you have any thoughts on any of above. Thanks!
On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote: > On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote: > > I don't have any micro-benchmarks for GUP though, if that's your question. Is > > there an easy-to-use test I can run to get some numbers? I'd be happy to try it out. > > Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched > from your side, afaict. I'll see whether I can provide some rough numbers > instead in the next post (I'll probably only be able to test it in a VM, > though, but hopefully that should still reflect mostly the truth). An update: I finished a round of 64K cont_pte test, in the slow gup micro benchmark I see ~15% perf degrade with this patchset applied on a VM on top of Apple M1. Frankly that's even less than I expected, considering not only how slow gup THP used to be, but also on the fact that that's a tight loop over slow gup, which in normal cases shouldn't happen: "present" ptes normally goes to fast-gup, while !present goes into a fault following it. I assume that's why nobody cared slow gup for THP before. I think adding cont_pte support shouldn't be very hard, but that will include making cont_pte idea global just for arm64 and riscv Svnapot. The current plan is I'll add that performance number into my commit message only, as I don't ever expect any real workload will regress with it. Maybe a global cont_pte api will be needed at some point, but perhaps not yet feel strongly for this use case. Please feel free to raise any concerns otherwise. Thanks,
Le 30/11/2023 à 22:30, Peter Xu a écrit : > On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote: >> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote: >>> I don't have any micro-benchmarks for GUP though, if that's your question. Is >>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out. >> >> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched >> from your side, afaict. I'll see whether I can provide some rough numbers >> instead in the next post (I'll probably only be able to test it in a VM, >> though, but hopefully that should still reflect mostly the truth). > > An update: I finished a round of 64K cont_pte test, in the slow gup micro > benchmark I see ~15% perf degrade with this patchset applied on a VM on top > of Apple M1. > > Frankly that's even less than I expected, considering not only how slow gup > THP used to be, but also on the fact that that's a tight loop over slow > gup, which in normal cases shouldn't happen: "present" ptes normally goes > to fast-gup, while !present goes into a fault following it. I assume > that's why nobody cared slow gup for THP before. I think adding cont_pte > support shouldn't be very hard, but that will include making cont_pte idea > global just for arm64 and riscv Svnapot. Is there any documentation on what cont_pte is ? I have always wondered if it could also fit powerpc 8xx need ? On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4 PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB misses the HW needs one entrie for each 4k fragment. There is also a similar approach for 512k pages, we have 128 contiguous identical PTEs for them. And whatever PAGE_SIZE is (either 4k or 16k), the HW needs one 'unsigned long' pte for each 4k fragment. So at the time being when we define PAGE_SIZE as 16k, we need a special pte_t which is a table of 4x unsigned long. Wondering if the cont_pte concept is similar and whether it could help. Thanks Christophe
On 03/12/2023 13:33, Christophe Leroy wrote: > > > Le 30/11/2023 à 22:30, Peter Xu a écrit : >> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote: >>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote: >>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is >>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out. >>> >>> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched >>> from your side, afaict. I'll see whether I can provide some rough numbers >>> instead in the next post (I'll probably only be able to test it in a VM, >>> though, but hopefully that should still reflect mostly the truth). >> >> An update: I finished a round of 64K cont_pte test, in the slow gup micro >> benchmark I see ~15% perf degrade with this patchset applied on a VM on top >> of Apple M1. >> >> Frankly that's even less than I expected, considering not only how slow gup >> THP used to be, but also on the fact that that's a tight loop over slow >> gup, which in normal cases shouldn't happen: "present" ptes normally goes >> to fast-gup, while !present goes into a fault following it. I assume >> that's why nobody cared slow gup for THP before. I think adding cont_pte >> support shouldn't be very hard, but that will include making cont_pte idea >> global just for arm64 and riscv Svnapot. > > Is there any documentation on what cont_pte is ? I have always wondered > if it could also fit powerpc 8xx need ? pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs are mapping a physically contiguous and naturally aligned piece of memory. The HW can use this to coalesce entries in the TLB. When using 4K base pages, the contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K base pages, its 2M (32 PTEs). > > On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4 > PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB > misses the HW needs one entrie for each 4k fragment. From that description, it sounds like the SPS bit might be similar to arm64 contiguous bit? Although sounds like you are currently using it in a slightly different way - telling kernel that the base page is 16K but mapping each 16K page with 4x 4K entries (plus the SPS bit set)? > > There is also a similar approach for 512k pages, we have 128 contiguous > identical PTEs for them. > > And whatever PAGE_SIZE is (either 4k or 16k), the HW needs one 'unsigned > long' pte for each 4k fragment. So at the time being when we define > PAGE_SIZE as 16k, we need a special pte_t which is a table of 4x > unsigned long. > > Wondering if the cont_pte concept is similar and whether it could help. To be honest, while I understand pte_cont() and friends, I don't understand their relevance (or at least potential future relevance) to GUP? > > Thanks > Christophe
Le 04/12/2023 à 12:11, Ryan Roberts a écrit : > On 03/12/2023 13:33, Christophe Leroy wrote: >> >> >> Le 30/11/2023 à 22:30, Peter Xu a écrit : >>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote: >>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote: >>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is >>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out. >>>> >>>> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched >>>> from your side, afaict. I'll see whether I can provide some rough numbers >>>> instead in the next post (I'll probably only be able to test it in a VM, >>>> though, but hopefully that should still reflect mostly the truth). >>> >>> An update: I finished a round of 64K cont_pte test, in the slow gup micro >>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top >>> of Apple M1. >>> >>> Frankly that's even less than I expected, considering not only how slow gup >>> THP used to be, but also on the fact that that's a tight loop over slow >>> gup, which in normal cases shouldn't happen: "present" ptes normally goes >>> to fast-gup, while !present goes into a fault following it. I assume >>> that's why nobody cared slow gup for THP before. I think adding cont_pte >>> support shouldn't be very hard, but that will include making cont_pte idea >>> global just for arm64 and riscv Svnapot. >> >> Is there any documentation on what cont_pte is ? I have always wondered >> if it could also fit powerpc 8xx need ? > > pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the > "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific > (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs > are mapping a physically contiguous and naturally aligned piece of memory. The > HW can use this to coalesce entries in the TLB. When using 4K base pages, the > contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K > base pages, its 2M (32 PTEs). > >> >> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4 >> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB >> misses the HW needs one entrie for each 4k fragment. > > From that description, it sounds like the SPS bit might be similar to arm64 > contiguous bit? Although sounds like you are currently using it in a slightly > different way - telling kernel that the base page is 16K but mapping each 16K > page with 4x 4K entries (plus the SPS bit set)? Yes it's both. When the base page is 16k, there are 4x 4k entries (with SPS bit set) in the page table, and pte_t is a table of 4 'unsigned long' When the base page is 4k, there is a 16k hugepage size, which is the same 4x 4k entries with SPS bit set. So it looks similar to the contiguous bit. And by extension, the same principle is used for 512k hugepages, the bit _PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS, PS being as follows: - 00 Small (4 Kbyte or 16 Kbyte) - 01 512 Kbyte - 10 Reserved - 11 8 Mbyte So as PMD size is 4M, 512k pages are 128 identical consecutive entries in the page table. I which I could have THP with 16k or 512k pages. Christophe
On 04/12/2023 11:25, Christophe Leroy wrote: > > > Le 04/12/2023 à 12:11, Ryan Roberts a écrit : >> On 03/12/2023 13:33, Christophe Leroy wrote: >>> >>> >>> Le 30/11/2023 à 22:30, Peter Xu a écrit : >>>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote: >>>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote: >>>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is >>>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out. >>>>> >>>>> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched >>>>> from your side, afaict. I'll see whether I can provide some rough numbers >>>>> instead in the next post (I'll probably only be able to test it in a VM, >>>>> though, but hopefully that should still reflect mostly the truth). >>>> >>>> An update: I finished a round of 64K cont_pte test, in the slow gup micro >>>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top >>>> of Apple M1. >>>> >>>> Frankly that's even less than I expected, considering not only how slow gup >>>> THP used to be, but also on the fact that that's a tight loop over slow >>>> gup, which in normal cases shouldn't happen: "present" ptes normally goes >>>> to fast-gup, while !present goes into a fault following it. I assume >>>> that's why nobody cared slow gup for THP before. I think adding cont_pte >>>> support shouldn't be very hard, but that will include making cont_pte idea >>>> global just for arm64 and riscv Svnapot. >>> >>> Is there any documentation on what cont_pte is ? I have always wondered >>> if it could also fit powerpc 8xx need ? >> >> pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the >> "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific >> (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs >> are mapping a physically contiguous and naturally aligned piece of memory. The >> HW can use this to coalesce entries in the TLB. When using 4K base pages, the >> contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K >> base pages, its 2M (32 PTEs). >> >>> >>> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4 >>> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB >>> misses the HW needs one entrie for each 4k fragment. >> >> From that description, it sounds like the SPS bit might be similar to arm64 >> contiguous bit? Although sounds like you are currently using it in a slightly >> different way - telling kernel that the base page is 16K but mapping each 16K >> page with 4x 4K entries (plus the SPS bit set)? > > Yes it's both. > > When the base page is 16k, there are 4x 4k entries (with SPS bit set) in > the page table, and pte_t is a table of 4 'unsigned long' > > When the base page is 4k, there is a 16k hugepage size, which is the > same 4x 4k entries with SPS bit set. > > So it looks similar to the contiguous bit. > > > And by extension, the same principle is used for 512k hugepages, the bit > _PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS, > PS being as follows: > - 00 Small (4 Kbyte or 16 Kbyte) > - 01 512 Kbyte > - 10 Reserved > - 11 8 Mbyte > > So as PMD size is 4M, 512k pages are 128 identical consecutive entries > in the page table. > > I which I could have THP with 16k or 512k pages. Then you have come to the right place! :) https://lore.kernel.org/linux-mm/20231204102027.57185-1-ryan.roberts@arm.com/ > > Christophe
Le 04/12/2023 à 12:46, Ryan Roberts a écrit : > On 04/12/2023 11:25, Christophe Leroy wrote: >> >> >> Le 04/12/2023 à 12:11, Ryan Roberts a écrit : >>> On 03/12/2023 13:33, Christophe Leroy wrote: >>>> >>>> >>>> Le 30/11/2023 à 22:30, Peter Xu a écrit : >>>>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote: >>>>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote: >>>>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is >>>>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out. >>>>>> >>>>>> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched >>>>>> from your side, afaict. I'll see whether I can provide some rough numbers >>>>>> instead in the next post (I'll probably only be able to test it in a VM, >>>>>> though, but hopefully that should still reflect mostly the truth). >>>>> >>>>> An update: I finished a round of 64K cont_pte test, in the slow gup micro >>>>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top >>>>> of Apple M1. >>>>> >>>>> Frankly that's even less than I expected, considering not only how slow gup >>>>> THP used to be, but also on the fact that that's a tight loop over slow >>>>> gup, which in normal cases shouldn't happen: "present" ptes normally goes >>>>> to fast-gup, while !present goes into a fault following it. I assume >>>>> that's why nobody cared slow gup for THP before. I think adding cont_pte >>>>> support shouldn't be very hard, but that will include making cont_pte idea >>>>> global just for arm64 and riscv Svnapot. >>>> >>>> Is there any documentation on what cont_pte is ? I have always wondered >>>> if it could also fit powerpc 8xx need ? >>> >>> pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the >>> "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific >>> (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs >>> are mapping a physically contiguous and naturally aligned piece of memory. The >>> HW can use this to coalesce entries in the TLB. When using 4K base pages, the >>> contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K >>> base pages, its 2M (32 PTEs). >>> >>>> >>>> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4 >>>> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB >>>> misses the HW needs one entrie for each 4k fragment. >>> >>> From that description, it sounds like the SPS bit might be similar to arm64 >>> contiguous bit? Although sounds like you are currently using it in a slightly >>> different way - telling kernel that the base page is 16K but mapping each 16K >>> page with 4x 4K entries (plus the SPS bit set)? >> >> Yes it's both. >> >> When the base page is 16k, there are 4x 4k entries (with SPS bit set) in >> the page table, and pte_t is a table of 4 'unsigned long' >> >> When the base page is 4k, there is a 16k hugepage size, which is the >> same 4x 4k entries with SPS bit set. >> >> So it looks similar to the contiguous bit. >> >> >> And by extension, the same principle is used for 512k hugepages, the bit >> _PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS, >> PS being as follows: >> - 00 Small (4 Kbyte or 16 Kbyte) >> - 01 512 Kbyte >> - 10 Reserved >> - 11 8 Mbyte >> >> So as PMD size is 4M, 512k pages are 128 identical consecutive entries >> in the page table. >> >> I which I could have THP with 16k or 512k pages. > > Then you have come to the right place! :) > > https://lore.kernel.org/linux-mm/20231204102027.57185-1-ryan.roberts@arm.com/ > That looks great. That series only modifies core mm/ . No changes needed in arch ? Will it work on powerpc without any change/additions to arch code ? Well, I'll try it soon. Christophe
On 04/12/2023 11:57, Christophe Leroy wrote: > > > Le 04/12/2023 à 12:46, Ryan Roberts a écrit : >> On 04/12/2023 11:25, Christophe Leroy wrote: >>> >>> >>> Le 04/12/2023 à 12:11, Ryan Roberts a écrit : >>>> On 03/12/2023 13:33, Christophe Leroy wrote: >>>>> >>>>> >>>>> Le 30/11/2023 à 22:30, Peter Xu a écrit : >>>>>> On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote: >>>>>>> On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote: >>>>>>>> I don't have any micro-benchmarks for GUP though, if that's your question. Is >>>>>>>> there an easy-to-use test I can run to get some numbers? I'd be happy to try it out. >>>>>>> >>>>>>> Thanks Ryan. Then nothing is needed to be tested if gup is not yet touched >>>>>>> from your side, afaict. I'll see whether I can provide some rough numbers >>>>>>> instead in the next post (I'll probably only be able to test it in a VM, >>>>>>> though, but hopefully that should still reflect mostly the truth). >>>>>> >>>>>> An update: I finished a round of 64K cont_pte test, in the slow gup micro >>>>>> benchmark I see ~15% perf degrade with this patchset applied on a VM on top >>>>>> of Apple M1. >>>>>> >>>>>> Frankly that's even less than I expected, considering not only how slow gup >>>>>> THP used to be, but also on the fact that that's a tight loop over slow >>>>>> gup, which in normal cases shouldn't happen: "present" ptes normally goes >>>>>> to fast-gup, while !present goes into a fault following it. I assume >>>>>> that's why nobody cared slow gup for THP before. I think adding cont_pte >>>>>> support shouldn't be very hard, but that will include making cont_pte idea >>>>>> global just for arm64 and riscv Svnapot. >>>>> >>>>> Is there any documentation on what cont_pte is ? I have always wondered >>>>> if it could also fit powerpc 8xx need ? >>>> >>>> pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the >>>> "contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific >>>> (AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs >>>> are mapping a physically contiguous and naturally aligned piece of memory. The >>>> HW can use this to coalesce entries in the TLB. When using 4K base pages, the >>>> contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K >>>> base pages, its 2M (32 PTEs). >>>> >>>>> >>>>> On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4 >>>>> PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB >>>>> misses the HW needs one entrie for each 4k fragment. >>>> >>>> From that description, it sounds like the SPS bit might be similar to arm64 >>>> contiguous bit? Although sounds like you are currently using it in a slightly >>>> different way - telling kernel that the base page is 16K but mapping each 16K >>>> page with 4x 4K entries (plus the SPS bit set)? >>> >>> Yes it's both. >>> >>> When the base page is 16k, there are 4x 4k entries (with SPS bit set) in >>> the page table, and pte_t is a table of 4 'unsigned long' >>> >>> When the base page is 4k, there is a 16k hugepage size, which is the >>> same 4x 4k entries with SPS bit set. >>> >>> So it looks similar to the contiguous bit. >>> >>> >>> And by extension, the same principle is used for 512k hugepages, the bit >>> _PAGE_HUGE is copied by the TLB miss handler into the lower bit of PS, >>> PS being as follows: >>> - 00 Small (4 Kbyte or 16 Kbyte) >>> - 01 512 Kbyte >>> - 10 Reserved >>> - 11 8 Mbyte >>> >>> So as PMD size is 4M, 512k pages are 128 identical consecutive entries >>> in the page table. >>> >>> I which I could have THP with 16k or 512k pages. >> >> Then you have come to the right place! :) >> >> https://lore.kernel.org/linux-mm/20231204102027.57185-1-ryan.roberts@arm.com/ >> > > That looks great. That series only modifies core mm/ . > No changes needed in arch ? Will it work on powerpc without any > change/additions to arch code ? Yes there are also changes needed in arch; I have a separate series for arm64, which transparently manages the contiguous bit when it sees appropriate PTEs: https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-1-ryan.roberts@arm.com/ > > Well, I'll try it soon. > > Christophe
On Mon, Dec 04, 2023 at 11:11:26AM +0000, Ryan Roberts wrote: > To be honest, while I understand pte_cont() and friends, I don't understand > their relevance (or at least potential future relevance) to GUP? GUP in general can be smarter to recognize if a pte/pmd is a cont_pte and fetch the whole pte/pmd range if the caller specified. Now it loops over each pte/pmd. Fast-gup is better as it at least doesn't take pgtable lock, for cont_pte it looks inside gup_pte_range() which is good enough, but it'll still do folio checks for each sub-pte, even though the 2nd+ folio checks should be mostly the same (if to ignore races when the folio changed within the time of processing the cont_pte chunk). Slow-gup (as of what this series is about so far) doesn't do that either, for each cont_pte whole entry it'll loop N times, frequently taking and releasing the pgtable lock. A smarter slow-gup can fundamentallly setup follow_page_context.page_mask if it sees a cont_pte. There might be a challenge on whether holding the head page's refcount would stablize the whole folio, but that may be another question to ask. I think I also overlooked that PPC_8XX also has cont_pte support, so we actually have three users indeed, if not counting potential future archs adding support to also get that same tlb benefit. Thanks,
diff --git a/mm/gup.c b/mm/gup.c index 0e00204761d2..424d45e1afb3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2816,11 +2816,6 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, return 0; } - if (!folio_fast_pin_allowed(folio, flags)) { - gup_put_folio(folio, refs, flags); - return 0; - } - if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) { gup_put_folio(folio, refs, flags); return 0;