Message ID | 20231204142146.91437-1-david@redhat.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp2794553vqy; Mon, 4 Dec 2023 06:22:02 -0800 (PST) X-Google-Smtp-Source: AGHT+IFAjJgABZoEqhUsqmxTAyW/9NViB9K06uD/xAZZPzx+KQVhBecv+pqt2Z2R0GiCvr+kuOsF X-Received: by 2002:a17:90a:d794:b0:286:6cd8:ef04 with SMTP id z20-20020a17090ad79400b002866cd8ef04mr4765082pju.28.1701699722465; Mon, 04 Dec 2023 06:22:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701699722; cv=none; d=google.com; s=arc-20160816; b=mb+FC3TANIlXVWhIU/DAn+BAfr3OUIfN6/781WAiByeOJX6UVOKJexAsZrt61QRkck Vgb2CaH1A/hJG0zxmMFI9TgrkZU6qr8q095z0xBRbxVTpgP8ZTjVOmrMpfbYM3hEnjiN 4chioF+cCIDQrMfH0oeOa6+eK/7vV4Q7jb3d+N12eXrdRfAvhxnT4gHKZiXV5q69PaB6 RnxHX2ts33mzJQ4dYjtpuHrkn2smsWeAhoUDYpNN0SUNNUFFOvzQpuAu4UIDIFAN+Yo/ 6Twq6R6fVcn9KWgy1cyNGyJvjPhMKu9dNMuPxKp+sKPB8eP1lMNBOJFM5T3DbGaCHcfM EpfQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=1DVxi53AdHJct3hXH5+VGS1svVEDHY6RYmwnFv3N0GA=; fh=UlFXU+hKOMcm6idRzNfe3Pu+B5L1f7aeir3C6i/NWQA=; b=ApVqJvWx4XVbZAjg5unpo1Z376mF7UWhBEILZPdb9m6BzjA9VE4wmmYB0Uf37RUINL kZ2+NRjXBKaWelrB18OPqOYzKzXfSEbKSrS3sm2pC7ke5bvk/V9W9zB1T6Ef+4tENoWC GjMuvCpZellFFfV24Z1Qt/KYiLv2D6kau2CsFam21fAJp1bXRZCeDtWpJ5P1TKp2YHKe bVKHpsF4lYGXEh724SrM7xvd/ChqVZnIxsS22VGG+rQR7cCOKgD/Tbzsp0JXvb8UCHU7 RxBhtGrH8UM6VCJpNGokiUdsob8dR41bXE5TtwtJVSROvM4V/S2FIaRjUmNdn/3piPkK dydA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=cCjdf3X8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id u5-20020a17090aa20500b002859ad34f7bsi637612pjp.92.2023.12.04.06.22.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Dec 2023 06:22:02 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=cCjdf3X8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 463968057DAF; Mon, 4 Dec 2023 06:21:56 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235654AbjLDOVr (ORCPT <rfc822;chrisfriedt@gmail.com> + 99 others); Mon, 4 Dec 2023 09:21:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33372 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233342AbjLDOVq (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 4 Dec 2023 09:21:46 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3B664A9 for <linux-kernel@vger.kernel.org>; Mon, 4 Dec 2023 06:21:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1701699710; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=1DVxi53AdHJct3hXH5+VGS1svVEDHY6RYmwnFv3N0GA=; b=cCjdf3X8BsX9IjgztuRF4tpXRzbn0UyMlM0kHlPVGPpaDrsNvgubKviTBOKcqlreZr1022 irCS8LaVAUGM3wV+X9jr58HiEHKKsNWDZjhCcsFHe7rE2nNxhm88GVUkUE9ZxlQfyAR9dd nk8V2rawq0qIdwXW/AkAGCvhENtLD2Y= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-480-Tpjb8xMcMU6W6IJk6zWHwA-1; Mon, 04 Dec 2023 09:21:49 -0500 X-MC-Unique: Tpjb8xMcMU6W6IJk6zWHwA-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 8ED8A811E7E; Mon, 4 Dec 2023 14:21:48 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.195.87]) by smtp.corp.redhat.com (Postfix) with ESMTP id CB0DD2026D4C; Mon, 4 Dec 2023 14:21:46 +0000 (UTC) From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Hugh Dickins <hughd@google.com>, Ryan Roberts <ryan.roberts@arm.com>, Yin Fengwei <fengwei.yin@intel.com>, Mike Kravetz <mike.kravetz@oracle.com>, Muchun Song <muchun.song@linux.dev>, Peter Xu <peterx@redhat.com> Subject: [PATCH RFC 00/39] mm/rmap: interface overhaul Date: Mon, 4 Dec 2023 15:21:07 +0100 Message-ID: <20231204142146.91437-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.4 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Mon, 04 Dec 2023 06:21:56 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1784361488236441961 X-GMAIL-MSGID: 1784361488236441961 |
Series |
mm/rmap: interface overhaul
|
|
Message
David Hildenbrand
Dec. 4, 2023, 2:21 p.m. UTC
Baed on mm-stable from a couple of days. This series proposes an overhaul to our rmap interface, to get rid of the "bool compound" / RMAP_COMPOUND parameter with the goal of making the interface less error prone, more future proof, and more natural to extend to "batching". Also, this converts the interface to always consume folio+subpage, which speeds up operations on large folios. Further, this series adds PTE-batching variants for 4 rmap functions, whereby only folio_add_anon_rmap_ptes() is used for batching in this series when PTE-remapping a PMD-mapped THP. Ryan has series where we would make use of folio_remove_rmap_ptes() [1] -- he carries his own batching variant right now -- and folio_try_dup_anon_rmap_ptes()/folio_dup_file_rmap_ptes() [2]. There is some overlap with both series (and some other work, like multi-size THP [3]), so that will need some coordination, and likely a stepwise inclusion. I got that started [4], but it made sense to show the whole picture. The patches of [4] are contained in here, with one additional patch added ("mm/rmap: introduce and use hugetlb_try_share_anon_rmap()") and some slight patch description changes. In general, RMAP batching is an important optimization for PTE-mapped THP, especially once we want to move towards a total mapcount or further, as shown with my WIP patches on "mapped shared vs. mapped exclusively" [5]. The rmap batching part of [5] is also contained here in a slightly reworked fork [and I found a bug du to the "compound" parameter handling in these patches that should be fixed here :) ]. This series performs a lot of folio conversion, that could be separated if there is a good reason. Most of the added LOC in the diff are only due to documentation. As we're moving to a pte/pmd interface where we clearly express the mapping granularity we are dealing with, we first get the remainder of hugetlb out of the way, as it is special and expected to remain special: it treats everything as a "single logical PTE" and only currently allows entire mappings. Even if we'd ever support partial mappings, I strongly assume the interface and implementation will still differ heavily: hopefull we can avoid working on subpages/subpage mapcounts completely and only add a "count" parameter for them to enable batching. New (extended) hugetlb interface that operate on entire folio: * hugetlb_add_new_anon_rmap() -> Already existed * hugetlb_add_anon_rmap() -> Already existed * hugetlb_try_dup_anon_rmap() * hugetlb_try_share_anon_rmap() * hugetlb_add_file_rmap() * hugetlb_remove_rmap() New "ordinary" interface for small folios / THP:: * folio_add_new_anon_rmap() -> Already existed * folio_add_anon_rmap_[pte|ptes|pmd]() * folio_try_dup_anon_rmap_[pte|ptes|pmd]() * folio_try_share_anon_rmap_[pte|pmd]() * folio_add_file_rmap_[pte|ptes|pmd]() * folio_dup_file_rmap_[pte|ptes|pmd]() * folio_remove_rmap_[pte|ptes|pmd]() folio_add_new_anon_rmap() will always map at the biggest granularity possible (currently, a single PMD to cover a PMD-sized THP). Could be extended if ever required. In the future, we might want "_pud" variants and eventually "_pmds" variants for batching. Further, if hugepd is ever a thing outside hugetlb code, we might want some variants for that. All stuff for the distant future. I ran some simple microbenchmarks from [5] on an Intel(R) Xeon(R) Silver 4210R: munmap(), fork(), cow, MADV_DONTNEED on each PTE ... and PTE remapping PMD-mapped THPs on 1 GiB of memory. For small folios, there is barely a change (< 1 % performance improvement), whereby fork() still stands out with 0.74% performance improvement, but it might be just some noise. Folio optimizations don't help that much with small folios. For PTE-mapped THP: * PTE-remapping a PMD-mapped THP is more than 10% faster. -> RMAP batching * fork() is more than 4% faster. -> folio conversion * MADV_DONTNEED is 2% faster -> folio conversion * COW by writing only a single byte on a COW-shared PTE -> folio conversion * munmap() is only slightly faster (< 1%). [1] https://lkml.kernel.org/r/20230810103332.3062143-1-ryan.roberts@arm.com [2] https://lkml.kernel.org/r/20231204105440.61448-1-ryan.roberts@arm.com [3] https://lkml.kernel.org/r/20231204102027.57185-1-ryan.roberts@arm.com [4] https://lkml.kernel.org/r/20231128145205.215026-1-david@redhat.com [5] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> David Hildenbrand (39): mm/rmap: rename hugepage_add* to hugetlb_add* mm/rmap: introduce and use hugetlb_remove_rmap() mm/rmap: introduce and use hugetlb_add_file_rmap() mm/rmap: introduce and use hugetlb_try_dup_anon_rmap() mm/rmap: introduce and use hugetlb_try_share_anon_rmap() mm/rmap: add hugetlb sanity checks mm/rmap: convert folio_add_file_rmap_range() into folio_add_file_rmap_[pte|ptes|pmd]() mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]() mm/huge_memory: page_add_file_rmap() -> folio_add_file_rmap_pmd() mm/migrate: page_add_file_rmap() -> folio_add_file_rmap_pte() mm/userfaultfd: page_add_file_rmap() -> folio_add_file_rmap_pte() mm/rmap: remove page_add_file_rmap() mm/rmap: factor out adding folio mappings into __folio_add_rmap() mm/rmap: introduce folio_add_anon_rmap_[pte|ptes|pmd]() mm/huge_memory: batch rmap operations in __split_huge_pmd_locked() mm/huge_memory: page_add_anon_rmap() -> folio_add_anon_rmap_pmd() mm/migrate: page_add_anon_rmap() -> folio_add_anon_rmap_pte() mm/ksm: page_add_anon_rmap() -> folio_add_anon_rmap_pte() mm/swapfile: page_add_anon_rmap() -> folio_add_anon_rmap_pte() mm/memory: page_add_anon_rmap() -> folio_add_anon_rmap_pte() mm/rmap: remove page_add_anon_rmap() mm/rmap: remove RMAP_COMPOUND mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]() kernel/events/uprobes: page_remove_rmap() -> folio_remove_rmap_pte() mm/huge_memory: page_remove_rmap() -> folio_remove_rmap_pmd() mm/khugepaged: page_remove_rmap() -> folio_remove_rmap_pte() mm/ksm: page_remove_rmap() -> folio_remove_rmap_pte() mm/memory: page_remove_rmap() -> folio_remove_rmap_pte() mm/migrate_device: page_remove_rmap() -> folio_remove_rmap_pte() mm/rmap: page_remove_rmap() -> folio_remove_rmap_pte() Documentation: stop referring to page_remove_rmap() mm/rmap: remove page_remove_rmap() mm/rmap: convert page_dup_file_rmap() to folio_dup_file_rmap_[pte|ptes|pmd]() mm/rmap: introduce folio_try_dup_anon_rmap_[pte|ptes|pmd]() mm/huge_memory: page_try_dup_anon_rmap() -> folio_try_dup_anon_rmap_pmd() mm/memory: page_try_dup_anon_rmap() -> folio_try_dup_anon_rmap_pte() mm/rmap: remove page_try_dup_anon_rmap() mm: convert page_try_share_anon_rmap() to folio_try_share_anon_rmap_[pte|pmd]() mm/rmap: rename COMPOUND_MAPPED to ENTIRELY_MAPPED Documentation/mm/transhuge.rst | 4 +- Documentation/mm/unevictable-lru.rst | 4 +- include/linux/mm.h | 6 +- include/linux/rmap.h | 380 +++++++++++++++++++----- kernel/events/uprobes.c | 2 +- mm/gup.c | 2 +- mm/huge_memory.c | 85 +++--- mm/hugetlb.c | 21 +- mm/internal.h | 12 +- mm/khugepaged.c | 17 +- mm/ksm.c | 15 +- mm/memory-failure.c | 4 +- mm/memory.c | 60 ++-- mm/migrate.c | 12 +- mm/migrate_device.c | 41 +-- mm/mmu_gather.c | 2 +- mm/rmap.c | 422 ++++++++++++++++----------- mm/swapfile.c | 2 +- mm/userfaultfd.c | 2 +- 19 files changed, 709 insertions(+), 384 deletions(-)
Comments
On 04/12/2023 14:21, David Hildenbrand wrote: > Baed on mm-stable from a couple of days. > > This series proposes an overhaul to our rmap interface, to get rid of the > "bool compound" / RMAP_COMPOUND parameter with the goal of making the > interface less error prone, more future proof, and more natural to extend > to "batching". Also, this converts the interface to always consume > folio+subpage, which speeds up operations on large folios. > > Further, this series adds PTE-batching variants for 4 rmap functions, > whereby only folio_add_anon_rmap_ptes() is used for batching in this series > when PTE-remapping a PMD-mapped THP. I certainly support the objective you have here; making the interfaces clearer, more consistent and more amenable to batching. I'll try to find some time this week to review. > > Ryan has series where we would make use of folio_remove_rmap_ptes() [1] > -- he carries his own batching variant right now -- and > folio_try_dup_anon_rmap_ptes()/folio_dup_file_rmap_ptes() [2]. Note that the contpte series at [2] has a new patch in v3 (patch 2), which could benefit from folio_remove_rmap_ptes() or equivalent. My plan was to revive [1] on top of [2] once it is merged. > > There is some overlap with both series (and some other work, like > multi-size THP [3]), so that will need some coordination, and likely a > stepwise inclusion. Selfishly, I'd really like to get my stuff merged as soon as there is no technical reason not to. I'd prefer not to add this as a dependency if we can help it. > > I got that started [4], but it made sense to show the whole picture. The > patches of [4] are contained in here, with one additional patch added > ("mm/rmap: introduce and use hugetlb_try_share_anon_rmap()") and some > slight patch description changes. > > In general, RMAP batching is an important optimization for PTE-mapped > THP, especially once we want to move towards a total mapcount or further, > as shown with my WIP patches on "mapped shared vs. mapped exclusively" [5]. > > The rmap batching part of [5] is also contained here in a slightly reworked > fork [and I found a bug du to the "compound" parameter handling in these > patches that should be fixed here :) ]. > > This series performs a lot of folio conversion, that could be separated > if there is a good reason. Most of the added LOC in the diff are only due > to documentation. > > As we're moving to a pte/pmd interface where we clearly express the > mapping granularity we are dealing with, we first get the remainder of > hugetlb out of the way, as it is special and expected to remain special: it > treats everything as a "single logical PTE" and only currently allows > entire mappings. > > Even if we'd ever support partial mappings, I strongly > assume the interface and implementation will still differ heavily: > hopefull we can avoid working on subpages/subpage mapcounts completely and > only add a "count" parameter for them to enable batching. > > > New (extended) hugetlb interface that operate on entire folio: > * hugetlb_add_new_anon_rmap() -> Already existed > * hugetlb_add_anon_rmap() -> Already existed > * hugetlb_try_dup_anon_rmap() > * hugetlb_try_share_anon_rmap() > * hugetlb_add_file_rmap() > * hugetlb_remove_rmap() > > New "ordinary" interface for small folios / THP:: > * folio_add_new_anon_rmap() -> Already existed > * folio_add_anon_rmap_[pte|ptes|pmd]() > * folio_try_dup_anon_rmap_[pte|ptes|pmd]() > * folio_try_share_anon_rmap_[pte|pmd]() > * folio_add_file_rmap_[pte|ptes|pmd]() > * folio_dup_file_rmap_[pte|ptes|pmd]() > * folio_remove_rmap_[pte|ptes|pmd]() I'm not sure if there are official guidelines, but personally if we are reworking the API, I'd take the opportunity to move "rmap" to the front of the name, rather than having it burried in the middle as it is for some of these: rmap_hugetlb_*() rmap_folio_*() I guess reading the patches will tell me, but what's the point of "ptes"? Surely you're either mapping at pte or pmd level, and the number of pages is determined by the folio size? (or presumably nr param passed in) Thanks, Ryan > > folio_add_new_anon_rmap() will always map at the biggest granularity > possible (currently, a single PMD to cover a PMD-sized THP). Could be > extended if ever required. > > In the future, we might want "_pud" variants and eventually "_pmds" variants > for batching. Further, if hugepd is ever a thing outside hugetlb code, > we might want some variants for that. All stuff for the distant future. > > > I ran some simple microbenchmarks from [5] on an Intel(R) Xeon(R) Silver > 4210R: munmap(), fork(), cow, MADV_DONTNEED on each PTE ... and PTE > remapping PMD-mapped THPs on 1 GiB of memory. > > For small folios, there is barely a change (< 1 % performance improvement), > whereby fork() still stands out with 0.74% performance improvement, but > it might be just some noise. Folio optimizations don't help that much > with small folios. > > For PTE-mapped THP: > * PTE-remapping a PMD-mapped THP is more than 10% faster. > -> RMAP batching > * fork() is more than 4% faster. > -> folio conversion > * MADV_DONTNEED is 2% faster > -> folio conversion > * COW by writing only a single byte on a COW-shared PTE > -> folio conversion > * munmap() is only slightly faster (< 1%). > > [1] https://lkml.kernel.org/r/20230810103332.3062143-1-ryan.roberts@arm.com > [2] https://lkml.kernel.org/r/20231204105440.61448-1-ryan.roberts@arm.com > [3] https://lkml.kernel.org/r/20231204102027.57185-1-ryan.roberts@arm.com > [4] https://lkml.kernel.org/r/20231128145205.215026-1-david@redhat.com > [5] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Yin Fengwei <fengwei.yin@intel.com> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Cc: Muchun Song <muchun.song@linux.dev> > Cc: Peter Xu <peterx@redhat.com> > > David Hildenbrand (39): > mm/rmap: rename hugepage_add* to hugetlb_add* > mm/rmap: introduce and use hugetlb_remove_rmap() > mm/rmap: introduce and use hugetlb_add_file_rmap() > mm/rmap: introduce and use hugetlb_try_dup_anon_rmap() > mm/rmap: introduce and use hugetlb_try_share_anon_rmap() > mm/rmap: add hugetlb sanity checks > mm/rmap: convert folio_add_file_rmap_range() into > folio_add_file_rmap_[pte|ptes|pmd]() > mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]() > mm/huge_memory: page_add_file_rmap() -> folio_add_file_rmap_pmd() > mm/migrate: page_add_file_rmap() -> folio_add_file_rmap_pte() > mm/userfaultfd: page_add_file_rmap() -> folio_add_file_rmap_pte() > mm/rmap: remove page_add_file_rmap() > mm/rmap: factor out adding folio mappings into __folio_add_rmap() > mm/rmap: introduce folio_add_anon_rmap_[pte|ptes|pmd]() > mm/huge_memory: batch rmap operations in __split_huge_pmd_locked() > mm/huge_memory: page_add_anon_rmap() -> folio_add_anon_rmap_pmd() > mm/migrate: page_add_anon_rmap() -> folio_add_anon_rmap_pte() > mm/ksm: page_add_anon_rmap() -> folio_add_anon_rmap_pte() > mm/swapfile: page_add_anon_rmap() -> folio_add_anon_rmap_pte() > mm/memory: page_add_anon_rmap() -> folio_add_anon_rmap_pte() > mm/rmap: remove page_add_anon_rmap() > mm/rmap: remove RMAP_COMPOUND > mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]() > kernel/events/uprobes: page_remove_rmap() -> folio_remove_rmap_pte() > mm/huge_memory: page_remove_rmap() -> folio_remove_rmap_pmd() > mm/khugepaged: page_remove_rmap() -> folio_remove_rmap_pte() > mm/ksm: page_remove_rmap() -> folio_remove_rmap_pte() > mm/memory: page_remove_rmap() -> folio_remove_rmap_pte() > mm/migrate_device: page_remove_rmap() -> folio_remove_rmap_pte() > mm/rmap: page_remove_rmap() -> folio_remove_rmap_pte() > Documentation: stop referring to page_remove_rmap() > mm/rmap: remove page_remove_rmap() > mm/rmap: convert page_dup_file_rmap() to > folio_dup_file_rmap_[pte|ptes|pmd]() > mm/rmap: introduce folio_try_dup_anon_rmap_[pte|ptes|pmd]() > mm/huge_memory: page_try_dup_anon_rmap() -> > folio_try_dup_anon_rmap_pmd() > mm/memory: page_try_dup_anon_rmap() -> folio_try_dup_anon_rmap_pte() > mm/rmap: remove page_try_dup_anon_rmap() > mm: convert page_try_share_anon_rmap() to > folio_try_share_anon_rmap_[pte|pmd]() > mm/rmap: rename COMPOUND_MAPPED to ENTIRELY_MAPPED > > Documentation/mm/transhuge.rst | 4 +- > Documentation/mm/unevictable-lru.rst | 4 +- > include/linux/mm.h | 6 +- > include/linux/rmap.h | 380 +++++++++++++++++++----- > kernel/events/uprobes.c | 2 +- > mm/gup.c | 2 +- > mm/huge_memory.c | 85 +++--- > mm/hugetlb.c | 21 +- > mm/internal.h | 12 +- > mm/khugepaged.c | 17 +- > mm/ksm.c | 15 +- > mm/memory-failure.c | 4 +- > mm/memory.c | 60 ++-- > mm/migrate.c | 12 +- > mm/migrate_device.c | 41 +-- > mm/mmu_gather.c | 2 +- > mm/rmap.c | 422 ++++++++++++++++----------- > mm/swapfile.c | 2 +- > mm/userfaultfd.c | 2 +- > 19 files changed, 709 insertions(+), 384 deletions(-) >
>> >> Ryan has series where we would make use of folio_remove_rmap_ptes() [1] >> -- he carries his own batching variant right now -- and >> folio_try_dup_anon_rmap_ptes()/folio_dup_file_rmap_ptes() [2]. > > Note that the contpte series at [2] has a new patch in v3 (patch 2), which could > benefit from folio_remove_rmap_ptes() or equivalent. My plan was to revive [1] > on top of [2] once it is merged. > >> >> There is some overlap with both series (and some other work, like >> multi-size THP [3]), so that will need some coordination, and likely a >> stepwise inclusion. > > Selfishly, I'd really like to get my stuff merged as soon as there is no > technical reason not to. I'd prefer not to add this as a dependency if we can > help it. It's easy to rework either series on top of each other. The mTHP series has highest priority, no question, that will go in first. Regarding the contpte, I think it needs more work. Especially, as raised, to not degrade order-0 performance. Maybe we won't make the next merge window (and you already predicated that in some cover letter :P ). Let's see. But again, the conflicts are all trivial, so I'll happily rebase on top of whatever is in mm-unstable. Or move the relevant rework to the front so you can just carry them/base on them. (the batched variants for dup do make the contpte code much easier) [...] >> >> >> New (extended) hugetlb interface that operate on entire folio: >> * hugetlb_add_new_anon_rmap() -> Already existed >> * hugetlb_add_anon_rmap() -> Already existed >> * hugetlb_try_dup_anon_rmap() >> * hugetlb_try_share_anon_rmap() >> * hugetlb_add_file_rmap() >> * hugetlb_remove_rmap() >> >> New "ordinary" interface for small folios / THP:: >> * folio_add_new_anon_rmap() -> Already existed >> * folio_add_anon_rmap_[pte|ptes|pmd]() >> * folio_try_dup_anon_rmap_[pte|ptes|pmd]() >> * folio_try_share_anon_rmap_[pte|pmd]() >> * folio_add_file_rmap_[pte|ptes|pmd]() >> * folio_dup_file_rmap_[pte|ptes|pmd]() >> * folio_remove_rmap_[pte|ptes|pmd]() > > I'm not sure if there are official guidelines, but personally if we are > reworking the API, I'd take the opportunity to move "rmap" to the front of the > name, rather than having it burried in the middle as it is for some of these: > > rmap_hugetlb_*() > > rmap_folio_*() No strong opinion. But we might want slightly different names then. For example, it's "bio_add_folio" and not "bio_folio_add": rmap_add_new_anon_hugetlb() rmap_add_anon_hugetlb() ... rmap_remove_hugetlb() rmap_add_new_anon_folio() rmap_add_anon_folio_[pte|ptes|pmd]() ... rmap_dup_file_folio_[pte|ptes|pmd]() rmap_remove_folio_[pte|ptes|pmd]() Thoughts? > > I guess reading the patches will tell me, but what's the point of "ptes"? Surely > you're either mapping at pte or pmd level, and the number of pages is determined > by the folio size? (or presumably nr param passed in) It's really (currently) one function to handle 1 vs. multiple PTEs. For example: void folio_remove_rmap_ptes(struct folio *, struct page *, unsigned int nr, struct vm_area_struct *); #define folio_remove_rmap_pte(folio, page, vma) \ folio_remove_rmap_ptes(folio, page, 1, vma) void folio_remove_rmap_pmd(struct folio *, struct page *, struct vm_area_struct *); Once could let the compiler generate specialized variants for the single-pte versions to make the order-0 case faster. For now it's just a helper macro.
On 04/12/2023 19:53, Ryan Roberts wrote: > On 04/12/2023 14:21, David Hildenbrand wrote: >> Baed on mm-stable from a couple of days. >> >> This series proposes an overhaul to our rmap interface, to get rid of the >> "bool compound" / RMAP_COMPOUND parameter with the goal of making the >> interface less error prone, more future proof, and more natural to extend >> to "batching". Also, this converts the interface to always consume >> folio+subpage, which speeds up operations on large folios. >> >> Further, this series adds PTE-batching variants for 4 rmap functions, >> whereby only folio_add_anon_rmap_ptes() is used for batching in this series >> when PTE-remapping a PMD-mapped THP. > > I certainly support the objective you have here; making the interfaces clearer, > more consistent and more amenable to batching. I'll try to find some time this > week to review. > >> >> Ryan has series where we would make use of folio_remove_rmap_ptes() [1] >> -- he carries his own batching variant right now -- and >> folio_try_dup_anon_rmap_ptes()/folio_dup_file_rmap_ptes() [2]. > > Note that the contpte series at [2] has a new patch in v3 (patch 2), which could > benefit from folio_remove_rmap_ptes() or equivalent. My plan was to revive [1] > on top of [2] once it is merged. > >> >> There is some overlap with both series (and some other work, like >> multi-size THP [3]), so that will need some coordination, and likely a >> stepwise inclusion. > > Selfishly, I'd really like to get my stuff merged as soon as there is no > technical reason not to. I'd prefer not to add this as a dependency if we can > help it. > >> >> I got that started [4], but it made sense to show the whole picture. The >> patches of [4] are contained in here, with one additional patch added >> ("mm/rmap: introduce and use hugetlb_try_share_anon_rmap()") and some >> slight patch description changes. >> >> In general, RMAP batching is an important optimization for PTE-mapped >> THP, especially once we want to move towards a total mapcount or further, >> as shown with my WIP patches on "mapped shared vs. mapped exclusively" [5]. >> >> The rmap batching part of [5] is also contained here in a slightly reworked >> fork [and I found a bug du to the "compound" parameter handling in these >> patches that should be fixed here :) ]. >> >> This series performs a lot of folio conversion, that could be separated >> if there is a good reason. Most of the added LOC in the diff are only due >> to documentation. >> >> As we're moving to a pte/pmd interface where we clearly express the >> mapping granularity we are dealing with, we first get the remainder of >> hugetlb out of the way, as it is special and expected to remain special: it >> treats everything as a "single logical PTE" and only currently allows >> entire mappings. >> >> Even if we'd ever support partial mappings, I strongly >> assume the interface and implementation will still differ heavily: >> hopefull we can avoid working on subpages/subpage mapcounts completely and >> only add a "count" parameter for them to enable batching. >> >> >> New (extended) hugetlb interface that operate on entire folio: >> * hugetlb_add_new_anon_rmap() -> Already existed >> * hugetlb_add_anon_rmap() -> Already existed >> * hugetlb_try_dup_anon_rmap() >> * hugetlb_try_share_anon_rmap() >> * hugetlb_add_file_rmap() >> * hugetlb_remove_rmap() >> >> New "ordinary" interface for small folios / THP:: >> * folio_add_new_anon_rmap() -> Already existed >> * folio_add_anon_rmap_[pte|ptes|pmd]() >> * folio_try_dup_anon_rmap_[pte|ptes|pmd]() >> * folio_try_share_anon_rmap_[pte|pmd]() >> * folio_add_file_rmap_[pte|ptes|pmd]() >> * folio_dup_file_rmap_[pte|ptes|pmd]() >> * folio_remove_rmap_[pte|ptes|pmd]() > > I'm not sure if there are official guidelines, but personally if we are > reworking the API, I'd take the opportunity to move "rmap" to the front of the > name, rather than having it burried in the middle as it is for some of these: > > rmap_hugetlb_*() > > rmap_folio_*() In fact, I'd be inclined to drop the "folio" to shorten the name; everything is a folio, so its not telling us much. e.g.: New (extended) hugetlb interface that operate on entire folio: * rmap_hugetlb_add_new_anon() -> Already existed * rmap_hugetlb_add_anon() -> Already existed * rmap_hugetlb_try_dup_anon() * rmap_hugetlb_try_share_anon() * rmap_hugetlb_add_file() * rmap_hugetlb_remove() New "ordinary" interface for small folios / THP:: * rmap_add_new_anon() -> Already existed * rmap_add_anon_[pte|ptes|pmd]() * rmap_try_dup_anon_[pte|ptes|pmd]() * rmap_try_share_anon_[pte|pmd]() * rmap_add_file_[pte|ptes|pmd]() * rmap_dup_file_[pte|ptes|pmd]() * rmap_remove_[pte|ptes|pmd]() > > I guess reading the patches will tell me, but what's the point of "ptes"? Surely > you're either mapping at pte or pmd level, and the number of pages is determined > by the folio size? (or presumably nr param passed in) > > Thanks, > Ryan > >> >> folio_add_new_anon_rmap() will always map at the biggest granularity >> possible (currently, a single PMD to cover a PMD-sized THP). Could be >> extended if ever required. >> >> In the future, we might want "_pud" variants and eventually "_pmds" variants >> for batching. Further, if hugepd is ever a thing outside hugetlb code, >> we might want some variants for that. All stuff for the distant future. >> >> >> I ran some simple microbenchmarks from [5] on an Intel(R) Xeon(R) Silver >> 4210R: munmap(), fork(), cow, MADV_DONTNEED on each PTE ... and PTE >> remapping PMD-mapped THPs on 1 GiB of memory. >> >> For small folios, there is barely a change (< 1 % performance improvement), >> whereby fork() still stands out with 0.74% performance improvement, but >> it might be just some noise. Folio optimizations don't help that much >> with small folios. >> >> For PTE-mapped THP: >> * PTE-remapping a PMD-mapped THP is more than 10% faster. >> -> RMAP batching >> * fork() is more than 4% faster. >> -> folio conversion >> * MADV_DONTNEED is 2% faster >> -> folio conversion >> * COW by writing only a single byte on a COW-shared PTE >> -> folio conversion >> * munmap() is only slightly faster (< 1%). >> >> [1] https://lkml.kernel.org/r/20230810103332.3062143-1-ryan.roberts@arm.com >> [2] https://lkml.kernel.org/r/20231204105440.61448-1-ryan.roberts@arm.com >> [3] https://lkml.kernel.org/r/20231204102027.57185-1-ryan.roberts@arm.com >> [4] https://lkml.kernel.org/r/20231128145205.215026-1-david@redhat.com >> [5] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com >> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> >> Cc: Hugh Dickins <hughd@google.com> >> Cc: Ryan Roberts <ryan.roberts@arm.com> >> Cc: Yin Fengwei <fengwei.yin@intel.com> >> Cc: Mike Kravetz <mike.kravetz@oracle.com> >> Cc: Muchun Song <muchun.song@linux.dev> >> Cc: Peter Xu <peterx@redhat.com> >> >> David Hildenbrand (39): >> mm/rmap: rename hugepage_add* to hugetlb_add* >> mm/rmap: introduce and use hugetlb_remove_rmap() >> mm/rmap: introduce and use hugetlb_add_file_rmap() >> mm/rmap: introduce and use hugetlb_try_dup_anon_rmap() >> mm/rmap: introduce and use hugetlb_try_share_anon_rmap() >> mm/rmap: add hugetlb sanity checks >> mm/rmap: convert folio_add_file_rmap_range() into >> folio_add_file_rmap_[pte|ptes|pmd]() >> mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]() >> mm/huge_memory: page_add_file_rmap() -> folio_add_file_rmap_pmd() >> mm/migrate: page_add_file_rmap() -> folio_add_file_rmap_pte() >> mm/userfaultfd: page_add_file_rmap() -> folio_add_file_rmap_pte() >> mm/rmap: remove page_add_file_rmap() >> mm/rmap: factor out adding folio mappings into __folio_add_rmap() >> mm/rmap: introduce folio_add_anon_rmap_[pte|ptes|pmd]() >> mm/huge_memory: batch rmap operations in __split_huge_pmd_locked() >> mm/huge_memory: page_add_anon_rmap() -> folio_add_anon_rmap_pmd() >> mm/migrate: page_add_anon_rmap() -> folio_add_anon_rmap_pte() >> mm/ksm: page_add_anon_rmap() -> folio_add_anon_rmap_pte() >> mm/swapfile: page_add_anon_rmap() -> folio_add_anon_rmap_pte() >> mm/memory: page_add_anon_rmap() -> folio_add_anon_rmap_pte() >> mm/rmap: remove page_add_anon_rmap() >> mm/rmap: remove RMAP_COMPOUND >> mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]() >> kernel/events/uprobes: page_remove_rmap() -> folio_remove_rmap_pte() >> mm/huge_memory: page_remove_rmap() -> folio_remove_rmap_pmd() >> mm/khugepaged: page_remove_rmap() -> folio_remove_rmap_pte() >> mm/ksm: page_remove_rmap() -> folio_remove_rmap_pte() >> mm/memory: page_remove_rmap() -> folio_remove_rmap_pte() >> mm/migrate_device: page_remove_rmap() -> folio_remove_rmap_pte() >> mm/rmap: page_remove_rmap() -> folio_remove_rmap_pte() >> Documentation: stop referring to page_remove_rmap() >> mm/rmap: remove page_remove_rmap() >> mm/rmap: convert page_dup_file_rmap() to >> folio_dup_file_rmap_[pte|ptes|pmd]() >> mm/rmap: introduce folio_try_dup_anon_rmap_[pte|ptes|pmd]() >> mm/huge_memory: page_try_dup_anon_rmap() -> >> folio_try_dup_anon_rmap_pmd() >> mm/memory: page_try_dup_anon_rmap() -> folio_try_dup_anon_rmap_pte() >> mm/rmap: remove page_try_dup_anon_rmap() >> mm: convert page_try_share_anon_rmap() to >> folio_try_share_anon_rmap_[pte|pmd]() >> mm/rmap: rename COMPOUND_MAPPED to ENTIRELY_MAPPED >> >> Documentation/mm/transhuge.rst | 4 +- >> Documentation/mm/unevictable-lru.rst | 4 +- >> include/linux/mm.h | 6 +- >> include/linux/rmap.h | 380 +++++++++++++++++++----- >> kernel/events/uprobes.c | 2 +- >> mm/gup.c | 2 +- >> mm/huge_memory.c | 85 +++--- >> mm/hugetlb.c | 21 +- >> mm/internal.h | 12 +- >> mm/khugepaged.c | 17 +- >> mm/ksm.c | 15 +- >> mm/memory-failure.c | 4 +- >> mm/memory.c | 60 ++-- >> mm/migrate.c | 12 +- >> mm/migrate_device.c | 41 +-- >> mm/mmu_gather.c | 2 +- >> mm/rmap.c | 422 ++++++++++++++++----------- >> mm/swapfile.c | 2 +- >> mm/userfaultfd.c | 2 +- >> 19 files changed, 709 insertions(+), 384 deletions(-) >> >
On 05/12/2023 09:56, David Hildenbrand wrote: >>> >>> Ryan has series where we would make use of folio_remove_rmap_ptes() [1] >>> -- he carries his own batching variant right now -- and >>> folio_try_dup_anon_rmap_ptes()/folio_dup_file_rmap_ptes() [2]. >> >> Note that the contpte series at [2] has a new patch in v3 (patch 2), which could >> benefit from folio_remove_rmap_ptes() or equivalent. My plan was to revive [1] >> on top of [2] once it is merged. >> >>> >>> There is some overlap with both series (and some other work, like >>> multi-size THP [3]), so that will need some coordination, and likely a >>> stepwise inclusion. >> >> Selfishly, I'd really like to get my stuff merged as soon as there is no >> technical reason not to. I'd prefer not to add this as a dependency if we can >> help it. > > It's easy to rework either series on top of each other. The mTHP series has > highest priority, > no question, that will go in first. Music to my ears! It would be great to either get a reviewed-by or feedback on why not, for the key 2 patches in that series (3 & 4) and also your opinion on whether we need to wait for compaction to land (see cover letter). It would be great to get this into linux-next ASAP IMHO. > > Regarding the contpte, I think it needs more work. Especially, as raised, to not > degrade > order-0 performance. Maybe we won't make the next merge window (and you already > predicated > that in some cover letter :P ). Let's see. Yeah that's ok. I'll do the work to fix the order-0 perf. And also do the same for patch 2 in that series - would also be really helpful if you had a chance to look at patch 2 - its new for v3. > > But again, the conflicts are all trivial, so I'll happily rebase on top of > whatever is > in mm-unstable. Or move the relevant rework to the front so you can just carry > them/base on them. (the batched variants for dup do make the contpte code much > easier) So perhaps we should aim for mTHP, then this, then contpte last, benefiting from the batching. > > [...] > >>> >>> >>> New (extended) hugetlb interface that operate on entire folio: >>> * hugetlb_add_new_anon_rmap() -> Already existed >>> * hugetlb_add_anon_rmap() -> Already existed >>> * hugetlb_try_dup_anon_rmap() >>> * hugetlb_try_share_anon_rmap() >>> * hugetlb_add_file_rmap() >>> * hugetlb_remove_rmap() >>> >>> New "ordinary" interface for small folios / THP:: >>> * folio_add_new_anon_rmap() -> Already existed >>> * folio_add_anon_rmap_[pte|ptes|pmd]() >>> * folio_try_dup_anon_rmap_[pte|ptes|pmd]() >>> * folio_try_share_anon_rmap_[pte|pmd]() >>> * folio_add_file_rmap_[pte|ptes|pmd]() >>> * folio_dup_file_rmap_[pte|ptes|pmd]() >>> * folio_remove_rmap_[pte|ptes|pmd]() >> >> I'm not sure if there are official guidelines, but personally if we are >> reworking the API, I'd take the opportunity to move "rmap" to the front of the >> name, rather than having it burried in the middle as it is for some of these: >> >> rmap_hugetlb_*() >> >> rmap_folio_*() > > No strong opinion. But we might want slightly different names then. For example, > it's "bio_add_folio" and not "bio_folio_add": > > > rmap_add_new_anon_hugetlb() > rmap_add_anon_hugetlb() > ... > rmap_remove_hugetlb() > > > rmap_add_new_anon_folio() > rmap_add_anon_folio_[pte|ptes|pmd]() > ... > rmap_dup_file_folio_[pte|ptes|pmd]() > rmap_remove_folio_[pte|ptes|pmd]() > > Thoughts? Having now reviewed your series, I have a less strong opinion, perhaps it's actually best with your original names; "folio" is actually the subject after all; it's the thing being operated on. > >> >> I guess reading the patches will tell me, but what's the point of "ptes"? Surely >> you're either mapping at pte or pmd level, and the number of pages is determined >> by the folio size? (or presumably nr param passed in) > > It's really (currently) one function to handle 1 vs. multiple PTEs. For example: > > void folio_remove_rmap_ptes(struct folio *, struct page *, unsigned int nr, > struct vm_area_struct *); > #define folio_remove_rmap_pte(folio, page, vma) \ > folio_remove_rmap_ptes(folio, page, 1, vma) > void folio_remove_rmap_pmd(struct folio *, struct page *, > struct vm_area_struct *); Yeah now that I've looked at the series, this makes sense. "ptes" was originally making me think of contpte, but I suspect I'll be the only one with that association :) > > > Once could let the compiler generate specialized variants for the single-pte > versions to make the order-0 case faster. For now it's just a helper macro. >
On 05.12.23 14:31, Ryan Roberts wrote: > On 05/12/2023 09:56, David Hildenbrand wrote: >>>> >>>> Ryan has series where we would make use of folio_remove_rmap_ptes() [1] >>>> -- he carries his own batching variant right now -- and >>>> folio_try_dup_anon_rmap_ptes()/folio_dup_file_rmap_ptes() [2]. >>> >>> Note that the contpte series at [2] has a new patch in v3 (patch 2), which could >>> benefit from folio_remove_rmap_ptes() or equivalent. My plan was to revive [1] >>> on top of [2] once it is merged. >>> >>>> >>>> There is some overlap with both series (and some other work, like >>>> multi-size THP [3]), so that will need some coordination, and likely a >>>> stepwise inclusion. >>> >>> Selfishly, I'd really like to get my stuff merged as soon as there is no >>> technical reason not to. I'd prefer not to add this as a dependency if we can >>> help it. >> >> It's easy to rework either series on top of each other. The mTHP series has >> highest priority, >> no question, that will go in first. > > Music to my ears! It would be great to either get a reviewed-by or feedback on > why not, for the key 2 patches in that series (3 & 4) and also your opinion on > whether we need to wait for compaction to land (see cover letter). It would be > great to get this into linux-next ASAP IMHO. On it :) > >> >> Regarding the contpte, I think it needs more work. Especially, as raised, to not >> degrade >> order-0 performance. Maybe we won't make the next merge window (and you already >> predicated >> that in some cover letter :P ). Let's see. > > Yeah that's ok. I'll do the work to fix the order-0 perf. And also do the same > for patch 2 in that series - would also be really helpful if you had a chance to > look at patch 2 - its new for v3. I only skimmed over it, but it seems to go into the direction we'll need. Keeping order-0 performance unharmed should have highest priority. Hopefully my microbenchmarks are helpful. > >> >> But again, the conflicts are all trivial, so I'll happily rebase on top of >> whatever is >> in mm-unstable. Or move the relevant rework to the front so you can just carry >> them/base on them. (the batched variants for dup do make the contpte code much >> easier) > > So perhaps we should aim for mTHP, then this, then contpte last, benefiting from > the batching. Yeah. And again, I don't care too much if I have to rebase on top of your work if this here takes longer. It's all a fairly trivial conversion. >> >> [...] >> >>>> >>>> >>>> New (extended) hugetlb interface that operate on entire folio: >>>> * hugetlb_add_new_anon_rmap() -> Already existed >>>> * hugetlb_add_anon_rmap() -> Already existed >>>> * hugetlb_try_dup_anon_rmap() >>>> * hugetlb_try_share_anon_rmap() >>>> * hugetlb_add_file_rmap() >>>> * hugetlb_remove_rmap() >>>> >>>> New "ordinary" interface for small folios / THP:: >>>> * folio_add_new_anon_rmap() -> Already existed >>>> * folio_add_anon_rmap_[pte|ptes|pmd]() >>>> * folio_try_dup_anon_rmap_[pte|ptes|pmd]() >>>> * folio_try_share_anon_rmap_[pte|pmd]() >>>> * folio_add_file_rmap_[pte|ptes|pmd]() >>>> * folio_dup_file_rmap_[pte|ptes|pmd]() >>>> * folio_remove_rmap_[pte|ptes|pmd]() >>> >>> I'm not sure if there are official guidelines, but personally if we are >>> reworking the API, I'd take the opportunity to move "rmap" to the front of the >>> name, rather than having it burried in the middle as it is for some of these: >>> >>> rmap_hugetlb_*() >>> >>> rmap_folio_*() >> >> No strong opinion. But we might want slightly different names then. For example, >> it's "bio_add_folio" and not "bio_folio_add": >> >> >> rmap_add_new_anon_hugetlb() >> rmap_add_anon_hugetlb() >> ... >> rmap_remove_hugetlb() >> >> >> rmap_add_new_anon_folio() >> rmap_add_anon_folio_[pte|ptes|pmd]() >> ... >> rmap_dup_file_folio_[pte|ptes|pmd]() >> rmap_remove_folio_[pte|ptes|pmd]() >> >> Thoughts? > > Having now reviewed your series, I have a less strong opinion, perhaps it's > actually best with your original names; "folio" is actually the subject after > all; it's the thing being operated on. > I think having "folio" in there looks cleaner and more consistent to other functions. I tend to like "rmap_dup_file_folio_[pte|ptes|pmd]()", because then we have "file folio" and "anon folio" as one word. But then I wonder about the hugetlb part. Maybe simply "hugetlb_rmap_remove_folio()" etc. Having the "hugetlb_" prefix at the beginning feels like the right thing to do, looking at orher hugetlb special-handlings. But I'll wait a bit until I go crazy on renaming :) > >> >>> >>> I guess reading the patches will tell me, but what's the point of "ptes"? Surely >>> you're either mapping at pte or pmd level, and the number of pages is determined >>> by the folio size? (or presumably nr param passed in) >> >> It's really (currently) one function to handle 1 vs. multiple PTEs. For example: >> >> void folio_remove_rmap_ptes(struct folio *, struct page *, unsigned int nr, >> struct vm_area_struct *); >> #define folio_remove_rmap_pte(folio, page, vma) \ >> folio_remove_rmap_ptes(folio, page, 1, vma) >> void folio_remove_rmap_pmd(struct folio *, struct page *, >> struct vm_area_struct *); > > Yeah now that I've looked at the series, this makes sense. "ptes" was originally > making me think of contpte, but I suspect I'll be the only one with that > association :) Ah, yes :)
On 05/12/2023 13:39, David Hildenbrand wrote: > On 05.12.23 14:31, Ryan Roberts wrote: >> On 05/12/2023 09:56, David Hildenbrand wrote: >>>>> >>>>> Ryan has series where we would make use of folio_remove_rmap_ptes() [1] >>>>> -- he carries his own batching variant right now -- and >>>>> folio_try_dup_anon_rmap_ptes()/folio_dup_file_rmap_ptes() [2]. >>>> >>>> Note that the contpte series at [2] has a new patch in v3 (patch 2), which >>>> could >>>> benefit from folio_remove_rmap_ptes() or equivalent. My plan was to revive [1] >>>> on top of [2] once it is merged. >>>> >>>>> >>>>> There is some overlap with both series (and some other work, like >>>>> multi-size THP [3]), so that will need some coordination, and likely a >>>>> stepwise inclusion. >>>> >>>> Selfishly, I'd really like to get my stuff merged as soon as there is no >>>> technical reason not to. I'd prefer not to add this as a dependency if we can >>>> help it. >>> >>> It's easy to rework either series on top of each other. The mTHP series has >>> highest priority, >>> no question, that will go in first. >> >> Music to my ears! It would be great to either get a reviewed-by or feedback on >> why not, for the key 2 patches in that series (3 & 4) and also your opinion on >> whether we need to wait for compaction to land (see cover letter). It would be >> great to get this into linux-next ASAP IMHO. > > On it :) > >> >>> >>> Regarding the contpte, I think it needs more work. Especially, as raised, to not >>> degrade >>> order-0 performance. Maybe we won't make the next merge window (and you already >>> predicated >>> that in some cover letter :P ). Let's see. >> >> Yeah that's ok. I'll do the work to fix the order-0 perf. And also do the same >> for patch 2 in that series - would also be really helpful if you had a chance to >> look at patch 2 - its new for v3. > > I only skimmed over it, but it seems to go into the direction we'll need. > Keeping order-0 performance unharmed should have highest priority. Hopefully my > microbenchmarks are helpful. Yes absolutely - are you able to share them?? > >> >>> >>> But again, the conflicts are all trivial, so I'll happily rebase on top of >>> whatever is >>> in mm-unstable. Or move the relevant rework to the front so you can just carry >>> them/base on them. (the batched variants for dup do make the contpte code much >>> easier) >> >> So perhaps we should aim for mTHP, then this, then contpte last, benefiting from >> the batching. > > Yeah. And again, I don't care too much if I have to rebase on top of your work > if this here takes longer. It's all a fairly trivial conversion. > >>> >>> [...] >>> >>>>> >>>>> >>>>> New (extended) hugetlb interface that operate on entire folio: >>>>> * hugetlb_add_new_anon_rmap() -> Already existed >>>>> * hugetlb_add_anon_rmap() -> Already existed >>>>> * hugetlb_try_dup_anon_rmap() >>>>> * hugetlb_try_share_anon_rmap() >>>>> * hugetlb_add_file_rmap() >>>>> * hugetlb_remove_rmap() >>>>> >>>>> New "ordinary" interface for small folios / THP:: >>>>> * folio_add_new_anon_rmap() -> Already existed >>>>> * folio_add_anon_rmap_[pte|ptes|pmd]() >>>>> * folio_try_dup_anon_rmap_[pte|ptes|pmd]() >>>>> * folio_try_share_anon_rmap_[pte|pmd]() >>>>> * folio_add_file_rmap_[pte|ptes|pmd]() >>>>> * folio_dup_file_rmap_[pte|ptes|pmd]() >>>>> * folio_remove_rmap_[pte|ptes|pmd]() >>>> >>>> I'm not sure if there are official guidelines, but personally if we are >>>> reworking the API, I'd take the opportunity to move "rmap" to the front of the >>>> name, rather than having it burried in the middle as it is for some of these: >>>> >>>> rmap_hugetlb_*() >>>> >>>> rmap_folio_*() >>> >>> No strong opinion. But we might want slightly different names then. For example, >>> it's "bio_add_folio" and not "bio_folio_add": >>> >>> >>> rmap_add_new_anon_hugetlb() >>> rmap_add_anon_hugetlb() >>> ... >>> rmap_remove_hugetlb() >>> >>> >>> rmap_add_new_anon_folio() >>> rmap_add_anon_folio_[pte|ptes|pmd]() >>> ... >>> rmap_dup_file_folio_[pte|ptes|pmd]() >>> rmap_remove_folio_[pte|ptes|pmd]() >>> >>> Thoughts? >> >> Having now reviewed your series, I have a less strong opinion, perhaps it's >> actually best with your original names; "folio" is actually the subject after >> all; it's the thing being operated on. >> > > I think having "folio" in there looks cleaner and more consistent to other > functions. > > I tend to like "rmap_dup_file_folio_[pte|ptes|pmd]()", because then we have > "file folio" and "anon folio" as one word. > > But then I wonder about the hugetlb part. Maybe simply > "hugetlb_rmap_remove_folio()" etc. > > Having the "hugetlb_" prefix at the beginning feels like the right thing to do, > looking at orher hugetlb special-handlings. > > But I'll wait a bit until I go crazy on renaming :) I suspect we could argue in multiple directions for hours :) Let's see if others have opinions. FWIW, I've looked through all the patches; I like what I see! This is a really nice clean up and will definitely help with the various patch sets I've been working on. Apart from the comments I've already raised, looks in pretty good shape to me. > >> >>> >>>> >>>> I guess reading the patches will tell me, but what's the point of "ptes"? >>>> Surely >>>> you're either mapping at pte or pmd level, and the number of pages is >>>> determined >>>> by the folio size? (or presumably nr param passed in) >>> >>> It's really (currently) one function to handle 1 vs. multiple PTEs. For example: >>> >>> void folio_remove_rmap_ptes(struct folio *, struct page *, unsigned int nr, >>> struct vm_area_struct *); >>> #define folio_remove_rmap_pte(folio, page, vma) \ >>> folio_remove_rmap_ptes(folio, page, 1, vma) >>> void folio_remove_rmap_pmd(struct folio *, struct page *, >>> struct vm_area_struct *); >> >> Yeah now that I've looked at the series, this makes sense. "ptes" was originally >> making me think of contpte, but I suspect I'll be the only one with that >> association :) > > Ah, yes :) >
>>>>>>> >>>> Regarding the contpte, I think it needs more work. Especially, as raised, to not >>>> degrade >>>> order-0 performance. Maybe we won't make the next merge window (and you already >>>> predicated >>>> that in some cover letter :P ). Let's see. >>> >>> Yeah that's ok. I'll do the work to fix the order-0 perf. And also do the same >>> for patch 2 in that series - would also be really helpful if you had a chance to >>> look at patch 2 - its new for v3. >> >> I only skimmed over it, but it seems to go into the direction we'll need. >> Keeping order-0 performance unharmed should have highest priority. Hopefully my >> microbenchmarks are helpful. > > Yes absolutely - are you able to share them?? I shared them in the reply to your patchset. Let me know if you can't find them. [...] >>> Having now reviewed your series, I have a less strong opinion, perhaps it's >>> actually best with your original names; "folio" is actually the subject after >>> all; it's the thing being operated on. >>> >> >> I think having "folio" in there looks cleaner and more consistent to other >> functions. >> >> I tend to like "rmap_dup_file_folio_[pte|ptes|pmd]()", because then we have >> "file folio" and "anon folio" as one word. >> >> But then I wonder about the hugetlb part. Maybe simply >> "hugetlb_rmap_remove_folio()" etc. >> >> Having the "hugetlb_" prefix at the beginning feels like the right thing to do, >> looking at orher hugetlb special-handlings. >> >> But I'll wait a bit until I go crazy on renaming :) > > I suspect we could argue in multiple directions for hours :) :) > > Let's see if others have opinions. > > FWIW, I've looked through all the patches; I like what I see! This is a really > nice clean up and will definitely help with the various patch sets I've been > working on. Apart from the comments I've already raised, looks in pretty good > shape to me. Thanks!
On 12/4/2023 10:21 PM, David Hildenbrand wrote: > Let's convert remove_migration_pmd() and while at it, perform some folio > conversion. > > Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> > --- > mm/huge_memory.c | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 4f542444a91f2..cb33c6e0404cf 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3276,6 +3276,7 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, > > void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) > { > + struct folio *folio = page_folio(new); > struct vm_area_struct *vma = pvmw->vma; > struct mm_struct *mm = vma->vm_mm; > unsigned long address = pvmw->address; > @@ -3287,7 +3288,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) > return; > > entry = pmd_to_swp_entry(*pvmw->pmd); > - get_page(new); > + folio_get(folio); > pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)); > if (pmd_swp_soft_dirty(*pvmw->pmd)) > pmde = pmd_mksoft_dirty(pmde); > @@ -3298,10 +3299,10 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) > if (!is_migration_entry_young(entry)) > pmde = pmd_mkold(pmde); > /* NOTE: this may contain setting soft-dirty on some archs */ > - if (PageDirty(new) && is_migration_entry_dirty(entry)) > + if (folio_test_dirty(folio) && is_migration_entry_dirty(entry)) > pmde = pmd_mkdirty(pmde); > > - if (PageAnon(new)) { > + if (folio_test_anon(folio)) { > rmap_t rmap_flags = RMAP_COMPOUND; > > if (!is_readable_migration_entry(entry)) > @@ -3309,9 +3310,9 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) > > page_add_anon_rmap(new, vma, haddr, rmap_flags); > } else { > - page_add_file_rmap(new, vma, true); > + folio_add_file_rmap_pmd(folio, new, vma); > } > - VM_BUG_ON(pmd_write(pmde) && PageAnon(new) && !PageAnonExclusive(new)); > + VM_BUG_ON(pmd_write(pmde) && folio_test_anon(folio) && !PageAnonExclusive(new)); > set_pmd_at(mm, haddr, pvmw->pmd, pmde); > > /* No need to invalidate - it was non-present before */
On 05.12.23 14:49, Ryan Roberts wrote: > On 05/12/2023 13:39, David Hildenbrand wrote: >> On 05.12.23 14:31, Ryan Roberts wrote: >>> On 05/12/2023 09:56, David Hildenbrand wrote: >>>>>> >>>>>> Ryan has series where we would make use of folio_remove_rmap_ptes() [1] >>>>>> -- he carries his own batching variant right now -- and >>>>>> folio_try_dup_anon_rmap_ptes()/folio_dup_file_rmap_ptes() [2]. >>>>> >>>>> Note that the contpte series at [2] has a new patch in v3 (patch 2), which >>>>> could >>>>> benefit from folio_remove_rmap_ptes() or equivalent. My plan was to revive [1] >>>>> on top of [2] once it is merged. >>>>> >>>>>> >>>>>> There is some overlap with both series (and some other work, like >>>>>> multi-size THP [3]), so that will need some coordination, and likely a >>>>>> stepwise inclusion. >>>>> >>>>> Selfishly, I'd really like to get my stuff merged as soon as there is no >>>>> technical reason not to. I'd prefer not to add this as a dependency if we can >>>>> help it. >>>> >>>> It's easy to rework either series on top of each other. The mTHP series has >>>> highest priority, >>>> no question, that will go in first. >>> >>> Music to my ears! It would be great to either get a reviewed-by or feedback on >>> why not, for the key 2 patches in that series (3 & 4) and also your opinion on >>> whether we need to wait for compaction to land (see cover letter). It would be >>> great to get this into linux-next ASAP IMHO. >> >> On it :) >> >>> >>>> >>>> Regarding the contpte, I think it needs more work. Especially, as raised, to not >>>> degrade >>>> order-0 performance. Maybe we won't make the next merge window (and you already >>>> predicated >>>> that in some cover letter :P ). Let's see. >>> >>> Yeah that's ok. I'll do the work to fix the order-0 perf. And also do the same >>> for patch 2 in that series - would also be really helpful if you had a chance to >>> look at patch 2 - its new for v3. >> >> I only skimmed over it, but it seems to go into the direction we'll need. >> Keeping order-0 performance unharmed should have highest priority. Hopefully my >> microbenchmarks are helpful. > > Yes absolutely - are you able to share them?? > >> >>> >>>> >>>> But again, the conflicts are all trivial, so I'll happily rebase on top of >>>> whatever is >>>> in mm-unstable. Or move the relevant rework to the front so you can just carry >>>> them/base on them. (the batched variants for dup do make the contpte code much >>>> easier) >>> >>> So perhaps we should aim for mTHP, then this, then contpte last, benefiting from >>> the batching. >> >> Yeah. And again, I don't care too much if I have to rebase on top of your work >> if this here takes longer. It's all a fairly trivial conversion. >> >>>> >>>> [...] >>>> >>>>>> >>>>>> >>>>>> New (extended) hugetlb interface that operate on entire folio: >>>>>> * hugetlb_add_new_anon_rmap() -> Already existed >>>>>> * hugetlb_add_anon_rmap() -> Already existed >>>>>> * hugetlb_try_dup_anon_rmap() >>>>>> * hugetlb_try_share_anon_rmap() >>>>>> * hugetlb_add_file_rmap() >>>>>> * hugetlb_remove_rmap() >>>>>> >>>>>> New "ordinary" interface for small folios / THP:: >>>>>> * folio_add_new_anon_rmap() -> Already existed >>>>>> * folio_add_anon_rmap_[pte|ptes|pmd]() >>>>>> * folio_try_dup_anon_rmap_[pte|ptes|pmd]() >>>>>> * folio_try_share_anon_rmap_[pte|pmd]() >>>>>> * folio_add_file_rmap_[pte|ptes|pmd]() >>>>>> * folio_dup_file_rmap_[pte|ptes|pmd]() >>>>>> * folio_remove_rmap_[pte|ptes|pmd]() >>>>> >>>>> I'm not sure if there are official guidelines, but personally if we are >>>>> reworking the API, I'd take the opportunity to move "rmap" to the front of the >>>>> name, rather than having it burried in the middle as it is for some of these: >>>>> >>>>> rmap_hugetlb_*() >>>>> >>>>> rmap_folio_*() >>>> >>>> No strong opinion. But we might want slightly different names then. For example, >>>> it's "bio_add_folio" and not "bio_folio_add": >>>> >>>> >>>> rmap_add_new_anon_hugetlb() >>>> rmap_add_anon_hugetlb() >>>> ... >>>> rmap_remove_hugetlb() >>>> >>>> >>>> rmap_add_new_anon_folio() >>>> rmap_add_anon_folio_[pte|ptes|pmd]() >>>> ... >>>> rmap_dup_file_folio_[pte|ptes|pmd]() >>>> rmap_remove_folio_[pte|ptes|pmd]() >>>> >>>> Thoughts? >>> >>> Having now reviewed your series, I have a less strong opinion, perhaps it's >>> actually best with your original names; "folio" is actually the subject after >>> all; it's the thing being operated on. So far I sticked to the original names used in this RFC. I'm testing a new series that is based on current mm/unstable (especially, mTHP) and contains all changes discussed here. If I don't here anything else, I'll send that out as v1 on Monday. Thanks!
On 08/12/2023 11:24, David Hildenbrand wrote: > On 05.12.23 14:49, Ryan Roberts wrote: >> On 05/12/2023 13:39, David Hildenbrand wrote: >>> On 05.12.23 14:31, Ryan Roberts wrote: >>>> On 05/12/2023 09:56, David Hildenbrand wrote: >>>>>>> >>>>>>> Ryan has series where we would make use of folio_remove_rmap_ptes() [1] >>>>>>> -- he carries his own batching variant right now -- and >>>>>>> folio_try_dup_anon_rmap_ptes()/folio_dup_file_rmap_ptes() [2]. >>>>>> >>>>>> Note that the contpte series at [2] has a new patch in v3 (patch 2), which >>>>>> could >>>>>> benefit from folio_remove_rmap_ptes() or equivalent. My plan was to revive >>>>>> [1] >>>>>> on top of [2] once it is merged. >>>>>> >>>>>>> >>>>>>> There is some overlap with both series (and some other work, like >>>>>>> multi-size THP [3]), so that will need some coordination, and likely a >>>>>>> stepwise inclusion. >>>>>> >>>>>> Selfishly, I'd really like to get my stuff merged as soon as there is no >>>>>> technical reason not to. I'd prefer not to add this as a dependency if we can >>>>>> help it. >>>>> >>>>> It's easy to rework either series on top of each other. The mTHP series has >>>>> highest priority, >>>>> no question, that will go in first. >>>> >>>> Music to my ears! It would be great to either get a reviewed-by or feedback on >>>> why not, for the key 2 patches in that series (3 & 4) and also your opinion on >>>> whether we need to wait for compaction to land (see cover letter). It would be >>>> great to get this into linux-next ASAP IMHO. >>> >>> On it :) >>> >>>> >>>>> >>>>> Regarding the contpte, I think it needs more work. Especially, as raised, >>>>> to not >>>>> degrade >>>>> order-0 performance. Maybe we won't make the next merge window (and you >>>>> already >>>>> predicated >>>>> that in some cover letter :P ). Let's see. >>>> >>>> Yeah that's ok. I'll do the work to fix the order-0 perf. And also do the same >>>> for patch 2 in that series - would also be really helpful if you had a >>>> chance to >>>> look at patch 2 - its new for v3. >>> >>> I only skimmed over it, but it seems to go into the direction we'll need. >>> Keeping order-0 performance unharmed should have highest priority. Hopefully my >>> microbenchmarks are helpful. >> >> Yes absolutely - are you able to share them?? >> >>> >>>> >>>>> >>>>> But again, the conflicts are all trivial, so I'll happily rebase on top of >>>>> whatever is >>>>> in mm-unstable. Or move the relevant rework to the front so you can just carry >>>>> them/base on them. (the batched variants for dup do make the contpte code much >>>>> easier) >>>> >>>> So perhaps we should aim for mTHP, then this, then contpte last, benefiting >>>> from >>>> the batching. >>> >>> Yeah. And again, I don't care too much if I have to rebase on top of your work >>> if this here takes longer. It's all a fairly trivial conversion. >>> >>>>> >>>>> [...] >>>>> >>>>>>> >>>>>>> >>>>>>> New (extended) hugetlb interface that operate on entire folio: >>>>>>> * hugetlb_add_new_anon_rmap() -> Already existed >>>>>>> * hugetlb_add_anon_rmap() -> Already existed >>>>>>> * hugetlb_try_dup_anon_rmap() >>>>>>> * hugetlb_try_share_anon_rmap() >>>>>>> * hugetlb_add_file_rmap() >>>>>>> * hugetlb_remove_rmap() >>>>>>> >>>>>>> New "ordinary" interface for small folios / THP:: >>>>>>> * folio_add_new_anon_rmap() -> Already existed >>>>>>> * folio_add_anon_rmap_[pte|ptes|pmd]() >>>>>>> * folio_try_dup_anon_rmap_[pte|ptes|pmd]() >>>>>>> * folio_try_share_anon_rmap_[pte|pmd]() >>>>>>> * folio_add_file_rmap_[pte|ptes|pmd]() >>>>>>> * folio_dup_file_rmap_[pte|ptes|pmd]() >>>>>>> * folio_remove_rmap_[pte|ptes|pmd]() >>>>>> >>>>>> I'm not sure if there are official guidelines, but personally if we are >>>>>> reworking the API, I'd take the opportunity to move "rmap" to the front of >>>>>> the >>>>>> name, rather than having it burried in the middle as it is for some of these: >>>>>> >>>>>> rmap_hugetlb_*() >>>>>> >>>>>> rmap_folio_*() >>>>> >>>>> No strong opinion. But we might want slightly different names then. For >>>>> example, >>>>> it's "bio_add_folio" and not "bio_folio_add": >>>>> >>>>> >>>>> rmap_add_new_anon_hugetlb() >>>>> rmap_add_anon_hugetlb() >>>>> ... >>>>> rmap_remove_hugetlb() >>>>> >>>>> >>>>> rmap_add_new_anon_folio() >>>>> rmap_add_anon_folio_[pte|ptes|pmd]() >>>>> ... >>>>> rmap_dup_file_folio_[pte|ptes|pmd]() >>>>> rmap_remove_folio_[pte|ptes|pmd]() >>>>> >>>>> Thoughts? >>>> >>>> Having now reviewed your series, I have a less strong opinion, perhaps it's >>>> actually best with your original names; "folio" is actually the subject after >>>> all; it's the thing being operated on. > > So far I sticked to the original names used in this RFC. I'm testing a new > series that is based on current mm/unstable (especially, mTHP) and contains all > changes discussed here. > > If I don't here anything else, I'll send that out as v1 on Monday. Get's my vote! > > Thanks! >