Message ID | 20230912162815.440749-1-zi.yan@sent.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9ecd:0:b0:3f2:4152:657d with SMTP id t13csp620856vqx; Tue, 12 Sep 2023 12:05:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGDVXNAbKS04KcaUvsIpH1/rX747WZysxh856vapsUy2Tz6mI3ESQGbtjIWVZROHmrZUbvF X-Received: by 2002:a05:6e02:1ca4:b0:349:3503:253b with SMTP id x4-20020a056e021ca400b003493503253bmr562189ill.17.1694545536000; Tue, 12 Sep 2023 12:05:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694545535; cv=none; d=google.com; s=arc-20160816; b=Ktnw1f+x4rWVj2NtDwsoLlRKdywS6T3NvNCVqxwbvASwBQGDHzG1Do7jpIAeLXIs4L 3/endrQak+C0WqQsasVj5V0lJiawvUVYw3J8ZU+Mnzp0mVTnF4igPIZ+ha2pgR4YBaFZ G6DJ/tKKocoFcbZw67GUzxKQdURbOpsUPBUkyAUg1Aavs7FvshjErlRURDuzelAU6h5g sFCS0K3/MY/DD8YAVZETYcwu9ZDkzRVTVpjUHvP0aWR07ac4ouy8AtQBTSJgL30PQIZU RnYcftX6M4oyCJw0+DEYJtY6ww7v//N9bn50pTh+Ct1Q9WtAWQKbBYTIQKNFZSGMQLYG pCFw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version:reply-to :message-id:date:subject:cc:to:from:feedback-id:dkim-signature :dkim-signature; bh=5t1ioQ/11KphbqzXH7SCYin9Bo5ArvfTJo+4zOQ+tUo=; fh=bUDnJOBC0/apF/24NaPi+H6rvMB40CebS5Pe6QbQjAo=; b=pYPQYkrIXZisefv0Kf5E2R2XR7yUwvJYncPkuOCDDyCaJOIRuWWrn2rylCDWjR8W8G /drMV6OFO7T1eNZYHSri7h8qZbucKHMHUPDMb9taT9P+1/z7VrqLjxiT9VBC2VI7r1zd PMtXGBdqOHv1qtCGNqzrDUtzavy4Ul3oOMjdvJPWoY0grmz5YtMV8JAFZgEmVYPHU/rw SseB5TjEqN/PT66MFwX5nfTgEV1WYLNicmfT/CiYwAU/hOc65P6HBeiSx5VSqJ3QFkw2 3PQdffJBk1PMKM72gyulOm7Jp8+9bv2FDikBsJjYkt14+wZ1bStx9hSLad0VTXQk5p/t qdcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@sent.com header.s=fm1 header.b=yPxIE70a; dkim=pass header.i=@messagingengine.com header.s=fm1 header.b="M3g/WjBO"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=sent.com Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1]) by mx.google.com with ESMTPS id z8-20020a633308000000b00564c67e66fbsi8370549pgz.842.2023.09.12.12.05.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Sep 2023 12:05:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) client-ip=2620:137:e000::3:1; Authentication-Results: mx.google.com; dkim=pass header.i=@sent.com header.s=fm1 header.b=yPxIE70a; dkim=pass header.i=@messagingengine.com header.s=fm1 header.b="M3g/WjBO"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=sent.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 4CB068024DE9; Tue, 12 Sep 2023 09:28:50 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236784AbjILQ2k (ORCPT <rfc822;pwkd43@gmail.com> + 37 others); Tue, 12 Sep 2023 12:28:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236772AbjILQ2h (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 12 Sep 2023 12:28:37 -0400 Received: from out2-smtp.messagingengine.com (out2-smtp.messagingengine.com [66.111.4.26]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 670C4115 for <linux-kernel@vger.kernel.org>; Tue, 12 Sep 2023 09:28:33 -0700 (PDT) Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id 8F5BD5C020E; Tue, 12 Sep 2023 12:28:32 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute5.internal (MEProxy); Tue, 12 Sep 2023 12:28:32 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sent.com; h=cc :cc:content-transfer-encoding:content-type:date:date:from:from :in-reply-to:message-id:mime-version:reply-to:reply-to:sender :subject:subject:to:to; s=fm1; t=1694536112; x=1694622512; bh=5t 1ioQ/11KphbqzXH7SCYin9Bo5ArvfTJo+4zOQ+tUo=; b=yPxIE70aHJyQBG+Cz7 r1pUm8bA7QZYChGe1yn5GyPCFUsWuCw8vTMgKJmPWCfBljM6LYkcMmbYqPU3v8W1 McvyoyJHw0npjr6fTKnw0sr9TRsbFbyXlgtwcNGKqStMbESs4XwQQ3D38KA/YrbS KgFz6ASTZTmJ2PXCS9mzxogr12O+bxO3FvfD89UhJc33NzdB6GfVoNxgLx0OMEIF 69l0gbGaX8ewEojdkEH+NzlFLMBXBGOZWc5wFtP3yZ0i7P9AgnvXf9zqCsLe03PD JLnF3CucRaTXVr0Otp/U/SQ0xXWS/FxTmkwt6s/fXU9X18/LyA1VsRX/BMDCe1oU PYvw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:message-id:mime-version:reply-to:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm1; t=1694536112; x=1694622512; bh=5 t1ioQ/11KphbqzXH7SCYin9Bo5ArvfTJo+4zOQ+tUo=; b=M3g/WjBOWlHh3VsRR 8pDFfruMUOVnACEwfCjQWnnoCfpyyBSB4X5h+mfZkTim3kdPSAQySOdddZl8FbLN QQI+5pCliiBxHNtQFzOJvD1abGwpoVL7Q5LwiM8lByN4baeMu8Znx+Z616roydOz TG6AW7aQ3ZmDGMhu80YKMaSdz9wjkzd6Sni1cbTmQXqB6OzL8YEVG9KNPyxFPjBK crlzO7e5y5aLclGd/KAstXTbSsN+rkdJXFEJze9ClEiTpRx4VoHxOmJMczbncWj7 6czUo77It5UygNrbHbwUf3QkLnQePOc/mK/rS9TCCPdF2ps+xfUy0ju0mwyZanQ+ mf9gw== X-ME-Sender: <xms:rpEAZe16fOH6ukzZ0EUSzqWXwFRNy0aTcPgDA_IUYMmTu-MapdXIAw> <xme:rpEAZRFh4_ghaV3kqh4hkAE7DrEqRP2tJvXZanBHditj1Sow74bMSw6bggVSMIDY6 GMEnSLkKNE8Onrweg> X-ME-Received: <xmr:rpEAZW7yEXNNdbl32CBKxNAxJWZ8p-E0I3GeySJyeBpYVpJs8bEt4TYoTDMHXf7SGr9KZHCqL-mR47ndWbDUEtzYM1br6pyiF3EMWE6Wg-vQRI0tgxCqaB0h> X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedviedrudeiiedguddtvdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd enucfjughrpefhvfevufffkfforhgggfestdhqredtredttdenucfhrhhomhepkghiucgj rghnuceoiihirdihrghnsehsvghnthdrtghomheqnecuggftrfgrthhtvghrnhepudevud egfeffgffhteehjeeuheejueelvdehhfekhfduieeggfduvdevkeevieevnecuffhomhgr ihhnpehkvghrnhgvlhdrohhrghdptghomhhprggtthhiohhnrdhmmhdpshhplhhithdrmh hmpdhprghgvghsrdhmmhdpfhholhhiohhsrdhmmhenucevlhhushhtvghrufhiiigvpedt necurfgrrhgrmhepmhgrihhlfhhrohhmpeiiihdrhigrnhesshgvnhhtrdgtohhm X-ME-Proxy: <xmx:rpEAZf14vkTTcxWwQYBnspTrA142lsPC2MvCjPEVwaV_plH1kUDhlw> <xmx:rpEAZRGWTJEPGBPBn__UWgoukCWdblV2_pb3TmTKc7vMnNVjAgfVaw> <xmx:rpEAZY_HAUPqf4uIJ3zUDezQp3KLOJ_1wdYHKsOYN_3tDAnab_3MjA> <xmx:sJEAZXVVMFyWbiI7ZhiO1odNJPpS1PnPGVICCMy7ts55iSQZpEfh9Q> Feedback-ID: iccd040f4:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 12 Sep 2023 12:28:29 -0400 (EDT) From: Zi Yan <zi.yan@sent.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Zi Yan <ziy@nvidia.com>, Ryan Roberts <ryan.roberts@arm.com>, Andrew Morton <akpm@linux-foundation.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, David Hildenbrand <david@redhat.com>, "Yin, Fengwei" <fengwei.yin@intel.com>, Yu Zhao <yuzhao@google.com>, Vlastimil Babka <vbabka@suse.cz>, Johannes Weiner <hannes@cmpxchg.org>, Baolin Wang <baolin.wang@linux.alibaba.com>, Kemeng Shi <shikemeng@huaweicloud.com>, Mel Gorman <mgorman@techsingularity.net>, Rohan Puri <rohan.puri15@gmail.com>, Mcgrof Chamberlain <mcgrof@kernel.org>, Adam Manzanares <a.manzanares@samsung.com>, John Hubbard <jhubbard@nvidia.com> Subject: [RFC PATCH 0/4] Enable >0 order folio memory compaction Date: Tue, 12 Sep 2023 12:28:11 -0400 Message-Id: <20230912162815.440749-1-zi.yan@sent.com> X-Mailer: git-send-email 2.40.1 Reply-To: Zi Yan <ziy@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Tue, 12 Sep 2023 09:28:50 -0700 (PDT) X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1776859779753405192 X-GMAIL-MSGID: 1776859779753405192 |
Series |
Enable >0 order folio memory compaction
|
|
Message
Zi Yan
Sept. 12, 2023, 4:28 p.m. UTC
From: Zi Yan <ziy@nvidia.com>
Hi all,
This patchset enables >0 order folio memory compaction, which is one of
the prerequisitions for large folio support[1]. It is on top of
mm-everything-2023-09-11-22-56.
Overview
===
To support >0 order folio compaction, the patchset changes how free pages used
for migration are kept during compaction. Free pages used to be split into
order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared,
page order stored in page->private is zeroed, and page reference is set to 1).
Now all free pages are kept in a MAX_ORDER+1 array of page lists based
on their order without post allocation process. When migrate_pages() asks for
a new page, one of the free pages, based on the requested page order, is
then processed and given out.
Optimizations
===
1. Free page split is added to increase migration success rate in case
a source page does not have a matched free page in the free page lists.
Free page merge is possible but not implemented, since existing
PFN-based buddy page merge algorithm requires the identification of
buddy pages, but free pages kept for memory compaction cannot have
PageBuddy set to avoid confusing other PFN scanners.
2. Sort source pages in ascending order before migration is added to
reduce free page split. Otherwise, high order free pages might be
prematurely split, causing undesired high order folio migration failures.
TODOs
===
1. Refactor free page post allocation and free page preparation code so
that compaction_alloc() and compaction_free() can call functions instead
of hard coding.
2. One possible optimization is to allow migrate_pages() to continue
even if get_new_folio() returns a NULL. In general, that means there is
not enough memory. But in >0 order folio compaction case, that means
there is no suitable free page at source page order. It might be better
to skip that page and finish the rest of migration to achieve a better
compaction result.
3. Another possible optimization is to enable free page merge. It is
possible that a to-be-migrated page causes free page split then fails to
migrate eventually. We would lose a high order free page without free
page merge function. But a way of identifying free pages for memory
compaction is needed to reuse existing PFN-based buddy page merge.
4. The implemented >0 order folio compaction algorithm is quite naive
and does not consider all possible situations. A better algorithm can
improve compaction success rate.
Feel free to give comments and ask questions.
Thanks.
[1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
Zi Yan (4):
mm/compaction: add support for >0 order folio memory compaction.
mm/compaction: optimize >0 order folio compaction with free page
split.
mm/compaction: optimize >0 order folio compaction by sorting source
pages.
mm/compaction: enable compacting >0 order folios.
mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++---------
mm/internal.h | 7 +-
2 files changed, 176 insertions(+), 36 deletions(-)
Comments
On Tue, Sep 12, 2023 at 12:28:11PM -0400, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > > Feel free to give comments and ask questions. How about testing? I'm looking with an eye towards creating a pathalogical situation which can be automated for fragmentation and see how things go. Mel Gorman's original artificial fragmentation taken from his first patches ot help with fragmentation avoidance from 2018 suggested he tried [0]: ------ From 2018 a) Create an XFS filesystem b) Start 4 fio threads that write a number of 64K files inefficiently. Inefficiently means that files are created on first access and not created in advance (fio parameterr create_on_open=1) and fallocate is not used (fallocate=none). With multiple IO issuers this creates a mix of slab and page cache allocations over time. The total size of the files is 150% physical memory so that the slabs and page cache pages get mixed c) Warm up a number of fio read-only threads accessing the same files created in step 2. This part runs for the same length of time it took to create the files. It'll fault back in old data and further interleave slab and page cache allocations. As it's now low on memory due to step 2, fragmentation occurs as pageblocks get stolen. While step 3 is still running, start a process that tries to allocate 75% of memory as huge pages with a number of threads. The number of threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with fio, any other threads or forcing cross-NUMA scheduling. Note that the test has not been used on a machine with less than 8 cores. The benchmark records whether huge pages were allocated and what the fault latency was in microseconds d) Measure the number of events potentially causing external fragmentation, the fault latency and the huge page allocation success rate. ------- end of extract These days we can probably do a bit more damage. There has been concerns that LBS support (block size > ps) could hinder fragmentation, one of the reasons is that any file created despite it's size will require at least the block size, and if using 64k block size that means 64k allocation for each new file on that 64k block size filesystem, so clearly you may run out of lower order allocations pretty quickly. You can also create different larg eblock filesystems too, one for 64k another for 32k. Although LBS is new and we're still ironing out the kinks if you wanna give it a go we've rebased the patches onto Linus' tree [1], and if you wanted to ramp up fast you could use kdevops [2] which let's you pick that branch and also a series of NVMe drives (by enabling CONFIG_LIBVIRT_EXTRA_STORAGE_DRIVE_NVME) for large IO experimentation (by enabling CONFIG_VAGRANT_ENABLE_LARGEIO). Creating different filesystem with large block size (64k, 32k, 16k) on a 4k sector size drive (mkfs.xfs -f -b size=64k -s size=4k) should let you easily do tons of crazy pathalogical things. Are there other known recipes test help test this stuff? How do we measure success in your patches for fragmentation exactly? [0] https://lwn.net/Articles/770235/ [1] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus-nobdev [2] https://github.com/linux-kdevops/kdevops Luis
On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote:
> Are there other known recipes test help test this stuff?
You know, it got me wondering... since how memory fragmented a system
might be by just running fstests, because, well, we already have
that automated in kdevops and it also has LBS support for all the
different large block sizes on 4k sector size. So if we just had a
way to "measure" or "quantify" memory fragmentation with a score,
we could just tally up how we did after 4 hours of testing for each
block size with a set of memory on the guest / target node / cloud
system.
Luis
On 9/20/23 18:16, Luis Chamberlain wrote: > On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote: >> Are there other known recipes test help test this stuff? > > You know, it got me wondering... since how memory fragmented a system > might be by just running fstests, because, well, we already have > that automated in kdevops and it also has LBS support for all the > different large block sizes on 4k sector size. So if we just had a > way to "measure" or "quantify" memory fragmentation with a score, > we could just tally up how we did after 4 hours of testing for each > block size with a set of memory on the guest / target node / cloud > system. > > Luis I thought about it, and here is one possible way to quantify fragmentation with just a single number. Take this with some skepticism because it is a first draft sort of thing: a) Let BLOCKS be the number of 4KB pages (or more generally, then number of smallest sized objects allowed) in the area. b) Let FRAGS be the number of free *or* allocated chunks (no need to consider the size of each, as that is automatically taken into consideration). Then: fragmentation percentage = (FRAGS / BLOCKS) * 100% This has some nice properties. For one thing, it's easy to calculate. For another, it can discern between these cases: Assume a 12-page area: Case 1) 6 pages allocated allocated unevenly: 1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7% Case 2) 6 pages allocated evenly: every other page is allocated: fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100% thanks,
On Wed, Sep 20, 2023 at 07:05:25PM -0700, John Hubbard wrote: > On 9/20/23 18:16, Luis Chamberlain wrote: > > On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote: > > > Are there other known recipes test help test this stuff? > > > > You know, it got me wondering... since how memory fragmented a system > > might be by just running fstests, because, well, we already have > > that automated in kdevops and it also has LBS support for all the > > different large block sizes on 4k sector size. So if we just had a > > way to "measure" or "quantify" memory fragmentation with a score, > > we could just tally up how we did after 4 hours of testing for each > > block size with a set of memory on the guest / target node / cloud > > system. > > > > Luis > > I thought about it, and here is one possible way to quantify > fragmentation with just a single number. Take this with some > skepticism because it is a first draft sort of thing: > > a) Let BLOCKS be the number of 4KB pages (or more generally, then number > of smallest sized objects allowed) in the area. > > b) Let FRAGS be the number of free *or* allocated chunks (no need to > consider the size of each, as that is automatically taken into > consideration). > > Then: > fragmentation percentage = (FRAGS / BLOCKS) * 100% > > This has some nice properties. For one thing, it's easy to calculate. > For another, it can discern between these cases: > > Assume a 12-page area: > > Case 1) 6 pages allocated allocated unevenly: > > 1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated > > fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7% > > Case 2) 6 pages allocated evenly: every other page is allocated: > > fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100% Thanks! Will try this! BTW stress-ng might also be a nice way to do other pathalogical things here. Luis
On 20 Sep 2023, at 23:14, Luis Chamberlain wrote: > On Wed, Sep 20, 2023 at 07:05:25PM -0700, John Hubbard wrote: >> On 9/20/23 18:16, Luis Chamberlain wrote: >>> On Wed, Sep 20, 2023 at 05:55:51PM -0700, Luis Chamberlain wrote: >>>> Are there other known recipes test help test this stuff? >>> >>> You know, it got me wondering... since how memory fragmented a system >>> might be by just running fstests, because, well, we already have >>> that automated in kdevops and it also has LBS support for all the >>> different large block sizes on 4k sector size. So if we just had a >>> way to "measure" or "quantify" memory fragmentation with a score, >>> we could just tally up how we did after 4 hours of testing for each >>> block size with a set of memory on the guest / target node / cloud >>> system. >>> >>> Luis >> >> I thought about it, and here is one possible way to quantify >> fragmentation with just a single number. Take this with some >> skepticism because it is a first draft sort of thing: >> >> a) Let BLOCKS be the number of 4KB pages (or more generally, then number >> of smallest sized objects allowed) in the area. >> >> b) Let FRAGS be the number of free *or* allocated chunks (no need to >> consider the size of each, as that is automatically taken into >> consideration). >> >> Then: >> fragmentation percentage = (FRAGS / BLOCKS) * 100% >> >> This has some nice properties. For one thing, it's easy to calculate. >> For another, it can discern between these cases: >> >> Assume a 12-page area: >> >> Case 1) 6 pages allocated allocated unevenly: >> >> 1 page allocated | 1 page free | 1 page allocated | 5 pages free | 4 pages allocated >> >> fragmentation = (5 FRAGS / 12 BLOCKS) * 100% = 41.7% >> >> Case 2) 6 pages allocated evenly: every other page is allocated: >> >> fragmentation = (12 FRAGS / 12 BLOCKS) * 100% = 100% > > Thanks! Will try this! > > BTW stress-ng might also be a nice way to do other pathalogical things here. > > Luis Thanks. These are all good performance tests and a good fragmentation metric. I would like to get it working properly first. As I mentioned in another email, there will be tons of exploration to do to improve >0 folio memory compaction with the consideration of: 1. the distribution of free pages, 2. the goal of compaction, e.g., to allocate a single order folio or reduce the overall fragmentation level, 3. the runtime cost of compaction, and more. My patchset aims to provide a reasonably working compaction functionality. In terms of correctness testing, what I have done locally is to: 1. have a XFS partition, 2. create files with various sizes from 4KB to 2MB, 3. mmap each of these files to use one folio at the file size, 4. get the physical addresses of these folios, 5. trigger global memory compaction via sysctl, 6. read the physical addresses of these folios again. -- Best Regards, Yan, Zi
Hi Zi, On 12/09/2023 17:28, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > > Hi all, > > This patchset enables >0 order folio memory compaction, which is one of > the prerequisitions for large folio support[1]. It is on top of > mm-everything-2023-09-11-22-56. I've taken a quick look at these and realize I'm not well equipped to provide much in the way of meaningful review comments; All I can say is thanks for putting this together, and yes, I think it will become even more important for my work on anonymous large folios. > > Overview > === > > To support >0 order folio compaction, the patchset changes how free pages used > for migration are kept during compaction. Free pages used to be split into > order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared, > page order stored in page->private is zeroed, and page reference is set to 1). > Now all free pages are kept in a MAX_ORDER+1 array of page lists based > on their order without post allocation process. When migrate_pages() asks for > a new page, one of the free pages, based on the requested page order, is > then processed and given out. > > > Optimizations > === > > 1. Free page split is added to increase migration success rate in case > a source page does not have a matched free page in the free page lists. > Free page merge is possible but not implemented, since existing > PFN-based buddy page merge algorithm requires the identification of > buddy pages, but free pages kept for memory compaction cannot have > PageBuddy set to avoid confusing other PFN scanners. > > 2. Sort source pages in ascending order before migration is added to > reduce free page split. Otherwise, high order free pages might be > prematurely split, causing undesired high order folio migration failures. Not knowing much about how compaction actually works, naively I would imagine that if you are just trying to free up a known amount of contiguous physical space, then working through the pages in PFN order is more likely to yield the result quicker? Unless all of the pages in the set must be successfully migrated in order to free up the required amount of space... Thanks, Ryan
Hi, Zi, Thanks for your patch! Zi Yan <zi.yan@sent.com> writes: > From: Zi Yan <ziy@nvidia.com> > > Hi all, > > This patchset enables >0 order folio memory compaction, which is one of > the prerequisitions for large folio support[1]. It is on top of > mm-everything-2023-09-11-22-56. > > Overview > === > > To support >0 order folio compaction, the patchset changes how free pages used > for migration are kept during compaction. migrate_pages() can split the large folio for allocation failure. So the minimal implementation could be - allow to migrate large folios in compaction - return -ENOMEM for order > 0 in compaction_alloc() The performance may be not desirable. But that may be a baseline for further optimization. And, if we can measure the performance for each step of optimization, that will be even better. > Free pages used to be split into > order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared, > page order stored in page->private is zeroed, and page reference is set to 1). > Now all free pages are kept in a MAX_ORDER+1 array of page lists based > on their order without post allocation process. When migrate_pages() asks for > a new page, one of the free pages, based on the requested page order, is > then processed and given out. > > > Optimizations > === > > 1. Free page split is added to increase migration success rate in case > a source page does not have a matched free page in the free page lists. > Free page merge is possible but not implemented, since existing > PFN-based buddy page merge algorithm requires the identification of > buddy pages, but free pages kept for memory compaction cannot have > PageBuddy set to avoid confusing other PFN scanners. > > 2. Sort source pages in ascending order before migration is added to Trivial. s/ascending/descending/ > reduce free page split. Otherwise, high order free pages might be > prematurely split, causing undesired high order folio migration failures. > > > TODOs > === > > 1. Refactor free page post allocation and free page preparation code so > that compaction_alloc() and compaction_free() can call functions instead > of hard coding. > > 2. One possible optimization is to allow migrate_pages() to continue > even if get_new_folio() returns a NULL. In general, that means there is > not enough memory. But in >0 order folio compaction case, that means > there is no suitable free page at source page order. It might be better > to skip that page and finish the rest of migration to achieve a better > compaction result. We can split the source folio if get_new_folio() returns NULL. So, do we really need this? In general, we may reconsider all further optimizations given splitting is available already. > 3. Another possible optimization is to enable free page merge. It is > possible that a to-be-migrated page causes free page split then fails to > migrate eventually. We would lose a high order free page without free > page merge function. But a way of identifying free pages for memory > compaction is needed to reuse existing PFN-based buddy page merge. > > 4. The implemented >0 order folio compaction algorithm is quite naive > and does not consider all possible situations. A better algorithm can > improve compaction success rate. > > > Feel free to give comments and ask questions. > > Thanks. > > > [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/ > > Zi Yan (4): > mm/compaction: add support for >0 order folio memory compaction. > mm/compaction: optimize >0 order folio compaction with free page > split. > mm/compaction: optimize >0 order folio compaction by sorting source > pages. > mm/compaction: enable compacting >0 order folios. > > mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++--------- > mm/internal.h | 7 +- > 2 files changed, 176 insertions(+), 36 deletions(-) -- Best Regards, Huang, Ying
On 2 Oct 2023, at 8:32, Ryan Roberts wrote: > Hi Zi, > > On 12/09/2023 17:28, Zi Yan wrote: >> From: Zi Yan <ziy@nvidia.com> >> >> Hi all, >> >> This patchset enables >0 order folio memory compaction, which is one of >> the prerequisitions for large folio support[1]. It is on top of >> mm-everything-2023-09-11-22-56. > > I've taken a quick look at these and realize I'm not well equipped to provide > much in the way of meaningful review comments; All I can say is thanks for > putting this together, and yes, I think it will become even more important for > my work on anonymous large folios. > > >> >> Overview >> === >> >> To support >0 order folio compaction, the patchset changes how free pages used >> for migration are kept during compaction. Free pages used to be split into >> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared, >> page order stored in page->private is zeroed, and page reference is set to 1). >> Now all free pages are kept in a MAX_ORDER+1 array of page lists based >> on their order without post allocation process. When migrate_pages() asks for >> a new page, one of the free pages, based on the requested page order, is >> then processed and given out. >> >> >> Optimizations >> === >> >> 1. Free page split is added to increase migration success rate in case >> a source page does not have a matched free page in the free page lists. >> Free page merge is possible but not implemented, since existing >> PFN-based buddy page merge algorithm requires the identification of >> buddy pages, but free pages kept for memory compaction cannot have >> PageBuddy set to avoid confusing other PFN scanners. >> >> 2. Sort source pages in ascending order before migration is added to >> reduce free page split. Otherwise, high order free pages might be >> prematurely split, causing undesired high order folio migration failures. > > Not knowing much about how compaction actually works, naively I would imagine > that if you are just trying to free up a known amount of contiguous physical > space, then working through the pages in PFN order is more likely to yield the > result quicker? Unless all of the pages in the set must be successfully migrated > in order to free up the required amount of space... During compaction, pages are not freed, since that is the job of page reclaim. The goal of compaction is to get a high order free page without freeing existing pages to avoid potential high cost IO operations. If compaction does not work, page reclaim would free pages to get us there (and potentially another follow-up compaction). So either pages are migrated or stay where they are during compaction. BTW compaction works by scanning in use pages from lower PFN to higher PFN, and free pages from higher PFN to lower PFN until two scanners meet in the middle. -- Best Regards, Yan, Zi
On 9 Oct 2023, at 3:12, Huang, Ying wrote: > Hi, Zi, > > Thanks for your patch! > > Zi Yan <zi.yan@sent.com> writes: > >> From: Zi Yan <ziy@nvidia.com> >> >> Hi all, >> >> This patchset enables >0 order folio memory compaction, which is one of >> the prerequisitions for large folio support[1]. It is on top of >> mm-everything-2023-09-11-22-56. >> >> Overview >> === >> >> To support >0 order folio compaction, the patchset changes how free pages used >> for migration are kept during compaction. > > migrate_pages() can split the large folio for allocation failure. So > the minimal implementation could be > > - allow to migrate large folios in compaction > - return -ENOMEM for order > 0 in compaction_alloc() > > The performance may be not desirable. But that may be a baseline for > further optimization. I would imagine it might cause a regression since compaction might gradually split high order folios in the system. But I can move Patch 4 first to make this the baseline and see how system performance changes. > > And, if we can measure the performance for each step of optimization, > that will be even better. Do you have any benchmark in mind for the performance tests? vm-scalability? > >> Free pages used to be split into >> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared, >> page order stored in page->private is zeroed, and page reference is set to 1). >> Now all free pages are kept in a MAX_ORDER+1 array of page lists based >> on their order without post allocation process. When migrate_pages() asks for >> a new page, one of the free pages, based on the requested page order, is >> then processed and given out. >> >> >> Optimizations >> === >> >> 1. Free page split is added to increase migration success rate in case >> a source page does not have a matched free page in the free page lists. >> Free page merge is possible but not implemented, since existing >> PFN-based buddy page merge algorithm requires the identification of >> buddy pages, but free pages kept for memory compaction cannot have >> PageBuddy set to avoid confusing other PFN scanners. >> >> 2. Sort source pages in ascending order before migration is added to > > Trivial. > > s/ascending/descending/ > >> reduce free page split. Otherwise, high order free pages might be >> prematurely split, causing undesired high order folio migration failures. >> >> >> TODOs >> === >> >> 1. Refactor free page post allocation and free page preparation code so >> that compaction_alloc() and compaction_free() can call functions instead >> of hard coding. >> >> 2. One possible optimization is to allow migrate_pages() to continue >> even if get_new_folio() returns a NULL. In general, that means there is >> not enough memory. But in >0 order folio compaction case, that means >> there is no suitable free page at source page order. It might be better >> to skip that page and finish the rest of migration to achieve a better >> compaction result. > > We can split the source folio if get_new_folio() returns NULL. So, do > we really need this? It depends. The situation it can benefit is that when the system is going to allocate a high order free page and trigger a compaction, it is possible to get the high order free page by migrating a bunch of base pages instead of splitting a existing high order folio. > > In general, we may reconsider all further optimizations given splitting > is available already. In my mind, split should be avoided as much as possible. But it really depends on the actual situation, e.g., how much effort and cost the compaction wants to pay to get memory defragmented. If the system really wants to get a high order free page at any cost, split can be used without any issue. But applications might lose performance because existing large folios are split just to a new one. Like I said in the email, there are tons of optimizations and policies for us to explore. We can start with the bare minimum support (if no performance regression is observed, we can even start with split all high folios like you suggested) and add optimizations one by one. > >> 3. Another possible optimization is to enable free page merge. It is >> possible that a to-be-migrated page causes free page split then fails to >> migrate eventually. We would lose a high order free page without free >> page merge function. But a way of identifying free pages for memory >> compaction is needed to reuse existing PFN-based buddy page merge. >> >> 4. The implemented >0 order folio compaction algorithm is quite naive >> and does not consider all possible situations. A better algorithm can >> improve compaction success rate. >> >> >> Feel free to give comments and ask questions. >> >> Thanks. >> >> >> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/ >> >> Zi Yan (4): >> mm/compaction: add support for >0 order folio memory compaction. >> mm/compaction: optimize >0 order folio compaction with free page >> split. >> mm/compaction: optimize >0 order folio compaction by sorting source >> pages. >> mm/compaction: enable compacting >0 order folios. >> >> mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++--------- >> mm/internal.h | 7 +- >> 2 files changed, 176 insertions(+), 36 deletions(-) > > -- > Best Regards, > Huang, Ying -- Best Regards, Yan, Zi
On 09/10/2023 14:24, Zi Yan wrote: > On 2 Oct 2023, at 8:32, Ryan Roberts wrote: > >> Hi Zi, >> >> On 12/09/2023 17:28, Zi Yan wrote: >>> From: Zi Yan <ziy@nvidia.com> >>> >>> Hi all, >>> >>> This patchset enables >0 order folio memory compaction, which is one of >>> the prerequisitions for large folio support[1]. It is on top of >>> mm-everything-2023-09-11-22-56. >> >> I've taken a quick look at these and realize I'm not well equipped to provide >> much in the way of meaningful review comments; All I can say is thanks for >> putting this together, and yes, I think it will become even more important for >> my work on anonymous large folios. >> >> >>> >>> Overview >>> === >>> >>> To support >0 order folio compaction, the patchset changes how free pages used >>> for migration are kept during compaction. Free pages used to be split into >>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared, >>> page order stored in page->private is zeroed, and page reference is set to 1). >>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based >>> on their order without post allocation process. When migrate_pages() asks for >>> a new page, one of the free pages, based on the requested page order, is >>> then processed and given out. >>> >>> >>> Optimizations >>> === >>> >>> 1. Free page split is added to increase migration success rate in case >>> a source page does not have a matched free page in the free page lists. >>> Free page merge is possible but not implemented, since existing >>> PFN-based buddy page merge algorithm requires the identification of >>> buddy pages, but free pages kept for memory compaction cannot have >>> PageBuddy set to avoid confusing other PFN scanners. >>> >>> 2. Sort source pages in ascending order before migration is added to >>> reduce free page split. Otherwise, high order free pages might be >>> prematurely split, causing undesired high order folio migration failures. >> >> Not knowing much about how compaction actually works, naively I would imagine >> that if you are just trying to free up a known amount of contiguous physical >> space, then working through the pages in PFN order is more likely to yield the >> result quicker? Unless all of the pages in the set must be successfully migrated >> in order to free up the required amount of space... > > During compaction, pages are not freed, since that is the job of page reclaim. Sorry yes - my fault for using sloppy language. When I said "free up a known amount of contiguous physical space", I really meant "move pages in order to recover an amount of contiguous physical space". But I still think the rest of what I said applies; wouldn't you be more likely to reach your goal quicker if you sort by PFN? > The goal of compaction is to get a high order free page without freeing existing > pages to avoid potential high cost IO operations. If compaction does not work, > page reclaim would free pages to get us there (and potentially another follow-up > compaction). So either pages are migrated or stay where they are during compaction. > > BTW compaction works by scanning in use pages from lower PFN to higher PFN, > and free pages from higher PFN to lower PFN until two scanners meet in the middle. > > -- > Best Regards, > Yan, Zi
(resent as plain text) On 9 Oct 2023, at 10:10, Ryan Roberts wrote: > On 09/10/2023 14:24, Zi Yan wrote: >> On 2 Oct 2023, at 8:32, Ryan Roberts wrote: >> >>> Hi Zi, >>> >>> On 12/09/2023 17:28, Zi Yan wrote: >>>> From: Zi Yan <ziy@nvidia.com> >>>> >>>> Hi all, >>>> >>>> This patchset enables >0 order folio memory compaction, which is one of >>>> the prerequisitions for large folio support[1]. It is on top of >>>> mm-everything-2023-09-11-22-56. >>> >>> I've taken a quick look at these and realize I'm not well equipped to provide >>> much in the way of meaningful review comments; All I can say is thanks for >>> putting this together, and yes, I think it will become even more important for >>> my work on anonymous large folios. >>> >>> >>>> >>>> Overview >>>> === >>>> >>>> To support >0 order folio compaction, the patchset changes how free pages used >>>> for migration are kept during compaction. Free pages used to be split into >>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared, >>>> page order stored in page->private is zeroed, and page reference is set to 1). >>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based >>>> on their order without post allocation process. When migrate_pages() asks for >>>> a new page, one of the free pages, based on the requested page order, is >>>> then processed and given out. >>>> >>>> >>>> Optimizations >>>> === >>>> >>>> 1. Free page split is added to increase migration success rate in case >>>> a source page does not have a matched free page in the free page lists. >>>> Free page merge is possible but not implemented, since existing >>>> PFN-based buddy page merge algorithm requires the identification of >>>> buddy pages, but free pages kept for memory compaction cannot have >>>> PageBuddy set to avoid confusing other PFN scanners. >>>> >>>> 2. Sort source pages in ascending order before migration is added to >>>> reduce free page split. Otherwise, high order free pages might be >>>> prematurely split, causing undesired high order folio migration failures. >>> >>> Not knowing much about how compaction actually works, naively I would imagine >>> that if you are just trying to free up a known amount of contiguous physical >>> space, then working through the pages in PFN order is more likely to yield the >>> result quicker? Unless all of the pages in the set must be successfully migrated >>> in order to free up the required amount of space... >> >> During compaction, pages are not freed, since that is the job of page reclaim. > > Sorry yes - my fault for using sloppy language. When I said "free up a known > amount of contiguous physical space", I really meant "move pages in order to > recover an amount of contiguous physical space". But I still think the rest of > what I said applies; wouldn't you be more likely to reach your goal quicker if > you sort by PFN? Not always. If the in-use folios on the left are order-2, order-2, order-4 (all contiguous in one pageblock) and free pages on the right are order-4 (pageblock N), order-2, order-2 (pageblock N-1) and it is not a single order-8, since there are in-use folios in the middle), going in PFN order will not get you an order-8 free page, since first order-4 free page will be split into two order-2 for the first two order-2 in-use folios. But if you migrate in the the descending order of in-use page orders, you can get an order-8 free page at the end. The patchset minimizes free page splits to avoid the situation described above, since once a high order free page is split, the opportunity of migrating a high order in-use folio into it is gone and hardly recoverable. >> The goal of compaction is to get a high order free page without freeing existing >> pages to avoid potential high cost IO operations. If compaction does not work, >> page reclaim would free pages to get us there (and potentially another follow-up >> compaction). So either pages are migrated or stay where they are during compaction. >> >> BTW compaction works by scanning in use pages from lower PFN to higher PFN, >> and free pages from higher PFN to lower PFN until two scanners meet in the middle. >> >> -- >> Best Regards, >> Yan, Zi Best Regards, Yan, Zi
Something wrong with my mail box. Sorry, if you received duplicated mail. Zi Yan <ziy@nvidia.com> writes: > On 9 Oct 2023, at 3:12, Huang, Ying wrote: > >> Hi, Zi, >> >> Thanks for your patch! >> >> Zi Yan <zi.yan@sent.com> writes: >> >>> From: Zi Yan <ziy@nvidia.com> >>> >>> Hi all, >>> >>> This patchset enables >0 order folio memory compaction, which is one of >>> the prerequisitions for large folio support[1]. It is on top of >>> mm-everything-2023-09-11-22-56. >>> >>> Overview >>> === >>> >>> To support >0 order folio compaction, the patchset changes how free pages used >>> for migration are kept during compaction. >> >> migrate_pages() can split the large folio for allocation failure. So >> the minimal implementation could be >> >> - allow to migrate large folios in compaction >> - return -ENOMEM for order > 0 in compaction_alloc() >> >> The performance may be not desirable. But that may be a baseline for >> further optimization. > > I would imagine it might cause a regression since compaction might gradually > split high order folios in the system. I may not call it a pure regression, since large folio can be migrated during compaction with that, but it's possible that this hurts performance. Anyway, this can be a not-so-good minimal baseline. > But I can move Patch 4 first to make this the baseline and see how > system performance changes. Thanks! >> >> And, if we can measure the performance for each step of optimization, >> that will be even better. > > Do you have any benchmark in mind for the performance tests? vm-scalability? I remember Mel Gorman has done some tests for defragmentation before. But that's for order-0 pages. >>> Free pages used to be split into >>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared, >>> page order stored in page->private is zeroed, and page reference is set to 1). >>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based >>> on their order without post allocation process. When migrate_pages() asks for >>> a new page, one of the free pages, based on the requested page order, is >>> then processed and given out. >>> >>> >>> Optimizations >>> === >>> >>> 1. Free page split is added to increase migration success rate in case >>> a source page does not have a matched free page in the free page lists. >>> Free page merge is possible but not implemented, since existing >>> PFN-based buddy page merge algorithm requires the identification of >>> buddy pages, but free pages kept for memory compaction cannot have >>> PageBuddy set to avoid confusing other PFN scanners. >>> >>> 2. Sort source pages in ascending order before migration is added to >> >> Trivial. >> >> s/ascending/descending/ >> >>> reduce free page split. Otherwise, high order free pages might be >>> prematurely split, causing undesired high order folio migration failures. >>> >>> >>> TODOs >>> === >>> >>> 1. Refactor free page post allocation and free page preparation code so >>> that compaction_alloc() and compaction_free() can call functions instead >>> of hard coding. >>> >>> 2. One possible optimization is to allow migrate_pages() to continue >>> even if get_new_folio() returns a NULL. In general, that means there is >>> not enough memory. But in >0 order folio compaction case, that means >>> there is no suitable free page at source page order. It might be better >>> to skip that page and finish the rest of migration to achieve a better >>> compaction result. >> >> We can split the source folio if get_new_folio() returns NULL. So, do >> we really need this? > > It depends. The situation it can benefit is that when the system is going > to allocate a high order free page and trigger a compaction, it is possible to > get the high order free page by migrating a bunch of base pages instead of > splitting a existing high order folio. > >> >> In general, we may reconsider all further optimizations given splitting >> is available already. > > In my mind, split should be avoided as much as possible. If so, should we use "nosplit" logic in migrate_pages_batch() in some situation? > But it really depends > on the actual situation, e.g., how much effort and cost the compaction wants > to pay to get memory defragmented. If the system really wants to get a high > order free page at any cost, split can be used without any issue. But applications > might lose performance because existing large folios are split just to a > new one. Is it possible that splitting is desirable in some situation? For example, allocate some large DMA buffers at the cost of large anonymous folios? > Like I said in the email, there are tons of optimizations and policies for us > to explore. We can start with the bare minimum support (if no performance > regression is observed, we can even start with split all high folios like you > suggested) and add optimizations one by one. Sound good to me! Thanks! >> >>> 3. Another possible optimization is to enable free page merge. It is >>> possible that a to-be-migrated page causes free page split then fails to >>> migrate eventually. We would lose a high order free page without free >>> page merge function. But a way of identifying free pages for memory >>> compaction is needed to reuse existing PFN-based buddy page merge. >>> >>> 4. The implemented >0 order folio compaction algorithm is quite naive >>> and does not consider all possible situations. A better algorithm can >>> improve compaction success rate. >>> >>> >>> Feel free to give comments and ask questions. >>> >>> Thanks. >>> >>> >>> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/ >>> >>> Zi Yan (4): >>> mm/compaction: add support for >0 order folio memory compaction. >>> mm/compaction: optimize >0 order folio compaction with free page >>> split. >>> mm/compaction: optimize >0 order folio compaction by sorting source >>> pages. >>> mm/compaction: enable compacting >0 order folios. >>> >>> mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++--------- >>> mm/internal.h | 7 +- >>> 2 files changed, 176 insertions(+), 36 deletions(-) -- Best Regards, Huang, Ying
On 09/10/2023 16:52, Zi Yan wrote: > (resent as plain text) > On 9 Oct 2023, at 10:10, Ryan Roberts wrote: > >> On 09/10/2023 14:24, Zi Yan wrote: >>> On 2 Oct 2023, at 8:32, Ryan Roberts wrote: >>> >>>> Hi Zi, >>>> >>>> On 12/09/2023 17:28, Zi Yan wrote: >>>>> From: Zi Yan <ziy@nvidia.com> >>>>> >>>>> Hi all, >>>>> >>>>> This patchset enables >0 order folio memory compaction, which is one of >>>>> the prerequisitions for large folio support[1]. It is on top of >>>>> mm-everything-2023-09-11-22-56. >>>> >>>> I've taken a quick look at these and realize I'm not well equipped to provide >>>> much in the way of meaningful review comments; All I can say is thanks for >>>> putting this together, and yes, I think it will become even more important for >>>> my work on anonymous large folios. >>>> >>>> >>>>> >>>>> Overview >>>>> === >>>>> >>>>> To support >0 order folio compaction, the patchset changes how free pages used >>>>> for migration are kept during compaction. Free pages used to be split into >>>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared, >>>>> page order stored in page->private is zeroed, and page reference is set to 1). >>>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based >>>>> on their order without post allocation process. When migrate_pages() asks for >>>>> a new page, one of the free pages, based on the requested page order, is >>>>> then processed and given out. >>>>> >>>>> >>>>> Optimizations >>>>> === >>>>> >>>>> 1. Free page split is added to increase migration success rate in case >>>>> a source page does not have a matched free page in the free page lists. >>>>> Free page merge is possible but not implemented, since existing >>>>> PFN-based buddy page merge algorithm requires the identification of >>>>> buddy pages, but free pages kept for memory compaction cannot have >>>>> PageBuddy set to avoid confusing other PFN scanners. >>>>> >>>>> 2. Sort source pages in ascending order before migration is added to >>>>> reduce free page split. Otherwise, high order free pages might be >>>>> prematurely split, causing undesired high order folio migration failures. >>>> >>>> Not knowing much about how compaction actually works, naively I would imagine >>>> that if you are just trying to free up a known amount of contiguous physical >>>> space, then working through the pages in PFN order is more likely to yield the >>>> result quicker? Unless all of the pages in the set must be successfully migrated >>>> in order to free up the required amount of space... >>> >>> During compaction, pages are not freed, since that is the job of page reclaim. >> >> Sorry yes - my fault for using sloppy language. When I said "free up a known >> amount of contiguous physical space", I really meant "move pages in order to >> recover an amount of contiguous physical space". But I still think the rest of >> what I said applies; wouldn't you be more likely to reach your goal quicker if >> you sort by PFN? > > Not always. If the in-use folios on the left are order-2, order-2, order-4 > (all contiguous in one pageblock) and free pages on the right are order-4 (pageblock N), > order-2, order-2 (pageblock N-1) and it is not a single order-8, since there are > in-use folios in the middle), going in PFN order will not get you an order-8 free > page, since first order-4 free page will be split into two order-2 for the first > two order-2 in-use folios. But if you migrate in the the descending order of > in-use page orders, you can get an order-8 free page at the end. > > The patchset minimizes free page splits to avoid the situation described above, > since once a high order free page is split, the opportunity of migrating a high order > in-use folio into it is gone and hardly recoverable. OK I get it now - thanks! > > >>> The goal of compaction is to get a high order free page without freeing existing >>> pages to avoid potential high cost IO operations. If compaction does not work, >>> page reclaim would free pages to get us there (and potentially another follow-up >>> compaction). So either pages are migrated or stay where they are during compaction. >>> >>> BTW compaction works by scanning in use pages from lower PFN to higher PFN, >>> and free pages from higher PFN to lower PFN until two scanners meet in the middle. >>> >>> -- >>> Best Regards, >>> Yan, Zi > > > Best Regards, > Yan, Zi
On 10 Oct 2023, at 2:08, Huang, Ying wrote: > Something wrong with my mail box. Sorry, if you received duplicated > mail. > > Zi Yan <ziy@nvidia.com> writes: > >> On 9 Oct 2023, at 3:12, Huang, Ying wrote: >> >>> Hi, Zi, >>> >>> Thanks for your patch! >>> >>> Zi Yan <zi.yan@sent.com> writes: >>> >>>> From: Zi Yan <ziy@nvidia.com> >>>> >>>> Hi all, >>>> >>>> This patchset enables >0 order folio memory compaction, which is one of >>>> the prerequisitions for large folio support[1]. It is on top of >>>> mm-everything-2023-09-11-22-56. >>>> >>>> Overview >>>> === >>>> >>>> To support >0 order folio compaction, the patchset changes how free pages used >>>> for migration are kept during compaction. >>> >>> migrate_pages() can split the large folio for allocation failure. So >>> the minimal implementation could be >>> >>> - allow to migrate large folios in compaction >>> - return -ENOMEM for order > 0 in compaction_alloc() >>> >>> The performance may be not desirable. But that may be a baseline for >>> further optimization. >> >> I would imagine it might cause a regression since compaction might gradually >> split high order folios in the system. > > I may not call it a pure regression, since large folio can be migrated > during compaction with that, but it's possible that this hurts > performance. > > Anyway, this can be a not-so-good minimal baseline. > >> But I can move Patch 4 first to make this the baseline and see how >> system performance changes. > > Thanks! > >>> >>> And, if we can measure the performance for each step of optimization, >>> that will be even better. >> >> Do you have any benchmark in mind for the performance tests? vm-scalability? > > I remember Mel Gorman has done some tests for defragmentation before. > But that's for order-0 pages. OK, I will try to find that. > >>>> Free pages used to be split into >>>> order-0 pages that are post allocation processed (i.e., PageBuddy flag cleared, >>>> page order stored in page->private is zeroed, and page reference is set to 1). >>>> Now all free pages are kept in a MAX_ORDER+1 array of page lists based >>>> on their order without post allocation process. When migrate_pages() asks for >>>> a new page, one of the free pages, based on the requested page order, is >>>> then processed and given out. >>>> >>>> >>>> Optimizations >>>> === >>>> >>>> 1. Free page split is added to increase migration success rate in case >>>> a source page does not have a matched free page in the free page lists. >>>> Free page merge is possible but not implemented, since existing >>>> PFN-based buddy page merge algorithm requires the identification of >>>> buddy pages, but free pages kept for memory compaction cannot have >>>> PageBuddy set to avoid confusing other PFN scanners. >>>> >>>> 2. Sort source pages in ascending order before migration is added to >>> >>> Trivial. >>> >>> s/ascending/descending/ >>> >>>> reduce free page split. Otherwise, high order free pages might be >>>> prematurely split, causing undesired high order folio migration failures. >>>> >>>> >>>> TODOs >>>> === >>>> >>>> 1. Refactor free page post allocation and free page preparation code so >>>> that compaction_alloc() and compaction_free() can call functions instead >>>> of hard coding. >>>> >>>> 2. One possible optimization is to allow migrate_pages() to continue >>>> even if get_new_folio() returns a NULL. In general, that means there is >>>> not enough memory. But in >0 order folio compaction case, that means >>>> there is no suitable free page at source page order. It might be better >>>> to skip that page and finish the rest of migration to achieve a better >>>> compaction result. >>> >>> We can split the source folio if get_new_folio() returns NULL. So, do >>> we really need this? >> >> It depends. The situation it can benefit is that when the system is going >> to allocate a high order free page and trigger a compaction, it is possible to >> get the high order free page by migrating a bunch of base pages instead of >> splitting a existing high order folio. >> >>> >>> In general, we may reconsider all further optimizations given splitting >>> is available already. >> >> In my mind, split should be avoided as much as possible. > > If so, should we use "nosplit" logic in migrate_pages_batch() in some > situation? A possible future optimization. > >> But it really depends >> on the actual situation, e.g., how much effort and cost the compaction wants >> to pay to get memory defragmented. If the system really wants to get a high >> order free page at any cost, split can be used without any issue. But applications >> might lose performance because existing large folios are split just to a >> new one. > > Is it possible that splitting is desirable in some situation? For > example, allocate some large DMA buffers at the cost of large anonymous > folios? Sure. There are definitely cases split is better than non-split. But let's leave it when large anonymous folio is deployed. > >> Like I said in the email, there are tons of optimizations and policies for us >> to explore. We can start with the bare minimum support (if no performance >> regression is observed, we can even start with split all high folios like you >> suggested) and add optimizations one by one. > > Sound good to me! Thanks! > >>> >>>> 3. Another possible optimization is to enable free page merge. It is >>>> possible that a to-be-migrated page causes free page split then fails to >>>> migrate eventually. We would lose a high order free page without free >>>> page merge function. But a way of identifying free pages for memory >>>> compaction is needed to reuse existing PFN-based buddy page merge. >>>> >>>> 4. The implemented >0 order folio compaction algorithm is quite naive >>>> and does not consider all possible situations. A better algorithm can >>>> improve compaction success rate. >>>> >>>> >>>> Feel free to give comments and ask questions. >>>> >>>> Thanks. >>>> >>>> >>>> [1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/ >>>> >>>> Zi Yan (4): >>>> mm/compaction: add support for >0 order folio memory compaction. >>>> mm/compaction: optimize >0 order folio compaction with free page >>>> split. >>>> mm/compaction: optimize >0 order folio compaction by sorting source >>>> pages. >>>> mm/compaction: enable compacting >0 order folios. >>>> >>>> mm/compaction.c | 205 +++++++++++++++++++++++++++++++++++++++--------- >>>> mm/internal.h | 7 +- >>>> 2 files changed, 176 insertions(+), 36 deletions(-) > > -- > Best Regards, > Huang, Ying -- Best Regards, Yan, Zi