Message ID | 20231010142111.3997780-1-ryan.roberts@arm.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id ib8csp238732vqb; Tue, 10 Oct 2023 07:21:41 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF0lcwbhaeObnMv6ORVmUJFE/Z+QLQQZC8mFoeM7z0OTJ98EsAZHq6UuE4sZcUpimz3Hs4e X-Received: by 2002:a05:6358:917:b0:14a:e358:f436 with SMTP id r23-20020a056358091700b0014ae358f436mr22304698rwi.1.1696947701175; Tue, 10 Oct 2023 07:21:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696947701; cv=none; d=google.com; s=arc-20160816; b=nqfHxPnrOSwVOzfiBGMetlvhF6pxtASE2GSmvA8fjXKX1xx2aKQhIprWYBDEYIHd7N kTWuvvpIvAQAufruT4OFtVLlFJcXdYuRaPDFFqd9yPr9JowVo0BJ5WeOfnYXxAF/8AY2 8fPj4rXav1jBhnQSgPy1hqiKuaZlqCXbTy0qNOvFcsCoI6xx3qlHkwAyj15W4lE6ssPd XVJ9E76bGwZ8N/wrxZbkp6PFC1RzjNBrQeIDtGst4rlrDRs4kn88zNfMMerX2Ri/AdHX z2KbNCw4WUKXTAUql3IGkBDwdh5aCKwdooqs+P2DBeQ1D0rGChDFpi1NwMp+1yO7tjXb PckQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=+PXohe2+H8KqBdmOC+os2jN6LksZvQvIK9iJxuYoX7g=; fh=Z88u1Jfe/43T2N2CcO+vJyH7zjKPj5sAtvRfvG6Y63c=; b=r11N8CcTLTJy7Lfkk2rcaxw8VFja+v9SgiTBPICF/5azTt1i2RpLHdVhSFR/Y2F02v Dcty6r+4/CNovx0wCnv1eoebBvty79P69y9p8uk3zWylNNvQXpVMhQDLVLHUOi9Pzbpw UDEoef1IRVCzYz6mL8m8ZDOWobxWkpJPzA9YfxyFBuXRJoER/XWogHzyWWn5ytRfJQ1Y KepnGsoKLTPvaCzlyT45koWKchyGrZpmJ7geCXX7ZTm8FRBOMA22kMgm25GGoaAmrsSo jijA3BURR+BzOd5zKsKvVxT+hFsFhZDotHz6N+YaivghDIXoPgYmh+nOsogilg8fhUau x1AA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id ng12-20020a17090b1a8c00b0027761cbb47dsi12598838pjb.49.2023.10.10.07.21.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 10 Oct 2023 07:21:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id CC67D8108369; Tue, 10 Oct 2023 07:21:37 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232211AbjJJOV3 (ORCPT <rfc822;rua109.linux@gmail.com> + 20 others); Tue, 10 Oct 2023 10:21:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59366 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232862AbjJJOV1 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 10 Oct 2023 10:21:27 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id EE3FBF5 for <linux-kernel@vger.kernel.org>; Tue, 10 Oct 2023 07:21:23 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F08071FB; Tue, 10 Oct 2023 07:22:03 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CE8D43F762; Tue, 10 Oct 2023 07:21:21 -0700 (PDT) From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>, Huang Ying <ying.huang@intel.com>, Gao Xiang <xiang@kernel.org>, Yu Zhao <yuzhao@google.com>, Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting Date: Tue, 10 Oct 2023 15:21:09 +0100 Message-Id: <20231010142111.3997780-1-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=2.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Tue, 10 Oct 2023 07:21:37 -0700 (PDT) X-Spam-Level: ** X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1779378632352802827 X-GMAIL-MSGID: 1779378632352802827 |
Series |
Swap-out small-sized THP without splitting
|
|
Message
Ryan Roberts
Oct. 10, 2023, 2:21 p.m. UTC
Hi All, This is an RFC for a small series to add support for swapping out small-sized THP without needing to first split the large folio via __split_huge_page(). It closely follows the approach already used by PMD-sized THP. "Small-sized THP" is an upcoming feature that enables performance improvements by allocating large folios for anonymous memory, where the large folio size is smaller than the traditional PMD-size. See [1]. In some circumstances I've observed a performance regression (see patch 2 for details), and this series is an attempt to fix the regression in advance of merging small-sized THP support. I've done what I thought was the smallest change possible, and as a result, this approach is only employed when the swap is backed by a non-rotating block device (just as PMD-sized THP is supported today). However, I have a few questions on whether we should consider relaxing those requirements in certain circumstances: 1) block-backed vs file-backed ============================== The code only attempts to allocate a contiguous set of entries if swap is backed by a block device (i.e. not file-backed). The original commit, f0eea189e8e9 ("mm, THP, swap: don't allocate huge cluster for file backed swap device"), stated "It's hard to write a whole transparent huge page (THP) to a file backed swap device". But didn't state why. Does this imply there is a size limit at which it becomes hard? And does that therefore imply that for "small enough" sizes we should now allow use with file-back swap? This original commit was subsequently fixed with commit 41663430588c ("mm, THP, swap: fix allocating cluster for swapfile by mistake"), which said the original commit was using the wrong flag to determine if it was a block device and therefore in some cases was actually doing large allocations for a file-backed swap device, and this was causing file-system corruption. But that implies some sort of correctness issue to me, rather than the performance issue I inferred from the original commit. If anyone can offer an explanation, that would be helpful in determining if we should allow some large sizes for file-backed swap. 2) rotating vs non-rotating =========================== I notice that the clustered approach is only used for non-rotating swap. That implies that for rotating media, we will always fail a large allocation, and fall back to splitting THPs to single pages. Which implies that the regression I'm fixing here may still be present on rotating media? Or perhaps rotating disk is so slow that the cost of writing the data out dominates the cost of splitting? I considered that potentially the free swap entry search algorithm that is used in this case could be modified to look for (small) contiguous runs of entries; Up to ~16 pages (order-4) could be done by doing 2x 64bit reads from map instead of single byte. I haven't looked into this idea in detail, but wonder if anybody thinks it is worth the effort? Or perhaps it would end up causing bad fragmentation. Finally on testing, I've run the mm selftests and see no regressions, but I don't think there is anything in there specifically aimed towards swap? Are there any functional or performance tests that I should run? It would certainly be good to confirm I haven't regressed PMD-size THP swap performance. Thanks, Ryan [1] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@arm.com/ Ryan Roberts (2): mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags mm: swap: Swap-out small-sized THP without splitting include/linux/swap.h | 17 +++---- mm/huge_memory.c | 3 -- mm/swapfile.c | 105 ++++++++++++++++++++++--------------------- mm/vmscan.c | 10 +++-- 4 files changed, 66 insertions(+), 69 deletions(-) -- 2.25.1
Comments
Ryan Roberts <ryan.roberts@arm.com> writes: > Hi All, > > This is an RFC for a small series to add support for swapping out small-sized > THP without needing to first split the large folio via __split_huge_page(). It > closely follows the approach already used by PMD-sized THP. > > "Small-sized THP" is an upcoming feature that enables performance improvements > by allocating large folios for anonymous memory, where the large folio size is > smaller than the traditional PMD-size. See [1]. > > In some circumstances I've observed a performance regression (see patch 2 for > details), and this series is an attempt to fix the regression in advance of > merging small-sized THP support. > > I've done what I thought was the smallest change possible, and as a result, this > approach is only employed when the swap is backed by a non-rotating block device > (just as PMD-sized THP is supported today). However, I have a few questions on > whether we should consider relaxing those requirements in certain circumstances: > > > 1) block-backed vs file-backed > ============================== > > The code only attempts to allocate a contiguous set of entries if swap is backed > by a block device (i.e. not file-backed). The original commit, f0eea189e8e9 > ("mm, THP, swap: don't allocate huge cluster for file backed swap device"), > stated "It's hard to write a whole transparent huge page (THP) to a file backed > swap device". But didn't state why. Does this imply there is a size limit at > which it becomes hard? And does that therefore imply that for "small enough" > sizes we should now allow use with file-back swap? > > This original commit was subsequently fixed with commit 41663430588c ("mm, THP, > swap: fix allocating cluster for swapfile by mistake"), which said the original > commit was using the wrong flag to determine if it was a block device and > therefore in some cases was actually doing large allocations for a file-backed > swap device, and this was causing file-system corruption. But that implies some > sort of correctness issue to me, rather than the performance issue I inferred > from the original commit. > > If anyone can offer an explanation, that would be helpful in determining if we > should allow some large sizes for file-backed swap. swap use 'swap extent' (swap_info_struct.swap_extent_root) to map from swap offset to storage block number. For block-backed swap, the mapping is pure linear. So, you can use arbitrary large page size. But for file-backed swap, only PAGE_SIZE alignment is guaranteed. > 2) rotating vs non-rotating > =========================== > > I notice that the clustered approach is only used for non-rotating swap. That > implies that for rotating media, we will always fail a large allocation, and > fall back to splitting THPs to single pages. Which implies that the regression > I'm fixing here may still be present on rotating media? Or perhaps rotating disk > is so slow that the cost of writing the data out dominates the cost of > splitting? > > I considered that potentially the free swap entry search algorithm that is used > in this case could be modified to look for (small) contiguous runs of entries; > Up to ~16 pages (order-4) could be done by doing 2x 64bit reads from map instead > of single byte. > > I haven't looked into this idea in detail, but wonder if anybody thinks it is > worth the effort? Or perhaps it would end up causing bad fragmentation. I doubt anybody will use rotating storage to back swap now. > Finally on testing, I've run the mm selftests and see no regressions, but I > don't think there is anything in there specifically aimed towards swap? Are > there any functional or performance tests that I should run? It would certainly > be good to confirm I haven't regressed PMD-size THP swap performance. I have used swap sub test case of vm-scalbility to test. https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/ -- Best Regards, Huang, Ying
On 11/10/2023 07:37, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Hi All, >> >> This is an RFC for a small series to add support for swapping out small-sized >> THP without needing to first split the large folio via __split_huge_page(). It >> closely follows the approach already used by PMD-sized THP. >> >> "Small-sized THP" is an upcoming feature that enables performance improvements >> by allocating large folios for anonymous memory, where the large folio size is >> smaller than the traditional PMD-size. See [1]. >> >> In some circumstances I've observed a performance regression (see patch 2 for >> details), and this series is an attempt to fix the regression in advance of >> merging small-sized THP support. >> >> I've done what I thought was the smallest change possible, and as a result, this >> approach is only employed when the swap is backed by a non-rotating block device >> (just as PMD-sized THP is supported today). However, I have a few questions on >> whether we should consider relaxing those requirements in certain circumstances: >> >> >> 1) block-backed vs file-backed >> ============================== >> >> The code only attempts to allocate a contiguous set of entries if swap is backed >> by a block device (i.e. not file-backed). The original commit, f0eea189e8e9 >> ("mm, THP, swap: don't allocate huge cluster for file backed swap device"), >> stated "It's hard to write a whole transparent huge page (THP) to a file backed >> swap device". But didn't state why. Does this imply there is a size limit at >> which it becomes hard? And does that therefore imply that for "small enough" >> sizes we should now allow use with file-back swap? >> >> This original commit was subsequently fixed with commit 41663430588c ("mm, THP, >> swap: fix allocating cluster for swapfile by mistake"), which said the original >> commit was using the wrong flag to determine if it was a block device and >> therefore in some cases was actually doing large allocations for a file-backed >> swap device, and this was causing file-system corruption. But that implies some >> sort of correctness issue to me, rather than the performance issue I inferred >> from the original commit. >> >> If anyone can offer an explanation, that would be helpful in determining if we >> should allow some large sizes for file-backed swap. > > swap use 'swap extent' (swap_info_struct.swap_extent_root) to map from > swap offset to storage block number. For block-backed swap, the mapping > is pure linear. So, you can use arbitrary large page size. But for > file-backed swap, only PAGE_SIZE alignment is guaranteed. Ahh, I see, so its a correctness issue then. Thanks! > >> 2) rotating vs non-rotating >> =========================== >> >> I notice that the clustered approach is only used for non-rotating swap. That >> implies that for rotating media, we will always fail a large allocation, and >> fall back to splitting THPs to single pages. Which implies that the regression >> I'm fixing here may still be present on rotating media? Or perhaps rotating disk >> is so slow that the cost of writing the data out dominates the cost of >> splitting? >> >> I considered that potentially the free swap entry search algorithm that is used >> in this case could be modified to look for (small) contiguous runs of entries; >> Up to ~16 pages (order-4) could be done by doing 2x 64bit reads from map instead >> of single byte. >> >> I haven't looked into this idea in detail, but wonder if anybody thinks it is >> worth the effort? Or perhaps it would end up causing bad fragmentation. > > I doubt anybody will use rotating storage to back swap now. I'm often using a QEMU VM for testing with an Ubuntu install. The disk is enumerating as rotating storage and the swap device is file-backed. But I guess the former issue at least, is me setting up QEMU with the incorrect options. > >> Finally on testing, I've run the mm selftests and see no regressions, but I >> don't think there is anything in there specifically aimed towards swap? Are >> there any functional or performance tests that I should run? It would certainly >> be good to confirm I haven't regressed PMD-size THP swap performance. > > I have used swap sub test case of vm-scalbility to test. > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/ Great - I shall take a look! > > -- > Best Regards, > Huang, Ying
On 11/10/2023 07:37, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > > [...] > >> Finally on testing, I've run the mm selftests and see no regressions, but I >> don't think there is anything in there specifically aimed towards swap? Are >> there any functional or performance tests that I should run? It would certainly >> be good to confirm I haven't regressed PMD-size THP swap performance. > > I have used swap sub test case of vm-scalbility to test. > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/ I ended up using `usemem`, which is the core of this test suite, but deviated from the pre-canned test case to allow me to use anonymous memory and get numbers for small-sized THP (this is a very useful tool - thanks for pointing it out!) I've run the tests on Ampere Altra, set up with a 35G block ram device as the swap device and from inside a memcg limited to 40G memory. I've then run `usemem` with 70 processes (each has its own core), each allocating and writing 1G of memory. I've repeated everything 5 times and taken the mean and stdev: Mean Performance Improvement vs 4K/baseline | alloc size | baseline | remove-huge-flag | swap-file-small-thp | | | v6.6-rc4+anonfolio | + patch 1 | + patch 2 | |:-----------|--------------------:|--------------------:|--------------------:| | 4K Page | 0.0% | 2.3% | 9.1% | | 64K THP | -44.1% | -46.3% | 30.6% | | 2M THP | 56.0% | 54.2% | 60.1% | Standard Deviation as Percentage of Mean | alloc size | baseline | remove-huge-flag | swap-file-small-thp | | | v6.6-rc4+anonfolio | + patch 1 | + patch 2 | |:-----------|--------------------:|--------------------:|--------------------:| | 4K Page | 3.4% | 7.1% | 1.7% | | 64K THP | 1.9% | 5.6% | 7.7% | | 2M THP | 1.9% | 2.1% | 3.2% | I don't see any meaningful performance cost to removing the HUGE flag, so hopefully this gives us confidence to move forward with patch 1. You can indeed see the performance regression in the baseline when THP is configured to allocate small-sized THP only (in this case 64K). And you can see the regression is fixed by patch 2, which avoids splitting the THP and thus avoids the extra TLBIs. This correlates with what I saw in kernel compilation workload. Huang Ying, based on these results, do you still want me to persue a per-cpu solution to avoid potential contention on the swap info lock? - I proposed in the thread against patch 2 to do this in the swap_slots layer if so, rather than in swapfile.c directly (I'm not sure how your original proposal would actually work?). But based on these results, its not obvious to me that there is a definite problem here, and it might be simpler to avoid the complexity? Thanks, Ryan > > -- > Best Regards, > Huang, Ying