Message ID | 20231019110543.3284654-1-ryan.roberts@arm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2010:b0:403:3b70:6f57 with SMTP id fe16csp302186vqb; Thu, 19 Oct 2023 04:06:35 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFszPCC3VrDKQ/3C6BkKn8tZqCSuPOiV5AmSnktZytIWFwP+i3hRWMw8cJXUeigL6SBOVZF X-Received: by 2002:a17:90a:5314:b0:27d:433e:e69c with SMTP id x20-20020a17090a531400b0027d433ee69cmr2244765pjh.18.1697713589203; Thu, 19 Oct 2023 04:06:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697713589; cv=none; d=google.com; s=arc-20160816; b=j93sroMKE+p8V8F+310cG0Xi5FxcFAiAVq7l8HsuRQjDLduIBPJty+F42r+MulCETH 7lRiHJpWPPiDA4K1Vt1V2krYgJKGWAfxrIl3mziVLqIFxF3qE0cougkDBGqkKhMOfd2T C44nTzoj0G4quLhP1vA9hUIMPRDdFT4BsHNdOt9NMWZAfpyxZLSAQf02Z34ALyigTbbf 5C1b7Z9TC9NV7nRYfqnVmOEvwH4mxYyDHS/z8z+Zs9+ikpuKEJDVrjwBsWdR5TFQ1A4G rxhsnfNlWSdnxudTGiZ5lTSK0K2Vv/WusMhmcI23YheXCBelxKYjCc/k7zh7W0BL5xIL m9BQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=KD3fGgZ0cj3Q+OnuxldJQNkSwLG7jf9/YyRKufHKhq8=; fh=FHZ50e79Ium/KS5W8BpXNfhTmtaNKhsZ0tq0L/zraFw=; b=uD5rHi+X0rX4LHdlCu7Yge8sTCnBrKq7ABDQ/Ua6MpAQjfG9Sl35AHC74tqf5Gc0dY fIDVTMLBqwoN/9tYOEPdKspptFjPqTG/qE7irlxxsZajEAzMrkhv2g04WlOXo/k+Cc9M DLPvkZsqJxda2P3FtpjO2oW+H6SAy+a1vefw0GiQKbduwHKj/5KZIsCBsnkSWFqHVZ0E am6tQX3krkYYfADqFNgoDAqZ/8QGXn1lfHL3ZZ4WW2kYpmcxSeuOpMvSKJDzdDsTmZJB rcuc04nAeKlK6ymw2fUvLCavBaRVKsrIjW9tlTfonl0bI5GxvqpU6pfK6qZbGf2E49zT YHow== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id d4-20020a17090ad98400b0027766994586si1836320pjv.71.2023.10.19.04.06.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Oct 2023 04:06:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 40DDE812228E; Thu, 19 Oct 2023 04:06:26 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345291AbjJSLGG (ORCPT <rfc822;lkml4gm@gmail.com> + 25 others); Thu, 19 Oct 2023 07:06:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52510 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345224AbjJSLGE (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 19 Oct 2023 07:06:04 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 95F94112 for <linux-kernel@vger.kernel.org>; Thu, 19 Oct 2023 04:06:00 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D074D2F4; Thu, 19 Oct 2023 04:06:40 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 542DD3F762; Thu, 19 Oct 2023 04:05:57 -0700 (PDT) From: Ryan Roberts <ryan.roberts@arm.com> To: Seth Jennings <sjenning@redhat.com>, Dan Streetman <ddstreet@ieee.org>, Vitaly Wool <vitaly.wool@konsulko.com>, Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v1] mm: zswap: Store large folios without splitting Date: Thu, 19 Oct 2023 12:05:43 +0100 Message-Id: <20231019110543.3284654-1-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Thu, 19 Oct 2023 04:06:26 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780181724350846754 X-GMAIL-MSGID: 1780181724350846754 |
Series |
[RFC,v1] mm: zswap: Store large folios without splitting
|
|
Commit Message
Ryan Roberts
Oct. 19, 2023, 11:05 a.m. UTC
Previously zswap would refuse to store any folio bigger than order-0,
and therefore all of those folios would be sent directly to the swap
file. This is a minor inconvenience since swap can currently only
support order-0 and PMD-sized THP, but with the pending introduction of
"small-sized THP", and corresponding changes to swapfile to support any
order of folio, these large folios will become more prevalent and
without this zswap change, zswap will become unusable. Independently of
the "small-sized THP" feature, this change makes it possible to store
existing PMD-sized THPs in zswap.
Modify zswap_store() to allow storing large folios. The function is
split into 2 parts; zswap_store() does all the per-folio operations
(i.e. checking there is enough space, etc). Then it calls a new helper,
zswap_store_page(), for each page in the folio, which are stored as
their own entries in the zswap pool. (These entries continue to be
loaded back individually as single pages). If a store fails for any
single page, then all previously successfully stored folio pages are
invalidated.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
I've tested this on arm64 (m2) with zswap enabled, and running
vm-scalability's `usemem` across multiple cores from within a
memory-constrained memcg to force high volumes of swap. I've also run mm
selftests and observe no regressions (although there is nothing [z]swap
specific there) - does zswap have any specific tests I should run?
This is based on mm-stable, since mm-unstable contains a zswap patch
known to be buggy [1]. I thought it would be best to get comments on the
shape, then do the rebase after that patch has been fixed.
For context, small-sized THP is being discussed here [2], and I'm
working on changes to swapfile to support non-PMD-sized large folios
here [3].
[1] https://lore.kernel.org/linux-mm/21606fe5-fb9b-4d37-98ab-38c96819893b@arm.com/
[2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
Thanks,
Ryan
mm/zswap.c | 155 +++++++++++++++++++++++++++++++++--------------------
1 file changed, 98 insertions(+), 57 deletions(-)
base-commit: 158978945f3173b8c1a88f8c5684a629736a57ac
--
2.25.1
Comments
Just a polite nudge on this if anyone has any feedback? Alternatively, happy to repost against -rc1 if that's preferred? On 19/10/2023 12:05, Ryan Roberts wrote: > Previously zswap would refuse to store any folio bigger than order-0, > and therefore all of those folios would be sent directly to the swap > file. This is a minor inconvenience since swap can currently only > support order-0 and PMD-sized THP, but with the pending introduction of > "small-sized THP", and corresponding changes to swapfile to support any > order of folio, these large folios will become more prevalent and > without this zswap change, zswap will become unusable. Independently of > the "small-sized THP" feature, this change makes it possible to store > existing PMD-sized THPs in zswap. > > Modify zswap_store() to allow storing large folios. The function is > split into 2 parts; zswap_store() does all the per-folio operations > (i.e. checking there is enough space, etc). Then it calls a new helper, > zswap_store_page(), for each page in the folio, which are stored as > their own entries in the zswap pool. (These entries continue to be > loaded back individually as single pages). If a store fails for any > single page, then all previously successfully stored folio pages are > invalidated. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > I've tested this on arm64 (m2) with zswap enabled, and running > vm-scalability's `usemem` across multiple cores from within a > memory-constrained memcg to force high volumes of swap. I've also run mm > selftests and observe no regressions (although there is nothing [z]swap > specific there) - does zswap have any specific tests I should run? > > This is based on mm-stable, since mm-unstable contains a zswap patch > known to be buggy [1]. I thought it would be best to get comments on the > shape, then do the rebase after that patch has been fixed. > > For context, small-sized THP is being discussed here [2], and I'm > working on changes to swapfile to support non-PMD-sized large folios > here [3]. > > [1] https://lore.kernel.org/linux-mm/21606fe5-fb9b-4d37-98ab-38c96819893b@arm.com/ > [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/ > [3] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/ > > Thanks, > Ryan > > > mm/zswap.c | 155 +++++++++++++++++++++++++++++++++-------------------- > 1 file changed, 98 insertions(+), 57 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 37d2b1cb2ecb..51cbfc4e1ef8 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1188,18 +1188,17 @@ static void zswap_fill_page(void *ptr, unsigned long value) > memset_l(page, value, PAGE_SIZE / sizeof(unsigned long)); > } > > -bool zswap_store(struct folio *folio) > +static bool zswap_store_page(struct folio *folio, long index, > + struct obj_cgroup *objcg, struct zswap_pool *pool) > { > swp_entry_t swp = folio->swap; > int type = swp_type(swp); > - pgoff_t offset = swp_offset(swp); > - struct page *page = &folio->page; > + pgoff_t offset = swp_offset(swp) + index; > + struct page *page = folio_page(folio, index); > struct zswap_tree *tree = zswap_trees[type]; > struct zswap_entry *entry, *dupentry; > struct scatterlist input, output; > struct crypto_acomp_ctx *acomp_ctx; > - struct obj_cgroup *objcg = NULL; > - struct zswap_pool *pool; > struct zpool *zpool; > unsigned int dlen = PAGE_SIZE; > unsigned long handle, value; > @@ -1208,51 +1207,11 @@ bool zswap_store(struct folio *folio) > gfp_t gfp; > int ret; > > - VM_WARN_ON_ONCE(!folio_test_locked(folio)); > - VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > - > - /* Large folios aren't supported */ > - if (folio_test_large(folio)) > - return false; > - > - if (!zswap_enabled || !tree) > + /* entry keeps the references if successfully stored. */ > + if (!zswap_pool_get(pool)) > return false; > - > - /* > - * If this is a duplicate, it must be removed before attempting to store > - * it, otherwise, if the store fails the old page won't be removed from > - * the tree, and it might be written back overriding the new data. > - */ > - spin_lock(&tree->lock); > - dupentry = zswap_rb_search(&tree->rbroot, offset); > - if (dupentry) { > - zswap_duplicate_entry++; > - zswap_invalidate_entry(tree, dupentry); > - } > - spin_unlock(&tree->lock); > - > - /* > - * XXX: zswap reclaim does not work with cgroups yet. Without a > - * cgroup-aware entry LRU, we will push out entries system-wide based on > - * local cgroup limits. > - */ > - objcg = get_obj_cgroup_from_folio(folio); > - if (objcg && !obj_cgroup_may_zswap(objcg)) > - goto reject; > - > - /* reclaim space if needed */ > - if (zswap_is_full()) { > - zswap_pool_limit_hit++; > - zswap_pool_reached_full = true; > - goto shrink; > - } > - > - if (zswap_pool_reached_full) { > - if (!zswap_can_accept()) > - goto shrink; > - else > - zswap_pool_reached_full = false; > - } > + if (objcg) > + obj_cgroup_get(objcg); > > /* allocate entry */ > entry = zswap_entry_cache_alloc(GFP_KERNEL); > @@ -1260,6 +1219,8 @@ bool zswap_store(struct folio *folio) > zswap_reject_kmemcache_fail++; > goto reject; > } > + entry->objcg = objcg; > + entry->pool = pool; > > if (zswap_same_filled_pages_enabled) { > src = kmap_atomic(page); > @@ -1277,11 +1238,6 @@ bool zswap_store(struct folio *folio) > if (!zswap_non_same_filled_pages_enabled) > goto freepage; > > - /* if entry is successfully added, it keeps the reference */ > - entry->pool = zswap_pool_current_get(); > - if (!entry->pool) > - goto freepage; > - > /* compress */ > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > @@ -1337,7 +1293,6 @@ bool zswap_store(struct folio *folio) > entry->length = dlen; > > insert_entry: > - entry->objcg = objcg; > if (objcg) { > obj_cgroup_charge_zswap(objcg, entry->length); > /* Account before objcg ref is moved to tree */ > @@ -1373,19 +1328,105 @@ bool zswap_store(struct folio *folio) > > put_dstmem: > mutex_unlock(acomp_ctx->mutex); > - zswap_pool_put(entry->pool); > freepage: > zswap_entry_cache_free(entry); > reject: > if (objcg) > obj_cgroup_put(objcg); > + zswap_pool_put(pool); > return false; > +} > > +bool zswap_store(struct folio *folio) > +{ > + long nr_pages = folio_nr_pages(folio); > + swp_entry_t swp = folio->swap; > + int type = swp_type(swp); > + pgoff_t offset = swp_offset(swp); > + struct zswap_tree *tree = zswap_trees[type]; > + struct zswap_entry *entry; > + struct obj_cgroup *objcg = NULL; > + struct zswap_pool *pool; > + bool ret = false; > + long i; > + > + VM_WARN_ON_ONCE(!folio_test_locked(folio)); > + VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > + > + if (!zswap_enabled || !tree) > + return false; > + > + /* > + * If this is a duplicate, it must be removed before attempting to store > + * it, otherwise, if the store fails the old page won't be removed from > + * the tree, and it might be written back overriding the new data. > + */ > + spin_lock(&tree->lock); > + for (i = 0; i < nr_pages; i++) { > + entry = zswap_rb_search(&tree->rbroot, offset + i); > + if (entry) { > + zswap_duplicate_entry++; > + zswap_invalidate_entry(tree, entry); > + } > + } > + spin_unlock(&tree->lock); > + > + /* > + * XXX: zswap reclaim does not work with cgroups yet. Without a > + * cgroup-aware entry LRU, we will push out entries system-wide based on > + * local cgroup limits. > + */ > + objcg = get_obj_cgroup_from_folio(folio); > + if (objcg && !obj_cgroup_may_zswap(objcg)) > + goto put_objcg; > + > + /* reclaim space if needed */ > + if (zswap_is_full()) { > + zswap_pool_limit_hit++; > + zswap_pool_reached_full = true; > + goto shrink; > + } > + > + if (zswap_pool_reached_full) { > + if (!zswap_can_accept()) > + goto shrink; > + else > + zswap_pool_reached_full = false; > + } > + > + pool = zswap_pool_current_get(); > + if (!pool) > + goto put_objcg; > + > + /* > + * Store each page of the folio as a separate entry. If we fail to store > + * a page, unwind by removing all the previous pages we stored. > + */ > + for (i = 0; i < nr_pages; i++) { > + if (!zswap_store_page(folio, i, objcg, pool)) { > + spin_lock(&tree->lock); > + for (i--; i >= 0; i--) { > + entry = zswap_rb_search(&tree->rbroot, offset + i); > + if (entry) > + zswap_invalidate_entry(tree, entry); > + } > + spin_unlock(&tree->lock); > + goto put_pool; > + } > + } > + > + ret = true; > +put_pool: > + zswap_pool_put(pool); > +put_objcg: > + if (objcg) > + obj_cgroup_put(objcg); > + return ret; > shrink: > pool = zswap_pool_last_get(); > if (pool && !queue_work(shrink_wq, &pool->shrink_work)) > zswap_pool_put(pool); > - goto reject; > + goto put_objcg; > } > > bool zswap_load(struct folio *folio) > > base-commit: 158978945f3173b8c1a88f8c5684a629736a57ac > -- > 2.25.1 >
On Thu, Oct 19, 2023 at 4:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Previously zswap would refuse to store any folio bigger than order-0, > and therefore all of those folios would be sent directly to the swap > file. This is a minor inconvenience since swap can currently only > support order-0 and PMD-sized THP, but with the pending introduction of > "small-sized THP", and corresponding changes to swapfile to support any > order of folio, these large folios will become more prevalent and > without this zswap change, zswap will become unusable. Independently of > the "small-sized THP" feature, this change makes it possible to store > existing PMD-sized THPs in zswap. > > Modify zswap_store() to allow storing large folios. The function is > split into 2 parts; zswap_store() does all the per-folio operations > (i.e. checking there is enough space, etc). Then it calls a new helper, > zswap_store_page(), for each page in the folio, which are stored as > their own entries in the zswap pool. (These entries continue to be > loaded back individually as single pages). If a store fails for any > single page, then all previously successfully stored folio pages are > invalidated. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > I've tested this on arm64 (m2) with zswap enabled, and running > vm-scalability's `usemem` across multiple cores from within a > memory-constrained memcg to force high volumes of swap. I've also run mm > selftests and observe no regressions (although there is nothing [z]swap > specific there) - does zswap have any specific tests I should run? There is a zswap kselftest in the cgroup suite: https://lore.kernel.org/all/20230621153548.428093-1-cerasuolodomenico@gmail.com/ Regardless, I feel like this kind of change is probably better to be tested via stress tests anyway - allocating a bunch of anon memory with a certain pattern, making sure they're not corrupted after being zswapped out etc. > > This is based on mm-stable, since mm-unstable contains a zswap patch > known to be buggy [1]. I thought it would be best to get comments on the > shape, then do the rebase after that patch has been fixed. > > For context, small-sized THP is being discussed here [2], and I'm > working on changes to swapfile to support non-PMD-sized large folios > here [3]. > > [1] https://lore.kernel.org/linux-mm/21606fe5-fb9b-4d37-98ab-38c96819893b@arm.com/ > [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/ > [3] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/ > > Thanks, > Ryan > > > mm/zswap.c | 155 +++++++++++++++++++++++++++++++++-------------------- > 1 file changed, 98 insertions(+), 57 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 37d2b1cb2ecb..51cbfc4e1ef8 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1188,18 +1188,17 @@ static void zswap_fill_page(void *ptr, unsigned long value) > memset_l(page, value, PAGE_SIZE / sizeof(unsigned long)); > } > > -bool zswap_store(struct folio *folio) > +static bool zswap_store_page(struct folio *folio, long index, > + struct obj_cgroup *objcg, struct zswap_pool *pool) > { > swp_entry_t swp = folio->swap; > int type = swp_type(swp); > - pgoff_t offset = swp_offset(swp); > - struct page *page = &folio->page; > + pgoff_t offset = swp_offset(swp) + index; > + struct page *page = folio_page(folio, index); > struct zswap_tree *tree = zswap_trees[type]; > struct zswap_entry *entry, *dupentry; > struct scatterlist input, output; > struct crypto_acomp_ctx *acomp_ctx; > - struct obj_cgroup *objcg = NULL; > - struct zswap_pool *pool; > struct zpool *zpool; > unsigned int dlen = PAGE_SIZE; > unsigned long handle, value; > @@ -1208,51 +1207,11 @@ bool zswap_store(struct folio *folio) > gfp_t gfp; > int ret; > > - VM_WARN_ON_ONCE(!folio_test_locked(folio)); > - VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > - > - /* Large folios aren't supported */ > - if (folio_test_large(folio)) > - return false; > - > - if (!zswap_enabled || !tree) > + /* entry keeps the references if successfully stored. */ > + if (!zswap_pool_get(pool)) > return false; > - > - /* > - * If this is a duplicate, it must be removed before attempting to store > - * it, otherwise, if the store fails the old page won't be removed from > - * the tree, and it might be written back overriding the new data. > - */ > - spin_lock(&tree->lock); > - dupentry = zswap_rb_search(&tree->rbroot, offset); > - if (dupentry) { > - zswap_duplicate_entry++; > - zswap_invalidate_entry(tree, dupentry); > - } > - spin_unlock(&tree->lock); > - > - /* > - * XXX: zswap reclaim does not work with cgroups yet. Without a > - * cgroup-aware entry LRU, we will push out entries system-wide based on > - * local cgroup limits. > - */ > - objcg = get_obj_cgroup_from_folio(folio); > - if (objcg && !obj_cgroup_may_zswap(objcg)) > - goto reject; > - > - /* reclaim space if needed */ > - if (zswap_is_full()) { > - zswap_pool_limit_hit++; > - zswap_pool_reached_full = true; > - goto shrink; > - } > - > - if (zswap_pool_reached_full) { > - if (!zswap_can_accept()) > - goto shrink; > - else > - zswap_pool_reached_full = false; > - } > + if (objcg) > + obj_cgroup_get(objcg); > > /* allocate entry */ > entry = zswap_entry_cache_alloc(GFP_KERNEL); > @@ -1260,6 +1219,8 @@ bool zswap_store(struct folio *folio) > zswap_reject_kmemcache_fail++; > goto reject; > } > + entry->objcg = objcg; > + entry->pool = pool; > > if (zswap_same_filled_pages_enabled) { > src = kmap_atomic(page); > @@ -1277,11 +1238,6 @@ bool zswap_store(struct folio *folio) > if (!zswap_non_same_filled_pages_enabled) > goto freepage; > > - /* if entry is successfully added, it keeps the reference */ > - entry->pool = zswap_pool_current_get(); > - if (!entry->pool) > - goto freepage; > - > /* compress */ > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > @@ -1337,7 +1293,6 @@ bool zswap_store(struct folio *folio) > entry->length = dlen; > > insert_entry: > - entry->objcg = objcg; > if (objcg) { > obj_cgroup_charge_zswap(objcg, entry->length); > /* Account before objcg ref is moved to tree */ > @@ -1373,19 +1328,105 @@ bool zswap_store(struct folio *folio) > > put_dstmem: > mutex_unlock(acomp_ctx->mutex); > - zswap_pool_put(entry->pool); > freepage: > zswap_entry_cache_free(entry); > reject: > if (objcg) > obj_cgroup_put(objcg); > + zswap_pool_put(pool); > return false; > +} > > +bool zswap_store(struct folio *folio) > +{ > + long nr_pages = folio_nr_pages(folio); > + swp_entry_t swp = folio->swap; > + int type = swp_type(swp); > + pgoff_t offset = swp_offset(swp); > + struct zswap_tree *tree = zswap_trees[type]; > + struct zswap_entry *entry; > + struct obj_cgroup *objcg = NULL; > + struct zswap_pool *pool; > + bool ret = false; > + long i; > + > + VM_WARN_ON_ONCE(!folio_test_locked(folio)); > + VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > + > + if (!zswap_enabled || !tree) > + return false; > + > + /* > + * If this is a duplicate, it must be removed before attempting to store > + * it, otherwise, if the store fails the old page won't be removed from > + * the tree, and it might be written back overriding the new data. > + */ > + spin_lock(&tree->lock); > + for (i = 0; i < nr_pages; i++) { > + entry = zswap_rb_search(&tree->rbroot, offset + i); > + if (entry) { > + zswap_duplicate_entry++; > + zswap_invalidate_entry(tree, entry); > + } > + } > + spin_unlock(&tree->lock); > + > + /* > + * XXX: zswap reclaim does not work with cgroups yet. Without a > + * cgroup-aware entry LRU, we will push out entries system-wide based on > + * local cgroup limits. > + */ > + objcg = get_obj_cgroup_from_folio(folio); > + if (objcg && !obj_cgroup_may_zswap(objcg)) > + goto put_objcg; Hmm would this make more sense to check these limits at a higher order page level or at a backing page (4 KB) level? What if the cgroup still has some space before the new folio comes in, but the new folio would drive the pool size beyond the limit? Ditto for global zswap pool limit. Previously, the object size is capped by the size of a page (since we don't accept bigger size pages into zswap). If zswap limit is exceded, it will not be exceeded by more than 4 KB. No big deal. But I'm not sure this will be OK as we move to bigger and bigger sizes for the pages... If we do decide to check the cgroup's zswap limit for each backing page, I'm not sure how the reclaim logic (which will be introduced in the cgroup-aware zswap LRU patch) will interact with this though. > + > + /* reclaim space if needed */ > + if (zswap_is_full()) { > + zswap_pool_limit_hit++; > + zswap_pool_reached_full = true; > + goto shrink; > + } > + > + if (zswap_pool_reached_full) { > + if (!zswap_can_accept()) > + goto shrink; > + else > + zswap_pool_reached_full = false; > + } > + > + pool = zswap_pool_current_get(); > + if (!pool) > + goto put_objcg; > + > + /* > + * Store each page of the folio as a separate entry. If we fail to store > + * a page, unwind by removing all the previous pages we stored. > + */ > + for (i = 0; i < nr_pages; i++) { > + if (!zswap_store_page(folio, i, objcg, pool)) { > + spin_lock(&tree->lock); > + for (i--; i >= 0; i--) { > + entry = zswap_rb_search(&tree->rbroot, offset + i); > + if (entry) > + zswap_invalidate_entry(tree, entry); > + } > + spin_unlock(&tree->lock); > + goto put_pool; > + } > + } > + > + ret = true; > +put_pool: > + zswap_pool_put(pool); > +put_objcg: > + if (objcg) > + obj_cgroup_put(objcg); > + return ret; > shrink: > pool = zswap_pool_last_get(); > if (pool && !queue_work(shrink_wq, &pool->shrink_work)) > zswap_pool_put(pool); > - goto reject; > + goto put_objcg; > } > > bool zswap_load(struct folio *folio) > > base-commit: 158978945f3173b8c1a88f8c5684a629736a57ac > -- > 2.25.1 > > I don't see anything else that is obviously wrong with this. Seems straightforward to me. But I'm not too super familiar with THP swapping logic either :) So maybe someone with exposure to both should take a look too. And would it make sense to introduce a gate that guard this so that users can opt-in/opt-out of this new feature, at least as we experiment more with it to get more data?
Hi Nhat - thanks for the review! On 27/10/2023 19:49, Nhat Pham wrote: > On Thu, Oct 19, 2023 at 4:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Previously zswap would refuse to store any folio bigger than order-0, >> and therefore all of those folios would be sent directly to the swap >> file. This is a minor inconvenience since swap can currently only >> support order-0 and PMD-sized THP, but with the pending introduction of >> "small-sized THP", and corresponding changes to swapfile to support any >> order of folio, these large folios will become more prevalent and >> without this zswap change, zswap will become unusable. Independently of >> the "small-sized THP" feature, this change makes it possible to store >> existing PMD-sized THPs in zswap. >> >> Modify zswap_store() to allow storing large folios. The function is >> split into 2 parts; zswap_store() does all the per-folio operations >> (i.e. checking there is enough space, etc). Then it calls a new helper, >> zswap_store_page(), for each page in the folio, which are stored as >> their own entries in the zswap pool. (These entries continue to be >> loaded back individually as single pages). If a store fails for any >> single page, then all previously successfully stored folio pages are >> invalidated. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> I've tested this on arm64 (m2) with zswap enabled, and running >> vm-scalability's `usemem` across multiple cores from within a >> memory-constrained memcg to force high volumes of swap. I've also run mm >> selftests and observe no regressions (although there is nothing [z]swap >> specific there) - does zswap have any specific tests I should run? > > There is a zswap kselftest in the cgroup suite: > https://lore.kernel.org/all/20230621153548.428093-1-cerasuolodomenico@gmail.com/ ahh great - I'll run that against future versions. > > Regardless, I feel like this kind of change is probably better to be tested > via stress tests anyway - allocating a bunch of anon memory with a certain > pattern, making sure they're not corrupted after being zswapped out etc. Yes - that's exactly what the `usemem` test I described above is doing (and its not reporting any corruption). > > >> >> This is based on mm-stable, since mm-unstable contains a zswap patch >> known to be buggy [1]. I thought it would be best to get comments on the >> shape, then do the rebase after that patch has been fixed. >> >> For context, small-sized THP is being discussed here [2], and I'm >> working on changes to swapfile to support non-PMD-sized large folios >> here [3]. >> >> [1] https://lore.kernel.org/linux-mm/21606fe5-fb9b-4d37-98ab-38c96819893b@arm.com/ >> [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/ >> [3] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/ >> >> Thanks, >> Ryan >> >> >> mm/zswap.c | 155 +++++++++++++++++++++++++++++++++-------------------- >> 1 file changed, 98 insertions(+), 57 deletions(-) >> >> diff --git a/mm/zswap.c b/mm/zswap.c >> index 37d2b1cb2ecb..51cbfc4e1ef8 100644 >> --- a/mm/zswap.c >> +++ b/mm/zswap.c >> @@ -1188,18 +1188,17 @@ static void zswap_fill_page(void *ptr, unsigned long value) >> memset_l(page, value, PAGE_SIZE / sizeof(unsigned long)); >> } >> >> -bool zswap_store(struct folio *folio) >> +static bool zswap_store_page(struct folio *folio, long index, >> + struct obj_cgroup *objcg, struct zswap_pool *pool) >> { >> swp_entry_t swp = folio->swap; >> int type = swp_type(swp); >> - pgoff_t offset = swp_offset(swp); >> - struct page *page = &folio->page; >> + pgoff_t offset = swp_offset(swp) + index; >> + struct page *page = folio_page(folio, index); >> struct zswap_tree *tree = zswap_trees[type]; >> struct zswap_entry *entry, *dupentry; >> struct scatterlist input, output; >> struct crypto_acomp_ctx *acomp_ctx; >> - struct obj_cgroup *objcg = NULL; >> - struct zswap_pool *pool; >> struct zpool *zpool; >> unsigned int dlen = PAGE_SIZE; >> unsigned long handle, value; >> @@ -1208,51 +1207,11 @@ bool zswap_store(struct folio *folio) >> gfp_t gfp; >> int ret; >> >> - VM_WARN_ON_ONCE(!folio_test_locked(folio)); >> - VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); >> - >> - /* Large folios aren't supported */ >> - if (folio_test_large(folio)) >> - return false; >> - >> - if (!zswap_enabled || !tree) >> + /* entry keeps the references if successfully stored. */ >> + if (!zswap_pool_get(pool)) >> return false; >> - >> - /* >> - * If this is a duplicate, it must be removed before attempting to store >> - * it, otherwise, if the store fails the old page won't be removed from >> - * the tree, and it might be written back overriding the new data. >> - */ >> - spin_lock(&tree->lock); >> - dupentry = zswap_rb_search(&tree->rbroot, offset); >> - if (dupentry) { >> - zswap_duplicate_entry++; >> - zswap_invalidate_entry(tree, dupentry); >> - } >> - spin_unlock(&tree->lock); >> - >> - /* >> - * XXX: zswap reclaim does not work with cgroups yet. Without a >> - * cgroup-aware entry LRU, we will push out entries system-wide based on >> - * local cgroup limits. >> - */ >> - objcg = get_obj_cgroup_from_folio(folio); >> - if (objcg && !obj_cgroup_may_zswap(objcg)) >> - goto reject; >> - >> - /* reclaim space if needed */ >> - if (zswap_is_full()) { >> - zswap_pool_limit_hit++; >> - zswap_pool_reached_full = true; >> - goto shrink; >> - } >> - >> - if (zswap_pool_reached_full) { >> - if (!zswap_can_accept()) >> - goto shrink; >> - else >> - zswap_pool_reached_full = false; >> - } >> + if (objcg) >> + obj_cgroup_get(objcg); >> >> /* allocate entry */ >> entry = zswap_entry_cache_alloc(GFP_KERNEL); >> @@ -1260,6 +1219,8 @@ bool zswap_store(struct folio *folio) >> zswap_reject_kmemcache_fail++; >> goto reject; >> } >> + entry->objcg = objcg; >> + entry->pool = pool; >> >> if (zswap_same_filled_pages_enabled) { >> src = kmap_atomic(page); >> @@ -1277,11 +1238,6 @@ bool zswap_store(struct folio *folio) >> if (!zswap_non_same_filled_pages_enabled) >> goto freepage; >> >> - /* if entry is successfully added, it keeps the reference */ >> - entry->pool = zswap_pool_current_get(); >> - if (!entry->pool) >> - goto freepage; >> - >> /* compress */ >> acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); >> >> @@ -1337,7 +1293,6 @@ bool zswap_store(struct folio *folio) >> entry->length = dlen; >> >> insert_entry: >> - entry->objcg = objcg; >> if (objcg) { >> obj_cgroup_charge_zswap(objcg, entry->length); >> /* Account before objcg ref is moved to tree */ >> @@ -1373,19 +1328,105 @@ bool zswap_store(struct folio *folio) >> >> put_dstmem: >> mutex_unlock(acomp_ctx->mutex); >> - zswap_pool_put(entry->pool); >> freepage: >> zswap_entry_cache_free(entry); >> reject: >> if (objcg) >> obj_cgroup_put(objcg); >> + zswap_pool_put(pool); >> return false; >> +} >> >> +bool zswap_store(struct folio *folio) >> +{ >> + long nr_pages = folio_nr_pages(folio); >> + swp_entry_t swp = folio->swap; >> + int type = swp_type(swp); >> + pgoff_t offset = swp_offset(swp); >> + struct zswap_tree *tree = zswap_trees[type]; >> + struct zswap_entry *entry; >> + struct obj_cgroup *objcg = NULL; >> + struct zswap_pool *pool; >> + bool ret = false; >> + long i; >> + >> + VM_WARN_ON_ONCE(!folio_test_locked(folio)); >> + VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); >> + >> + if (!zswap_enabled || !tree) >> + return false; >> + >> + /* >> + * If this is a duplicate, it must be removed before attempting to store >> + * it, otherwise, if the store fails the old page won't be removed from >> + * the tree, and it might be written back overriding the new data. >> + */ >> + spin_lock(&tree->lock); >> + for (i = 0; i < nr_pages; i++) { >> + entry = zswap_rb_search(&tree->rbroot, offset + i); >> + if (entry) { >> + zswap_duplicate_entry++; >> + zswap_invalidate_entry(tree, entry); >> + } >> + } >> + spin_unlock(&tree->lock); >> + >> + /* >> + * XXX: zswap reclaim does not work with cgroups yet. Without a >> + * cgroup-aware entry LRU, we will push out entries system-wide based on >> + * local cgroup limits. >> + */ >> + objcg = get_obj_cgroup_from_folio(folio); >> + if (objcg && !obj_cgroup_may_zswap(objcg)) >> + goto put_objcg; > > Hmm would this make more sense to check these limits > at a higher order page level or at a backing page (4 KB) > level? > > What if the cgroup still has some space before the new > folio comes in, but the new folio would drive the pool size > beyond the limit? Ditto for global zswap pool limit. I did consider this, but I thought I would keep it simple for the RFC and accept that we may go over the limits by (HPAGE_PMD_NR - 1) pages. It sounds like you don't think that will be acceptable. I see 2 options: - move this checking logic to be per-page in zswap_store_page() - pass a size (or nr_pages or order) to obj_cgroup_may_zswap(), zswap_is_full() and zswap_can_accept() so they know how much we are proposing to add and can make a correct decision. Personally I prefer the second option for efficiency. > > Previously, the object size is capped by the size of > a page (since we don't accept bigger size pages into > zswap). If zswap limit is exceded, it will not be exceeded > by more than 4 KB. No big deal. But I'm not sure > this will be OK as we move to bigger and bigger > sizes for the pages... > > If we do decide to check the cgroup's zswap limit for > each backing page, I'm not sure how the reclaim logic > (which will be introduced in the cgroup-aware zswap > LRU patch) will interact with this though. Ok so this points us in the direction of my option 2 above as well? > >> + >> + /* reclaim space if needed */ >> + if (zswap_is_full()) { >> + zswap_pool_limit_hit++; >> + zswap_pool_reached_full = true; >> + goto shrink; >> + } >> + >> + if (zswap_pool_reached_full) { >> + if (!zswap_can_accept()) >> + goto shrink; >> + else >> + zswap_pool_reached_full = false; >> + } >> + >> + pool = zswap_pool_current_get(); >> + if (!pool) >> + goto put_objcg; >> + >> + /* >> + * Store each page of the folio as a separate entry. If we fail to store >> + * a page, unwind by removing all the previous pages we stored. >> + */ >> + for (i = 0; i < nr_pages; i++) { >> + if (!zswap_store_page(folio, i, objcg, pool)) { >> + spin_lock(&tree->lock); >> + for (i--; i >= 0; i--) { >> + entry = zswap_rb_search(&tree->rbroot, offset + i); >> + if (entry) >> + zswap_invalidate_entry(tree, entry); >> + } >> + spin_unlock(&tree->lock); >> + goto put_pool; >> + } >> + } >> + >> + ret = true; >> +put_pool: >> + zswap_pool_put(pool); >> +put_objcg: >> + if (objcg) >> + obj_cgroup_put(objcg); >> + return ret; >> shrink: >> pool = zswap_pool_last_get(); >> if (pool && !queue_work(shrink_wq, &pool->shrink_work)) >> zswap_pool_put(pool); >> - goto reject; >> + goto put_objcg; >> } >> >> bool zswap_load(struct folio *folio) >> >> base-commit: 158978945f3173b8c1a88f8c5684a629736a57ac >> -- >> 2.25.1 >> >> > > I don't see anything else that is obviously wrong with this. > Seems straightforward to me. > > But I'm not too super familiar with THP swapping > logic either :) So maybe someone with exposure to both > should take a look too. > > And would it make sense to introduce a gate that guard > this so that users can opt-in/opt-out of this new feature, > at least as we experiment more with it to get more > data? There is already a gate at a higher level; CONFIG_THP_SWAP. If that is not enabled, then all large folios will be split to single pages before a swap entry is even allocated. I considered adding a separate gate for this, but given the above control, I didn't think the zswap-specific control was really neccessary. What do you think? Thanks, Ryan
On Mon, Oct 30, 2023 at 4:57 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi Nhat - thanks for the review! > > > On 27/10/2023 19:49, Nhat Pham wrote: > > On Thu, Oct 19, 2023 at 4:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> Previously zswap would refuse to store any folio bigger than order-0, > >> and therefore all of those folios would be sent directly to the swap > >> file. This is a minor inconvenience since swap can currently only > >> support order-0 and PMD-sized THP, but with the pending introduction of > >> "small-sized THP", and corresponding changes to swapfile to support any > >> order of folio, these large folios will become more prevalent and > >> without this zswap change, zswap will become unusable. Independently of > >> the "small-sized THP" feature, this change makes it possible to store > >> existing PMD-sized THPs in zswap. > >> > >> Modify zswap_store() to allow storing large folios. The function is > >> split into 2 parts; zswap_store() does all the per-folio operations > >> (i.e. checking there is enough space, etc). Then it calls a new helper, > >> zswap_store_page(), for each page in the folio, which are stored as > >> their own entries in the zswap pool. (These entries continue to be > >> loaded back individually as single pages). If a store fails for any > >> single page, then all previously successfully stored folio pages are > >> invalidated. > >> > >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > >> --- > >> I've tested this on arm64 (m2) with zswap enabled, and running > >> vm-scalability's `usemem` across multiple cores from within a > >> memory-constrained memcg to force high volumes of swap. I've also run mm > >> selftests and observe no regressions (although there is nothing [z]swap > >> specific there) - does zswap have any specific tests I should run? > > > > There is a zswap kselftest in the cgroup suite: > > https://lore.kernel.org/all/20230621153548.428093-1-cerasuolodomenico@gmail.com/ > > ahh great - I'll run that against future versions. > > > > > Regardless, I feel like this kind of change is probably better to be tested > > via stress tests anyway - allocating a bunch of anon memory with a certain > > pattern, making sure they're not corrupted after being zswapped out etc. > > Yes - that's exactly what the `usemem` test I described above is doing (and its > not reporting any corruption). Nice, and I assume no regression (in runtime for e.g) in the usemem test as well? > > > > > > >> > >> This is based on mm-stable, since mm-unstable contains a zswap patch > >> known to be buggy [1]. I thought it would be best to get comments on the > >> shape, then do the rebase after that patch has been fixed. > >> > >> For context, small-sized THP is being discussed here [2], and I'm > >> working on changes to swapfile to support non-PMD-sized large folios > >> here [3]. > >> > >> [1] https://lore.kernel.org/linux-mm/21606fe5-fb9b-4d37-98ab-38c96819893b@arm.com/ > >> [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/ > >> [3] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/ > >> > >> Thanks, > >> Ryan > >> > >> > >> mm/zswap.c | 155 +++++++++++++++++++++++++++++++++-------------------- > >> 1 file changed, 98 insertions(+), 57 deletions(-) > >> > >> diff --git a/mm/zswap.c b/mm/zswap.c > >> index 37d2b1cb2ecb..51cbfc4e1ef8 100644 > >> --- a/mm/zswap.c > >> +++ b/mm/zswap.c > >> @@ -1188,18 +1188,17 @@ static void zswap_fill_page(void *ptr, unsigned long value) > >> memset_l(page, value, PAGE_SIZE / sizeof(unsigned long)); > >> } > >> > >> -bool zswap_store(struct folio *folio) > >> +static bool zswap_store_page(struct folio *folio, long index, > >> + struct obj_cgroup *objcg, struct zswap_pool *pool) > >> { > >> swp_entry_t swp = folio->swap; > >> int type = swp_type(swp); > >> - pgoff_t offset = swp_offset(swp); > >> - struct page *page = &folio->page; > >> + pgoff_t offset = swp_offset(swp) + index; > >> + struct page *page = folio_page(folio, index); > >> struct zswap_tree *tree = zswap_trees[type]; > >> struct zswap_entry *entry, *dupentry; > >> struct scatterlist input, output; > >> struct crypto_acomp_ctx *acomp_ctx; > >> - struct obj_cgroup *objcg = NULL; > >> - struct zswap_pool *pool; > >> struct zpool *zpool; > >> unsigned int dlen = PAGE_SIZE; > >> unsigned long handle, value; > >> @@ -1208,51 +1207,11 @@ bool zswap_store(struct folio *folio) > >> gfp_t gfp; > >> int ret; > >> > >> - VM_WARN_ON_ONCE(!folio_test_locked(folio)); > >> - VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > >> - > >> - /* Large folios aren't supported */ > >> - if (folio_test_large(folio)) > >> - return false; > >> - > >> - if (!zswap_enabled || !tree) > >> + /* entry keeps the references if successfully stored. */ > >> + if (!zswap_pool_get(pool)) > >> return false; > >> - > >> - /* > >> - * If this is a duplicate, it must be removed before attempting to store > >> - * it, otherwise, if the store fails the old page won't be removed from > >> - * the tree, and it might be written back overriding the new data. > >> - */ > >> - spin_lock(&tree->lock); > >> - dupentry = zswap_rb_search(&tree->rbroot, offset); > >> - if (dupentry) { > >> - zswap_duplicate_entry++; > >> - zswap_invalidate_entry(tree, dupentry); > >> - } > >> - spin_unlock(&tree->lock); > >> - > >> - /* > >> - * XXX: zswap reclaim does not work with cgroups yet. Without a > >> - * cgroup-aware entry LRU, we will push out entries system-wide based on > >> - * local cgroup limits. > >> - */ > >> - objcg = get_obj_cgroup_from_folio(folio); > >> - if (objcg && !obj_cgroup_may_zswap(objcg)) > >> - goto reject; > >> - > >> - /* reclaim space if needed */ > >> - if (zswap_is_full()) { > >> - zswap_pool_limit_hit++; > >> - zswap_pool_reached_full = true; > >> - goto shrink; > >> - } > >> - > >> - if (zswap_pool_reached_full) { > >> - if (!zswap_can_accept()) > >> - goto shrink; > >> - else > >> - zswap_pool_reached_full = false; > >> - } > >> + if (objcg) > >> + obj_cgroup_get(objcg); > >> > >> /* allocate entry */ > >> entry = zswap_entry_cache_alloc(GFP_KERNEL); > >> @@ -1260,6 +1219,8 @@ bool zswap_store(struct folio *folio) > >> zswap_reject_kmemcache_fail++; > >> goto reject; > >> } > >> + entry->objcg = objcg; > >> + entry->pool = pool; > >> > >> if (zswap_same_filled_pages_enabled) { > >> src = kmap_atomic(page); > >> @@ -1277,11 +1238,6 @@ bool zswap_store(struct folio *folio) > >> if (!zswap_non_same_filled_pages_enabled) > >> goto freepage; > >> > >> - /* if entry is successfully added, it keeps the reference */ > >> - entry->pool = zswap_pool_current_get(); > >> - if (!entry->pool) > >> - goto freepage; > >> - > >> /* compress */ > >> acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > >> > >> @@ -1337,7 +1293,6 @@ bool zswap_store(struct folio *folio) > >> entry->length = dlen; > >> > >> insert_entry: > >> - entry->objcg = objcg; > >> if (objcg) { > >> obj_cgroup_charge_zswap(objcg, entry->length); > >> /* Account before objcg ref is moved to tree */ > >> @@ -1373,19 +1328,105 @@ bool zswap_store(struct folio *folio) > >> > >> put_dstmem: > >> mutex_unlock(acomp_ctx->mutex); > >> - zswap_pool_put(entry->pool); > >> freepage: > >> zswap_entry_cache_free(entry); > >> reject: > >> if (objcg) > >> obj_cgroup_put(objcg); > >> + zswap_pool_put(pool); > >> return false; > >> +} > >> > >> +bool zswap_store(struct folio *folio) > >> +{ > >> + long nr_pages = folio_nr_pages(folio); > >> + swp_entry_t swp = folio->swap; > >> + int type = swp_type(swp); > >> + pgoff_t offset = swp_offset(swp); > >> + struct zswap_tree *tree = zswap_trees[type]; > >> + struct zswap_entry *entry; > >> + struct obj_cgroup *objcg = NULL; > >> + struct zswap_pool *pool; > >> + bool ret = false; > >> + long i; > >> + > >> + VM_WARN_ON_ONCE(!folio_test_locked(folio)); > >> + VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > >> + > >> + if (!zswap_enabled || !tree) > >> + return false; > >> + > >> + /* > >> + * If this is a duplicate, it must be removed before attempting to store > >> + * it, otherwise, if the store fails the old page won't be removed from > >> + * the tree, and it might be written back overriding the new data. > >> + */ > >> + spin_lock(&tree->lock); > >> + for (i = 0; i < nr_pages; i++) { > >> + entry = zswap_rb_search(&tree->rbroot, offset + i); > >> + if (entry) { > >> + zswap_duplicate_entry++; > >> + zswap_invalidate_entry(tree, entry); > >> + } > >> + } > >> + spin_unlock(&tree->lock); > >> + > >> + /* > >> + * XXX: zswap reclaim does not work with cgroups yet. Without a > >> + * cgroup-aware entry LRU, we will push out entries system-wide based on > >> + * local cgroup limits. > >> + */ > >> + objcg = get_obj_cgroup_from_folio(folio); > >> + if (objcg && !obj_cgroup_may_zswap(objcg)) > >> + goto put_objcg; > > > > Hmm would this make more sense to check these limits > > at a higher order page level or at a backing page (4 KB) > > level? > > > > What if the cgroup still has some space before the new > > folio comes in, but the new folio would drive the pool size > > beyond the limit? Ditto for global zswap pool limit. > > I did consider this, but I thought I would keep it simple for the RFC and accept > that we may go over the limits by (HPAGE_PMD_NR - 1) pages. It sounds like you > don't think that will be acceptable. Not necessarily unacceptable - I just wanted to point out a potential change in our assumption with zswap that might go unnoticed. I think there should be at least some sort of documentation or comments spelling this out. > > I see 2 options: > > - move this checking logic to be per-page in zswap_store_page() > - pass a size (or nr_pages or order) to obj_cgroup_may_zswap(), > zswap_is_full() and zswap_can_accept() so they know how much we are > proposing to add and can make a correct decision. > > Personally I prefer the second option for efficiency. Hmm if we go with the second option, how should this decision be made? It's impossible to predict the post-compression size of the huge page, even if we know its order. I don't have a better suggestion than the first option, so if anyone else has a better idea please chime in :) Personally, I would not flat out NACK this patch because of this limitation though, as long as it is at least spelled out somewhere. > > > > > Previously, the object size is capped by the size of > > a page (since we don't accept bigger size pages into > > zswap). If zswap limit is exceded, it will not be exceeded > > by more than 4 KB. No big deal. But I'm not sure > > this will be OK as we move to bigger and bigger > > sizes for the pages... > > > > If we do decide to check the cgroup's zswap limit for > > each backing page, I'm not sure how the reclaim logic > > (which will be introduced in the cgroup-aware zswap > > LRU patch) will interact with this though. > > Ok so this points us in the direction of my option 2 above as well? > > > > >> + > >> + /* reclaim space if needed */ > >> + if (zswap_is_full()) { > >> + zswap_pool_limit_hit++; > >> + zswap_pool_reached_full = true; > >> + goto shrink; > >> + } > >> + > >> + if (zswap_pool_reached_full) { > >> + if (!zswap_can_accept()) > >> + goto shrink; > >> + else > >> + zswap_pool_reached_full = false; > >> + } > >> + > >> + pool = zswap_pool_current_get(); > >> + if (!pool) > >> + goto put_objcg; > >> + > >> + /* > >> + * Store each page of the folio as a separate entry. If we fail to store > >> + * a page, unwind by removing all the previous pages we stored. > >> + */ > >> + for (i = 0; i < nr_pages; i++) { > >> + if (!zswap_store_page(folio, i, objcg, pool)) { > >> + spin_lock(&tree->lock); > >> + for (i--; i >= 0; i--) { > >> + entry = zswap_rb_search(&tree->rbroot, offset + i); > >> + if (entry) > >> + zswap_invalidate_entry(tree, entry); > >> + } > >> + spin_unlock(&tree->lock); > >> + goto put_pool; > >> + } > >> + } > >> + > >> + ret = true; > >> +put_pool: > >> + zswap_pool_put(pool); > >> +put_objcg: > >> + if (objcg) > >> + obj_cgroup_put(objcg); > >> + return ret; > >> shrink: > >> pool = zswap_pool_last_get(); > >> if (pool && !queue_work(shrink_wq, &pool->shrink_work)) > >> zswap_pool_put(pool); > >> - goto reject; > >> + goto put_objcg; > >> } > >> > >> bool zswap_load(struct folio *folio) > >> > >> base-commit: 158978945f3173b8c1a88f8c5684a629736a57ac > >> -- > >> 2.25.1 > >> > >> > > > > I don't see anything else that is obviously wrong with this. > > Seems straightforward to me. > > > > But I'm not too super familiar with THP swapping > > logic either :) So maybe someone with exposure to both > > should take a look too. > > > > And would it make sense to introduce a gate that guard > > this so that users can opt-in/opt-out of this new feature, > > at least as we experiment more with it to get more > > data? > > There is already a gate at a higher level; CONFIG_THP_SWAP. If that is not > enabled, then all large folios will be split to single pages before a swap entry > is even allocated. Hmm is there a runtime option as well, or does it not make sense to have it? But yeah, in principle I'm not against having a single knob controlling this behavior for both swap and zswap (as opposed to a zswap-specific knob). I'll defer any further objections to this aspect to other zswap contributors and maintainers :) > > I considered adding a separate gate for this, but given the above control, I > didn't think the zswap-specific control was really neccessary. What do you think? > > Thanks, > Ryan > >
On 02/11/2023 01:37, Nhat Pham wrote: > On Mon, Oct 30, 2023 at 4:57 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Hi Nhat - thanks for the review! >> >> >> On 27/10/2023 19:49, Nhat Pham wrote: >>> On Thu, Oct 19, 2023 at 4:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> Previously zswap would refuse to store any folio bigger than order-0, >>>> and therefore all of those folios would be sent directly to the swap >>>> file. This is a minor inconvenience since swap can currently only >>>> support order-0 and PMD-sized THP, but with the pending introduction of >>>> "small-sized THP", and corresponding changes to swapfile to support any >>>> order of folio, these large folios will become more prevalent and >>>> without this zswap change, zswap will become unusable. Independently of >>>> the "small-sized THP" feature, this change makes it possible to store >>>> existing PMD-sized THPs in zswap. >>>> >>>> Modify zswap_store() to allow storing large folios. The function is >>>> split into 2 parts; zswap_store() does all the per-folio operations >>>> (i.e. checking there is enough space, etc). Then it calls a new helper, >>>> zswap_store_page(), for each page in the folio, which are stored as >>>> their own entries in the zswap pool. (These entries continue to be >>>> loaded back individually as single pages). If a store fails for any >>>> single page, then all previously successfully stored folio pages are >>>> invalidated. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>> --- >>>> I've tested this on arm64 (m2) with zswap enabled, and running >>>> vm-scalability's `usemem` across multiple cores from within a >>>> memory-constrained memcg to force high volumes of swap. I've also run mm >>>> selftests and observe no regressions (although there is nothing [z]swap >>>> specific there) - does zswap have any specific tests I should run? >>> >>> There is a zswap kselftest in the cgroup suite: >>> https://lore.kernel.org/all/20230621153548.428093-1-cerasuolodomenico@gmail.com/ >> >> ahh great - I'll run that against future versions. >> >>> >>> Regardless, I feel like this kind of change is probably better to be tested >>> via stress tests anyway - allocating a bunch of anon memory with a certain >>> pattern, making sure they're not corrupted after being zswapped out etc. >> >> Yes - that's exactly what the `usemem` test I described above is doing (and its >> not reporting any corruption). > > Nice, and I assume no regression (in runtime for e.g) in the usemem test > as well? I was mainly focussing on correctness. I don't recall seeing any perf regression, but don't have the data to check right now. I'll post numbers with the next version (although I'm likely not going to have bandwidth until early December to pick this up again). > >> >>> >>> >>>> >>>> This is based on mm-stable, since mm-unstable contains a zswap patch >>>> known to be buggy [1]. I thought it would be best to get comments on the >>>> shape, then do the rebase after that patch has been fixed. >>>> >>>> For context, small-sized THP is being discussed here [2], and I'm >>>> working on changes to swapfile to support non-PMD-sized large folios >>>> here [3]. >>>> >>>> [1] https://lore.kernel.org/linux-mm/21606fe5-fb9b-4d37-98ab-38c96819893b@arm.com/ >>>> [2] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/ >>>> [3] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/ >>>> >>>> Thanks, >>>> Ryan >>>> >>>> >>>> mm/zswap.c | 155 +++++++++++++++++++++++++++++++++-------------------- >>>> 1 file changed, 98 insertions(+), 57 deletions(-) >>>> >>>> diff --git a/mm/zswap.c b/mm/zswap.c >>>> index 37d2b1cb2ecb..51cbfc4e1ef8 100644 >>>> --- a/mm/zswap.c >>>> +++ b/mm/zswap.c >>>> @@ -1188,18 +1188,17 @@ static void zswap_fill_page(void *ptr, unsigned long value) >>>> memset_l(page, value, PAGE_SIZE / sizeof(unsigned long)); >>>> } >>>> >>>> -bool zswap_store(struct folio *folio) >>>> +static bool zswap_store_page(struct folio *folio, long index, >>>> + struct obj_cgroup *objcg, struct zswap_pool *pool) >>>> { >>>> swp_entry_t swp = folio->swap; >>>> int type = swp_type(swp); >>>> - pgoff_t offset = swp_offset(swp); >>>> - struct page *page = &folio->page; >>>> + pgoff_t offset = swp_offset(swp) + index; >>>> + struct page *page = folio_page(folio, index); >>>> struct zswap_tree *tree = zswap_trees[type]; >>>> struct zswap_entry *entry, *dupentry; >>>> struct scatterlist input, output; >>>> struct crypto_acomp_ctx *acomp_ctx; >>>> - struct obj_cgroup *objcg = NULL; >>>> - struct zswap_pool *pool; >>>> struct zpool *zpool; >>>> unsigned int dlen = PAGE_SIZE; >>>> unsigned long handle, value; >>>> @@ -1208,51 +1207,11 @@ bool zswap_store(struct folio *folio) >>>> gfp_t gfp; >>>> int ret; >>>> >>>> - VM_WARN_ON_ONCE(!folio_test_locked(folio)); >>>> - VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); >>>> - >>>> - /* Large folios aren't supported */ >>>> - if (folio_test_large(folio)) >>>> - return false; >>>> - >>>> - if (!zswap_enabled || !tree) >>>> + /* entry keeps the references if successfully stored. */ >>>> + if (!zswap_pool_get(pool)) >>>> return false; >>>> - >>>> - /* >>>> - * If this is a duplicate, it must be removed before attempting to store >>>> - * it, otherwise, if the store fails the old page won't be removed from >>>> - * the tree, and it might be written back overriding the new data. >>>> - */ >>>> - spin_lock(&tree->lock); >>>> - dupentry = zswap_rb_search(&tree->rbroot, offset); >>>> - if (dupentry) { >>>> - zswap_duplicate_entry++; >>>> - zswap_invalidate_entry(tree, dupentry); >>>> - } >>>> - spin_unlock(&tree->lock); >>>> - >>>> - /* >>>> - * XXX: zswap reclaim does not work with cgroups yet. Without a >>>> - * cgroup-aware entry LRU, we will push out entries system-wide based on >>>> - * local cgroup limits. >>>> - */ >>>> - objcg = get_obj_cgroup_from_folio(folio); >>>> - if (objcg && !obj_cgroup_may_zswap(objcg)) >>>> - goto reject; >>>> - >>>> - /* reclaim space if needed */ >>>> - if (zswap_is_full()) { >>>> - zswap_pool_limit_hit++; >>>> - zswap_pool_reached_full = true; >>>> - goto shrink; >>>> - } >>>> - >>>> - if (zswap_pool_reached_full) { >>>> - if (!zswap_can_accept()) >>>> - goto shrink; >>>> - else >>>> - zswap_pool_reached_full = false; >>>> - } >>>> + if (objcg) >>>> + obj_cgroup_get(objcg); >>>> >>>> /* allocate entry */ >>>> entry = zswap_entry_cache_alloc(GFP_KERNEL); >>>> @@ -1260,6 +1219,8 @@ bool zswap_store(struct folio *folio) >>>> zswap_reject_kmemcache_fail++; >>>> goto reject; >>>> } >>>> + entry->objcg = objcg; >>>> + entry->pool = pool; >>>> >>>> if (zswap_same_filled_pages_enabled) { >>>> src = kmap_atomic(page); >>>> @@ -1277,11 +1238,6 @@ bool zswap_store(struct folio *folio) >>>> if (!zswap_non_same_filled_pages_enabled) >>>> goto freepage; >>>> >>>> - /* if entry is successfully added, it keeps the reference */ >>>> - entry->pool = zswap_pool_current_get(); >>>> - if (!entry->pool) >>>> - goto freepage; >>>> - >>>> /* compress */ >>>> acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); >>>> >>>> @@ -1337,7 +1293,6 @@ bool zswap_store(struct folio *folio) >>>> entry->length = dlen; >>>> >>>> insert_entry: >>>> - entry->objcg = objcg; >>>> if (objcg) { >>>> obj_cgroup_charge_zswap(objcg, entry->length); >>>> /* Account before objcg ref is moved to tree */ >>>> @@ -1373,19 +1328,105 @@ bool zswap_store(struct folio *folio) >>>> >>>> put_dstmem: >>>> mutex_unlock(acomp_ctx->mutex); >>>> - zswap_pool_put(entry->pool); >>>> freepage: >>>> zswap_entry_cache_free(entry); >>>> reject: >>>> if (objcg) >>>> obj_cgroup_put(objcg); >>>> + zswap_pool_put(pool); >>>> return false; >>>> +} >>>> >>>> +bool zswap_store(struct folio *folio) >>>> +{ >>>> + long nr_pages = folio_nr_pages(folio); >>>> + swp_entry_t swp = folio->swap; >>>> + int type = swp_type(swp); >>>> + pgoff_t offset = swp_offset(swp); >>>> + struct zswap_tree *tree = zswap_trees[type]; >>>> + struct zswap_entry *entry; >>>> + struct obj_cgroup *objcg = NULL; >>>> + struct zswap_pool *pool; >>>> + bool ret = false; >>>> + long i; >>>> + >>>> + VM_WARN_ON_ONCE(!folio_test_locked(folio)); >>>> + VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); >>>> + >>>> + if (!zswap_enabled || !tree) >>>> + return false; >>>> + >>>> + /* >>>> + * If this is a duplicate, it must be removed before attempting to store >>>> + * it, otherwise, if the store fails the old page won't be removed from >>>> + * the tree, and it might be written back overriding the new data. >>>> + */ >>>> + spin_lock(&tree->lock); >>>> + for (i = 0; i < nr_pages; i++) { >>>> + entry = zswap_rb_search(&tree->rbroot, offset + i); >>>> + if (entry) { >>>> + zswap_duplicate_entry++; >>>> + zswap_invalidate_entry(tree, entry); >>>> + } >>>> + } >>>> + spin_unlock(&tree->lock); >>>> + >>>> + /* >>>> + * XXX: zswap reclaim does not work with cgroups yet. Without a >>>> + * cgroup-aware entry LRU, we will push out entries system-wide based on >>>> + * local cgroup limits. >>>> + */ >>>> + objcg = get_obj_cgroup_from_folio(folio); >>>> + if (objcg && !obj_cgroup_may_zswap(objcg)) >>>> + goto put_objcg; >>> >>> Hmm would this make more sense to check these limits >>> at a higher order page level or at a backing page (4 KB) >>> level? >>> >>> What if the cgroup still has some space before the new >>> folio comes in, but the new folio would drive the pool size >>> beyond the limit? Ditto for global zswap pool limit. >> >> I did consider this, but I thought I would keep it simple for the RFC and accept >> that we may go over the limits by (HPAGE_PMD_NR - 1) pages. It sounds like you >> don't think that will be acceptable. > > Not necessarily unacceptable - I just wanted to point out a potential > change in our assumption with zswap that might go unnoticed. > > I think there should be at least some sort of documentation or > comments spelling this out. > >> >> I see 2 options: >> >> - move this checking logic to be per-page in zswap_store_page() >> - pass a size (or nr_pages or order) to obj_cgroup_may_zswap(), >> zswap_is_full() and zswap_can_accept() so they know how much we are >> proposing to add and can make a correct decision. >> >> Personally I prefer the second option for efficiency. > > Hmm if we go with the second option, how should this decision > be made? It's impossible to predict the post-compression size > of the huge page, even if we know its order. But we have a worst case size (the uncompressed size). So we could use that as a conservative estimate? It means we might under-utilize the quota a bit, but I guess that's better than going over? > > I don't have a better suggestion than the first option, so if > anyone else has a better idea please chime in :) > > Personally, I would not flat out NACK this patch because of this > limitation though, as long as it is at least spelled out somewhere. OK, that sounds like the easiest approach :). I'm happy to just add a comment for v2. > >> >>> >>> Previously, the object size is capped by the size of >>> a page (since we don't accept bigger size pages into >>> zswap). If zswap limit is exceded, it will not be exceeded >>> by more than 4 KB. No big deal. But I'm not sure >>> this will be OK as we move to bigger and bigger >>> sizes for the pages... >>> >>> If we do decide to check the cgroup's zswap limit for >>> each backing page, I'm not sure how the reclaim logic >>> (which will be introduced in the cgroup-aware zswap >>> LRU patch) will interact with this though. >> >> Ok so this points us in the direction of my option 2 above as well? >> >>> >>>> + >>>> + /* reclaim space if needed */ >>>> + if (zswap_is_full()) { >>>> + zswap_pool_limit_hit++; >>>> + zswap_pool_reached_full = true; >>>> + goto shrink; >>>> + } >>>> + >>>> + if (zswap_pool_reached_full) { >>>> + if (!zswap_can_accept()) >>>> + goto shrink; >>>> + else >>>> + zswap_pool_reached_full = false; >>>> + } >>>> + >>>> + pool = zswap_pool_current_get(); >>>> + if (!pool) >>>> + goto put_objcg; >>>> + >>>> + /* >>>> + * Store each page of the folio as a separate entry. If we fail to store >>>> + * a page, unwind by removing all the previous pages we stored. >>>> + */ >>>> + for (i = 0; i < nr_pages; i++) { >>>> + if (!zswap_store_page(folio, i, objcg, pool)) { >>>> + spin_lock(&tree->lock); >>>> + for (i--; i >= 0; i--) { >>>> + entry = zswap_rb_search(&tree->rbroot, offset + i); >>>> + if (entry) >>>> + zswap_invalidate_entry(tree, entry); >>>> + } >>>> + spin_unlock(&tree->lock); >>>> + goto put_pool; >>>> + } >>>> + } >>>> + >>>> + ret = true; >>>> +put_pool: >>>> + zswap_pool_put(pool); >>>> +put_objcg: >>>> + if (objcg) >>>> + obj_cgroup_put(objcg); >>>> + return ret; >>>> shrink: >>>> pool = zswap_pool_last_get(); >>>> if (pool && !queue_work(shrink_wq, &pool->shrink_work)) >>>> zswap_pool_put(pool); >>>> - goto reject; >>>> + goto put_objcg; >>>> } >>>> >>>> bool zswap_load(struct folio *folio) >>>> >>>> base-commit: 158978945f3173b8c1a88f8c5684a629736a57ac >>>> -- >>>> 2.25.1 >>>> >>>> >>> >>> I don't see anything else that is obviously wrong with this. >>> Seems straightforward to me. >>> >>> But I'm not too super familiar with THP swapping >>> logic either :) So maybe someone with exposure to both >>> should take a look too. >>> >>> And would it make sense to introduce a gate that guard >>> this so that users can opt-in/opt-out of this new feature, >>> at least as we experiment more with it to get more >>> data? >> >> There is already a gate at a higher level; CONFIG_THP_SWAP. If that is not >> enabled, then all large folios will be split to single pages before a swap entry >> is even allocated. > > Hmm is there a runtime option as well, or does it not make sense to have > it? But yeah, in principle I'm not against having a single knob controlling > this behavior for both swap and zswap (as opposed to a zswap-specific > knob). There is no runtime knob equivalent to CONFIG_THP_SWAP at the moment, so I guess that proves that it is not needed at the swap level. And I don't really see the benefit for explicitly disabling at the zswap level. So I would prefer to keep it simple and not add any new runtime knob. The closest existing knob is the runtime THP sysfs interface - if THP is disabled (from boot) then there are no THPs to swap. (of course if disabled after boot, there may still be some THPs that were created earlier). > > I'll defer any further objections to this aspect to other zswap contributors > and maintainers :) >> >> I considered adding a separate gate for this, but given the above control, I >> didn't think the zswap-specific control was really neccessary. What do you think? >> >> Thanks, >> Ryan >> >>
diff --git a/mm/zswap.c b/mm/zswap.c index 37d2b1cb2ecb..51cbfc4e1ef8 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1188,18 +1188,17 @@ static void zswap_fill_page(void *ptr, unsigned long value) memset_l(page, value, PAGE_SIZE / sizeof(unsigned long)); } -bool zswap_store(struct folio *folio) +static bool zswap_store_page(struct folio *folio, long index, + struct obj_cgroup *objcg, struct zswap_pool *pool) { swp_entry_t swp = folio->swap; int type = swp_type(swp); - pgoff_t offset = swp_offset(swp); - struct page *page = &folio->page; + pgoff_t offset = swp_offset(swp) + index; + struct page *page = folio_page(folio, index); struct zswap_tree *tree = zswap_trees[type]; struct zswap_entry *entry, *dupentry; struct scatterlist input, output; struct crypto_acomp_ctx *acomp_ctx; - struct obj_cgroup *objcg = NULL; - struct zswap_pool *pool; struct zpool *zpool; unsigned int dlen = PAGE_SIZE; unsigned long handle, value; @@ -1208,51 +1207,11 @@ bool zswap_store(struct folio *folio) gfp_t gfp; int ret; - VM_WARN_ON_ONCE(!folio_test_locked(folio)); - VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); - - /* Large folios aren't supported */ - if (folio_test_large(folio)) - return false; - - if (!zswap_enabled || !tree) + /* entry keeps the references if successfully stored. */ + if (!zswap_pool_get(pool)) return false; - - /* - * If this is a duplicate, it must be removed before attempting to store - * it, otherwise, if the store fails the old page won't be removed from - * the tree, and it might be written back overriding the new data. - */ - spin_lock(&tree->lock); - dupentry = zswap_rb_search(&tree->rbroot, offset); - if (dupentry) { - zswap_duplicate_entry++; - zswap_invalidate_entry(tree, dupentry); - } - spin_unlock(&tree->lock); - - /* - * XXX: zswap reclaim does not work with cgroups yet. Without a - * cgroup-aware entry LRU, we will push out entries system-wide based on - * local cgroup limits. - */ - objcg = get_obj_cgroup_from_folio(folio); - if (objcg && !obj_cgroup_may_zswap(objcg)) - goto reject; - - /* reclaim space if needed */ - if (zswap_is_full()) { - zswap_pool_limit_hit++; - zswap_pool_reached_full = true; - goto shrink; - } - - if (zswap_pool_reached_full) { - if (!zswap_can_accept()) - goto shrink; - else - zswap_pool_reached_full = false; - } + if (objcg) + obj_cgroup_get(objcg); /* allocate entry */ entry = zswap_entry_cache_alloc(GFP_KERNEL); @@ -1260,6 +1219,8 @@ bool zswap_store(struct folio *folio) zswap_reject_kmemcache_fail++; goto reject; } + entry->objcg = objcg; + entry->pool = pool; if (zswap_same_filled_pages_enabled) { src = kmap_atomic(page); @@ -1277,11 +1238,6 @@ bool zswap_store(struct folio *folio) if (!zswap_non_same_filled_pages_enabled) goto freepage; - /* if entry is successfully added, it keeps the reference */ - entry->pool = zswap_pool_current_get(); - if (!entry->pool) - goto freepage; - /* compress */ acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); @@ -1337,7 +1293,6 @@ bool zswap_store(struct folio *folio) entry->length = dlen; insert_entry: - entry->objcg = objcg; if (objcg) { obj_cgroup_charge_zswap(objcg, entry->length); /* Account before objcg ref is moved to tree */ @@ -1373,19 +1328,105 @@ bool zswap_store(struct folio *folio) put_dstmem: mutex_unlock(acomp_ctx->mutex); - zswap_pool_put(entry->pool); freepage: zswap_entry_cache_free(entry); reject: if (objcg) obj_cgroup_put(objcg); + zswap_pool_put(pool); return false; +} +bool zswap_store(struct folio *folio) +{ + long nr_pages = folio_nr_pages(folio); + swp_entry_t swp = folio->swap; + int type = swp_type(swp); + pgoff_t offset = swp_offset(swp); + struct zswap_tree *tree = zswap_trees[type]; + struct zswap_entry *entry; + struct obj_cgroup *objcg = NULL; + struct zswap_pool *pool; + bool ret = false; + long i; + + VM_WARN_ON_ONCE(!folio_test_locked(folio)); + VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); + + if (!zswap_enabled || !tree) + return false; + + /* + * If this is a duplicate, it must be removed before attempting to store + * it, otherwise, if the store fails the old page won't be removed from + * the tree, and it might be written back overriding the new data. + */ + spin_lock(&tree->lock); + for (i = 0; i < nr_pages; i++) { + entry = zswap_rb_search(&tree->rbroot, offset + i); + if (entry) { + zswap_duplicate_entry++; + zswap_invalidate_entry(tree, entry); + } + } + spin_unlock(&tree->lock); + + /* + * XXX: zswap reclaim does not work with cgroups yet. Without a + * cgroup-aware entry LRU, we will push out entries system-wide based on + * local cgroup limits. + */ + objcg = get_obj_cgroup_from_folio(folio); + if (objcg && !obj_cgroup_may_zswap(objcg)) + goto put_objcg; + + /* reclaim space if needed */ + if (zswap_is_full()) { + zswap_pool_limit_hit++; + zswap_pool_reached_full = true; + goto shrink; + } + + if (zswap_pool_reached_full) { + if (!zswap_can_accept()) + goto shrink; + else + zswap_pool_reached_full = false; + } + + pool = zswap_pool_current_get(); + if (!pool) + goto put_objcg; + + /* + * Store each page of the folio as a separate entry. If we fail to store + * a page, unwind by removing all the previous pages we stored. + */ + for (i = 0; i < nr_pages; i++) { + if (!zswap_store_page(folio, i, objcg, pool)) { + spin_lock(&tree->lock); + for (i--; i >= 0; i--) { + entry = zswap_rb_search(&tree->rbroot, offset + i); + if (entry) + zswap_invalidate_entry(tree, entry); + } + spin_unlock(&tree->lock); + goto put_pool; + } + } + + ret = true; +put_pool: + zswap_pool_put(pool); +put_objcg: + if (objcg) + obj_cgroup_put(objcg); + return ret; shrink: pool = zswap_pool_last_get(); if (pool && !queue_work(shrink_wq, &pool->shrink_work)) zswap_pool_put(pool); - goto reject; + goto put_objcg; } bool zswap_load(struct folio *folio)