Message ID | 20231113130601.3350915-1-hezhongkun.hzk@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b909:0:b0:403:3b70:6f57 with SMTP id t9csp1186699vqg; Mon, 13 Nov 2023 05:06:40 -0800 (PST) X-Google-Smtp-Source: AGHT+IFtFC071xocHbqxfQRBIIUFR7pNd1auNVUb2cFynJ64jp8beDFnju5jIgTDJQ2A8Q7+1ckn X-Received: by 2002:a17:902:e850:b0:1cc:4559:ff with SMTP id t16-20020a170902e85000b001cc455900ffmr9028462plg.13.1699880799830; Mon, 13 Nov 2023 05:06:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699880799; cv=none; d=google.com; s=arc-20160816; b=da2oCZDqPxtpoa3pk6k7sOvAaHMVacZpTFSijQrWjpsohb1Spt6afEJ+1RBQpuTri3 LpaHbzJS3zfieRYXONyIJJBO654HFKJEizlB3sdX0l9nw1zKuOYRRzny8Xjs3CgrPulh Z7wxe6hhRMCy/KbWHLM8eF1lu1z4rUSnkX8w6yepSxe3JMY5bxsq8pxaC91K2SgX0YWn NnjtdN+uhwzHPgiS/8PjFxKQsQEPJXGQINWxbQcGu1cTtI20j0eykzMhM9oV8Iw2ej1H dRkjDcLwcKbR3xz8aJtRN5ynOyMP4wEca0onss+S//yOg+a/AgngyBS6ZLl1cDnIPmmi 9SYw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=iXoWjqfg8+j8eT28RHElwz66GqwL0VNwqWlpAIhmXQc=; fh=PszaQX4C+JKV5DOkv5b1oVfFmM+viGnqQb7kiYaHXAg=; b=xUFCga5McZ1QFfHpZ7Q41+qbP+WTd5Vlxwuvr3qfKCeTy0ErILY9nSsi6SuClVPlMb XF5GaMOGCns8mNFed79IEcglaCiuqBxCxEMsZiFPCh0osSqVSMItXZstWA+fz/LCwPI9 kqMBSRd+HV3HAhy02XrJFJVNZSnhg7JPCsSuRt2N8KX4CmhZVFWW1KLwYEdxLyn2jzkk mDvnDipe411si3/M14NSQbUDk0/etlWh49OEkbjaVOig6gxvThwXNf7l0S/J7kM6cPDk IJGMQiI63AGwsleXxZmVtgkfsVxvpFMMtYe30EGe8DdQzXVun+YA6cqYpDhgs4573shN KAHw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=hvT6X8qu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id q18-20020a170902eb9200b001b8a56b9895si5634369plg.616.2023.11.13.05.06.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Nov 2023 05:06:39 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=hvT6X8qu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id E02C1807C860; Mon, 13 Nov 2023 05:06:36 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229944AbjKMNGV (ORCPT <rfc822;heyuhang3455@gmail.com> + 29 others); Mon, 13 Nov 2023 08:06:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42990 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229470AbjKMNGT (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 13 Nov 2023 08:06:19 -0500 Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 96C9E1711 for <linux-kernel@vger.kernel.org>; Mon, 13 Nov 2023 05:06:16 -0800 (PST) Received: by mail-pf1-x433.google.com with SMTP id d2e1a72fcca58-6c4cf0aea06so3287597b3a.0 for <linux-kernel@vger.kernel.org>; Mon, 13 Nov 2023 05:06:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1699880776; x=1700485576; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=iXoWjqfg8+j8eT28RHElwz66GqwL0VNwqWlpAIhmXQc=; b=hvT6X8quBGBXaDDJ/FtKdy/06sGZIbJHtLKtKSB+Xoaj8KBrPUzETkFjS6wiovr3sV T4nh8KXe5zP1BTVCT/4RlUNUhfWwUoIpKbmd2qJrIJsaRiQDuWo6EzAaJARC7QRTmuhJ oYEVqS46E83mAWjt+Kn17FxuZ02ug6dJWzx3AQvHrNUdwcoCXlt7uj7gMwNjTCkkEpFK ye4erKCwUGj2CCCeTpRHU5lHPPoqXFzDPi70pndrSsLLw2pBLP6vs/KRcyBb/AkTJzJF DJen2rtandxL53wsBAXVxjL7lVTmqNDLXyPRx0RE+u56YzjrRTyDO4kEs1dWm9d8NHBc 5TtQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699880776; x=1700485576; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=iXoWjqfg8+j8eT28RHElwz66GqwL0VNwqWlpAIhmXQc=; b=ZLtK5yNKVjeJg6HtzaKf+RqwO4c7qtuOYrXCFN/qd9pREd62CEmZW6uBujXzAX8qtd UDAw5QPAHK1HbJasvjPDhC469ZUEImsNKjPlu6HYqAXBNnueQ+/TMFi8fqFPC+PvLPhi G3ZSC2t1+ztcXugTLVwPtDSJgY9Jhq1/2UkbL9KeBfgcW4wNZ5075A9ziQNTZiakL3+u mduhPauQg60tNL3aFC6zbqMHGh19TKDRMDtuaqhL1bgqOsgelmz7vSsMUPqki8mtSYsP foy9kn+uLVjJa8NfZiRZ/TfgndY0Wo23YRIki8wH56ryJD3NJLWDm4s905AcHAv5CGkf CZ6Q== X-Gm-Message-State: AOJu0YwG1T12u99svZixO1nOA3zLWdKfGSputGngx3RRIOg2pGWOriCs h4BGC1bottJp4Mm+1AI6LhxZww== X-Received: by 2002:a05:6a00:1406:b0:6bc:e7f8:821e with SMTP id l6-20020a056a00140600b006bce7f8821emr8344990pfu.10.1699880775952; Mon, 13 Nov 2023 05:06:15 -0800 (PST) Received: from localhost.localdomain ([203.208.167.146]) by smtp.gmail.com with ESMTPSA id g3-20020aa78743000000b0068ff0a633fdsm3810805pfo.131.2023.11.13.05.06.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Nov 2023 05:06:15 -0800 (PST) From: Zhongkun He <hezhongkun.hzk@bytedance.com> To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Zhongkun He <hezhongkun.hzk@bytedance.com> Subject: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios Date: Mon, 13 Nov 2023 21:06:01 +0800 Message-Id: <20231113130601.3350915-1-hezhongkun.hzk@bytedance.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Mon, 13 Nov 2023 05:06:37 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1782454209619318368 X-GMAIL-MSGID: 1782454209619318368 |
Series |
mm:zswap: fix zswap entry reclamation failure in two scenarios
|
|
Commit Message
Zhongkun He
Nov. 13, 2023, 1:06 p.m. UTC
I recently found two scenarios where zswap entry could not be
released, which will cause shrink_worker and active recycling
to fail.
1)The swap entry has been freed, but cached in swap_slots_cache,
no swap cache and swapcount=0.
2)When the option zswap_exclusive_loads_enabled disabled and
zswap_load completed(page in swap_cache and swapcount = 0).
The above two cases need to be determined by swapcount=0,
fix it.
Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
mm/zswap.c | 35 +++++++++++++++++++++++++----------
1 file changed, 25 insertions(+), 10 deletions(-)
Comments
On Mon, Nov 13, 2023 at 8:06 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > I recently found two scenarios where zswap entry could not be > released, which will cause shrink_worker and active recycling > to fail. > 1)The swap entry has been freed, but cached in swap_slots_cache, > no swap cache and swapcount=0. > 2)When the option zswap_exclusive_loads_enabled disabled and > zswap_load completed(page in swap_cache and swapcount = 0). > > The above two cases need to be determined by swapcount=0, > fix it. > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> > --- > mm/zswap.c | 35 +++++++++++++++++++++++++---------- > 1 file changed, 25 insertions(+), 10 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 74411dfdad92..db95491bcdd5 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1063,11 +1063,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > struct mempolicy *mpol; > struct scatterlist input, output; > struct crypto_acomp_ctx *acomp_ctx; > + struct swap_info_struct *si; > struct zpool *pool = zswap_find_zpool(entry); > bool page_was_allocated; > u8 *src, *tmp = NULL; > unsigned int dlen; > - int ret; > + int ret = 0; > struct writeback_control wbc = { > .sync_mode = WB_SYNC_NONE, > }; > @@ -1082,16 +1083,30 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > mpol = get_task_policy(current); > page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol, > NO_INTERLEAVE_INDEX, &page_was_allocated); > - if (!page) { > + if (!page) > ret = -ENOMEM; > - goto fail; > - } > - > - /* Found an existing page, we raced with load/swapin */ > - if (!page_was_allocated) { > + else if (!page_was_allocated) { > + /* Found an existing page, we raced with load/swapin */ > put_page(page); > ret = -EEXIST; > - goto fail; > + } > + > + if (ret) { > + si = get_swap_device(swpentry); > + if (!si) > + goto out; > + > + /* Two cases to directly release zswap_entry. > + * 1) -ENOMEM,if the swpentry has been freed, but cached in > + * swap_slots_cache(no page and swapcount = 0). > + * 2) -EEXIST, option zswap_exclusive_loads_enabled disabled and > + * zswap_load completed(page in swap_cache and swapcount = 0). > + */ These two cases should not count as "successful writeback" right? I'm slightly biased of course, since my zswap shrinker depends on this as one of the potential signals for over-shrinking - but that aside, I think that this constitutes a failed writeback (i.e should not increment writeback counter, and the limit-based reclaim should try again etc.). If anything, it will make it incredibly confusing for users. For instance, we were trying to estimate the number of zswap store fails by subtracting the writeback count from the overall pswpout, and this could throw us off by inflating the writeback count, and deflating the zswap store failure count as a result. Regarding the second case specifically, I thought that was the point of having zswap_exclusive_loads_enabled disabled - i.e still keeps a copy around in the zswap pool even after a completed zswap_load? Based on the Kconfig documentation: "This avoids having two copies of the same page in memory (compressed and uncompressed) after faulting in a page from zswap. The cost is that if the page was never dirtied and needs to be swapped out again, it will be re-compressed." > + if (!swap_swapcount(si, swpentry)) > + ret = 0; > + > + put_swap_device(si); > + goto out; > } > > /* > @@ -1106,7 +1121,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > spin_unlock(&tree->lock); > delete_from_swap_cache(page_folio(page)); > ret = -ENOMEM; > - goto fail; > + goto out; > } > spin_unlock(&tree->lock); > > @@ -1151,7 +1166,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > return ret; > > -fail: > +out: > if (!zpool_can_sleep_mapped(pool)) > kfree(tmp); > > -- > 2.25.1 >
Thanks for your time, Nhat. > > These two cases should not count as "successful writeback" right? > This is true from the perspective of the writeback itself, but should it also be considered successful from the purpose of the writeback, i.e. whether the compressed memory and zswap_entry can be reclaimed? > I'm slightly biased of course, since my zswap shrinker depends on this > as one of the potential signals for over-shrinking - but that aside, I think > that this constitutes a failed writeback (i.e should not increment writeback > counter, and the limit-based reclaim should try again etc.). If anything, > it will make it incredibly confusing for users. This patch will skip the writeback step,so the writeback counter will not be incremented. Currently MAX_RECLAIM_RETRIES is 14, shrink_worker will often fail if writeback fails. > > For instance, we were trying to estimate the number of zswap store > fails by subtracting the writeback count from the overall pswpout, and > this could throw us off by inflating the writeback count, and deflating > the zswap store failure count as a result. As mentioned above, writeback counter will not be incremented. > > Regarding the second case specifically, I thought that was the point of > having zswap_exclusive_loads_enabled disabled - i.e still keeps a copy > around in the zswap pool even after a completed zswap_load? Based > on the Kconfig documentation: > > "This avoids having two copies of the same page in memory > (compressed and uncompressed) after faulting in a page from zswap. > The cost is that if the page was never dirtied and needs to be > swapped out again, it will be re-compressed." > Yes,i know the point,in the case of reading, there is no data update, so the next swapout does not need to be compressed again. Consider this scenario,there is a lot of data cached in memory and zswap, hit the limit,and shrink_worker will fail. The new coming data be written directly to swap due to zswap_store failure. Should we free the last one to store the latest one in zswap.
On Tue, Nov 14, 2023 at 12:21 AM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote: > > Thanks for your time, Nhat. > > > > > These two cases should not count as "successful writeback" right? > > > > This is true from the perspective of the writeback itself, but should it > also be considered successful from the purpose of the writeback, > i.e. whether the compressed memory and zswap_entry can be reclaimed? > > > I'm slightly biased of course, since my zswap shrinker depends on this > > as one of the potential signals for over-shrinking - but that aside, I think > > that this constitutes a failed writeback (i.e should not increment writeback > > counter, and the limit-based reclaim should try again etc.). If anything, > > it will make it incredibly confusing for users. > > This patch will skip the writeback step,so the writeback counter will not > be incremented. Currently MAX_RECLAIM_RETRIES is 14, shrink_worker > will often fail if writeback fails. Ah my bad, I should have been clearer. I was looking at the zswap shrinker patch series (specifically the cgroup-aware LRU patch), which moves the counter update out of zswap_writeback_entry. If we apply that patch on top of that series, we will screw up the counter. Should be easily fixable anyway though. > > > > > For instance, we were trying to estimate the number of zswap store > > fails by subtracting the writeback count from the overall pswpout, and > > this could throw us off by inflating the writeback count, and deflating > > the zswap store failure count as a result. > > As mentioned above, writeback counter will not be incremented. > > > > > Regarding the second case specifically, I thought that was the point of > > having zswap_exclusive_loads_enabled disabled - i.e still keeps a copy > > around in the zswap pool even after a completed zswap_load? Based > > on the Kconfig documentation: > > > > "This avoids having two copies of the same page in memory > > (compressed and uncompressed) after faulting in a page from zswap. > > The cost is that if the page was never dirtied and needs to be > > swapped out again, it will be re-compressed." > > > > Yes,i know the point,in the case of reading, there is no data update, > so the next swapout does not need to be compressed again. > Consider this scenario,there is a lot of data cached in memory and zswap, > hit the limit,and shrink_worker will fail. The new coming data be written > directly to swap due to zswap_store failure. Should we free the last one > to store the latest one in zswap. Ah I think I understand the point of the patch a bit better now. Essentially, we're invalidating these entries, which does reclaim the memory used for these compressed objects, but there is no IO involved. Writeback-less shrinking, if you will. This will still screw up one of the heuristics I'm using for the zswap shrinker a bit, but that should be easily fixable with some plumbing. Same goes for the writeback counter - but depending on the order in which Andrew apply the patches, you might have to resolve the conflicts there :) Other than this objection, I think this optimization makes sense to me: In the first case, we already freed the swap entry. Might as well also dropped the zswap entry. In the second case, we already have another copy in memory, so dropping the compressed copy to make space for warmer objects coming into zswap makes sense to me. Might be worth doing a micro-benchmark to verify this intuition, but I agree that it's more important to maintain the LRU ordering than any CPU saving from skipping re-compression. I would suggest that you should expand on this on the commit log to make clearer the motivation behind this optimization, if you were to re-submit this patch for some reason (for e.g to resolve the aforementioned conflicts with the zswap shrinker series). But otherwise, LGTM! Feel free to add the following tag: Reviewed-by: Nhat Pham <nphamcs@gmail.com>
+Ying On Mon, Nov 13, 2023 at 5:06 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > I recently found two scenarios where zswap entry could not be > released, which will cause shrink_worker and active recycling > to fail. > 1)The swap entry has been freed, but cached in swap_slots_cache, > no swap cache and swapcount=0. > 2)When the option zswap_exclusive_loads_enabled disabled and > zswap_load completed(page in swap_cache and swapcount = 0). For case (1), I think a cleaner solution would be to move the zswap_invalidate() call from swap_range_free() (which is called after the cached slots are freed) to __swap_entry_free_locked() if the usage goes to 0. I actually think conceptually this makes not just for zswap_invalidate(), but also for the arch call, memcg uncharging, etc. Slots caching is a swapfile optimization that should be internal to swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND not in the swap cache), all the hooks should be called (memcg, zswap, arch, ..) as the swap entry is effectively freed. The fact that swapfile code internally batches and caches slots should be transparent to other parts of MM. I am not sure if the calls can just be moved or if there are underlying assumptions in the implementation that would be broken, but it feels like the right thing to do. For case (2), I don't think zswap can just decide to free the entry. In that case, the page is in the swap cache pointing to a swp_entry which has a corresponding zswap entry, and the page is clean because it is already in swap/zswap, so we don't need to write it out again unless it is redirtied. If zswap just drops the entry, and reclaim tries to reclaim the page in the swap cache, it will drop the page assuming that there is a copy in swap/zswap (because it is clean). Now we lost all copies of the page. Am I missing something? > > The above two cases need to be determined by swapcount=0, > fix it. > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> > --- > mm/zswap.c | 35 +++++++++++++++++++++++++---------- > 1 file changed, 25 insertions(+), 10 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 74411dfdad92..db95491bcdd5 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1063,11 +1063,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > struct mempolicy *mpol; > struct scatterlist input, output; > struct crypto_acomp_ctx *acomp_ctx; > + struct swap_info_struct *si; > struct zpool *pool = zswap_find_zpool(entry); > bool page_was_allocated; > u8 *src, *tmp = NULL; > unsigned int dlen; > - int ret; > + int ret = 0; > struct writeback_control wbc = { > .sync_mode = WB_SYNC_NONE, > }; > @@ -1082,16 +1083,30 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > mpol = get_task_policy(current); > page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol, > NO_INTERLEAVE_INDEX, &page_was_allocated); > - if (!page) { > + if (!page) > ret = -ENOMEM; > - goto fail; > - } > - > - /* Found an existing page, we raced with load/swapin */ > - if (!page_was_allocated) { > + else if (!page_was_allocated) { > + /* Found an existing page, we raced with load/swapin */ > put_page(page); > ret = -EEXIST; > - goto fail; > + } > + > + if (ret) { > + si = get_swap_device(swpentry); > + if (!si) > + goto out; > + > + /* Two cases to directly release zswap_entry. > + * 1) -ENOMEM,if the swpentry has been freed, but cached in > + * swap_slots_cache(no page and swapcount = 0). > + * 2) -EEXIST, option zswap_exclusive_loads_enabled disabled and > + * zswap_load completed(page in swap_cache and swapcount = 0). > + */ > + if (!swap_swapcount(si, swpentry)) > + ret = 0; > + > + put_swap_device(si); > + goto out; > } > > /* > @@ -1106,7 +1121,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > spin_unlock(&tree->lock); > delete_from_swap_cache(page_folio(page)); > ret = -ENOMEM; > - goto fail; > + goto out; > } > spin_unlock(&tree->lock); > > @@ -1151,7 +1166,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > return ret; > > -fail: > +out: > if (!zpool_can_sleep_mapped(pool)) > kfree(tmp); > > -- > 2.25.1 >
> > Ah my bad, I should have been clearer. > > I was looking at the zswap shrinker patch series (specifically the > cgroup-aware LRU patch), which moves the counter update out of > zswap_writeback_entry. If we apply that patch on top of that series, we will > screw up the counter. Should be easily fixable anyway though. Got it. > > Ah I think I understand the point of the patch a bit better now. > > Essentially, we're invalidating these entries, which does reclaim the > memory used for these compressed objects, but there is no IO involved. > Writeback-less shrinking, if you will. > > This will still screw up one of the heuristics I'm using for the zswap > shrinker a bit, but that should be easily fixable with some plumbing. > Same goes for the writeback counter - but depending on the order in > which Andrew apply the patches, you might have to resolve the conflicts > there :) OK, I will fix it. > > Other than this objection, I think this optimization makes sense to me: > > In the first case, we already freed the swap entry. Might as well also > dropped the zswap entry. > > In the second case, we already have another copy in memory, so > dropping the compressed copy to make space for warmer objects > coming into zswap makes sense to me. Might be worth doing a > micro-benchmark to verify this intuition, but I agree that it's more > important to maintain the LRU ordering than any CPU saving from > skipping re-compression. > > I would suggest that you should expand on this on the commit log > to make clearer the motivation behind this optimization, if you were > to re-submit this patch for some reason (for e.g to resolve the > aforementioned conflicts with the zswap shrinker series). > > But otherwise, LGTM! > > Feel free to add the following tag: > Reviewed-by: Nhat Pham <nphamcs@gmail.com> Thanks, there are still some commits from Yosry, after that, I will send it again.
> For case (1), I think a cleaner solution would be to move the > zswap_invalidate() call from swap_range_free() (which is called after > the cached slots are freed) to __swap_entry_free_locked() if the usage > goes to 0. I actually think conceptually this makes not just for > zswap_invalidate(), but also for the arch call, memcg uncharging, etc. > Slots caching is a swapfile optimization that should be internal to > swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND > not in the swap cache), all the hooks should be called (memcg, zswap, > arch, ..) as the swap entry is effectively freed. The fact that > swapfile code internally batches and caches slots should be > transparent to other parts of MM. I am not sure if the calls can just > be moved or if there are underlying assumptions in the implementation > that would be broken, but it feels like the right thing to do. Good idea, This is indeed a clear solution. I'll try it in another patch later. > > For case (2), I don't think zswap can just decide to free the entry. > > In that case, the page is in the swap cache pointing to a swp_entry > which has a corresponding zswap entry, and the page is clean because > it is already in swap/zswap, so we don't need to write it out again > unless it is redirtied. If zswap just drops the entry, and reclaim > tries to reclaim the page in the swap cache, it will drop the page > assuming that there is a copy in swap/zswap (because it is clean). Now > we lost all copies of the page. > > Am I missing something? > Ah, my bad. Missed the step of marking the page as dirty. Please have a look, just like zswap_exclusive_loads_enabled, set page dity so that it can be pageout again. if (!page_was_allocated) { if (!count) { set_page_dirty(page); ret = 0; } else ret = -EEXIST; put_page(page); } Thanks for your feedback, Yosry.
On Wed, Nov 15, 2023 at 4:53 AM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote: > > > For case (1), I think a cleaner solution would be to move the > > zswap_invalidate() call from swap_range_free() (which is called after > > the cached slots are freed) to __swap_entry_free_locked() if the usage > > goes to 0. I actually think conceptually this makes not just for > > zswap_invalidate(), but also for the arch call, memcg uncharging, etc. > > Slots caching is a swapfile optimization that should be internal to > > swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND > > not in the swap cache), all the hooks should be called (memcg, zswap, > > arch, ..) as the swap entry is effectively freed. The fact that > > swapfile code internally batches and caches slots should be > > transparent to other parts of MM. I am not sure if the calls can just > > be moved or if there are underlying assumptions in the implementation > > that would be broken, but it feels like the right thing to do. > > Good idea, This is indeed a clear solution. I'll try it in another > patch later. > > > > > For case (2), I don't think zswap can just decide to free the entry. > > > > In that case, the page is in the swap cache pointing to a swp_entry > > which has a corresponding zswap entry, and the page is clean because > > it is already in swap/zswap, so we don't need to write it out again > > unless it is redirtied. If zswap just drops the entry, and reclaim > > tries to reclaim the page in the swap cache, it will drop the page > > assuming that there is a copy in swap/zswap (because it is clean). Now > > we lost all copies of the page. > > > > Am I missing something? > > > > Ah, my bad. Missed the step of marking the page as dirty. > Please have a look, just like zswap_exclusive_loads_enabled, > set page dity so that it can be pageout again. > if (!page_was_allocated) { > if (!count) { > set_page_dirty(page); > ret = 0; > } else > ret = -EEXIST; > put_page(page); > } I think we may need to try to lock the folio. Otherwise we may race with reclaim reading the dirty bit before we set it. Taking a step back, this seems like we are going behind exclusive loads. We "should" keep the page in zswap as exclusive loads are disabled and the page is not yet invalidated from zswap (the swap entry is still in use). What you are trying to do here is sneakily drop the page from zswap as if we wrote it back, but we didn't. We just know that it was already loaded from zswap. We are essentially making the previous load exclusive retroactively. Is there a reason why exclusive loads cannot be enabled to achieve the same result in the (arguably) correct way? > Thanks for your feedback, Yosry.
On Thu, Nov 16, 2023 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > I think we may need to try to lock the folio. Otherwise we may race > with reclaim reading the dirty bit before we set it. > > Taking a step back, this seems like we are going behind exclusive > loads. We "should" keep the page in zswap as exclusive loads are > disabled and the page is not yet invalidated from zswap (the swap > entry is still in use). What you are trying to do here is sneakily > drop the page from zswap as if we wrote it back, but we didn't. If we want to reclaim the cached zswap_entry, Writing back might be the easy way. > We just know that it was already loaded from zswap. We are essentially > making the previous load exclusive retroactively. > > Is there a reason why exclusive loads cannot be enabled to achieve the > same result in the (arguably) correct way? > zswap_exclusive_loads is not enabled by default, so the shrink_worker may fail if there are many cached zswap_entries on the zswap_pool->lru. Is it possible to make zswap_exclusive_loads the only way in zswap_load? It only makes sense when the page is read and no longer dirty. If the page is read frequently, it should stay in cache rather than zswap. The benefit of doing this is very small, two copies of the same page in memory. Thanks.
On Wed, Nov 15, 2023 at 7:33 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote: > > On Thu, Nov 16, 2023 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > I think we may need to try to lock the folio. Otherwise we may race > > with reclaim reading the dirty bit before we set it. > > > > Taking a step back, this seems like we are going behind exclusive > > loads. We "should" keep the page in zswap as exclusive loads are > > disabled and the page is not yet invalidated from zswap (the swap > > entry is still in use). What you are trying to do here is sneakily > > drop the page from zswap as if we wrote it back, but we didn't. > > If we want to reclaim the cached zswap_entry, Writing back might > be the easy way. > > > We just know that it was already loaded from zswap. We are essentially > > making the previous load exclusive retroactively. > > > > Is there a reason why exclusive loads cannot be enabled to achieve the > > same result in the (arguably) correct way? > > > > zswap_exclusive_loads is not enabled by default, so the shrink_worker > may fail if there are many cached zswap_entries on the zswap_pool->lru. It can be enabled at runtime, or enabled by default by using CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON. > > > Is it possible to make zswap_exclusive_loads the only way in zswap_load? > It only makes sense when the page is read and no longer dirty. > If the page is read frequently, it should stay in cache rather than zswap. > The benefit of doing this is very small, two copies of the same page > in memory. The reason I added it behind runtime and config knobs is to preserve the existing behavior in case someone depends on it. At Google, we have been using exclusive loads for a long time. If other users of zswap agree to make this the default behavior or make it the only way to do zswap loads I don't have a problem with it. > > > Thanks.
On Thu, Nov 16, 2023 at 12:10 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > It can be enabled at runtime, or enabled by default by using > CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON. > Yes, I see it in the doc. Thanks. > > > The reason I added it behind runtime and config knobs is to preserve > the existing behavior in case someone depends on it. At Google, we > have been using exclusive loads for a long time. If other users of > zswap agree to make this the default behavior or make it the only way > to do zswap loads I don't have a problem with it. > Got it. Thanks for your feedback.
Yosry Ahmed <yosryahmed@google.com> writes: > +Ying > > On Mon, Nov 13, 2023 at 5:06 AM Zhongkun He > <hezhongkun.hzk@bytedance.com> wrote: >> >> I recently found two scenarios where zswap entry could not be >> released, which will cause shrink_worker and active recycling >> to fail. >> 1)The swap entry has been freed, but cached in swap_slots_cache, >> no swap cache and swapcount=0. >> 2)When the option zswap_exclusive_loads_enabled disabled and >> zswap_load completed(page in swap_cache and swapcount = 0). > > For case (1), I think a cleaner solution would be to move the > zswap_invalidate() call from swap_range_free() (which is called after > the cached slots are freed) to __swap_entry_free_locked() if the usage > goes to 0. I actually think conceptually this makes not just for > zswap_invalidate(), but also for the arch call, memcg uncharging, etc. > Slots caching is a swapfile optimization that should be internal to > swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND > not in the swap cache), all the hooks should be called (memcg, zswap, > arch, ..) as the swap entry is effectively freed. The fact that > swapfile code internally batches and caches slots should be > transparent to other parts of MM. I am not sure if the calls can just > be moved or if there are underlying assumptions in the implementation > that would be broken, but it feels like the right thing to do. This sounds reasonable for me. Just don't forget to change other swap_range_free() callers too. -- Best Regards, Huang, Ying
> > This sounds reasonable for me. Just don't forget to change other > swap_range_free() callers too. Got it, thanks. > > -- > Best Regards, > Huang, Ying
Hi Yosry, On Tue, Nov 14, 2023 at 9:16 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > 1)The swap entry has been freed, but cached in swap_slots_cache, > > no swap cache and swapcount=0. > > 2)When the option zswap_exclusive_loads_enabled disabled and > > zswap_load completed(page in swap_cache and swapcount = 0). > > For case (1), I think a cleaner solution would be to move the > zswap_invalidate() call from swap_range_free() (which is called after > the cached slots are freed) to __swap_entry_free_locked() if the usage > goes to 0. I actually think conceptually this makes not just for > zswap_invalidate(), but also for the arch call, memcg uncharging, etc. > Slots caching is a swapfile optimization that should be internal to > swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND Do you mean moving all swap slots free to bypass the swap slot cache, even it is not from zswap? That might have unwanted side effects. The swap slot cache is not just for swap files on disk. The batching has the effect that on average lower cost of freeing per entry. > not in the swap cache), all the hooks should be called (memcg, zswap, > arch, ..) as the swap entry is effectively freed. The fact that > swapfile code internally batches and caches slots should be > transparent to other parts of MM. I am not sure if the calls can just > be moved or if there are underlying assumptions in the implementation > that would be broken, but it feels like the right thing to do. There is also the behavior that if the page gets swapped in but hasn't changed, when swap out again, it is possible to avoid writing the page again to the disk. For disk there is no overhead keeping the old date on the disk not to touch it. For zpool it might have memory overhead holding the compressed pool. The trade off might be different. Chris
On Thu, Nov 16, 2023 at 12:12 PM Chris Li <chrisl@kernel.org> wrote: > > Hi Yosry, > > On Tue, Nov 14, 2023 at 9:16 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > 1)The swap entry has been freed, but cached in swap_slots_cache, > > > no swap cache and swapcount=0. > > > 2)When the option zswap_exclusive_loads_enabled disabled and > > > zswap_load completed(page in swap_cache and swapcount = 0). > > > > For case (1), I think a cleaner solution would be to move the > > zswap_invalidate() call from swap_range_free() (which is called after > > the cached slots are freed) to __swap_entry_free_locked() if the usage > > goes to 0. I actually think conceptually this makes not just for > > zswap_invalidate(), but also for the arch call, memcg uncharging, etc. > > Slots caching is a swapfile optimization that should be internal to > > swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND > > Do you mean moving all swap slots free to bypass the swap slot cache, even it > is not from zswap? That might have unwanted side effects. The swap > slot cache is not just for swap files on disk. The batching has the > effect that on average lower cost of freeing per entry. Not bypassing the swap slot cache, just make the callbacks to invalidate the zswap entry, do memg uncharging, etc when the slot is no longer used and is entering the swap slot cache (i.e. when free_swap_slot() is called), instead of when draining the swap slot cache (i.e. when swap_range_free() is called). For all parts of MM outside of swap, the swap entry is freed when free_swap_slot() is called. We don't free it immediately because of caching, but this should be transparent to other parts of MM (e.g. zswap, memcg, etc). > > > not in the swap cache), all the hooks should be called (memcg, zswap, > > arch, ..) as the swap entry is effectively freed. The fact that > > swapfile code internally batches and caches slots should be > > transparent to other parts of MM. I am not sure if the calls can just > > be moved or if there are underlying assumptions in the implementation > > that would be broken, but it feels like the right thing to do. > > There is also the behavior that if the page gets swapped in but hasn't > changed, when swap out again, it is possible to avoid writing the > page again to the disk. For disk there is no overhead keeping the old > date on the disk not to touch it. For zpool it might have memory > overhead holding the compressed pool. The trade off might be > different. > > Chris
On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > Not bypassing the swap slot cache, just make the callbacks to > invalidate the zswap entry, do memg uncharging, etc when the slot is > no longer used and is entering the swap slot cache (i.e. when > free_swap_slot() is called), instead of when draining the swap slot > cache (i.e. when swap_range_free() is called). For all parts of MM > outside of swap, the swap entry is freed when free_swap_slot() is > called. We don't free it immediately because of caching, but this > should be transparent to other parts of MM (e.g. zswap, memcg, etc). That will cancel the batching effect on the swap slot free, making the common case for swapping faults take longer to complete, righ? If I recall correctly, the uncharge is the expensive part of the swap slot free operation. I just want to figure out what we are trading off against. This is not one side wins all situations. Chris
On Thu, Nov 16, 2023 at 12:30 PM Chris Li <chriscli@google.com> wrote: > > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > Not bypassing the swap slot cache, just make the callbacks to > > invalidate the zswap entry, do memg uncharging, etc when the slot is > > no longer used and is entering the swap slot cache (i.e. when > > free_swap_slot() is called), instead of when draining the swap slot > > cache (i.e. when swap_range_free() is called). For all parts of MM > > outside of swap, the swap entry is freed when free_swap_slot() is > > called. We don't free it immediately because of caching, but this > > should be transparent to other parts of MM (e.g. zswap, memcg, etc). > > That will cancel the batching effect on the swap slot free, making the > common case for swapping faults take longer to complete, righ? > If I recall correctly, the uncharge is the expensive part of the swap > slot free operation. > I just want to figure out what we are trading off against. This is not > one side wins all situations. Interesting. Maybe we can just move the zswap_invalidate() call to save memory early, and leave the memcg uncharge call to be batched? IIUC we already do some version of this internally at Google. > > > Chris
> > That will cancel the batching effect on the swap slot free, making the > common case for swapping faults take longer to complete, righ? > If I recall correctly, the uncharge is the expensive part of the swap > slot free operation. > I just want to figure out what we are trading off against. This is not > one side wins all situations. > Hi Chris, thanks for your feedback. I have the same concerns, maybe we should just move the zswap_invalidate() out of batches, as Yosry mentioned above.
On Thu, Nov 16, 2023 at 12:46 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Nov 16, 2023 at 12:30 PM Chris Li <chriscli@google.com> wrote: > > > > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > Not bypassing the swap slot cache, just make the callbacks to > > > invalidate the zswap entry, do memg uncharging, etc when the slot is > > > no longer used and is entering the swap slot cache (i.e. when > > > free_swap_slot() is called), instead of when draining the swap slot > > > cache (i.e. when swap_range_free() is called). For all parts of MM > > > outside of swap, the swap entry is freed when free_swap_slot() is > > > called. We don't free it immediately because of caching, but this > > > should be transparent to other parts of MM (e.g. zswap, memcg, etc). > > > > That will cancel the batching effect on the swap slot free, making the > > common case for swapping faults take longer to complete, righ? > > If I recall correctly, the uncharge is the expensive part of the swap > > slot free operation. > > I just want to figure out what we are trading off against. This is not > > one side wins all situations. > > Interesting. Maybe we can just move the zswap_invalidate() call to > save memory early, and leave the memcg uncharge call to be batched? > IIUC we already do some version of this internally at Google. I would like to see the patch so I can reason about it better. Do you have the link for the patch you are talking about? I can also look up the Google internal one if you have a commit hash. One thing I would like to find out is whether the behavior of reusing swap slots without page writing has been changed. e.g. Previously if the swap slot can be page out again without writing/compression again, if the page is not dirty. If we change that behavior, I would like to understand the trade off better. Chris
On Fri, Nov 17, 2023 at 1:56 AM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > Hi Chris, thanks for your feedback. I have the same concerns, > maybe we should just move the zswap_invalidate() out of batches, > as Yosry mentioned above. As I replied in the previous email, I just want to understand the other side effects of the change better. To me, this patching is actually freeing the memory that does not require actual page IO write from zswap. Which means the memory is from some kind of cache. It would be interesting if we can not complicate the write back path further. Instead, we can drop those memories from the different cache if needed. I assume those caches are doing something useful in the common case. If not, we should have a patch to remove these caches instead. Not sure how big a mess it will be to implement separate the write and drop caches. While you are here, I have some questions for you. Can you help me understand how much memory you can free from this patch? For example, are we talking about a few pages or a few GB? Where does the freed memory come from? If the memory comes from zswap entry struct. Due to the slab allocator fragmentation. It would take a lot of zswap entries to have meaningful memory reclaimed from the slab allocator. If the memory comes from the swap cached pages, that would be much more meaningful. But that is not what this patch is doing, right? Chris
Hi Chris, thanks for your time. > > On Fri, Nov 17, 2023 at 1:56 AM Zhongkun He > <hezhongkun.hzk@bytedance.com> wrote: > > Hi Chris, thanks for your feedback. I have the same concerns, > > maybe we should just move the zswap_invalidate() out of batches, > > as Yosry mentioned above. > > As I replied in the previous email, I just want to understand the > other side effects of the change better. > > To me, this patching is actually freeing the memory that does not > require actual page IO write from zswap. Which means the memory is > from some kind of cache. It would be interesting if we can not > complicate the write back path further. Instead, we can drop those > memories from the different cache if needed. I assume those caches are > doing something useful in the common case. If not, we should have a > patch to remove these caches instead. Not sure how big a mess it will > be to implement separate the write and drop caches. > > While you are here, I have some questions for you. > > Can you help me understand how much memory you can free from this > patch? For example, are we talking about a few pages or a few GB? > > Where does the freed memory come from? > If the memory comes from zswap entry struct. Due to the slab allocator > fragmentation. It would take a lot of zswap entries to have meaningful > memory reclaimed from the slab allocator. > > If the memory comes from the swap cached pages, that would be much > more meaningful. But that is not what this patch is doing, right? > > Chris It's my bad for putting two cases together. The memory released in both cases comes from zswap entry struct and zswap compressed page. The original intention of this patch is to solve the problem that shrink_work() fails to reclaim memory in two situations. For case (1), the zswap_writeback_entry() will failed for the __read_swap_cache_async return NULL because the swap has been freed but cached in swap_slots_cache, so the memory come from the zswap entry struct and compressed page. Count = SWAP_BATCH * ncpu. Solution: move the zswap_invalidate() out of batches, free it once the swap count equal to 0. For case (2), the zswap_writeback_entry() will failed for !page_was_allocated because zswap_load will have two copies of the same page in memory (compressed and uncompressed) after faulting in a page from zswap when zswap_exclusive_loads disabled. The amount of memory is greater but depends on the usage. Why do we need to release them? Consider this scenario,there is a lot of data cached in memory and zswap, hit the limit,and shrink_worker will fail. The new coming data will be written directly to swap due to zswap_store failure. Should we free the last one to store the latest one in zswap. According to the previous discussion, the writeback is inevitable. So I want to make zswap_exclusive_loads_enabled the default behavior or make it the only way to do zswap loads. It only makes sense when the page is read and no longer dirty. If the page is read frequently, it should stay in cache rather than zswap. The benefit of doing this is very small, i.e. two copies of the same page in memory. Thanks again.
On Fri, Nov 17, 2023 at 8:46 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > Hi Chris, thanks for your time. > > > > > On Fri, Nov 17, 2023 at 1:56 AM Zhongkun He > > <hezhongkun.hzk@bytedance.com> wrote: > > > Hi Chris, thanks for your feedback. I have the same concerns, > > > maybe we should just move the zswap_invalidate() out of batches, > > > as Yosry mentioned above. > > > > As I replied in the previous email, I just want to understand the > > other side effects of the change better. > > > > To me, this patching is actually freeing the memory that does not > > require actual page IO write from zswap. Which means the memory is > > from some kind of cache. It would be interesting if we can not > > complicate the write back path further. Instead, we can drop those > > memories from the different cache if needed. I assume those caches are > > doing something useful in the common case. If not, we should have a > > patch to remove these caches instead. Not sure how big a mess it will > > be to implement separate the write and drop caches. > > > > While you are here, I have some questions for you. > > > > Can you help me understand how much memory you can free from this > > patch? For example, are we talking about a few pages or a few GB? > > > > Where does the freed memory come from? > > If the memory comes from zswap entry struct. Due to the slab allocator > > fragmentation. It would take a lot of zswap entries to have meaningful > > memory reclaimed from the slab allocator. > > > > If the memory comes from the swap cached pages, that would be much > > more meaningful. But that is not what this patch is doing, right? > > > > Chris > > It's my bad for putting two cases together. The memory released in both > cases comes from zswap entry struct and zswap compressed page. > > The original intention of this patch is to solve the problem that > shrink_work() fails to reclaim memory in two situations. > > For case (1), the zswap_writeback_entry() will failed for the > __read_swap_cache_async return NULL because the swap has been > freed but cached in swap_slots_cache, so the memory come from > the zswap entry struct and compressed page. > Count = SWAP_BATCH * ncpu. > Solution: move the zswap_invalidate() out of batches, free it once the swap > count equal to 0. > > For case (2), the zswap_writeback_entry() will failed for !page_was_allocated > because zswap_load will have two copies of the same page in memory > (compressed and uncompressed) after faulting in a page from zswap when > zswap_exclusive_loads disabled. The amount of memory is greater but depends > on the usage. > > Why do we need to release them? > Consider this scenario,there is a lot of data cached in memory and zswap, > hit the limit,and shrink_worker will fail. The new coming data will be written > directly to swap due to zswap_store failure. Should we free the last one > to store the latest one in zswap. Shameless plug: zswap will much less likely hit the limit (global or cgroup) with the shrinker enabled ;) It will proactively reclaim the objects way ahead of the limit. It comes with its own can of worms, of course - it's unlikely to work for all workloads in its current form, but perhaps worth experimenting with/improved upon? > > According to the previous discussion, the writeback is inevitable. > So I want to make zswap_exclusive_loads_enabled the default behavior > or make it the only way to do zswap loads. It only makes sense when > the page is read and no longer dirty. If the page is read frequently, it > should stay in cache rather than zswap. The benefit of doing this is > very small, i.e. two copies of the same page in memory. > > Thanks again.
Hi Zhongkun, On Fri, Nov 17, 2023 at 5:46 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > Can you help me understand how much memory you can free from this > > patch? For example, are we talking about a few pages or a few GB? > > > > Where does the freed memory come from? > > If the memory comes from zswap entry struct. Due to the slab allocator > > fragmentation. It would take a lot of zswap entries to have meaningful > > memory reclaimed from the slab allocator. > > > > If the memory comes from the swap cached pages, that would be much > > more meaningful. But that is not what this patch is doing, right? > > > > Chris > > It's my bad for putting two cases together. The memory released in both > cases comes from zswap entry struct and zswap compressed page. Thanks for the clarification. Keep in mind that memory freeing from and zswap entry and zpool does not directly translate into page free. If the page has other none freed zswap entry or zsmalloc usage, those pages will not be free to the system. That is the fragmentation cost I was talking about. With this consideration, do you know many extra pages it can release back to the system by this patch in your usage case? If the difference is very small, it might not be worth the extra complexity to release those. > The original intention of this patch is to solve the problem that > shrink_work() fails to reclaim memory in two situations. > > For case (1), the zswap_writeback_entry() will failed for the > __read_swap_cache_async return NULL because the swap has been > freed but cached in swap_slots_cache, so the memory come from > the zswap entry struct and compressed page. In those cases, if we drop the swap_slots_cache, it will also free those zswap entries and compressed pages (zpool), right? > Count = SWAP_BATCH * ncpu. That is the upper limit. Not all CPUs have swap batches fully loaded. > Solution: move the zswap_invalidate() out of batches, free it once the swap > count equal to 0. Per previous discussion, this will have an impact on the swap_slot_cache behavior. We need some data points for cost benefit analysis. > For case (2), the zswap_writeback_entry() will failed for !page_was_allocated > because zswap_load will have two copies of the same page in memory > (compressed and uncompressed) after faulting in a page from zswap when > zswap_exclusive_loads disabled. The amount of memory is greater but depends > on the usage. That is basically disable the future swap out page IO write optimization that skip the write if the page hasn't changed. If the system are low on memory, that is justifiable. Again, it seems we can have a pass to drop the compressed memory if the swap count is zero (and mark page dirty). > > Why do we need to release them? > Consider this scenario,there is a lot of data cached in memory and zswap, > hit the limit,and shrink_worker will fail. The new coming data will be written Yes, the shrink_worker will need to allocate a page to store uncompressed data for write back. > directly to swap due to zswap_store failure. Should we free the last one > to store the latest one in zswap. The "last one" you mean the most recent zswap entry written into zswap? Well, you need to allocate a page to write it out, that is an async process. Shrink the zpool now is kind of too late already. > According to the previous discussion, the writeback is inevitable. > So I want to make zswap_exclusive_loads_enabled the default behavior > or make it the only way to do zswap loads. It only makes sense when We need some data point for how often we swap it out to zswap again, where the zswap out write can be saved by using the existing compressed data. It is totally possible this page IO write out optimization is not worthwhile for zswap. We need some data to support that claim. > the page is read and no longer dirty. If the page is read frequently, it > should stay in cache rather than zswap. The benefit of doing this is > very small, i.e. two copies of the same page in memory. If the benefit of doing this is very small, that seems to be the argument against this patch? Again we need some data points for cost and benefit analysis. Chris
On Sat, Nov 18, 2023 at 10:44 AM Nhat Pham <nphamcs@gmail.com> wrote: > > Why do we need to release them? > > Consider this scenario,there is a lot of data cached in memory and zswap, > > hit the limit,and shrink_worker will fail. The new coming data will be written > > directly to swap due to zswap_store failure. Should we free the last one > > to store the latest one in zswap. > > Shameless plug: zswap will much less likely hit the limit (global or > cgroup) with the shrinker enabled ;) It will proactively reclaim the > objects way ahead of the limit. I think that is actually the proper path, by the time it hits the limit of zpool. That is already too late to shrink zpool to make room. The shrinker should have guaranteed the amount of pages for write back purposes to make forward progress. > It comes with its own can of worms, of course - it's unlikely to work > for all workloads in its current form, but perhaps worth experimenting > with/improved upon? Agree. Chris
> > Shameless plug: zswap will much less likely hit the limit (global or > cgroup) with the shrinker enabled ;) It will proactively reclaim the > objects way ahead of the limit. Hi Nhat,glad to hear from you. Back to the beginning, the original intention of this patch is to solve the problem that shrink_work() fails to reclaim memory in two situations. The zswap_writeback_entry() will failed for !page_was_allocated because zswap_load will have two copies of the same page in memory (compressed and uncompressed) after faulting in a page from zswap when zswap_exclusive_loads disabled. A simple test: 1): Turn off zswap_exclusive_loads_enabled. 2): Run a read-only program and allocate more memory than the limit, so the limit will be reached and shrinker will fail. > > It comes with its own can of worms, of course - it's unlikely to work > for all workloads in its current form, but perhaps worth experimenting > with/improved upon? >
> > Thanks for the clarification. Keep in mind that memory freeing from > and zswap entry and zpool does not directly translate into page free. > If the page has other none freed zswap entry or zsmalloc usage, those > pages will not be free to the system. That is the fragmentation cost I > was talking about. Yes, it may need to be compacted. > > With this consideration, do you know many extra pages it can release > back to the system by this patch in your usage case? If the difference > is very small, it might not be worth the extra complexity to release > those. > The original intention of this patch is to make shrink work properly, not to release cache and related memory. > > The original intention of this patch is to solve the problem that > > shrink_work() fails to reclaim memory in two situations. > > > > For case (1), the zswap_writeback_entry() will failed for the > > __read_swap_cache_async return NULL because the swap has been > > freed but cached in swap_slots_cache, so the memory come from > > the zswap entry struct and compressed page. > In those cases, if we drop the swap_slots_cache, it will also free > those zswap entries and compressed pages (zpool), right? > > > Count = SWAP_BATCH * ncpu. > > That is the upper limit. Not all CPUs have swap batches fully loaded. Yes. > > > Solution: move the zswap_invalidate() out of batches, free it once the swap > > count equal to 0. > Per previous discussion, this will have an impact on the > swap_slot_cache behavior. > We need some data points for cost benefit analysis. > > > For case (2), the zswap_writeback_entry() will failed for !page_was_allocated > > because zswap_load will have two copies of the same page in memory > > (compressed and uncompressed) after faulting in a page from zswap when > > zswap_exclusive_loads disabled. The amount of memory is greater but depends > > on the usage. > > That is basically disable the future swap out page IO write > optimization that skip the write if the page hasn't changed. If the > system are low on memory, that is justifiable. Again, it seems we can > have a pass to drop the compressed memory if the swap count is zero > (and mark page dirty). > OK. > > > > Why do we need to release them? > > Consider this scenario,there is a lot of data cached in memory and zswap, > > hit the limit,and shrink_worker will fail. The new coming data will be written > > Yes, the shrink_worker will need to allocate a page to store > uncompressed data for write back. > > > directly to swap due to zswap_store failure. Should we free the last one > > to store the latest one in zswap. > > The "last one" you mean the most recent zswap entry written into zswap? > Well, you need to allocate a page to write it out, that is an async process. > Shrink the zpool now is kind of too late already. > The last zswap_entry in zswap_pool->lru. > > According to the previous discussion, the writeback is inevitable. > > So I want to make zswap_exclusive_loads_enabled the default behavior > > or make it the only way to do zswap loads. It only makes sense when > > We need some data point for how often we swap it out to zswap again, > where the zswap out write can be saved by using the existing compressed data. > > It is totally possible this page IO write out optimization is not > worthwhile for zswap. > We need some data to support that claim. > Got it. I will find it. > > the page is read and no longer dirty. If the page is read frequently, it > > should stay in cache rather than zswap. The benefit of doing this is > > very small, i.e. two copies of the same page in memory. > > If the benefit of doing this is very small, that seems to be the > argument against this patch? > Again we need some data points for cost and benefit analysis. > Yes, it is the new idea to make zswap_exclusive_loads_enabled the default behavior or make it the only way to do zswap loads. > Chris
Chris Li <chriscli@google.com> writes: > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> Not bypassing the swap slot cache, just make the callbacks to >> invalidate the zswap entry, do memg uncharging, etc when the slot is >> no longer used and is entering the swap slot cache (i.e. when >> free_swap_slot() is called), instead of when draining the swap slot >> cache (i.e. when swap_range_free() is called). For all parts of MM >> outside of swap, the swap entry is freed when free_swap_slot() is >> called. We don't free it immediately because of caching, but this >> should be transparent to other parts of MM (e.g. zswap, memcg, etc). > > That will cancel the batching effect on the swap slot free, making the > common case for swapping faults take longer to complete, righ? > If I recall correctly, the uncharge is the expensive part of the swap > slot free operation. > I just want to figure out what we are trading off against. This is not > one side wins all situations. Per my understanding, we don't batch memcg uncharging in swap_entry_free() now. Although it's possible and may improve performance. -- Best Regards, Huang, Ying
On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chriscli@google.com> writes: > > > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > That will cancel the batching effect on the swap slot free, making the > > common case for swapping faults take longer to complete, righ? > > If I recall correctly, the uncharge is the expensive part of the swap > > slot free operation. > > I just want to figure out what we are trading off against. This is not > > one side wins all situations. > > Per my understanding, we don't batch memcg uncharging in > swap_entry_free() now. Although it's possible and may improve > performance. swap_entry_free() does not do batching, it is at the caller level. I just checked. The batching is done in free_swap_slot() is still using swap slot cache and batching. It uses swapcache_free_entries() to batch free the swap_slots. That is where the uncharge happens per my understanding. Chris
Chris Li <chrisl@kernel.org> writes: > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chriscli@google.com> writes: >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> > That will cancel the batching effect on the swap slot free, making the >> > common case for swapping faults take longer to complete, righ? >> > If I recall correctly, the uncharge is the expensive part of the swap >> > slot free operation. >> > I just want to figure out what we are trading off against. This is not >> > one side wins all situations. >> >> Per my understanding, we don't batch memcg uncharging in >> swap_entry_free() now. Although it's possible and may improve >> performance. > > swap_entry_free() does not do batching, it is at the caller level. > > I just checked. The batching is done in free_swap_slot() is still > using swap slot cache and batching. > It uses swapcache_free_entries() to batch free the swap_slots. That is > where the uncharge happens per my understanding. Per my understanding, memcg uncharging happens in swapcache_free_entries() swap_entry_free() mem_cgroup_uncharge_swap() The swap entries are uncharged one-by-one, not acquire lock in memcg uncharge all entries release lock in memcg -- Best Regards, Huang, Ying
On Sun, Nov 19, 2023 at 9:41 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote: > > Per my understanding, memcg uncharging happens in > > swapcache_free_entries() > swap_entry_free() > mem_cgroup_uncharge_swap() > > The swap entries are uncharged one-by-one, not Yes. That matches my understanding as well. I think I am using the term "batching" very loosely. My bad and thanks for the clarification. I am referring to the fact that in most cases, the free_swap_slot() does not perform uncharge. It is grouped together with other entries to uncharge together inside swapcache_free_entries(). Yes, the uncharge itself is done by a for loop page by page. No batching in the for loop. BTW, Not all uncharges can be batched, because they can come from different memcg. Chris
On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chriscli@google.com> writes: > > > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: > >> > >> Not bypassing the swap slot cache, just make the callbacks to > >> invalidate the zswap entry, do memg uncharging, etc when the slot is > >> no longer used and is entering the swap slot cache (i.e. when > >> free_swap_slot() is called), instead of when draining the swap slot > >> cache (i.e. when swap_range_free() is called). For all parts of MM > >> outside of swap, the swap entry is freed when free_swap_slot() is > >> called. We don't free it immediately because of caching, but this > >> should be transparent to other parts of MM (e.g. zswap, memcg, etc). > > > > That will cancel the batching effect on the swap slot free, making the > > common case for swapping faults take longer to complete, righ? > > If I recall correctly, the uncharge is the expensive part of the swap > > slot free operation. > > I just want to figure out what we are trading off against. This is not > > one side wins all situations. > > Per my understanding, we don't batch memcg uncharging in > swap_entry_free() now. Although it's possible and may improve > performance. Yes. It actually causes a long tail in swapin fault latency as Chris discovered in our prod. I am wondering if doing the memcg uncharging outside the slots cache will actually amortize the cost instead. Regardless of memcg charging, which is more complicated, I think we should at least move the call to zswap_invalidate() before the slots cache. I would prefer that we move everything non-swapfile specific outside the slots cache layer (zswap_invalidate(), arch_swap_invalidate_page(), clear_shadow_from_swap_cache(), mem_cgroup_uncharge_swap(), ..). However, if some of those are controversial, we can move some of them for now. When draining free swap slots from the cache, swap_range_free() is called with nr_entries == 1 anyway, so I can't see how any batching is going on. If anything it should help amortize the cost. > > -- > Best Regards, > Huang, Ying
Yosry Ahmed <yosryahmed@google.com> writes: > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chriscli@google.com> writes: >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> >> >> Not bypassing the swap slot cache, just make the callbacks to >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is >> >> no longer used and is entering the swap slot cache (i.e. when >> >> free_swap_slot() is called), instead of when draining the swap slot >> >> cache (i.e. when swap_range_free() is called). For all parts of MM >> >> outside of swap, the swap entry is freed when free_swap_slot() is >> >> called. We don't free it immediately because of caching, but this >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc). >> > >> > That will cancel the batching effect on the swap slot free, making the >> > common case for swapping faults take longer to complete, righ? >> > If I recall correctly, the uncharge is the expensive part of the swap >> > slot free operation. >> > I just want to figure out what we are trading off against. This is not >> > one side wins all situations. >> >> Per my understanding, we don't batch memcg uncharging in >> swap_entry_free() now. Although it's possible and may improve >> performance. > > Yes. It actually causes a long tail in swapin fault latency as Chris > discovered in our prod. I am wondering if doing the memcg uncharging > outside the slots cache will actually amortize the cost instead. > > Regardless of memcg charging, which is more complicated, I think we > should at least move the call to zswap_invalidate() before the slots > cache. I would prefer that we move everything non-swapfile specific > outside the slots cache layer (zswap_invalidate(), > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(), > mem_cgroup_uncharge_swap(), ..). However, if some of those are > controversial, we can move some of them for now. That makes sense for me. > When draining free swap slots from the cache, swap_range_free() is > called with nr_entries == 1 anyway, so I can't see how any batching is > going on. If anything it should help amortize the cost. In swapcache_free_entries(), the sis->lock will be held to free multiple swap slots via swap_info_get_cont() if possible. This can reduce sis->lock contention. -- Best Regards, Huang, Ying
On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chriscli@google.com> writes: > >> > >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: > >> >> > >> >> Not bypassing the swap slot cache, just make the callbacks to > >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is > >> >> no longer used and is entering the swap slot cache (i.e. when > >> >> free_swap_slot() is called), instead of when draining the swap slot > >> >> cache (i.e. when swap_range_free() is called). For all parts of MM > >> >> outside of swap, the swap entry is freed when free_swap_slot() is > >> >> called. We don't free it immediately because of caching, but this > >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc). > >> > > >> > That will cancel the batching effect on the swap slot free, making the > >> > common case for swapping faults take longer to complete, righ? > >> > If I recall correctly, the uncharge is the expensive part of the swap > >> > slot free operation. > >> > I just want to figure out what we are trading off against. This is not > >> > one side wins all situations. > >> > >> Per my understanding, we don't batch memcg uncharging in > >> swap_entry_free() now. Although it's possible and may improve > >> performance. > > > > Yes. It actually causes a long tail in swapin fault latency as Chris > > discovered in our prod. I am wondering if doing the memcg uncharging > > outside the slots cache will actually amortize the cost instead. > > > > Regardless of memcg charging, which is more complicated, I think we > > should at least move the call to zswap_invalidate() before the slots > > cache. I would prefer that we move everything non-swapfile specific > > outside the slots cache layer (zswap_invalidate(), > > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(), > > mem_cgroup_uncharge_swap(), ..). However, if some of those are > > controversial, we can move some of them for now. > > That makes sense for me. > > > When draining free swap slots from the cache, swap_range_free() is > > called with nr_entries == 1 anyway, so I can't see how any batching is > > going on. If anything it should help amortize the cost. > > In swapcache_free_entries(), the sis->lock will be held to free multiple > swap slots via swap_info_get_cont() if possible. This can reduce > sis->lock contention. Ah yes that's a good point. Since most of these callbacks don't actually access sis, but use the swap entry value itself, I am guessing the reason we need to hold the lock for all these callbacks is to prevent swapoff and swapon reusing the same swap entry on a different swap device, right?
Yosry Ahmed <yosryahmed@google.com> writes: > On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Chris Li <chriscli@google.com> writes: >> >> >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> >> >> >> >> Not bypassing the swap slot cache, just make the callbacks to >> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is >> >> >> no longer used and is entering the swap slot cache (i.e. when >> >> >> free_swap_slot() is called), instead of when draining the swap slot >> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM >> >> >> outside of swap, the swap entry is freed when free_swap_slot() is >> >> >> called. We don't free it immediately because of caching, but this >> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc). >> >> > >> >> > That will cancel the batching effect on the swap slot free, making the >> >> > common case for swapping faults take longer to complete, righ? >> >> > If I recall correctly, the uncharge is the expensive part of the swap >> >> > slot free operation. >> >> > I just want to figure out what we are trading off against. This is not >> >> > one side wins all situations. >> >> >> >> Per my understanding, we don't batch memcg uncharging in >> >> swap_entry_free() now. Although it's possible and may improve >> >> performance. >> > >> > Yes. It actually causes a long tail in swapin fault latency as Chris >> > discovered in our prod. I am wondering if doing the memcg uncharging >> > outside the slots cache will actually amortize the cost instead. >> > >> > Regardless of memcg charging, which is more complicated, I think we >> > should at least move the call to zswap_invalidate() before the slots >> > cache. I would prefer that we move everything non-swapfile specific >> > outside the slots cache layer (zswap_invalidate(), >> > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(), >> > mem_cgroup_uncharge_swap(), ..). However, if some of those are >> > controversial, we can move some of them for now. >> >> That makes sense for me. >> >> > When draining free swap slots from the cache, swap_range_free() is >> > called with nr_entries == 1 anyway, so I can't see how any batching is >> > going on. If anything it should help amortize the cost. >> >> In swapcache_free_entries(), the sis->lock will be held to free multiple >> swap slots via swap_info_get_cont() if possible. This can reduce >> sis->lock contention. > > Ah yes that's a good point. Since most of these callbacks don't > actually access sis, but use the swap entry value itself, I am > guessing the reason we need to hold the lock for all these callbacks > is to prevent swapoff and swapon reusing the same swap entry on a > different swap device, right? In, swapcache_free_entries() swap_entry_free() swap_range_free() Quite some sis fields will be accessed. -- Best Regards, Huang, Ying
On Mon, Nov 20, 2023 at 5:55 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yosry Ahmed <yosryahmed@google.com> writes: > >> > >> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Chris Li <chriscli@google.com> writes: > >> >> > >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: > >> >> >> > >> >> >> Not bypassing the swap slot cache, just make the callbacks to > >> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is > >> >> >> no longer used and is entering the swap slot cache (i.e. when > >> >> >> free_swap_slot() is called), instead of when draining the swap slot > >> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM > >> >> >> outside of swap, the swap entry is freed when free_swap_slot() is > >> >> >> called. We don't free it immediately because of caching, but this > >> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc). > >> >> > > >> >> > That will cancel the batching effect on the swap slot free, making the > >> >> > common case for swapping faults take longer to complete, righ? > >> >> > If I recall correctly, the uncharge is the expensive part of the swap > >> >> > slot free operation. > >> >> > I just want to figure out what we are trading off against. This is not > >> >> > one side wins all situations. > >> >> > >> >> Per my understanding, we don't batch memcg uncharging in > >> >> swap_entry_free() now. Although it's possible and may improve > >> >> performance. > >> > > >> > Yes. It actually causes a long tail in swapin fault latency as Chris > >> > discovered in our prod. I am wondering if doing the memcg uncharging > >> > outside the slots cache will actually amortize the cost instead. > >> > > >> > Regardless of memcg charging, which is more complicated, I think we > >> > should at least move the call to zswap_invalidate() before the slots > >> > cache. I would prefer that we move everything non-swapfile specific > >> > outside the slots cache layer (zswap_invalidate(), > >> > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(), > >> > mem_cgroup_uncharge_swap(), ..). However, if some of those are > >> > controversial, we can move some of them for now. > >> > >> That makes sense for me. > >> > >> > When draining free swap slots from the cache, swap_range_free() is > >> > called with nr_entries == 1 anyway, so I can't see how any batching is > >> > going on. If anything it should help amortize the cost. > >> > >> In swapcache_free_entries(), the sis->lock will be held to free multiple > >> swap slots via swap_info_get_cont() if possible. This can reduce > >> sis->lock contention. > > > > Ah yes that's a good point. Since most of these callbacks don't > > actually access sis, but use the swap entry value itself, I am > > guessing the reason we need to hold the lock for all these callbacks > > is to prevent swapoff and swapon reusing the same swap entry on a > > different swap device, right? > > In, > > swapcache_free_entries() > swap_entry_free() > swap_range_free() > > Quite some sis fields will be accessed. I wasn't referring to this code. I was what's preventing us from moving the callbacks I mentioned outside the lock (zswap_invalidate(), arch_swap_invalidate_page(), clear_shadow_from_swap_cache(), mem_cgroup_uncharge_swap(), ..). I think most or all of them don't really access sis, but perhaps they need the lock to ensure the swap entry value does not get reused? > > -- > Best Regards, > Huang, Ying
Yosry Ahmed <yosryahmed@google.com> writes: > On Mon, Nov 20, 2023 at 5:55 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> > On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> >> >> >> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Chris Li <chriscli@google.com> writes: >> >> >> >> >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> >> >> >> >> >> >> Not bypassing the swap slot cache, just make the callbacks to >> >> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is >> >> >> >> no longer used and is entering the swap slot cache (i.e. when >> >> >> >> free_swap_slot() is called), instead of when draining the swap slot >> >> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM >> >> >> >> outside of swap, the swap entry is freed when free_swap_slot() is >> >> >> >> called. We don't free it immediately because of caching, but this >> >> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc). >> >> >> > >> >> >> > That will cancel the batching effect on the swap slot free, making the >> >> >> > common case for swapping faults take longer to complete, righ? >> >> >> > If I recall correctly, the uncharge is the expensive part of the swap >> >> >> > slot free operation. >> >> >> > I just want to figure out what we are trading off against. This is not >> >> >> > one side wins all situations. >> >> >> >> >> >> Per my understanding, we don't batch memcg uncharging in >> >> >> swap_entry_free() now. Although it's possible and may improve >> >> >> performance. >> >> > >> >> > Yes. It actually causes a long tail in swapin fault latency as Chris >> >> > discovered in our prod. I am wondering if doing the memcg uncharging >> >> > outside the slots cache will actually amortize the cost instead. >> >> > >> >> > Regardless of memcg charging, which is more complicated, I think we >> >> > should at least move the call to zswap_invalidate() before the slots >> >> > cache. I would prefer that we move everything non-swapfile specific >> >> > outside the slots cache layer (zswap_invalidate(), >> >> > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(), >> >> > mem_cgroup_uncharge_swap(), ..). However, if some of those are >> >> > controversial, we can move some of them for now. >> >> >> >> That makes sense for me. >> >> >> >> > When draining free swap slots from the cache, swap_range_free() is >> >> > called with nr_entries == 1 anyway, so I can't see how any batching is >> >> > going on. If anything it should help amortize the cost. >> >> >> >> In swapcache_free_entries(), the sis->lock will be held to free multiple >> >> swap slots via swap_info_get_cont() if possible. This can reduce >> >> sis->lock contention. >> > >> > Ah yes that's a good point. Since most of these callbacks don't >> > actually access sis, but use the swap entry value itself, I am >> > guessing the reason we need to hold the lock for all these callbacks >> > is to prevent swapoff and swapon reusing the same swap entry on a >> > different swap device, right? >> >> In, >> >> swapcache_free_entries() >> swap_entry_free() >> swap_range_free() >> >> Quite some sis fields will be accessed. > > I wasn't referring to this code. I was what's preventing us from > moving the callbacks I mentioned outside the lock (zswap_invalidate(), > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(), > mem_cgroup_uncharge_swap(), ..). I think most or all of them don't > really access sis, but perhaps they need the lock to ensure the swap > entry value does not get reused? In fact, the swap entries to be freed by swapcache_free_entries() is in a state that can not be freed by other path (including swapoff()). It's swap_map value is SWAP_HAS_CACHE, but we can not find folio in swap_address_space(). To be honest, I don't know whether there are dependencies on sis->lock in these callbacks. You need to investigate them one by one. -- Best Regards, Huang, Ying
On Mon, Nov 20, 2023 at 7:35 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Mon, Nov 20, 2023 at 5:55 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yosry Ahmed <yosryahmed@google.com> writes: > >> > >> > On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yosry Ahmed <yosryahmed@google.com> writes: > >> >> > >> >> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Chris Li <chriscli@google.com> writes: > >> >> >> > >> >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote: > >> >> >> >> > >> >> >> >> Not bypassing the swap slot cache, just make the callbacks to > >> >> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is > >> >> >> >> no longer used and is entering the swap slot cache (i.e. when > >> >> >> >> free_swap_slot() is called), instead of when draining the swap slot > >> >> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM > >> >> >> >> outside of swap, the swap entry is freed when free_swap_slot() is > >> >> >> >> called. We don't free it immediately because of caching, but this > >> >> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc). > >> >> >> > > >> >> >> > That will cancel the batching effect on the swap slot free, making the > >> >> >> > common case for swapping faults take longer to complete, righ? > >> >> >> > If I recall correctly, the uncharge is the expensive part of the swap > >> >> >> > slot free operation. > >> >> >> > I just want to figure out what we are trading off against. This is not > >> >> >> > one side wins all situations. > >> >> >> > >> >> >> Per my understanding, we don't batch memcg uncharging in > >> >> >> swap_entry_free() now. Although it's possible and may improve > >> >> >> performance. > >> >> > > >> >> > Yes. It actually causes a long tail in swapin fault latency as Chris > >> >> > discovered in our prod. I am wondering if doing the memcg uncharging > >> >> > outside the slots cache will actually amortize the cost instead. > >> >> > > >> >> > Regardless of memcg charging, which is more complicated, I think we > >> >> > should at least move the call to zswap_invalidate() before the slots > >> >> > cache. I would prefer that we move everything non-swapfile specific > >> >> > outside the slots cache layer (zswap_invalidate(), > >> >> > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(), > >> >> > mem_cgroup_uncharge_swap(), ..). However, if some of those are > >> >> > controversial, we can move some of them for now. > >> >> > >> >> That makes sense for me. > >> >> > >> >> > When draining free swap slots from the cache, swap_range_free() is > >> >> > called with nr_entries == 1 anyway, so I can't see how any batching is > >> >> > going on. If anything it should help amortize the cost. > >> >> > >> >> In swapcache_free_entries(), the sis->lock will be held to free multiple > >> >> swap slots via swap_info_get_cont() if possible. This can reduce > >> >> sis->lock contention. > >> > > >> > Ah yes that's a good point. Since most of these callbacks don't > >> > actually access sis, but use the swap entry value itself, I am > >> > guessing the reason we need to hold the lock for all these callbacks > >> > is to prevent swapoff and swapon reusing the same swap entry on a > >> > different swap device, right? > >> > >> In, > >> > >> swapcache_free_entries() > >> swap_entry_free() > >> swap_range_free() > >> > >> Quite some sis fields will be accessed. > > > > I wasn't referring to this code. I was what's preventing us from > > moving the callbacks I mentioned outside the lock (zswap_invalidate(), > > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(), > > mem_cgroup_uncharge_swap(), ..). I think most or all of them don't > > really access sis, but perhaps they need the lock to ensure the swap > > entry value does not get reused? > > In fact, the swap entries to be freed by swapcache_free_entries() is in > a state that can not be freed by other path (including swapoff()). It's > swap_map value is SWAP_HAS_CACHE, but we can not find folio in > swap_address_space(). Interesting, it would be even nicer if we can move them outside the lock. > > To be honest, I don't know whether there are dependencies on sis->lock > in these callbacks. You need to investigate them one by one. Yeah moving these callbacks outside batching and the lock is very intriguing but needs to be done carefully. We don't need to do it all at once, we can start with zswap_invalidate() and move them as we see fit. It would be nice if the code is refactored such that it's clear what callbacks are made immediately when the entry is no longer used and what callbacks are made when the swap slot is being freed. > > -- > Best Regards, > Huang, Ying
diff --git a/mm/zswap.c b/mm/zswap.c index 74411dfdad92..db95491bcdd5 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1063,11 +1063,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry, struct mempolicy *mpol; struct scatterlist input, output; struct crypto_acomp_ctx *acomp_ctx; + struct swap_info_struct *si; struct zpool *pool = zswap_find_zpool(entry); bool page_was_allocated; u8 *src, *tmp = NULL; unsigned int dlen; - int ret; + int ret = 0; struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, }; @@ -1082,16 +1083,30 @@ static int zswap_writeback_entry(struct zswap_entry *entry, mpol = get_task_policy(current); page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol, NO_INTERLEAVE_INDEX, &page_was_allocated); - if (!page) { + if (!page) ret = -ENOMEM; - goto fail; - } - - /* Found an existing page, we raced with load/swapin */ - if (!page_was_allocated) { + else if (!page_was_allocated) { + /* Found an existing page, we raced with load/swapin */ put_page(page); ret = -EEXIST; - goto fail; + } + + if (ret) { + si = get_swap_device(swpentry); + if (!si) + goto out; + + /* Two cases to directly release zswap_entry. + * 1) -ENOMEM,if the swpentry has been freed, but cached in + * swap_slots_cache(no page and swapcount = 0). + * 2) -EEXIST, option zswap_exclusive_loads_enabled disabled and + * zswap_load completed(page in swap_cache and swapcount = 0). + */ + if (!swap_swapcount(si, swpentry)) + ret = 0; + + put_swap_device(si); + goto out; } /* @@ -1106,7 +1121,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, spin_unlock(&tree->lock); delete_from_swap_cache(page_folio(page)); ret = -ENOMEM; - goto fail; + goto out; } spin_unlock(&tree->lock); @@ -1151,7 +1166,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, return ret; -fail: +out: if (!zpool_can_sleep_mapped(pool)) kfree(tmp);