Message ID | 20230524065051.6328-1-cerasuolodomenico@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2645364vqo; Wed, 24 May 2023 00:00:25 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7kgpIjmCw0W9ndgeWjm/PYMfXoX1BmxKzbMHjeXd/Qj1gwsSs8iRVQgLLcyXz78dIMMut4 X-Received: by 2002:a05:6a00:139d:b0:64f:3fc8:5d3a with SMTP id t29-20020a056a00139d00b0064f3fc85d3amr1943044pfg.16.1684911624825; Wed, 24 May 2023 00:00:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684911624; cv=none; d=google.com; s=arc-20160816; b=KcxBAZNOyqNY6uM4pVOBtXRR6t+Mejlfc7IxO8c0OAutzVs/E+boLmqI7ChOpmsiNW dhCJGjc4CpOsvXSD8XCqf5thUWvMKzmHDYRvBAc0HCkozD9me457Z21eIZNPdQHfAMYY oYurwVGA5Db2WHSXvCh/C5LIwv5hjkK1L+qW8bb/JYBk8UbzTdo/phP83yBlhzPG+smN fITNzoclrHsBl7KYfUj8Bymml1+qgVWRTuErad3qfDqMC3LzXJJ5hS8T6P4k/24x0JBh N0+uyebAUl7VpbiVf4DioDiVvfZ72o1JF8laJoPNtI3044dLrYvQFMzM2+83e/ZNI0iM QUtQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=Q1BgDaEuGbsPtj89gXejmiemKbgHi9EHq9Za8nOtL1I=; b=bS8WZFzbzAaDsYEZQtRc/mmfc+MKVVDfjpch5tLKgua2ZBNWhgdJNNwtzKyqPvSApz P8W8EwNhM1pvTOnWUtUsk83gXsFxH7vqfjtsA4lOV3EVkjz4LPv/18UtTy9bRm2NREth Sc6j+EuKhcVpHkZvQb9OAwOVXA6d/zGIF5qAygm9xAK6J0Nv4kp+kSNyZzdddluzpwIr cSy/boNYbr1raPaXUp1I1wbtgyCsmdch7WyN3oj0o1kFwXNrR/q3LYLcYyJuvw8iiwh7 ODt6wetJQhTwFz8scUFP1T59SRi7YZxRrREmubXubUqTxPfruP+NgLKAlytj3n0TYcwS 663w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=Egvyynvn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q11-20020aa7960b000000b0063b57ad9891si7868496pfg.315.2023.05.24.00.00.11; Wed, 24 May 2023 00:00:24 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=Egvyynvn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235934AbjEXGwh (ORCPT <rfc822;ahmedalshaiji.dev@gmail.com> + 99 others); Wed, 24 May 2023 02:52:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38122 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239571AbjEXGvo (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 24 May 2023 02:51:44 -0400 Received: from mail-ed1-x52e.google.com (mail-ed1-x52e.google.com [IPv6:2a00:1450:4864:20::52e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E7D87E7D for <linux-kernel@vger.kernel.org>; Tue, 23 May 2023 23:50:56 -0700 (PDT) Received: by mail-ed1-x52e.google.com with SMTP id 4fb4d7f45d1cf-50bcb00a4c2so1180120a12.1 for <linux-kernel@vger.kernel.org>; Tue, 23 May 2023 23:50:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684911055; x=1687503055; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=Q1BgDaEuGbsPtj89gXejmiemKbgHi9EHq9Za8nOtL1I=; b=EgvyynvnYSsaA25G+7MLLPYzMpi3yN4o3G1RyAt/VSfgaKCM7SrDfnfm0KmHEHZGjP 6ErdO7t5dAR1q46BChrApVBhwG5Vb92CZ1oZK86qhqRHEmf3sujlTrbqiO4wxpL3JTNr VA+AT64mDoeWhg0YbN618Rv/lqhd2VhzOVvyMh6ax3QDSWquV9xGEAWzVD+eArOvTyWZ hNVQuYLf8YhjHY3rnOvazlHr0gr2hzJiSu1lXhavatw2r/iO8QFulUoKK5SiN6IZ0gp/ GnDhDXyLRFIrGtCB4426XmfP+W8hPr0LLQqxtvfYx6RSoyD44e/4l9EX4ibbCMLMA7j3 7Ilg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684911055; x=1687503055; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Q1BgDaEuGbsPtj89gXejmiemKbgHi9EHq9Za8nOtL1I=; b=Gt+KxVwtu4Tc5gEZBg2sKYz0WfFPr/lvNq76l8Jct1aPigEDDOncNu8iZI7mFwgT+S Y/eSzGCXb0wrXHZTB0iC/cqKeNfD/tlqSBTGYEsH6+QirB0QdVMIO7rWbpus/UA+F2IY 8JF4M3aVgHGe0A8B68JUqey4+GM/m0uAgA7cRG5m+DE+MJ2pT6FbJliponHCsUYrBCDK dCOFy0pBt35mu6wa3Y38XBAAlgoZHrfnOznecVTJyrFSvPY+YzR7176Gbmlh1YQyfB2O NeP6+9BpMh4u+nPawzbMWBm/93BEqZC7vfvL0eGUaC5iq52N5mMYNgd2Uoh/PCpWZy3H 7mww== X-Gm-Message-State: AC+VfDz4njn+VHus3gaFMfLde6jX4nH8EIXE5mozrgZnEZrZWWf3JYWc /fmoDUtODV7ss3ztSINZa7Y= X-Received: by 2002:a17:907:7b8d:b0:92b:3c78:91fa with SMTP id ne13-20020a1709077b8d00b0092b3c7891famr18509329ejc.28.1684911054895; Tue, 23 May 2023 23:50:54 -0700 (PDT) Received: from lelloman-5950.homenet.telecomitalia.it (host-95-247-156-74.retail.telecomitalia.it. [95.247.156.74]) by smtp.gmail.com with ESMTPSA id f16-20020a170906739000b0096650f46004sm5444304ejl.56.2023.05.23.23.50.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 May 2023 23:50:54 -0700 (PDT) From: Domenico Cerasuolo <cerasuolodomenico@gmail.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, yosryahmed@google.com, hannes@cmpxchg.org, kernel-team@fb.com, Domenico Cerasuolo <cerasuolodomenico@gmail.com> Subject: [PATCH] mm: zswap: shrink until can accept Date: Wed, 24 May 2023 08:50:51 +0200 Message-Id: <20230524065051.6328-1-cerasuolodomenico@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766757892253098940?= X-GMAIL-MSGID: =?utf-8?q?1766757892253098940?= |
Series |
mm: zswap: shrink until can accept
|
|
Commit Message
Domenico Cerasuolo
May 24, 2023, 6:50 a.m. UTC
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate,
as shown by basic benchmark test. For the test, a kernel build was
utilized as a reference, with its memory confined to 1G via a cgroup and
a 5G swap file provided. The results are presented below, these are
averages of three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is
the extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular
scenario: once zswap hits his limit, zswap_pool_reached_full is set to
true; with this flag on, zswap_frontswap_store rejects pages if zswap is
still above the acceptance threshold. Once we include the rejections due
to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
1478 to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because
it's above the acceptance threshold, yet fails to attempt memory
reclaimation. This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk,
while cold ones remain memory, waiting only to be invalidated. The LRU
order is completely broken and zswap ends up being just an overhead
without providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
---
mm/zswap.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
Comments
Hi Domenico, On Tue, May 23, 2023 at 11:50 PM Domenico Cerasuolo <cerasuolodomenico@gmail.com> wrote: > > This update addresses an issue with the zswap reclaim mechanism, which > hinders the efficient offloading of cold pages to disk, thereby > compromising the preservation of the LRU order and consequently > diminishing, if not inverting, its performance benefits. > > The functioning of the zswap shrink worker was found to be inadequate, > as shown by basic benchmark test. For the test, a kernel build was > utilized as a reference, with its memory confined to 1G via a cgroup and > a 5G swap file provided. The results are presented below, these are > averages of three runs without the use of zswap: > > real 46m26s > user 35m4s > sys 7m37s > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > system), the results changed to: > > real 56m4s > user 35m13s > sys 8m43s > > written_back_pages: 18 > reject_reclaim_fail: 0 > pool_limit_hit:1478 > > Besides the evident regression, one thing to notice from this data is > the extremely low number of written_back_pages and pool_limit_hit. > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > when zswap is completely full, doesn't account for a particular > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > still above the acceptance threshold. Once we include the rejections due > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > 1478 to a significant 21578266. > > Zswap is stuck in an undesirable state where it rejects pages because > it's above the acceptance threshold, yet fails to attempt memory > reclaimation. This happens because the shrink work is only queued when > zswap_frontswap_store detects that it's full and the work itself only > reclaims one page per run. > > This state results in hot pages getting written directly to disk, > while cold ones remain memory, waiting only to be invalidated. The LRU > order is completely broken and zswap ends up being just an overhead > without providing any benefits. > > This commit applies 2 changes: a) the shrink worker is set to reclaim > pages until the acceptance threshold is met and b) the task is also > enqueued when zswap is not full but still above the threshold. > > Testing this suggested update showed much better numbers: > > real 36m37s > user 35m8s > sys 9m32s > > written_back_pages: 10459423 > reject_reclaim_fail: 12896 > pool_limit_hit: 75653 Impressive numbers, and great find in general! I wonder how other workloads benefit/regress from this change. Anything else that you happened to run? :) > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> > --- > mm/zswap.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 59da2a415fbb..2ee0775d8213 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > { > struct zswap_pool *pool = container_of(w, typeof(*pool), > shrink_work); > + int ret; > > - if (zpool_shrink(pool->zpool, 1, NULL)) > - zswap_reject_reclaim_fail++; > + do { > + ret = zpool_shrink(pool->zpool, 1, NULL); > + if (ret) > + zswap_reject_reclaim_fail++; > + } while (!zswap_can_accept() && ret != -EINVAL); One question/comment here about the retry logic. So I looked through the awfully convoluted writeback code, and there are multiple layers, and some of them tend to overwrite the return value of the layer underneath :( For zsmalloc (as an example): zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict()->zswap_writeback_entry(). First of all, that zpool_ops is an unnecessarily confusing indirection, but anyway. - zswap_writeback_entry() will either return -ENOMEM or -EEXIST on error - zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lru is empty, and -EAGAIN on other errors. - zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the return value of zs_reclaim_page()/zbud_reclaim_page(). - zpool_shrink() will return -EINVAL if the driver does not support shrinking, otherwise it will propagate the return value from the driver. So it looks like we will get -EINVAL only if the driver lru is empty or the driver does not support writeback, so rightfully we should not retry. If zswap_writeback_entry() returns -EEXIST, it probably means that we raced with another task decompressing the page, so rightfully we should retry to writeback another page instead. If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily mean that we couldn't allocate memory (unfortunately), it looks like we will return -ENOMEM in other cases as well. Arguably, we can retry in all cases, because even if we were actually out of memory, we are trying to make an allocation that will eventually free more memory. In all cases, I think it would be nicer if we retry if ret == -EAGAIN instead. It is semantically more sane. In this specific case it is functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly return -EAGAIN anyway, but in case someone tries to use saner errnos in the future, this will age better. Also, do we intentionally want to keep retrying forever on failure? Do we want to add a max number of retries? > zswap_pool_put(pool); > } > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > if (zswap_pool_reached_full) { > if (!zswap_can_accept()) { > ret = -ENOMEM; > - goto reject; > + goto shrink; > } else > zswap_pool_reached_full = false; > } > -- > 2.34.1 >
On Thu, May 25, 2023 at 2:59 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > Hi Domenico, > > On Tue, May 23, 2023 at 11:50 PM Domenico Cerasuolo > <cerasuolodomenico@gmail.com> wrote: > > > > This update addresses an issue with the zswap reclaim mechanism, which > > hinders the efficient offloading of cold pages to disk, thereby > > compromising the preservation of the LRU order and consequently > > diminishing, if not inverting, its performance benefits. > > > > The functioning of the zswap shrink worker was found to be inadequate, > > as shown by basic benchmark test. For the test, a kernel build was > > utilized as a reference, with its memory confined to 1G via a cgroup and > > a 5G swap file provided. The results are presented below, these are > > averages of three runs without the use of zswap: > > > > real 46m26s > > user 35m4s > > sys 7m37s > > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > > system), the results changed to: > > > > real 56m4s > > user 35m13s > > sys 8m43s > > > > written_back_pages: 18 > > reject_reclaim_fail: 0 > > pool_limit_hit:1478 > > > > Besides the evident regression, one thing to notice from this data is > > the extremely low number of written_back_pages and pool_limit_hit. > > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > > when zswap is completely full, doesn't account for a particular > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > > still above the acceptance threshold. Once we include the rejections due > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > > 1478 to a significant 21578266. > > > > Zswap is stuck in an undesirable state where it rejects pages because > > it's above the acceptance threshold, yet fails to attempt memory > > reclaimation. This happens because the shrink work is only queued when > > zswap_frontswap_store detects that it's full and the work itself only > > reclaims one page per run. > > > > This state results in hot pages getting written directly to disk, > > while cold ones remain memory, waiting only to be invalidated. The LRU > > order is completely broken and zswap ends up being just an overhead > > without providing any benefits. > > > > This commit applies 2 changes: a) the shrink worker is set to reclaim > > pages until the acceptance threshold is met and b) the task is also > > enqueued when zswap is not full but still above the threshold. > > > > Testing this suggested update showed much better numbers: > > > > real 36m37s > > user 35m8s > > sys 9m32s > > > > written_back_pages: 10459423 > > reject_reclaim_fail: 12896 > > pool_limit_hit: 75653 > > Impressive numbers, and great find in general! > > I wonder how other workloads benefit/regress from this change. > Anything else that you happened to run? :) > Hi Yosry, thanks for the quick feedback! Besides stressers, I haven't tested any other actual workload, we don't have writeback in production yet so I can't provide any data from there. I was wondering what kind of actual workload I could use to test this on my desktop, but I couldn't think of anything relevant, I'm open to suggestions though :) > > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> > > --- > > mm/zswap.c | 10 +++++++--- > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 59da2a415fbb..2ee0775d8213 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > > { > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > shrink_work); > > + int ret; > > > > - if (zpool_shrink(pool->zpool, 1, NULL)) > > - zswap_reject_reclaim_fail++; > > + do { > > + ret = zpool_shrink(pool->zpool, 1, NULL); > > + if (ret) > > + zswap_reject_reclaim_fail++; > > + } while (!zswap_can_accept() && ret != -EINVAL); > > One question/comment here about the retry logic. > > So I looked through the awfully convoluted writeback code, and there > are multiple layers, and some of them tend to overwrite the return > value of the layer underneath :( > > For zsmalloc (as an example): > zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict()->zswap_writeback_entry(). > > First of all, that zpool_ops is an unnecessarily confusing > indirection, but anyway. > > - zswap_writeback_entry() will either return -ENOMEM or -EEXIST on error > - zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lru > is empty, and -EAGAIN on other errors. > - zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the > return value of zs_reclaim_page()/zbud_reclaim_page(). > - zpool_shrink() will return -EINVAL if the driver does not support > shrinking, otherwise it will propagate the return value from the > driver. > > So it looks like we will get -EINVAL only if the driver lru is empty > or the driver does not support writeback, so rightfully we should not > retry. > > If zswap_writeback_entry() returns -EEXIST, it probably means that we > raced with another task decompressing the page, so rightfully we > should retry to writeback another page instead. > > If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily > mean that we couldn't allocate memory (unfortunately), it looks like > we will return -ENOMEM in other cases as well. Arguably, we can retry > in all cases, because even if we were actually out of memory, we are > trying to make an allocation that will eventually free more memory. > > In all cases, I think it would be nicer if we retry if ret == -EAGAIN > instead. It is semantically more sane. In this specific case it is > functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly > return -EAGAIN anyway, but in case someone tries to use saner errnos > in the future, this will age better. Retrying if ret == -EAGAIN seems much nicer indeed, will change it. > > Also, do we intentionally want to keep retrying forever on failure? Do > we want to add a max number of retries? If the drivers guaranteed that zpool_shrink will remove at least an entry from their LRU, the retry wouldn't be needed because the LRU will eventually be emptied. But this is an assumption on the implementation of the zpool, so yes, we could use a max retries. I think that being a sanity check, it should overshoot the required number of iterations in order to avoid a premature break, what about retrying a max of zswap_stored_pages times? > > > zswap_pool_put(pool); > > } > > > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > > if (zswap_pool_reached_full) { > > if (!zswap_can_accept()) { > > ret = -ENOMEM; > > - goto reject; > > + goto shrink; > > } else > > zswap_pool_reached_full = false; > > } > > -- > > 2.34.1 > >
On Thu, May 25, 2023 at 9:53 AM Domenico Cerasuolo <cerasuolodomenico@gmail.com> wrote: > > On Thu, May 25, 2023 at 2:59 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > Hi Domenico, > > > > On Tue, May 23, 2023 at 11:50 PM Domenico Cerasuolo > > <cerasuolodomenico@gmail.com> wrote: > > > > > > This update addresses an issue with the zswap reclaim mechanism, which > > > hinders the efficient offloading of cold pages to disk, thereby > > > compromising the preservation of the LRU order and consequently > > > diminishing, if not inverting, its performance benefits. > > > > > > The functioning of the zswap shrink worker was found to be inadequate, > > > as shown by basic benchmark test. For the test, a kernel build was > > > utilized as a reference, with its memory confined to 1G via a cgroup and > > > a 5G swap file provided. The results are presented below, these are > > > averages of three runs without the use of zswap: > > > > > > real 46m26s > > > user 35m4s > > > sys 7m37s > > > > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > > > system), the results changed to: > > > > > > real 56m4s > > > user 35m13s > > > sys 8m43s > > > > > > written_back_pages: 18 > > > reject_reclaim_fail: 0 > > > pool_limit_hit:1478 > > > > > > Besides the evident regression, one thing to notice from this data is > > > the extremely low number of written_back_pages and pool_limit_hit. > > > > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > > > when zswap is completely full, doesn't account for a particular > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > > > still above the acceptance threshold. Once we include the rejections due > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > > > 1478 to a significant 21578266. > > > > > > Zswap is stuck in an undesirable state where it rejects pages because > > > it's above the acceptance threshold, yet fails to attempt memory > > > reclaimation. This happens because the shrink work is only queued when > > > zswap_frontswap_store detects that it's full and the work itself only > > > reclaims one page per run. > > > > > > This state results in hot pages getting written directly to disk, > > > while cold ones remain memory, waiting only to be invalidated. The LRU > > > order is completely broken and zswap ends up being just an overhead > > > without providing any benefits. > > > > > > This commit applies 2 changes: a) the shrink worker is set to reclaim > > > pages until the acceptance threshold is met and b) the task is also > > > enqueued when zswap is not full but still above the threshold. > > > > > > Testing this suggested update showed much better numbers: > > > > > > real 36m37s > > > user 35m8s > > > sys 9m32s > > > > > > written_back_pages: 10459423 > > > reject_reclaim_fail: 12896 > > > pool_limit_hit: 75653 > > > > Impressive numbers, and great find in general! > > > > I wonder how other workloads benefit/regress from this change. > > Anything else that you happened to run? :) > > > Hi Yosry, > > thanks for the quick feedback! > > Besides stressers, I haven't tested any other actual workload, we don't have > writeback in production yet so I can't provide any data from there. I was > wondering what kind of actual workload I could use to test this on my desktop, > but I couldn't think of anything relevant, I'm open to suggestions though :) Nothing in mind in particular as well. Perhaps others have ideas. > > > > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> > > > --- > > > mm/zswap.c | 10 +++++++--- > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 59da2a415fbb..2ee0775d8213 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > > > { > > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > > shrink_work); > > > + int ret; > > > > > > - if (zpool_shrink(pool->zpool, 1, NULL)) > > > - zswap_reject_reclaim_fail++; > > > + do { > > > + ret = zpool_shrink(pool->zpool, 1, NULL); > > > + if (ret) > > > + zswap_reject_reclaim_fail++; > > > + } while (!zswap_can_accept() && ret != -EINVAL); > > > > One question/comment here about the retry logic. > > > > So I looked through the awfully convoluted writeback code, and there > > are multiple layers, and some of them tend to overwrite the return > > value of the layer underneath :( > > > > For zsmalloc (as an example): > > zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict()->zswap_writeback_entry(). > > > > First of all, that zpool_ops is an unnecessarily confusing > > indirection, but anyway. > > > > - zswap_writeback_entry() will either return -ENOMEM or -EEXIST on error > > - zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lru > > is empty, and -EAGAIN on other errors. > > - zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the > > return value of zs_reclaim_page()/zbud_reclaim_page(). > > - zpool_shrink() will return -EINVAL if the driver does not support > > shrinking, otherwise it will propagate the return value from the > > driver. > > > > So it looks like we will get -EINVAL only if the driver lru is empty > > or the driver does not support writeback, so rightfully we should not > > retry. > > > > If zswap_writeback_entry() returns -EEXIST, it probably means that we > > raced with another task decompressing the page, so rightfully we > > should retry to writeback another page instead. > > > > If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily > > mean that we couldn't allocate memory (unfortunately), it looks like > > we will return -ENOMEM in other cases as well. Arguably, we can retry > > in all cases, because even if we were actually out of memory, we are > > trying to make an allocation that will eventually free more memory. > > > > In all cases, I think it would be nicer if we retry if ret == -EAGAIN > > instead. It is semantically more sane. In this specific case it is > > functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly > > return -EAGAIN anyway, but in case someone tries to use saner errnos > > in the future, this will age better. > Retrying if ret == -EAGAIN seems much nicer indeed, will change it. > > > > Also, do we intentionally want to keep retrying forever on failure? Do > > we want to add a max number of retries? > If the drivers guaranteed that zpool_shrink will remove at least an entry > from their LRU, the retry wouldn't be needed because the LRU will > eventually be emptied. But this is an assumption on the implementation of I don't think any zpool driver can guarantee to writeback at least one page. It can fail for reasons beyond their control (e.g. cannot allocate a page to decompress to). > the zpool, so yes, we could use a max retries. I think that being a sanity > check, it should overshoot the required number of iterations in order to > avoid a premature break, what about retrying a max of > zswap_stored_pages times? Why is it just a sanity check? Consider a case where the system is under extreme memory pressure that the drivers cannot allocate pages to decompress to. The drivers would be continuously competing with all other allocations on the machine. I think we should give up at some point. In any case, you changed the zswap_frontswap_store() to goto shrink if !zswap_can_accept(), so next time we try to compress a page to zswap we will invoke try again anyway -- perhaps under better circumstances. I think zswap_stored_pages is too much, keep in mind that it also includes same-filled pages which are not even stored in the zpool drivers. Maybe we should allow a fixed number of failures. If zpool_shrink() is successful, keep going until zswap_can_accept(). Otherwise allow a fixed number of failures before giving up. Maybe we can improve the error codes propagated through the writeback code as well to improve the retry logic. For example, if zswap_writeback_entry() returns EEXIST then retrying should be harmless, but ENOMEM might be harmful. Both of which are propagated as EAGAIN from zs_zpool_shrink()/zbud_zpool_shrink(). > > > > > zswap_pool_put(pool); > > > } > > > > > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > > > if (zswap_pool_reached_full) { > > > if (!zswap_can_accept()) { > > > ret = -ENOMEM; > > > - goto reject; > > > + goto shrink; > > > } else > > > zswap_pool_reached_full = false; > > > } > > > -- > > > 2.34.1 > > >
On Thu, May 25, 2023 at 9:10 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, May 25, 2023 at 9:53 AM Domenico Cerasuolo > <cerasuolodomenico@gmail.com> wrote: > > > > On Thu, May 25, 2023 at 2:59 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > Hi Domenico, > > > > > > On Tue, May 23, 2023 at 11:50 PM Domenico Cerasuolo > > > <cerasuolodomenico@gmail.com> wrote: > > > > > > > > This update addresses an issue with the zswap reclaim mechanism, which > > > > hinders the efficient offloading of cold pages to disk, thereby > > > > compromising the preservation of the LRU order and consequently > > > > diminishing, if not inverting, its performance benefits. > > > > > > > > The functioning of the zswap shrink worker was found to be inadequate, > > > > as shown by basic benchmark test. For the test, a kernel build was > > > > utilized as a reference, with its memory confined to 1G via a cgroup and > > > > a 5G swap file provided. The results are presented below, these are > > > > averages of three runs without the use of zswap: > > > > > > > > real 46m26s > > > > user 35m4s > > > > sys 7m37s > > > > > > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > > > > system), the results changed to: > > > > > > > > real 56m4s > > > > user 35m13s > > > > sys 8m43s > > > > > > > > written_back_pages: 18 > > > > reject_reclaim_fail: 0 > > > > pool_limit_hit:1478 > > > > > > > > Besides the evident regression, one thing to notice from this data is > > > > the extremely low number of written_back_pages and pool_limit_hit. > > > > > > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > > > > when zswap is completely full, doesn't account for a particular > > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > > > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > > > > still above the acceptance threshold. Once we include the rejections due > > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > > > > 1478 to a significant 21578266. > > > > > > > > Zswap is stuck in an undesirable state where it rejects pages because > > > > it's above the acceptance threshold, yet fails to attempt memory > > > > reclaimation. This happens because the shrink work is only queued when > > > > zswap_frontswap_store detects that it's full and the work itself only > > > > reclaims one page per run. > > > > > > > > This state results in hot pages getting written directly to disk, > > > > while cold ones remain memory, waiting only to be invalidated. The LRU > > > > order is completely broken and zswap ends up being just an overhead > > > > without providing any benefits. > > > > > > > > This commit applies 2 changes: a) the shrink worker is set to reclaim > > > > pages until the acceptance threshold is met and b) the task is also > > > > enqueued when zswap is not full but still above the threshold. > > > > > > > > Testing this suggested update showed much better numbers: > > > > > > > > real 36m37s > > > > user 35m8s > > > > sys 9m32s > > > > > > > > written_back_pages: 10459423 > > > > reject_reclaim_fail: 12896 > > > > pool_limit_hit: 75653 > > > > > > Impressive numbers, and great find in general! > > > > > > I wonder how other workloads benefit/regress from this change. > > > Anything else that you happened to run? :) > > > > > Hi Yosry, > > > > thanks for the quick feedback! > > > > Besides stressers, I haven't tested any other actual workload, we don't have > > writeback in production yet so I can't provide any data from there. I was > > wondering what kind of actual workload I could use to test this on my desktop, > > but I couldn't think of anything relevant, I'm open to suggestions though :) > > Nothing in mind in particular as well. Perhaps others have ideas. > > > > > > > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > > > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> > > > > --- > > > > mm/zswap.c | 10 +++++++--- > > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index 59da2a415fbb..2ee0775d8213 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > > > > { > > > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > > > shrink_work); > > > > + int ret; > > > > > > > > - if (zpool_shrink(pool->zpool, 1, NULL)) > > > > - zswap_reject_reclaim_fail++; > > > > + do { > > > > + ret = zpool_shrink(pool->zpool, 1, NULL); > > > > + if (ret) > > > > + zswap_reject_reclaim_fail++; > > > > + } while (!zswap_can_accept() && ret != -EINVAL); > > > > > > One question/comment here about the retry logic. > > > > > > So I looked through the awfully convoluted writeback code, and there > > > are multiple layers, and some of them tend to overwrite the return > > > value of the layer underneath :( > > > > > > For zsmalloc (as an example): > > > zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict()->zswap_writeback_entry(). > > > > > > First of all, that zpool_ops is an unnecessarily confusing > > > indirection, but anyway. > > > > > > - zswap_writeback_entry() will either return -ENOMEM or -EEXIST on error > > > - zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lru > > > is empty, and -EAGAIN on other errors. > > > - zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the > > > return value of zs_reclaim_page()/zbud_reclaim_page(). > > > - zpool_shrink() will return -EINVAL if the driver does not support > > > shrinking, otherwise it will propagate the return value from the > > > driver. > > > > > > So it looks like we will get -EINVAL only if the driver lru is empty > > > or the driver does not support writeback, so rightfully we should not > > > retry. > > > > > > If zswap_writeback_entry() returns -EEXIST, it probably means that we > > > raced with another task decompressing the page, so rightfully we > > > should retry to writeback another page instead. > > > > > > If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily > > > mean that we couldn't allocate memory (unfortunately), it looks like > > > we will return -ENOMEM in other cases as well. Arguably, we can retry > > > in all cases, because even if we were actually out of memory, we are > > > trying to make an allocation that will eventually free more memory. > > > > > > In all cases, I think it would be nicer if we retry if ret == -EAGAIN > > > instead. It is semantically more sane. In this specific case it is > > > functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly > > > return -EAGAIN anyway, but in case someone tries to use saner errnos > > > in the future, this will age better. > > Retrying if ret == -EAGAIN seems much nicer indeed, will change it. > > > > > > Also, do we intentionally want to keep retrying forever on failure? Do > > > we want to add a max number of retries? > > If the drivers guaranteed that zpool_shrink will remove at least an entry > > from their LRU, the retry wouldn't be needed because the LRU will > > eventually be emptied. But this is an assumption on the implementation of > > I don't think any zpool driver can guarantee to writeback at least one > page. It can fail for reasons beyond their control (e.g. cannot > allocate a page to decompress to). > > > the zpool, so yes, we could use a max retries. I think that being a sanity > > check, it should overshoot the required number of iterations in order to > > avoid a premature break, what about retrying a max of > > zswap_stored_pages times? > > Why is it just a sanity check? Consider a case where the system is > under extreme memory pressure that the drivers cannot allocate pages > to decompress to. The drivers would be continuously competing with all > other allocations on the machine. I think we should give up at some > point. In any case, you changed the zswap_frontswap_store() to goto > shrink if !zswap_can_accept(), so next time we try to compress a page > to zswap we will invoke try again anyway -- perhaps under better > circumstances. > > I think zswap_stored_pages is too much, keep in mind that it also > includes same-filled pages which are not even stored in the zpool > drivers. Maybe we should allow a fixed number of failures. If > zpool_shrink() is successful, keep going until zswap_can_accept(). > Otherwise allow a fixed number of failures before giving up. > Yes, I think I got carried away by a writeback frenzy, while testing the LRU refactor I had to fight to get a decent amount of writebacks, and ended up being too afraid the shrink work would stop. What about using MAX_RECLAIM_RETRIES? I like the idea of putting a limit but I wouldn't know how to pick a fixed number. > Maybe we can improve the error codes propagated through the writeback > code as well to improve the retry logic. For example, if > zswap_writeback_entry() returns EEXIST then retrying should be > harmless, but ENOMEM might be harmful. Both of which are propagated as > EAGAIN from zs_zpool_shrink()/zbud_zpool_shrink(). > That would be a nice improvement indeed, I'd queue it after the LRU refactor though. > > > > > > > zswap_pool_put(pool); > > > > } > > > > > > > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > > > > if (zswap_pool_reached_full) { > > > > if (!zswap_can_accept()) { > > > > ret = -ENOMEM; > > > > - goto reject; > > > > + goto shrink; > > > > } else > > > > zswap_pool_reached_full = false; > > > > } > > > > -- > > > > 2.34.1 > > > >
Hi Domenico, On Wed, May 24, 2023 at 8:50 AM Domenico Cerasuolo <cerasuolodomenico@gmail.com> wrote: > > This update addresses an issue with the zswap reclaim mechanism, which > hinders the efficient offloading of cold pages to disk, thereby > compromising the preservation of the LRU order and consequently > diminishing, if not inverting, its performance benefits. > > The functioning of the zswap shrink worker was found to be inadequate, > as shown by basic benchmark test. For the test, a kernel build was > utilized as a reference, with its memory confined to 1G via a cgroup and > a 5G swap file provided. The results are presented below, these are > averages of three runs without the use of zswap: > > real 46m26s > user 35m4s > sys 7m37s > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > system), the results changed to: > > real 56m4s > user 35m13s > sys 8m43s > > written_back_pages: 18 > reject_reclaim_fail: 0 > pool_limit_hit:1478 > > Besides the evident regression, one thing to notice from this data is > the extremely low number of written_back_pages and pool_limit_hit. > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > when zswap is completely full, doesn't account for a particular > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > still above the acceptance threshold. Once we include the rejections due > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > 1478 to a significant 21578266. > > Zswap is stuck in an undesirable state where it rejects pages because > it's above the acceptance threshold, yet fails to attempt memory > reclaimation. This happens because the shrink work is only queued when > zswap_frontswap_store detects that it's full and the work itself only > reclaims one page per run. > > This state results in hot pages getting written directly to disk, > while cold ones remain memory, waiting only to be invalidated. The LRU > order is completely broken and zswap ends up being just an overhead > without providing any benefits. > > This commit applies 2 changes: a) the shrink worker is set to reclaim > pages until the acceptance threshold is met and b) the task is also > enqueued when zswap is not full but still above the threshold. > > Testing this suggested update showed much better numbers: > > real 36m37s > user 35m8s > sys 9m32s > > written_back_pages: 10459423 > reject_reclaim_fail: 12896 > pool_limit_hit: 75653 > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> > --- > mm/zswap.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 59da2a415fbb..2ee0775d8213 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > { > struct zswap_pool *pool = container_of(w, typeof(*pool), > shrink_work); > + int ret; > > - if (zpool_shrink(pool->zpool, 1, NULL)) > - zswap_reject_reclaim_fail++; > + do { > + ret = zpool_shrink(pool->zpool, 1, NULL); > + if (ret) > + zswap_reject_reclaim_fail++; > + } while (!zswap_can_accept() && ret != -EINVAL); > zswap_pool_put(pool); > } while I do agree with your points, I have a concern about this shrinker logic change. The reason for not doing this as you do was possible real time/responsiveness characteristics degrade. Have you checked that it's not really the case? Thanks, Vitaly > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > if (zswap_pool_reached_full) { > if (!zswap_can_accept()) { > ret = -ENOMEM; > - goto reject; > + goto shrink; > } else > zswap_pool_reached_full = false; > } > -- > 2.34.1 >
On Fri, May 26, 2023 at 12:18:21PM +0200, Vitaly Wool wrote: > On Wed, May 24, 2023 at 8:50 AM Domenico Cerasuolo > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > > { > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > shrink_work); > > + int ret; > > > > - if (zpool_shrink(pool->zpool, 1, NULL)) > > - zswap_reject_reclaim_fail++; > > + do { > > + ret = zpool_shrink(pool->zpool, 1, NULL); > > + if (ret) > > + zswap_reject_reclaim_fail++; > > + } while (!zswap_can_accept() && ret != -EINVAL); > > zswap_pool_put(pool); > > } > > while I do agree with your points, I have a concern about this > shrinker logic change. The reason for not doing this as you do was > possible real time/responsiveness characteristics degrade. Have you > checked that it's not really the case? Good point. Adding a cond_resched() to the loop would be a good idea.
On Fri, May 26, 2023 at 1:52 AM Domenico Cerasuolo <cerasuolodomenico@gmail.com> wrote: > > On Thu, May 25, 2023 at 9:10 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, May 25, 2023 at 9:53 AM Domenico Cerasuolo > > <cerasuolodomenico@gmail.com> wrote: > > > > > > On Thu, May 25, 2023 at 2:59 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > Hi Domenico, > > > > > > > > On Tue, May 23, 2023 at 11:50 PM Domenico Cerasuolo > > > > <cerasuolodomenico@gmail.com> wrote: > > > > > > > > > > This update addresses an issue with the zswap reclaim mechanism, which > > > > > hinders the efficient offloading of cold pages to disk, thereby > > > > > compromising the preservation of the LRU order and consequently > > > > > diminishing, if not inverting, its performance benefits. > > > > > > > > > > The functioning of the zswap shrink worker was found to be inadequate, > > > > > as shown by basic benchmark test. For the test, a kernel build was > > > > > utilized as a reference, with its memory confined to 1G via a cgroup and > > > > > a 5G swap file provided. The results are presented below, these are > > > > > averages of three runs without the use of zswap: > > > > > > > > > > real 46m26s > > > > > user 35m4s > > > > > sys 7m37s > > > > > > > > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > > > > > system), the results changed to: > > > > > > > > > > real 56m4s > > > > > user 35m13s > > > > > sys 8m43s > > > > > > > > > > written_back_pages: 18 > > > > > reject_reclaim_fail: 0 > > > > > pool_limit_hit:1478 > > > > > > > > > > Besides the evident regression, one thing to notice from this data is > > > > > the extremely low number of written_back_pages and pool_limit_hit. > > > > > > > > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > > > > > when zswap is completely full, doesn't account for a particular > > > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > > > > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > > > > > still above the acceptance threshold. Once we include the rejections due > > > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > > > > > 1478 to a significant 21578266. > > > > > > > > > > Zswap is stuck in an undesirable state where it rejects pages because > > > > > it's above the acceptance threshold, yet fails to attempt memory > > > > > reclaimation. This happens because the shrink work is only queued when > > > > > zswap_frontswap_store detects that it's full and the work itself only > > > > > reclaims one page per run. > > > > > > > > > > This state results in hot pages getting written directly to disk, > > > > > while cold ones remain memory, waiting only to be invalidated. The LRU > > > > > order is completely broken and zswap ends up being just an overhead > > > > > without providing any benefits. > > > > > > > > > > This commit applies 2 changes: a) the shrink worker is set to reclaim > > > > > pages until the acceptance threshold is met and b) the task is also > > > > > enqueued when zswap is not full but still above the threshold. > > > > > > > > > > Testing this suggested update showed much better numbers: > > > > > > > > > > real 36m37s > > > > > user 35m8s > > > > > sys 9m32s > > > > > > > > > > written_back_pages: 10459423 > > > > > reject_reclaim_fail: 12896 > > > > > pool_limit_hit: 75653 > > > > > > > > Impressive numbers, and great find in general! > > > > > > > > I wonder how other workloads benefit/regress from this change. > > > > Anything else that you happened to run? :) > > > > > > > Hi Yosry, > > > > > > thanks for the quick feedback! > > > > > > Besides stressers, I haven't tested any other actual workload, we don't have > > > writeback in production yet so I can't provide any data from there. I was > > > wondering what kind of actual workload I could use to test this on my desktop, > > > but I couldn't think of anything relevant, I'm open to suggestions though :) > > > > Nothing in mind in particular as well. Perhaps others have ideas. > > > > > > > > > > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > > > > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> > > > > > --- > > > > > mm/zswap.c | 10 +++++++--- > > > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > index 59da2a415fbb..2ee0775d8213 100644 > > > > > --- a/mm/zswap.c > > > > > +++ b/mm/zswap.c > > > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > > > > > { > > > > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > > > > shrink_work); > > > > > + int ret; > > > > > > > > > > - if (zpool_shrink(pool->zpool, 1, NULL)) > > > > > - zswap_reject_reclaim_fail++; > > > > > + do { > > > > > + ret = zpool_shrink(pool->zpool, 1, NULL); > > > > > + if (ret) > > > > > + zswap_reject_reclaim_fail++; > > > > > + } while (!zswap_can_accept() && ret != -EINVAL); > > > > > > > > One question/comment here about the retry logic. > > > > > > > > So I looked through the awfully convoluted writeback code, and there > > > > are multiple layers, and some of them tend to overwrite the return > > > > value of the layer underneath :( > > > > > > > > For zsmalloc (as an example): > > > > zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict()->zswap_writeback_entry(). > > > > > > > > First of all, that zpool_ops is an unnecessarily confusing > > > > indirection, but anyway. > > > > > > > > - zswap_writeback_entry() will either return -ENOMEM or -EEXIST on error > > > > - zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lru > > > > is empty, and -EAGAIN on other errors. > > > > - zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the > > > > return value of zs_reclaim_page()/zbud_reclaim_page(). > > > > - zpool_shrink() will return -EINVAL if the driver does not support > > > > shrinking, otherwise it will propagate the return value from the > > > > driver. > > > > > > > > So it looks like we will get -EINVAL only if the driver lru is empty > > > > or the driver does not support writeback, so rightfully we should not > > > > retry. > > > > > > > > If zswap_writeback_entry() returns -EEXIST, it probably means that we > > > > raced with another task decompressing the page, so rightfully we > > > > should retry to writeback another page instead. > > > > > > > > If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily > > > > mean that we couldn't allocate memory (unfortunately), it looks like > > > > we will return -ENOMEM in other cases as well. Arguably, we can retry > > > > in all cases, because even if we were actually out of memory, we are > > > > trying to make an allocation that will eventually free more memory. > > > > > > > > In all cases, I think it would be nicer if we retry if ret == -EAGAIN > > > > instead. It is semantically more sane. In this specific case it is > > > > functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly > > > > return -EAGAIN anyway, but in case someone tries to use saner errnos > > > > in the future, this will age better. > > > Retrying if ret == -EAGAIN seems much nicer indeed, will change it. > > > > > > > > Also, do we intentionally want to keep retrying forever on failure? Do > > > > we want to add a max number of retries? > > > If the drivers guaranteed that zpool_shrink will remove at least an entry > > > from their LRU, the retry wouldn't be needed because the LRU will > > > eventually be emptied. But this is an assumption on the implementation of > > > > I don't think any zpool driver can guarantee to writeback at least one > > page. It can fail for reasons beyond their control (e.g. cannot > > allocate a page to decompress to). > > > > > the zpool, so yes, we could use a max retries. I think that being a sanity > > > check, it should overshoot the required number of iterations in order to > > > avoid a premature break, what about retrying a max of > > > zswap_stored_pages times? > > > > Why is it just a sanity check? Consider a case where the system is > > under extreme memory pressure that the drivers cannot allocate pages > > to decompress to. The drivers would be continuously competing with all > > other allocations on the machine. I think we should give up at some > > point. In any case, you changed the zswap_frontswap_store() to goto > > shrink if !zswap_can_accept(), so next time we try to compress a page > > to zswap we will invoke try again anyway -- perhaps under better > > circumstances. > > > > I think zswap_stored_pages is too much, keep in mind that it also > > includes same-filled pages which are not even stored in the zpool > > drivers. Maybe we should allow a fixed number of failures. If > > zpool_shrink() is successful, keep going until zswap_can_accept(). > > Otherwise allow a fixed number of failures before giving up. > > > > Yes, I think I got carried away by a writeback frenzy, while testing the LRU > refactor I had to fight to get a decent amount of writebacks, and ended up > being too afraid the shrink work would stop. > What about using MAX_RECLAIM_RETRIES? I like the idea of putting a limit but > I wouldn't know how to pick a fixed number. In a sense, shrinking zswap is a form of reclaim, and is triggered by reclaim, so MAX_RECLAIM_RETRIES sounds reasonable to me. We can always revisit if needed. > > > Maybe we can improve the error codes propagated through the writeback > > code as well to improve the retry logic. For example, if > > zswap_writeback_entry() returns EEXIST then retrying should be > > harmless, but ENOMEM might be harmful. Both of which are propagated as > > EAGAIN from zs_zpool_shrink()/zbud_zpool_shrink(). > > > > That would be a nice improvement indeed, I'd queue it after the LRU refactor > though. Great! > > > > > > > > > > zswap_pool_put(pool); > > > > > } > > > > > > > > > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > > > > > if (zswap_pool_reached_full) { > > > > > if (!zswap_can_accept()) { > > > > > ret = -ENOMEM; > > > > > - goto reject; > > > > > + goto shrink; > > > > > } else > > > > > zswap_pool_reached_full = false; > > > > > } > > > > > -- > > > > > 2.34.1 > > > > >
On Wed, May 24, 2023 at 08:50:51AM +0200, Domenico Cerasuolo wrote: > This update addresses an issue with the zswap reclaim mechanism, which > hinders the efficient offloading of cold pages to disk, thereby > compromising the preservation of the LRU order and consequently > diminishing, if not inverting, its performance benefits. > > The functioning of the zswap shrink worker was found to be inadequate, > as shown by basic benchmark test. For the test, a kernel build was > utilized as a reference, with its memory confined to 1G via a cgroup and > a 5G swap file provided. The results are presented below, these are > averages of three runs without the use of zswap: > > real 46m26s > user 35m4s > sys 7m37s > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > system), the results changed to: > > real 56m4s > user 35m13s > sys 8m43s > > written_back_pages: 18 > reject_reclaim_fail: 0 > pool_limit_hit:1478 > > Besides the evident regression, one thing to notice from this data is > the extremely low number of written_back_pages and pool_limit_hit. > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > when zswap is completely full, doesn't account for a particular > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > still above the acceptance threshold. Once we include the rejections due > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > 1478 to a significant 21578266. > > Zswap is stuck in an undesirable state where it rejects pages because > it's above the acceptance threshold, yet fails to attempt memory > reclaimation. This happens because the shrink work is only queued when > zswap_frontswap_store detects that it's full and the work itself only > reclaims one page per run. > > This state results in hot pages getting written directly to disk, > while cold ones remain memory, waiting only to be invalidated. The LRU > order is completely broken and zswap ends up being just an overhead > without providing any benefits. > > This commit applies 2 changes: a) the shrink worker is set to reclaim > pages until the acceptance threshold is met and b) the task is also > enqueued when zswap is not full but still above the threshold. > > Testing this suggested update showed much better numbers: > > real 36m37s > user 35m8s > sys 9m32s > > written_back_pages: 10459423 > reject_reclaim_fail: 12896 > pool_limit_hit: 75653 > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> > --- > mm/zswap.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 59da2a415fbb..2ee0775d8213 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > { > struct zswap_pool *pool = container_of(w, typeof(*pool), > shrink_work); > + int ret; Very minor nit pick, you can move the declare inside the do statement where it get used. > > - if (zpool_shrink(pool->zpool, 1, NULL)) > - zswap_reject_reclaim_fail++; > + do { > + ret = zpool_shrink(pool->zpool, 1, NULL); > + if (ret) > + zswap_reject_reclaim_fail++; > + } while (!zswap_can_accept() && ret != -EINVAL); As others point out, this while loop can be problematic. Have you find out what was the common reason causing the reclaim fail? Inside the shrink function there is a while loop that would be the place to perform try harder conditions. For example, if all the page in the LRU are already try once there's no reason to keep on calling the shrink function. The outer loop actually doesn't have this kind of visibilities. Chris > zswap_pool_put(pool); > } > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > if (zswap_pool_reached_full) { > if (!zswap_can_accept()) { > ret = -ENOMEM; > - goto reject; > + goto shrink; > } else > zswap_pool_reached_full = false; > } > -- > 2.34.1 >
On Sat, May 27, 2023 at 1:05 AM Chris Li <chrisl@kernel.org> wrote: > > On Wed, May 24, 2023 at 08:50:51AM +0200, Domenico Cerasuolo wrote: > > This update addresses an issue with the zswap reclaim mechanism, which > > hinders the efficient offloading of cold pages to disk, thereby > > compromising the preservation of the LRU order and consequently > > diminishing, if not inverting, its performance benefits. > > > > The functioning of the zswap shrink worker was found to be inadequate, > > as shown by basic benchmark test. For the test, a kernel build was > > utilized as a reference, with its memory confined to 1G via a cgroup and > > a 5G swap file provided. The results are presented below, these are > > averages of three runs without the use of zswap: > > > > real 46m26s > > user 35m4s > > sys 7m37s > > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > > system), the results changed to: > > > > real 56m4s > > user 35m13s > > sys 8m43s > > > > written_back_pages: 18 > > reject_reclaim_fail: 0 > > pool_limit_hit:1478 > > > > Besides the evident regression, one thing to notice from this data is > > the extremely low number of written_back_pages and pool_limit_hit. > > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > > when zswap is completely full, doesn't account for a particular > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > > still above the acceptance threshold. Once we include the rejections due > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > > 1478 to a significant 21578266. > > > > Zswap is stuck in an undesirable state where it rejects pages because > > it's above the acceptance threshold, yet fails to attempt memory > > reclaimation. This happens because the shrink work is only queued when > > zswap_frontswap_store detects that it's full and the work itself only > > reclaims one page per run. > > > > This state results in hot pages getting written directly to disk, > > while cold ones remain memory, waiting only to be invalidated. The LRU > > order is completely broken and zswap ends up being just an overhead > > without providing any benefits. > > > > This commit applies 2 changes: a) the shrink worker is set to reclaim > > pages until the acceptance threshold is met and b) the task is also > > enqueued when zswap is not full but still above the threshold. > > > > Testing this suggested update showed much better numbers: > > > > real 36m37s > > user 35m8s > > sys 9m32s > > > > written_back_pages: 10459423 > > reject_reclaim_fail: 12896 > > pool_limit_hit: 75653 > > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> > > --- > > mm/zswap.c | 10 +++++++--- > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 59da2a415fbb..2ee0775d8213 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > > { > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > shrink_work); > > + int ret; > Very minor nit pick, you can move the declare inside the do > statement where it get used. > > > > > - if (zpool_shrink(pool->zpool, 1, NULL)) > > - zswap_reject_reclaim_fail++; > > + do { > > + ret = zpool_shrink(pool->zpool, 1, NULL); > > + if (ret) > > + zswap_reject_reclaim_fail++; > > + } while (!zswap_can_accept() && ret != -EINVAL); > > As others point out, this while loop can be problematic. Do you have some specific concern that's not been already addressed following other reviewers' suggestions? > > Have you find out what was the common reason causing the > reclaim fail? Inside the shrink function there is a while > loop that would be the place to perform try harder conditions. > For example, if all the page in the LRU are already try once > there's no reason to keep on calling the shrink function. > The outer loop actually doesn't have this kind of visibilities. > The most common cause I saw during testing was concurrent operations on the swap entry, if an entry is being loaded/invalidated at the same time as the zswap writeback, then errors will be returned. This scenario doesn't seem harmful at all because the failure doesn't indicate that memory cannot be allocated, just that that particular page should not be written back. As far as I understood the voiced concerns, the problem could arise if the writeback fails due to an impossibility to allocate memory, that could indicate that the system is in extremely high memory pressure and this loop could aggravate the situation by adding more contention on the already scarce available memory. Since both these cases are treated equally with the retries limit, we're adopting a conservative approach in considering non-harmful errors as if they were harmful. This could certainly be improved, but I don't see it as an issue because a differentiation of the errors would actually make the loop run longer than it would without the differentiation. As I was writing to Yosry, the differentiation would be a great improvement here, I just have a patch set in the queue that moves the inner reclaim loop from the zpool driver up to zswap. With that, updating the error handling would be more convenient as it would be done in one place instead of three. > > Chris > > > zswap_pool_put(pool); > > } > > > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > > if (zswap_pool_reached_full) { > > if (!zswap_can_accept()) { > > ret = -ENOMEM; > > - goto reject; > > + goto shrink; > > } else > > zswap_pool_reached_full = false; > > } > > -- > > 2.34.1 > >
On Sun, May 28, 2023 at 10:53:49AM +0200, Domenico Cerasuolo wrote: > On Sat, May 27, 2023 at 1:05 AM Chris Li <chrisl@kernel.org> wrote: > > > > On Wed, May 24, 2023 at 08:50:51AM +0200, Domenico Cerasuolo wrote: > > > This update addresses an issue with the zswap reclaim mechanism, which > > > hinders the efficient offloading of cold pages to disk, thereby > > > compromising the preservation of the LRU order and consequently > > > diminishing, if not inverting, its performance benefits. > > > > > > The functioning of the zswap shrink worker was found to be inadequate, > > > as shown by basic benchmark test. For the test, a kernel build was > > > utilized as a reference, with its memory confined to 1G via a cgroup and > > > a 5G swap file provided. The results are presented below, these are > > > averages of three runs without the use of zswap: > > > > > > real 46m26s > > > user 35m4s > > > sys 7m37s > > > > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > > > system), the results changed to: > > > > > > real 56m4s > > > user 35m13s > > > sys 8m43s > > > > > > written_back_pages: 18 > > > reject_reclaim_fail: 0 > > > pool_limit_hit:1478 > > > > > > Besides the evident regression, one thing to notice from this data is > > > the extremely low number of written_back_pages and pool_limit_hit. > > > > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > > > when zswap is completely full, doesn't account for a particular > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > > > still above the acceptance threshold. Once we include the rejections due > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > > > 1478 to a significant 21578266. > > > > > > Zswap is stuck in an undesirable state where it rejects pages because > > > it's above the acceptance threshold, yet fails to attempt memory > > > reclaimation. This happens because the shrink work is only queued when > > > zswap_frontswap_store detects that it's full and the work itself only > > > reclaims one page per run. > > > > > > This state results in hot pages getting written directly to disk, > > > while cold ones remain memory, waiting only to be invalidated. The LRU > > > order is completely broken and zswap ends up being just an overhead > > > without providing any benefits. > > > > > > This commit applies 2 changes: a) the shrink worker is set to reclaim > > > pages until the acceptance threshold is met and b) the task is also > > > enqueued when zswap is not full but still above the threshold. > > > > > > Testing this suggested update showed much better numbers: > > > > > > real 36m37s > > > user 35m8s > > > sys 9m32s > > > > > > written_back_pages: 10459423 > > > reject_reclaim_fail: 12896 > > > pool_limit_hit: 75653 > > > > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> > > > --- > > > mm/zswap.c | 10 +++++++--- > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 59da2a415fbb..2ee0775d8213 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > > > { > > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > > shrink_work); > > > + int ret; > > Very minor nit pick, you can move the declare inside the do > > statement where it get used. > > > > > > > > - if (zpool_shrink(pool->zpool, 1, NULL)) > > > - zswap_reject_reclaim_fail++; > > > + do { > > > + ret = zpool_shrink(pool->zpool, 1, NULL); > > > + if (ret) > > > + zswap_reject_reclaim_fail++; > > > + } while (!zswap_can_accept() && ret != -EINVAL); > > > > As others point out, this while loop can be problematic. > > Do you have some specific concern that's not been already addressed following > other reviewers' suggestions? Mostly as it is, the outer loop seems the wrong place to do the retry because of lacking the error context and properly continuing the iteration. > > > > > Have you find out what was the common reason causing the > > reclaim fail? Inside the shrink function there is a while > > loop that would be the place to perform try harder conditions. > > For example, if all the page in the LRU are already try once > > there's no reason to keep on calling the shrink function. > > The outer loop actually doesn't have this kind of visibilities. > > > > The most common cause I saw during testing was concurrent operations > on the swap entry, if an entry is being loaded/invalidated at the same time > as the zswap writeback, then errors will be returned. This scenario doesn't > seem harmful at all because the failure doesn't indicate that memory > cannot be allocated, just that that particular page should not be written back. > > As far as I understood the voiced concerns, the problem could arise if the > writeback fails due to an impossibility to allocate memory, that could indicate > that the system is in extremely high memory pressure and this loop could > aggravate the situation by adding more contention on the already scarce > available memory. It seems to me the inner loop should continue if page writeback fails due to loaded/invalidate, abort due to memory pressure. The key difference is that one is transient while others might be more permanent You definitely don't want to retry if it is already ENOMEM. > > Since both these cases are treated equally with the retries limit, we're > adopting a conservative approach in considering non-harmful errors as if > they were harmful. This could certainly be improved, but I don't see it as an > issue because a differentiation of the errors would actually make the loop run > longer than it would without the differentiation. We don't want to run longer if error is persistent. > > As I was writing to Yosry, the differentiation would be a great improvement > here, I just have a patch set in the queue that moves the inner reclaim loop > from the zpool driver up to zswap. With that, updating the error handling > would be more convenient as it would be done in one place instead of three.i This has tricky complications as well. The current shrink interface doesn't support continuing from the previous error position. If you want to avoid a repeat attempt if the page has a writeback error, you kinda of need a way to skip that page. Chris > > > > > Chris > > > > > zswap_pool_put(pool); > > > } > > > > > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > > > if (zswap_pool_reached_full) { > > > if (!zswap_can_accept()) { > > > ret = -ENOMEM; > > > - goto reject; > > > + goto shrink; > > > } else > > > zswap_pool_reached_full = false; > > > } > > > -- > > > 2.34.1 > > >
On Mon, May 29, 2023 at 02:00:46PM -0700, Chris Li wrote: > On Sun, May 28, 2023 at 10:53:49AM +0200, Domenico Cerasuolo wrote: > > On Sat, May 27, 2023 at 1:05 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > On Wed, May 24, 2023 at 08:50:51AM +0200, Domenico Cerasuolo wrote: > > > > This update addresses an issue with the zswap reclaim mechanism, which > > > > hinders the efficient offloading of cold pages to disk, thereby > > > > compromising the preservation of the LRU order and consequently > > > > diminishing, if not inverting, its performance benefits. > > > > > > > > The functioning of the zswap shrink worker was found to be inadequate, > > > > as shown by basic benchmark test. For the test, a kernel build was > > > > utilized as a reference, with its memory confined to 1G via a cgroup and > > > > a 5G swap file provided. The results are presented below, these are > > > > averages of three runs without the use of zswap: > > > > > > > > real 46m26s > > > > user 35m4s > > > > sys 7m37s > > > > > > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > > > > system), the results changed to: > > > > > > > > real 56m4s > > > > user 35m13s > > > > sys 8m43s > > > > > > > > written_back_pages: 18 > > > > reject_reclaim_fail: 0 > > > > pool_limit_hit:1478 > > > > > > > > Besides the evident regression, one thing to notice from this data is > > > > the extremely low number of written_back_pages and pool_limit_hit. > > > > > > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > > > > when zswap is completely full, doesn't account for a particular > > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > > > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > > > > still above the acceptance threshold. Once we include the rejections due > > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > > > > 1478 to a significant 21578266. > > > > > > > > Zswap is stuck in an undesirable state where it rejects pages because > > > > it's above the acceptance threshold, yet fails to attempt memory > > > > reclaimation. This happens because the shrink work is only queued when > > > > zswap_frontswap_store detects that it's full and the work itself only > > > > reclaims one page per run. > > > > > > > > This state results in hot pages getting written directly to disk, > > > > while cold ones remain memory, waiting only to be invalidated. The LRU > > > > order is completely broken and zswap ends up being just an overhead > > > > without providing any benefits. > > > > > > > > This commit applies 2 changes: a) the shrink worker is set to reclaim > > > > pages until the acceptance threshold is met and b) the task is also > > > > enqueued when zswap is not full but still above the threshold. > > > > > > > > Testing this suggested update showed much better numbers: > > > > > > > > real 36m37s > > > > user 35m8s > > > > sys 9m32s > > > > > > > > written_back_pages: 10459423 > > > > reject_reclaim_fail: 12896 > > > > pool_limit_hit: 75653 > > > > > > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > > > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> > > > > --- > > > > mm/zswap.c | 10 +++++++--- > > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index 59da2a415fbb..2ee0775d8213 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > > > > { > > > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > > > shrink_work); > > > > + int ret; > > > Very minor nit pick, you can move the declare inside the do > > > statement where it get used. > > > > > > > > > > > - if (zpool_shrink(pool->zpool, 1, NULL)) > > > > - zswap_reject_reclaim_fail++; > > > > + do { > > > > + ret = zpool_shrink(pool->zpool, 1, NULL); > > > > + if (ret) > > > > + zswap_reject_reclaim_fail++; > > > > + } while (!zswap_can_accept() && ret != -EINVAL); > > > > > > As others point out, this while loop can be problematic. > > > > Do you have some specific concern that's not been already addressed following > > other reviewers' suggestions? > > Mostly as it is, the outer loop seems the wrong > place to do the retry because of lacking the error > context and properly continuing the iteration. > > > > > > > > > > Have you find out what was the common reason causing the > > > reclaim fail? Inside the shrink function there is a while > > > loop that would be the place to perform try harder conditions. > > > For example, if all the page in the LRU are already try once > > > there's no reason to keep on calling the shrink function. > > > The outer loop actually doesn't have this kind of visibilities. > > > > > > > The most common cause I saw during testing was concurrent operations > > on the swap entry, if an entry is being loaded/invalidated at the same time > > as the zswap writeback, then errors will be returned. This scenario doesn't > > seem harmful at all because the failure doesn't indicate that memory > > cannot be allocated, just that that particular page should not be written back. > > > > As far as I understood the voiced concerns, the problem could arise if the > > writeback fails due to an impossibility to allocate memory, that could indicate > > that the system is in extremely high memory pressure and this loop could > > aggravate the situation by adding more contention on the already scarce > > available memory. > > It seems to me the inner loop should continue if page writeback > fails due to loaded/invalidate, abort due to memory pressure. > > The key difference is that one is transient while others might > be more permanent > > You definitely don't want to retry if it is already ENOMEM. This is actually a separate problem that Domenico was preparing a patch for. Writeback is a reclaim operation, so it should have access to the memory reserves (PF_MEMALLOC). Persistent -ENOMEM shouldn't exist. It actually used have reserve access when writeback happened directly from inside the store (reclaim -> swapout path). When writeback was moved out to the async worker, it lost reserve access as an unintended side effect. This should be fixed, so we don't OOM while we try to free memory. Should it be fixed before merging this patch? I don't think the ordering matters. Right now the -ENOMEM case invokes OOM, so it isn't really persistent either. Retrying a few times in that case certainly doesn't seem to make things worse. > > As I was writing to Yosry, the differentiation would be a great improvement > > here, I just have a patch set in the queue that moves the inner reclaim loop > > from the zpool driver up to zswap. With that, updating the error handling > > would be more convenient as it would be done in one place instead of three.i > > This has tricky complications as well. The current shrink interface > doesn't support continuing from the previous error position. If you want > to avoid a repeat attempt if the page has a writeback error, you kinda > of need a way to skip that page. A page that fails to reclaim is put back to the tail of the LRU, so for all intents and purposes it will be skipped. In the rare and extreme case where it's the only page left on the list, I again don't see how retrying a few times will make the situation worse. In practice, IMO there is little upside in trying to be more discerning about the error codes. Simple seems better here.
On Tue, May 30, 2023 at 12:13:41AM -0400, Johannes Weiner wrote: > On Mon, May 29, 2023 at 02:00:46PM -0700, Chris Li wrote: > > On Sun, May 28, 2023 at 10:53:49AM +0200, Domenico Cerasuolo wrote: > > > On Sat, May 27, 2023 at 1:05 AM Chris Li <chrisl@kernel.org> wrote: > > > > > > > > On Wed, May 24, 2023 at 08:50:51AM +0200, Domenico Cerasuolo wrote: > > > > > This update addresses an issue with the zswap reclaim mechanism, which > > > > > hinders the efficient offloading of cold pages to disk, thereby > > > > > compromising the preservation of the LRU order and consequently > > > > > diminishing, if not inverting, its performance benefits. > > > > > > > > > > The functioning of the zswap shrink worker was found to be inadequate, > > > > > as shown by basic benchmark test. For the test, a kernel build was > > > > > utilized as a reference, with its memory confined to 1G via a cgroup and > > > > > a 5G swap file provided. The results are presented below, these are > > > > > averages of three runs without the use of zswap: > > > > > > > > > > real 46m26s > > > > > user 35m4s > > > > > sys 7m37s > > > > > > > > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G > > > > > system), the results changed to: > > > > > > > > > > real 56m4s > > > > > user 35m13s > > > > > sys 8m43s > > > > > > > > > > written_back_pages: 18 > > > > > reject_reclaim_fail: 0 > > > > > pool_limit_hit:1478 > > > > > > > > > > Besides the evident regression, one thing to notice from this data is > > > > > the extremely low number of written_back_pages and pool_limit_hit. > > > > > > > > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store > > > > > when zswap is completely full, doesn't account for a particular > > > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to > > > > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is > > > > > still above the acceptance threshold. Once we include the rejections due > > > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from > > > > > 1478 to a significant 21578266. > > > > > > > > > > Zswap is stuck in an undesirable state where it rejects pages because > > > > > it's above the acceptance threshold, yet fails to attempt memory > > > > > reclaimation. This happens because the shrink work is only queued when > > > > > zswap_frontswap_store detects that it's full and the work itself only > > > > > reclaims one page per run. > > > > > > > > > > This state results in hot pages getting written directly to disk, > > > > > while cold ones remain memory, waiting only to be invalidated. The LRU > > > > > order is completely broken and zswap ends up being just an overhead > > > > > without providing any benefits. > > > > > > > > > > This commit applies 2 changes: a) the shrink worker is set to reclaim > > > > > pages until the acceptance threshold is met and b) the task is also > > > > > enqueued when zswap is not full but still above the threshold. > > > > > > > > > > Testing this suggested update showed much better numbers: > > > > > > > > > > real 36m37s > > > > > user 35m8s > > > > > sys 9m32s > > > > > > > > > > written_back_pages: 10459423 > > > > > reject_reclaim_fail: 12896 > > > > > pool_limit_hit: 75653 > > > > > > > > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") > > > > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> > > > > > --- > > > > > mm/zswap.c | 10 +++++++--- > > > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > index 59da2a415fbb..2ee0775d8213 100644 > > > > > --- a/mm/zswap.c > > > > > +++ b/mm/zswap.c > > > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) > > > > > { > > > > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > > > > shrink_work); > > > > > + int ret; > > > > Very minor nit pick, you can move the declare inside the do > > > > statement where it get used. > > > > > > > > > > > > > > - if (zpool_shrink(pool->zpool, 1, NULL)) > > > > > - zswap_reject_reclaim_fail++; > > > > > + do { > > > > > + ret = zpool_shrink(pool->zpool, 1, NULL); > > > > > + if (ret) > > > > > + zswap_reject_reclaim_fail++; > > > > > + } while (!zswap_can_accept() && ret != -EINVAL); > > > > > > > > As others point out, this while loop can be problematic. > > > > > > Do you have some specific concern that's not been already addressed following > > > other reviewers' suggestions? > > > > Mostly as it is, the outer loop seems the wrong > > place to do the retry because of lacking the error > > context and properly continuing the iteration. > > > > > > > > > > > > > > > Have you find out what was the common reason causing the > > > > reclaim fail? Inside the shrink function there is a while > > > > loop that would be the place to perform try harder conditions. > > > > For example, if all the page in the LRU are already try once > > > > there's no reason to keep on calling the shrink function. > > > > The outer loop actually doesn't have this kind of visibilities. > > > > > > > > > > The most common cause I saw during testing was concurrent operations > > > on the swap entry, if an entry is being loaded/invalidated at the same time > > > as the zswap writeback, then errors will be returned. This scenario doesn't > > > seem harmful at all because the failure doesn't indicate that memory > > > cannot be allocated, just that that particular page should not be written back. > > > > > > As far as I understood the voiced concerns, the problem could arise if the > > > writeback fails due to an impossibility to allocate memory, that could indicate > > > that the system is in extremely high memory pressure and this loop could > > > aggravate the situation by adding more contention on the already scarce > > > available memory. > > > > It seems to me the inner loop should continue if page writeback > > fails due to loaded/invalidate, abort due to memory pressure. > > > > The key difference is that one is transient while others might > > be more permanent > > > > You definitely don't want to retry if it is already ENOMEM. > > This is actually a separate problem that Domenico was preparing a > patch for. Writeback is a reclaim operation, so it should have access > to the memory reserves (PF_MEMALLOC). Persistent -ENOMEM shouldn't > exist. It actually used have reserve access when writeback happened > directly from inside the store (reclaim -> swapout path). When > writeback was moved out to the async worker, it lost reserve access as > an unintended side effect. This should be fixed, so we don't OOM while > we try to free memory. Thanks for pointing out -ENOMEM shouldn't be persistent. Points taken. The original point of not retrying the persistent error still holds. > > Should it be fixed before merging this patch? I don't think the > ordering matters. Right now the -ENOMEM case invokes OOM, so it isn't > really persistent either. Retrying a few times in that case certainly > doesn't seem to make things worse. If you already know the error is persistent, retrying is wasting CPU. It can pertancial hold locks during the retry, which can slow someone else down. > > > As I was writing to Yosry, the differentiation would be a great improvement > > > here, I just have a patch set in the queue that moves the inner reclaim loop > > > from the zpool driver up to zswap. With that, updating the error handling > > > would be more convenient as it would be done in one place instead of three.i > > > > This has tricky complications as well. The current shrink interface > > doesn't support continuing from the previous error position. If you want > > to avoid a repeat attempt if the page has a writeback error, you kinda > > of need a way to skip that page. > > A page that fails to reclaim is put back to the tail of the LRU, so > for all intents and purposes it will be skipped. In the rare and Do you mean the page is treated as hot again? Wouldn't that be undesirable from the app's point of view? > extreme case where it's the only page left on the list, I again don't > see how retrying a few times will make the situation worse. > > In practice, IMO there is little upside in trying to be more > discerning about the error codes. Simple seems better here. Just trying to think about what should be the precise loop termination condition here. I still feel blindly trying a few times is a very imprecise condition. Chris
On Tue, May 30, 2023 at 07:51:23AM -0700, Chris Li wrote: > Thanks for pointing out -ENOMEM shouldn't be persistent. > Points taken. > > The original point of not retrying the persistent error > still holds. Okay, but what persistent errors are you referring to? Aside from -ENOMEM, writeback_entry will fail on concurrent swap invalidation or a racing swapin fault. In both cases we should absolutely keep trying other entries until the goal is met. > > Should it be fixed before merging this patch? I don't think the > > ordering matters. Right now the -ENOMEM case invokes OOM, so it isn't > > really persistent either. Retrying a few times in that case certainly > > doesn't seem to make things worse. > > If you already know the error is persistent, retrying is wasting > CPU. It can pertancial hold locks during the retry, which can > slow someone else down. That's a bit of a truism. How does this pertain to the zswap reclaim situation? > > > > As I was writing to Yosry, the differentiation would be a great improvement > > > > here, I just have a patch set in the queue that moves the inner reclaim loop > > > > from the zpool driver up to zswap. With that, updating the error handling > > > > would be more convenient as it would be done in one place instead of three.i > > > > > > This has tricky complications as well. The current shrink interface > > > doesn't support continuing from the previous error position. If you want > > > to avoid a repeat attempt if the page has a writeback error, you kinda > > > of need a way to skip that page. > > > > A page that fails to reclaim is put back to the tail of the LRU, so > > for all intents and purposes it will be skipped. In the rare and > > Do you mean the page is treated as hot again? > > Wouldn't that be undesirable from the app's point of view? That's current backend LRU behavior. Is it optimal? That's certainly debatable. But it's tangential to this patch. The point is that capping retries to a fixed number of failures works correctly as a safety precaution and introduces no (new) undesirable behavior. It's entirely moot once we refactor the backend page LRU to the zswap entry LRU. The only time we'll fail to reclaim an entry is if we race with something already freeing it, so it doesn't really matter where we put it. > > extreme case where it's the only page left on the list, I again don't > > see how retrying a few times will make the situation worse. > > > > In practice, IMO there is little upside in trying to be more > > discerning about the error codes. Simple seems better here. > > Just trying to think about what should be the precise loop termination > condition here. > > I still feel blindly trying a few times is a very imprecise condition. The precise termination condition is when can_accept() returns true again. The safety cap is only added as precaution to avoid infinite loops if something goes wrong or unexpected, now or in the future.
On Tue, May 30, 2023 at 11:55:19AM -0400, Johannes Weiner wrote: > On Tue, May 30, 2023 at 07:51:23AM -0700, Chris Li wrote: > > Thanks for pointing out -ENOMEM shouldn't be persistent. > > Points taken. > > > > The original point of not retrying the persistent error > > still holds. > > Okay, but what persistent errors are you referring to? Maybe ENOMEM is a bad example. How about if the swap device just went bad and can't complete new IO writes? > Aside from -ENOMEM, writeback_entry will fail on concurrent swap > invalidation or a racing swapin fault. In both cases we should > absolutely keep trying other entries until the goal is met. How about a narrower fix recognizing those error cases and making the inner loop continue in those errors? > > > Should it be fixed before merging this patch? I don't think the > > > ordering matters. Right now the -ENOMEM case invokes OOM, so it isn't > > > really persistent either. Retrying a few times in that case certainly > > > doesn't seem to make things worse. > > > > If you already know the error is persistent, retrying is wasting > > CPU. It can pertancial hold locks during the retry, which can > > slow someone else down. > > That's a bit of a truism. How does this pertain to the zswap reclaim > situation? See the above narrower fix alternative. > > > > > > As I was writing to Yosry, the differentiation would be a great improvement > > > > > here, I just have a patch set in the queue that moves the inner reclaim loop > > > > > from the zpool driver up to zswap. With that, updating the error handling > > > > > would be more convenient as it would be done in one place instead of three.i > > > > > > > > This has tricky complications as well. The current shrink interface > > > > doesn't support continuing from the previous error position. If you want > > > > to avoid a repeat attempt if the page has a writeback error, you kinda > > > > of need a way to skip that page. > > > > > > A page that fails to reclaim is put back to the tail of the LRU, so > > > for all intents and purposes it will be skipped. In the rare and > > > > Do you mean the page is treated as hot again? > > > > Wouldn't that be undesirable from the app's point of view? > > That's current backend LRU behavior. Is it optimal? That's certainly > debatable. But it's tangential to this patch. The point is that > capping retries to a fixed number of failures works correctly as a > safety precaution and introduces no (new) undesirable behavior. > > It's entirely moot once we refactor the backend page LRU to the zswap > entry LRU. The only time we'll fail to reclaim an entry is if we race > with something already freeing it, so it doesn't really matter where > we put it. Agree with you there. A bit side tracked. > > > extreme case where it's the only page left on the list, I again don't > > > see how retrying a few times will make the situation worse. > > > > > > In practice, IMO there is little upside in trying to be more > > > discerning about the error codes. Simple seems better here. > > > > Just trying to think about what should be the precise loop termination > > condition here. > > > > I still feel blindly trying a few times is a very imprecise condition. > > The precise termination condition is when can_accept() returns true > again. The safety cap is only added as precaution to avoid infinite > loops if something goes wrong or unexpected, now or in the future. In my mind, that statement already suggests can_accept() is not *precise*, considering the avoid infinite loop. e.g. Do we know what is the optimal cap value and why that value is optical? Putting the definition of precise aside, I do see the unconditional retry can have unwanted effects. Chris
On Tue, May 30, 2023 at 11:18:51AM -0700, Chris Li wrote: > On Tue, May 30, 2023 at 11:55:19AM -0400, Johannes Weiner wrote: > > On Tue, May 30, 2023 at 07:51:23AM -0700, Chris Li wrote: > > > Thanks for pointing out -ENOMEM shouldn't be persistent. > > > Points taken. > > > > > > The original point of not retrying the persistent error > > > still holds. > > > > Okay, but what persistent errors are you referring to? > > Maybe ENOMEM is a bad example. How about if the swap device > just went bad and can't complete new IO writes? This is actually outside the scope of zswap, and handled by the swapcache (end_swap_bio_write). Once the IO is submitted, zswap will ax its copy and leave the rest to the swapcache. It behaves the same way as if zswap had never been involved to begin with when the swap out fails on IO errors. From a zswap perspective, there are no persistent errors in moving a zswap entry back into the swapcache. Not just currently, but generally. > > Aside from -ENOMEM, writeback_entry will fail on concurrent swap > > invalidation or a racing swapin fault. In both cases we should > > absolutely keep trying other entries until the goal is met. > > How about a narrower fix recognizing those error cases and making > the inner loop continue in those errors? Right, I just I don't really see the value proposition of this complication, and I see some downsides (see below). No single entry error should ever cause us to stop the wider reclaim loop. > > > > extreme case where it's the only page left on the list, I again don't > > > > see how retrying a few times will make the situation worse. > > > > > > > > In practice, IMO there is little upside in trying to be more > > > > discerning about the error codes. Simple seems better here. > > > > > > Just trying to think about what should be the precise loop termination > > > condition here. > > > > > > I still feel blindly trying a few times is a very imprecise condition. > > > > The precise termination condition is when can_accept() returns true > > again. The safety cap is only added as precaution to avoid infinite > > loops if something goes wrong or unexpected, now or in the future. > > In my mind, that statement already suggests can_accept() is not > *precise*, considering the avoid infinite loop. > e.g. Do we know what is the optimal cap value and why that value > is optical? Oh but it is precise. That's the goal we want to accomplish. The cap is just so that in case something unexpectedly goes wrong (a bug), we fail gracefully and don't lock up the machine. The same reason we prefer WARN_ONs over BUG_ONs if we can, to avoid crashes. That's really all there is too it, and it strikes me as reasonable and robust design choice. It's fine to limp along or be suboptimal after such a bug happens; the bar is avoiding an infinite loop, nothing else. Your suggestion of whitelisting certain errors is more complicated, but also less robust: in case an entry error does by some accident become persistent for the whole LRU, we're locking up the host. We'd rather catch a bug like this by seeing spikes in the reclaim failure rate than by losing production machines. > Putting the definition of precise aside, I do see the unconditional > retry can have unwanted effects. I hope I could address this above. But if not, please share your concerns. Thanks!
On Tue, May 30, 2023 at 02:54:51PM -0400, Johannes Weiner wrote: > > Maybe ENOMEM is a bad example. How about if the swap device > > just went bad and can't complete new IO writes? > > This is actually outside the scope of zswap, and handled by the > swapcache (end_swap_bio_write). > > Once the IO is submitted, zswap will ax its copy and leave the rest to > the swapcache. It behaves the same way as if zswap had never been > involved to begin with when the swap out fails on IO errors. > > From a zswap perspective, there are no persistent errors in moving a > zswap entry back into the swapcache. Not just currently, but generally. Again, you are right that this zswap writeback is async. So the writeback error is NOT going to propagate to the shrink function. With the current three pool backends that I looked at{zbud, z3fold,zsmalloc} they all have internal retry 8 times. Adding more retry did not fundamentally change the existing behavior. I look at all the possible error codes generated inside the reclaim code, the only noticeable errors are ENOMEM and concurrent swap invalidation or a racing swapin fault. BTW, zswap reclaim consumes memory. Keep on looping ENOMEM might cause more OOM. But that can exist in current code as well. > > > Aside from -ENOMEM, writeback_entry will fail on concurrent swap > > > invalidation or a racing swapin fault. In both cases we should > > > absolutely keep trying other entries until the goal is met. > > > > How about a narrower fix recognizing those error cases and making > > the inner loop continue in those errors? > > Right, I just I don't really see the value proposition of this > complication, and I see some downsides (see below). No single entry > error should ever cause us to stop the wider reclaim loop. That is until the current LRU list has been through once. I expect repeating the same list yields less reclaimed pages. > > > > > > extreme case where it's the only page left on the list, I again don't > > > > > see how retrying a few times will make the situation worse. > > > > > > > > > > In practice, IMO there is little upside in trying to be more > > > > > discerning about the error codes. Simple seems better here. > > > > > > > > Just trying to think about what should be the precise loop termination > > > > condition here. > > > > > > > > I still feel blindly trying a few times is a very imprecise condition. > > > > > > The precise termination condition is when can_accept() returns true > > > again. The safety cap is only added as precaution to avoid infinite > > > loops if something goes wrong or unexpected, now or in the future. > > > > In my mind, that statement already suggests can_accept() is not > > *precise*, considering the avoid infinite loop. > > e.g. Do we know what is the optimal cap value and why that value > > is optical? > > Oh but it is precise. That's the goal we want to accomplish. I understand it is the goal, the precise condition I am talking about is the loop termination condition, can_accept() is not the only one. Anyway, let's move on. > > The cap is just so that in case something unexpectedly goes wrong (a > bug), we fail gracefully and don't lock up the machine. The same > reason we prefer WARN_ONs over BUG_ONs if we can, to avoid > crashes. That's really all there is too it, and it strikes me as > reasonable and robust design choice. It's fine to limp along or be > suboptimal after such a bug happens; the bar is avoiding an infinite > loop, nothing else. > > Your suggestion of whitelisting certain errors is more complicated, > but also less robust: in case an entry error does by some accident > become persistent for the whole LRU, we're locking up the host. We'd > rather catch a bug like this by seeing spikes in the reclaim failure > rate than by losing production machines. > > > Putting the definition of precise aside, I do see the unconditional > > retry can have unwanted effects. > > I hope I could address this above. But if not, please share your > concerns. Thanks for the discussion. I am less concerned about the retry now. Retry on EAGAIN might be the simplest way to proceed. Outside of the scope of this patch, I am still surprised to see such a high number of retries caused by race conditions. There are 8 inner loop retry already. The actual pages retried need to times 8. If there is a reproducer script, I want to local repo this to understand better. Wish there are ways to reduce the retry. Another idea is that we can start the shrinking once the pool max was reached. Try to reduce to the threshold earlier. Chris
On Tue, May 30, 2023 at 06:06:38PM -0700, Chris Li wrote: > On Tue, May 30, 2023 at 02:54:51PM -0400, Johannes Weiner wrote: > > > Maybe ENOMEM is a bad example. How about if the swap device > > > just went bad and can't complete new IO writes? > > > > This is actually outside the scope of zswap, and handled by the > > swapcache (end_swap_bio_write). > > > > Once the IO is submitted, zswap will ax its copy and leave the rest to > > the swapcache. It behaves the same way as if zswap had never been > > involved to begin with when the swap out fails on IO errors. > > > > From a zswap perspective, there are no persistent errors in moving a > > zswap entry back into the swapcache. Not just currently, but generally. > Again, you are right that this zswap writeback is async. > So the writeback error is NOT going to propagate to the > shrink function. > > With the current three pool backends that I looked at{zbud, > z3fold,zsmalloc} they all have internal retry 8 times. > Adding more retry did not fundamentally change the existing > behavior. Ah, but they're looping over different things. The internal loop in the zs_reclaim_page() function is about walking the LRU list until at least one backing page is freed. Then there is zs_zpool_shrink() which calls zs_reclaim_page() until the requested number of pages are freed. Finally, there is shrink_worker(), which calls zs_zpool_shrink(). It currently calls it for a single page when woken up during a store that hits the zswap pool limit. This is the problematic one, because zswap is very unlikely to go back to accepting stores after one page freed. Domenico's patch isn't adding more retries for error conditions. It ensures the pool is shrunk back down to where it accepts stores again. The reason that it now looks at errors as well isn't to retry over them (that's zs_reclaim_page()'s job). It's to avoid an infinite loop in case there is an unexpectedly high rate of errors across a whole series of pages (suggesting there is a bug of some kind). > I look at all the possible error codes generated inside > the reclaim code, the only noticeable errors are ENOMEM > and concurrent swap invalidation or a racing swapin fault. Right. > BTW, zswap reclaim consumes memory. Keep on looping ENOMEM > might cause more OOM. But that can exist in current code > as well. Right. And this is temporary. Zswap will allocate a page to decompress in, add it to the swapcache and kick off the IO. Once the page is written out, it'll be reclaimed again. So while the consumption increases temporarily, the end result is a net reduction by the amount of compressed data that was written back from zswap. This is typical for other types of reclaim as well, e.g. allocating entries in the swapcache tree, allocating bios and IO requests... > > > > Aside from -ENOMEM, writeback_entry will fail on concurrent swap > > > > invalidation or a racing swapin fault. In both cases we should > > > > absolutely keep trying other entries until the goal is met. > > > > > > How about a narrower fix recognizing those error cases and making > > > the inner loop continue in those errors? > > > > Right, I just I don't really see the value proposition of this > > complication, and I see some downsides (see below). No single entry > > error should ever cause us to stop the wider reclaim loop. > > That is until the current LRU list has been through once. > I expect repeating the same list yields less reclaimed pages. If we see failure due to invalidation races, the entries will free and shrink the pool, thus getting us closer to the can_accept() condition. We'll stop looping in the shrinker once enough entries have been freed - whether we reclaimed them ourselves, or somebody else invalidated them.
On Wed, May 31, 2023 at 3:06 AM Chris Li <chrisl@kernel.org> wrote: > > On Tue, May 30, 2023 at 02:54:51PM -0400, Johannes Weiner wrote: > > > Maybe ENOMEM is a bad example. How about if the swap device > > > just went bad and can't complete new IO writes? > > > > This is actually outside the scope of zswap, and handled by the > > swapcache (end_swap_bio_write). > > > > Once the IO is submitted, zswap will ax its copy and leave the rest to > > the swapcache. It behaves the same way as if zswap had never been > > involved to begin with when the swap out fails on IO errors. > > > > From a zswap perspective, there are no persistent errors in moving a > > zswap entry back into the swapcache. Not just currently, but generally. > Again, you are right that this zswap writeback is async. > So the writeback error is NOT going to propagate to the > shrink function. > > With the current three pool backends that I looked at{zbud, > z3fold,zsmalloc} they all have internal retry 8 times. > Adding more retry did not fundamentally change the existing > behavior. > > I look at all the possible error codes generated inside > the reclaim code, the only noticeable errors are ENOMEM > and concurrent swap invalidation or a racing swapin fault. > > BTW, zswap reclaim consumes memory. Keep on looping ENOMEM > might cause more OOM. But that can exist in current code > as well. > > > > > Aside from -ENOMEM, writeback_entry will fail on concurrent swap > > > > invalidation or a racing swapin fault. In both cases we should > > > > absolutely keep trying other entries until the goal is met. > > > > > > How about a narrower fix recognizing those error cases and making > > > the inner loop continue in those errors? > > > > Right, I just I don't really see the value proposition of this > > complication, and I see some downsides (see below). No single entry > > error should ever cause us to stop the wider reclaim loop. > > That is until the current LRU list has been through once. > I expect repeating the same list yields less reclaimed pages. > > > > > > > > > extreme case where it's the only page left on the list, I again don't > > > > > > see how retrying a few times will make the situation worse. > > > > > > > > > > > > In practice, IMO there is little upside in trying to be more > > > > > > discerning about the error codes. Simple seems better here. > > > > > > > > > > Just trying to think about what should be the precise loop termination > > > > > condition here. > > > > > > > > > > I still feel blindly trying a few times is a very imprecise condition. > > > > > > > > The precise termination condition is when can_accept() returns true > > > > again. The safety cap is only added as precaution to avoid infinite > > > > loops if something goes wrong or unexpected, now or in the future. > > > > > > In my mind, that statement already suggests can_accept() is not > > > *precise*, considering the avoid infinite loop. > > > e.g. Do we know what is the optimal cap value and why that value > > > is optical? > > > > Oh but it is precise. That's the goal we want to accomplish. > > I understand it is the goal, the precise condition I am > talking about is the loop termination condition, can_accept() > is not the only one. Anyway, let's move on. > > > > The cap is just so that in case something unexpectedly goes wrong (a > > bug), we fail gracefully and don't lock up the machine. The same > > reason we prefer WARN_ONs over BUG_ONs if we can, to avoid > > crashes. That's really all there is too it, and it strikes me as > > reasonable and robust design choice. It's fine to limp along or be > > suboptimal after such a bug happens; the bar is avoiding an infinite > > loop, nothing else. > > > > Your suggestion of whitelisting certain errors is more complicated, > > but also less robust: in case an entry error does by some accident > > become persistent for the whole LRU, we're locking up the host. We'd > > rather catch a bug like this by seeing spikes in the reclaim failure > > rate than by losing production machines. > > > > > Putting the definition of precise aside, I do see the unconditional > > > retry can have unwanted effects. > > > > I hope I could address this above. But if not, please share your > > concerns. > Thanks for the discussion. I am less concerned about the retry now. > Retry on EAGAIN might be the simplest way to proceed. > > Outside of the scope of this patch, I am still surprised to see > such a high number of retries caused by race conditions. There are > 8 inner loop retry already. The actual pages retried need to > times 8. > > If there is a reproducer script, I want to local repo this > to understand better. Wish there are ways to reduce the retry. With `stress` you can reproduce the errors quite easily, if you're not going to see many (if any at all) writebacks _without_ this patch, you can let the zpool fill, keeping the default max_pool_percent value of 20, then once full switch max_pool_percent to 1. > > Another idea is that we can start the shrinking once the > pool max was reached. Try to reduce to the threshold earlier. The shrinking already starts when pool max is reached, with and without the proposed patch. If you're referring to a new threshold that would kick off the shrinking, that could be a nice improvement to the overall mechanism, and it would fit nicely on top of this patch. > > Chris
diff --git a/mm/zswap.c b/mm/zswap.c index 59da2a415fbb..2ee0775d8213 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w) { struct zswap_pool *pool = container_of(w, typeof(*pool), shrink_work); + int ret; - if (zpool_shrink(pool->zpool, 1, NULL)) - zswap_reject_reclaim_fail++; + do { + ret = zpool_shrink(pool->zpool, 1, NULL); + if (ret) + zswap_reject_reclaim_fail++; + } while (!zswap_can_accept() && ret != -EINVAL); zswap_pool_put(pool); } @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, if (zswap_pool_reached_full) { if (!zswap_can_accept()) { ret = -ENOMEM; - goto reject; + goto shrink; } else zswap_pool_reached_full = false; }