Message ID | 20231011051117.2289518-1-hezhongkun.hzk@bytedance.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id ib8csp311892vqb; Tue, 10 Oct 2023 22:11:57 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEhkccnIkOukI8xIge2nNcs2JRon0NRqqSlWjva3EkLm6kcNCdAcW1WSWtzNPEzpO4gFPHe X-Received: by 2002:a17:902:d512:b0:1bf:349f:b85c with SMTP id b18-20020a170902d51200b001bf349fb85cmr22602082plg.1.1697001116980; Tue, 10 Oct 2023 22:11:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697001116; cv=none; d=google.com; s=arc-20160816; b=mItY31QGAhKKRpJUHhXEcRnjtjg15SluS11gbxYfVhq3Sp16/M6QuNZeSfvs4HMoSy 8Z24iyg1WCEfSLh4HG6GyF1e3uyMwq3CO9b8yG21bJieCG318AayikTE57GwzhfodkJw SNLhPuajVuefDM0/CDpTtehu+gbvu0Vf0o46c2a9XNhHC30/sd8vqo/3yNcgki8zW2dX /jA0MfQ6lRGd/eK5Kj0q1R3kRUgpYG0xJ9Rbi26kGNUQyy4b3sipkkT4YI9zFGdtbCH+ YBW26kB4IRA8vU1ITEPzsX2CijrKSEggjfqev3ar5zxGkeRFZk68nWyXLPVMGmx6cV71 tUHw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=8YdumXqFECiIqGWe/cPAmhangKs7E2fmFDNCzZi3I4I=; fh=PszaQX4C+JKV5DOkv5b1oVfFmM+viGnqQb7kiYaHXAg=; b=eXrJ5GNsEVIJdyoilgtKn/pnp1tnbQfj4I4w5eoCgNTLZxK4yU6Pa0UeWEopEPfgk/ aubPG7VOnMl/GJ+OgYGmBV4ftyWUDMurAtn/lZB0xHM6nQnmxh6M3epumRi5af9+Gjo1 hcarL90sfhEGFDSanx/Ki2FTY4QEUVQLf5GMDBRYh1yMoolvSw8XNzuPYKKQsOOi2uRn ZnPy6T9KPKo7YU7W3Sghoy5JFGpQpE5/dA9JhGiISeK0kjVPj5aU+5DoYIY3k/UHkYqU o2OZQt7KISNMP8OvxmG8ZQgYQseGAtrIvRx+7bV8BORi6gtNrQQMIu8BMheJS+yQaS8E 2JKA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=eHWUXmxM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id cp1-20020a170902e78100b001c9cd55e31bsi620342plb.287.2023.10.10.22.11.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 10 Oct 2023 22:11:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=eHWUXmxM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 5D0DC802821F; Tue, 10 Oct 2023 22:11:54 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229686AbjJKFLd (ORCPT <rfc822;rua109.linux@gmail.com> + 19 others); Wed, 11 Oct 2023 01:11:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37536 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229722AbjJKFLb (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 11 Oct 2023 01:11:31 -0400 Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F7C0AF for <linux-kernel@vger.kernel.org>; Tue, 10 Oct 2023 22:11:28 -0700 (PDT) Received: by mail-pl1-x634.google.com with SMTP id d9443c01a7336-1c735473d1aso44798885ad.1 for <linux-kernel@vger.kernel.org>; Tue, 10 Oct 2023 22:11:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1697001088; x=1697605888; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=8YdumXqFECiIqGWe/cPAmhangKs7E2fmFDNCzZi3I4I=; b=eHWUXmxMG4/Wm5zcO4E1HjEqCrjjz/zqIHtIpSRoxdwlg7v93LbJ40iTc5zJq4NU5u 3e1j62t9nO4/Nvj/x6xVBfXPW1kQidjngxVvzDAApAuVa1skO/l2EadOHqdGaMRmp0Hl SYTnPwCX+wvBk9vkVdeK/Uhn4i/FUJW0XFb1LJbjK8na5CxYsIMB7X9DArxpmd4POt9F pS1UGvnY/LtCEPWX1N/drZ7v5t/F4QYvxNoFpMWgi0/+OkkuuNgiGrNt+KXWRqmmiVme THwCxKsaYsiESQu4wgSFs4beXQ8R3GouMhV7kT3MPyhsZD47oaEAapnj01Wa8QwXaOkS k5FA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697001088; x=1697605888; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=8YdumXqFECiIqGWe/cPAmhangKs7E2fmFDNCzZi3I4I=; b=u9fMu2+6aOw+J8stZwZET7e6hK3N4CvPzd5tyRNQ2AsWjcWvOlX6SSDepXhxEnsLHH KZ5UllTuju8QkFqX4JecFPNUcKpz/X3PUNRxM2z2ziUEaey7qvibA54yoFiBPKD6649f bZMfU0ZZ0IHIGD7Tte6xXe4CRDsL16aWGyjSqypk1gBQgltu9AoyIzHe2K6FksQrQMPS X95j7lba/fuf/mUo03McZG8mMOlzWKUDxfhniXkzPJNFlcWuyh7UlpYUc6o8KlJA3d6e spfuPKtVMY8y0iDE5+pYMA1o2KWFSsu6evLLKeC8NcOGIRDfc7iKFUiXErnDz3YTdly9 zg9Q== X-Gm-Message-State: AOJu0Ywj2iZG2gyOVTuOadZetGNd9Qodat2k3jPErCOG5Zwiy7BVlxcP WcQ9tDQ1HgZMvtpv9VY6EhcSBQ== X-Received: by 2002:a17:903:11c8:b0:1c7:23c9:a7e1 with SMTP id q8-20020a17090311c800b001c723c9a7e1mr20211376plh.26.1697001087763; Tue, 10 Oct 2023 22:11:27 -0700 (PDT) Received: from Tower.bytedance.net ([203.208.167.147]) by smtp.gmail.com with ESMTPSA id b14-20020a170902d50e00b001bc6e6069a6sm12745399plg.122.2023.10.10.22.11.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 10 Oct 2023 22:11:27 -0700 (PDT) From: Zhongkun He <hezhongkun.hzk@bytedance.com> To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Zhongkun He <hezhongkun.hzk@bytedance.com> Subject: [RFC PATCH] zswap: add writeback_time_threshold interface to shrink zswap pool Date: Wed, 11 Oct 2023 13:11:17 +0800 Message-Id: <20231011051117.2289518-1-hezhongkun.hzk@bytedance.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=2.7 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Tue, 10 Oct 2023 22:11:54 -0700 (PDT) X-Spam-Level: ** X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1779434643086138179 X-GMAIL-MSGID: 1779434643086138179 |
Series |
[RFC] zswap: add writeback_time_threshold interface to shrink zswap pool
|
|
Commit Message
Zhongkun He
Oct. 11, 2023, 5:11 a.m. UTC
zswap does not have a suitable method to select objects that have not
been accessed for a long time, and just shrink the pool when the limit
is hit. There is a high probability of wasting memory in zswap if the
limit is too high.
This patch add a new interface writeback_time_threshold to shrink zswap
pool proactively based on the time threshold in second, e.g.::
echo 600 > /sys/module/zswap/parameters/writeback_time_threshold
If zswap_entrys have not been accessed for more than 600 seconds, they
will be swapout to swap. if set to 0, all of them will be swapout.
Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
Documentation/admin-guide/mm/zswap.rst | 9 +++
mm/zswap.c | 76 ++++++++++++++++++++++++++
2 files changed, 85 insertions(+)
Comments
On Tue, Oct 10, 2023 at 10:11 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > zswap does not have a suitable method to select objects that have not > been accessed for a long time, and just shrink the pool when the limit > is hit. There is a high probability of wasting memory in zswap if the > limit is too high. We're currently trying to solve this exact problem. Our approach is to add a shrinker that automatically shrinks the size of the zswap pool: https://lore.kernel.org/lkml/20230919171447.2712746-1-nphamcs@gmail.com/ It is triggered on memory-pressure, and can perform reclaim in a workload-specific manner. I'm currently working on v3 of this patch series, but in the meantime, could you take a look and see if it will address your issues as well? Comments and suggestions are always welcome, of course :) > > This patch add a new interface writeback_time_threshold to shrink zswap > pool proactively based on the time threshold in second, e.g.:: > > echo 600 > /sys/module/zswap/parameters/writeback_time_threshold My concern with this approach is that this value seems rather arbitrary. I imagine that it is workload- and memory access pattern- dependent, and will have to be tuned. Other than a couple of big users, no one will have the resources to do this. And since this is a one-off knob, there's another parameter users will have to decide - frequency, i.e how often should the userspace agent trigger this reclaim action. This is again very hard to determine a priori, and most likely has to be tuned as well. > > If zswap_entrys have not been accessed for more than 600 seconds, they > will be swapout to swap. if set to 0, all of them will be swapout. > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> > --- > Documentation/admin-guide/mm/zswap.rst | 9 +++ > mm/zswap.c | 76 ++++++++++++++++++++++++++ > 2 files changed, 85 insertions(+) > > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst > index 45b98390e938..9ffaed26c3c0 100644 > --- a/Documentation/admin-guide/mm/zswap.rst > +++ b/Documentation/admin-guide/mm/zswap.rst > @@ -153,6 +153,15 @@ attribute, e. g.:: > > Setting this parameter to 100 will disable the hysteresis. > > +When there is a lot of cold memory according to the store time in the zswap, > +it can be swapout and save memory in userspace proactively. User can write > +writeback time threshold in second to enable it, e.g.:: > + > + echo 600 > /sys/module/zswap/parameters/writeback_time_threshold > + > +If zswap_entrys have not been accessed for more than 600 seconds, they will be > +swapout. if set to 0, all of them will be swapout. > + > A debugfs interface is provided for various statistic about pool size, number > of pages stored, same-value filled pages and various counters for the reasons > pages are rejected. > diff --git a/mm/zswap.c b/mm/zswap.c > index 083c693602b8..c3a19b56a29b 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -141,6 +141,16 @@ static bool zswap_exclusive_loads_enabled = IS_ENABLED( > CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON); > module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644); > > +/* zswap writeback time threshold in second */ > +static unsigned int zswap_writeback_time_thr; > +static int zswap_writeback_time_thr_param_set(const char *, const struct kernel_param *); > +static const struct kernel_param_ops zswap_writeback_param_ops = { > + .set = zswap_writeback_time_thr_param_set, > + .get = param_get_uint, > +}; > +module_param_cb(writeback_time_threshold, &zswap_writeback_param_ops, > + &zswap_writeback_time_thr, 0644); > + > /* Number of zpools in zswap_pool (empirically determined for scalability) */ > #define ZSWAP_NR_ZPOOLS 32 > > @@ -197,6 +207,7 @@ struct zswap_pool { > * value - value of the same-value filled pages which have same content > * objcg - the obj_cgroup that the compressed memory is charged to > * lru - handle to the pool's lru used to evict pages. > + * sto_time - the store time of zswap_entry. > */ > struct zswap_entry { > struct rb_node rbnode; > @@ -210,6 +221,7 @@ struct zswap_entry { > }; > struct obj_cgroup *objcg; > struct list_head lru; > + ktime_t sto_time; > }; > > /* > @@ -288,6 +300,31 @@ static void zswap_update_total_size(void) > zswap_pool_total_size = total; > } > > +static void zswap_reclaim_entry_by_timethr(void); > + > +static bool zswap_reach_timethr(struct zswap_pool *pool) > +{ > + struct zswap_entry *entry; > + ktime_t expire_time = 0; > + bool ret = false; > + > + spin_lock(&pool->lru_lock); > + > + if (list_empty(&pool->lru)) > + goto out; > + > + entry = list_last_entry(&pool->lru, struct zswap_entry, lru); > + expire_time = ktime_add(entry->sto_time, > + ns_to_ktime(zswap_writeback_time_thr * NSEC_PER_SEC)); > + > + if (ktime_after(ktime_get_boottime(), expire_time)) > + ret = true; > +out: > + spin_unlock(&pool->lru_lock); > + return ret; > +} > + > + > /********************************* > * zswap entry functions > **********************************/ > @@ -395,6 +432,7 @@ static void zswap_free_entry(struct zswap_entry *entry) > else { > spin_lock(&entry->pool->lru_lock); > list_del(&entry->lru); > + entry->sto_time = 0; > spin_unlock(&entry->pool->lru_lock); > zpool_free(zswap_find_zpool(entry), entry->handle); > zswap_pool_put(entry->pool); > @@ -709,6 +747,28 @@ static void shrink_worker(struct work_struct *w) > zswap_pool_put(pool); > } > > +static void zswap_reclaim_entry_by_timethr(void) > +{ > + struct zswap_pool *pool = zswap_pool_current_get(); > + int ret, failures = 0; > + > + if (!pool) > + return; > + > + while (zswap_reach_timethr(pool)) { > + ret = zswap_reclaim_entry(pool); > + if (ret) { > + zswap_reject_reclaim_fail++; > + if (ret != -EAGAIN) > + break; > + if (++failures == MAX_RECLAIM_RETRIES) > + break; > + } > + cond_resched(); > + } > + zswap_pool_put(pool); > +} > + > static struct zswap_pool *zswap_pool_create(char *type, char *compressor) > { > int i; > @@ -1037,6 +1097,21 @@ static int zswap_enabled_param_set(const char *val, > return ret; > } > > +static int zswap_writeback_time_thr_param_set(const char *val, > + const struct kernel_param *kp) > +{ > + int ret = -ENODEV; > + > + /* if this is load-time (pre-init) param setting, just return. */ > + if (system_state != SYSTEM_RUNNING) > + return ret; > + > + ret = param_set_uint(val, kp); > + if (!ret) > + zswap_reclaim_entry_by_timethr(); > + return ret; > +} > + > /********************************* > * writeback code > **********************************/ > @@ -1360,6 +1435,7 @@ bool zswap_store(struct folio *folio) > if (entry->length) { > spin_lock(&entry->pool->lru_lock); > list_add(&entry->lru, &entry->pool->lru); > + entry->sto_time = ktime_get_boottime(); > spin_unlock(&entry->pool->lru_lock); > } I think there might be some issues with just storing the store time here as well. IIUC, there might be cases where the zswap entry is accessed and brought into memory, but that entry (with the associated compressed memory) still hangs around. For e.g and more context, see this patch that enables exclusive loads: https://lore.kernel.org/lkml/20230607195143.1473802-1-yosryahmed@google.com/ If that happens, this sto_time field does not tell the full story, right? For instance, if an object is stored a long time ago, but has been accessed since, it shouldn't be considered a cold object that should be a candidate for reclaim. But the old sto_time would indicate otherwise. > spin_unlock(&tree->lock); > -- > 2.25.1 >
Hi Nhat, thanks for your detailed reply. > We're currently trying to solve this exact problem. Our approach is to > add a shrinker that automatically shrinks the size of the zswap pool: > > https://lore.kernel.org/lkml/20230919171447.2712746-1-nphamcs@gmail.com/ > > It is triggered on memory-pressure, and can perform reclaim in a > workload-specific manner. > > I'm currently working on v3 of this patch series, but in the meantime, > could you take a look and see if it will address your issues as well? > > Comments and suggestions are always welcome, of course :) > Thanks, I've seen both patches. But we hope to be able to reclaim memory in advance, regardless of memory pressure, like memory.reclaim in memcg, so we can offload memory in different tiers. > > My concern with this approach is that this value seems rather arbitrary. > I imagine that it is workload- and memory access pattern- dependent, > and will have to be tuned. Other than a couple of big users, no one > will have the resources to do this. > > And since this is a one-off knob, there's another parameter users > will have to decide - frequency, i.e how often should the userspace > agent trigger this reclaim action. This is again very hard to determine > a priori, and most likely has to be tuned as well. > I totally agree with you, this is the key point of this approach.It depends on how we define cold pages, which are usually measured in time, such as not being accessed for 600 seconds, etc. So the frequency should be greater than 600 seconds. > I think there might be some issues with just storing the store time here > as well. IIUC, there might be cases where the zswap entry > is accessed and brought into memory, but that entry (with the associated > compressed memory) still hangs around. For e.g and more context, > see this patch that enables exclusive loads: > > https://lore.kernel.org/lkml/20230607195143.1473802-1-yosryahmed@google.com/ > > If that happens, this sto_time field does not tell the full story, right? > For instance, if an object is stored a long time ago, but has been > accessed since, it shouldn't be considered a cold object that should be > a candidate for reclaim. But the old sto_time would indicate otherwise. > Thanks for your review,we should update the store time when it was loaded. But it confused me, there are two copies of the same page in memory (compressed and uncompressed) after faulting in a page from zswap if 'zswap_exclusive_loads_enabled' was disabled. I didn't notice any difference when turning that option on or off because the frontswap_ops has been removed and there is no frontswap_map anymore. Sorry, am I missing something?
On Thu, Oct 12, 2023 at 10:13:16PM +0800, 贺中坤 wrote: > Hi Nhat, thanks for your detailed reply. > > > We're currently trying to solve this exact problem. Our approach is to > > add a shrinker that automatically shrinks the size of the zswap pool: > > > > https://lore.kernel.org/lkml/20230919171447.2712746-1-nphamcs@gmail.com/ > > > > It is triggered on memory-pressure, and can perform reclaim in a > > workload-specific manner. > > > > I'm currently working on v3 of this patch series, but in the meantime, > > could you take a look and see if it will address your issues as well? > > > > Comments and suggestions are always welcome, of course :) > > > > Thanks, I've seen both patches. But we hope to be able to reclaim memory > in advance, regardless of memory pressure, like memory.reclaim in memcg, > so we can offload memory in different tiers. Can you use memory.reclaim itself for that? With Nhat's shrinker, it should move the whole pipeline (LRU -> zswap -> swap). > Thanks for your review,we should update the store time when it was loaded. > But it confused me, there are two copies of the same page in memory > (compressed and uncompressed) after faulting in a page from zswap if > 'zswap_exclusive_loads_enabled' was disabled. I didn't notice any difference > when turning that option on or off because the frontswap_ops has been removed > and there is no frontswap_map anymore. Sorry, am I missing something? In many instances, swapins already free the swap slot through the generic swap code (see should_try_to_free_swap()). It matters for shared pages, or for swapcaching read-only data when swap isn't full - it could be that isn't the case in your tests.
On Thu, Oct 12, 2023 at 7:13 AM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote: > > Hi Nhat, thanks for your detailed reply. > > > We're currently trying to solve this exact problem. Our approach is to > > add a shrinker that automatically shrinks the size of the zswap pool: > > > > https://lore.kernel.org/lkml/20230919171447.2712746-1-nphamcs@gmail.com/ > > > > It is triggered on memory-pressure, and can perform reclaim in a > > workload-specific manner. > > > > I'm currently working on v3 of this patch series, but in the meantime, > > could you take a look and see if it will address your issues as well? > > > > Comments and suggestions are always welcome, of course :) > > > > Thanks, I've seen both patches. But we hope to be able to reclaim memory > in advance, regardless of memory pressure, like memory.reclaim in memcg, > so we can offload memory in different tiers. As Johannes pointed out, with a zswap shrinker, we can just push on the memory.reclaim knob, and it'll automatically get pushed down the pipeline: memory -> swap -> zswap That seems to be a bit more natural and user-friendly to me than making the users manually decide to push zswap out to swap. My ideal vision of how all of this should go is that users provide an abstract declaration of requirement, and the specific decision of what to be done is left to the kernel to perform, as transparently to the user as possible. This philosophy extends to multi-tier memory management in general, not just the above 3-tier model. > > > > > My concern with this approach is that this value seems rather arbitrary. > > I imagine that it is workload- and memory access pattern- dependent, > > and will have to be tuned. Other than a couple of big users, no one > > will have the resources to do this. > > > > And since this is a one-off knob, there's another parameter users > > will have to decide - frequency, i.e how often should the userspace > > agent trigger this reclaim action. This is again very hard to determine > > a priori, and most likely has to be tuned as well. > > > > I totally agree with you, this is the key point of this approach.It depends > on how we define cold pages, which are usually measured in time, > such as not being accessed for 600 seconds, etc. So the frequency > should be greater than 600 seconds. I guess my main concern here is - how do you determine the value 600 seconds in the first place? And yes, the frequency should be greater than the oldness cutoff, but how much greater? We can run experiments to decide what cutoff will hurt performance the least (or improve the performance the most), but that value will be specific to our workload and memory access patterns. Other users might need a different value entirely, and they might not have the resources to find out. If it's just a binary decision (on or off), then at least it could be one A/B experiment (per workload/service). But the range here could vary wildly. Is there at least a default value that works decently well across workload/service, in your experience? > > > I think there might be some issues with just storing the store time here > > as well. IIUC, there might be cases where the zswap entry > > is accessed and brought into memory, but that entry (with the associated > > compressed memory) still hangs around. For e.g and more context, > > see this patch that enables exclusive loads: > > > > https://lore.kernel.org/lkml/20230607195143.1473802-1-yosryahmed@google.com/ > > > > If that happens, this sto_time field does not tell the full story, right? > > For instance, if an object is stored a long time ago, but has been > > accessed since, it shouldn't be considered a cold object that should be > > a candidate for reclaim. But the old sto_time would indicate otherwise. > > > > Thanks for your review,we should update the store time when it was loaded. > But it confused me, there are two copies of the same page in memory > (compressed and uncompressed) after faulting in a page from zswap if > 'zswap_exclusive_loads_enabled' was disabled. I didn't notice any difference > when turning that option on or off because the frontswap_ops has been removed > and there is no frontswap_map anymore. Sorry, am I missing something? I believe Johannes has explained the case where this could happen. But yeah, this should be fixable with by updating the stored time field on access (maybe rename it to something a bit more fitting as well - last_accessed_time?) Regardless, it is incredibly validating to see that other parties share the same problems as us :) It's not a super invasive change as well. I just don't think it solves the issue that well for every zswap user.
On Tue, Oct 10, 2023 at 10:11 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > zswap does not have a suitable method to select objects that have not > been accessed for a long time, and just shrink the pool when the limit > is hit. There is a high probability of wasting memory in zswap if the > limit is too high. > > This patch add a new interface writeback_time_threshold to shrink zswap > pool proactively based on the time threshold in second, e.g.:: > > echo 600 > /sys/module/zswap/parameters/writeback_time_threshold > > If zswap_entrys have not been accessed for more than 600 seconds, they > will be swapout to swap. if set to 0, all of them will be swapout. > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> I prefer if this can be done through memory.reclaim when the zswap shrinker is in place, as others have suggested. I understand that this provides more control by specifying the time at which to start writing pages out, which is similar to zram writeback AFAICT, but it is also difficult to determine the right value to write here. I am also not sure how you decide that it is better to writeback cold pages in zswap or compress cold pages in the LRUs. The pages in zswap are obviously colder, but accessing them after they are written back is much more expensive, to the point that it could be better to compress more cold memory from the LRUs. This is obviously not straightforward and requires a fair amount of tuning to do more good than harm. That being said, if we decide to move forward with this I have a couple of comments: - I think you should check out how zram implements idle writeback and try to make things consistent. Zswap and zram don't really see eye to eye, but some consistency would be nice. If you looked at zram's implementation you would realize that you also need to update the access time when a page is read (unless the load is exclusive). - This should be behind a config option. Every word that we add to struct zswap_entry reduces the zswap savings by roughly 0.2%. Maybe this doesn't sound like much but it adds up. Let's not opt everyone in unless they ask for it. > --- > Documentation/admin-guide/mm/zswap.rst | 9 +++ > mm/zswap.c | 76 ++++++++++++++++++++++++++ > 2 files changed, 85 insertions(+) > > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst > index 45b98390e938..9ffaed26c3c0 100644 > --- a/Documentation/admin-guide/mm/zswap.rst > +++ b/Documentation/admin-guide/mm/zswap.rst > @@ -153,6 +153,15 @@ attribute, e. g.:: > > Setting this parameter to 100 will disable the hysteresis. > > +When there is a lot of cold memory according to the store time in the zswap, > +it can be swapout and save memory in userspace proactively. User can write > +writeback time threshold in second to enable it, e.g.:: > + > + echo 600 > /sys/module/zswap/parameters/writeback_time_threshold > + > +If zswap_entrys have not been accessed for more than 600 seconds, they will be > +swapout. if set to 0, all of them will be swapout. > + > A debugfs interface is provided for various statistic about pool size, number > of pages stored, same-value filled pages and various counters for the reasons > pages are rejected. > diff --git a/mm/zswap.c b/mm/zswap.c > index 083c693602b8..c3a19b56a29b 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -141,6 +141,16 @@ static bool zswap_exclusive_loads_enabled = IS_ENABLED( > CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON); > module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644); > > +/* zswap writeback time threshold in second */ > +static unsigned int zswap_writeback_time_thr; > +static int zswap_writeback_time_thr_param_set(const char *, const struct kernel_param *); > +static const struct kernel_param_ops zswap_writeback_param_ops = { > + .set = zswap_writeback_time_thr_param_set, > + .get = param_get_uint, > +}; > +module_param_cb(writeback_time_threshold, &zswap_writeback_param_ops, > + &zswap_writeback_time_thr, 0644); > + > /* Number of zpools in zswap_pool (empirically determined for scalability) */ > #define ZSWAP_NR_ZPOOLS 32 > > @@ -197,6 +207,7 @@ struct zswap_pool { > * value - value of the same-value filled pages which have same content > * objcg - the obj_cgroup that the compressed memory is charged to > * lru - handle to the pool's lru used to evict pages. > + * sto_time - the store time of zswap_entry. > */ > struct zswap_entry { > struct rb_node rbnode; > @@ -210,6 +221,7 @@ struct zswap_entry { > }; > struct obj_cgroup *objcg; > struct list_head lru; > + ktime_t sto_time; > }; > > /* > @@ -288,6 +300,31 @@ static void zswap_update_total_size(void) > zswap_pool_total_size = total; > } > > +static void zswap_reclaim_entry_by_timethr(void); > + > +static bool zswap_reach_timethr(struct zswap_pool *pool) > +{ > + struct zswap_entry *entry; > + ktime_t expire_time = 0; > + bool ret = false; > + > + spin_lock(&pool->lru_lock); > + > + if (list_empty(&pool->lru)) > + goto out; > + > + entry = list_last_entry(&pool->lru, struct zswap_entry, lru); > + expire_time = ktime_add(entry->sto_time, > + ns_to_ktime(zswap_writeback_time_thr * NSEC_PER_SEC)); > + > + if (ktime_after(ktime_get_boottime(), expire_time)) > + ret = true; > +out: > + spin_unlock(&pool->lru_lock); > + return ret; > +} > + > + > /********************************* > * zswap entry functions > **********************************/ > @@ -395,6 +432,7 @@ static void zswap_free_entry(struct zswap_entry *entry) > else { > spin_lock(&entry->pool->lru_lock); > list_del(&entry->lru); > + entry->sto_time = 0; > spin_unlock(&entry->pool->lru_lock); > zpool_free(zswap_find_zpool(entry), entry->handle); > zswap_pool_put(entry->pool); > @@ -709,6 +747,28 @@ static void shrink_worker(struct work_struct *w) > zswap_pool_put(pool); > } > > +static void zswap_reclaim_entry_by_timethr(void) > +{ > + struct zswap_pool *pool = zswap_pool_current_get(); > + int ret, failures = 0; > + > + if (!pool) > + return; > + > + while (zswap_reach_timethr(pool)) { > + ret = zswap_reclaim_entry(pool); > + if (ret) { > + zswap_reject_reclaim_fail++; > + if (ret != -EAGAIN) > + break; > + if (++failures == MAX_RECLAIM_RETRIES) > + break; > + } > + cond_resched(); > + } > + zswap_pool_put(pool); > +} > + > static struct zswap_pool *zswap_pool_create(char *type, char *compressor) > { > int i; > @@ -1037,6 +1097,21 @@ static int zswap_enabled_param_set(const char *val, > return ret; > } > > +static int zswap_writeback_time_thr_param_set(const char *val, > + const struct kernel_param *kp) > +{ > + int ret = -ENODEV; > + > + /* if this is load-time (pre-init) param setting, just return. */ > + if (system_state != SYSTEM_RUNNING) > + return ret; > + > + ret = param_set_uint(val, kp); > + if (!ret) > + zswap_reclaim_entry_by_timethr(); > + return ret; > +} > + > /********************************* > * writeback code > **********************************/ > @@ -1360,6 +1435,7 @@ bool zswap_store(struct folio *folio) > if (entry->length) { > spin_lock(&entry->pool->lru_lock); > list_add(&entry->lru, &entry->pool->lru); > + entry->sto_time = ktime_get_boottime(); > spin_unlock(&entry->pool->lru_lock); > } > spin_unlock(&tree->lock); > -- > 2.25.1 >
> > Can you use memory.reclaim itself for that? With Nhat's shrinker, it > should move the whole pipeline (LRU -> zswap -> swap). > Thanks, I will backport it and have a try. > In many instances, swapins already free the swap slot through the > generic swap code (see should_try_to_free_swap()). It matters for > shared pages, or for swapcaching read-only data when swap isn't full - > it could be that isn't the case in your tests. Got it. Thanks for your reply.
> > As Johannes pointed out, with a zswap shrinker, we can just push on > the memory.reclaim knob, and it'll automatically get pushed down the > pipeline: > > memory -> swap -> zswap > > That seems to be a bit more natural and user-friendly to me than > making the users manually decide to push zswap out to swap. > > My ideal vision of how all of this should go is that users provide an > abstract declaration of requirement, and the specific decision of what > to be done is left to the kernel to perform, as transparently to the user > as possible. This philosophy extends to multi-tier memory management > in general, not just the above 3-tier model. > That sounds great,i will backport it and have a try. > > I guess my main concern here is - how do you determine the value > 600 seconds in the first place? > I will test based on different applications and corresponding memory access models. Usually we run similar programs on the same machine. First, we can use memory.reclaim to swap out pages to zswap, and with this patch , I would find the distribution of times the page resides in zswap, and then choose the appropriate time. > And yes, the frequency should be greater than the oldness cutoff, > but how much greater? > This depends on the user's memory needs. If you want to reclaim memory faster, you can set it to 1.5 times the threshold. On the contrary, you can set it to 1 hour, two hours, etc. > We can run experiments to decide what cutoff will hurt performance > the least (or improve the performance the most), but that value will > be specific to our workload and memory access patterns. Other > users might need a different value entirely, and they might not have > the resources to find out. > > If it's just a binary decision (on or off), then at least it could be > one A/B experiment (per workload/service). But the range here > could vary wildly. > > Is there at least a default value that works decently well across > workload/service, in your experience? > Yes I agree, it's difficult to set a perfect value, but it's actually beneficial to just have a normal value, such as 600 seconds by default. This means that the zswap value stores values that have not been accessed within 600 seconds and then unloads them to swap. > I believe Johannes has explained the case where this could happen. > But yeah, this should be fixable with by updating the stored time > field on access (maybe rename it to something a bit more fitting as > well - last_accessed_time?) Thanks, I agree. > > Regardless, it is incredibly validating to see that other parties share the > same problems as us :) It's not a super invasive change as well. > I just don't think it solves the issue that well for every zswap user. I've noticed this problem before and thought about some solutions,but only saw your patch recently. I can also try it and discuss it together.At the same time, I will think about how to improve this patch.
Thanks for your reply. > I prefer if this can be done through memory.reclaim when the zswap > shrinker is in place, as others have suggested. I understand that this > provides more control by specifying the time at which to start writing > pages out, which is similar to zram writeback AFAICT, but it is also > difficult to determine the right value to write here. > > I am also not sure how you decide that it is better to writeback cold > pages in zswap or compress cold pages in the LRUs. The pages in zswap > are obviously colder, but accessing them after they are written back > is much more expensive, to the point that it could be better to > compress more cold memory from the LRUs. This is obviously not > straightforward and requires a fair amount of tuning to do more good > than harm. I do agree. For some applications, a common value will work, such as 600s. Besides, this patch provides a more flexible way to offload compress pages. > > That being said, if we decide to move forward with this I have a > couple of comments: > > - I think you should check out how zram implements idle writeback and > try to make things consistent. Zswap and zram don't really see eye to > eye, but some consistency would be nice. If you looked at zram's > implementation you would realize that you also need to update the > access time when a page is read (unless the load is exclusive). Thanks for your suggestion,i will fix it and check it again. > > - This should be behind a config option. Every word that we add to > struct zswap_entry reduces the zswap savings by roughly 0.2%. Maybe > this doesn't sound like much but it adds up. Let's not opt everyone in > unless they ask for it. > Good idea, Thanks.
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 45b98390e938..9ffaed26c3c0 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -153,6 +153,15 @@ attribute, e. g.:: Setting this parameter to 100 will disable the hysteresis. +When there is a lot of cold memory according to the store time in the zswap, +it can be swapout and save memory in userspace proactively. User can write +writeback time threshold in second to enable it, e.g.:: + + echo 600 > /sys/module/zswap/parameters/writeback_time_threshold + +If zswap_entrys have not been accessed for more than 600 seconds, they will be +swapout. if set to 0, all of them will be swapout. + A debugfs interface is provided for various statistic about pool size, number of pages stored, same-value filled pages and various counters for the reasons pages are rejected. diff --git a/mm/zswap.c b/mm/zswap.c index 083c693602b8..c3a19b56a29b 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -141,6 +141,16 @@ static bool zswap_exclusive_loads_enabled = IS_ENABLED( CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON); module_param_named(exclusive_loads, zswap_exclusive_loads_enabled, bool, 0644); +/* zswap writeback time threshold in second */ +static unsigned int zswap_writeback_time_thr; +static int zswap_writeback_time_thr_param_set(const char *, const struct kernel_param *); +static const struct kernel_param_ops zswap_writeback_param_ops = { + .set = zswap_writeback_time_thr_param_set, + .get = param_get_uint, +}; +module_param_cb(writeback_time_threshold, &zswap_writeback_param_ops, + &zswap_writeback_time_thr, 0644); + /* Number of zpools in zswap_pool (empirically determined for scalability) */ #define ZSWAP_NR_ZPOOLS 32 @@ -197,6 +207,7 @@ struct zswap_pool { * value - value of the same-value filled pages which have same content * objcg - the obj_cgroup that the compressed memory is charged to * lru - handle to the pool's lru used to evict pages. + * sto_time - the store time of zswap_entry. */ struct zswap_entry { struct rb_node rbnode; @@ -210,6 +221,7 @@ struct zswap_entry { }; struct obj_cgroup *objcg; struct list_head lru; + ktime_t sto_time; }; /* @@ -288,6 +300,31 @@ static void zswap_update_total_size(void) zswap_pool_total_size = total; } +static void zswap_reclaim_entry_by_timethr(void); + +static bool zswap_reach_timethr(struct zswap_pool *pool) +{ + struct zswap_entry *entry; + ktime_t expire_time = 0; + bool ret = false; + + spin_lock(&pool->lru_lock); + + if (list_empty(&pool->lru)) + goto out; + + entry = list_last_entry(&pool->lru, struct zswap_entry, lru); + expire_time = ktime_add(entry->sto_time, + ns_to_ktime(zswap_writeback_time_thr * NSEC_PER_SEC)); + + if (ktime_after(ktime_get_boottime(), expire_time)) + ret = true; +out: + spin_unlock(&pool->lru_lock); + return ret; +} + + /********************************* * zswap entry functions **********************************/ @@ -395,6 +432,7 @@ static void zswap_free_entry(struct zswap_entry *entry) else { spin_lock(&entry->pool->lru_lock); list_del(&entry->lru); + entry->sto_time = 0; spin_unlock(&entry->pool->lru_lock); zpool_free(zswap_find_zpool(entry), entry->handle); zswap_pool_put(entry->pool); @@ -709,6 +747,28 @@ static void shrink_worker(struct work_struct *w) zswap_pool_put(pool); } +static void zswap_reclaim_entry_by_timethr(void) +{ + struct zswap_pool *pool = zswap_pool_current_get(); + int ret, failures = 0; + + if (!pool) + return; + + while (zswap_reach_timethr(pool)) { + ret = zswap_reclaim_entry(pool); + if (ret) { + zswap_reject_reclaim_fail++; + if (ret != -EAGAIN) + break; + if (++failures == MAX_RECLAIM_RETRIES) + break; + } + cond_resched(); + } + zswap_pool_put(pool); +} + static struct zswap_pool *zswap_pool_create(char *type, char *compressor) { int i; @@ -1037,6 +1097,21 @@ static int zswap_enabled_param_set(const char *val, return ret; } +static int zswap_writeback_time_thr_param_set(const char *val, + const struct kernel_param *kp) +{ + int ret = -ENODEV; + + /* if this is load-time (pre-init) param setting, just return. */ + if (system_state != SYSTEM_RUNNING) + return ret; + + ret = param_set_uint(val, kp); + if (!ret) + zswap_reclaim_entry_by_timethr(); + return ret; +} + /********************************* * writeback code **********************************/ @@ -1360,6 +1435,7 @@ bool zswap_store(struct folio *folio) if (entry->length) { spin_lock(&entry->pool->lru_lock); list_add(&entry->lru, &entry->pool->lru); + entry->sto_time = ktime_get_boottime(); spin_unlock(&entry->pool->lru_lock); } spin_unlock(&tree->lock);