Message ID | 20231106183159.3562879-1-nphamcs@gmail.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:8f47:0:b0:403:3b70:6f57 with SMTP id j7csp2850514vqu; Mon, 6 Nov 2023 10:33:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IGZlM1A7Z03wQ6+coMqrgmfvw01R/Iu9Cpu4sN6M+k+VtDF4a4w8pF9JbPYcyA8IiFw53Pj X-Received: by 2002:a05:6808:13d6:b0:3b2:ef9a:4339 with SMTP id d22-20020a05680813d600b003b2ef9a4339mr37798876oiw.49.1699295584765; Mon, 06 Nov 2023 10:33:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699295584; cv=none; d=google.com; s=arc-20160816; b=pWSunlaHuc5ROGspJfRDCeP66fvlJUQb7Y7QxytBbhsQDNKAD26e8vSVoQuycgGRHs /7RLIp0M5gt3y82YvO46GyQpfeNb8ES8CJB4+YArWd5/FstWWHRRjz7EWttIR5kImAyO vPWEKbPSLEOPlYS4QUmWPhidZdYAvIJH15a1U44x8cp7j8WZ94ojf6niQPqF9kiSkNfW 99ZM/QahNPqKIWwq+Ta90WF2QD6ws2xwN7gdj7DDsUcT9+cV/U9Yn2ScL33wYszUgS+U uvKwAHRrtyAilyaLMCm2XNOnPJIGqX/uxEebmAC6BQMdl5xNHAPpR8qm7xwZqBfNKyi/ FO3Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=RolW8kmfT6ZCz66bTNAjEmvRmOGj5ZqsnmWESYphJzc=; fh=5ynFD2G6LA0fdRuAOSZxbBxHoIIG4U3xrwbnkupZs28=; b=x3fLRVLO2uPlGIEwZxK5nJdjEdjUj4z6UFu5l+s/xBXejLMYSpR2WRYRkX9xdgqlZ9 5r9QRDJM7XqGU5hPrUoVzJ+Ksyvj1u3MinXRfNmhQFCqlBhqsCXAI3ofxFWVN6xG/cM+ U/Asz9TM8z/HB+2i2XbEHud6c8VV6sXoKXTHTrQheaVShB0ZTw2cUHnQ9pjBMe6TWFNs XS+Go3CYxdaR/I7GnnRGg6VtuZ8shgu+klo4Gzto4sPMoUNVJt04WB5AEa3HKvZFX41a UVgd6t7MBmXkO+ryw8QGtUEEGRwKJIDoIYoN418F3rsAKv2kGXVEbIN7mWcxDUIwWQKd e2uQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=aW3nFXg7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id bx11-20020a056a00428b00b006b59f4771c5si8553298pfb.277.2023.11.06.10.33.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Nov 2023 10:33:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=aW3nFXg7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id F281280A9AB5; Mon, 6 Nov 2023 10:32:47 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232132AbjKFScF (ORCPT <rfc822;lhua1029@gmail.com> + 34 others); Mon, 6 Nov 2023 13:32:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48244 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231773AbjKFScE (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 6 Nov 2023 13:32:04 -0500 Received: from mail-pl1-x629.google.com (mail-pl1-x629.google.com [IPv6:2607:f8b0:4864:20::629]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5124FD49; Mon, 6 Nov 2023 10:32:01 -0800 (PST) Received: by mail-pl1-x629.google.com with SMTP id d9443c01a7336-1cc330e8f58so33580215ad.3; Mon, 06 Nov 2023 10:32:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1699295521; x=1699900321; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=RolW8kmfT6ZCz66bTNAjEmvRmOGj5ZqsnmWESYphJzc=; b=aW3nFXg7W2ZCtG249zntt53EABbUFmLyOWq1xU3YFouZ+/cjAdLhr3/TTUPVBx5/QE vFG7MiqiBvCrCmGhOP7d/e1oee+2R9nIXMAGq8KrsbdXGWeUQ5OWBQqGooMocH3ewtOq pa1COWQj1ODDIzu8V8enTiFwh9cFHi0jwlHoamhT2icH7jwyvZ1lw72i0wuSe+oC7VEH 5Ui3ip+lF23Y1Kxm3L597x3atJq8yucY9lBWceTDWVDHkVT0dOqh16oM97a5AOAsHIjN +I8O7UVcaOq/yTnRDjm0Ec+64+AQKAqtTHRqiRA350kPj8fgjieu7H6wjnPICBMLcidu 55Gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699295521; x=1699900321; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=RolW8kmfT6ZCz66bTNAjEmvRmOGj5ZqsnmWESYphJzc=; b=Cpgl65TuxkHH+srSegdn4W5l8ycTCaEdUy1xqdklQeQ9wrTZ7g5Dix9ox6eY1Ak4+M to7GO07XTQhWFocXHvdPrGdhndu7jo2Kz0RfysQmxufukfvC9lCOI91J8IbynGVq4YdD VbRKTTQPNO6d6Q17YUjLFGhHS5pV2tkAQwC/9vMbURMQ1E8eLL2ffhyjcfZm9VHIuLf7 zKOHbZHxyVcbHZErbaN01aa0sHbcxZqXX0ypcvierhtgV11hnVAgCIDOobuPElbKyG5d wEbFnGUJEd/v9lN6bobD/cwRD6mwm2MKsayt+f1zMjUxyu9hzgYDJHRKAPUKUDZdY1jy xrlQ== X-Gm-Message-State: AOJu0YwRal6Q7FKoLcb8NGpW9LH4Nu+OwU28al+WVp11rHYBpMEX+AU5 gKWjo3tedJhUZuz0gtAuOOc= X-Received: by 2002:a17:903:2013:b0:1cc:251c:c381 with SMTP id s19-20020a170903201300b001cc251cc381mr19680064pla.29.1699295520329; Mon, 06 Nov 2023 10:32:00 -0800 (PST) Received: from localhost (fwdproxy-prn-020.fbsv.net. [2a03:2880:ff:14::face:b00c]) by smtp.gmail.com with ESMTPSA id u14-20020a170902e5ce00b001c9b35287aesm6213048plf.88.2023.11.06.10.31.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Nov 2023 10:31:59 -0800 (PST) From: Nhat Pham <nphamcs@gmail.com> To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, cerasuolodomenico@gmail.com, yosryahmed@google.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, chrisl@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, shuah@kernel.org Subject: [PATCH v5 0/6] workload-specific and memory pressure-driven zswap writeback Date: Mon, 6 Nov 2023 10:31:53 -0800 Message-Id: <20231106183159.3562879-1-nphamcs@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Mon, 06 Nov 2023 10:32:48 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781840566957020524 X-GMAIL-MSGID: 1781840566957020524 |
Series |
workload-specific and memory pressure-driven zswap writeback
|
|
Message
Nhat Pham
Nov. 6, 2023, 6:31 p.m. UTC
Changelog: v5: * Replace reference getting with an rcu_read_lock() section for zswap lru modifications (suggested by Yosry) * Add a new prep patch that allows mem_cgroup_iter() to return online cgroup. * Add a callback that updates pool->next_shrink when the cgroup is offlined (suggested by Yosry Ahmed, Johannes Weiner) v4: * Rename list_lru_add to list_lru_add_obj and __list_lru_add to list_lru_add (patch 1) (suggested by Johannes Weiner and Yosry Ahmed) * Some cleanups on the memcg aware LRU patch (patch 2) (suggested by Yosry Ahmed) * Use event interface for the new per-cgroup writeback counters. (patch 3) (suggested by Yosry Ahmed) * Abstract zswap's lruvec states and handling into zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed) v3: * Add a patch to export per-cgroup zswap writeback counters * Add a patch to update zswap's kselftest * Separate the new list_lru functions into its own prep patch * Do not start from the top of the hierarchy when encounter a memcg that is not online for the global limit zswap writeback (patch 2) (suggested by Yosry Ahmed) * Do not remove the swap entry from list_lru in __read_swapcache_async() (patch 2) (suggested by Yosry Ahmed) * Removed a redundant zswap pool getting (patch 2) (reported by Ryan Roberts) * Use atomic for the nr_zswap_protected (instead of lruvec's lock) (patch 5) (suggested by Yosry Ahmed) * Remove the per-cgroup zswap shrinker knob (patch 5) (suggested by Yosry Ahmed) v2: * Fix loongarch compiler errors * Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM There are currently several issues with zswap writeback: 1. There is only a single global LRU for zswap, making it impossible to perform worload-specific shrinking - an memcg under memory pressure cannot determine which pages in the pool it owns, and often ends up writing pages from other memcgs. This issue has been previously observed in practice and mitigated by simply disabling memcg-initiated shrinking: https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u But this solution leaves a lot to be desired, as we still do not have an avenue for an memcg to free up its own memory locked up in the zswap pool. 2. We only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlikely to be used again will reside in the pool, wasting precious memory. It is hard to predict how much zswap space will be needed ahead of time, as this depends on the workload (specifically, on factors such as memory access patterns and compressibility of the memory pages). This patch series solves these issues by separating the global zswap LRU into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The new shrinker does not have any parameter that must be tuned by the user, and can be opted in or out on a per-memcg basis. As a proof of concept, we ran the following synthetic benchmark: build the linux kernel in a memory-limited cgroup, and allocate some cold data in tmpfs to see if the shrinker could write them out and improved the overall performance. Depending on the amount of cold data generated, we observe from 14% to 35% reduction in kernel CPU time used in the kernel builds. Domenico Cerasuolo (3): zswap: make shrinking memcg-aware mm: memcg: add per-memcg zswap writeback stat selftests: cgroup: update per-memcg zswap writeback selftest Nhat Pham (3): list_lru: allows explicit memcg and NUMA node selection memcontrol: allows mem_cgroup_iter() to check for onlineness zswap: shrinks zswap pool based on memory pressure Documentation/admin-guide/mm/zswap.rst | 7 + drivers/android/binder_alloc.c | 5 +- fs/dcache.c | 8 +- fs/gfs2/quota.c | 6 +- fs/inode.c | 4 +- fs/nfs/nfs42xattr.c | 8 +- fs/nfsd/filecache.c | 4 +- fs/xfs/xfs_buf.c | 6 +- fs/xfs/xfs_dquot.c | 2 +- fs/xfs/xfs_qm.c | 2 +- include/linux/list_lru.h | 46 ++- include/linux/memcontrol.h | 9 +- include/linux/mmzone.h | 2 + include/linux/vm_event_item.h | 1 + include/linux/zswap.h | 27 +- mm/list_lru.c | 48 ++- mm/memcontrol.c | 20 +- mm/mmzone.c | 1 + mm/shrinker.c | 4 +- mm/swap.h | 3 +- mm/swap_state.c | 26 +- mm/vmscan.c | 26 +- mm/vmstat.c | 1 + mm/workingset.c | 4 +- mm/zswap.c | 430 +++++++++++++++++--- tools/testing/selftests/cgroup/test_zswap.c | 74 ++-- 26 files changed, 625 insertions(+), 149 deletions(-)
Comments
Hi Nhat, Sorry for being late to the party. I want to take a look at your patches series. However I wasn't able to "git am" your patches series cleanly on current mm-stable, mm-unstable or linux tip. $ git am patches/v5_20231106_nphamcs_workload_specific_and_memory_pressure_driven_zswap_writeback.mbx Applying: list_lru: allows explicit memcg and NUMA node selection Applying: memcontrol: allows mem_cgroup_iter() to check for onlineness Applying: zswap: make shrinking memcg-aware (fix) error: patch failed: mm/zswap.c:174 error: mm/zswap.c: patch does not apply Patch failed at 0003 zswap: make shrinking memcg-aware (fix) What is the base of your patches? A git hash or a branch I can pull from would be nice. Thanks Chris On Mon, Nov 6, 2023 at 10:32 AM Nhat Pham <nphamcs@gmail.com> wrote: > > Changelog: > v5: > * Replace reference getting with an rcu_read_lock() section for > zswap lru modifications (suggested by Yosry) > * Add a new prep patch that allows mem_cgroup_iter() to return > online cgroup. > * Add a callback that updates pool->next_shrink when the cgroup is > offlined (suggested by Yosry Ahmed, Johannes Weiner) > v4: > * Rename list_lru_add to list_lru_add_obj and __list_lru_add to > list_lru_add (patch 1) (suggested by Johannes Weiner and > Yosry Ahmed) > * Some cleanups on the memcg aware LRU patch (patch 2) > (suggested by Yosry Ahmed) > * Use event interface for the new per-cgroup writeback counters. > (patch 3) (suggested by Yosry Ahmed) > * Abstract zswap's lruvec states and handling into > zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed) > v3: > * Add a patch to export per-cgroup zswap writeback counters > * Add a patch to update zswap's kselftest > * Separate the new list_lru functions into its own prep patch > * Do not start from the top of the hierarchy when encounter a memcg > that is not online for the global limit zswap writeback (patch 2) > (suggested by Yosry Ahmed) > * Do not remove the swap entry from list_lru in > __read_swapcache_async() (patch 2) (suggested by Yosry Ahmed) > * Removed a redundant zswap pool getting (patch 2) > (reported by Ryan Roberts) > * Use atomic for the nr_zswap_protected (instead of lruvec's lock) > (patch 5) (suggested by Yosry Ahmed) > * Remove the per-cgroup zswap shrinker knob (patch 5) > (suggested by Yosry Ahmed) > v2: > * Fix loongarch compiler errors > * Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM > > There are currently several issues with zswap writeback: > > 1. There is only a single global LRU for zswap, making it impossible to > perform worload-specific shrinking - an memcg under memory pressure > cannot determine which pages in the pool it owns, and often ends up > writing pages from other memcgs. This issue has been previously > observed in practice and mitigated by simply disabling > memcg-initiated shrinking: > > https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u > > But this solution leaves a lot to be desired, as we still do not > have an avenue for an memcg to free up its own memory locked up in > the zswap pool. > > 2. We only shrink the zswap pool when the user-defined limit is hit. > This means that if we set the limit too high, cold data that are > unlikely to be used again will reside in the pool, wasting precious > memory. It is hard to predict how much zswap space will be needed > ahead of time, as this depends on the workload (specifically, on > factors such as memory access patterns and compressibility of the > memory pages). > > This patch series solves these issues by separating the global zswap > LRU into per-memcg and per-NUMA LRUs, and performs workload-specific > (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The > new shrinker does not have any parameter that must be tuned by the > user, and can be opted in or out on a per-memcg basis. > > As a proof of concept, we ran the following synthetic benchmark: > build the linux kernel in a memory-limited cgroup, and allocate some > cold data in tmpfs to see if the shrinker could write them out and > improved the overall performance. Depending on the amount of cold data > generated, we observe from 14% to 35% reduction in kernel CPU time used > in the kernel builds. > > Domenico Cerasuolo (3): > zswap: make shrinking memcg-aware > mm: memcg: add per-memcg zswap writeback stat > selftests: cgroup: update per-memcg zswap writeback selftest > > Nhat Pham (3): > list_lru: allows explicit memcg and NUMA node selection > memcontrol: allows mem_cgroup_iter() to check for onlineness > zswap: shrinks zswap pool based on memory pressure > > Documentation/admin-guide/mm/zswap.rst | 7 + > drivers/android/binder_alloc.c | 5 +- > fs/dcache.c | 8 +- > fs/gfs2/quota.c | 6 +- > fs/inode.c | 4 +- > fs/nfs/nfs42xattr.c | 8 +- > fs/nfsd/filecache.c | 4 +- > fs/xfs/xfs_buf.c | 6 +- > fs/xfs/xfs_dquot.c | 2 +- > fs/xfs/xfs_qm.c | 2 +- > include/linux/list_lru.h | 46 ++- > include/linux/memcontrol.h | 9 +- > include/linux/mmzone.h | 2 + > include/linux/vm_event_item.h | 1 + > include/linux/zswap.h | 27 +- > mm/list_lru.c | 48 ++- > mm/memcontrol.c | 20 +- > mm/mmzone.c | 1 + > mm/shrinker.c | 4 +- > mm/swap.h | 3 +- > mm/swap_state.c | 26 +- > mm/vmscan.c | 26 +- > mm/vmstat.c | 1 + > mm/workingset.c | 4 +- > mm/zswap.c | 430 +++++++++++++++++--- > tools/testing/selftests/cgroup/test_zswap.c | 74 ++-- > 26 files changed, 625 insertions(+), 149 deletions(-) > > -- > 2.34.1 >
On Wed, Nov 8, 2023 at 11:46 AM Chris Li <chrisl@kernel.org> wrote: > > Hi Nhat, > > Sorry for being late to the party. I want to take a look at your patches series. > However I wasn't able to "git am" your patches series cleanly on current > mm-stable, mm-unstable or linux tip. > > $ git am patches/v5_20231106_nphamcs_workload_specific_and_memory_pressure_driven_zswap_writeback.mbx > Applying: list_lru: allows explicit memcg and NUMA node selection > Applying: memcontrol: allows mem_cgroup_iter() to check for onlineness > Applying: zswap: make shrinking memcg-aware (fix) > error: patch failed: mm/zswap.c:174 > error: mm/zswap.c: patch does not apply > Patch failed at 0003 zswap: make shrinking memcg-aware (fix) Ah that was meant to be a fixlet - so that on top of the original "zswap: make shrinking memcg-aware" patch. The intention was to eventually squash it... But this is getting a bit annoyingly confusing, I admit. I just rebased to mm-unstable + squashed it all again, then sent one single replacement patch: [PATCH v5 3/6 REPLACE] zswap: make shrinking memcg-aware Let me know if this still fails to apply. If not, I'll send the whole thing again as v6! My sincerest apologies for the troubles and confusion :( > > What is the base of your patches? A git hash or a branch I can pull > from would be > nice. > > Thanks > > Chris > > On Mon, Nov 6, 2023 at 10:32 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > Changelog: > > v5: > > * Replace reference getting with an rcu_read_lock() section for > > zswap lru modifications (suggested by Yosry) > > * Add a new prep patch that allows mem_cgroup_iter() to return > > online cgroup. > > * Add a callback that updates pool->next_shrink when the cgroup is > > offlined (suggested by Yosry Ahmed, Johannes Weiner) > > v4: > > * Rename list_lru_add to list_lru_add_obj and __list_lru_add to > > list_lru_add (patch 1) (suggested by Johannes Weiner and > > Yosry Ahmed) > > * Some cleanups on the memcg aware LRU patch (patch 2) > > (suggested by Yosry Ahmed) > > * Use event interface for the new per-cgroup writeback counters. > > (patch 3) (suggested by Yosry Ahmed) > > * Abstract zswap's lruvec states and handling into > > zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed) > > v3: > > * Add a patch to export per-cgroup zswap writeback counters > > * Add a patch to update zswap's kselftest > > * Separate the new list_lru functions into its own prep patch > > * Do not start from the top of the hierarchy when encounter a memcg > > that is not online for the global limit zswap writeback (patch 2) > > (suggested by Yosry Ahmed) > > * Do not remove the swap entry from list_lru in > > __read_swapcache_async() (patch 2) (suggested by Yosry Ahmed) > > * Removed a redundant zswap pool getting (patch 2) > > (reported by Ryan Roberts) > > * Use atomic for the nr_zswap_protected (instead of lruvec's lock) > > (patch 5) (suggested by Yosry Ahmed) > > * Remove the per-cgroup zswap shrinker knob (patch 5) > > (suggested by Yosry Ahmed) > > v2: > > * Fix loongarch compiler errors > > * Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM > > > > There are currently several issues with zswap writeback: > > > > 1. There is only a single global LRU for zswap, making it impossible to > > perform worload-specific shrinking - an memcg under memory pressure > > cannot determine which pages in the pool it owns, and often ends up > > writing pages from other memcgs. This issue has been previously > > observed in practice and mitigated by simply disabling > > memcg-initiated shrinking: > > > > https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u > > > > But this solution leaves a lot to be desired, as we still do not > > have an avenue for an memcg to free up its own memory locked up in > > the zswap pool. > > > > 2. We only shrink the zswap pool when the user-defined limit is hit. > > This means that if we set the limit too high, cold data that are > > unlikely to be used again will reside in the pool, wasting precious > > memory. It is hard to predict how much zswap space will be needed > > ahead of time, as this depends on the workload (specifically, on > > factors such as memory access patterns and compressibility of the > > memory pages). > > > > This patch series solves these issues by separating the global zswap > > LRU into per-memcg and per-NUMA LRUs, and performs workload-specific > > (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The > > new shrinker does not have any parameter that must be tuned by the > > user, and can be opted in or out on a per-memcg basis. > > > > As a proof of concept, we ran the following synthetic benchmark: > > build the linux kernel in a memory-limited cgroup, and allocate some > > cold data in tmpfs to see if the shrinker could write them out and > > improved the overall performance. Depending on the amount of cold data > > generated, we observe from 14% to 35% reduction in kernel CPU time used > > in the kernel builds. > > > > Domenico Cerasuolo (3): > > zswap: make shrinking memcg-aware > > mm: memcg: add per-memcg zswap writeback stat > > selftests: cgroup: update per-memcg zswap writeback selftest > > > > Nhat Pham (3): > > list_lru: allows explicit memcg and NUMA node selection > > memcontrol: allows mem_cgroup_iter() to check for onlineness > > zswap: shrinks zswap pool based on memory pressure > > > > Documentation/admin-guide/mm/zswap.rst | 7 + > > drivers/android/binder_alloc.c | 5 +- > > fs/dcache.c | 8 +- > > fs/gfs2/quota.c | 6 +- > > fs/inode.c | 4 +- > > fs/nfs/nfs42xattr.c | 8 +- > > fs/nfsd/filecache.c | 4 +- > > fs/xfs/xfs_buf.c | 6 +- > > fs/xfs/xfs_dquot.c | 2 +- > > fs/xfs/xfs_qm.c | 2 +- > > include/linux/list_lru.h | 46 ++- > > include/linux/memcontrol.h | 9 +- > > include/linux/mmzone.h | 2 + > > include/linux/vm_event_item.h | 1 + > > include/linux/zswap.h | 27 +- > > mm/list_lru.c | 48 ++- > > mm/memcontrol.c | 20 +- > > mm/mmzone.c | 1 + > > mm/shrinker.c | 4 +- > > mm/swap.h | 3 +- > > mm/swap_state.c | 26 +- > > mm/vmscan.c | 26 +- > > mm/vmstat.c | 1 + > > mm/workingset.c | 4 +- > > mm/zswap.c | 430 +++++++++++++++++--- > > tools/testing/selftests/cgroup/test_zswap.c | 74 ++-- > > 26 files changed, 625 insertions(+), 149 deletions(-) > > > > -- > > 2.34.1 > >
Hi Nhat, On Wed, Nov 8, 2023 at 1:15 PM Nhat Pham <nphamcs@gmail.com> wrote: > > Ah that was meant to be a fixlet - so that on top of the original > "zswap: make shrinking memcg-aware" patch. The intention was > to eventually squash it... > > But this is getting a bit annoyingly confusing, I admit. I just rebased to > mm-unstable + squashed it all again, then sent one single replacement > patch: > > [PATCH v5 3/6 REPLACE] zswap: make shrinking memcg-aware Thank you for the quick response. Yes, I am able to download your replacement version of patch 3. Just FYI, I am using "git mailsplit" to split up the mbox into 6 separate patch files. On mm-unstable, I am able to apply your replacement patch 3 cleanly. I also need some help on the patch 0005, it does not apply cleanly either. $ git mailsplit -ozswap-pool-lru v5_20231106_nphamcs_workload_specific_and_memory_pressure_driven_zswap_writeback.mbx $ git am patches/zswap-pool-lru/0001 Applying: list_lru: allows explicit memcg and NUMA node selection $ git am patches/zswap-pool-lru/0002 Applying: memcontrol: allows mem_cgroup_iter() to check for onlineness $ git am patches/zswap-pool-lru/3.replace Applying: zswap: make shrinking memcg-aware $ git am patches/zswap-pool-lru/0004 Applying: mm: memcg: add per-memcg zswap writeback stat $ git am patches/zswap-pool-lru/0005 Applying: selftests: cgroup: update per-memcg zswap writeback selftest error: patch failed: tools/testing/selftests/cgroup/test_zswap.c:50 error: tools/testing/selftests/cgroup/test_zswap.c: patch does not apply Patch failed at 0001 selftests: cgroup: update per-memcg zswap writeback selftest hint: Use 'git am --show-current-patch=diff' to see the failed patch When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". > > Let me know if this still fails to apply. If not, I'll send the whole thing > again as v6! My sincerest apologies for the troubles and confusion :( No problem at all. Thanks for your help on patch 3. Chris
On Wed, Nov 8, 2023 at 3:12 PM Chris Li <chrisl@kernel.org> wrote: > > Hi Nhat, > > On Wed, Nov 8, 2023 at 1:15 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > Ah that was meant to be a fixlet - so that on top of the original > > "zswap: make shrinking memcg-aware" patch. The intention was > > to eventually squash it... > > > > But this is getting a bit annoyingly confusing, I admit. I just rebased to > > mm-unstable + squashed it all again, then sent one single replacement > > patch: > > > > [PATCH v5 3/6 REPLACE] zswap: make shrinking memcg-aware > > Thank you for the quick response. > > Yes, I am able to download your replacement version of patch 3. > Just FYI, I am using "git mailsplit" to split up the mbox into 6 > separate patch files. > On mm-unstable, I am able to apply your replacement patch 3 cleanly. > I also need some help on the patch 0005, it does not apply cleanly either. > > $ git mailsplit -ozswap-pool-lru > v5_20231106_nphamcs_workload_specific_and_memory_pressure_driven_zswap_writeback.mbx > $ git am patches/zswap-pool-lru/0001 > Applying: list_lru: allows explicit memcg and NUMA node selection > $ git am patches/zswap-pool-lru/0002 > Applying: memcontrol: allows mem_cgroup_iter() to check for onlineness > $ git am patches/zswap-pool-lru/3.replace > Applying: zswap: make shrinking memcg-aware > $ git am patches/zswap-pool-lru/0004 > Applying: mm: memcg: add per-memcg zswap writeback stat > $ git am patches/zswap-pool-lru/0005 > Applying: selftests: cgroup: update per-memcg zswap writeback selftest > error: patch failed: tools/testing/selftests/cgroup/test_zswap.c:50 > error: tools/testing/selftests/cgroup/test_zswap.c: patch does not apply > Patch failed at 0001 selftests: cgroup: update per-memcg zswap > writeback selftest > hint: Use 'git am --show-current-patch=diff' to see the failed patch > When you have resolved this problem, run "git am --continue". > If you prefer to skip this patch, run "git am --skip" instead. > To restore the original branch and stop patching, run "git am --abort". > > > > > Let me know if this still fails to apply. If not, I'll send the whole thing > > again as v6! My sincerest apologies for the troubles and confusion :( > > No problem at all. Thanks for your help on patch 3. > > Chris Hmm my guess is that I probably sent this out based on an outdated mm-unstable. There has since been a new zswap selftest merged to mm-unstable (written by no other than myself - oh the irony), so maybe it does not apply cleanly anymore with git am. I was, however, able to apply the patch with the -3 argument, i.e: git am -3 ../mbox/0005 This will fall back to the 3-way merge if direct application fails. And, FWIW, the kselftest still seems to build. I think you'll have to do the same with the 6th patch as well. My guess is that on my latest rebase attempt, this was done silent and automatically, so I did not notice this. Let me know if this works. Worst case scenario, I can still rebase 'n resend the patch series :)
On Wed, Nov 8, 2023 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote: > > Hmm my guess is that I probably sent this out based on an outdated > mm-unstable. There has since been a new zswap selftest merged > to mm-unstable (written by no other than myself - oh the irony), so > maybe it does not apply cleanly anymore with git am. $ git am -3 patches/zswap-pool-lru/0005 Applying: selftests: cgroup: update per-memcg zswap writeback selftest Using index info to reconstruct a base tree... M tools/testing/selftests/cgroup/test_zswap.c Falling back to patching base and 3-way merge... Auto-merging tools/testing/selftests/cgroup/test_zswap.c $ git am -3 patches/zswap-pool-lru/0006 Applying: zswap: shrinks zswap pool based on memory pressure error: sha1 information is lacking or useless (mm/zswap.c). error: could not build fake ancestor Patch failed at 0001 zswap: shrinks zswap pool based on memory pressure hint: Use 'git am --show-current-patch=diff' to see the failed patch When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". I was able to resolve the conflict on patch 6 by hand though. So I am good now. Thanks Chris
Hi Nhat, I want want to share the high level feedback we discussed here in the mailing list as well. It is my observation that each memcg LRU list can't compare the page time order with other memcg. It works great when the leaf level memcg hits the memory limit and you want to reclaim from that memcg. It works less well on the global memory pressure you need to reclaim from all memcg. You kind of have to scan each all child memcg to find out the best page to shrink from. It is less effective to get to the most desirable page quickly. This can benefit from a design similar to MGLRU. This idea is suggested by Yu Zhao, credit goes to him not me. In other words, the current patch is similar to the memcg page list pre MGLRU world. We can have a MRLRU like per memcg zswap shrink list. Chris On Wed, Nov 8, 2023 at 6:10 PM Chris Li <chrisl@kernel.org> wrote: > > On Wed, Nov 8, 2023 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > Hmm my guess is that I probably sent this out based on an outdated > > mm-unstable. There has since been a new zswap selftest merged > > to mm-unstable (written by no other than myself - oh the irony), so > > maybe it does not apply cleanly anymore with git am. > > $ git am -3 patches/zswap-pool-lru/0005 > Applying: selftests: cgroup: update per-memcg zswap writeback selftest > Using index info to reconstruct a base tree... > M tools/testing/selftests/cgroup/test_zswap.c > Falling back to patching base and 3-way merge... > Auto-merging tools/testing/selftests/cgroup/test_zswap.c > $ git am -3 patches/zswap-pool-lru/0006 > Applying: zswap: shrinks zswap pool based on memory pressure > error: sha1 information is lacking or useless (mm/zswap.c). > error: could not build fake ancestor > Patch failed at 0001 zswap: shrinks zswap pool based on memory pressure > hint: Use 'git am --show-current-patch=diff' to see the failed patch > When you have resolved this problem, run "git am --continue". > If you prefer to skip this patch, run "git am --skip" instead. > To restore the original branch and stop patching, run "git am --abort". > > I was able to resolve the conflict on patch 6 by hand though. So I am good now. > > Thanks > > Chris
On Thu, Nov 16, 2023 at 4:57 PM Chris Li <chrisl@kernel.org> wrote: > > Hi Nhat, > > I want want to share the high level feedback we discussed here in the > mailing list as well. > > It is my observation that each memcg LRU list can't compare the page > time order with other memcg. > It works great when the leaf level memcg hits the memory limit and you > want to reclaim from that memcg. > It works less well on the global memory pressure you need to reclaim > from all memcg. You kind of have to > scan each all child memcg to find out the best page to shrink from. It > is less effective to get to the most desirable page quickly. > > This can benefit from a design similar to MGLRU. This idea is > suggested by Yu Zhao, credit goes to him not me. > In other words, the current patch is similar to the memcg page list > pre MGLRU world. We can have a MRLRU > like per memcg zswap shrink list. I was gonna summarize the points myself :P But thanks for doing this. It's your idea so you're more qualified to explain this anyway ;) I absolutely agree that having a generation-aware cgroup-aware NUMA-aware LRU is the future way to go. Currently, IIUC, the reclaim logic selects cgroups in a round-robin-ish manner. It's "fair" in this perspective, but I also think it's not ideal. As we have discussed, the current list_lru infrastructure only take into account intra-cgroup relative recency, not inter-cgroup relative recency. The recently proposed time-based zswap reclaim mechanism will provide us with a source of information, but the overhead of using this might be too high - and it's very zswap-specific. Maybe after this, we should improve zswap reclaim (and perhaps all list_lru users) by adding generations to list_lru then take generations into account in the vmscan code. This patch series could be merged as-is, and once we make list_lru generation-aware, zswap shrinker will automagically be improved (along with all other list_lru/shrinker users). I don't know enough about the current design of MGLRU to comment too much further, but let me know if this makes sense, and if you have objections/other ideas. And if you have other documentations for MGLRU than its code, could you please let me know? I'm struggling to find more details about this. > > > Chris > > On Wed, Nov 8, 2023 at 6:10 PM Chris Li <chrisl@kernel.org> wrote: > > > > On Wed, Nov 8, 2023 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > > > Hmm my guess is that I probably sent this out based on an outdated > > > mm-unstable. There has since been a new zswap selftest merged > > > to mm-unstable (written by no other than myself - oh the irony), so > > > maybe it does not apply cleanly anymore with git am. > > > > $ git am -3 patches/zswap-pool-lru/0005 > > Applying: selftests: cgroup: update per-memcg zswap writeback selftest > > Using index info to reconstruct a base tree... > > M tools/testing/selftests/cgroup/test_zswap.c > > Falling back to patching base and 3-way merge... > > Auto-merging tools/testing/selftests/cgroup/test_zswap.c > > $ git am -3 patches/zswap-pool-lru/0006 > > Applying: zswap: shrinks zswap pool based on memory pressure > > error: sha1 information is lacking or useless (mm/zswap.c). > > error: could not build fake ancestor > > Patch failed at 0001 zswap: shrinks zswap pool based on memory pressure > > hint: Use 'git am --show-current-patch=diff' to see the failed patch > > When you have resolved this problem, run "git am --continue". > > If you prefer to skip this patch, run "git am --skip" instead. > > To restore the original branch and stop patching, run "git am --abort". > > > > I was able to resolve the conflict on patch 6 by hand though. So I am good now. > > > > Thanks > > > > Chris
On Fri, Nov 17, 2023 at 8:23 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Thu, Nov 16, 2023 at 4:57 PM Chris Li <chrisl@kernel.org> wrote: > > > > Hi Nhat, > > > > I want want to share the high level feedback we discussed here in the > > mailing list as well. > > > > It is my observation that each memcg LRU list can't compare the page > > time order with other memcg. > > It works great when the leaf level memcg hits the memory limit and you > > want to reclaim from that memcg. > > It works less well on the global memory pressure you need to reclaim > > from all memcg. You kind of have to > > scan each all child memcg to find out the best page to shrink from. It > > is less effective to get to the most desirable page quickly. > > > > This can benefit from a design similar to MGLRU. This idea is > > suggested by Yu Zhao, credit goes to him not me. > > In other words, the current patch is similar to the memcg page list > > pre MGLRU world. We can have a MRLRU > > like per memcg zswap shrink list. > > I was gonna summarize the points myself :P But thanks for doing this. > It's your idea so you're more qualified to explain this anyway ;) > > I absolutely agree that having a generation-aware cgroup-aware > NUMA-aware LRU is the future way to go. Currently, IIUC, the reclaim logic > selects cgroups in a round-robin-ish manner. It's "fair" in this perspective, > but I also think it's not ideal. As we have discussed, the current list_lru > infrastructure only take into account intra-cgroup relative recency, not > inter-cgroup relative recency. The recently proposed time-based zswap > reclaim mechanism will provide us with a source of information, but the > overhead of using this might be too high - and it's very zswap-specific. > > Maybe after this, we should improve zswap reclaim (and perhaps all > list_lru users) by adding generations to list_lru then take generations > into account in the vmscan code. This patch series could be merged > as-is, and once we make list_lru generation-aware, zswap shrinker > will automagically be improved (along with all other list_lru/shrinker > users). > > I don't know enough about the current design of MGLRU to comment > too much further, but let me know if this makes sense, and if you have > objections/other ideas. > > And if you have other documentations for MGLRU than its code, could > you please let me know? I'm struggling to find more details about this. > This could be a good place to start: https://www.youtube.com/watch?v=9HvJfN21H9Y
On Fri, Nov 17, 2023 at 8:23 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Thu, Nov 16, 2023 at 4:57 PM Chris Li <chrisl@kernel.org> wrote: > > > > Hi Nhat, > > > > I want want to share the high level feedback we discussed here in the > > mailing list as well. > > > > It is my observation that each memcg LRU list can't compare the page > > time order with other memcg. > > It works great when the leaf level memcg hits the memory limit and you > > want to reclaim from that memcg. > > It works less well on the global memory pressure you need to reclaim > > from all memcg. You kind of have to > > scan each all child memcg to find out the best page to shrink from. It > > is less effective to get to the most desirable page quickly. > > > > This can benefit from a design similar to MGLRU. This idea is > > suggested by Yu Zhao, credit goes to him not me. > > In other words, the current patch is similar to the memcg page list > > pre MGLRU world. We can have a MRLRU > > like per memcg zswap shrink list. > > I was gonna summarize the points myself :P But thanks for doing this. > It's your idea so you're more qualified to explain this anyway ;) The MGLRU like shrinker was Zhao Yu's idea. I just observe the problem. > > I absolutely agree that having a generation-aware cgroup-aware > NUMA-aware LRU is the future way to go. Currently, IIUC, the reclaim logic > selects cgroups in a round-robin-ish manner. It's "fair" in this perspective, > but I also think it's not ideal. As we have discussed, the current list_lru > infrastructure only take into account intra-cgroup relative recency, not > inter-cgroup relative recency. The recently proposed time-based zswap > reclaim mechanism will provide us with a source of information, but the > overhead of using this might be too high - and it's very zswap-specific. I don't mind it is zswap-specific, as long as it is effective. The overhead has two folds: 1) memory overhead on storing timestamps on per compressed page. 2) cpu overhead for reading timestamps. Using MGLRU likely have advantage over time stamps on both memory and cpu. The generation can use fewer bits and doesn't require reading time on every page. > Maybe after this, we should improve zswap reclaim (and perhaps all > list_lru users) by adding generations to list_lru then take generations > into account in the vmscan code. This patch series could be merged One high level idea is that we can get the page generation in the MGLRU before it gets into zswap. Just retain the generation into the zpool LRU somehow. > as-is, and once we make list_lru generation-aware, zswap shrinker > will automagically be improved (along with all other list_lru/shrinker > users). I don't think it will automatically improve, you will need to rewrite a lot of code in the shrinker as well to best use MGLRU zpool. > > I don't know enough about the current design of MGLRU to comment > too much further, but let me know if this makes sense, and if you have > objections/other ideas. Taking the step by step approach is fine by me as long as we are making steady progress towards the better end goal. > > And if you have other documentations for MGLRU than its code, could > you please let me know? I'm struggling to find more details about this. I would need to learn MGLRU myself. We can share and compare notes when we get to it. Chris