[v7,0/6] workload-specific and memory pressure-driven zswap writeback

Message ID	20231127234600.2971029-1-nphamcs@gmail.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; From: Nhat Pham <nphamcs@gmail.com> To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, cerasuolodomenico@gmail.com, yosryahmed@google.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, chrisl@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, shuah@kernel.org Subject: [PATCH v7 0/6] workload-specific and memory pressure-driven zswap writeback Date: Mon, 27 Nov 2023 15:45:54 -0800 Message-Id: <20231127234600.2971029-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	workload-specific and memory pressure-driven zswap writeback \| [v7,0/6] workload-specific and memory pressure-driven zswap writeback [v7,1/6] list_lru: allows explicit memcg and NUMA node selection [v7,2/6] memcontrol: add a new function to traverse online-only memcg hierarchy [v7,3/6] zswap: make shrinking memcg-aware [v7,4/6] mm: memcg: add per-memcg zswap writeback stat [v7,5/6] selftests: cgroup: update per-memcg zswap writeback selftest

Message ID

20231127234600.2971029-1-nphamcs@gmail.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31;
From: Nhat Pham <nphamcs@gmail.com>
To: akpm@linux-foundation.org
Cc: hannes@cmpxchg.org, cerasuolodomenico@gmail.com,
        yosryahmed@google.com, sjenning@redhat.com, ddstreet@ieee.org,
        vitaly.wool@konsulko.com, mhocko@kernel.org,
        roman.gushchin@linux.dev, shakeelb@google.com,
        muchun.song@linux.dev, chrisl@kernel.org, linux-mm@kvack.org,
        kernel-team@meta.com, linux-kernel@vger.kernel.org,
        cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
        linux-kselftest@vger.kernel.org, shuah@kernel.org
Subject: [PATCH v7 0/6] workload-specific and memory pressure-driven zswap
 writeback
Date: Mon, 27 Nov 2023 15:45:54 -0800
Message-Id: <20231127234600.2971029-1-nphamcs@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

workload-specific and memory pressure-driven zswap writeback |

Message

Nhat Pham Nov. 27, 2023, 11:45 p.m. UTC

  Changelog:
v7:
   * Added the mem_cgroup_iter_online() function to the API for the new
     behavior (suggested by Andrew Morton) (patch 2)
   * Fixed a missing list_lru_del -> list_lru_del_obj (patch 1)
v6:
   * Rebase on top of latest mm-unstable.
   * Fix/improve the in-code documentation of the new list_lru
     manipulation functions (patch 1)
v5:
   * Replace reference getting with an rcu_read_lock() section for
     zswap lru modifications (suggested by Yosry)
   * Add a new prep patch that allows mem_cgroup_iter() to return
     online cgroup.
   * Add a callback that updates pool->next_shrink when the cgroup is
     offlined (suggested by Yosry Ahmed, Johannes Weiner)
v4:
   * Rename list_lru_add to list_lru_add_obj and __list_lru_add to
     list_lru_add (patch 1) (suggested by Johannes Weiner and
	 Yosry Ahmed)
   * Some cleanups on the memcg aware LRU patch (patch 2)
     (suggested by Yosry Ahmed)
   * Use event interface for the new per-cgroup writeback counters.
     (patch 3) (suggested by Yosry Ahmed)
   * Abstract zswap's lruvec states and handling into 
     zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed)
v3:
   * Add a patch to export per-cgroup zswap writeback counters
   * Add a patch to update zswap's kselftest
   * Separate the new list_lru functions into its own prep patch
   * Do not start from the top of the hierarchy when encounter a memcg
     that is not online for the global limit zswap writeback (patch 2)
     (suggested by Yosry Ahmed)
   * Do not remove the swap entry from list_lru in
     __read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
   * Removed a redundant zswap pool getting (patch 2)
     (reported by Ryan Roberts)
   * Use atomic for the nr_zswap_protected (instead of lruvec's lock)
     (patch 5) (suggested by Yosry Ahmed)
   * Remove the per-cgroup zswap shrinker knob (patch 5)
     (suggested by Yosry Ahmed)
v2:
   * Fix loongarch compiler errors
   * Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM

There are currently several issues with zswap writeback:

1. There is only a single global LRU for zswap, making it impossible to
   perform worload-specific shrinking - an memcg under memory pressure
   cannot determine which pages in the pool it owns, and often ends up
   writing pages from other memcgs. This issue has been previously
   observed in practice and mitigated by simply disabling
   memcg-initiated shrinking:

   https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u

   But this solution leaves a lot to be desired, as we still do not
   have an avenue for an memcg to free up its own memory locked up in
   the zswap pool.

2. We only shrink the zswap pool when the user-defined limit is hit.
   This means that if we set the limit too high, cold data that are
   unlikely to be used again will reside in the pool, wasting precious
   memory. It is hard to predict how much zswap space will be needed
   ahead of time, as this depends on the workload (specifically, on
   factors such as memory access patterns and compressibility of the
   memory pages).

This patch series solves these issues by separating the global zswap
LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
(i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
new shrinker does not have any parameter that must be tuned by the
user, and can be opted in or out on a per-memcg basis.

As a proof of concept, we ran the following synthetic benchmark:
build the linux kernel in a memory-limited cgroup, and allocate some
cold data in tmpfs to see if the shrinker could write them out and
improved the overall performance. Depending on the amount of cold data
generated, we observe from 14% to 35% reduction in kernel CPU time used
in the kernel builds.

Domenico Cerasuolo (3):
  zswap: make shrinking memcg-aware
  mm: memcg: add per-memcg zswap writeback stat
  selftests: cgroup: update per-memcg zswap writeback selftest

Nhat Pham (3):
  list_lru: allows explicit memcg and NUMA node selection
  memcontrol: add a new function to traverse online-only memcg hierarchy
  zswap: shrinks zswap pool based on memory pressure

 Documentation/admin-guide/mm/zswap.rst      |   7 +
 drivers/android/binder_alloc.c              |   7 +-
 fs/dcache.c                                 |   8 +-
 fs/gfs2/quota.c                             |   6 +-
 fs/inode.c                                  |   4 +-
 fs/nfs/nfs42xattr.c                         |   8 +-
 fs/nfsd/filecache.c                         |   4 +-
 fs/xfs/xfs_buf.c                            |   6 +-
 fs/xfs/xfs_dquot.c                          |   2 +-
 fs/xfs/xfs_qm.c                             |   2 +-
 include/linux/list_lru.h                    |  54 ++-
 include/linux/memcontrol.h                  |  18 +
 include/linux/mmzone.h                      |   2 +
 include/linux/vm_event_item.h               |   1 +
 include/linux/zswap.h                       |  27 +-
 mm/list_lru.c                               |  48 ++-
 mm/memcontrol.c                             |  32 +-
 mm/mmzone.c                                 |   1 +
 mm/swap.h                                   |   3 +-
 mm/swap_state.c                             |  26 +-
 mm/vmstat.c                                 |   1 +
 mm/workingset.c                             |   4 +-
 mm/zswap.c                                  | 426 +++++++++++++++++---
 tools/testing/selftests/cgroup/test_zswap.c |  74 ++--
 24 files changed, 641 insertions(+), 130 deletions(-)


base-commit: 5cdba94229e58a39ca389ad99763af29e6b0c5a5

Comments

Johannes Weiner Nov. 29, 2023, 4:21 p.m. UTC | #1

On Mon, Nov 27, 2023 at 03:46:00PM -0800, Nhat Pham wrote:
> Currently, we only shrink the zswap pool when the user-defined limit is
> hit. This means that if we set the limit too high, cold data that are
> unlikely to be used again will reside in the pool, wasting precious
> memory. It is hard to predict how much zswap space will be needed ahead
> of time, as this depends on the workload (specifically, on factors such
> as memory access patterns and compressibility of the memory pages).
> 
> This patch implements a memcg- and NUMA-aware shrinker for zswap, that
> is initiated when there is memory pressure. The shrinker does not
> have any parameter that must be tuned by the user, and can be opted in
> or out on a per-memcg basis.
> 
> Furthermore, to make it more robust for many workloads and prevent
> overshrinking (i.e evicting warm pages that might be refaulted into
> memory), we build in the following heuristics:
> 
> * Estimate the number of warm pages residing in zswap, and attempt to
>   protect this region of the zswap LRU.
> * Scale the number of freeable objects by an estimate of the memory
>   saving factor. The better zswap compresses the data, the fewer pages
>   we will evict to swap (as we will otherwise incur IO for relatively
>   small memory saving).
> * During reclaim, if the shrinker encounters a page that is also being
>   brought into memory, the shrinker will cautiously terminate its
>   shrinking action, as this is a sign that it is touching the warmer
>   region of the zswap LRU.
> 
> As a proof of concept, we ran the following synthetic benchmark:
> build the linux kernel in a memory-limited cgroup, and allocate some
> cold data in tmpfs to see if the shrinker could write them out and
> improved the overall performance. Depending on the amount of cold data
> generated, we observe from 14% to 35% reduction in kernel CPU time used
> in the kernel builds.

I think this is a really important piece of functionality for zswap.

Memory compression has been around and in use for a long time, but the
question of zswap vs swap sizing has been in the room since the hybrid
mode was first proposed. Because depending on the reuse patterns,
putting zswap with a static size limit in front of an existing swap
file can be a net negative for performance as it consumes more memory.

It's great to finally see a solution to this which makes zswap *much*
more general purpose. And something that distributions might want to
turn on per default when swap is configured.

Actually to the point where I think there should be a config option to
enable the shrinker per default. Maybe not right away, but in a few
releases when this feature has racked up some more production time.

> @@ -687,6 +687,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  					&page_allocated, false);
>  	if (unlikely(page_allocated))
>  		swap_readpage(page, false, NULL);
> +	zswap_lruvec_swapin(page);

The "lruvec" in the name vs the page parameter is a bit odd.
zswap_page_swapin() would be slightly better, but it still also sounds
like it would cause an actual swapin of some sort.

zswap_record_swapin()?

> @@ -520,6 +575,95 @@ static struct zswap_entry *zswap_entry_find_get(struct rb_root *root,
>  	return entry;
>  }
>  
> +/*********************************
> +* shrinker functions
> +**********************************/
> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
> +				       spinlock_t *lock, void *arg);
> +
> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> +		struct shrink_control *sc)
> +{
> +	struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> +	unsigned long shrink_ret, nr_protected, lru_size;
> +	struct zswap_pool *pool = shrinker->private_data;
> +	bool encountered_page_in_swapcache = false;
> +
> +	nr_protected =
> +		atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> +	lru_size = list_lru_shrink_count(&pool->list_lru, sc);
> +
> +	/*
> +	 * Abort if the shrinker is disabled or if we are shrinking into the
> +	 * protected region.
> +	 */
> +	if (!zswap_shrinker_enabled || nr_protected >= lru_size - sc->nr_to_scan) {
> +		sc->nr_scanned = 0;
> +		return SHRINK_STOP;
> +	}

I'm scratching my head at the protection check. zswap_shrinker_count()
already factors protection into account, so sc->nr_to_scan should only
be what is left on the list after excluding the protected area.

Do we even get here if the whole list is protected? Is this to protect
against concurrent shrinking of the list through multiple shrinkers or
swapins? If so, a comment would be nice :)

Otherwise, this looks great to me!

Just nitpicks, no show stoppers:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Bagas Sanjaya Dec. 2, 2023, 4:44 a.m. UTC | #2

On Mon, Nov 27, 2023 at 03:45:54PM -0800, Nhat Pham wrote:
> Changelog:
> v7:
>    * Added the mem_cgroup_iter_online() function to the API for the new
>      behavior (suggested by Andrew Morton) (patch 2)
>    * Fixed a missing list_lru_del -> list_lru_del_obj (patch 1)
> v6:
>    * Rebase on top of latest mm-unstable.
>    * Fix/improve the in-code documentation of the new list_lru
>      manipulation functions (patch 1)
> v5:
>    * Replace reference getting with an rcu_read_lock() section for
>      zswap lru modifications (suggested by Yosry)
>    * Add a new prep patch that allows mem_cgroup_iter() to return
>      online cgroup.
>    * Add a callback that updates pool->next_shrink when the cgroup is
>      offlined (suggested by Yosry Ahmed, Johannes Weiner)
> v4:
>    * Rename list_lru_add to list_lru_add_obj and __list_lru_add to
>      list_lru_add (patch 1) (suggested by Johannes Weiner and
> 	 Yosry Ahmed)
>    * Some cleanups on the memcg aware LRU patch (patch 2)
>      (suggested by Yosry Ahmed)
>    * Use event interface for the new per-cgroup writeback counters.
>      (patch 3) (suggested by Yosry Ahmed)
>    * Abstract zswap's lruvec states and handling into 
>      zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed)
> v3:
>    * Add a patch to export per-cgroup zswap writeback counters
>    * Add a patch to update zswap's kselftest
>    * Separate the new list_lru functions into its own prep patch
>    * Do not start from the top of the hierarchy when encounter a memcg
>      that is not online for the global limit zswap writeback (patch 2)
>      (suggested by Yosry Ahmed)
>    * Do not remove the swap entry from list_lru in
>      __read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
>    * Removed a redundant zswap pool getting (patch 2)
>      (reported by Ryan Roberts)
>    * Use atomic for the nr_zswap_protected (instead of lruvec's lock)
>      (patch 5) (suggested by Yosry Ahmed)
>    * Remove the per-cgroup zswap shrinker knob (patch 5)
>      (suggested by Yosry Ahmed)
> v2:
>    * Fix loongarch compiler errors
>    * Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM
> 
> There are currently several issues with zswap writeback:
> 
> 1. There is only a single global LRU for zswap, making it impossible to
>    perform worload-specific shrinking - an memcg under memory pressure
>    cannot determine which pages in the pool it owns, and often ends up
>    writing pages from other memcgs. This issue has been previously
>    observed in practice and mitigated by simply disabling
>    memcg-initiated shrinking:
> 
>    https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
> 
>    But this solution leaves a lot to be desired, as we still do not
>    have an avenue for an memcg to free up its own memory locked up in
>    the zswap pool.
> 
> 2. We only shrink the zswap pool when the user-defined limit is hit.
>    This means that if we set the limit too high, cold data that are
>    unlikely to be used again will reside in the pool, wasting precious
>    memory. It is hard to predict how much zswap space will be needed
>    ahead of time, as this depends on the workload (specifically, on
>    factors such as memory access patterns and compressibility of the
>    memory pages).
> 
> This patch series solves these issues by separating the global zswap
> LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
> (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
> new shrinker does not have any parameter that must be tuned by the
> user, and can be opted in or out on a per-memcg basis.
> 
> As a proof of concept, we ran the following synthetic benchmark:
> build the linux kernel in a memory-limited cgroup, and allocate some
> cold data in tmpfs to see if the shrinker could write them out and
> improved the overall performance. Depending on the amount of cold data
> generated, we observe from 14% to 35% reduction in kernel CPU time used
> in the kernel builds.
> 
> Domenico Cerasuolo (3):
>   zswap: make shrinking memcg-aware
>   mm: memcg: add per-memcg zswap writeback stat
>   selftests: cgroup: update per-memcg zswap writeback selftest
> 
> Nhat Pham (3):
>   list_lru: allows explicit memcg and NUMA node selection
>   memcontrol: add a new function to traverse online-only memcg hierarchy
>   zswap: shrinks zswap pool based on memory pressure
> 
>  Documentation/admin-guide/mm/zswap.rst      |   7 +
>  drivers/android/binder_alloc.c              |   7 +-
>  fs/dcache.c                                 |   8 +-
>  fs/gfs2/quota.c                             |   6 +-
>  fs/inode.c                                  |   4 +-
>  fs/nfs/nfs42xattr.c                         |   8 +-
>  fs/nfsd/filecache.c                         |   4 +-
>  fs/xfs/xfs_buf.c                            |   6 +-
>  fs/xfs/xfs_dquot.c                          |   2 +-
>  fs/xfs/xfs_qm.c                             |   2 +-
>  include/linux/list_lru.h                    |  54 ++-
>  include/linux/memcontrol.h                  |  18 +
>  include/linux/mmzone.h                      |   2 +
>  include/linux/vm_event_item.h               |   1 +
>  include/linux/zswap.h                       |  27 +-
>  mm/list_lru.c                               |  48 ++-
>  mm/memcontrol.c                             |  32 +-
>  mm/mmzone.c                                 |   1 +
>  mm/swap.h                                   |   3 +-
>  mm/swap_state.c                             |  26 +-
>  mm/vmstat.c                                 |   1 +
>  mm/workingset.c                             |   4 +-
>  mm/zswap.c                                  | 426 +++++++++++++++++---
>  tools/testing/selftests/cgroup/test_zswap.c |  74 ++--
>  24 files changed, 641 insertions(+), 130 deletions(-)
> 
> 
> base-commit: 5cdba94229e58a39ca389ad99763af29e6b0c5a5

No regressions when booting kernel with series applied.

Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>