[v6] zswap: memcontrol: implement zswap writeback disabling

Message ID 20231207192406.3809579-1-nphamcs@gmail.com
State New
Headers
Series [v6] zswap: memcontrol: implement zswap writeback disabling |

Commit Message

Nhat Pham Dec. 7, 2023, 7:24 p.m. UTC
  During our experiment with zswap, we sometimes observe swap IOs due to
occasional zswap store failures and writebacks-to-swap. These swapping
IOs prevent many users who cannot tolerate swapping from adopting zswap
to save memory and improve performance where possible.

This patch adds the option to disable this behavior entirely: do not
writeback to backing swapping device when a zswap store attempt fail,
and do not write pages in the zswap pool back to the backing swap
device (both when the pool is full, and when the new zswap shrinker is
called).

This new behavior can be opted-in/out on a per-cgroup basis via a new
cgroup file. By default, writebacks to swap device is enabled, which is
the previous behavior. Initially, writeback is enabled for the root
cgroup, and a newly created cgroup will inherit the current setting of
its parent.

Note that this is subtly different from setting memory.swap.max to 0, as
it still allows for pages to be stored in the zswap pool (which itself
consumes swap space in its current form).

This patch should be applied on top of the zswap shrinker series:

https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/

as it also disables the zswap shrinker, a major source of zswap
writebacks.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 12 ++++++++
 Documentation/admin-guide/mm/zswap.rst  |  6 ++++
 include/linux/memcontrol.h              | 12 ++++++++
 include/linux/zswap.h                   |  6 ++++
 mm/memcontrol.c                         | 38 +++++++++++++++++++++++++
 mm/page_io.c                            |  6 ++++
 mm/shmem.c                              |  3 +-
 mm/zswap.c                              | 13 +++++++--
 8 files changed, 92 insertions(+), 4 deletions(-)


base-commit: cdcab2d34f129f593c0afbb2493bcaf41f4acd61
  

Comments

Yosry Ahmed Dec. 7, 2023, 7:26 p.m. UTC | #1
On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> During our experiment with zswap, we sometimes observe swap IOs due to
> occasional zswap store failures and writebacks-to-swap. These swapping
> IOs prevent many users who cannot tolerate swapping from adopting zswap
> to save memory and improve performance where possible.
>
> This patch adds the option to disable this behavior entirely: do not
> writeback to backing swapping device when a zswap store attempt fail,
> and do not write pages in the zswap pool back to the backing swap
> device (both when the pool is full, and when the new zswap shrinker is
> called).
>
> This new behavior can be opted-in/out on a per-cgroup basis via a new
> cgroup file. By default, writebacks to swap device is enabled, which is
> the previous behavior. Initially, writeback is enabled for the root
> cgroup, and a newly created cgroup will inherit the current setting of
> its parent.
>
> Note that this is subtly different from setting memory.swap.max to 0, as
> it still allows for pages to be stored in the zswap pool (which itself
> consumes swap space in its current form).
>
> This patch should be applied on top of the zswap shrinker series:
>
> https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
>
> as it also disables the zswap shrinker, a major source of zswap
> writebacks.
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

FTR I still prefer a more generic and future-proof interface (e.g.
memory.swap.tiers), but I am not opposed to this. It will just be
annoying to have different interfaces with overlapping functionalities
in the future if/when the need for a generic interface arises.
  
Andrew Morton Dec. 7, 2023, 10:11 p.m. UTC | #2
On Thu,  7 Dec 2023 11:24:06 -0800 Nhat Pham <nphamcs@gmail.com> wrote:

> During our experiment with zswap, we sometimes observe swap IOs due to
> occasional zswap store failures and writebacks-to-swap. These swapping
> IOs prevent many users who cannot tolerate swapping from adopting zswap
> to save memory and improve performance where possible.
> 
> This patch adds the option to disable this behavior entirely: do not
> writeback to backing swapping device when a zswap store attempt fail,
> and do not write pages in the zswap pool back to the backing swap
> device (both when the pool is full, and when the new zswap shrinker is
> called).
> 
> This new behavior can be opted-in/out on a per-cgroup basis via a new
> cgroup file. By default, writebacks to swap device is enabled, which is
> the previous behavior. Initially, writeback is enabled for the root
> cgroup, and a newly created cgroup will inherit the current setting of
> its parent.
> 
> Note that this is subtly different from setting memory.swap.max to 0, as
> it still allows for pages to be stored in the zswap pool (which itself
> consumes swap space in its current form).
> 
> This patch should be applied on top of the zswap shrinker series:
> 
> https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
> 
> as it also disables the zswap shrinker, a major source of zswap
> writebacks.
> 
> ...
>
> --- a/Documentation/admin-guide/mm/zswap.rst
> +++ b/Documentation/admin-guide/mm/zswap.rst
> @@ -153,6 +153,12 @@ attribute, e. g.::
>  
>  Setting this parameter to 100 will disable the hysteresis.
>  
> +Some users cannot tolerate the swapping that comes with zswap store failures
> +and zswap writebacks. Swapping can be disabled entirely (without disabling
> +zswap itself) on a cgroup-basis as follows:
> +
> +	echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
> +

This does seem to be getting down into the weeds.  How would a user
know (or even suspect) that these things are happening to them?  Perhaps
it would be helpful to tell people where to go look to determine this.

Also, it would be quite helpful of the changelog were to give us some
idea of how important this tunable is.  What sort of throughput
differences might it cause and under what circumstances?
  
Chris Li Dec. 8, 2023, 12:19 a.m. UTC | #3
Hi Nhat,


On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> During our experiment with zswap, we sometimes observe swap IOs due to
> occasional zswap store failures and writebacks-to-swap. These swapping
> IOs prevent many users who cannot tolerate swapping from adopting zswap
> to save memory and improve performance where possible.
>
> This patch adds the option to disable this behavior entirely: do not
> writeback to backing swapping device when a zswap store attempt fail,
> and do not write pages in the zswap pool back to the backing swap
> device (both when the pool is full, and when the new zswap shrinker is
> called).
>
> This new behavior can be opted-in/out on a per-cgroup basis via a new
> cgroup file. By default, writebacks to swap device is enabled, which is
> the previous behavior. Initially, writeback is enabled for the root
> cgroup, and a newly created cgroup will inherit the current setting of
> its parent.
>
> Note that this is subtly different from setting memory.swap.max to 0, as
> it still allows for pages to be stored in the zswap pool (which itself
> consumes swap space in its current form).
>
> This patch should be applied on top of the zswap shrinker series:
>
> https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
>
> as it also disables the zswap shrinker, a major source of zswap
> writebacks.

I am wondering about the status of "memory.swap.tiers" proof of concept patch?
Are we still on board to have this two patch merge together somehow so
we can have
"memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the
memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case?

Thanks

Chris

>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 12 ++++++++
>  Documentation/admin-guide/mm/zswap.rst  |  6 ++++
>  include/linux/memcontrol.h              | 12 ++++++++
>  include/linux/zswap.h                   |  6 ++++
>  mm/memcontrol.c                         | 38 +++++++++++++++++++++++++
>  mm/page_io.c                            |  6 ++++
>  mm/shmem.c                              |  3 +-
>  mm/zswap.c                              | 13 +++++++--
>  8 files changed, 92 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3f85254f3cef..2b4ac43efdc8 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1679,6 +1679,18 @@ PAGE_SIZE multiple when read back.
>         limit, it will refuse to take any more stores before existing
>         entries fault back in or are written out to disk.
>
> +  memory.zswap.writeback
> +       A read-write single value file. The default value is "1". The
> +       initial value of the root cgroup is 1, and when a new cgroup is
> +       created, it inherits the current value of its parent.
> +
> +       When this is set to 0, all swapping attempts to swapping devices
> +       are disabled. This included both zswap writebacks, and swapping due
> +       to zswap store failure.
> +
> +       Note that this is subtly different from setting memory.swap.max to
> +       0, as it still allows for pages to be written to the zswap pool.
> +
>    memory.pressure
>         A read-only nested-keyed file.
>
> diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
> index 62fc244ec702..cfa653130346 100644
> --- a/Documentation/admin-guide/mm/zswap.rst
> +++ b/Documentation/admin-guide/mm/zswap.rst
> @@ -153,6 +153,12 @@ attribute, e. g.::
>
>  Setting this parameter to 100 will disable the hysteresis.
>
> +Some users cannot tolerate the swapping that comes with zswap store failures
> +and zswap writebacks. Swapping can be disabled entirely (without disabling
> +zswap itself) on a cgroup-basis as follows:
> +
> +       echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
> +
>  When there is a sizable amount of cold memory residing in the zswap pool, it
>  can be advantageous to proactively write these cold pages to swap and reclaim
>  the memory for other use cases. By default, the zswap shrinker is disabled.
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 43b77363ab8e..5de775e6cdd9 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -219,6 +219,12 @@ struct mem_cgroup {
>
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
>         unsigned long zswap_max;
> +
> +       /*
> +        * Prevent pages from this memcg from being written back from zswap to
> +        * swap, and from being swapped out on zswap store failures.
> +        */
> +       bool zswap_writeback;
>  #endif
>
>         unsigned long soft_limit;
> @@ -1941,6 +1947,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
>  bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
>  void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
>  void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg);
>  #else
>  static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
>  {
> @@ -1954,6 +1961,11 @@ static inline void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg,
>                                              size_t size)
>  {
>  }
> +static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> +{
> +       /* if zswap is disabled, do not block pages going to the swapping device */
> +       return true;
> +}
>  #endif
>
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 08c240e16a01..a78ceaf3a65e 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -35,6 +35,7 @@ void zswap_swapoff(int type);
>  void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
>  void zswap_lruvec_state_init(struct lruvec *lruvec);
>  void zswap_page_swapin(struct page *page);
> +bool is_zswap_enabled(void);
>  #else
>
>  struct zswap_lruvec_state {};
> @@ -55,6 +56,11 @@ static inline void zswap_swapoff(int type) {}
>  static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
>  static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {}
>  static inline void zswap_page_swapin(struct page *page) {}
> +
> +static inline bool is_zswap_enabled(void)
> +{
> +       return false;
> +}
>  #endif
>
>  #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d7bc47316acb..ae8c62c7aa53 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5538,6 +5538,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>         WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
>         memcg->zswap_max = PAGE_COUNTER_MAX;
> +       WRITE_ONCE(memcg->zswap_writeback,
> +               !parent || READ_ONCE(parent->zswap_writeback));
>  #endif
>         page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
>         if (parent) {
> @@ -8174,6 +8176,12 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
>         rcu_read_unlock();
>  }
>
> +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> +{
> +       /* if zswap is disabled, do not block pages going to the swapping device */
> +       return !is_zswap_enabled() || !memcg || READ_ONCE(memcg->zswap_writeback);
> +}
> +
>  static u64 zswap_current_read(struct cgroup_subsys_state *css,
>                               struct cftype *cft)
>  {
> @@ -8206,6 +8214,31 @@ static ssize_t zswap_max_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> +static int zswap_writeback_show(struct seq_file *m, void *v)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> +       seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback));
> +       return 0;
> +}
> +
> +static ssize_t zswap_writeback_write(struct kernfs_open_file *of,
> +                               char *buf, size_t nbytes, loff_t off)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +       int zswap_writeback;
> +       ssize_t parse_ret = kstrtoint(strstrip(buf), 0, &zswap_writeback);
> +
> +       if (parse_ret)
> +               return parse_ret;
> +
> +       if (zswap_writeback != 0 && zswap_writeback != 1)
> +               return -EINVAL;
> +
> +       WRITE_ONCE(memcg->zswap_writeback, zswap_writeback);
> +       return nbytes;
> +}
> +
>  static struct cftype zswap_files[] = {
>         {
>                 .name = "zswap.current",
> @@ -8218,6 +8251,11 @@ static struct cftype zswap_files[] = {
>                 .seq_show = zswap_max_show,
>                 .write = zswap_max_write,
>         },
> +       {
> +               .name = "zswap.writeback",
> +               .seq_show = zswap_writeback_show,
> +               .write = zswap_writeback_write,
> +       },
>         { }     /* terminate */
>  };
>  #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */
> diff --git a/mm/page_io.c b/mm/page_io.c
> index cb559ae324c6..5e606f1aa2f6 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -201,6 +201,12 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>                 folio_end_writeback(folio);
>                 return 0;
>         }
> +
> +       if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) {
> +               folio_mark_dirty(folio);
> +               return AOP_WRITEPAGE_ACTIVATE;
> +       }
> +
>         __swap_writepage(&folio->page, wbc);
>         return 0;
>  }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index c62f904ba1ca..dd084fbafcf1 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1514,8 +1514,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>
>                 mutex_unlock(&shmem_swaplist_mutex);
>                 BUG_ON(folio_mapped(folio));
> -               swap_writepage(&folio->page, wbc);
> -               return 0;
> +               return swap_writepage(&folio->page, wbc);
>         }
>
>         mutex_unlock(&shmem_swaplist_mutex);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index daaa949837f2..7ee54a3d8281 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -153,6 +153,11 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
>                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
>  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
>
> +bool is_zswap_enabled(void)
> +{
> +       return zswap_enabled;
> +}
> +
>  /*********************************
>  * data structures
>  **********************************/
> @@ -596,7 +601,8 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>         struct zswap_pool *pool = shrinker->private_data;
>         bool encountered_page_in_swapcache = false;
>
> -       if (!zswap_shrinker_enabled) {
> +       if (!zswap_shrinker_enabled ||
> +                       !mem_cgroup_zswap_writeback_enabled(sc->memcg)) {
>                 sc->nr_scanned = 0;
>                 return SHRINK_STOP;
>         }
> @@ -637,7 +643,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>         struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
>         unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
>
> -       if (!zswap_shrinker_enabled)
> +       if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
>                 return 0;
>
>  #ifdef CONFIG_MEMCG_KMEM
> @@ -956,6 +962,9 @@ static int shrink_memcg(struct mem_cgroup *memcg)
>         struct zswap_pool *pool;
>         int nid, shrunk = 0;
>
> +       if (!mem_cgroup_zswap_writeback_enabled(memcg))
> +               return -EINVAL;
> +
>         /*
>          * Skip zombies because their LRUs are reparented and we would be
>          * reclaiming from the parent instead of the dead memcg.
>
> base-commit: cdcab2d34f129f593c0afbb2493bcaf41f4acd61
> --
> 2.34.1
>
  
Nhat Pham Dec. 8, 2023, 12:42 a.m. UTC | #4
>
> This does seem to be getting down into the weeds.  How would a user
> know (or even suspect) that these things are happening to them?  Perhaps
> it would be helpful to tell people where to go look to determine this.

When I test this feature during its development, I primarily just look
at the swapin/major fault counters to see if I'm experiencing swapping
IO, and when writeback is disabled, if the IO is still there. We can
also poll these counters overtime and plot it/compute their rate of
change. I just assumed this is usually the standard practice, and not
very zswap-specific in general, so I did not specify in the zswap
documentation.

>
> Also, it would be quite helpful of the changelog were to give us some
> idea of how important this tunable is.  What sort of throughput
> differences might it cause and under what circumstances?

For the most part, this feature is motivated by internal parties who
have already established their opinions regarding swapping - the
workloads that are highly sensitive to IO, and especially those who
are using servers with really slow disk performance (for instance,
massive but slow HDDs). For these folks, it's impossible to convince
them to even entertain zswap if swapping also comes as a packaged
deal. Writeback disabling is quite a useful feature in these
situations - on a mixed workloads deployment, they can disable
writeback for the more IO-sensitive workloads, and enable writeback
for other background workloads.

(Maybe we should include the paragraph above as part of the changelog?)

I don't have any concrete numbers though - any numbers I can pull out
are from highly artificial tasks that only serve to test the
correctness aspect of the implementation. zswap.writeback disablement
would of course be faster in these situations (up to 33%!!!!) - but
that's basically just saying HDD is slow. Which is not very
informative or surprising, so I did not include it in the changelog.
  
Nhat Pham Dec. 8, 2023, 1:03 a.m. UTC | #5
On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Nhat,
>
>
> On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > During our experiment with zswap, we sometimes observe swap IOs due to
> > occasional zswap store failures and writebacks-to-swap. These swapping
> > IOs prevent many users who cannot tolerate swapping from adopting zswap
> > to save memory and improve performance where possible.
> >
> > This patch adds the option to disable this behavior entirely: do not
> > writeback to backing swapping device when a zswap store attempt fail,
> > and do not write pages in the zswap pool back to the backing swap
> > device (both when the pool is full, and when the new zswap shrinker is
> > called).
> >
> > This new behavior can be opted-in/out on a per-cgroup basis via a new
> > cgroup file. By default, writebacks to swap device is enabled, which is
> > the previous behavior. Initially, writeback is enabled for the root
> > cgroup, and a newly created cgroup will inherit the current setting of
> > its parent.
> >
> > Note that this is subtly different from setting memory.swap.max to 0, as
> > it still allows for pages to be stored in the zswap pool (which itself
> > consumes swap space in its current form).
> >
> > This patch should be applied on top of the zswap shrinker series:
> >
> > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
> >
> > as it also disables the zswap shrinker, a major source of zswap
> > writebacks.
>
> I am wondering about the status of "memory.swap.tiers" proof of concept patch?
> Are we still on board to have this two patch merge together somehow so
> we can have
> "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the
> memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case?
>
> Thanks
>
> Chris
>

Hi Chris,

I briefly summarized my recent discussion with Johannes here:

https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/

TL;DR is we acknowledge the potential usefulness of swap.tiers
interface, but the use case is not quite there yet, so it does not
make too much sense to build up that heavy machinery now.
zswap.writeback is a more urgent need, and does not prevent swap.tiers
if we do decide to implement it.
  
Yosry Ahmed Dec. 8, 2023, 1:12 a.m. UTC | #6
On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Nhat,
> >
> >
> > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > During our experiment with zswap, we sometimes observe swap IOs due to
> > > occasional zswap store failures and writebacks-to-swap. These swapping
> > > IOs prevent many users who cannot tolerate swapping from adopting zswap
> > > to save memory and improve performance where possible.
> > >
> > > This patch adds the option to disable this behavior entirely: do not
> > > writeback to backing swapping device when a zswap store attempt fail,
> > > and do not write pages in the zswap pool back to the backing swap
> > > device (both when the pool is full, and when the new zswap shrinker is
> > > called).
> > >
> > > This new behavior can be opted-in/out on a per-cgroup basis via a new
> > > cgroup file. By default, writebacks to swap device is enabled, which is
> > > the previous behavior. Initially, writeback is enabled for the root
> > > cgroup, and a newly created cgroup will inherit the current setting of
> > > its parent.
> > >
> > > Note that this is subtly different from setting memory.swap.max to 0, as
> > > it still allows for pages to be stored in the zswap pool (which itself
> > > consumes swap space in its current form).
> > >
> > > This patch should be applied on top of the zswap shrinker series:
> > >
> > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
> > >
> > > as it also disables the zswap shrinker, a major source of zswap
> > > writebacks.
> >
> > I am wondering about the status of "memory.swap.tiers" proof of concept patch?
> > Are we still on board to have this two patch merge together somehow so
> > we can have
> > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the
> > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case?
> >
> > Thanks
> >
> > Chris
> >
>
> Hi Chris,
>
> I briefly summarized my recent discussion with Johannes here:
>
> https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/
>
> TL;DR is we acknowledge the potential usefulness of swap.tiers
> interface, but the use case is not quite there yet, so it does not
> make too much sense to build up that heavy machinery now.
> zswap.writeback is a more urgent need, and does not prevent swap.tiers
> if we do decide to implement it.

I am honestly not convinced by this. There is no heavy machinery here.
The interface is more generic and extensible, but the implementation
is roughly the same. Unless we have a reason to think a swap.tiers
interface may make it difficult to extend this later or will not
support some use cases, I think we should go ahead with it. If we are
worried that "tiers" may not accurately describe future use cases, we
can be more generic and call it swap.types or something.
  
Nhat Pham Dec. 8, 2023, 1:14 a.m. UTC | #7
On Thu, Dec 7, 2023 at 4:42 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
[..]
>
> I don't have any concrete numbers though - any numbers I can pull out
> are from highly artificial tasks that only serve to test the
> correctness aspect of the implementation. zswap.writeback disablement
> would of course be faster in these situations (up to 33%!!!!) - but
> that's basically just saying HDD is slow. Which is not very
> informative or surprising, so I did not include it in the changelog.

For instance, on a server with HDD, I allocate memories and populate
them with random values (so that zswap store will always fail), and
specify memory.high low enough to trigger reclaim. The time it takes
to allocate the memories and just read through it a couple of times
(doing silly things like computing the values' average etc.):

zswap.writeback disabled:
real 0m30.537s
user 0m23.687s
sys 0m6.637s
0 pages swapped in
0 pages swapped out

zswap.writeback enabled:
real 0m45.061s
user 0m24.310s
sys 0m8.892s
712686 pages swapped in
461093 pages swapped out

(the last two lines are from vmstat -s).
  
Johannes Weiner Dec. 8, 2023, 4:34 p.m. UTC | #8
On Thu, Dec 07, 2023 at 05:12:13PM -0800, Yosry Ahmed wrote:
> On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote:
> > > I am wondering about the status of "memory.swap.tiers" proof of concept patch?
> > > Are we still on board to have this two patch merge together somehow so
> > > we can have
> > > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the
> > > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case?
> > >
> > > Thanks
> > >
> > > Chris
> > >
> >
> > Hi Chris,
> >
> > I briefly summarized my recent discussion with Johannes here:
> >
> > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/
> >
> > TL;DR is we acknowledge the potential usefulness of swap.tiers
> > interface, but the use case is not quite there yet, so it does not
> > make too much sense to build up that heavy machinery now.
> > zswap.writeback is a more urgent need, and does not prevent swap.tiers
> > if we do decide to implement it.
> 
> I am honestly not convinced by this. There is no heavy machinery here.
> The interface is more generic and extensible, but the implementation
> is roughly the same. Unless we have a reason to think a swap.tiers
> interface may make it difficult to extend this later or will not
> support some use cases, I think we should go ahead with it. If we are
> worried that "tiers" may not accurately describe future use cases, we
> can be more generic and call it swap.types or something.

I have to disagree. The generic swap types or tiers ideas actually
look pretty far-fetched to me, and there is a lack of convincing
explanation for why this is even a probable direction for swap.

For example,

1. What are the other backends? Where you seem to see a multitude of
   backends and arbitrary hierarchies of them, I see compression and
   flash, and really not much else. And there is only one reasonable
   direction in which to combine those two.

   The IOPs and latencies of HDDs and network compared to modern
   memory sizes and compute speeds make them for the most part
   impractical as paging backends.

   So I don't see a common third swap backend, let alone a fourth or a
   fifth, or a multitude of meaningful ways of combining them...

2. Even if the usecases were there, enabling this would be a ton of
   work and open interface questions:

  1) There is no generic code to transfer pages between arbitrary
     backends.

  2) There is no accepted indirection model where a swap pte can refer
     to backends dynamically, in a way that makes migration feasible
     at scale.

  3) Arbitrary global strings are somewhat unlikely to be accepted as
     a way to configure these hierarchies.

  4) Backend file paths in a global sysfs file don't work well with
     namespacing. The swapfile could be in a container
     namespace. Containers are not guaranteed to see /sys.

  5) Fixed keywords like "zswap" might not be good enough - what about
     compression and backend parameters?

None of these are insurmountable. My point is that this would be a
huge amount of prerequisite code and effort for what seems would be a
fringe usecase at best right now.

And there could be a lot of curve balls in both the software design as
well as the hardware development between now and then that could make
your proposals moot. Is a per-cgroup string file really going to be
the right way to configure arbitrary hierarchies if they materialize?

This strikes me as premature and speculative, for what could be, some
day.

We don't even do it for *internal API*. There is a review rule to
introduce a function in the same patch as its first caller, to make
sure it's the right abstraction and a good fit for the usecase. There
is no way we can have a lower bar than that for permanent ABI.

The patch here integrates with what zswap is NOW and always has been:
a compressing writeback cache for swap.

Should multiple swap tiers overcome all the above and actually become
real, this knob here would be the least of our worries. It would be
easy to just ignore, automatically override, or deprecate.

So I don't think you made a reasonable proposal for an alternative, or
gave convincing reasons to hold off this one.
  
Andrew Morton Dec. 8, 2023, 7:57 p.m. UTC | #9
On Thu, 7 Dec 2023 16:42:59 -0800 Nhat Pham <nphamcs@gmail.com> wrote:

> >
> > Also, it would be quite helpful of the changelog were to give us some
> > idea of how important this tunable is.  What sort of throughput
> > differences might it cause and under what circumstances?
> 
> For the most part, this feature is motivated by internal parties who
> have already established their opinions regarding swapping - the
> workloads that are highly sensitive to IO, and especially those who
> are using servers with really slow disk performance (for instance,
> massive but slow HDDs). For these folks, it's impossible to convince
> them to even entertain zswap if swapping also comes as a packaged
> deal. Writeback disabling is quite a useful feature in these
> situations - on a mixed workloads deployment, they can disable
> writeback for the more IO-sensitive workloads, and enable writeback
> for other background workloads.
> 
> (Maybe we should include the paragraph above as part of the changelog?)

I pasted it in, thanks.
  
Andrew Morton Dec. 8, 2023, 7:58 p.m. UTC | #10
On Thu, 7 Dec 2023 17:14:22 -0800 Nhat Pham <nphamcs@gmail.com> wrote:

> > I don't have any concrete numbers though - any numbers I can pull out
> > are from highly artificial tasks that only serve to test the
> > correctness aspect of the implementation. zswap.writeback disablement
> > would of course be faster in these situations (up to 33%!!!!) - but
> > that's basically just saying HDD is slow. Which is not very
> > informative or surprising, so I did not include it in the changelog.
> 
> For instance, on a server with HDD, I allocate memories and populate
> them with random values (so that zswap store will always fail), and
> specify memory.high low enough to trigger reclaim. The time it takes
> to allocate the memories and just read through it a couple of times
> (doing silly things like computing the values' average etc.):
> 
> zswap.writeback disabled:
> real 0m30.537s
> user 0m23.687s
> sys 0m6.637s
> 0 pages swapped in
> 0 pages swapped out
> 
> zswap.writeback enabled:
> real 0m45.061s
> user 0m24.310s
> sys 0m8.892s
> 712686 pages swapped in
> 461093 pages swapped out
> 
> (the last two lines are from vmstat -s).

I pasted that also.
  
Yosry Ahmed Dec. 8, 2023, 8:08 p.m. UTC | #11
On Fri, Dec 8, 2023 at 8:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Dec 07, 2023 at 05:12:13PM -0800, Yosry Ahmed wrote:
> > On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > > On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote:
> > > > I am wondering about the status of "memory.swap.tiers" proof of concept patch?
> > > > Are we still on board to have this two patch merge together somehow so
> > > > we can have
> > > > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the
> > > > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case?
> > > >
> > > > Thanks
> > > >
> > > > Chris
> > > >
> > >
> > > Hi Chris,
> > >
> > > I briefly summarized my recent discussion with Johannes here:
> > >
> > > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/
> > >
> > > TL;DR is we acknowledge the potential usefulness of swap.tiers
> > > interface, but the use case is not quite there yet, so it does not
> > > make too much sense to build up that heavy machinery now.
> > > zswap.writeback is a more urgent need, and does not prevent swap.tiers
> > > if we do decide to implement it.
> >
> > I am honestly not convinced by this. There is no heavy machinery here.
> > The interface is more generic and extensible, but the implementation
> > is roughly the same. Unless we have a reason to think a swap.tiers
> > interface may make it difficult to extend this later or will not
> > support some use cases, I think we should go ahead with it. If we are
> > worried that "tiers" may not accurately describe future use cases, we
> > can be more generic and call it swap.types or something.
>
> I have to disagree. The generic swap types or tiers ideas actually
> look pretty far-fetched to me, and there is a lack of convincing
> explanation for why this is even a probable direction for swap.
>
> For example,
>
> 1. What are the other backends? Where you seem to see a multitude of
>    backends and arbitrary hierarchies of them, I see compression and
>    flash, and really not much else. And there is only one reasonable
>    direction in which to combine those two.
>
>    The IOPs and latencies of HDDs and network compared to modern
>    memory sizes and compute speeds make them for the most part
>    impractical as paging backends.
>
>    So I don't see a common third swap backend, let alone a fourth or a
>    fifth, or a multitude of meaningful ways of combining them...
>
> 2. Even if the usecases were there, enabling this would be a ton of
>    work and open interface questions:
>
>   1) There is no generic code to transfer pages between arbitrary
>      backends.
>
>   2) There is no accepted indirection model where a swap pte can refer
>      to backends dynamically, in a way that makes migration feasible
>      at scale.
>
>   3) Arbitrary global strings are somewhat unlikely to be accepted as
>      a way to configure these hierarchies.
>
>   4) Backend file paths in a global sysfs file don't work well with
>      namespacing. The swapfile could be in a container
>      namespace. Containers are not guaranteed to see /sys.
>
>   5) Fixed keywords like "zswap" might not be good enough - what about
>      compression and backend parameters?
>
> None of these are insurmountable. My point is that this would be a
> huge amount of prerequisite code and effort for what seems would be a
> fringe usecase at best right now.
>
> And there could be a lot of curve balls in both the software design as
> well as the hardware development between now and then that could make
> your proposals moot. Is a per-cgroup string file really going to be
> the right way to configure arbitrary hierarchies if they materialize?
>
> This strikes me as premature and speculative, for what could be, some
> day.

It is, and I agree with all your reasoning above. My thinking was that
if we create the file now, and we end up only supporting "all" and
"zswap" values (equivalent to memory.zswap.writeback), that's fine,
but if we can find a way to extend it, that would be nice. I honestly
didn't think that "zswap" as a keyword may be a problem later on.

>
> We don't even do it for *internal API*. There is a review rule to
> introduce a function in the same patch as its first caller, to make
> sure it's the right abstraction and a good fit for the usecase. There
> is no way we can have a lower bar than that for permanent ABI.
>
> The patch here integrates with what zswap is NOW and always has been:
> a compressing writeback cache for swap.
>
> Should multiple swap tiers overcome all the above and actually become
> real, this knob here would be the least of our worries. It would be
> easy to just ignore, automatically override, or deprecate.

My main concern was having two knobs with overlapping functionalities
at that point, and we would need to handle invalid combinations, or
make one of the knobs superior to the other one, which is not ideal.

Having said that, you are right that we are probably a long ways from
supporting more swap types anyway, and having two knobs wouldn't be a
blocker then, only a nuisance that we need to take into consideration.

I am fine with memory.zswap.writeback being merged as-is (as I
mentioned earlier), it would have just been nice if we could agree on
a more generic interface, but you thoroughly proved that it is too
early to do that.

Thanks!
  
Chris Li Dec. 8, 2023, 11:55 p.m. UTC | #12
Hi Nhat,

On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Nhat,
> >
> >
> > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > During our experiment with zswap, we sometimes observe swap IOs due to
> > > occasional zswap store failures and writebacks-to-swap. These swapping
> > > IOs prevent many users who cannot tolerate swapping from adopting zswap
> > > to save memory and improve performance where possible.
> > >
> > > This patch adds the option to disable this behavior entirely: do not
> > > writeback to backing swapping device when a zswap store attempt fail,
> > > and do not write pages in the zswap pool back to the backing swap
> > > device (both when the pool is full, and when the new zswap shrinker is
> > > called).
> > >
> > > This new behavior can be opted-in/out on a per-cgroup basis via a new
> > > cgroup file. By default, writebacks to swap device is enabled, which is
> > > the previous behavior. Initially, writeback is enabled for the root
> > > cgroup, and a newly created cgroup will inherit the current setting of
> > > its parent.
> > >
> > > Note that this is subtly different from setting memory.swap.max to 0, as
> > > it still allows for pages to be stored in the zswap pool (which itself
> > > consumes swap space in its current form).
> > >
> > > This patch should be applied on top of the zswap shrinker series:
> > >
> > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
> > >
> > > as it also disables the zswap shrinker, a major source of zswap
> > > writebacks.
> >
> > I am wondering about the status of "memory.swap.tiers" proof of concept patch?
> > Are we still on board to have this two patch merge together somehow so
> > we can have
> > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the
> > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case?
> >
> > Thanks
> >
> > Chris
> >
>
> Hi Chris,
>
> I briefly summarized my recent discussion with Johannes here:
>
> https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/

Sorry I am traveling in a different time zone so not able to get to
that email sooner. That email is only sent out less than one day
before the V6 patch right?

>
> TL;DR is we acknowledge the potential usefulness of swap.tiers
> interface, but the use case is not quite there yet, so it does not

I disagree about no use case. No use case for Meta != no usage case
for the rest of the linux kernel community. That mindset really needs
to shift to do Linux kernel development. Respect other's usage cases.
It is not just Meta's Linux kernel. It is everybody's Linux kernel.

I can give you three usage cases right now:
1) Google producting kernel uses SSD only swap, it is currently on
pilot. This is not expressible by the memory.zswap.writeback. You can
set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD
backed swapfile. But the whole thing feels very clunky, especially
what you really want is SSD only swap, you need to do all this zswap
config dance. Google has an internal memory.swapfile feature
implemented per cgroup swap file type by "zswap only", "real swap file
only", "both", "none" (the exact keyword might be different). running
in the production for almost 10 years. The need for more than zswap
type of per cgroup control is really there.

2) As indicated by this discussion, Tencent has a usage case for SSD
and hard disk swap as overflow.
https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
+Kairui

3) Android has some fancy swap ideas led by those patches.
https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/
It got shot down due to removal of frontswap. But the usage case and
product requirement is there.
+Minchan

> make too much sense to build up that heavy machinery now.

Does my minimal memory.swap.tiers patch to support "zswap" and "all"
sound heavy machinery to you?
It is the same implementation under the hood. I am only trying to
avoid introducing something that will be foreseeable obsolete.

> zswap.writeback is a more urgent need, and does not prevent swap.tiers
> if we do decide to implement it.

I respect that urgent need, that is why I Ack on the V5 path, under
the understanding that this zswap.writeback is not carved into stones.
When a better interface comes alone, that interface can be obsolete.
Frankly speaking I would much prefer not introducing the cgroup API
which will be obsolete soon.

If you think zswap.writeback is not removable when another better
alternative is available, please voice it now.

If you squash my minimal memory.swap.tiers patch, it will also address
your urgent need for merging the "zswap.writeback", no?


Chris
  
Chris Li Dec. 9, 2023, 12:09 a.m. UTC | #13
Hi Yosry,

On Thu, Dec 7, 2023 at 5:12 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > I briefly summarized my recent discussion with Johannes here:
> >
> > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/
> >
> > TL;DR is we acknowledge the potential usefulness of swap.tiers
> > interface, but the use case is not quite there yet, so it does not
> > make too much sense to build up that heavy machinery now.
> > zswap.writeback is a more urgent need, and does not prevent swap.tiers
> > if we do decide to implement it.
>
> I am honestly not convinced by this. There is no heavy machinery here.
> The interface is more generic and extensible, but the implementation
> is roughly the same. Unless we have a reason to think a swap.tiers
> interface may make it difficult to extend this later or will not
> support some use cases, I think we should go ahead with it. If we are
> worried that "tiers" may not accurately describe future use cases, we
> can be more generic and call it swap.types or something.
>
+100.

Chris
  
Chris Li Dec. 9, 2023, 2:02 a.m. UTC | #14
Hi Johannes,

On Fri, Dec 8, 2023 at 8:35 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Dec 07, 2023 at 05:12:13PM -0800, Yosry Ahmed wrote:
> > On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > > On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote:
> > > > I am wondering about the status of "memory.swap.tiers" proof of concept patch?
> > > > Are we still on board to have this two patch merge together somehow so
> > > > we can have
> > > > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the
> > > > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case?
> > > >
> > > > Thanks
> > > >
> > > > Chris
> > > >
> > >
> > > Hi Chris,
> > >
> > > I briefly summarized my recent discussion with Johannes here:
> > >
> > > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/
> > >
> > > TL;DR is we acknowledge the potential usefulness of swap.tiers
> > > interface, but the use case is not quite there yet, so it does not
> > > make too much sense to build up that heavy machinery now.
> > > zswap.writeback is a more urgent need, and does not prevent swap.tiers
> > > if we do decide to implement it.
> >
> > I am honestly not convinced by this. There is no heavy machinery here.
> > The interface is more generic and extensible, but the implementation
> > is roughly the same. Unless we have a reason to think a swap.tiers
> > interface may make it difficult to extend this later or will not
> > support some use cases, I think we should go ahead with it. If we are
> > worried that "tiers" may not accurately describe future use cases, we
> > can be more generic and call it swap.types or something.
>
> I have to disagree. The generic swap types or tiers ideas actually
> look pretty far-fetched to me, and there is a lack of convincing
> explanation for why this is even a probable direction for swap.

It boils down to there being more than just "zswap + SSD" usage cases
in other parts of the Linux communities.
The need is real and it is just a question of  how to get there.

>
> For example,
>
> 1. What are the other backends? Where you seem to see a multitude of
>    backends and arbitrary hierarchies of them, I see compression and
>    flash, and really not much else. And there is only one reasonable
>    direction in which to combine those two.

I list a few other usage cases here in an earlier email of the same thread.
https://lore.kernel.org/linux-mm/CAF8kJuNpnqTM5x1QmQ7h-FaRWVnHBdNGvGvB3txohSOmZhYA-Q@mail.gmail.com/T/#t
TL;DR:
1) Google has had an internal memory.swapfile in production for almost 10 years.
2) Tencent uses hard drives as SSD swap overflow. +Kairui.
https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
3) Android has more fancy swap usage. +Kimchan
https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/

You can't imagine such an usage is not the reason to block others for
such usage. Please respect other usage cases as well.

As for the other backends, the first minimal milestone we can
implement is "all" and "zswap", which is functional equivalent to the
memory.zswap.writeback.
Other common back ends there are SSD and hard drive. The exact
keywords and tiers are up for future discussion. I just want to
acknowledge the need is there.

>
>    The IOPs and latencies of HDDs and network compared to modern
>    memory sizes and compute speeds make them for the most part
>    impractical as paging backends.

I don't see that being a problem as lower tiers for very cold swaps.
Again, it might not be a usage case for Meta but please respect the
usage case for others.

>
>    So I don't see a common third swap backend, let alone a fourth or a
>    fifth, or a multitude of meaningful ways of combining them...

Does not stop others from wanting to use them.

>
> 2. Even if the usecases were there, enabling this would be a ton of
>    work and open interface questions:

If we can agree this interface is more flexible and covers more usage
cases than "zswap.writeback". We can start from the minimal
implementation of "zswap" and "all".
There is not a ton of work in the first minimal milestone. I send out
the minimal patch here:
https://lore.kernel.org/linux-mm/ZVrHXJLxvs4_CUxc@google.com/

>
>   1) There is no generic code to transfer pages between arbitrary
>      backends.

True, but it does not have to be there to make "swap.tiers" useful. It
can start without a transfer page between backends. It is just a more
flexible way to specify what swap the cgroup wants to opt in
initially, that is a solid need. Everything else can be done later.

>
>   2) There is no accepted indirection model where a swap pte can refer
>      to backends dynamically, in a way that makes migration feasible
>      at scale.

Same as above.

>
>   3) Arbitrary global strings are somewhat unlikely to be accepted as
>      a way to configure these hierarchies.

It does not need to be an arbitrary string. We have a string to config
cgroup.subtree_control for example. I am not saying just borrow the
subtree control syntax. But we can do some thing similar to that.

>
>   4) Backend file paths in a global sysfs file don't work well with
>      namespacing. The swapfile could be in a container
>      namespace. Containers are not guaranteed to see /sys.

Can you clarify what usage problem you are trying to solve? The
containers are typically managed by the container manager's service.
The service sees /sys for sure. Do you want a cgroup interface to
allow the container usage to self-service the swap file?

>
>   5) Fixed keywords like "zswap" might not be good enough - what about
>      compression and backend parameters?

I think those levels of detail are zswap/zpool specific, it belongs to
the /sys/kernel/mm/zswap/foo_bar for example.
We can discuss the detail usage, nothing is set in stone yet.

>
> None of these are insurmountable. My point is that this would be a
> huge amount of prerequisite code and effort for what seems would be a
> fringe usecase at best right now.

For the fringe usage a minimal patch exists. Not a huge amount of
requireste code.
What you describe are close to the end goal of the swap tiers. I

>
> And there could be a lot of curve balls in both the software design as
> well as the hardware development between now and then that could make
> your proposals moot. Is a per-cgroup string file really going to be
> the right way to configure arbitrary hierarchies if they materialize?
>
> This strikes me as premature and speculative, for what could be, some
> day.

"swap.tiers" in my mind is strictly better than "zswap.writeback"
because writeback has only 0 or 1 value. It is not able to describe
other swap selections. Setting zswap.writeback = 0 will disable SSD
swap as well, that is not very intuitive. If we want SSD only swap, we
need to set "zswap.writeback" = 1 and "zswap.max" = 0. All this makes
me feel that we are limiting ourselves too much from the cubical of
zswap to look at the rest of the world.

>
> We don't even do it for *internal API*. There is a review rule to
> introduce a function in the same patch as its first caller, to make
> sure it's the right abstraction and a good fit for the usecase. There
> is no way we can have a lower bar than that for permanent ABI.
>
> The patch here integrates with what zswap is NOW and always has been:
> a compressing writeback cache for swap.
>
> Should multiple swap tiers overcome all the above and actually become
> real, this knob here would be the least of our worries. It would be
> easy to just ignore, automatically override, or deprecate.
>
> So I don't think you made a reasonable proposal for an alternative, or
> gave convincing reasons to hold off this one.
>
Keep in mind that the minimal patch is just trying to avoid the detour
of introducing the zswap.writeback and obsolete it soon. There is a
solid need for swap backend other than zswap. Frankly speaking I don't
see writing "all" to "swap.tiers" is that big a deal different than
writing "1" to "zswap.writeback". From the API point of view, one is
more flexible and future proof.

I just don't want "zswap.writeback" to become something carved into
stone and we cant' remove it later, especially if we can have the
alternative API compatible with other usage cases.

Chris
  
Johannes Weiner Dec. 9, 2023, 3:42 a.m. UTC | #15
On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote:
> I can give you three usage cases right now:
> 1) Google producting kernel uses SSD only swap, it is currently on
> pilot. This is not expressible by the memory.zswap.writeback. You can
> set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD
> backed swapfile. But the whole thing feels very clunky, especially
> what you really want is SSD only swap, you need to do all this zswap
> config dance. Google has an internal memory.swapfile feature
> implemented per cgroup swap file type by "zswap only", "real swap file
> only", "both", "none" (the exact keyword might be different). running
> in the production for almost 10 years. The need for more than zswap
> type of per cgroup control is really there.

We use regular swap on SSD without zswap just fine. Of course it's
expressible.

On dedicated systems, zswap is disabled in sysfs. On shared hosts
where it's determined based on which workload is scheduled, zswap is
generally enabled through sysfs, and individual cgroup access is
controlled via memory.zswap.max - which is what this knob is for.

This is analogous to enabling swap globally, and then opting
individual cgroups in and out with memory.swap.max.

So this usecase is very much already supported, and it's expressed in
a way that's pretty natural for how cgroups express access and lack of
access to certain resources.

I don't see how memory.swap.type or memory.swap.tiers would improve
this in any way. On the contrary, it would overlap and conflict with
existing controls to manage swap and zswap on a per-cgroup basis.

> 2) As indicated by this discussion, Tencent has a usage case for SSD
> and hard disk swap as overflow.
> https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
> +Kairui

Multiple swap devices for round robin or with different priorities
aren't new, they have been supported for a very, very long time. So
far nobody has proposed to control the exact behavior on a per-cgroup
basis, and I didn't see anybody in this thread asking for it either.

So I don't see how this counts as an obvious and automatic usecase for
memory.swap.tiers.

> 3) Android has some fancy swap ideas led by those patches.
> https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/
> It got shot down due to removal of frontswap. But the usage case and
> product requirement is there.
> +Minchan

This looks like an optimization for zram to bypass the block layer and
hook directly into the swap code. Correct me if I'm wrong, but this
doesn't appear to have anything to do with per-cgroup backend control.

> > zswap.writeback is a more urgent need, and does not prevent swap.tiers
> > if we do decide to implement it.
> 
> I respect that urgent need, that is why I Ack on the V5 path, under
> the understanding that this zswap.writeback is not carved into stones.
> When a better interface comes alone, that interface can be obsolete.
> Frankly speaking I would much prefer not introducing the cgroup API
> which will be obsolete soon.
> 
> If you think zswap.writeback is not removable when another better
> alternative is available, please voice it now.
> 
> If you squash my minimal memory.swap.tiers patch, it will also address
> your urgent need for merging the "zswap.writeback", no?

We can always deprecate ABI if something better comes along.

However, it's quite bold to claim that memory.swap.tiers is the best
way to implement backend control on a per-cgroup basis, and that it'll
definitely be needed in the future. You might see this as a foregone
conclusion, but I very much doubt this.

Even if such a file were to show up, I'm not convinced it should even
include zswap as one of the tiers. Zswap isn't a regular swap backend,
it doesn't show up in /proc/swaps, it can't be a second tier, the way
it interacts with its backend file is very different than how two
swapfiles of different priorities interact with each other, it's
already controllable with memory.zswap.max, etc.

I'm open to discussing usecases and proposals for more fine-grained
per-cgroup backend control. We've had discussions about per-cgroup
swapfiles in the past. Cgroup parameters for swapon are another
thought. There are several options and many considerations. The
memory.swap.tiers idea is the newest, has probably had the least
amount of discussion among them, and looks the least convincing to me.

Let's work out the requirements first.

The "conflict" with memory.zswap.writeback is a red herring - it's no
more of a conflict than setting memory.swap.tiers to "zswap" or "all"
and then setting memory.zswap.max or memory.swap.max to 0.

So the notion that we have to rush in a minimal version of a MUCH
bigger concept, just to support zswap writeback disabling is
misguided. And then hope that this format works as the concept evolves
and real usecases materialize... There is no reason to take that risk.
  
Chris Li Dec. 9, 2023, 5:39 p.m. UTC | #16
On Fri, Dec 8, 2023 at 7:42 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote:
> > I can give you three usage cases right now:
> > 1) Google producting kernel uses SSD only swap, it is currently on
> > pilot. This is not expressible by the memory.zswap.writeback. You can
> > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD
> > backed swapfile. But the whole thing feels very clunky, especially
> > what you really want is SSD only swap, you need to do all this zswap
> > config dance. Google has an internal memory.swapfile feature
> > implemented per cgroup swap file type by "zswap only", "real swap file
> > only", "both", "none" (the exact keyword might be different). running
> > in the production for almost 10 years. The need for more than zswap
> > type of per cgroup control is really there.
>
> We use regular swap on SSD without zswap just fine. Of course it's
> expressible.
>
> On dedicated systems, zswap is disabled in sysfs. On shared hosts
> where it's determined based on which workload is scheduled, zswap is
> generally enabled through sysfs, and individual cgroup access is
> controlled via memory.zswap.max - which is what this knob is for.

The sysfs API is not per cgroup, right?

>
> This is analogous to enabling swap globally, and then opting
> individual cgroups in and out with memory.swap.max.

That is good to know. Just comparing notes. Google is still using
cgroup V1. There is a seperate per cgroup zswap switch control
enabling zswap or not. As well as a swap file type switch to select
ghost swap file or real swap file.

> So this usecase is very much already supported, and it's expressed in
> a way that's pretty natural for how cgroups express access and lack of
> access to certain resources.
>
> I don't see how memory.swap.type or memory.swap.tiers would improve
> this in any way. On the contrary, it would overlap and conflict with
> existing controls to manage swap and zswap on a per-cgroup basis.

One minor point is that, if we have a per cgroup list of swap
tires/devices. It does not need to enter the zswap code to find out
zswap is disabled. But that is a very minor point.

>
> > 2) As indicated by this discussion, Tencent has a usage case for SSD
> > and hard disk swap as overflow.
> > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
> > +Kairui
>
> Multiple swap devices for round robin or with different priorities
> aren't new, they have been supported for a very, very long time. So
> far nobody has proposed to control the exact behavior on a per-cgroup
> basis, and I didn't see anybody in this thread asking for it either.
>
> So I don't see how this counts as an obvious and automatic usecase for
> memory.swap.tiers.

I assume Tencent would want to use per cgroup control but you have a
point it is not automatic. It would be best to have Kairui clarify
whether Tencent wants to use per cgroup SSD vs Hard Disk swap file
control.

>
> > 3) Android has some fancy swap ideas led by those patches.
> > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/
> > It got shot down due to removal of frontswap. But the usage case and
> > product requirement is there.
> > +Minchan
>
> This looks like an optimization for zram to bypass the block layer and
> hook directly into the swap code. Correct me if I'm wrong, but this
> doesn't appear to have anything to do with per-cgroup backend control.

No in that series. No.

>
> > > zswap.writeback is a more urgent need, and does not prevent swap.tiers
> > > if we do decide to implement it.
> >
> > I respect that urgent need, that is why I Ack on the V5 path, under
> > the understanding that this zswap.writeback is not carved into stones.
> > When a better interface comes alone, that interface can be obsolete.
> > Frankly speaking I would much prefer not introducing the cgroup API
> > which will be obsolete soon.
> >
> > If you think zswap.writeback is not removable when another better
> > alternative is available, please voice it now.
> >
> > If you squash my minimal memory.swap.tiers patch, it will also address
> > your urgent need for merging the "zswap.writeback", no?
>
> We can always deprecate ABI if something better comes along.

Thanks for the confirmation. That is what I want to hear from you:
memory.zswap.writeback is not set to stones if something better comes
along.

>
> However, it's quite bold to claim that memory.swap.tiers is the best
> way to implement backend control on a per-cgroup basis, and that it'll
> definitely be needed in the future. You might see this as a foregone
> conclusion, but I very much doubt this.

Ack.

>
> Even if such a file were to show up, I'm not convinced it should even
> include zswap as one of the tiers. Zswap isn't a regular swap backend,
> it doesn't show up in /proc/swaps, it can't be a second tier, the way
> it interacts with its backend file is very different than how two
> swapfiles of different priorities interact with each other, it's
> already controllable with memory.zswap.max, etc.

The rationale to separate out the swap tiers is not based on the
device, rather depends on the swapin latency distribution and access
pattern etc. In that regard, the memory based zswap has much lower
swap in latency, and should be considered a separate tier/class as
SSD. Even though current zswap allocates swap slots from SSD or ghost
swap file, I don't consider it an issue to list zswap as a separate
tier/class. The end of day this is tight to the SLO of the expected
swapping performance.

>
> I'm open to discussing usecases and proposals for more fine-grained
> per-cgroup backend control. We've had discussions about per-cgroup
> swapfiles in the past. Cgroup parameters for swapon are another
> thought. There are several options and many considerations. The
> memory.swap.tiers idea is the newest, has probably had the least
> amount of discussion among them, and looks the least convincing to me.
>
> Let's work out the requirements first.

I agree with that.  Let's lay out the requirements differently and
have open discussion to cover each other's usage case.

> The "conflict" with memory.zswap.writeback is a red herring - it's no
> more of a conflict than setting memory.swap.tiers to "zswap" or "all"
> and then setting memory.zswap.max or memory.swap.max to 0.

I am not sure I fully understand your point here. Can you elerate?
I am thinking memory.zswap.writeback and memory.swap.tiers both try to
describe the relationship of what set of swap tiers the cgroup can
select. "swap.teris" can specify what order.
The setting memory.zswap.max or memory.swap.max to 0 is just disable
zswap/swap. It does not specify the ordering. I am not sure what is
the "conflict" you are referring to.

>
> So the notion that we have to rush in a minimal version of a MUCH
> bigger concept, just to support zswap writeback disabling is
> misguided. And then hope that this format works as the concept evolves
> and real usecases materialize... There is no reason to take that risk.
>

I am convinced memory.swap.tiers is more general and more straight
forward to reason about.  You obviously need more convincing. Which is
fair that I haven't got much more than the minimal patch. As long as
it is not carved into stones and better things can be developed, I am
fine with that.

Chris
  
Kairui Song Dec. 11, 2023, 9:31 a.m. UTC | #17
Chris Li <chrisl@kernel.org> 于2023年12月9日周六 07:56写道:
>
> Hi Nhat,
>
> On Thu, Dec 7, 2023 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Thu, Dec 7, 2023 at 4:19 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > Hi Nhat,
> > >
> > >
> > > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > >
> > > > During our experiment with zswap, we sometimes observe swap IOs due to
> > > > occasional zswap store failures and writebacks-to-swap. These swapping
> > > > IOs prevent many users who cannot tolerate swapping from adopting zswap
> > > > to save memory and improve performance where possible.
> > > >
> > > > This patch adds the option to disable this behavior entirely: do not
> > > > writeback to backing swapping device when a zswap store attempt fail,
> > > > and do not write pages in the zswap pool back to the backing swap
> > > > device (both when the pool is full, and when the new zswap shrinker is
> > > > called).
> > > >
> > > > This new behavior can be opted-in/out on a per-cgroup basis via a new
> > > > cgroup file. By default, writebacks to swap device is enabled, which is
> > > > the previous behavior. Initially, writeback is enabled for the root
> > > > cgroup, and a newly created cgroup will inherit the current setting of
> > > > its parent.
> > > >
> > > > Note that this is subtly different from setting memory.swap.max to 0, as
> > > > it still allows for pages to be stored in the zswap pool (which itself
> > > > consumes swap space in its current form).
> > > >
> > > > This patch should be applied on top of the zswap shrinker series:
> > > >
> > > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
> > > >
> > > > as it also disables the zswap shrinker, a major source of zswap
> > > > writebacks.
> > >
> > > I am wondering about the status of "memory.swap.tiers" proof of concept patch?
> > > Are we still on board to have this two patch merge together somehow so
> > > we can have
> > > "memory.swap.tiers" == "all" and "memory.swap.tiers" == "zswap" cover the
> > > memory.zswap.writeback == 1 and memory.zswap.writeback == 0 case?
> > >
> > > Thanks
> > >
> > > Chris
> > >
> >
> > Hi Chris,
> >
> > I briefly summarized my recent discussion with Johannes here:
> >
> > https://lore.kernel.org/all/CAKEwX=NwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzUmYhK4jxzHA@mail.gmail.com/
>
> Sorry I am traveling in a different time zone so not able to get to
> that email sooner. That email is only sent out less than one day
> before the V6 patch right?
>
> >
> > TL;DR is we acknowledge the potential usefulness of swap.tiers
> > interface, but the use case is not quite there yet, so it does not
>
> I disagree about no use case. No use case for Meta != no usage case
> for the rest of the linux kernel community. That mindset really needs
> to shift to do Linux kernel development. Respect other's usage cases.
> It is not just Meta's Linux kernel. It is everybody's Linux kernel.
>
> I can give you three usage cases right now:
> 1) Google producting kernel uses SSD only swap, it is currently on
> pilot. This is not expressible by the memory.zswap.writeback. You can
> set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD
> backed swapfile. But the whole thing feels very clunky, especially
> what you really want is SSD only swap, you need to do all this zswap
> config dance. Google has an internal memory.swapfile feature
> implemented per cgroup swap file type by "zswap only", "real swap file
> only", "both", "none" (the exact keyword might be different). running
> in the production for almost 10 years. The need for more than zswap
> type of per cgroup control is really there.
>
> 2) As indicated by this discussion, Tencent has a usage case for SSD
> and hard disk swap as overflow.
> https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
> +Kairui

Yes, we are not using zswap. We are using ZRAM for swap since we have
many different varieties of workload instances, with a very flexible
storage setup. Some of them don't have the ability to set up a
swapfile. So we built a pack of kernel infrastructures based on ZRAM,
which so far worked pretty well.

The concern from some teams is that ZRAM (or zswap) can't always free
up memory so they may lead to higher risk of OOM compared to a
physical swap device, and they do have suitable devices for doing swap
on some of their machines. So a secondary swap support is very helpful
in case of memory usage peak.

Besides this, another requirement is that different containers may
have different priority, some containers can tolerate high swap
overhead while some cannot, so swap tiering is useful for us in many
ways.

And thanks to cloud infrastructure the disk setup could change from
time to time depending on workload requirements, so our requirement is
to support ZRAM (always) + SSD (optional) + HDD (also optional) as
swap backends, while not making things too complex to maintain.

Currently we have implemented a cgroup based ZRAM compression
algorithm control, per-cgroup ZRAM accounting and limit, and a
experimental kernel worker to migrate cold swap entry from high
priority device to low priority device at very small scale (lack of
basic mechanics to do this at large scale, however due to the low IOPS
of slow device and cold pages are rarely accessed, this wasn't too
much of a problem so far but kind of ugly). The rest of swapping (eg.
secondary swap when ZRAM if full) will depend on the kernel's native
ability.

So far it works, not in the best form, need more patches to make it
work better (eg. the swapin/readahead patch I sent previously). Some
of our design may also need to change in the long term, and we also
want a well built interface and kernel mechanics to manage multi tier
swaps, I'm very willing to talk and collaborate on this.
  
Minchan Kim Dec. 11, 2023, 10:55 p.m. UTC | #18
On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote:
> On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote:
> > I can give you three usage cases right now:
> > 1) Google producting kernel uses SSD only swap, it is currently on
> > pilot. This is not expressible by the memory.zswap.writeback. You can
> > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD
> > backed swapfile. But the whole thing feels very clunky, especially
> > what you really want is SSD only swap, you need to do all this zswap
> > config dance. Google has an internal memory.swapfile feature
> > implemented per cgroup swap file type by "zswap only", "real swap file
> > only", "both", "none" (the exact keyword might be different). running
> > in the production for almost 10 years. The need for more than zswap
> > type of per cgroup control is really there.
> 
> We use regular swap on SSD without zswap just fine. Of course it's
> expressible.
> 
> On dedicated systems, zswap is disabled in sysfs. On shared hosts
> where it's determined based on which workload is scheduled, zswap is
> generally enabled through sysfs, and individual cgroup access is
> controlled via memory.zswap.max - which is what this knob is for.
> 
> This is analogous to enabling swap globally, and then opting
> individual cgroups in and out with memory.swap.max.
> 
> So this usecase is very much already supported, and it's expressed in
> a way that's pretty natural for how cgroups express access and lack of
> access to certain resources.
> 
> I don't see how memory.swap.type or memory.swap.tiers would improve
> this in any way. On the contrary, it would overlap and conflict with
> existing controls to manage swap and zswap on a per-cgroup basis.
> 
> > 2) As indicated by this discussion, Tencent has a usage case for SSD
> > and hard disk swap as overflow.
> > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
> > +Kairui
> 
> Multiple swap devices for round robin or with different priorities
> aren't new, they have been supported for a very, very long time. So
> far nobody has proposed to control the exact behavior on a per-cgroup
> basis, and I didn't see anybody in this thread asking for it either.
> 
> So I don't see how this counts as an obvious and automatic usecase for
> memory.swap.tiers.
> 
> > 3) Android has some fancy swap ideas led by those patches.
> > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/
> > It got shot down due to removal of frontswap. But the usage case and
> > product requirement is there.
> > +Minchan
> 
> This looks like an optimization for zram to bypass the block layer and
> hook directly into the swap code. Correct me if I'm wrong, but this
> doesn't appear to have anything to do with per-cgroup backend control.

Hi Johannes,

I haven't been following the thread closely, but I noticed the discussion
about potential use cases for zram with memcg.

One interesting idea I have is to implement a swap controller per cgroup.
This would allow us to tailor the zram swap behavior to the specific needs of
different groups.

For example, Group A, which is sensitive to swap latency, could use zram swap
with a fast compression setting, even if it sacrifices some compression ratio.
This would prioritize quick access to swapped data, even if it takes up more space.

On the other hand, Group B, which can tolerate higher swap latency, could benefit
from a slower compression setting that achieves a higher compression ratio.
This would maximize memory efficiency at the cost of slightly slower data access.

This approach could provide a more nuanced and flexible way to manage swap usage
within different cgroups.
  
Zhongkun He Dec. 12, 2023, 2:43 a.m. UTC | #19
> Hi Johannes,
>
> I haven't been following the thread closely, but I noticed the discussion
> about potential use cases for zram with memcg.
>
> One interesting idea I have is to implement a swap controller per cgroup.
> This would allow us to tailor the zram swap behavior to the specific needs of
> different groups.

Hi Minchan,

It would be great if this feature could be in place. We are using zram and also
trying to use zswap to offload memory to disk. As kairui mentioned above,
there are many workloads of varying priority and sensitivity on a machine.

>
> For example, Group A, which is sensitive to swap latency, could use zram swap
> with a fast compression setting, even if it sacrifices some compression ratio.
> This would prioritize quick access to swapped data, even if it takes up more space.
>
> On the other hand, Group B, which can tolerate higher swap latency, could benefit
> from a slower compression setting that achieves a higher compression ratio.
> This would maximize memory efficiency at the cost of slightly slower data access.
>
> This approach could provide a more nuanced and flexible way to manage swap usage
> within different cgroups.

Indeed.
  
Nhat Pham Dec. 12, 2023, 9:36 p.m. UTC | #20
On Fri, Dec 8, 2023 at 7:42 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote:
> > I can give you three usage cases right now:
> > 1) Google producting kernel uses SSD only swap, it is currently on
> > pilot. This is not expressible by the memory.zswap.writeback. You can
> > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD
> > backed swapfile. But the whole thing feels very clunky, especially
> > what you really want is SSD only swap, you need to do all this zswap
> > config dance. Google has an internal memory.swapfile feature
> > implemented per cgroup swap file type by "zswap only", "real swap file
> > only", "both", "none" (the exact keyword might be different). running
> > in the production for almost 10 years. The need for more than zswap
> > type of per cgroup control is really there.
>
> We use regular swap on SSD without zswap just fine. Of course it's
> expressible.
>
> On dedicated systems, zswap is disabled in sysfs. On shared hosts
> where it's determined based on which workload is scheduled, zswap is
> generally enabled through sysfs, and individual cgroup access is
> controlled via memory.zswap.max - which is what this knob is for.
>
> This is analogous to enabling swap globally, and then opting
> individual cgroups in and out with memory.swap.max.
>
> So this usecase is very much already supported, and it's expressed in
> a way that's pretty natural for how cgroups express access and lack of
> access to certain resources.
>
> I don't see how memory.swap.type or memory.swap.tiers would improve
> this in any way. On the contrary, it would overlap and conflict with
> existing controls to manage swap and zswap on a per-cgroup basis.
>
> > 2) As indicated by this discussion, Tencent has a usage case for SSD
> > and hard disk swap as overflow.
> > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
> > +Kairui
>
> Multiple swap devices for round robin or with different priorities
> aren't new, they have been supported for a very, very long time. So
> far nobody has proposed to control the exact behavior on a per-cgroup
> basis, and I didn't see anybody in this thread asking for it either.
>
> So I don't see how this counts as an obvious and automatic usecase for
> memory.swap.tiers.
>
> > 3) Android has some fancy swap ideas led by those patches.
> > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/
> > It got shot down due to removal of frontswap. But the usage case and
> > product requirement is there.
> > +Minchan
>
> This looks like an optimization for zram to bypass the block layer and
> hook directly into the swap code. Correct me if I'm wrong, but this
> doesn't appear to have anything to do with per-cgroup backend control.
>
> > > zswap.writeback is a more urgent need, and does not prevent swap.tiers
> > > if we do decide to implement it.
> >
> > I respect that urgent need, that is why I Ack on the V5 path, under
> > the understanding that this zswap.writeback is not carved into stones.
> > When a better interface comes alone, that interface can be obsolete.
> > Frankly speaking I would much prefer not introducing the cgroup API
> > which will be obsolete soon.
> >
> > If you think zswap.writeback is not removable when another better
> > alternative is available, please voice it now.
> >
> > If you squash my minimal memory.swap.tiers patch, it will also address
> > your urgent need for merging the "zswap.writeback", no?
>
> We can always deprecate ABI if something better comes along.
>
> However, it's quite bold to claim that memory.swap.tiers is the best
> way to implement backend control on a per-cgroup basis, and that it'll
> definitely be needed in the future. You might see this as a foregone
> conclusion, but I very much doubt this.
>
> Even if such a file were to show up, I'm not convinced it should even
> include zswap as one of the tiers. Zswap isn't a regular swap backend,
> it doesn't show up in /proc/swaps, it can't be a second tier, the way
> it interacts with its backend file is very different than how two
> swapfiles of different priorities interact with each other, it's
> already controllable with memory.zswap.max, etc.

This is honestly the thing I was originally most iffy about :) zswap
is architecturally and semantically separate from other swap options.
It gets really confusing to lump it as part of the swap tiers.

>
> I'm open to discussing usecases and proposals for more fine-grained
> per-cgroup backend control. We've had discussions about per-cgroup
> swapfiles in the past. Cgroup parameters for swapon are another
> thought. There are several options and many considerations. The
> memory.swap.tiers idea is the newest, has probably had the least
> amount of discussion among them, and looks the least convincing to me.

Definitely. zswap.writeback is a really concrete feature, with
immediate use-case, whereas swap.tiers seem a bit nebulous to me now,
the more we discuss it. I'm not against the inclusion of something
along its line though, and I'm definitely not trying to limit the use
case of other folks - I'd be happy to contribute my engineering hours
towards the discussion of the multi-tier swapping design (both
internal implementation and and public interface), as well as actual
code, when that design is fully fleshed out :)

>
> Let's work out the requirements first.
>
> The "conflict" with memory.zswap.writeback is a red herring - it's no
> more of a conflict than setting memory.swap.tiers to "zswap" or "all"
> and then setting memory.zswap.max or memory.swap.max to 0.

Yup.

>
> So the notion that we have to rush in a minimal version of a MUCH
> bigger concept, just to support zswap writeback disabling is
> misguided. And then hope that this format works as the concept evolves
> and real usecases materialize... There is no reason to take that risk.
  
Chris Li Dec. 12, 2023, 11:39 p.m. UTC | #21
Hi Kairui,

Thanks for sharing the information on how you use swap.

On Mon, Dec 11, 2023 at 1:31 AM Kairui Song <ryncsn@gmail.com> wrote:
> > 2) As indicated by this discussion, Tencent has a usage case for SSD
> > and hard disk swap as overflow.
> > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
> > +Kairui
>
> Yes, we are not using zswap. We are using ZRAM for swap since we have
> many different varieties of workload instances, with a very flexible
> storage setup. Some of them don't have the ability to set up a
> swapfile. So we built a pack of kernel infrastructures based on ZRAM,
> which so far worked pretty well.

This is great. The usage case is actually much more than I expected.
For example, I never thought of zram as a swap tier. Now you mention
it. I am considering whether it makes sense to add zram to the
memory.swap.tiers as well as zswap.

>
> The concern from some teams is that ZRAM (or zswap) can't always free
> up memory so they may lead to higher risk of OOM compared to a
> physical swap device, and they do have suitable devices for doing swap
> on some of their machines. So a secondary swap support is very helpful
> in case of memory usage peak.
>
> Besides this, another requirement is that different containers may
> have different priority, some containers can tolerate high swap
> overhead while some cannot, so swap tiering is useful for us in many
> ways.
>
> And thanks to cloud infrastructure the disk setup could change from
> time to time depending on workload requirements, so our requirement is
> to support ZRAM (always) + SSD (optional) + HDD (also optional) as
> swap backends, while not making things too complex to maintain.

Just curious, do you use ZRAM + SSD + HDD all enabled? Do you ever
consider moving data from ZRAM to SSD, or from SSD to HDD? If you do,
I do see the possibility of having more general swap tiers support and
sharing the shrinking code between tiers somehow. Granted there are
many unanswered questions and a lot of infrastructure is lacking.
Gathering requirements, weight in the priority of the quirement is the
first step towards a possible solution.

> Currently we have implemented a cgroup based ZRAM compression
> algorithm control, per-cgroup ZRAM accounting and limit, and a
> experimental kernel worker to migrate cold swap entry from high
> priority device to low priority device at very small scale (lack of
> basic mechanics to do this at large scale, however due to the low IOPS
> of slow device and cold pages are rarely accessed, this wasn't too
> much of a problem so far but kind of ugly). The rest of swapping (eg.
> secondary swap when ZRAM if full) will depend on the kernel's native
> ability.

Thanks for confirming usage needs of per cgroup ZRAM enable and
flushing between swap devices. I was hoping the swap.tiers can support
some thing like that.

> So far it works, not in the best form, need more patches to make it
> work better (eg. the swapin/readahead patch I sent previously). Some
> of our design may also need to change in the long term, and we also
> want a well built interface and kernel mechanics to manage multi tier
> swaps, I'm very willing to talk and collaborate on this.
>

Great. Let's continue this discussion in a new thread and start
gathering some requirements and priorities from everyone one. The
output of this discussion should be some one pager document listing
the swap tiers requirement and rate the priorities between different
requirements.

Once we have that nail down, we can then discuss what are the
incremental milestones to get there.

I am very interested in this topic and willing to spend time on it as well.

Chris
  
Chris Li Dec. 12, 2023, 11:57 p.m. UTC | #22
Hi Minchan,

On Mon, Dec 11, 2023 at 2:55 PM Minchan Kim <minchan@kernel.org> wrote:

> > > 3) Android has some fancy swap ideas led by those patches.
> > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/
> > > It got shot down due to removal of frontswap. But the usage case and
> > > product requirement is there.
> > > +Minchan
> >
> > This looks like an optimization for zram to bypass the block layer and
> > hook directly into the swap code. Correct me if I'm wrong, but this
> > doesn't appear to have anything to do with per-cgroup backend control.
>
> Hi Johannes,
>
> I haven't been following the thread closely, but I noticed the discussion
> about potential use cases for zram with memcg.
>
> One interesting idea I have is to implement a swap controller per cgroup.
> This would allow us to tailor the zram swap behavior to the specific needs of
> different groups.
>
> For example, Group A, which is sensitive to swap latency, could use zram swap
> with a fast compression setting, even if it sacrifices some compression ratio.
> This would prioritize quick access to swapped data, even if it takes up more space.
>
> On the other hand, Group B, which can tolerate higher swap latency, could benefit
> from a slower compression setting that achieves a higher compression ratio.
> This would maximize memory efficiency at the cost of slightly slower data access.

That is a very solid usage case. Thanks for sharing this swap backend
usage story. It goes beyond my original memory.swap.teires idea as
well.

We can have some zram specific knobs to control what compression
setting is using. Moving data between different compression settings
would be an interesting topic. It might fit the swap.tiers usage model
as well. I am just thinking it out loud. Maybe define different
compression settings as different tiers and then allow the cgroup to
enroll into one of the tiers list.

>
> This approach could provide a more nuanced and flexible way to manage swap usage
> within different cgroups.
>

That is again very wonderful insight.

Thanks

Chris
  
Chris Li Dec. 13, 2023, 12:29 a.m. UTC | #23
On Tue, Dec 12, 2023 at 1:36 PM Nhat Pham <nphamcs@gmail.com> wrote:

> > Even if such a file were to show up, I'm not convinced it should even
> > include zswap as one of the tiers. Zswap isn't a regular swap backend,
> > it doesn't show up in /proc/swaps, it can't be a second tier, the way
> > it interacts with its backend file is very different than how two
> > swapfiles of different priorities interact with each other, it's
> > already controllable with memory.zswap.max, etc.
>
> This is honestly the thing I was originally most iffy about :) zswap
> is architecturally and semantically separate from other swap options.
> It gets really confusing to lump it as part of the swap tiers.

The writeback option is about interacting with other swap backends. So
technically it is not zswap alone. writeback = 0 will disable SSD swap
as well.
I am not against merging the write back. I just want to make sure 1)
better alternatives can be developed 2) zswap.writeback can obsolete
if a better alternative is available.

>
> >
> > I'm open to discussing usecases and proposals for more fine-grained
> > per-cgroup backend control. We've had discussions about per-cgroup
> > swapfiles in the past. Cgroup parameters for swapon are another
> > thought. There are several options and many considerations. The
> > memory.swap.tiers idea is the newest, has probably had the least
> > amount of discussion among them, and looks the least convincing to me.
>
> Definitely. zswap.writeback is a really concrete feature, with
> immediate use-case, whereas swap.tiers seem a bit nebulous to me now,
> the more we discuss it. I'm not against the inclusion of something
> along its line though, and I'm definitely not trying to limit the use
> case of other folks - I'd be happy to contribute my engineering hours
> towards the discussion of the multi-tier swapping design (both
> internal implementation and and public interface), as well as actual
> code, when that design is fully fleshed out :)

Great to hear that. I think the discussion so far shows the
alternative usage cases of the swap backend/tires is real.

>
> >
> > Let's work out the requirements first.
> >
> > The "conflict" with memory.zswap.writeback is a red herring - it's no
> > more of a conflict than setting memory.swap.tiers to "zswap" or "all"
> > and then setting memory.zswap.max or memory.swap.max to 0.
>
> Yup.

Care to elaborate it more? I don't understand the conflict part. I do
ask Johannes in my previous email for clarification.
One is the superset of the other. I don't consider that as a conflict.
If we can have both to choose from, obviously I would pick the one
that is more general and flexible.

Chris
  
Johannes Weiner Dec. 14, 2023, 5:11 p.m. UTC | #24
On Mon, Dec 11, 2023 at 02:55:43PM -0800, Minchan Kim wrote:
> On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote:
> > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote:
> > > I can give you three usage cases right now:
> > > 1) Google producting kernel uses SSD only swap, it is currently on
> > > pilot. This is not expressible by the memory.zswap.writeback. You can
> > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD
> > > backed swapfile. But the whole thing feels very clunky, especially
> > > what you really want is SSD only swap, you need to do all this zswap
> > > config dance. Google has an internal memory.swapfile feature
> > > implemented per cgroup swap file type by "zswap only", "real swap file
> > > only", "both", "none" (the exact keyword might be different). running
> > > in the production for almost 10 years. The need for more than zswap
> > > type of per cgroup control is really there.
> > 
> > We use regular swap on SSD without zswap just fine. Of course it's
> > expressible.
> > 
> > On dedicated systems, zswap is disabled in sysfs. On shared hosts
> > where it's determined based on which workload is scheduled, zswap is
> > generally enabled through sysfs, and individual cgroup access is
> > controlled via memory.zswap.max - which is what this knob is for.
> > 
> > This is analogous to enabling swap globally, and then opting
> > individual cgroups in and out with memory.swap.max.
> > 
> > So this usecase is very much already supported, and it's expressed in
> > a way that's pretty natural for how cgroups express access and lack of
> > access to certain resources.
> > 
> > I don't see how memory.swap.type or memory.swap.tiers would improve
> > this in any way. On the contrary, it would overlap and conflict with
> > existing controls to manage swap and zswap on a per-cgroup basis.
> > 
> > > 2) As indicated by this discussion, Tencent has a usage case for SSD
> > > and hard disk swap as overflow.
> > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
> > > +Kairui
> > 
> > Multiple swap devices for round robin or with different priorities
> > aren't new, they have been supported for a very, very long time. So
> > far nobody has proposed to control the exact behavior on a per-cgroup
> > basis, and I didn't see anybody in this thread asking for it either.
> > 
> > So I don't see how this counts as an obvious and automatic usecase for
> > memory.swap.tiers.
> > 
> > > 3) Android has some fancy swap ideas led by those patches.
> > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/
> > > It got shot down due to removal of frontswap. But the usage case and
> > > product requirement is there.
> > > +Minchan
> > 
> > This looks like an optimization for zram to bypass the block layer and
> > hook directly into the swap code. Correct me if I'm wrong, but this
> > doesn't appear to have anything to do with per-cgroup backend control.
> 
> Hi Johannes,
> 
> I haven't been following the thread closely, but I noticed the discussion
> about potential use cases for zram with memcg.
> 
> One interesting idea I have is to implement a swap controller per cgroup.
> This would allow us to tailor the zram swap behavior to the specific needs of
> different groups.
> 
> For example, Group A, which is sensitive to swap latency, could use zram swap
> with a fast compression setting, even if it sacrifices some compression ratio.
> This would prioritize quick access to swapped data, even if it takes up more space.
> 
> On the other hand, Group B, which can tolerate higher swap latency, could benefit
> from a slower compression setting that achieves a higher compression ratio.
> This would maximize memory efficiency at the cost of slightly slower data access.
> 
> This approach could provide a more nuanced and flexible way to manage swap usage
> within different cgroups.

That makes sense to me.

It sounds to me like per-cgroup swapfiles would be the easiest
solution to this. Then you can create zram devices with different
configurations and assign them to individual cgroups.

This would also apply to Kairu's usecase: assign zrams and hdd backups
as needed on a per-cgroup basis.

In addition, it would naturally solve scalability and isolation
problems when multiple containers would otherwise be hammering on the
same swap backends and locks.

It would also only require one, relatively simple new interface, such
as a cgroup parameter to swapon().

That's highly preferable over a complex configuration file like
memory.swap.tiers that needs to solve all sorts of visibility and
namespace issues and duplicate the full configuration interface of
every backend in some new, custom syntax.
  
Yu Zhao Dec. 14, 2023, 5:23 p.m. UTC | #25
On Thu, Dec 14, 2023 at 10:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Dec 11, 2023 at 02:55:43PM -0800, Minchan Kim wrote:
> > On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote:
> > > > I can give you three usage cases right now:
> > > > 1) Google producting kernel uses SSD only swap, it is currently on
> > > > pilot. This is not expressible by the memory.zswap.writeback. You can
> > > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD
> > > > backed swapfile. But the whole thing feels very clunky, especially
> > > > what you really want is SSD only swap, you need to do all this zswap
> > > > config dance. Google has an internal memory.swapfile feature
> > > > implemented per cgroup swap file type by "zswap only", "real swap file
> > > > only", "both", "none" (the exact keyword might be different). running
> > > > in the production for almost 10 years. The need for more than zswap
> > > > type of per cgroup control is really there.
> > >
> > > We use regular swap on SSD without zswap just fine. Of course it's
> > > expressible.
> > >
> > > On dedicated systems, zswap is disabled in sysfs. On shared hosts
> > > where it's determined based on which workload is scheduled, zswap is
> > > generally enabled through sysfs, and individual cgroup access is
> > > controlled via memory.zswap.max - which is what this knob is for.
> > >
> > > This is analogous to enabling swap globally, and then opting
> > > individual cgroups in and out with memory.swap.max.
> > >
> > > So this usecase is very much already supported, and it's expressed in
> > > a way that's pretty natural for how cgroups express access and lack of
> > > access to certain resources.
> > >
> > > I don't see how memory.swap.type or memory.swap.tiers would improve
> > > this in any way. On the contrary, it would overlap and conflict with
> > > existing controls to manage swap and zswap on a per-cgroup basis.
> > >
> > > > 2) As indicated by this discussion, Tencent has a usage case for SSD
> > > > and hard disk swap as overflow.
> > > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
> > > > +Kairui
> > >
> > > Multiple swap devices for round robin or with different priorities
> > > aren't new, they have been supported for a very, very long time. So
> > > far nobody has proposed to control the exact behavior on a per-cgroup
> > > basis, and I didn't see anybody in this thread asking for it either.
> > >
> > > So I don't see how this counts as an obvious and automatic usecase for
> > > memory.swap.tiers.
> > >
> > > > 3) Android has some fancy swap ideas led by those patches.
> > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/
> > > > It got shot down due to removal of frontswap. But the usage case and
> > > > product requirement is there.
> > > > +Minchan
> > >
> > > This looks like an optimization for zram to bypass the block layer and
> > > hook directly into the swap code. Correct me if I'm wrong, but this
> > > doesn't appear to have anything to do with per-cgroup backend control.
> >
> > Hi Johannes,
> >
> > I haven't been following the thread closely, but I noticed the discussion
> > about potential use cases for zram with memcg.
> >
> > One interesting idea I have is to implement a swap controller per cgroup.
> > This would allow us to tailor the zram swap behavior to the specific needs of
> > different groups.
> >
> > For example, Group A, which is sensitive to swap latency, could use zram swap
> > with a fast compression setting, even if it sacrifices some compression ratio.
> > This would prioritize quick access to swapped data, even if it takes up more space.
> >
> > On the other hand, Group B, which can tolerate higher swap latency, could benefit
> > from a slower compression setting that achieves a higher compression ratio.
> > This would maximize memory efficiency at the cost of slightly slower data access.
> >
> > This approach could provide a more nuanced and flexible way to manage swap usage
> > within different cgroups.
>
> That makes sense to me.
>
> It sounds to me like per-cgroup swapfiles would be the easiest
> solution to this.

Someone posted it about 10 years ago :)
https://lwn.net/Articles/592923/

+fdeutsch@redhat.com
Fabian recently asked me about its status.
  
Chris Li Dec. 14, 2023, 5:34 p.m. UTC | #26
On Thu, Dec 14, 2023 at 9:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote:

> > Hi Johannes,
> >
> > I haven't been following the thread closely, but I noticed the discussion
> > about potential use cases for zram with memcg.
> >
> > One interesting idea I have is to implement a swap controller per cgroup.
> > This would allow us to tailor the zram swap behavior to the specific needs of
> > different groups.
> >
> > For example, Group A, which is sensitive to swap latency, could use zram swap
> > with a fast compression setting, even if it sacrifices some compression ratio.
> > This would prioritize quick access to swapped data, even if it takes up more space.
> >
> > On the other hand, Group B, which can tolerate higher swap latency, could benefit
> > from a slower compression setting that achieves a higher compression ratio.
> > This would maximize memory efficiency at the cost of slightly slower data access.
> >
> > This approach could provide a more nuanced and flexible way to manage swap usage
> > within different cgroups.
>
> That makes sense to me.
>
> It sounds to me like per-cgroup swapfiles would be the easiest
> solution to this. Then you can create zram devices with different
> configurations and assign them to individual cgroups.

Ideally you need zram then following swap file after the zram. That
would be a list of the swap files rather than just one swapfile per
cgroup.

>
> This would also apply to Kairu's usecase: assign zrams and hdd backups
> as needed on a per-cgroup basis.

Same there, Kairui's request involves ZRAM and at least one extra swap
file. In other words, you really need a per cgroup swap file list.

>
> In addition, it would naturally solve scalability and isolation
> problems when multiple containers would otherwise be hammering on the
> same swap backends and locks.
>
> It would also only require one, relatively simple new interface, such
> as a cgroup parameter to swapon().
>
> That's highly preferable over a complex configuration file like
> memory.swap.tiers that needs to solve all sorts of visibility and
> namespace issues and duplicate the full configuration interface of
> every backend in some new, custom syntax.

If you don't like the syntax of memory.swap.tiers, I am open to
suggestions of your preferred syntax as well. The essicents of the
swap.tiers is a per cgroup list of the swap back ends. The names imply
that. I am not married to any given syntax of how to specify the list.
Its goal matches the above requirement pretty well.

Chris
  
Fabian Deutsch Dec. 14, 2023, 6:03 p.m. UTC | #27
On Thu, Dec 14, 2023 at 6:24 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Dec 14, 2023 at 10:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Mon, Dec 11, 2023 at 02:55:43PM -0800, Minchan Kim wrote:
> > > On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote:
> > > > > I can give you three usage cases right now:
> > > > > 1) Google producting kernel uses SSD only swap, it is currently on
> > > > > pilot. This is not expressible by the memory.zswap.writeback. You can
> > > > > set the memory.zswap.max = 0 and memory.zswap.writeback = 1, then SSD
> > > > > backed swapfile. But the whole thing feels very clunky, especially
> > > > > what you really want is SSD only swap, you need to do all this zswap
> > > > > config dance. Google has an internal memory.swapfile feature
> > > > > implemented per cgroup swap file type by "zswap only", "real swap file
> > > > > only", "both", "none" (the exact keyword might be different). running
> > > > > in the production for almost 10 years. The need for more than zswap
> > > > > type of per cgroup control is really there.
> > > >
> > > > We use regular swap on SSD without zswap just fine. Of course it's
> > > > expressible.
> > > >
> > > > On dedicated systems, zswap is disabled in sysfs. On shared hosts
> > > > where it's determined based on which workload is scheduled, zswap is
> > > > generally enabled through sysfs, and individual cgroup access is
> > > > controlled via memory.zswap.max - which is what this knob is for.
> > > >
> > > > This is analogous to enabling swap globally, and then opting
> > > > individual cgroups in and out with memory.swap.max.
> > > >
> > > > So this usecase is very much already supported, and it's expressed in
> > > > a way that's pretty natural for how cgroups express access and lack of
> > > > access to certain resources.
> > > >
> > > > I don't see how memory.swap.type or memory.swap.tiers would improve
> > > > this in any way. On the contrary, it would overlap and conflict with
> > > > existing controls to manage swap and zswap on a per-cgroup basis.
> > > >
> > > > > 2) As indicated by this discussion, Tencent has a usage case for SSD
> > > > > and hard disk swap as overflow.
> > > > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
> > > > > +Kairui
> > > >
> > > > Multiple swap devices for round robin or with different priorities
> > > > aren't new, they have been supported for a very, very long time. So
> > > > far nobody has proposed to control the exact behavior on a per-cgroup
> > > > basis, and I didn't see anybody in this thread asking for it either.
> > > >
> > > > So I don't see how this counts as an obvious and automatic usecase for
> > > > memory.swap.tiers.
> > > >
> > > > > 3) Android has some fancy swap ideas led by those patches.
> > > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/
> > > > > It got shot down due to removal of frontswap. But the usage case and
> > > > > product requirement is there.
> > > > > +Minchan
> > > >
> > > > This looks like an optimization for zram to bypass the block layer and
> > > > hook directly into the swap code. Correct me if I'm wrong, but this
> > > > doesn't appear to have anything to do with per-cgroup backend control.
> > >
> > > Hi Johannes,
> > >
> > > I haven't been following the thread closely, but I noticed the discussion
> > > about potential use cases for zram with memcg.
> > >
> > > One interesting idea I have is to implement a swap controller per cgroup.
> > > This would allow us to tailor the zram swap behavior to the specific needs of
> > > different groups.
> > >
> > > For example, Group A, which is sensitive to swap latency, could use zram swap
> > > with a fast compression setting, even if it sacrifices some compression ratio.
> > > This would prioritize quick access to swapped data, even if it takes up more space.
> > >
> > > On the other hand, Group B, which can tolerate higher swap latency, could benefit
> > > from a slower compression setting that achieves a higher compression ratio.
> > > This would maximize memory efficiency at the cost of slightly slower data access.
> > >
> > > This approach could provide a more nuanced and flexible way to manage swap usage
> > > within different cgroups.
> >
> > That makes sense to me.
> >
> > It sounds to me like per-cgroup swapfiles would be the easiest
> > solution to this.
>
> Someone posted it about 10 years ago :)
> https://lwn.net/Articles/592923/
>
> +fdeutsch@redhat.com
> Fabian recently asked me about its status.


Thanks Yu!

Yes, I was interested due to container use-cases.

Now a few thoughts in this direction:
- With swap per cgroup you loose the big "statistical" benefit of
having swap on a node level. well, it depends on the size of the
cgroup (i.e. system.slice is quite large).
- With todays node level swap, and setting memory.swap.max=0 for all
cgroups allows you toachieve a similar behavior (only opt-in cgroups
will get swap).
- the above approach however will still have a shared swap backend for
all cgroups.
  
Johannes Weiner Dec. 14, 2023, 10:11 p.m. UTC | #28
On Thu, Dec 14, 2023 at 09:34:06AM -0800, Christopher Li wrote:
> On Thu, Dec 14, 2023 at 9:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > > Hi Johannes,
> > >
> > > I haven't been following the thread closely, but I noticed the discussion
> > > about potential use cases for zram with memcg.
> > >
> > > One interesting idea I have is to implement a swap controller per cgroup.
> > > This would allow us to tailor the zram swap behavior to the specific needs of
> > > different groups.
> > >
> > > For example, Group A, which is sensitive to swap latency, could use zram swap
> > > with a fast compression setting, even if it sacrifices some compression ratio.
> > > This would prioritize quick access to swapped data, even if it takes up more space.
> > >
> > > On the other hand, Group B, which can tolerate higher swap latency, could benefit
> > > from a slower compression setting that achieves a higher compression ratio.
> > > This would maximize memory efficiency at the cost of slightly slower data access.
> > >
> > > This approach could provide a more nuanced and flexible way to manage swap usage
> > > within different cgroups.
> >
> > That makes sense to me.
> >
> > It sounds to me like per-cgroup swapfiles would be the easiest
> > solution to this. Then you can create zram devices with different
> > configurations and assign them to individual cgroups.
> 
> Ideally you need zram then following swap file after the zram. That
> would be a list of the swap files rather than just one swapfile per
> cgroup.
>
> > This would also apply to Kairu's usecase: assign zrams and hdd backups
> > as needed on a per-cgroup basis.
> 
> Same there, Kairui's request involves ZRAM and at least one extra swap
> file. In other words, you really need a per cgroup swap file list.

Why is that a problem?

swapon(zram, cgroup=foo)
swapon(hdd, cgroup=foo)

> > In addition, it would naturally solve scalability and isolation
> > problems when multiple containers would otherwise be hammering on the
> > same swap backends and locks.
> >
> > It would also only require one, relatively simple new interface, such
> > as a cgroup parameter to swapon().
> >
> > That's highly preferable over a complex configuration file like
> > memory.swap.tiers that needs to solve all sorts of visibility and
> > namespace issues and duplicate the full configuration interface of
> > every backend in some new, custom syntax.
> 
> If you don't like the syntax of memory.swap.tiers, I am open to
> suggestions of your preferred syntax as well. The essicents of the
> swap.tiers is a per cgroup list of the swap back ends. The names imply
> that. I am not married to any given syntax of how to specify the list.
> Its goal matches the above requirement pretty well.

Except Minchan said that he would also like different zram parameters
depending on the cgroup.

There is no way we'll add a memory.swap.tiers with a new configuration
language for backend parameters.
  
Chris Li Dec. 14, 2023, 10:54 p.m. UTC | #29
On Thu, Dec 14, 2023 at 2:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Dec 14, 2023 at 09:34:06AM -0800, Christopher Li wrote:
> > On Thu, Dec 14, 2023 at 9:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > > > Hi Johannes,
> > > >
> > > > I haven't been following the thread closely, but I noticed the discussion
> > > > about potential use cases for zram with memcg.
> > > >
> > > > One interesting idea I have is to implement a swap controller per cgroup.
> > > > This would allow us to tailor the zram swap behavior to the specific needs of
> > > > different groups.
> > > >
> > > > For example, Group A, which is sensitive to swap latency, could use zram swap
> > > > with a fast compression setting, even if it sacrifices some compression ratio.
> > > > This would prioritize quick access to swapped data, even if it takes up more space.
> > > >
> > > > On the other hand, Group B, which can tolerate higher swap latency, could benefit
> > > > from a slower compression setting that achieves a higher compression ratio.
> > > > This would maximize memory efficiency at the cost of slightly slower data access.
> > > >
> > > > This approach could provide a more nuanced and flexible way to manage swap usage
> > > > within different cgroups.
> > >
> > > That makes sense to me.
> > >
> > > It sounds to me like per-cgroup swapfiles would be the easiest
> > > solution to this. Then you can create zram devices with different
> > > configurations and assign them to individual cgroups.
> >
> > Ideally you need zram then following swap file after the zram. That
> > would be a list of the swap files rather than just one swapfile per
> > cgroup.
> >
> > > This would also apply to Kairu's usecase: assign zrams and hdd backups
> > > as needed on a per-cgroup basis.
> >
> > Same there, Kairui's request involves ZRAM and at least one extra swap
> > file. In other words, you really need a per cgroup swap file list.
>
> Why is that a problem?

It is not a problem. It is the necessary infrastructure to support the
requirement. I am merely saying just having one swap file is not
enough.

>
> swapon(zram, cgroup=foo)
> swapon(hdd, cgroup=foo)

Interesting idea. I assume you want to use swapon/swapoff to turn on
off a device for a specific cgroup.
That seems to implite each cgroup will have a private copy of the swap
device list.

I have considered the memory.swap.tiers for the same thing, with one
minor optimization. The list is system wide maintained with a name.
The per cgroup just has a pointer to that named list. There shouldn't
be too many such lists of swap back end combinations on the system.

We are getting into the weeds. The bottom line is, we need to have per
cgroup a swap file list. That is the necessary evil we can't get away
with.

>
> > > In addition, it would naturally solve scalability and isolation
> > > problems when multiple containers would otherwise be hammering on the
> > > same swap backends and locks.
> > >
> > > It would also only require one, relatively simple new interface, such
> > > as a cgroup parameter to swapon().
> > >
> > > That's highly preferable over a complex configuration file like
> > > memory.swap.tiers that needs to solve all sorts of visibility and
> > > namespace issues and duplicate the full configuration interface of
> > > every backend in some new, custom syntax.
> >
> > If you don't like the syntax of memory.swap.tiers, I am open to
> > suggestions of your preferred syntax as well. The essicents of the
> > swap.tiers is a per cgroup list of the swap back ends. The names imply
> > that. I am not married to any given syntax of how to specify the list.
> > Its goal matches the above requirement pretty well.
>
> Except Minchan said that he would also like different zram parameters
> depending on the cgroup.

Minchan's requirement is new. We will need to expand the original
"memory.swap.tiers" to support such usage.

> There is no way we'll add a memory.swap.tiers with a new configuration
> language for backend parameters.
>

I agree that we don't want a complicated configuration language for
"memory.swap.tiers".

Those backend parameters should be configured on the back end side.
The "memory.swap.tiers" just reference the already configured object.
Just brainstorming:
/dev/zram0 has compression algo1 for fast speed low compression ratio.
/dev/zram1 has compression algo2 for slow speed high compression ratio.

"memory.swap.tiers" point to zram0 or zram1 or a custom list has "zram0 + hdd"

Chris
  
Chris Li Dec. 14, 2023, 11:22 p.m. UTC | #30
Hi Fabian,

On Thu, Dec 14, 2023 at 10:00 AM Fabian Deutsch <fdeutsch@redhat.com> wrote:

> Yep - for container use-cases.
>
> Now a few thoughts in this direction:
> - With swap per cgroup you loose the big "statistical" benefit of having swap on a node level. well, it depends on the size of the cgroup (i.e. system.slice is quite large).

Just to clarify, the "node" you mean the "node" in kubernetes sense,
which is the whole machine. In the Linux kernel MM context, the node
often refers to the NUMA memory node, that is not what you mean here,
right?

> - With todays node level swap, and setting memory.swap.max=0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap).
> - the above approach however will still have a shared swap backend for all cgroups.

Yes, the "memory.swap.tires" idea is trying to allow cgroups to select
a subset of the swap backend in a specific order. It is still in the
early stage of discussion. If you have any suggestion or feedback in
that direction, I am looking forward to hearing that.

Chris
  
Nhat Pham Dec. 15, 2023, 2:19 a.m. UTC | #31
On Thu, Dec 14, 2023 at 2:55 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Dec 14, 2023 at 2:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Dec 14, 2023 at 09:34:06AM -0800, Christopher Li wrote:
> > > On Thu, Dec 14, 2023 at 9:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > > > Hi Johannes,
> > > > >
> > > > > I haven't been following the thread closely, but I noticed the discussion
> > > > > about potential use cases for zram with memcg.
> > > > >
> > > > > One interesting idea I have is to implement a swap controller per cgroup.
> > > > > This would allow us to tailor the zram swap behavior to the specific needs of
> > > > > different groups.
> > > > >
> > > > > For example, Group A, which is sensitive to swap latency, could use zram swap
> > > > > with a fast compression setting, even if it sacrifices some compression ratio.
> > > > > This would prioritize quick access to swapped data, even if it takes up more space.
> > > > >
> > > > > On the other hand, Group B, which can tolerate higher swap latency, could benefit
> > > > > from a slower compression setting that achieves a higher compression ratio.
> > > > > This would maximize memory efficiency at the cost of slightly slower data access.
> > > > >
> > > > > This approach could provide a more nuanced and flexible way to manage swap usage
> > > > > within different cgroups.
> > > >
> > > > That makes sense to me.
> > > >
> > > > It sounds to me like per-cgroup swapfiles would be the easiest
> > > > solution to this. Then you can create zram devices with different
> > > > configurations and assign them to individual cgroups.
> > >
> > > Ideally you need zram then following swap file after the zram. That
> > > would be a list of the swap files rather than just one swapfile per
> > > cgroup.
> > >
> > > > This would also apply to Kairu's usecase: assign zrams and hdd backups
> > > > as needed on a per-cgroup basis.
> > >
> > > Same there, Kairui's request involves ZRAM and at least one extra swap
> > > file. In other words, you really need a per cgroup swap file list.
> >
> > Why is that a problem?
>
> It is not a problem. It is the necessary infrastructure to support the
> requirement. I am merely saying just having one swap file is not
> enough.
>
> >
> > swapon(zram, cgroup=foo)
> > swapon(hdd, cgroup=foo)
>
> Interesting idea. I assume you want to use swapon/swapoff to turn on
> off a device for a specific cgroup.
> That seems to implite each cgroup will have a private copy of the swap
> device list.
>
> I have considered the memory.swap.tiers for the same thing, with one
> minor optimization. The list is system wide maintained with a name.
> The per cgroup just has a pointer to that named list. There shouldn't
> be too many such lists of swap back end combinations on the system.
>
> We are getting into the weeds. The bottom line is, we need to have per
> cgroup a swap file list. That is the necessary evil we can't get away
> with.

Highly agree. This is getting waaayyyy too deep into the weeds, and
the conversation has practically spiralled out of the original
intention of this patch - its purported problem and proposed solution.

Not to say that none of this is useful, but I sense that we first need
to do the following:

a) List out the requirements that the new interface has to support:
the tiers made available to the cgroup, hierarchical structure (i.e do
we want a tier list to have more than 1 non-zswap level? Maybe we
won't need it after all, in which case the swapon solution is perhaps
sufficient).
b) Carefully evaluate the proposed candidates. It could be an altered
memory.swap.tiers, or an extended swapon/swapoff.

Perhaps we should organize a separate meeting or email thread to
discuss this in detail, and write out proposed solutions for everyone
to evaluate. In the meantime, I think that we should merge this new
knob as-is.

>
> >
> > > > In addition, it would naturally solve scalability and isolation
> > > > problems when multiple containers would otherwise be hammering on the
> > > > same swap backends and locks.
> > > >
> > > > It would also only require one, relatively simple new interface, such
> > > > as a cgroup parameter to swapon().
> > > >
> > > > That's highly preferable over a complex configuration file like
> > > > memory.swap.tiers that needs to solve all sorts of visibility and
> > > > namespace issues and duplicate the full configuration interface of
> > > > every backend in some new, custom syntax.
> > >
> > > If you don't like the syntax of memory.swap.tiers, I am open to
> > > suggestions of your preferred syntax as well. The essicents of the
> > > swap.tiers is a per cgroup list of the swap back ends. The names imply
> > > that. I am not married to any given syntax of how to specify the list.
> > > Its goal matches the above requirement pretty well.
> >
> > Except Minchan said that he would also like different zram parameters
> > depending on the cgroup.
>
> Minchan's requirement is new. We will need to expand the original
> "memory.swap.tiers" to support such usage.
>
> > There is no way we'll add a memory.swap.tiers with a new configuration
> > language for backend parameters.
> >
>
> I agree that we don't want a complicated configuration language for
> "memory.swap.tiers".
>
> Those backend parameters should be configured on the back end side.
> The "memory.swap.tiers" just reference the already configured object.
> Just brainstorming:
> /dev/zram0 has compression algo1 for fast speed low compression ratio.
> /dev/zram1 has compression algo2 for slow speed high compression ratio.
>
> "memory.swap.tiers" point to zram0 or zram1 or a custom list has "zram0 + hdd"
>
> Chris
  
Chris Li Dec. 15, 2023, 9:40 a.m. UTC | #32
On Thu, Dec 14, 2023 at 11:42 PM Fabian Deutsch <fdeutsch@redhat.com> wrote:.
> >
> > Just to clarify, the "node" you mean the "node" in kubernetes sense,
> > which is the whole machine. In the Linux kernel MM context, the node
> > often refers to the NUMA memory node, that is not what you mean here,
> > right?
>
> Correct, I was referring to a kubernetes node, not numa node.
>
> >
> >> - With todays node level swap, and setting memory.swap.max=0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap).
> >> - the above approach however will still have a shared swap backend for all cgroups.
> >
> > Yes, the "memory.swap.tires" idea is trying to allow cgroups to select
> > a subset of the swap backend in a specific order. It is still in the
> > early stage of discussion. If you have any suggestion or feedback in
> > that direction, I am looking forward to hearing that.
>
> Interesting. There have been concerns to leak confidential data accidentally when it's getting written to a swap device.

One common solution is to encrypt the data written to the device. If
someone gets hold of the swapped outed device without the key, they
can't get to the memory data without the key.

> The other less discussed item was QoS for swap io traffic.
>
> At a first glance it seems like tires could help with the second use-case.

The idea is that you can select the swap tiers list for each cgroup.
That way  you can assign different swap QoS to different cgroup.

Chris
  
Fabian Deutsch Dec. 15, 2023, 9:50 a.m. UTC | #33
On Fri, Dec 15, 2023 at 10:40 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Dec 14, 2023 at 11:42 PM Fabian Deutsch <fdeutsch@redhat.com> wrote:.
> > >
> > > Just to clarify, the "node" you mean the "node" in kubernetes sense,
> > > which is the whole machine. In the Linux kernel MM context, the node
> > > often refers to the NUMA memory node, that is not what you mean here,
> > > right?
> >
> > Correct, I was referring to a kubernetes node, not numa node.
> >
> > >
> > >> - With todays node level swap, and setting memory.swap.max=0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap).
> > >> - the above approach however will still have a shared swap backend for all cgroups.
> > >
> > > Yes, the "memory.swap.tires" idea is trying to allow cgroups to select
> > > a subset of the swap backend in a specific order. It is still in the
> > > early stage of discussion. If you have any suggestion or feedback in
> > > that direction, I am looking forward to hearing that.
> >
> > Interesting. There have been concerns to leak confidential data accidentally when it's getting written to a swap device.
>
> One common solution is to encrypt the data written to the device. If
> someone gets hold of the swapped outed device without the key, they
> can't get to the memory data without the key.


Yes - I guess like writing it onto a dmcrypt device with some random key.
Nevertheless, this was one of the topics.

>
>
> > The other less discussed item was QoS for swap io traffic.
> >
> > At a first glance it seems like tires could help with the second use-case.
>
> The idea is that you can select the swap tiers list for each cgroup.
> That way  you can assign different swap QoS to different cgroup.


Yes, it sounds like a fit.
What use-cases did you have in mind for the tiers feature?
  
Yosry Ahmed Dec. 15, 2023, 9:21 p.m. UTC | #34
On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> During our experiment with zswap, we sometimes observe swap IOs due to
> occasional zswap store failures and writebacks-to-swap. These swapping
> IOs prevent many users who cannot tolerate swapping from adopting zswap
> to save memory and improve performance where possible.
>
> This patch adds the option to disable this behavior entirely: do not
> writeback to backing swapping device when a zswap store attempt fail,
> and do not write pages in the zswap pool back to the backing swap
> device (both when the pool is full, and when the new zswap shrinker is
> called).
>
> This new behavior can be opted-in/out on a per-cgroup basis via a new
> cgroup file. By default, writebacks to swap device is enabled, which is
> the previous behavior. Initially, writeback is enabled for the root
> cgroup, and a newly created cgroup will inherit the current setting of
> its parent.
>
> Note that this is subtly different from setting memory.swap.max to 0, as
> it still allows for pages to be stored in the zswap pool (which itself
> consumes swap space in its current form).
>
> This patch should be applied on top of the zswap shrinker series:
>
> https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
>
> as it also disables the zswap shrinker, a major source of zswap
> writebacks.
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

Taking a step back from all the memory.swap.tiers vs.
memory.zswap.writeback discussions, I think there may be a more
fundamental problem here. If the zswap store failure is recurrent,
pages can keep going back to the LRUs and then sent back to zswap
eventually, only to be rejected again. For example, this can if zswap
is above the acceptance threshold, but could be even worse if it's the
allocator rejecting the page due to not compressing well enough. In
the latter case, the page can keep going back and forth between zswap
and LRUs indefinitely.

You probably did not run into this as you're using zsmalloc, but it
can happen with zbud AFAICT. Even with zsmalloc, a less problematic
version can happen if zswap is above its acceptance threshold.

This can cause thrashing and ineffective reclaim. We have an internal
implementation where we mark incompressible pages and put them on the
unevictable LRU when we don't have a backing swapfile (i.e. ghost
swapfiles), and something similar may work if writeback is disabled.
We need to scan such incompressible pages periodically though to
remove them from the unevictable LRU if they have been dirited.
  
Johannes Weiner Dec. 18, 2023, 2:44 p.m. UTC | #35
On Fri, Dec 15, 2023 at 01:21:57PM -0800, Yosry Ahmed wrote:
> On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > During our experiment with zswap, we sometimes observe swap IOs due to
> > occasional zswap store failures and writebacks-to-swap. These swapping
> > IOs prevent many users who cannot tolerate swapping from adopting zswap
> > to save memory and improve performance where possible.
> >
> > This patch adds the option to disable this behavior entirely: do not
> > writeback to backing swapping device when a zswap store attempt fail,
> > and do not write pages in the zswap pool back to the backing swap
> > device (both when the pool is full, and when the new zswap shrinker is
> > called).
> >
> > This new behavior can be opted-in/out on a per-cgroup basis via a new
> > cgroup file. By default, writebacks to swap device is enabled, which is
> > the previous behavior. Initially, writeback is enabled for the root
> > cgroup, and a newly created cgroup will inherit the current setting of
> > its parent.
> >
> > Note that this is subtly different from setting memory.swap.max to 0, as
> > it still allows for pages to be stored in the zswap pool (which itself
> > consumes swap space in its current form).
> >
> > This patch should be applied on top of the zswap shrinker series:
> >
> > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
> >
> > as it also disables the zswap shrinker, a major source of zswap
> > writebacks.
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> 
> Taking a step back from all the memory.swap.tiers vs.
> memory.zswap.writeback discussions, I think there may be a more
> fundamental problem here. If the zswap store failure is recurrent,
> pages can keep going back to the LRUs and then sent back to zswap
> eventually, only to be rejected again. For example, this can if zswap
> is above the acceptance threshold, but could be even worse if it's the
> allocator rejecting the page due to not compressing well enough. In
> the latter case, the page can keep going back and forth between zswap
> and LRUs indefinitely.
> 
> You probably did not run into this as you're using zsmalloc, but it
> can happen with zbud AFAICT. Even with zsmalloc, a less problematic
> version can happen if zswap is above its acceptance threshold.
> 
> This can cause thrashing and ineffective reclaim. We have an internal
> implementation where we mark incompressible pages and put them on the
> unevictable LRU when we don't have a backing swapfile (i.e. ghost
> swapfiles), and something similar may work if writeback is disabled.
> We need to scan such incompressible pages periodically though to
> remove them from the unevictable LRU if they have been dirited.

I'm not sure this is an actual problem.

When pages get rejected, they rotate to the furthest point from the
reclaimer - the head of the active list. We only get to them again
after we scanned everything else.

If all that's left on the LRU is unzswappable, then you'd assume that
remainder isn't very large, and thus not a significant part of overall
scan work. Because if it is, then there is a serious problem with the
zswap configuration.

There might be possible optimizations to determine how permanent a
rejection is, but I'm not sure the effort is called for just
yet. Rejections are already failure cases that screw up the LRU
ordering, and healthy setups shouldn't have a lot of those. I don't
think this patch adds any sort of new complications to this picture.
  
Nhat Pham Dec. 18, 2023, 7:21 p.m. UTC | #36
On Mon, Dec 18, 2023 at 6:44 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Dec 15, 2023 at 01:21:57PM -0800, Yosry Ahmed wrote:
> > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > During our experiment with zswap, we sometimes observe swap IOs due to
> > > occasional zswap store failures and writebacks-to-swap. These swapping
> > > IOs prevent many users who cannot tolerate swapping from adopting zswap
> > > to save memory and improve performance where possible.
> > >
> > > This patch adds the option to disable this behavior entirely: do not
> > > writeback to backing swapping device when a zswap store attempt fail,
> > > and do not write pages in the zswap pool back to the backing swap
> > > device (both when the pool is full, and when the new zswap shrinker is
> > > called).
> > >
> > > This new behavior can be opted-in/out on a per-cgroup basis via a new
> > > cgroup file. By default, writebacks to swap device is enabled, which is
> > > the previous behavior. Initially, writeback is enabled for the root
> > > cgroup, and a newly created cgroup will inherit the current setting of
> > > its parent.
> > >
> > > Note that this is subtly different from setting memory.swap.max to 0, as
> > > it still allows for pages to be stored in the zswap pool (which itself
> > > consumes swap space in its current form).
> > >
> > > This patch should be applied on top of the zswap shrinker series:
> > >
> > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
> > >
> > > as it also disables the zswap shrinker, a major source of zswap
> > > writebacks.
> > >
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > Taking a step back from all the memory.swap.tiers vs.
> > memory.zswap.writeback discussions, I think there may be a more
> > fundamental problem here. If the zswap store failure is recurrent,
> > pages can keep going back to the LRUs and then sent back to zswap
> > eventually, only to be rejected again. For example, this can if zswap
> > is above the acceptance threshold, but could be even worse if it's the
> > allocator rejecting the page due to not compressing well enough. In
> > the latter case, the page can keep going back and forth between zswap
> > and LRUs indefinitely.
> >
> > You probably did not run into this as you're using zsmalloc, but it

Which is why I recommend everyone to use zsmalloc, and change the
default allocator to it in Kconfig :)

But tongue-in-cheek aside, I think this is fine. As you noted below,
we probably want to try again on that page (for instance, in case its
content has changed and is now more compressible). And as Johannes has
explained, we'll only look at it again once we have scanned everything
else. This sounds acceptable to me.

Now if all of the intermediate pages are also unstoreable as well,
then we have a problem, but that seems unlikely, and perhaps is an
indication that we need to do something else entirely (if the workload
is *that* incompressible, perhaps it is better to just disable zswap
entirely here).

> > can happen with zbud AFAICT. Even with zsmalloc, a less problematic
> > version can happen if zswap is above its acceptance threshold.
> >
> > This can cause thrashing and ineffective reclaim. We have an internal
> > implementation where we mark incompressible pages and put them on the
> > unevictable LRU when we don't have a backing swapfile (i.e. ghost
> > swapfiles), and something similar may work if writeback is disabled.
> > We need to scan such incompressible pages periodically though to
> > remove them from the unevictable LRU if they have been dirited.
>
> I'm not sure this is an actual problem.
>
> When pages get rejected, they rotate to the furthest point from the
> reclaimer - the head of the active list. We only get to them again
> after we scanned everything else.

Agree.  That is the reason why we rotate the LRU here - to avoid
touching it again until we have tried other pages.

>
> If all that's left on the LRU is unzswappable, then you'd assume that
> remainder isn't very large, and thus not a significant part of overall
> scan work. Because if it is, then there is a serious problem with the
> zswap configuration.

Agree.

>
> There might be possible optimizations to determine how permanent a
> rejection is, but I'm not sure the effort is called for just
> yet. Rejections are already failure cases that screw up the LRU
> ordering, and healthy setups shouldn't have a lot of those. I don't
> think this patch adds any sort of new complications to this picture.

Yep. This is one of the reasons (among many) why we were toying around
with storing uncompressed pages in zswap - it's one of the failure
cases where trying again (if the page's content has not changed) -
isn't likely to yield a different result, so might as well just retain
the overall LRU ordering and squeeze it in zswap (but as discussed,
that has quite a bit of implications that we need to deal with).
  
Yosry Ahmed Dec. 18, 2023, 9:52 p.m. UTC | #37
> > Taking a step back from all the memory.swap.tiers vs.
> > memory.zswap.writeback discussions, I think there may be a more
> > fundamental problem here. If the zswap store failure is recurrent,
> > pages can keep going back to the LRUs and then sent back to zswap
> > eventually, only to be rejected again. For example, this can if zswap
> > is above the acceptance threshold, but could be even worse if it's the
> > allocator rejecting the page due to not compressing well enough. In
> > the latter case, the page can keep going back and forth between zswap
> > and LRUs indefinitely.
> >
> > You probably did not run into this as you're using zsmalloc, but it
> > can happen with zbud AFAICT. Even with zsmalloc, a less problematic
> > version can happen if zswap is above its acceptance threshold.
> >
> > This can cause thrashing and ineffective reclaim. We have an internal
> > implementation where we mark incompressible pages and put them on the
> > unevictable LRU when we don't have a backing swapfile (i.e. ghost
> > swapfiles), and something similar may work if writeback is disabled.
> > We need to scan such incompressible pages periodically though to
> > remove them from the unevictable LRU if they have been dirited.
>
> I'm not sure this is an actual problem.
>
> When pages get rejected, they rotate to the furthest point from the
> reclaimer - the head of the active list. We only get to them again
> after we scanned everything else.
>
> If all that's left on the LRU is unzswappable, then you'd assume that
> remainder isn't very large, and thus not a significant part of overall
> scan work. Because if it is, then there is a serious problem with the
> zswap configuration.
>
> There might be possible optimizations to determine how permanent a
> rejection is, but I'm not sure the effort is called for just
> yet. Rejections are already failure cases that screw up the LRU
> ordering, and healthy setups shouldn't have a lot of those. I don't
> think this patch adds any sort of new complications to this picture.

We have workloads where a significant amount (maybe 20%? 30% not sure
tbh) of the memory is incompressible. Zswap is still a very viable
option for those workloads once those pages are taken out of the
picture. If those pages remain on the LRUs, they will introduce a
regression in reclaim efficiency.

With the upstream code today, those pages go directly to the backing
store, which isn't ideal in terms of LRU ordering, but this patch
makes them stay on the LRUs, which can be harmful. I don't think we
can just assume it is okay. Whether we make those pages unevictable or
store them uncompressed in zswap, I think taking them out of the LRUs
(until they are redirtied), is the right thing to do.

Adding Wei and Yu for more data about incompressible memory in our
fleet. Keep in mind that we have internal patches to cap the
compression ratio (i.e. reject pages where the compressed size +
metadata is not worth it, or where zsmalloc will store it in a full
page anyway). But the same thing can happen upstream with zbud.
  
Yosry Ahmed Dec. 18, 2023, 9:54 p.m. UTC | #38
On Mon, Dec 18, 2023 at 11:21 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Dec 18, 2023 at 6:44 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Fri, Dec 15, 2023 at 01:21:57PM -0800, Yosry Ahmed wrote:
> > > On Thu, Dec 7, 2023 at 11:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > >
> > > > During our experiment with zswap, we sometimes observe swap IOs due to
> > > > occasional zswap store failures and writebacks-to-swap. These swapping
> > > > IOs prevent many users who cannot tolerate swapping from adopting zswap
> > > > to save memory and improve performance where possible.
> > > >
> > > > This patch adds the option to disable this behavior entirely: do not
> > > > writeback to backing swapping device when a zswap store attempt fail,
> > > > and do not write pages in the zswap pool back to the backing swap
> > > > device (both when the pool is full, and when the new zswap shrinker is
> > > > called).
> > > >
> > > > This new behavior can be opted-in/out on a per-cgroup basis via a new
> > > > cgroup file. By default, writebacks to swap device is enabled, which is
> > > > the previous behavior. Initially, writeback is enabled for the root
> > > > cgroup, and a newly created cgroup will inherit the current setting of
> > > > its parent.
> > > >
> > > > Note that this is subtly different from setting memory.swap.max to 0, as
> > > > it still allows for pages to be stored in the zswap pool (which itself
> > > > consumes swap space in its current form).
> > > >
> > > > This patch should be applied on top of the zswap shrinker series:
> > > >
> > > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
> > > >
> > > > as it also disables the zswap shrinker, a major source of zswap
> > > > writebacks.
> > > >
> > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> > >
> > > Taking a step back from all the memory.swap.tiers vs.
> > > memory.zswap.writeback discussions, I think there may be a more
> > > fundamental problem here. If the zswap store failure is recurrent,
> > > pages can keep going back to the LRUs and then sent back to zswap
> > > eventually, only to be rejected again. For example, this can if zswap
> > > is above the acceptance threshold, but could be even worse if it's the
> > > allocator rejecting the page due to not compressing well enough. In
> > > the latter case, the page can keep going back and forth between zswap
> > > and LRUs indefinitely.
> > >
> > > You probably did not run into this as you're using zsmalloc, but it
>
> Which is why I recommend everyone to use zsmalloc, and change the
> default allocator to it in Kconfig :)
>

Internally, we have a cap on the compression ratio, after which we
reject pages because it doesn't make sense to store them (e.g.
zsmalloc will store them in a full page anyway, or the compressed size
+ metadata isn't worth it). I think this is where we should head
upstream as well, you proposed something in the right direction with
storing uncompressed pages in zswap. IOW, I think such pages should be
taken out of the LRUs one way or another.
  
Johannes Weiner Dec. 20, 2023, 5:15 a.m. UTC | #39
On Mon, Dec 18, 2023 at 01:52:23PM -0800, Yosry Ahmed wrote:
> > > Taking a step back from all the memory.swap.tiers vs.
> > > memory.zswap.writeback discussions, I think there may be a more
> > > fundamental problem here. If the zswap store failure is recurrent,
> > > pages can keep going back to the LRUs and then sent back to zswap
> > > eventually, only to be rejected again. For example, this can if zswap
> > > is above the acceptance threshold, but could be even worse if it's the
> > > allocator rejecting the page due to not compressing well enough. In
> > > the latter case, the page can keep going back and forth between zswap
> > > and LRUs indefinitely.
> > >
> > > You probably did not run into this as you're using zsmalloc, but it
> > > can happen with zbud AFAICT. Even with zsmalloc, a less problematic
> > > version can happen if zswap is above its acceptance threshold.
> > >
> > > This can cause thrashing and ineffective reclaim. We have an internal
> > > implementation where we mark incompressible pages and put them on the
> > > unevictable LRU when we don't have a backing swapfile (i.e. ghost
> > > swapfiles), and something similar may work if writeback is disabled.
> > > We need to scan such incompressible pages periodically though to
> > > remove them from the unevictable LRU if they have been dirited.
> >
> > I'm not sure this is an actual problem.
> >
> > When pages get rejected, they rotate to the furthest point from the
> > reclaimer - the head of the active list. We only get to them again
> > after we scanned everything else.
> >
> > If all that's left on the LRU is unzswappable, then you'd assume that
> > remainder isn't very large, and thus not a significant part of overall
> > scan work. Because if it is, then there is a serious problem with the
> > zswap configuration.
> >
> > There might be possible optimizations to determine how permanent a
> > rejection is, but I'm not sure the effort is called for just
> > yet. Rejections are already failure cases that screw up the LRU
> > ordering, and healthy setups shouldn't have a lot of those. I don't
> > think this patch adds any sort of new complications to this picture.
> 
> We have workloads where a significant amount (maybe 20%? 30% not sure
> tbh) of the memory is incompressible. Zswap is still a very viable
> option for those workloads once those pages are taken out of the
> picture. If those pages remain on the LRUs, they will introduce a
> regression in reclaim efficiency.
> 
> With the upstream code today, those pages go directly to the backing
> store, which isn't ideal in terms of LRU ordering, but this patch
> makes them stay on the LRUs, which can be harmful. I don't think we
> can just assume it is okay. Whether we make those pages unevictable or
> store them uncompressed in zswap, I think taking them out of the LRUs
> (until they are redirtied), is the right thing to do.

This is how it works with zram as well, though, and it has plenty of
happy users. The fact that there are antagonistic workloads doesn't
mean the feature isn't useful. This flag is optional and not enabled
by default, so nobody is forced to use it where it hurts.

I'm not saying it's not worth optimizing those cases, but it doesn't
look like a requirement in order to be useful to a variety of loads.

> Adding Wei and Yu for more data about incompressible memory in our
> fleet. Keep in mind that we have internal patches to cap the
> compression ratio (i.e. reject pages where the compressed size +
> metadata is not worth it, or where zsmalloc will store it in a full
> page anyway). But the same thing can happen upstream with zbud.

I hate to bring this up, but there has been a bit of a disturbing
trend in the zswap discussions recently.

Please do not argue with private patches. Their behavior, the usecases
they enable, and their dependencies are entirely irrelevant to patches
submitted in this forum. They do not need to cater to them or consider
the consequences for them. The only thing that matters is the upstream
codebase and the usecases enabled by it.
  
Yosry Ahmed Dec. 20, 2023, 8:59 a.m. UTC | #40
On Tue, Dec 19, 2023 at 9:15 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Dec 18, 2023 at 01:52:23PM -0800, Yosry Ahmed wrote:
> > > > Taking a step back from all the memory.swap.tiers vs.
> > > > memory.zswap.writeback discussions, I think there may be a more
> > > > fundamental problem here. If the zswap store failure is recurrent,
> > > > pages can keep going back to the LRUs and then sent back to zswap
> > > > eventually, only to be rejected again. For example, this can if zswap
> > > > is above the acceptance threshold, but could be even worse if it's the
> > > > allocator rejecting the page due to not compressing well enough. In
> > > > the latter case, the page can keep going back and forth between zswap
> > > > and LRUs indefinitely.
> > > >
> > > > You probably did not run into this as you're using zsmalloc, but it
> > > > can happen with zbud AFAICT. Even with zsmalloc, a less problematic
> > > > version can happen if zswap is above its acceptance threshold.
> > > >
> > > > This can cause thrashing and ineffective reclaim. We have an internal
> > > > implementation where we mark incompressible pages and put them on the
> > > > unevictable LRU when we don't have a backing swapfile (i.e. ghost
> > > > swapfiles), and something similar may work if writeback is disabled.
> > > > We need to scan such incompressible pages periodically though to
> > > > remove them from the unevictable LRU if they have been dirited.
> > >
> > > I'm not sure this is an actual problem.
> > >
> > > When pages get rejected, they rotate to the furthest point from the
> > > reclaimer - the head of the active list. We only get to them again
> > > after we scanned everything else.
> > >
> > > If all that's left on the LRU is unzswappable, then you'd assume that
> > > remainder isn't very large, and thus not a significant part of overall
> > > scan work. Because if it is, then there is a serious problem with the
> > > zswap configuration.
> > >
> > > There might be possible optimizations to determine how permanent a
> > > rejection is, but I'm not sure the effort is called for just
> > > yet. Rejections are already failure cases that screw up the LRU
> > > ordering, and healthy setups shouldn't have a lot of those. I don't
> > > think this patch adds any sort of new complications to this picture.
> >
> > We have workloads where a significant amount (maybe 20%? 30% not sure
> > tbh) of the memory is incompressible. Zswap is still a very viable
> > option for those workloads once those pages are taken out of the
> > picture. If those pages remain on the LRUs, they will introduce a
> > regression in reclaim efficiency.
> >
> > With the upstream code today, those pages go directly to the backing
> > store, which isn't ideal in terms of LRU ordering, but this patch
> > makes them stay on the LRUs, which can be harmful. I don't think we
> > can just assume it is okay. Whether we make those pages unevictable or
> > store them uncompressed in zswap, I think taking them out of the LRUs
> > (until they are redirtied), is the right thing to do.
>
> This is how it works with zram as well, though, and it has plenty of
> happy users.

I am not sure I understand. Zram does not reject pages that do not
compress well, right? IIUC it acts as a block device so it cannot
reject pages. I feel like I am missing something.

> The fact that there are antagonistic workloads doesn't
> mean the feature isn't useful. This flag is optional and not enabled
> by default, so nobody is forced to use it where it hurts.
>
> I'm not saying it's not worth optimizing those cases, but it doesn't
> look like a requirement in order to be useful to a variety of loads.

But we don't even understand the impact on those workloads today to
properly document it. What happens with a workload using zbud for
example and has quite a bit of memory that gets rejected? Is the
feature usable for such a setup or not? There has also been
discussions upstream about introducing a compression threshold for
zswap in general (see below), this seems to be going in an opposite
direction.

If we already want to support taking pages away from the LRUs when
rejected by zswap (e.g. Nhat's proposal earlier), doesn't it make
sense to do that first so that this patch can be useful for all
workloads?

>
> > Adding Wei and Yu for more data about incompressible memory in our
> > fleet. Keep in mind that we have internal patches to cap the
> > compression ratio (i.e. reject pages where the compressed size +
> > metadata is not worth it, or where zsmalloc will store it in a full
> > page anyway). But the same thing can happen upstream with zbud.
>
> I hate to bring this up, but there has been a bit of a disturbing
> trend in the zswap discussions recently.
>
> Please do not argue with private patches. Their behavior, the usecases
> they enable, and their dependencies are entirely irrelevant to patches
> submitted in this forum. They do not need to cater to them or consider
> the consequences for them. The only thing that matters is the upstream
> codebase and the usecases enabled by it.

Sorry if my intention wasn't clear. I am not arguing that this patch
affects our internal patches in any way. All I am saying is that we do
something similar internally, and we would like to move to an upstream
solution if possible -- so naturally we want the upstream solution to
work for us as well.

Besides, this can happen in the upstream codebase with zbud as I
mentioned earlier, and there has been discussions upstream about
introducing such a compression threshold as well (e.g. [1]). So it is
not something unique to Google. If this is where we think we are
headed upstream (and is already the case with zbud), I think it's not
unreasonable to bring it up.

[1]https://lore.kernel.org/lkml/7a0e3229-be63-4a24-a3fe-7e3ff517de10@bytedance.com/
  
Kairui Song Dec. 20, 2023, 10:21 a.m. UTC | #41
Chris Li <chrisl@kernel.org> 于2023年12月13日周三 07:39写道:
>
> Hi Kairui,
>
> Thanks for sharing the information on how you use swap.

Hi Chris,

>
> On Mon, Dec 11, 2023 at 1:31 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > 2) As indicated by this discussion, Tencent has a usage case for SSD
> > > and hard disk swap as overflow.
> > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
> > > +Kairui
> >
> > Yes, we are not using zswap. We are using ZRAM for swap since we have
> > many different varieties of workload instances, with a very flexible
> > storage setup. Some of them don't have the ability to set up a
> > swapfile. So we built a pack of kernel infrastructures based on ZRAM,
> > which so far worked pretty well.
>
> This is great. The usage case is actually much more than I expected.
> For example, I never thought of zram as a swap tier. Now you mention
> it. I am considering whether it makes sense to add zram to the
> memory.swap.tiers as well as zswap.
>
> >
> > The concern from some teams is that ZRAM (or zswap) can't always free
> > up memory so they may lead to higher risk of OOM compared to a
> > physical swap device, and they do have suitable devices for doing swap
> > on some of their machines. So a secondary swap support is very helpful
> > in case of memory usage peak.
> >
> > Besides this, another requirement is that different containers may
> > have different priority, some containers can tolerate high swap
> > overhead while some cannot, so swap tiering is useful for us in many
> > ways.
> >
> > And thanks to cloud infrastructure the disk setup could change from
> > time to time depending on workload requirements, so our requirement is
> > to support ZRAM (always) + SSD (optional) + HDD (also optional) as
> > swap backends, while not making things too complex to maintain.
>
> Just curious, do you use ZRAM + SSD + HDD all enabled? Do you ever
> consider moving data from ZRAM to SSD, or from SSD to HDD? If you do,
> I do see the possibility of having more general swap tiers support and
> sharing the shrinking code between tiers somehow. Granted there are
> many unanswered questions and a lot of infrastructure is lacking.
> Gathering requirements, weight in the priority of the quirement is the
> first step towards a possible solution.

Sorry for the late response. Yes, it's our plan to use ZRAM + SSD +
HDD all enabled when possible. Alghouth currently only ZRAM + SSD is
expected.

I see this discussion is still going one so just add some info here...

We have some test environments which have a kernel worker enabled to
move data from ZRAM to SSD, and from SSD to HDD too, to free up space
for higher tier swap devices. The kworker is simple, it maintains a
swap entry LRU for every swap device (maybe worth noting here, there
is currently no LRU bases writeback for ZRAM, and ZRAM writeback
require a fixed block device on init, and a swap device level LRU is
also helpful for migrating entry from SSD to HDD). It walks the page
table to swap in coldest swap entry then swap out immediately to a
lower tier, doing this page by page periodically. Overhead and memory
footprint is minimal with limited moving rate, but the efficiency for
large scaled data moving is terrible so it only has very limited
usage. I was trying to come up with a better design but am currently
not working on it.
  
Kairui Song Dec. 20, 2023, 10:22 a.m. UTC | #42
Chris Li <chrisl@kernel.org> 于2023年12月13日周三 07:58写道:
>
> Hi Minchan,
>
> On Mon, Dec 11, 2023 at 2:55 PM Minchan Kim <minchan@kernel.org> wrote:
>
> > > > 3) Android has some fancy swap ideas led by those patches.
> > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/
> > > > It got shot down due to removal of frontswap. But the usage case and
> > > > product requirement is there.
> > > > +Minchan
> > >
> > > This looks like an optimization for zram to bypass the block layer and
> > > hook directly into the swap code. Correct me if I'm wrong, but this
> > > doesn't appear to have anything to do with per-cgroup backend control.
> >
> > Hi Johannes,
> >
> > I haven't been following the thread closely, but I noticed the discussion
> > about potential use cases for zram with memcg.
> >
> > One interesting idea I have is to implement a swap controller per cgroup.
> > This would allow us to tailor the zram swap behavior to the specific needs of
> > different groups.
> >
> > For example, Group A, which is sensitive to swap latency, could use zram swap
> > with a fast compression setting, even if it sacrifices some compression ratio.
> > This would prioritize quick access to swapped data, even if it takes up more space.
> >
> > On the other hand, Group B, which can tolerate higher swap latency, could benefit
> > from a slower compression setting that achieves a higher compression ratio.
> > This would maximize memory efficiency at the cost of slightly slower data access.
>
> That is a very solid usage case. Thanks for sharing this swap backend
> usage story. It goes beyond my original memory.swap.teires idea as
> well.
>
> We can have some zram specific knobs to control what compression
> setting is using. Moving data between different compression settings
> would be an interesting topic. It might fit the swap.tiers usage model
> as well. I am just thinking it out loud. Maybe define different
> compression settings as different tiers and then allow the cgroup to
> enroll into one of the tiers list.
>

This is very similar to our usage, easy to implement too. Actually,
now ZRAM already supports multiple compression streams, so if each
memcg just provides an extra knob to record the compression level
(1-4), ZRAM can decide which compression stream to use when the page
reaches ZRAM, just by checking pages memcg. It's limited to 4 levels
but enough for most cases.
  
Johannes Weiner Dec. 20, 2023, 2:50 p.m. UTC | #43
On Wed, Dec 20, 2023 at 12:59:15AM -0800, Yosry Ahmed wrote:
> On Tue, Dec 19, 2023 at 9:15 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Mon, Dec 18, 2023 at 01:52:23PM -0800, Yosry Ahmed wrote:
> > > > > Taking a step back from all the memory.swap.tiers vs.
> > > > > memory.zswap.writeback discussions, I think there may be a more
> > > > > fundamental problem here. If the zswap store failure is recurrent,
> > > > > pages can keep going back to the LRUs and then sent back to zswap
> > > > > eventually, only to be rejected again. For example, this can if zswap
> > > > > is above the acceptance threshold, but could be even worse if it's the
> > > > > allocator rejecting the page due to not compressing well enough. In
> > > > > the latter case, the page can keep going back and forth between zswap
> > > > > and LRUs indefinitely.
> > > > >
> > > > > You probably did not run into this as you're using zsmalloc, but it
> > > > > can happen with zbud AFAICT. Even with zsmalloc, a less problematic
> > > > > version can happen if zswap is above its acceptance threshold.
> > > > >
> > > > > This can cause thrashing and ineffective reclaim. We have an internal
> > > > > implementation where we mark incompressible pages and put them on the
> > > > > unevictable LRU when we don't have a backing swapfile (i.e. ghost
> > > > > swapfiles), and something similar may work if writeback is disabled.
> > > > > We need to scan such incompressible pages periodically though to
> > > > > remove them from the unevictable LRU if they have been dirited.
> > > >
> > > > I'm not sure this is an actual problem.
> > > >
> > > > When pages get rejected, they rotate to the furthest point from the
> > > > reclaimer - the head of the active list. We only get to them again
> > > > after we scanned everything else.
> > > >
> > > > If all that's left on the LRU is unzswappable, then you'd assume that
> > > > remainder isn't very large, and thus not a significant part of overall
> > > > scan work. Because if it is, then there is a serious problem with the
> > > > zswap configuration.
> > > >
> > > > There might be possible optimizations to determine how permanent a
> > > > rejection is, but I'm not sure the effort is called for just
> > > > yet. Rejections are already failure cases that screw up the LRU
> > > > ordering, and healthy setups shouldn't have a lot of those. I don't
> > > > think this patch adds any sort of new complications to this picture.
> > >
> > > We have workloads where a significant amount (maybe 20%? 30% not sure
> > > tbh) of the memory is incompressible. Zswap is still a very viable
> > > option for those workloads once those pages are taken out of the
> > > picture. If those pages remain on the LRUs, they will introduce a
> > > regression in reclaim efficiency.
> > >
> > > With the upstream code today, those pages go directly to the backing
> > > store, which isn't ideal in terms of LRU ordering, but this patch
> > > makes them stay on the LRUs, which can be harmful. I don't think we
> > > can just assume it is okay. Whether we make those pages unevictable or
> > > store them uncompressed in zswap, I think taking them out of the LRUs
> > > (until they are redirtied), is the right thing to do.
> >
> > This is how it works with zram as well, though, and it has plenty of
> > happy users.
> 
> I am not sure I understand. Zram does not reject pages that do not
> compress well, right? IIUC it acts as a block device so it cannot
> reject pages. I feel like I am missing something.

zram_write_page() can fail for various reasons - compression failure,
zsmalloc failure, the memory limit. This results in !!bio->bi_status,
__end_swap_bio_write redirtying the page, and vmscan rotating it.

The effect is actually more pronounced with zram, because the pages
don't get activated and thus cycle faster.

What you're raising doesn't seem to be a dealbreaker in practice.

> If we already want to support taking pages away from the LRUs when
> rejected by zswap (e.g. Nhat's proposal earlier), doesn't it make
> sense to do that first so that this patch can be useful for all
> workloads?

No.

Why should users who can benefit now wait for a hypothetical future
optimization that isn't relevant to them? And by the looks of it, is
only relevant to a small set of specialized cases?

And the optimization - should anybody actually care to write it - can
be transparently done on top later, so that's no reason to change
merge order, either.
  
Yosry Ahmed Dec. 21, 2023, 12:24 a.m. UTC | #44
On Wed, Dec 20, 2023 at 6:50 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Dec 20, 2023 at 12:59:15AM -0800, Yosry Ahmed wrote:
> > On Tue, Dec 19, 2023 at 9:15 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Mon, Dec 18, 2023 at 01:52:23PM -0800, Yosry Ahmed wrote:
> > > > > > Taking a step back from all the memory.swap.tiers vs.
> > > > > > memory.zswap.writeback discussions, I think there may be a more
> > > > > > fundamental problem here. If the zswap store failure is recurrent,
> > > > > > pages can keep going back to the LRUs and then sent back to zswap
> > > > > > eventually, only to be rejected again. For example, this can if zswap
> > > > > > is above the acceptance threshold, but could be even worse if it's the
> > > > > > allocator rejecting the page due to not compressing well enough. In
> > > > > > the latter case, the page can keep going back and forth between zswap
> > > > > > and LRUs indefinitely.
> > > > > >
> > > > > > You probably did not run into this as you're using zsmalloc, but it
> > > > > > can happen with zbud AFAICT. Even with zsmalloc, a less problematic
> > > > > > version can happen if zswap is above its acceptance threshold.
> > > > > >
> > > > > > This can cause thrashing and ineffective reclaim. We have an internal
> > > > > > implementation where we mark incompressible pages and put them on the
> > > > > > unevictable LRU when we don't have a backing swapfile (i.e. ghost
> > > > > > swapfiles), and something similar may work if writeback is disabled.
> > > > > > We need to scan such incompressible pages periodically though to
> > > > > > remove them from the unevictable LRU if they have been dirited.
> > > > >
> > > > > I'm not sure this is an actual problem.
> > > > >
> > > > > When pages get rejected, they rotate to the furthest point from the
> > > > > reclaimer - the head of the active list. We only get to them again
> > > > > after we scanned everything else.
> > > > >
> > > > > If all that's left on the LRU is unzswappable, then you'd assume that
> > > > > remainder isn't very large, and thus not a significant part of overall
> > > > > scan work. Because if it is, then there is a serious problem with the
> > > > > zswap configuration.
> > > > >
> > > > > There might be possible optimizations to determine how permanent a
> > > > > rejection is, but I'm not sure the effort is called for just
> > > > > yet. Rejections are already failure cases that screw up the LRU
> > > > > ordering, and healthy setups shouldn't have a lot of those. I don't
> > > > > think this patch adds any sort of new complications to this picture.
> > > >
> > > > We have workloads where a significant amount (maybe 20%? 30% not sure
> > > > tbh) of the memory is incompressible. Zswap is still a very viable
> > > > option for those workloads once those pages are taken out of the
> > > > picture. If those pages remain on the LRUs, they will introduce a
> > > > regression in reclaim efficiency.
> > > >
> > > > With the upstream code today, those pages go directly to the backing
> > > > store, which isn't ideal in terms of LRU ordering, but this patch
> > > > makes them stay on the LRUs, which can be harmful. I don't think we
> > > > can just assume it is okay. Whether we make those pages unevictable or
> > > > store them uncompressed in zswap, I think taking them out of the LRUs
> > > > (until they are redirtied), is the right thing to do.
> > >
> > > This is how it works with zram as well, though, and it has plenty of
> > > happy users.
> >
> > I am not sure I understand. Zram does not reject pages that do not
> > compress well, right? IIUC it acts as a block device so it cannot
> > reject pages. I feel like I am missing something.
>
> zram_write_page() can fail for various reasons - compression failure,
> zsmalloc failure, the memory limit. This results in !!bio->bi_status,
> __end_swap_bio_write redirtying the page, and vmscan rotating it.
>
> The effect is actually more pronounced with zram, because the pages
> don't get activated and thus cycle faster.
>
> What you're raising doesn't seem to be a dealbreaker in practice.

For the workloads using zram, yes, they are exclusively using zsmalloc
which can store incompressible pages anyway.

>
> > If we already want to support taking pages away from the LRUs when
> > rejected by zswap (e.g. Nhat's proposal earlier), doesn't it make
> > sense to do that first so that this patch can be useful for all
> > workloads?
>
> No.
>
> Why should users who can benefit now wait for a hypothetical future
> optimization that isn't relevant to them? And by the looks of it, is
> only relevant to a small set of specialized cases?
>
> And the optimization - should anybody actually care to write it - can
> be transparently done on top later, so that's no reason to change
> merge order, either.

We can agree to disagree here, I am not trying to block this anyway.
But let's at least document this in the commit message/docs/code
(wherever it makes sense) -- that recurrent failures (e.g.
incompressible memory) may keep going back to zswap only to get
rejected, so workloads prone to this may observe some reclaim
inefficiency.
  
Nhat Pham Dec. 21, 2023, 12:50 a.m. UTC | #45
On Wed, Dec 20, 2023 at 4:24 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> We can agree to disagree here, I am not trying to block this anyway.
> But let's at least document this in the commit message/docs/code
> (wherever it makes sense) -- that recurrent failures (e.g.
> incompressible memory) may keep going back to zswap only to get
> rejected, so workloads prone to this may observe some reclaim
> inefficiency.

I'll add the following caveat:

Note that if the store failures are recurring (for e.g if the pages are
incompressible), users can observe reclaim inefficiency after disabling
writeback (because the same pages might be rejected again and again).

to the zswap documentation and the cgroup documentation then? I'll
repeat this caveat in both places for self-containment purposes.
  

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..2b4ac43efdc8 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1679,6 +1679,18 @@  PAGE_SIZE multiple when read back.
 	limit, it will refuse to take any more stores before existing
 	entries fault back in or are written out to disk.
 
+  memory.zswap.writeback
+	A read-write single value file. The default value is "1". The
+	initial value of the root cgroup is 1, and when a new cgroup is
+	created, it inherits the current value of its parent.
+
+	When this is set to 0, all swapping attempts to swapping devices
+	are disabled. This included both zswap writebacks, and swapping due
+	to zswap store failure.
+
+	Note that this is subtly different from setting memory.swap.max to
+	0, as it still allows for pages to be written to the zswap pool.
+
   memory.pressure
 	A read-only nested-keyed file.
 
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index 62fc244ec702..cfa653130346 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -153,6 +153,12 @@  attribute, e. g.::
 
 Setting this parameter to 100 will disable the hysteresis.
 
+Some users cannot tolerate the swapping that comes with zswap store failures
+and zswap writebacks. Swapping can be disabled entirely (without disabling
+zswap itself) on a cgroup-basis as follows:
+
+	echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
+
 When there is a sizable amount of cold memory residing in the zswap pool, it
 can be advantageous to proactively write these cold pages to swap and reclaim
 the memory for other use cases. By default, the zswap shrinker is disabled.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 43b77363ab8e..5de775e6cdd9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -219,6 +219,12 @@  struct mem_cgroup {
 
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
 	unsigned long zswap_max;
+
+	/*
+	 * Prevent pages from this memcg from being written back from zswap to
+	 * swap, and from being swapped out on zswap store failures.
+	 */
+	bool zswap_writeback;
 #endif
 
 	unsigned long soft_limit;
@@ -1941,6 +1947,7 @@  static inline void count_objcg_event(struct obj_cgroup *objcg,
 bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
 void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
 void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
+bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg);
 #else
 static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
 {
@@ -1954,6 +1961,11 @@  static inline void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg,
 					     size_t size)
 {
 }
+static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
+{
+	/* if zswap is disabled, do not block pages going to the swapping device */
+	return true;
+}
 #endif
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 08c240e16a01..a78ceaf3a65e 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -35,6 +35,7 @@  void zswap_swapoff(int type);
 void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
 void zswap_lruvec_state_init(struct lruvec *lruvec);
 void zswap_page_swapin(struct page *page);
+bool is_zswap_enabled(void);
 #else
 
 struct zswap_lruvec_state {};
@@ -55,6 +56,11 @@  static inline void zswap_swapoff(int type) {}
 static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
 static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {}
 static inline void zswap_page_swapin(struct page *page) {}
+
+static inline bool is_zswap_enabled(void)
+{
+	return false;
+}
 #endif
 
 #endif /* _LINUX_ZSWAP_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d7bc47316acb..ae8c62c7aa53 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5538,6 +5538,8 @@  mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
 	memcg->zswap_max = PAGE_COUNTER_MAX;
+	WRITE_ONCE(memcg->zswap_writeback,
+		!parent || READ_ONCE(parent->zswap_writeback));
 #endif
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	if (parent) {
@@ -8174,6 +8176,12 @@  void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
 	rcu_read_unlock();
 }
 
+bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
+{
+	/* if zswap is disabled, do not block pages going to the swapping device */
+	return !is_zswap_enabled() || !memcg || READ_ONCE(memcg->zswap_writeback);
+}
+
 static u64 zswap_current_read(struct cgroup_subsys_state *css,
 			      struct cftype *cft)
 {
@@ -8206,6 +8214,31 @@  static ssize_t zswap_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int zswap_writeback_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback));
+	return 0;
+}
+
+static ssize_t zswap_writeback_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int zswap_writeback;
+	ssize_t parse_ret = kstrtoint(strstrip(buf), 0, &zswap_writeback);
+
+	if (parse_ret)
+		return parse_ret;
+
+	if (zswap_writeback != 0 && zswap_writeback != 1)
+		return -EINVAL;
+
+	WRITE_ONCE(memcg->zswap_writeback, zswap_writeback);
+	return nbytes;
+}
+
 static struct cftype zswap_files[] = {
 	{
 		.name = "zswap.current",
@@ -8218,6 +8251,11 @@  static struct cftype zswap_files[] = {
 		.seq_show = zswap_max_show,
 		.write = zswap_max_write,
 	},
+	{
+		.name = "zswap.writeback",
+		.seq_show = zswap_writeback_show,
+		.write = zswap_writeback_write,
+	},
 	{ }	/* terminate */
 };
 #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */
diff --git a/mm/page_io.c b/mm/page_io.c
index cb559ae324c6..5e606f1aa2f6 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -201,6 +201,12 @@  int swap_writepage(struct page *page, struct writeback_control *wbc)
 		folio_end_writeback(folio);
 		return 0;
 	}
+
+	if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) {
+		folio_mark_dirty(folio);
+		return AOP_WRITEPAGE_ACTIVATE;
+	}
+
 	__swap_writepage(&folio->page, wbc);
 	return 0;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index c62f904ba1ca..dd084fbafcf1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1514,8 +1514,7 @@  static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 
 		mutex_unlock(&shmem_swaplist_mutex);
 		BUG_ON(folio_mapped(folio));
-		swap_writepage(&folio->page, wbc);
-		return 0;
+		return swap_writepage(&folio->page, wbc);
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
diff --git a/mm/zswap.c b/mm/zswap.c
index daaa949837f2..7ee54a3d8281 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -153,6 +153,11 @@  static bool zswap_shrinker_enabled = IS_ENABLED(
 		CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
 module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
 
+bool is_zswap_enabled(void)
+{
+	return zswap_enabled;
+}
+
 /*********************************
 * data structures
 **********************************/
@@ -596,7 +601,8 @@  static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 	struct zswap_pool *pool = shrinker->private_data;
 	bool encountered_page_in_swapcache = false;
 
-	if (!zswap_shrinker_enabled) {
+	if (!zswap_shrinker_enabled ||
+			!mem_cgroup_zswap_writeback_enabled(sc->memcg)) {
 		sc->nr_scanned = 0;
 		return SHRINK_STOP;
 	}
@@ -637,7 +643,7 @@  static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
 	unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
 
-	if (!zswap_shrinker_enabled)
+	if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
 		return 0;
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -956,6 +962,9 @@  static int shrink_memcg(struct mem_cgroup *memcg)
 	struct zswap_pool *pool;
 	int nid, shrunk = 0;
 
+	if (!mem_cgroup_zswap_writeback_enabled(memcg))
+		return -EINVAL;
+
 	/*
 	 * Skip zombies because their LRUs are reparented and we would be
 	 * reclaiming from the parent instead of the dead memcg.