mm: zswap: shrink until can accept

Message ID 20230524065051.6328-1-cerasuolodomenico@gmail.com
State New
Headers
Series mm: zswap: shrink until can accept |

Commit Message

Domenico Cerasuolo May 24, 2023, 6:50 a.m. UTC
  This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.

The functioning of the zswap shrink worker was found to be inadequate,
as shown by basic benchmark test. For the test, a kernel build was
utilized as a reference, with its memory confined to 1G via a cgroup and
a 5G swap file provided. The results are presented below, these are
averages of three runs without the use of zswap:

real 46m26s
user 35m4s
sys 7m37s

With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:

real 56m4s
user 35m13s
sys 8m43s

written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478

Besides the evident regression, one thing to notice from this data is
the extremely low number of written_back_pages and pool_limit_hit.

The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular
scenario: once zswap hits his limit, zswap_pool_reached_full is set to
true; with this flag on, zswap_frontswap_store rejects pages if zswap is
still above the acceptance threshold. Once we include the rejections due
to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
1478 to a significant 21578266.

Zswap is stuck in an undesirable state where it rejects pages because
it's above the acceptance threshold, yet fails to attempt memory
reclaimation. This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.

This state results in hot pages getting written directly to disk,
while cold ones remain memory, waiting only to be invalidated. The LRU
order is completely broken and zswap ends up being just an overhead
without providing any benefits.

This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.

Testing this suggested update showed much better numbers:

real 36m37s
user 35m8s
sys 9m32s

written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653

Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
---
 mm/zswap.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)
  

Comments

Yosry Ahmed May 25, 2023, 12:58 a.m. UTC | #1
Hi Domenico,

On Tue, May 23, 2023 at 11:50 PM Domenico Cerasuolo
<cerasuolodomenico@gmail.com> wrote:
>
> This update addresses an issue with the zswap reclaim mechanism, which
> hinders the efficient offloading of cold pages to disk, thereby
> compromising the preservation of the LRU order and consequently
> diminishing, if not inverting, its performance benefits.
>
> The functioning of the zswap shrink worker was found to be inadequate,
> as shown by basic benchmark test. For the test, a kernel build was
> utilized as a reference, with its memory confined to 1G via a cgroup and
> a 5G swap file provided. The results are presented below, these are
> averages of three runs without the use of zswap:
>
> real 46m26s
> user 35m4s
> sys 7m37s
>
> With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> system), the results changed to:
>
> real 56m4s
> user 35m13s
> sys 8m43s
>
> written_back_pages: 18
> reject_reclaim_fail: 0
> pool_limit_hit:1478
>
> Besides the evident regression, one thing to notice from this data is
> the extremely low number of written_back_pages and pool_limit_hit.
>
> The pool_limit_hit counter, which is increased in zswap_frontswap_store
> when zswap is completely full, doesn't account for a particular
> scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> still above the acceptance threshold. Once we include the rejections due
> to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> 1478 to a significant 21578266.
>
> Zswap is stuck in an undesirable state where it rejects pages because
> it's above the acceptance threshold, yet fails to attempt memory
> reclaimation. This happens because the shrink work is only queued when
> zswap_frontswap_store detects that it's full and the work itself only
> reclaims one page per run.
>
> This state results in hot pages getting written directly to disk,
> while cold ones remain memory, waiting only to be invalidated. The LRU
> order is completely broken and zswap ends up being just an overhead
> without providing any benefits.
>
> This commit applies 2 changes: a) the shrink worker is set to reclaim
> pages until the acceptance threshold is met and b) the task is also
> enqueued when zswap is not full but still above the threshold.
>
> Testing this suggested update showed much better numbers:
>
> real 36m37s
> user 35m8s
> sys 9m32s
>
> written_back_pages: 10459423
> reject_reclaim_fail: 12896
> pool_limit_hit: 75653

Impressive numbers, and great find in general!

I wonder how other workloads benefit/regress from this change.
Anything else that you happened to run? :)

>
> Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> ---
>  mm/zswap.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 59da2a415fbb..2ee0775d8213 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
>  {
>         struct zswap_pool *pool = container_of(w, typeof(*pool),
>                                                 shrink_work);
> +       int ret;
>
> -       if (zpool_shrink(pool->zpool, 1, NULL))
> -               zswap_reject_reclaim_fail++;
> +       do {
> +               ret = zpool_shrink(pool->zpool, 1, NULL);
> +               if (ret)
> +                       zswap_reject_reclaim_fail++;
> +       } while (!zswap_can_accept() && ret != -EINVAL);

One question/comment here about the retry logic.

So I looked through the awfully convoluted writeback code, and there
are multiple layers, and some of them tend to overwrite the return
value of the layer underneath :(

For zsmalloc (as an example):
zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict()->zswap_writeback_entry().

First of all, that zpool_ops is an unnecessarily confusing
indirection, but anyway.

- zswap_writeback_entry() will either return -ENOMEM or -EEXIST on error
- zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lru
is empty, and -EAGAIN on other errors.
- zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the
return value of  zs_reclaim_page()/zbud_reclaim_page().
- zpool_shrink() will return -EINVAL if the driver does not support
shrinking, otherwise it will propagate the return value from the
driver.

So it looks like we will get -EINVAL only if the driver lru is empty
or the driver does not support writeback, so rightfully we should not
retry.

If zswap_writeback_entry() returns -EEXIST, it probably means that we
raced with another task decompressing the page, so rightfully we
should retry to writeback another page instead.

If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily
mean that we couldn't allocate memory (unfortunately), it looks like
we will return -ENOMEM in other cases as well. Arguably, we can retry
in all cases, because even if we were actually out of memory, we are
trying to make an allocation that will eventually free more memory.

In all cases, I think it would be nicer if we retry if ret == -EAGAIN
instead. It is semantically more sane. In this specific case it is
functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly
return -EAGAIN anyway, but in case someone tries to use saner errnos
in the future, this will age better.

Also, do we intentionally want to keep retrying forever on failure? Do
we want to add a max number of retries?

>         zswap_pool_put(pool);
>  }
>
> @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>         if (zswap_pool_reached_full) {
>                if (!zswap_can_accept()) {
>                         ret = -ENOMEM;
> -                       goto reject;
> +                       goto shrink;
>                 } else
>                         zswap_pool_reached_full = false;
>         }
> --
> 2.34.1
>
  
Domenico Cerasuolo May 25, 2023, 4:52 p.m. UTC | #2
On Thu, May 25, 2023 at 2:59 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Hi Domenico,
>
> On Tue, May 23, 2023 at 11:50 PM Domenico Cerasuolo
> <cerasuolodomenico@gmail.com> wrote:
> >
> > This update addresses an issue with the zswap reclaim mechanism, which
> > hinders the efficient offloading of cold pages to disk, thereby
> > compromising the preservation of the LRU order and consequently
> > diminishing, if not inverting, its performance benefits.
> >
> > The functioning of the zswap shrink worker was found to be inadequate,
> > as shown by basic benchmark test. For the test, a kernel build was
> > utilized as a reference, with its memory confined to 1G via a cgroup and
> > a 5G swap file provided. The results are presented below, these are
> > averages of three runs without the use of zswap:
> >
> > real 46m26s
> > user 35m4s
> > sys 7m37s
> >
> > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > system), the results changed to:
> >
> > real 56m4s
> > user 35m13s
> > sys 8m43s
> >
> > written_back_pages: 18
> > reject_reclaim_fail: 0
> > pool_limit_hit:1478
> >
> > Besides the evident regression, one thing to notice from this data is
> > the extremely low number of written_back_pages and pool_limit_hit.
> >
> > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > when zswap is completely full, doesn't account for a particular
> > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > still above the acceptance threshold. Once we include the rejections due
> > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > 1478 to a significant 21578266.
> >
> > Zswap is stuck in an undesirable state where it rejects pages because
> > it's above the acceptance threshold, yet fails to attempt memory
> > reclaimation. This happens because the shrink work is only queued when
> > zswap_frontswap_store detects that it's full and the work itself only
> > reclaims one page per run.
> >
> > This state results in hot pages getting written directly to disk,
> > while cold ones remain memory, waiting only to be invalidated. The LRU
> > order is completely broken and zswap ends up being just an overhead
> > without providing any benefits.
> >
> > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > pages until the acceptance threshold is met and b) the task is also
> > enqueued when zswap is not full but still above the threshold.
> >
> > Testing this suggested update showed much better numbers:
> >
> > real 36m37s
> > user 35m8s
> > sys 9m32s
> >
> > written_back_pages: 10459423
> > reject_reclaim_fail: 12896
> > pool_limit_hit: 75653
>
> Impressive numbers, and great find in general!
>
> I wonder how other workloads benefit/regress from this change.
> Anything else that you happened to run? :)
>
Hi Yosry,

thanks for the quick feedback!

Besides stressers, I haven't tested any other actual workload, we don't have
writeback in production yet so I can't provide any data from there. I was
wondering what kind of actual workload I could use to test this on my desktop,
but I couldn't think of anything relevant, I'm open to suggestions though :)
> >
> > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> > ---
> >  mm/zswap.c | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 59da2a415fbb..2ee0775d8213 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
> >  {
> >         struct zswap_pool *pool = container_of(w, typeof(*pool),
> >                                                 shrink_work);
> > +       int ret;
> >
> > -       if (zpool_shrink(pool->zpool, 1, NULL))
> > -               zswap_reject_reclaim_fail++;
> > +       do {
> > +               ret = zpool_shrink(pool->zpool, 1, NULL);
> > +               if (ret)
> > +                       zswap_reject_reclaim_fail++;
> > +       } while (!zswap_can_accept() && ret != -EINVAL);
>
> One question/comment here about the retry logic.
>
> So I looked through the awfully convoluted writeback code, and there
> are multiple layers, and some of them tend to overwrite the return
> value of the layer underneath :(
>
> For zsmalloc (as an example):
> zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict()->zswap_writeback_entry().
>
> First of all, that zpool_ops is an unnecessarily confusing
> indirection, but anyway.
>
> - zswap_writeback_entry() will either return -ENOMEM or -EEXIST on error
> - zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lru
> is empty, and -EAGAIN on other errors.
> - zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the
> return value of  zs_reclaim_page()/zbud_reclaim_page().
> - zpool_shrink() will return -EINVAL if the driver does not support
> shrinking, otherwise it will propagate the return value from the
> driver.
>
> So it looks like we will get -EINVAL only if the driver lru is empty
> or the driver does not support writeback, so rightfully we should not
> retry.
>
> If zswap_writeback_entry() returns -EEXIST, it probably means that we
> raced with another task decompressing the page, so rightfully we
> should retry to writeback another page instead.
>
> If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily
> mean that we couldn't allocate memory (unfortunately), it looks like
> we will return -ENOMEM in other cases as well. Arguably, we can retry
> in all cases, because even if we were actually out of memory, we are
> trying to make an allocation that will eventually free more memory.
>
> In all cases, I think it would be nicer if we retry if ret == -EAGAIN
> instead. It is semantically more sane. In this specific case it is
> functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly
> return -EAGAIN anyway, but in case someone tries to use saner errnos
> in the future, this will age better.
Retrying if ret == -EAGAIN seems much nicer indeed, will change it.
>
> Also, do we intentionally want to keep retrying forever on failure? Do
> we want to add a max number of retries?
If the drivers guaranteed that zpool_shrink will remove at least an entry
from their LRU, the retry wouldn't be needed because the LRU will
eventually be emptied. But this is an assumption on the implementation of
the zpool, so yes, we could use a max retries. I think that being a sanity
check, it should overshoot the required number of iterations in order to
avoid a premature break, what about retrying a max of
zswap_stored_pages times?
>
> >         zswap_pool_put(pool);
> >  }
> >
> > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> >         if (zswap_pool_reached_full) {
> >                if (!zswap_can_accept()) {
> >                         ret = -ENOMEM;
> > -                       goto reject;
> > +                       goto shrink;
> >                 } else
> >                         zswap_pool_reached_full = false;
> >         }
> > --
> > 2.34.1
> >
  
Yosry Ahmed May 25, 2023, 7:09 p.m. UTC | #3
On Thu, May 25, 2023 at 9:53 AM Domenico Cerasuolo
<cerasuolodomenico@gmail.com> wrote:
>
> On Thu, May 25, 2023 at 2:59 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > Hi Domenico,
> >
> > On Tue, May 23, 2023 at 11:50 PM Domenico Cerasuolo
> > <cerasuolodomenico@gmail.com> wrote:
> > >
> > > This update addresses an issue with the zswap reclaim mechanism, which
> > > hinders the efficient offloading of cold pages to disk, thereby
> > > compromising the preservation of the LRU order and consequently
> > > diminishing, if not inverting, its performance benefits.
> > >
> > > The functioning of the zswap shrink worker was found to be inadequate,
> > > as shown by basic benchmark test. For the test, a kernel build was
> > > utilized as a reference, with its memory confined to 1G via a cgroup and
> > > a 5G swap file provided. The results are presented below, these are
> > > averages of three runs without the use of zswap:
> > >
> > > real 46m26s
> > > user 35m4s
> > > sys 7m37s
> > >
> > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > > system), the results changed to:
> > >
> > > real 56m4s
> > > user 35m13s
> > > sys 8m43s
> > >
> > > written_back_pages: 18
> > > reject_reclaim_fail: 0
> > > pool_limit_hit:1478
> > >
> > > Besides the evident regression, one thing to notice from this data is
> > > the extremely low number of written_back_pages and pool_limit_hit.
> > >
> > > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > > when zswap is completely full, doesn't account for a particular
> > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > > still above the acceptance threshold. Once we include the rejections due
> > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > > 1478 to a significant 21578266.
> > >
> > > Zswap is stuck in an undesirable state where it rejects pages because
> > > it's above the acceptance threshold, yet fails to attempt memory
> > > reclaimation. This happens because the shrink work is only queued when
> > > zswap_frontswap_store detects that it's full and the work itself only
> > > reclaims one page per run.
> > >
> > > This state results in hot pages getting written directly to disk,
> > > while cold ones remain memory, waiting only to be invalidated. The LRU
> > > order is completely broken and zswap ends up being just an overhead
> > > without providing any benefits.
> > >
> > > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > > pages until the acceptance threshold is met and b) the task is also
> > > enqueued when zswap is not full but still above the threshold.
> > >
> > > Testing this suggested update showed much better numbers:
> > >
> > > real 36m37s
> > > user 35m8s
> > > sys 9m32s
> > >
> > > written_back_pages: 10459423
> > > reject_reclaim_fail: 12896
> > > pool_limit_hit: 75653
> >
> > Impressive numbers, and great find in general!
> >
> > I wonder how other workloads benefit/regress from this change.
> > Anything else that you happened to run? :)
> >
> Hi Yosry,
>
> thanks for the quick feedback!
>
> Besides stressers, I haven't tested any other actual workload, we don't have
> writeback in production yet so I can't provide any data from there. I was
> wondering what kind of actual workload I could use to test this on my desktop,
> but I couldn't think of anything relevant, I'm open to suggestions though :)

Nothing in mind in particular as well. Perhaps others have ideas.

> > >
> > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> > > ---
> > >  mm/zswap.c | 10 +++++++---
> > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 59da2a415fbb..2ee0775d8213 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
> > >  {
> > >         struct zswap_pool *pool = container_of(w, typeof(*pool),
> > >                                                 shrink_work);
> > > +       int ret;
> > >
> > > -       if (zpool_shrink(pool->zpool, 1, NULL))
> > > -               zswap_reject_reclaim_fail++;
> > > +       do {
> > > +               ret = zpool_shrink(pool->zpool, 1, NULL);
> > > +               if (ret)
> > > +                       zswap_reject_reclaim_fail++;
> > > +       } while (!zswap_can_accept() && ret != -EINVAL);
> >
> > One question/comment here about the retry logic.
> >
> > So I looked through the awfully convoluted writeback code, and there
> > are multiple layers, and some of them tend to overwrite the return
> > value of the layer underneath :(
> >
> > For zsmalloc (as an example):
> > zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict()->zswap_writeback_entry().
> >
> > First of all, that zpool_ops is an unnecessarily confusing
> > indirection, but anyway.
> >
> > - zswap_writeback_entry() will either return -ENOMEM or -EEXIST on error
> > - zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lru
> > is empty, and -EAGAIN on other errors.
> > - zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the
> > return value of  zs_reclaim_page()/zbud_reclaim_page().
> > - zpool_shrink() will return -EINVAL if the driver does not support
> > shrinking, otherwise it will propagate the return value from the
> > driver.
> >
> > So it looks like we will get -EINVAL only if the driver lru is empty
> > or the driver does not support writeback, so rightfully we should not
> > retry.
> >
> > If zswap_writeback_entry() returns -EEXIST, it probably means that we
> > raced with another task decompressing the page, so rightfully we
> > should retry to writeback another page instead.
> >
> > If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily
> > mean that we couldn't allocate memory (unfortunately), it looks like
> > we will return -ENOMEM in other cases as well. Arguably, we can retry
> > in all cases, because even if we were actually out of memory, we are
> > trying to make an allocation that will eventually free more memory.
> >
> > In all cases, I think it would be nicer if we retry if ret == -EAGAIN
> > instead. It is semantically more sane. In this specific case it is
> > functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly
> > return -EAGAIN anyway, but in case someone tries to use saner errnos
> > in the future, this will age better.
> Retrying if ret == -EAGAIN seems much nicer indeed, will change it.
> >
> > Also, do we intentionally want to keep retrying forever on failure? Do
> > we want to add a max number of retries?
> If the drivers guaranteed that zpool_shrink will remove at least an entry
> from their LRU, the retry wouldn't be needed because the LRU will
> eventually be emptied. But this is an assumption on the implementation of

I don't think any zpool driver can guarantee to writeback at least one
page. It can fail for reasons beyond their control (e.g. cannot
allocate a page to decompress to).

> the zpool, so yes, we could use a max retries. I think that being a sanity
> check, it should overshoot the required number of iterations in order to
> avoid a premature break, what about retrying a max of
> zswap_stored_pages times?

Why is it just a sanity check? Consider a case where the system is
under extreme memory pressure that the drivers cannot allocate pages
to decompress to. The drivers would be continuously competing with all
other allocations on the machine. I think we should give up at some
point. In any case, you changed the zswap_frontswap_store() to goto
shrink if !zswap_can_accept(), so next time we try to compress a page
to zswap we will invoke try again anyway -- perhaps under better
circumstances.

I think zswap_stored_pages is too much, keep in mind that it also
includes same-filled pages which are not even stored in the zpool
drivers. Maybe we should allow a fixed number of failures. If
zpool_shrink() is successful, keep going until zswap_can_accept().
Otherwise allow a fixed number of failures before giving up.

Maybe we can improve the error codes propagated through the writeback
code as well to improve the retry logic. For example, if
zswap_writeback_entry() returns EEXIST then retrying should be
harmless, but ENOMEM might be harmful. Both of which are propagated as
EAGAIN from zs_zpool_shrink()/zbud_zpool_shrink().

> >
> > >         zswap_pool_put(pool);
> > >  }
> > >
> > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > >         if (zswap_pool_reached_full) {
> > >                if (!zswap_can_accept()) {
> > >                         ret = -ENOMEM;
> > > -                       goto reject;
> > > +                       goto shrink;
> > >                 } else
> > >                         zswap_pool_reached_full = false;
> > >         }
> > > --
> > > 2.34.1
> > >
  
Domenico Cerasuolo May 26, 2023, 8:52 a.m. UTC | #4
On Thu, May 25, 2023 at 9:10 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, May 25, 2023 at 9:53 AM Domenico Cerasuolo
> <cerasuolodomenico@gmail.com> wrote:
> >
> > On Thu, May 25, 2023 at 2:59 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > Hi Domenico,
> > >
> > > On Tue, May 23, 2023 at 11:50 PM Domenico Cerasuolo
> > > <cerasuolodomenico@gmail.com> wrote:
> > > >
> > > > This update addresses an issue with the zswap reclaim mechanism, which
> > > > hinders the efficient offloading of cold pages to disk, thereby
> > > > compromising the preservation of the LRU order and consequently
> > > > diminishing, if not inverting, its performance benefits.
> > > >
> > > > The functioning of the zswap shrink worker was found to be inadequate,
> > > > as shown by basic benchmark test. For the test, a kernel build was
> > > > utilized as a reference, with its memory confined to 1G via a cgroup and
> > > > a 5G swap file provided. The results are presented below, these are
> > > > averages of three runs without the use of zswap:
> > > >
> > > > real 46m26s
> > > > user 35m4s
> > > > sys 7m37s
> > > >
> > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > > > system), the results changed to:
> > > >
> > > > real 56m4s
> > > > user 35m13s
> > > > sys 8m43s
> > > >
> > > > written_back_pages: 18
> > > > reject_reclaim_fail: 0
> > > > pool_limit_hit:1478
> > > >
> > > > Besides the evident regression, one thing to notice from this data is
> > > > the extremely low number of written_back_pages and pool_limit_hit.
> > > >
> > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > > > when zswap is completely full, doesn't account for a particular
> > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > > > still above the acceptance threshold. Once we include the rejections due
> > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > > > 1478 to a significant 21578266.
> > > >
> > > > Zswap is stuck in an undesirable state where it rejects pages because
> > > > it's above the acceptance threshold, yet fails to attempt memory
> > > > reclaimation. This happens because the shrink work is only queued when
> > > > zswap_frontswap_store detects that it's full and the work itself only
> > > > reclaims one page per run.
> > > >
> > > > This state results in hot pages getting written directly to disk,
> > > > while cold ones remain memory, waiting only to be invalidated. The LRU
> > > > order is completely broken and zswap ends up being just an overhead
> > > > without providing any benefits.
> > > >
> > > > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > > > pages until the acceptance threshold is met and b) the task is also
> > > > enqueued when zswap is not full but still above the threshold.
> > > >
> > > > Testing this suggested update showed much better numbers:
> > > >
> > > > real 36m37s
> > > > user 35m8s
> > > > sys 9m32s
> > > >
> > > > written_back_pages: 10459423
> > > > reject_reclaim_fail: 12896
> > > > pool_limit_hit: 75653
> > >
> > > Impressive numbers, and great find in general!
> > >
> > > I wonder how other workloads benefit/regress from this change.
> > > Anything else that you happened to run? :)
> > >
> > Hi Yosry,
> >
> > thanks for the quick feedback!
> >
> > Besides stressers, I haven't tested any other actual workload, we don't have
> > writeback in production yet so I can't provide any data from there. I was
> > wondering what kind of actual workload I could use to test this on my desktop,
> > but I couldn't think of anything relevant, I'm open to suggestions though :)
>
> Nothing in mind in particular as well. Perhaps others have ideas.
>
> > > >
> > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> > > > ---
> > > >  mm/zswap.c | 10 +++++++---
> > > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 59da2a415fbb..2ee0775d8213 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
> > > >  {
> > > >         struct zswap_pool *pool = container_of(w, typeof(*pool),
> > > >                                                 shrink_work);
> > > > +       int ret;
> > > >
> > > > -       if (zpool_shrink(pool->zpool, 1, NULL))
> > > > -               zswap_reject_reclaim_fail++;
> > > > +       do {
> > > > +               ret = zpool_shrink(pool->zpool, 1, NULL);
> > > > +               if (ret)
> > > > +                       zswap_reject_reclaim_fail++;
> > > > +       } while (!zswap_can_accept() && ret != -EINVAL);
> > >
> > > One question/comment here about the retry logic.
> > >
> > > So I looked through the awfully convoluted writeback code, and there
> > > are multiple layers, and some of them tend to overwrite the return
> > > value of the layer underneath :(
> > >
> > > For zsmalloc (as an example):
> > > zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict()->zswap_writeback_entry().
> > >
> > > First of all, that zpool_ops is an unnecessarily confusing
> > > indirection, but anyway.
> > >
> > > - zswap_writeback_entry() will either return -ENOMEM or -EEXIST on error
> > > - zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lru
> > > is empty, and -EAGAIN on other errors.
> > > - zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the
> > > return value of  zs_reclaim_page()/zbud_reclaim_page().
> > > - zpool_shrink() will return -EINVAL if the driver does not support
> > > shrinking, otherwise it will propagate the return value from the
> > > driver.
> > >
> > > So it looks like we will get -EINVAL only if the driver lru is empty
> > > or the driver does not support writeback, so rightfully we should not
> > > retry.
> > >
> > > If zswap_writeback_entry() returns -EEXIST, it probably means that we
> > > raced with another task decompressing the page, so rightfully we
> > > should retry to writeback another page instead.
> > >
> > > If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily
> > > mean that we couldn't allocate memory (unfortunately), it looks like
> > > we will return -ENOMEM in other cases as well. Arguably, we can retry
> > > in all cases, because even if we were actually out of memory, we are
> > > trying to make an allocation that will eventually free more memory.
> > >
> > > In all cases, I think it would be nicer if we retry if ret == -EAGAIN
> > > instead. It is semantically more sane. In this specific case it is
> > > functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly
> > > return -EAGAIN anyway, but in case someone tries to use saner errnos
> > > in the future, this will age better.
> > Retrying if ret == -EAGAIN seems much nicer indeed, will change it.
> > >
> > > Also, do we intentionally want to keep retrying forever on failure? Do
> > > we want to add a max number of retries?
> > If the drivers guaranteed that zpool_shrink will remove at least an entry
> > from their LRU, the retry wouldn't be needed because the LRU will
> > eventually be emptied. But this is an assumption on the implementation of
>
> I don't think any zpool driver can guarantee to writeback at least one
> page. It can fail for reasons beyond their control (e.g. cannot
> allocate a page to decompress to).
>
> > the zpool, so yes, we could use a max retries. I think that being a sanity
> > check, it should overshoot the required number of iterations in order to
> > avoid a premature break, what about retrying a max of
> > zswap_stored_pages times?
>
> Why is it just a sanity check? Consider a case where the system is
> under extreme memory pressure that the drivers cannot allocate pages
> to decompress to. The drivers would be continuously competing with all
> other allocations on the machine. I think we should give up at some
> point. In any case, you changed the zswap_frontswap_store() to goto
> shrink if !zswap_can_accept(), so next time we try to compress a page
> to zswap we will invoke try again anyway -- perhaps under better
> circumstances.
>
> I think zswap_stored_pages is too much, keep in mind that it also
> includes same-filled pages which are not even stored in the zpool
> drivers. Maybe we should allow a fixed number of failures. If
> zpool_shrink() is successful, keep going until zswap_can_accept().
> Otherwise allow a fixed number of failures before giving up.
>

Yes, I think I got carried away by a writeback frenzy, while testing the LRU
refactor I had to fight to get a decent amount of writebacks, and ended up
being too afraid the shrink work would stop.
What about using MAX_RECLAIM_RETRIES? I like the idea of putting a limit but
I wouldn't know how to pick a fixed number.

> Maybe we can improve the error codes propagated through the writeback
> code as well to improve the retry logic. For example, if
> zswap_writeback_entry() returns EEXIST then retrying should be
> harmless, but ENOMEM might be harmful. Both of which are propagated as
> EAGAIN from zs_zpool_shrink()/zbud_zpool_shrink().
>

That would be a nice improvement indeed, I'd queue it after the LRU refactor
though.

> > >
> > > >         zswap_pool_put(pool);
> > > >  }
> > > >
> > > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > > >         if (zswap_pool_reached_full) {
> > > >                if (!zswap_can_accept()) {
> > > >                         ret = -ENOMEM;
> > > > -                       goto reject;
> > > > +                       goto shrink;
> > > >                 } else
> > > >                         zswap_pool_reached_full = false;
> > > >         }
> > > > --
> > > > 2.34.1
> > > >
  
Vitaly Wool May 26, 2023, 10:18 a.m. UTC | #5
Hi Domenico,

On Wed, May 24, 2023 at 8:50 AM Domenico Cerasuolo
<cerasuolodomenico@gmail.com> wrote:
>
> This update addresses an issue with the zswap reclaim mechanism, which
> hinders the efficient offloading of cold pages to disk, thereby
> compromising the preservation of the LRU order and consequently
> diminishing, if not inverting, its performance benefits.
>
> The functioning of the zswap shrink worker was found to be inadequate,
> as shown by basic benchmark test. For the test, a kernel build was
> utilized as a reference, with its memory confined to 1G via a cgroup and
> a 5G swap file provided. The results are presented below, these are
> averages of three runs without the use of zswap:
>
> real 46m26s
> user 35m4s
> sys 7m37s
>
> With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> system), the results changed to:
>
> real 56m4s
> user 35m13s
> sys 8m43s
>
> written_back_pages: 18
> reject_reclaim_fail: 0
> pool_limit_hit:1478
>
> Besides the evident regression, one thing to notice from this data is
> the extremely low number of written_back_pages and pool_limit_hit.
>
> The pool_limit_hit counter, which is increased in zswap_frontswap_store
> when zswap is completely full, doesn't account for a particular
> scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> still above the acceptance threshold. Once we include the rejections due
> to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> 1478 to a significant 21578266.
>
> Zswap is stuck in an undesirable state where it rejects pages because
> it's above the acceptance threshold, yet fails to attempt memory
> reclaimation. This happens because the shrink work is only queued when
> zswap_frontswap_store detects that it's full and the work itself only
> reclaims one page per run.
>
> This state results in hot pages getting written directly to disk,
> while cold ones remain memory, waiting only to be invalidated. The LRU
> order is completely broken and zswap ends up being just an overhead
> without providing any benefits.
>
> This commit applies 2 changes: a) the shrink worker is set to reclaim
> pages until the acceptance threshold is met and b) the task is also
> enqueued when zswap is not full but still above the threshold.
>
> Testing this suggested update showed much better numbers:
>
> real 36m37s
> user 35m8s
> sys 9m32s
>
> written_back_pages: 10459423
> reject_reclaim_fail: 12896
> pool_limit_hit: 75653
>
> Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> ---
>  mm/zswap.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 59da2a415fbb..2ee0775d8213 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
>  {
>         struct zswap_pool *pool = container_of(w, typeof(*pool),
>                                                 shrink_work);
> +       int ret;
>
> -       if (zpool_shrink(pool->zpool, 1, NULL))
> -               zswap_reject_reclaim_fail++;
> +       do {
> +               ret = zpool_shrink(pool->zpool, 1, NULL);
> +               if (ret)
> +                       zswap_reject_reclaim_fail++;
> +       } while (!zswap_can_accept() && ret != -EINVAL);
>         zswap_pool_put(pool);
>  }

while I do agree with your points, I have a concern about this
shrinker logic change. The reason for not doing this as you do was
possible real time/responsiveness characteristics degrade. Have you
checked that it's not really the case?

Thanks,
Vitaly

> @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>         if (zswap_pool_reached_full) {
>                if (!zswap_can_accept()) {
>                         ret = -ENOMEM;
> -                       goto reject;
> +                       goto shrink;
>                 } else
>                         zswap_pool_reached_full = false;
>         }
> --
> 2.34.1
>
  
Johannes Weiner May 26, 2023, 1:56 p.m. UTC | #6
On Fri, May 26, 2023 at 12:18:21PM +0200, Vitaly Wool wrote:
> On Wed, May 24, 2023 at 8:50 AM Domenico Cerasuolo
> > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
> >  {
> >         struct zswap_pool *pool = container_of(w, typeof(*pool),
> >                                                 shrink_work);
> > +       int ret;
> >
> > -       if (zpool_shrink(pool->zpool, 1, NULL))
> > -               zswap_reject_reclaim_fail++;
> > +       do {
> > +               ret = zpool_shrink(pool->zpool, 1, NULL);
> > +               if (ret)
> > +                       zswap_reject_reclaim_fail++;
> > +       } while (!zswap_can_accept() && ret != -EINVAL);
> >         zswap_pool_put(pool);
> >  }
> 
> while I do agree with your points, I have a concern about this
> shrinker logic change. The reason for not doing this as you do was
> possible real time/responsiveness characteristics degrade. Have you
> checked that it's not really the case?

Good point. Adding a cond_resched() to the loop would be a good idea.
  
Yosry Ahmed May 26, 2023, 4:53 p.m. UTC | #7
On Fri, May 26, 2023 at 1:52 AM Domenico Cerasuolo
<cerasuolodomenico@gmail.com> wrote:
>
> On Thu, May 25, 2023 at 9:10 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, May 25, 2023 at 9:53 AM Domenico Cerasuolo
> > <cerasuolodomenico@gmail.com> wrote:
> > >
> > > On Thu, May 25, 2023 at 2:59 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > Hi Domenico,
> > > >
> > > > On Tue, May 23, 2023 at 11:50 PM Domenico Cerasuolo
> > > > <cerasuolodomenico@gmail.com> wrote:
> > > > >
> > > > > This update addresses an issue with the zswap reclaim mechanism, which
> > > > > hinders the efficient offloading of cold pages to disk, thereby
> > > > > compromising the preservation of the LRU order and consequently
> > > > > diminishing, if not inverting, its performance benefits.
> > > > >
> > > > > The functioning of the zswap shrink worker was found to be inadequate,
> > > > > as shown by basic benchmark test. For the test, a kernel build was
> > > > > utilized as a reference, with its memory confined to 1G via a cgroup and
> > > > > a 5G swap file provided. The results are presented below, these are
> > > > > averages of three runs without the use of zswap:
> > > > >
> > > > > real 46m26s
> > > > > user 35m4s
> > > > > sys 7m37s
> > > > >
> > > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > > > > system), the results changed to:
> > > > >
> > > > > real 56m4s
> > > > > user 35m13s
> > > > > sys 8m43s
> > > > >
> > > > > written_back_pages: 18
> > > > > reject_reclaim_fail: 0
> > > > > pool_limit_hit:1478
> > > > >
> > > > > Besides the evident regression, one thing to notice from this data is
> > > > > the extremely low number of written_back_pages and pool_limit_hit.
> > > > >
> > > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > > > > when zswap is completely full, doesn't account for a particular
> > > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > > > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > > > > still above the acceptance threshold. Once we include the rejections due
> > > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > > > > 1478 to a significant 21578266.
> > > > >
> > > > > Zswap is stuck in an undesirable state where it rejects pages because
> > > > > it's above the acceptance threshold, yet fails to attempt memory
> > > > > reclaimation. This happens because the shrink work is only queued when
> > > > > zswap_frontswap_store detects that it's full and the work itself only
> > > > > reclaims one page per run.
> > > > >
> > > > > This state results in hot pages getting written directly to disk,
> > > > > while cold ones remain memory, waiting only to be invalidated. The LRU
> > > > > order is completely broken and zswap ends up being just an overhead
> > > > > without providing any benefits.
> > > > >
> > > > > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > > > > pages until the acceptance threshold is met and b) the task is also
> > > > > enqueued when zswap is not full but still above the threshold.
> > > > >
> > > > > Testing this suggested update showed much better numbers:
> > > > >
> > > > > real 36m37s
> > > > > user 35m8s
> > > > > sys 9m32s
> > > > >
> > > > > written_back_pages: 10459423
> > > > > reject_reclaim_fail: 12896
> > > > > pool_limit_hit: 75653
> > > >
> > > > Impressive numbers, and great find in general!
> > > >
> > > > I wonder how other workloads benefit/regress from this change.
> > > > Anything else that you happened to run? :)
> > > >
> > > Hi Yosry,
> > >
> > > thanks for the quick feedback!
> > >
> > > Besides stressers, I haven't tested any other actual workload, we don't have
> > > writeback in production yet so I can't provide any data from there. I was
> > > wondering what kind of actual workload I could use to test this on my desktop,
> > > but I couldn't think of anything relevant, I'm open to suggestions though :)
> >
> > Nothing in mind in particular as well. Perhaps others have ideas.
> >
> > > > >
> > > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > > > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> > > > > ---
> > > > >  mm/zswap.c | 10 +++++++---
> > > > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > index 59da2a415fbb..2ee0775d8213 100644
> > > > > --- a/mm/zswap.c
> > > > > +++ b/mm/zswap.c
> > > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
> > > > >  {
> > > > >         struct zswap_pool *pool = container_of(w, typeof(*pool),
> > > > >                                                 shrink_work);
> > > > > +       int ret;
> > > > >
> > > > > -       if (zpool_shrink(pool->zpool, 1, NULL))
> > > > > -               zswap_reject_reclaim_fail++;
> > > > > +       do {
> > > > > +               ret = zpool_shrink(pool->zpool, 1, NULL);
> > > > > +               if (ret)
> > > > > +                       zswap_reject_reclaim_fail++;
> > > > > +       } while (!zswap_can_accept() && ret != -EINVAL);
> > > >
> > > > One question/comment here about the retry logic.
> > > >
> > > > So I looked through the awfully convoluted writeback code, and there
> > > > are multiple layers, and some of them tend to overwrite the return
> > > > value of the layer underneath :(
> > > >
> > > > For zsmalloc (as an example):
> > > > zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict()->zswap_writeback_entry().
> > > >
> > > > First of all, that zpool_ops is an unnecessarily confusing
> > > > indirection, but anyway.
> > > >
> > > > - zswap_writeback_entry() will either return -ENOMEM or -EEXIST on error
> > > > - zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lru
> > > > is empty, and -EAGAIN on other errors.
> > > > - zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the
> > > > return value of  zs_reclaim_page()/zbud_reclaim_page().
> > > > - zpool_shrink() will return -EINVAL if the driver does not support
> > > > shrinking, otherwise it will propagate the return value from the
> > > > driver.
> > > >
> > > > So it looks like we will get -EINVAL only if the driver lru is empty
> > > > or the driver does not support writeback, so rightfully we should not
> > > > retry.
> > > >
> > > > If zswap_writeback_entry() returns -EEXIST, it probably means that we
> > > > raced with another task decompressing the page, so rightfully we
> > > > should retry to writeback another page instead.
> > > >
> > > > If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily
> > > > mean that we couldn't allocate memory (unfortunately), it looks like
> > > > we will return -ENOMEM in other cases as well. Arguably, we can retry
> > > > in all cases, because even if we were actually out of memory, we are
> > > > trying to make an allocation that will eventually free more memory.
> > > >
> > > > In all cases, I think it would be nicer if we retry if ret == -EAGAIN
> > > > instead. It is semantically more sane. In this specific case it is
> > > > functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly
> > > > return -EAGAIN anyway, but in case someone tries to use saner errnos
> > > > in the future, this will age better.
> > > Retrying if ret == -EAGAIN seems much nicer indeed, will change it.
> > > >
> > > > Also, do we intentionally want to keep retrying forever on failure? Do
> > > > we want to add a max number of retries?
> > > If the drivers guaranteed that zpool_shrink will remove at least an entry
> > > from their LRU, the retry wouldn't be needed because the LRU will
> > > eventually be emptied. But this is an assumption on the implementation of
> >
> > I don't think any zpool driver can guarantee to writeback at least one
> > page. It can fail for reasons beyond their control (e.g. cannot
> > allocate a page to decompress to).
> >
> > > the zpool, so yes, we could use a max retries. I think that being a sanity
> > > check, it should overshoot the required number of iterations in order to
> > > avoid a premature break, what about retrying a max of
> > > zswap_stored_pages times?
> >
> > Why is it just a sanity check? Consider a case where the system is
> > under extreme memory pressure that the drivers cannot allocate pages
> > to decompress to. The drivers would be continuously competing with all
> > other allocations on the machine. I think we should give up at some
> > point. In any case, you changed the zswap_frontswap_store() to goto
> > shrink if !zswap_can_accept(), so next time we try to compress a page
> > to zswap we will invoke try again anyway -- perhaps under better
> > circumstances.
> >
> > I think zswap_stored_pages is too much, keep in mind that it also
> > includes same-filled pages which are not even stored in the zpool
> > drivers. Maybe we should allow a fixed number of failures. If
> > zpool_shrink() is successful, keep going until zswap_can_accept().
> > Otherwise allow a fixed number of failures before giving up.
> >
>
> Yes, I think I got carried away by a writeback frenzy, while testing the LRU
> refactor I had to fight to get a decent amount of writebacks, and ended up
> being too afraid the shrink work would stop.
> What about using MAX_RECLAIM_RETRIES? I like the idea of putting a limit but
> I wouldn't know how to pick a fixed number.

In a sense, shrinking zswap is a form of reclaim, and is triggered by
reclaim, so MAX_RECLAIM_RETRIES sounds reasonable to me. We can always
revisit if needed.

>
> > Maybe we can improve the error codes propagated through the writeback
> > code as well to improve the retry logic. For example, if
> > zswap_writeback_entry() returns EEXIST then retrying should be
> > harmless, but ENOMEM might be harmful. Both of which are propagated as
> > EAGAIN from zs_zpool_shrink()/zbud_zpool_shrink().
> >
>
> That would be a nice improvement indeed, I'd queue it after the LRU refactor
> though.

Great!

>
> > > >
> > > > >         zswap_pool_put(pool);
> > > > >  }
> > > > >
> > > > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > > > >         if (zswap_pool_reached_full) {
> > > > >                if (!zswap_can_accept()) {
> > > > >                         ret = -ENOMEM;
> > > > > -                       goto reject;
> > > > > +                       goto shrink;
> > > > >                 } else
> > > > >                         zswap_pool_reached_full = false;
> > > > >         }
> > > > > --
> > > > > 2.34.1
> > > > >
  
Chris Li May 26, 2023, 11:05 p.m. UTC | #8
On Wed, May 24, 2023 at 08:50:51AM +0200, Domenico Cerasuolo wrote:
> This update addresses an issue with the zswap reclaim mechanism, which
> hinders the efficient offloading of cold pages to disk, thereby
> compromising the preservation of the LRU order and consequently
> diminishing, if not inverting, its performance benefits.
> 
> The functioning of the zswap shrink worker was found to be inadequate,
> as shown by basic benchmark test. For the test, a kernel build was
> utilized as a reference, with its memory confined to 1G via a cgroup and
> a 5G swap file provided. The results are presented below, these are
> averages of three runs without the use of zswap:
> 
> real 46m26s
> user 35m4s
> sys 7m37s
> 
> With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> system), the results changed to:
> 
> real 56m4s
> user 35m13s
> sys 8m43s
> 
> written_back_pages: 18
> reject_reclaim_fail: 0
> pool_limit_hit:1478
> 
> Besides the evident regression, one thing to notice from this data is
> the extremely low number of written_back_pages and pool_limit_hit.
> 
> The pool_limit_hit counter, which is increased in zswap_frontswap_store
> when zswap is completely full, doesn't account for a particular
> scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> still above the acceptance threshold. Once we include the rejections due
> to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> 1478 to a significant 21578266.
> 
> Zswap is stuck in an undesirable state where it rejects pages because
> it's above the acceptance threshold, yet fails to attempt memory
> reclaimation. This happens because the shrink work is only queued when
> zswap_frontswap_store detects that it's full and the work itself only
> reclaims one page per run.
> 
> This state results in hot pages getting written directly to disk,
> while cold ones remain memory, waiting only to be invalidated. The LRU
> order is completely broken and zswap ends up being just an overhead
> without providing any benefits.
> 
> This commit applies 2 changes: a) the shrink worker is set to reclaim
> pages until the acceptance threshold is met and b) the task is also
> enqueued when zswap is not full but still above the threshold.
> 
> Testing this suggested update showed much better numbers:
> 
> real 36m37s
> user 35m8s
> sys 9m32s
> 
> written_back_pages: 10459423
> reject_reclaim_fail: 12896
> pool_limit_hit: 75653
> 
> Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> ---
>  mm/zswap.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 59da2a415fbb..2ee0775d8213 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
>  {
>  	struct zswap_pool *pool = container_of(w, typeof(*pool),
>  						shrink_work);
> +	int ret;
Very minor nit pick, you can move the declare inside the do
statement where it get used.

>  
> -	if (zpool_shrink(pool->zpool, 1, NULL))
> -		zswap_reject_reclaim_fail++;
> +	do {
> +		ret = zpool_shrink(pool->zpool, 1, NULL);
> +		if (ret)
> +			zswap_reject_reclaim_fail++;
> +	} while (!zswap_can_accept() && ret != -EINVAL);

As others point out, this while loop can be problematic.

Have you find out what was the common reason causing the
reclaim fail? Inside the shrink function there is a while
loop that would be the place to perform try harder conditions.
For example, if all the page in the LRU are already try once
there's no reason to keep on calling the shrink function.
The outer loop actually doesn't have this kind of visibilities.


Chris

>  	zswap_pool_put(pool);
>  }
>  
> @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>  	if (zswap_pool_reached_full) {
>  	       if (!zswap_can_accept()) {
>  			ret = -ENOMEM;
> -			goto reject;
> +			goto shrink;
>  		} else
>  			zswap_pool_reached_full = false;
>  	}
> -- 
> 2.34.1
>
  
Domenico Cerasuolo May 28, 2023, 8:53 a.m. UTC | #9
On Sat, May 27, 2023 at 1:05 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, May 24, 2023 at 08:50:51AM +0200, Domenico Cerasuolo wrote:
> > This update addresses an issue with the zswap reclaim mechanism, which
> > hinders the efficient offloading of cold pages to disk, thereby
> > compromising the preservation of the LRU order and consequently
> > diminishing, if not inverting, its performance benefits.
> >
> > The functioning of the zswap shrink worker was found to be inadequate,
> > as shown by basic benchmark test. For the test, a kernel build was
> > utilized as a reference, with its memory confined to 1G via a cgroup and
> > a 5G swap file provided. The results are presented below, these are
> > averages of three runs without the use of zswap:
> >
> > real 46m26s
> > user 35m4s
> > sys 7m37s
> >
> > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > system), the results changed to:
> >
> > real 56m4s
> > user 35m13s
> > sys 8m43s
> >
> > written_back_pages: 18
> > reject_reclaim_fail: 0
> > pool_limit_hit:1478
> >
> > Besides the evident regression, one thing to notice from this data is
> > the extremely low number of written_back_pages and pool_limit_hit.
> >
> > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > when zswap is completely full, doesn't account for a particular
> > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > still above the acceptance threshold. Once we include the rejections due
> > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > 1478 to a significant 21578266.
> >
> > Zswap is stuck in an undesirable state where it rejects pages because
> > it's above the acceptance threshold, yet fails to attempt memory
> > reclaimation. This happens because the shrink work is only queued when
> > zswap_frontswap_store detects that it's full and the work itself only
> > reclaims one page per run.
> >
> > This state results in hot pages getting written directly to disk,
> > while cold ones remain memory, waiting only to be invalidated. The LRU
> > order is completely broken and zswap ends up being just an overhead
> > without providing any benefits.
> >
> > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > pages until the acceptance threshold is met and b) the task is also
> > enqueued when zswap is not full but still above the threshold.
> >
> > Testing this suggested update showed much better numbers:
> >
> > real 36m37s
> > user 35m8s
> > sys 9m32s
> >
> > written_back_pages: 10459423
> > reject_reclaim_fail: 12896
> > pool_limit_hit: 75653
> >
> > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> > ---
> >  mm/zswap.c | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 59da2a415fbb..2ee0775d8213 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
> >  {
> >       struct zswap_pool *pool = container_of(w, typeof(*pool),
> >                                               shrink_work);
> > +     int ret;
> Very minor nit pick, you can move the declare inside the do
> statement where it get used.
>
> >
> > -     if (zpool_shrink(pool->zpool, 1, NULL))
> > -             zswap_reject_reclaim_fail++;
> > +     do {
> > +             ret = zpool_shrink(pool->zpool, 1, NULL);
> > +             if (ret)
> > +                     zswap_reject_reclaim_fail++;
> > +     } while (!zswap_can_accept() && ret != -EINVAL);
>
> As others point out, this while loop can be problematic.

Do you have some specific concern that's not been already addressed following
other reviewers' suggestions?

>
> Have you find out what was the common reason causing the
> reclaim fail? Inside the shrink function there is a while
> loop that would be the place to perform try harder conditions.
> For example, if all the page in the LRU are already try once
> there's no reason to keep on calling the shrink function.
> The outer loop actually doesn't have this kind of visibilities.
>

The most common cause I saw during testing was concurrent operations
on the swap entry, if an entry is being loaded/invalidated at the same time
as the zswap writeback, then errors will be returned. This scenario doesn't
seem harmful at all because the failure doesn't indicate that memory
cannot be allocated, just that that particular page should not be written back.

As far as I understood the voiced concerns, the problem could arise if the
writeback fails due to an impossibility to allocate memory, that could indicate
that the system is in extremely high memory pressure and this loop could
aggravate the situation by adding more contention on the already scarce
available memory.

Since both these cases are treated equally with the retries limit, we're
adopting a conservative approach in considering non-harmful errors as if
they were harmful. This could certainly be improved, but I don't see it as an
issue because a differentiation of the errors would actually make the loop run
longer than it would without the differentiation.

As I was writing to Yosry, the differentiation would be a great improvement
here, I just have a patch set in the queue that moves the inner reclaim loop
from the zpool driver up to zswap. With that, updating the error handling
would be more convenient as it would be done in one place instead of three.

>
> Chris
>
> >       zswap_pool_put(pool);
> >  }
> >
> > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> >       if (zswap_pool_reached_full) {
> >              if (!zswap_can_accept()) {
> >                       ret = -ENOMEM;
> > -                     goto reject;
> > +                     goto shrink;
> >               } else
> >                       zswap_pool_reached_full = false;
> >       }
> > --
> > 2.34.1
> >
  
Chris Li May 29, 2023, 9 p.m. UTC | #10
On Sun, May 28, 2023 at 10:53:49AM +0200, Domenico Cerasuolo wrote:
> On Sat, May 27, 2023 at 1:05 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Wed, May 24, 2023 at 08:50:51AM +0200, Domenico Cerasuolo wrote:
> > > This update addresses an issue with the zswap reclaim mechanism, which
> > > hinders the efficient offloading of cold pages to disk, thereby
> > > compromising the preservation of the LRU order and consequently
> > > diminishing, if not inverting, its performance benefits.
> > >
> > > The functioning of the zswap shrink worker was found to be inadequate,
> > > as shown by basic benchmark test. For the test, a kernel build was
> > > utilized as a reference, with its memory confined to 1G via a cgroup and
> > > a 5G swap file provided. The results are presented below, these are
> > > averages of three runs without the use of zswap:
> > >
> > > real 46m26s
> > > user 35m4s
> > > sys 7m37s
> > >
> > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > > system), the results changed to:
> > >
> > > real 56m4s
> > > user 35m13s
> > > sys 8m43s
> > >
> > > written_back_pages: 18
> > > reject_reclaim_fail: 0
> > > pool_limit_hit:1478
> > >
> > > Besides the evident regression, one thing to notice from this data is
> > > the extremely low number of written_back_pages and pool_limit_hit.
> > >
> > > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > > when zswap is completely full, doesn't account for a particular
> > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > > still above the acceptance threshold. Once we include the rejections due
> > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > > 1478 to a significant 21578266.
> > >
> > > Zswap is stuck in an undesirable state where it rejects pages because
> > > it's above the acceptance threshold, yet fails to attempt memory
> > > reclaimation. This happens because the shrink work is only queued when
> > > zswap_frontswap_store detects that it's full and the work itself only
> > > reclaims one page per run.
> > >
> > > This state results in hot pages getting written directly to disk,
> > > while cold ones remain memory, waiting only to be invalidated. The LRU
> > > order is completely broken and zswap ends up being just an overhead
> > > without providing any benefits.
> > >
> > > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > > pages until the acceptance threshold is met and b) the task is also
> > > enqueued when zswap is not full but still above the threshold.
> > >
> > > Testing this suggested update showed much better numbers:
> > >
> > > real 36m37s
> > > user 35m8s
> > > sys 9m32s
> > >
> > > written_back_pages: 10459423
> > > reject_reclaim_fail: 12896
> > > pool_limit_hit: 75653
> > >
> > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> > > ---
> > >  mm/zswap.c | 10 +++++++---
> > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 59da2a415fbb..2ee0775d8213 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
> > >  {
> > >       struct zswap_pool *pool = container_of(w, typeof(*pool),
> > >                                               shrink_work);
> > > +     int ret;
> > Very minor nit pick, you can move the declare inside the do
> > statement where it get used.
> >
> > >
> > > -     if (zpool_shrink(pool->zpool, 1, NULL))
> > > -             zswap_reject_reclaim_fail++;
> > > +     do {
> > > +             ret = zpool_shrink(pool->zpool, 1, NULL);
> > > +             if (ret)
> > > +                     zswap_reject_reclaim_fail++;
> > > +     } while (!zswap_can_accept() && ret != -EINVAL);
> >
> > As others point out, this while loop can be problematic.
> 
> Do you have some specific concern that's not been already addressed following
> other reviewers' suggestions?

Mostly as it is, the outer loop seems the wrong
place to do the retry because of lacking the error
context and properly continuing the iteration.


> 
> >
> > Have you find out what was the common reason causing the
> > reclaim fail? Inside the shrink function there is a while
> > loop that would be the place to perform try harder conditions.
> > For example, if all the page in the LRU are already try once
> > there's no reason to keep on calling the shrink function.
> > The outer loop actually doesn't have this kind of visibilities.
> >
> 
> The most common cause I saw during testing was concurrent operations
> on the swap entry, if an entry is being loaded/invalidated at the same time
> as the zswap writeback, then errors will be returned. This scenario doesn't
> seem harmful at all because the failure doesn't indicate that memory
> cannot be allocated, just that that particular page should not be written back.
> 
> As far as I understood the voiced concerns, the problem could arise if the
> writeback fails due to an impossibility to allocate memory, that could indicate
> that the system is in extremely high memory pressure and this loop could
> aggravate the situation by adding more contention on the already scarce
> available memory.

It seems to me the inner loop should continue if page writeback
fails due to loaded/invalidate, abort due to memory pressure.

The key difference is that one is transient while others might
be more permanent 

You definitely don't want to retry if it is already ENOMEM. 
> 
> Since both these cases are treated equally with the retries limit, we're
> adopting a conservative approach in considering non-harmful errors as if
> they were harmful. This could certainly be improved, but I don't see it as an
> issue because a differentiation of the errors would actually make the loop run
> longer than it would without the differentiation.

We don't want to run longer if error is persistent.

> 
> As I was writing to Yosry, the differentiation would be a great improvement
> here, I just have a patch set in the queue that moves the inner reclaim loop
> from the zpool driver up to zswap. With that, updating the error handling
> would be more convenient as it would be done in one place instead of three.i

This has tricky complications as well. The current shrink interface
doesn't support continuing from the previous error position. If you want
to avoid a repeat attempt if the page has a writeback error, you kinda
of need a way to skip that page.

Chris
 
> 
> >
> > Chris
> >
> > >       zswap_pool_put(pool);
> > >  }
> > >
> > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > >       if (zswap_pool_reached_full) {
> > >              if (!zswap_can_accept()) {
> > >                       ret = -ENOMEM;
> > > -                     goto reject;
> > > +                     goto shrink;
> > >               } else
> > >                       zswap_pool_reached_full = false;
> > >       }
> > > --
> > > 2.34.1
> > >
  
Johannes Weiner May 30, 2023, 4:13 a.m. UTC | #11
On Mon, May 29, 2023 at 02:00:46PM -0700, Chris Li wrote:
> On Sun, May 28, 2023 at 10:53:49AM +0200, Domenico Cerasuolo wrote:
> > On Sat, May 27, 2023 at 1:05 AM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Wed, May 24, 2023 at 08:50:51AM +0200, Domenico Cerasuolo wrote:
> > > > This update addresses an issue with the zswap reclaim mechanism, which
> > > > hinders the efficient offloading of cold pages to disk, thereby
> > > > compromising the preservation of the LRU order and consequently
> > > > diminishing, if not inverting, its performance benefits.
> > > >
> > > > The functioning of the zswap shrink worker was found to be inadequate,
> > > > as shown by basic benchmark test. For the test, a kernel build was
> > > > utilized as a reference, with its memory confined to 1G via a cgroup and
> > > > a 5G swap file provided. The results are presented below, these are
> > > > averages of three runs without the use of zswap:
> > > >
> > > > real 46m26s
> > > > user 35m4s
> > > > sys 7m37s
> > > >
> > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > > > system), the results changed to:
> > > >
> > > > real 56m4s
> > > > user 35m13s
> > > > sys 8m43s
> > > >
> > > > written_back_pages: 18
> > > > reject_reclaim_fail: 0
> > > > pool_limit_hit:1478
> > > >
> > > > Besides the evident regression, one thing to notice from this data is
> > > > the extremely low number of written_back_pages and pool_limit_hit.
> > > >
> > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > > > when zswap is completely full, doesn't account for a particular
> > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > > > still above the acceptance threshold. Once we include the rejections due
> > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > > > 1478 to a significant 21578266.
> > > >
> > > > Zswap is stuck in an undesirable state where it rejects pages because
> > > > it's above the acceptance threshold, yet fails to attempt memory
> > > > reclaimation. This happens because the shrink work is only queued when
> > > > zswap_frontswap_store detects that it's full and the work itself only
> > > > reclaims one page per run.
> > > >
> > > > This state results in hot pages getting written directly to disk,
> > > > while cold ones remain memory, waiting only to be invalidated. The LRU
> > > > order is completely broken and zswap ends up being just an overhead
> > > > without providing any benefits.
> > > >
> > > > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > > > pages until the acceptance threshold is met and b) the task is also
> > > > enqueued when zswap is not full but still above the threshold.
> > > >
> > > > Testing this suggested update showed much better numbers:
> > > >
> > > > real 36m37s
> > > > user 35m8s
> > > > sys 9m32s
> > > >
> > > > written_back_pages: 10459423
> > > > reject_reclaim_fail: 12896
> > > > pool_limit_hit: 75653
> > > >
> > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> > > > ---
> > > >  mm/zswap.c | 10 +++++++---
> > > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 59da2a415fbb..2ee0775d8213 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
> > > >  {
> > > >       struct zswap_pool *pool = container_of(w, typeof(*pool),
> > > >                                               shrink_work);
> > > > +     int ret;
> > > Very minor nit pick, you can move the declare inside the do
> > > statement where it get used.
> > >
> > > >
> > > > -     if (zpool_shrink(pool->zpool, 1, NULL))
> > > > -             zswap_reject_reclaim_fail++;
> > > > +     do {
> > > > +             ret = zpool_shrink(pool->zpool, 1, NULL);
> > > > +             if (ret)
> > > > +                     zswap_reject_reclaim_fail++;
> > > > +     } while (!zswap_can_accept() && ret != -EINVAL);
> > >
> > > As others point out, this while loop can be problematic.
> > 
> > Do you have some specific concern that's not been already addressed following
> > other reviewers' suggestions?
> 
> Mostly as it is, the outer loop seems the wrong
> place to do the retry because of lacking the error
> context and properly continuing the iteration.
> 
> 
> > 
> > >
> > > Have you find out what was the common reason causing the
> > > reclaim fail? Inside the shrink function there is a while
> > > loop that would be the place to perform try harder conditions.
> > > For example, if all the page in the LRU are already try once
> > > there's no reason to keep on calling the shrink function.
> > > The outer loop actually doesn't have this kind of visibilities.
> > >
> > 
> > The most common cause I saw during testing was concurrent operations
> > on the swap entry, if an entry is being loaded/invalidated at the same time
> > as the zswap writeback, then errors will be returned. This scenario doesn't
> > seem harmful at all because the failure doesn't indicate that memory
> > cannot be allocated, just that that particular page should not be written back.
> > 
> > As far as I understood the voiced concerns, the problem could arise if the
> > writeback fails due to an impossibility to allocate memory, that could indicate
> > that the system is in extremely high memory pressure and this loop could
> > aggravate the situation by adding more contention on the already scarce
> > available memory.
> 
> It seems to me the inner loop should continue if page writeback
> fails due to loaded/invalidate, abort due to memory pressure.
> 
> The key difference is that one is transient while others might
> be more permanent 
> 
> You definitely don't want to retry if it is already ENOMEM. 

This is actually a separate problem that Domenico was preparing a
patch for. Writeback is a reclaim operation, so it should have access
to the memory reserves (PF_MEMALLOC). Persistent -ENOMEM shouldn't
exist. It actually used have reserve access when writeback happened
directly from inside the store (reclaim -> swapout path). When
writeback was moved out to the async worker, it lost reserve access as
an unintended side effect. This should be fixed, so we don't OOM while
we try to free memory.

Should it be fixed before merging this patch? I don't think the
ordering matters. Right now the -ENOMEM case invokes OOM, so it isn't
really persistent either. Retrying a few times in that case certainly
doesn't seem to make things worse.

> > As I was writing to Yosry, the differentiation would be a great improvement
> > here, I just have a patch set in the queue that moves the inner reclaim loop
> > from the zpool driver up to zswap. With that, updating the error handling
> > would be more convenient as it would be done in one place instead of three.i
> 
> This has tricky complications as well. The current shrink interface
> doesn't support continuing from the previous error position. If you want
> to avoid a repeat attempt if the page has a writeback error, you kinda
> of need a way to skip that page.

A page that fails to reclaim is put back to the tail of the LRU, so
for all intents and purposes it will be skipped. In the rare and
extreme case where it's the only page left on the list, I again don't
see how retrying a few times will make the situation worse.

In practice, IMO there is little upside in trying to be more
discerning about the error codes. Simple seems better here.
  
Chris Li May 30, 2023, 2:51 p.m. UTC | #12
On Tue, May 30, 2023 at 12:13:41AM -0400, Johannes Weiner wrote:
> On Mon, May 29, 2023 at 02:00:46PM -0700, Chris Li wrote:
> > On Sun, May 28, 2023 at 10:53:49AM +0200, Domenico Cerasuolo wrote:
> > > On Sat, May 27, 2023 at 1:05 AM Chris Li <chrisl@kernel.org> wrote:
> > > >
> > > > On Wed, May 24, 2023 at 08:50:51AM +0200, Domenico Cerasuolo wrote:
> > > > > This update addresses an issue with the zswap reclaim mechanism, which
> > > > > hinders the efficient offloading of cold pages to disk, thereby
> > > > > compromising the preservation of the LRU order and consequently
> > > > > diminishing, if not inverting, its performance benefits.
> > > > >
> > > > > The functioning of the zswap shrink worker was found to be inadequate,
> > > > > as shown by basic benchmark test. For the test, a kernel build was
> > > > > utilized as a reference, with its memory confined to 1G via a cgroup and
> > > > > a 5G swap file provided. The results are presented below, these are
> > > > > averages of three runs without the use of zswap:
> > > > >
> > > > > real 46m26s
> > > > > user 35m4s
> > > > > sys 7m37s
> > > > >
> > > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > > > > system), the results changed to:
> > > > >
> > > > > real 56m4s
> > > > > user 35m13s
> > > > > sys 8m43s
> > > > >
> > > > > written_back_pages: 18
> > > > > reject_reclaim_fail: 0
> > > > > pool_limit_hit:1478
> > > > >
> > > > > Besides the evident regression, one thing to notice from this data is
> > > > > the extremely low number of written_back_pages and pool_limit_hit.
> > > > >
> > > > > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > > > > when zswap is completely full, doesn't account for a particular
> > > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > > > > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > > > > still above the acceptance threshold. Once we include the rejections due
> > > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > > > > 1478 to a significant 21578266.
> > > > >
> > > > > Zswap is stuck in an undesirable state where it rejects pages because
> > > > > it's above the acceptance threshold, yet fails to attempt memory
> > > > > reclaimation. This happens because the shrink work is only queued when
> > > > > zswap_frontswap_store detects that it's full and the work itself only
> > > > > reclaims one page per run.
> > > > >
> > > > > This state results in hot pages getting written directly to disk,
> > > > > while cold ones remain memory, waiting only to be invalidated. The LRU
> > > > > order is completely broken and zswap ends up being just an overhead
> > > > > without providing any benefits.
> > > > >
> > > > > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > > > > pages until the acceptance threshold is met and b) the task is also
> > > > > enqueued when zswap is not full but still above the threshold.
> > > > >
> > > > > Testing this suggested update showed much better numbers:
> > > > >
> > > > > real 36m37s
> > > > > user 35m8s
> > > > > sys 9m32s
> > > > >
> > > > > written_back_pages: 10459423
> > > > > reject_reclaim_fail: 12896
> > > > > pool_limit_hit: 75653
> > > > >
> > > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > > > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> > > > > ---
> > > > >  mm/zswap.c | 10 +++++++---
> > > > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > index 59da2a415fbb..2ee0775d8213 100644
> > > > > --- a/mm/zswap.c
> > > > > +++ b/mm/zswap.c
> > > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *w)
> > > > >  {
> > > > >       struct zswap_pool *pool = container_of(w, typeof(*pool),
> > > > >                                               shrink_work);
> > > > > +     int ret;
> > > > Very minor nit pick, you can move the declare inside the do
> > > > statement where it get used.
> > > >
> > > > >
> > > > > -     if (zpool_shrink(pool->zpool, 1, NULL))
> > > > > -             zswap_reject_reclaim_fail++;
> > > > > +     do {
> > > > > +             ret = zpool_shrink(pool->zpool, 1, NULL);
> > > > > +             if (ret)
> > > > > +                     zswap_reject_reclaim_fail++;
> > > > > +     } while (!zswap_can_accept() && ret != -EINVAL);
> > > >
> > > > As others point out, this while loop can be problematic.
> > > 
> > > Do you have some specific concern that's not been already addressed following
> > > other reviewers' suggestions?
> > 
> > Mostly as it is, the outer loop seems the wrong
> > place to do the retry because of lacking the error
> > context and properly continuing the iteration.
> > 
> > 
> > > 
> > > >
> > > > Have you find out what was the common reason causing the
> > > > reclaim fail? Inside the shrink function there is a while
> > > > loop that would be the place to perform try harder conditions.
> > > > For example, if all the page in the LRU are already try once
> > > > there's no reason to keep on calling the shrink function.
> > > > The outer loop actually doesn't have this kind of visibilities.
> > > >
> > > 
> > > The most common cause I saw during testing was concurrent operations
> > > on the swap entry, if an entry is being loaded/invalidated at the same time
> > > as the zswap writeback, then errors will be returned. This scenario doesn't
> > > seem harmful at all because the failure doesn't indicate that memory
> > > cannot be allocated, just that that particular page should not be written back.
> > > 
> > > As far as I understood the voiced concerns, the problem could arise if the
> > > writeback fails due to an impossibility to allocate memory, that could indicate
> > > that the system is in extremely high memory pressure and this loop could
> > > aggravate the situation by adding more contention on the already scarce
> > > available memory.
> > 
> > It seems to me the inner loop should continue if page writeback
> > fails due to loaded/invalidate, abort due to memory pressure.
> > 
> > The key difference is that one is transient while others might
> > be more permanent 
> > 
> > You definitely don't want to retry if it is already ENOMEM. 
> 
> This is actually a separate problem that Domenico was preparing a
> patch for. Writeback is a reclaim operation, so it should have access
> to the memory reserves (PF_MEMALLOC). Persistent -ENOMEM shouldn't
> exist. It actually used have reserve access when writeback happened
> directly from inside the store (reclaim -> swapout path). When
> writeback was moved out to the async worker, it lost reserve access as
> an unintended side effect. This should be fixed, so we don't OOM while
> we try to free memory.
Thanks for pointing out -ENOMEM shouldn't be persistent.
Points taken.

The original point of not retrying the persistent error
still holds.

> 
> Should it be fixed before merging this patch? I don't think the
> ordering matters. Right now the -ENOMEM case invokes OOM, so it isn't
> really persistent either. Retrying a few times in that case certainly
> doesn't seem to make things worse.

If you already know the error is persistent, retrying is wasting
CPU. It can pertancial hold locks during the retry, which can
slow someone else down.

> > > As I was writing to Yosry, the differentiation would be a great improvement
> > > here, I just have a patch set in the queue that moves the inner reclaim loop
> > > from the zpool driver up to zswap. With that, updating the error handling
> > > would be more convenient as it would be done in one place instead of three.i
> > 
> > This has tricky complications as well. The current shrink interface
> > doesn't support continuing from the previous error position. If you want
> > to avoid a repeat attempt if the page has a writeback error, you kinda
> > of need a way to skip that page.
> 
> A page that fails to reclaim is put back to the tail of the LRU, so
> for all intents and purposes it will be skipped. In the rare and

Do you mean the page is treated as hot again?

Wouldn't that be undesirable from the app's point of view?
> extreme case where it's the only page left on the list, I again don't
> see how retrying a few times will make the situation worse.
> 
> In practice, IMO there is little upside in trying to be more
> discerning about the error codes. Simple seems better here.

Just trying to think about what should be the precise loop termination
condition here.

I still feel blindly trying a few times is a very imprecise condition.

Chris
  
Johannes Weiner May 30, 2023, 3:55 p.m. UTC | #13
On Tue, May 30, 2023 at 07:51:23AM -0700, Chris Li wrote:
> Thanks for pointing out -ENOMEM shouldn't be persistent.
> Points taken.
> 
> The original point of not retrying the persistent error
> still holds.

Okay, but what persistent errors are you referring to?

Aside from -ENOMEM, writeback_entry will fail on concurrent swap
invalidation or a racing swapin fault. In both cases we should
absolutely keep trying other entries until the goal is met.

> > Should it be fixed before merging this patch? I don't think the
> > ordering matters. Right now the -ENOMEM case invokes OOM, so it isn't
> > really persistent either. Retrying a few times in that case certainly
> > doesn't seem to make things worse.
> 
> If you already know the error is persistent, retrying is wasting
> CPU. It can pertancial hold locks during the retry, which can
> slow someone else down.

That's a bit of a truism. How does this pertain to the zswap reclaim
situation?

> > > > As I was writing to Yosry, the differentiation would be a great improvement
> > > > here, I just have a patch set in the queue that moves the inner reclaim loop
> > > > from the zpool driver up to zswap. With that, updating the error handling
> > > > would be more convenient as it would be done in one place instead of three.i
> > > 
> > > This has tricky complications as well. The current shrink interface
> > > doesn't support continuing from the previous error position. If you want
> > > to avoid a repeat attempt if the page has a writeback error, you kinda
> > > of need a way to skip that page.
> > 
> > A page that fails to reclaim is put back to the tail of the LRU, so
> > for all intents and purposes it will be skipped. In the rare and
> 
> Do you mean the page is treated as hot again?
> 
> Wouldn't that be undesirable from the app's point of view?

That's current backend LRU behavior. Is it optimal? That's certainly
debatable. But it's tangential to this patch. The point is that
capping retries to a fixed number of failures works correctly as a
safety precaution and introduces no (new) undesirable behavior.

It's entirely moot once we refactor the backend page LRU to the zswap
entry LRU. The only time we'll fail to reclaim an entry is if we race
with something already freeing it, so it doesn't really matter where
we put it.

> > extreme case where it's the only page left on the list, I again don't
> > see how retrying a few times will make the situation worse.
> > 
> > In practice, IMO there is little upside in trying to be more
> > discerning about the error codes. Simple seems better here.
> 
> Just trying to think about what should be the precise loop termination
> condition here.
> 
> I still feel blindly trying a few times is a very imprecise condition.

The precise termination condition is when can_accept() returns true
again. The safety cap is only added as precaution to avoid infinite
loops if something goes wrong or unexpected, now or in the future.
  
Chris Li May 30, 2023, 6:18 p.m. UTC | #14
On Tue, May 30, 2023 at 11:55:19AM -0400, Johannes Weiner wrote:
> On Tue, May 30, 2023 at 07:51:23AM -0700, Chris Li wrote:
> > Thanks for pointing out -ENOMEM shouldn't be persistent.
> > Points taken.
> > 
> > The original point of not retrying the persistent error
> > still holds.
> 
> Okay, but what persistent errors are you referring to?

Maybe ENOMEM is a bad example. How about if the swap device
just went bad and can't complete new IO writes?

> Aside from -ENOMEM, writeback_entry will fail on concurrent swap
> invalidation or a racing swapin fault. In both cases we should
> absolutely keep trying other entries until the goal is met.

How about a narrower fix recognizing those error cases and making
the inner loop continue in those errors?

> > > Should it be fixed before merging this patch? I don't think the
> > > ordering matters. Right now the -ENOMEM case invokes OOM, so it isn't
> > > really persistent either. Retrying a few times in that case certainly
> > > doesn't seem to make things worse.
> > 
> > If you already know the error is persistent, retrying is wasting
> > CPU. It can pertancial hold locks during the retry, which can
> > slow someone else down.
> 
> That's a bit of a truism. How does this pertain to the zswap reclaim
> situation?

See the above narrower fix alternative.
> 
> > > > > As I was writing to Yosry, the differentiation would be a great improvement
> > > > > here, I just have a patch set in the queue that moves the inner reclaim loop
> > > > > from the zpool driver up to zswap. With that, updating the error handling
> > > > > would be more convenient as it would be done in one place instead of three.i
> > > > 
> > > > This has tricky complications as well. The current shrink interface
> > > > doesn't support continuing from the previous error position. If you want
> > > > to avoid a repeat attempt if the page has a writeback error, you kinda
> > > > of need a way to skip that page.
> > > 
> > > A page that fails to reclaim is put back to the tail of the LRU, so
> > > for all intents and purposes it will be skipped. In the rare and
> > 
> > Do you mean the page is treated as hot again?
> > 
> > Wouldn't that be undesirable from the app's point of view?
> 
> That's current backend LRU behavior. Is it optimal? That's certainly
> debatable. But it's tangential to this patch. The point is that
> capping retries to a fixed number of failures works correctly as a
> safety precaution and introduces no (new) undesirable behavior.
> 
> It's entirely moot once we refactor the backend page LRU to the zswap
> entry LRU. The only time we'll fail to reclaim an entry is if we race
> with something already freeing it, so it doesn't really matter where
> we put it.

Agree with you there. A bit side tracked.

> > > extreme case where it's the only page left on the list, I again don't
> > > see how retrying a few times will make the situation worse.
> > > 
> > > In practice, IMO there is little upside in trying to be more
> > > discerning about the error codes. Simple seems better here.
> > 
> > Just trying to think about what should be the precise loop termination
> > condition here.
> > 
> > I still feel blindly trying a few times is a very imprecise condition.
> 
> The precise termination condition is when can_accept() returns true
> again. The safety cap is only added as precaution to avoid infinite
> loops if something goes wrong or unexpected, now or in the future.

In my mind, that statement already suggests can_accept() is not
*precise*, considering the avoid infinite loop.
e.g. Do we know what is the optimal cap value and why that value
is optical?

Putting the definition of precise aside, I do see the unconditional
retry can have unwanted effects.

Chris
  
Johannes Weiner May 30, 2023, 6:54 p.m. UTC | #15
On Tue, May 30, 2023 at 11:18:51AM -0700, Chris Li wrote:
> On Tue, May 30, 2023 at 11:55:19AM -0400, Johannes Weiner wrote:
> > On Tue, May 30, 2023 at 07:51:23AM -0700, Chris Li wrote:
> > > Thanks for pointing out -ENOMEM shouldn't be persistent.
> > > Points taken.
> > > 
> > > The original point of not retrying the persistent error
> > > still holds.
> > 
> > Okay, but what persistent errors are you referring to?
> 
> Maybe ENOMEM is a bad example. How about if the swap device
> just went bad and can't complete new IO writes?

This is actually outside the scope of zswap, and handled by the
swapcache (end_swap_bio_write).

Once the IO is submitted, zswap will ax its copy and leave the rest to
the swapcache. It behaves the same way as if zswap had never been
involved to begin with when the swap out fails on IO errors.

From a zswap perspective, there are no persistent errors in moving a
zswap entry back into the swapcache. Not just currently, but generally.

> > Aside from -ENOMEM, writeback_entry will fail on concurrent swap
> > invalidation or a racing swapin fault. In both cases we should
> > absolutely keep trying other entries until the goal is met.
> 
> How about a narrower fix recognizing those error cases and making
> the inner loop continue in those errors?

Right, I just I don't really see the value proposition of this
complication, and I see some downsides (see below). No single entry
error should ever cause us to stop the wider reclaim loop.

> > > > extreme case where it's the only page left on the list, I again don't
> > > > see how retrying a few times will make the situation worse.
> > > > 
> > > > In practice, IMO there is little upside in trying to be more
> > > > discerning about the error codes. Simple seems better here.
> > > 
> > > Just trying to think about what should be the precise loop termination
> > > condition here.
> > > 
> > > I still feel blindly trying a few times is a very imprecise condition.
> > 
> > The precise termination condition is when can_accept() returns true
> > again. The safety cap is only added as precaution to avoid infinite
> > loops if something goes wrong or unexpected, now or in the future.
> 
> In my mind, that statement already suggests can_accept() is not
> *precise*, considering the avoid infinite loop.
> e.g. Do we know what is the optimal cap value and why that value
> is optical?

Oh but it is precise. That's the goal we want to accomplish.

The cap is just so that in case something unexpectedly goes wrong (a
bug), we fail gracefully and don't lock up the machine. The same
reason we prefer WARN_ONs over BUG_ONs if we can, to avoid
crashes. That's really all there is too it, and it strikes me as
reasonable and robust design choice. It's fine to limp along or be
suboptimal after such a bug happens; the bar is avoiding an infinite
loop, nothing else.

Your suggestion of whitelisting certain errors is more complicated,
but also less robust: in case an entry error does by some accident
become persistent for the whole LRU, we're locking up the host. We'd
rather catch a bug like this by seeing spikes in the reclaim failure
rate than by losing production machines.

> Putting the definition of precise aside, I do see the unconditional
> retry can have unwanted effects.

I hope I could address this above. But if not, please share your
concerns.

Thanks!
  
Chris Li May 31, 2023, 1:06 a.m. UTC | #16
On Tue, May 30, 2023 at 02:54:51PM -0400, Johannes Weiner wrote:
> > Maybe ENOMEM is a bad example. How about if the swap device
> > just went bad and can't complete new IO writes?
> 
> This is actually outside the scope of zswap, and handled by the
> swapcache (end_swap_bio_write).
> 
> Once the IO is submitted, zswap will ax its copy and leave the rest to
> the swapcache. It behaves the same way as if zswap had never been
> involved to begin with when the swap out fails on IO errors.
> 
> From a zswap perspective, there are no persistent errors in moving a
> zswap entry back into the swapcache. Not just currently, but generally.
Again, you are right that this zswap writeback is async.
So the writeback error is NOT going to propagate to the
shrink function.

With the current three pool backends that I looked at{zbud,
z3fold,zsmalloc} they all have internal retry 8 times.
Adding more retry did not fundamentally change the existing
behavior.

I look at all the possible error codes generated inside
the reclaim code, the only noticeable errors are ENOMEM
and concurrent swap invalidation or a racing swapin fault.

BTW, zswap reclaim consumes memory. Keep on looping ENOMEM
might cause more OOM. But that can exist in current code
as well.

> > > Aside from -ENOMEM, writeback_entry will fail on concurrent swap
> > > invalidation or a racing swapin fault. In both cases we should
> > > absolutely keep trying other entries until the goal is met.
> > 
> > How about a narrower fix recognizing those error cases and making
> > the inner loop continue in those errors?
> 
> Right, I just I don't really see the value proposition of this
> complication, and I see some downsides (see below). No single entry
> error should ever cause us to stop the wider reclaim loop.

That is until the current LRU list has been through once.
I expect repeating the same list yields less reclaimed pages.

> 
> > > > > extreme case where it's the only page left on the list, I again don't
> > > > > see how retrying a few times will make the situation worse.
> > > > > 
> > > > > In practice, IMO there is little upside in trying to be more
> > > > > discerning about the error codes. Simple seems better here.
> > > > 
> > > > Just trying to think about what should be the precise loop termination
> > > > condition here.
> > > > 
> > > > I still feel blindly trying a few times is a very imprecise condition.
> > > 
> > > The precise termination condition is when can_accept() returns true
> > > again. The safety cap is only added as precaution to avoid infinite
> > > loops if something goes wrong or unexpected, now or in the future.
> > 
> > In my mind, that statement already suggests can_accept() is not
> > *precise*, considering the avoid infinite loop.
> > e.g. Do we know what is the optimal cap value and why that value
> > is optical?
> 
> Oh but it is precise. That's the goal we want to accomplish.

I understand it is the goal, the precise condition I am
talking about is the loop termination condition, can_accept()
is not the only one. Anyway, let's move on.
> 
> The cap is just so that in case something unexpectedly goes wrong (a
> bug), we fail gracefully and don't lock up the machine. The same
> reason we prefer WARN_ONs over BUG_ONs if we can, to avoid
> crashes. That's really all there is too it, and it strikes me as
> reasonable and robust design choice. It's fine to limp along or be
> suboptimal after such a bug happens; the bar is avoiding an infinite
> loop, nothing else.
> 
> Your suggestion of whitelisting certain errors is more complicated,
> but also less robust: in case an entry error does by some accident
> become persistent for the whole LRU, we're locking up the host. We'd
> rather catch a bug like this by seeing spikes in the reclaim failure
> rate than by losing production machines.
> 
> > Putting the definition of precise aside, I do see the unconditional
> > retry can have unwanted effects.
> 
> I hope I could address this above. But if not, please share your
> concerns.
Thanks for the discussion. I am less concerned about the retry now.
Retry on EAGAIN might be the simplest way to proceed.

Outside of the scope of this patch, I am still surprised to see
such a high number of retries caused by race conditions. There are
8 inner loop retry already. The actual pages retried need to
times 8.

If there is a reproducer script, I want to local repo this
to understand better. Wish there are ways to reduce the retry.

Another idea is that we can start the shrinking once the
pool max was reached. Try to reduce to the threshold earlier.

Chris
  
Johannes Weiner May 31, 2023, 2:30 a.m. UTC | #17
On Tue, May 30, 2023 at 06:06:38PM -0700, Chris Li wrote:
> On Tue, May 30, 2023 at 02:54:51PM -0400, Johannes Weiner wrote:
> > > Maybe ENOMEM is a bad example. How about if the swap device
> > > just went bad and can't complete new IO writes?
> > 
> > This is actually outside the scope of zswap, and handled by the
> > swapcache (end_swap_bio_write).
> > 
> > Once the IO is submitted, zswap will ax its copy and leave the rest to
> > the swapcache. It behaves the same way as if zswap had never been
> > involved to begin with when the swap out fails on IO errors.
> > 
> > From a zswap perspective, there are no persistent errors in moving a
> > zswap entry back into the swapcache. Not just currently, but generally.
> Again, you are right that this zswap writeback is async.
> So the writeback error is NOT going to propagate to the
> shrink function.
> 
> With the current three pool backends that I looked at{zbud,
> z3fold,zsmalloc} they all have internal retry 8 times.
> Adding more retry did not fundamentally change the existing
> behavior.

Ah, but they're looping over different things.

The internal loop in the zs_reclaim_page() function is about walking
the LRU list until at least one backing page is freed.

Then there is zs_zpool_shrink() which calls zs_reclaim_page() until
the requested number of pages are freed.

Finally, there is shrink_worker(), which calls zs_zpool_shrink(). It
currently calls it for a single page when woken up during a store that
hits the zswap pool limit. This is the problematic one, because zswap
is very unlikely to go back to accepting stores after one page freed.

Domenico's patch isn't adding more retries for error conditions. It
ensures the pool is shrunk back down to where it accepts stores again.

The reason that it now looks at errors as well isn't to retry over
them (that's zs_reclaim_page()'s job). It's to avoid an infinite loop
in case there is an unexpectedly high rate of errors across a whole
series of pages (suggesting there is a bug of some kind).

> I look at all the possible error codes generated inside
> the reclaim code, the only noticeable errors are ENOMEM
> and concurrent swap invalidation or a racing swapin fault.

Right.

> BTW, zswap reclaim consumes memory. Keep on looping ENOMEM
> might cause more OOM. But that can exist in current code
> as well.

Right.

And this is temporary. Zswap will allocate a page to decompress in,
add it to the swapcache and kick off the IO. Once the page is written
out, it'll be reclaimed again. So while the consumption increases
temporarily, the end result is a net reduction by the amount of
compressed data that was written back from zswap.

This is typical for other types of reclaim as well, e.g. allocating
entries in the swapcache tree, allocating bios and IO requests...

> > > > Aside from -ENOMEM, writeback_entry will fail on concurrent swap
> > > > invalidation or a racing swapin fault. In both cases we should
> > > > absolutely keep trying other entries until the goal is met.
> > > 
> > > How about a narrower fix recognizing those error cases and making
> > > the inner loop continue in those errors?
> > 
> > Right, I just I don't really see the value proposition of this
> > complication, and I see some downsides (see below). No single entry
> > error should ever cause us to stop the wider reclaim loop.
> 
> That is until the current LRU list has been through once.
> I expect repeating the same list yields less reclaimed pages.

If we see failure due to invalidation races, the entries will free and
shrink the pool, thus getting us closer to the can_accept() condition.
We'll stop looping in the shrinker once enough entries have been freed
- whether we reclaimed them ourselves, or somebody else invalidated them.
  
Domenico Cerasuolo June 1, 2023, 3:27 p.m. UTC | #18
On Wed, May 31, 2023 at 3:06 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, May 30, 2023 at 02:54:51PM -0400, Johannes Weiner wrote:
> > > Maybe ENOMEM is a bad example. How about if the swap device
> > > just went bad and can't complete new IO writes?
> >
> > This is actually outside the scope of zswap, and handled by the
> > swapcache (end_swap_bio_write).
> >
> > Once the IO is submitted, zswap will ax its copy and leave the rest to
> > the swapcache. It behaves the same way as if zswap had never been
> > involved to begin with when the swap out fails on IO errors.
> >
> > From a zswap perspective, there are no persistent errors in moving a
> > zswap entry back into the swapcache. Not just currently, but generally.
> Again, you are right that this zswap writeback is async.
> So the writeback error is NOT going to propagate to the
> shrink function.
>
> With the current three pool backends that I looked at{zbud,
> z3fold,zsmalloc} they all have internal retry 8 times.
> Adding more retry did not fundamentally change the existing
> behavior.
>
> I look at all the possible error codes generated inside
> the reclaim code, the only noticeable errors are ENOMEM
> and concurrent swap invalidation or a racing swapin fault.
>
> BTW, zswap reclaim consumes memory. Keep on looping ENOMEM
> might cause more OOM. But that can exist in current code
> as well.
>
> > > > Aside from -ENOMEM, writeback_entry will fail on concurrent swap
> > > > invalidation or a racing swapin fault. In both cases we should
> > > > absolutely keep trying other entries until the goal is met.
> > >
> > > How about a narrower fix recognizing those error cases and making
> > > the inner loop continue in those errors?
> >
> > Right, I just I don't really see the value proposition of this
> > complication, and I see some downsides (see below). No single entry
> > error should ever cause us to stop the wider reclaim loop.
>
> That is until the current LRU list has been through once.
> I expect repeating the same list yields less reclaimed pages.
>
> >
> > > > > > extreme case where it's the only page left on the list, I again don't
> > > > > > see how retrying a few times will make the situation worse.
> > > > > >
> > > > > > In practice, IMO there is little upside in trying to be more
> > > > > > discerning about the error codes. Simple seems better here.
> > > > >
> > > > > Just trying to think about what should be the precise loop termination
> > > > > condition here.
> > > > >
> > > > > I still feel blindly trying a few times is a very imprecise condition.
> > > >
> > > > The precise termination condition is when can_accept() returns true
> > > > again. The safety cap is only added as precaution to avoid infinite
> > > > loops if something goes wrong or unexpected, now or in the future.
> > >
> > > In my mind, that statement already suggests can_accept() is not
> > > *precise*, considering the avoid infinite loop.
> > > e.g. Do we know what is the optimal cap value and why that value
> > > is optical?
> >
> > Oh but it is precise. That's the goal we want to accomplish.
>
> I understand it is the goal, the precise condition I am
> talking about is the loop termination condition, can_accept()
> is not the only one. Anyway, let's move on.
> >
> > The cap is just so that in case something unexpectedly goes wrong (a
> > bug), we fail gracefully and don't lock up the machine. The same
> > reason we prefer WARN_ONs over BUG_ONs if we can, to avoid
> > crashes. That's really all there is too it, and it strikes me as
> > reasonable and robust design choice. It's fine to limp along or be
> > suboptimal after such a bug happens; the bar is avoiding an infinite
> > loop, nothing else.
> >
> > Your suggestion of whitelisting certain errors is more complicated,
> > but also less robust: in case an entry error does by some accident
> > become persistent for the whole LRU, we're locking up the host. We'd
> > rather catch a bug like this by seeing spikes in the reclaim failure
> > rate than by losing production machines.
> >
> > > Putting the definition of precise aside, I do see the unconditional
> > > retry can have unwanted effects.
> >
> > I hope I could address this above. But if not, please share your
> > concerns.
> Thanks for the discussion. I am less concerned about the retry now.
> Retry on EAGAIN might be the simplest way to proceed.
>
> Outside of the scope of this patch, I am still surprised to see
> such a high number of retries caused by race conditions. There are
> 8 inner loop retry already. The actual pages retried need to
> times 8.
>
> If there is a reproducer script, I want to local repo this
> to understand better. Wish there are ways to reduce the retry.

With `stress` you can reproduce the errors quite easily, if you're not going
to see many (if any at all) writebacks _without_ this patch, you can let the
zpool fill, keeping the default max_pool_percent value of 20, then once full
switch max_pool_percent to 1.

>
> Another idea is that we can start the shrinking once the
> pool max was reached. Try to reduce to the threshold earlier.

The shrinking already starts when pool max is reached, with and without the
proposed patch. If you're referring to a new threshold that would kick off
the shrinking, that could be a nice improvement to the overall mechanism,
and it would fit nicely on top of this patch.

>
> Chris
  

Patch

diff --git a/mm/zswap.c b/mm/zswap.c
index 59da2a415fbb..2ee0775d8213 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -587,9 +587,13 @@  static void shrink_worker(struct work_struct *w)
 {
 	struct zswap_pool *pool = container_of(w, typeof(*pool),
 						shrink_work);
+	int ret;
 
-	if (zpool_shrink(pool->zpool, 1, NULL))
-		zswap_reject_reclaim_fail++;
+	do {
+		ret = zpool_shrink(pool->zpool, 1, NULL);
+		if (ret)
+			zswap_reject_reclaim_fail++;
+	} while (!zswap_can_accept() && ret != -EINVAL);
 	zswap_pool_put(pool);
 }
 
@@ -1188,7 +1192,7 @@  static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 	if (zswap_pool_reached_full) {
 	       if (!zswap_can_accept()) {
 			ret = -ENOMEM;
-			goto reject;
+			goto shrink;
 		} else
 			zswap_pool_reached_full = false;
 	}