mm:zswap: fix zswap entry reclamation failure in two scenarios

Message ID 20231113130601.3350915-1-hezhongkun.hzk@bytedance.com
State New
Headers
Series mm:zswap: fix zswap entry reclamation failure in two scenarios |

Commit Message

Zhongkun He Nov. 13, 2023, 1:06 p.m. UTC
  I recently found two scenarios where zswap entry could not be
released, which will cause shrink_worker and active recycling
to fail.
1)The swap entry has been freed, but cached in swap_slots_cache,
no swap cache and swapcount=0.
2)When the option zswap_exclusive_loads_enabled disabled and
zswap_load completed(page in swap_cache and swapcount = 0).

The above two cases need to be determined by swapcount=0,
fix it.

Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
 mm/zswap.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)
  

Comments

Nhat Pham Nov. 13, 2023, 3:11 p.m. UTC | #1
On Mon, Nov 13, 2023 at 8:06 AM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> I recently found two scenarios where zswap entry could not be
> released, which will cause shrink_worker and active recycling
> to fail.
> 1)The swap entry has been freed, but cached in swap_slots_cache,
> no swap cache and swapcount=0.
> 2)When the option zswap_exclusive_loads_enabled disabled and
> zswap_load completed(page in swap_cache and swapcount = 0).
>
> The above two cases need to be determined by swapcount=0,
> fix it.
>
> Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> ---
>  mm/zswap.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 74411dfdad92..db95491bcdd5 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1063,11 +1063,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         struct mempolicy *mpol;
>         struct scatterlist input, output;
>         struct crypto_acomp_ctx *acomp_ctx;
> +       struct swap_info_struct *si;
>         struct zpool *pool = zswap_find_zpool(entry);
>         bool page_was_allocated;
>         u8 *src, *tmp = NULL;
>         unsigned int dlen;
> -       int ret;
> +       int ret = 0;
>         struct writeback_control wbc = {
>                 .sync_mode = WB_SYNC_NONE,
>         };
> @@ -1082,16 +1083,30 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         mpol = get_task_policy(current);
>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>                                 NO_INTERLEAVE_INDEX, &page_was_allocated);
> -       if (!page) {
> +       if (!page)
>                 ret = -ENOMEM;
> -               goto fail;
> -       }
> -
> -       /* Found an existing page, we raced with load/swapin */
> -       if (!page_was_allocated) {
> +       else if (!page_was_allocated) {
> +               /* Found an existing page, we raced with load/swapin */
>                 put_page(page);
>                 ret = -EEXIST;
> -               goto fail;
> +       }
> +
> +       if (ret) {
> +               si = get_swap_device(swpentry);
> +               if (!si)
> +                       goto out;
> +
> +               /* Two cases to directly release zswap_entry.
> +                * 1) -ENOMEM,if the swpentry has been freed, but cached in
> +                * swap_slots_cache(no page and swapcount = 0).
> +                * 2) -EEXIST, option zswap_exclusive_loads_enabled disabled and
> +                * zswap_load completed(page in swap_cache and swapcount = 0).
> +                */

These two cases should not count as "successful writeback" right?

I'm slightly biased of course, since my zswap shrinker depends on this
as one of the potential signals for over-shrinking - but that aside, I think
that this constitutes a failed writeback (i.e should not increment writeback
counter, and the limit-based reclaim should try again etc.). If anything,
it will make it incredibly confusing for users.

For instance, we were trying to estimate the number of zswap store
fails by subtracting the writeback count from the overall pswpout, and
this could throw us off by inflating the writeback count, and deflating
the zswap store failure count as a result.

Regarding the second case specifically, I thought that was the point of
having zswap_exclusive_loads_enabled disabled - i.e still keeps a copy
around in the zswap pool even after a completed zswap_load? Based
on the Kconfig documentation:

"This avoids having two copies of the same page in memory
(compressed and uncompressed) after faulting in a page from zswap.
The cost is that if the page was never dirtied and needs to be
swapped out again, it will be re-compressed."

> +               if (!swap_swapcount(si, swpentry))
> +                       ret = 0;
> +
> +               put_swap_device(si);
> +               goto out;
>         }
>
>         /*
> @@ -1106,7 +1121,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>                 spin_unlock(&tree->lock);
>                 delete_from_swap_cache(page_folio(page));
>                 ret = -ENOMEM;
> -               goto fail;
> +               goto out;
>         }
>         spin_unlock(&tree->lock);
>
> @@ -1151,7 +1166,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>
>         return ret;
>
> -fail:
> +out:
>         if (!zpool_can_sleep_mapped(pool))
>                 kfree(tmp);
>
> --
> 2.25.1
>
  
Zhongkun He Nov. 14, 2023, 5:21 a.m. UTC | #2
Thanks for your time, Nhat.

>
> These two cases should not count as "successful writeback" right?
>

This is true from the perspective of the writeback itself, but should it
also be considered successful from the purpose of the writeback,
 i.e. whether the compressed memory and zswap_entry can be reclaimed?

> I'm slightly biased of course, since my zswap shrinker depends on this
> as one of the potential signals for over-shrinking - but that aside, I think
> that this constitutes a failed writeback (i.e should not increment writeback
> counter, and the limit-based reclaim should try again etc.). If anything,
> it will make it incredibly confusing for users.

This patch will skip the writeback step,so the writeback counter will not
be incremented. Currently MAX_RECLAIM_RETRIES is 14, shrink_worker
will often fail if writeback fails.

>
> For instance, we were trying to estimate the number of zswap store
> fails by subtracting the writeback count from the overall pswpout, and
> this could throw us off by inflating the writeback count, and deflating
> the zswap store failure count as a result.

As mentioned above, writeback counter will not be incremented.

>
> Regarding the second case specifically, I thought that was the point of
> having zswap_exclusive_loads_enabled disabled - i.e still keeps a copy
> around in the zswap pool even after a completed zswap_load? Based
> on the Kconfig documentation:
>
> "This avoids having two copies of the same page in memory
> (compressed and uncompressed) after faulting in a page from zswap.
> The cost is that if the page was never dirtied and needs to be
> swapped out again, it will be re-compressed."
>

Yes,i know the point,in the case of reading, there is no data update,
so the next swapout does not need to be compressed again.
Consider this scenario,there is a lot of data cached in memory and zswap,
hit the limit,and shrink_worker will fail. The new coming data be written
directly to swap due to zswap_store failure. Should we free the last one
to store the latest one in zswap.
  
Nhat Pham Nov. 14, 2023, 4:30 p.m. UTC | #3
On Tue, Nov 14, 2023 at 12:21 AM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote:
>
> Thanks for your time, Nhat.
>
> >
> > These two cases should not count as "successful writeback" right?
> >
>
> This is true from the perspective of the writeback itself, but should it
> also be considered successful from the purpose of the writeback,
>  i.e. whether the compressed memory and zswap_entry can be reclaimed?
>
> > I'm slightly biased of course, since my zswap shrinker depends on this
> > as one of the potential signals for over-shrinking - but that aside, I think
> > that this constitutes a failed writeback (i.e should not increment writeback
> > counter, and the limit-based reclaim should try again etc.). If anything,
> > it will make it incredibly confusing for users.
>
> This patch will skip the writeback step,so the writeback counter will not
> be incremented. Currently MAX_RECLAIM_RETRIES is 14, shrink_worker
> will often fail if writeback fails.

Ah my bad, I should have been clearer.

I was looking at the zswap shrinker patch series (specifically the
cgroup-aware LRU patch), which moves the counter update out of
zswap_writeback_entry. If we apply that patch on top of that series, we will
screw up the counter. Should be easily fixable anyway though.

>
> >
> > For instance, we were trying to estimate the number of zswap store
> > fails by subtracting the writeback count from the overall pswpout, and
> > this could throw us off by inflating the writeback count, and deflating
> > the zswap store failure count as a result.
>
> As mentioned above, writeback counter will not be incremented.
>
> >
> > Regarding the second case specifically, I thought that was the point of
> > having zswap_exclusive_loads_enabled disabled - i.e still keeps a copy
> > around in the zswap pool even after a completed zswap_load? Based
> > on the Kconfig documentation:
> >
> > "This avoids having two copies of the same page in memory
> > (compressed and uncompressed) after faulting in a page from zswap.
> > The cost is that if the page was never dirtied and needs to be
> > swapped out again, it will be re-compressed."
> >
>
> Yes,i know the point,in the case of reading, there is no data update,
> so the next swapout does not need to be compressed again.
> Consider this scenario,there is a lot of data cached in memory and zswap,
> hit the limit,and shrink_worker will fail. The new coming data be written
> directly to swap due to zswap_store failure. Should we free the last one
> to store the latest one in zswap.

Ah I think I understand the point of the patch a bit better now.

Essentially, we're invalidating these entries, which does reclaim the
memory used for these compressed objects, but there is no IO involved.
Writeback-less shrinking, if you will.

This will still screw up one of the heuristics I'm using for the zswap
shrinker a bit, but that should be easily fixable with some plumbing.
Same goes for the writeback counter - but depending on the order in
which Andrew apply the patches, you might have to resolve the conflicts
there :)

Other than this objection, I think this optimization makes sense to me:

In the first case, we already freed the swap entry. Might as well also
dropped the zswap entry.

In the second case, we already have another copy in memory, so
dropping the compressed copy to make space for warmer objects
coming into zswap makes sense to me. Might be worth doing a
micro-benchmark to verify this intuition, but I agree that it's more
important to maintain the LRU ordering than any CPU saving from
skipping re-compression.

I would suggest that you should expand on this on the commit log
to make clearer the motivation behind this optimization, if you were
to re-submit this patch for some reason (for e.g to resolve the
aforementioned conflicts with the zswap shrinker series).

But otherwise, LGTM!

Feel free to add the following tag:
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
  
Yosry Ahmed Nov. 14, 2023, 5:16 p.m. UTC | #4
+Ying

On Mon, Nov 13, 2023 at 5:06 AM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> I recently found two scenarios where zswap entry could not be
> released, which will cause shrink_worker and active recycling
> to fail.
> 1)The swap entry has been freed, but cached in swap_slots_cache,
> no swap cache and swapcount=0.
> 2)When the option zswap_exclusive_loads_enabled disabled and
> zswap_load completed(page in swap_cache and swapcount = 0).

For case (1), I think a cleaner solution would be to move the
zswap_invalidate() call from swap_range_free() (which is called after
the cached slots are freed) to __swap_entry_free_locked() if the usage
goes to 0. I actually think conceptually this makes not just for
zswap_invalidate(), but also for the arch call, memcg uncharging, etc.
Slots caching is a swapfile optimization that should be internal to
swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND
not in the swap cache), all the hooks should be called (memcg, zswap,
arch, ..) as the swap entry is effectively freed. The fact that
swapfile code internally batches and caches slots should be
transparent to other parts of MM. I am not sure if the calls can just
be moved or if there are underlying assumptions in the implementation
that would be broken, but it feels like the right thing to do.

For case (2), I don't think zswap can just decide to free the entry.

In that case, the page is in the swap cache pointing to a swp_entry
which has a corresponding zswap entry, and the page is clean because
it is already in swap/zswap, so we don't need to write it out again
unless it is redirtied. If zswap just drops the entry, and reclaim
tries to reclaim the page in the swap cache, it will drop the page
assuming that there is a copy in swap/zswap (because it is clean). Now
we lost all copies of the page.

Am I missing something?

>
> The above two cases need to be determined by swapcount=0,
> fix it.
>
> Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> ---
>  mm/zswap.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 74411dfdad92..db95491bcdd5 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1063,11 +1063,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         struct mempolicy *mpol;
>         struct scatterlist input, output;
>         struct crypto_acomp_ctx *acomp_ctx;
> +       struct swap_info_struct *si;
>         struct zpool *pool = zswap_find_zpool(entry);
>         bool page_was_allocated;
>         u8 *src, *tmp = NULL;
>         unsigned int dlen;
> -       int ret;
> +       int ret = 0;
>         struct writeback_control wbc = {
>                 .sync_mode = WB_SYNC_NONE,
>         };
> @@ -1082,16 +1083,30 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         mpol = get_task_policy(current);
>         page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
>                                 NO_INTERLEAVE_INDEX, &page_was_allocated);
> -       if (!page) {
> +       if (!page)
>                 ret = -ENOMEM;
> -               goto fail;
> -       }
> -
> -       /* Found an existing page, we raced with load/swapin */
> -       if (!page_was_allocated) {
> +       else if (!page_was_allocated) {
> +               /* Found an existing page, we raced with load/swapin */
>                 put_page(page);
>                 ret = -EEXIST;
> -               goto fail;
> +       }
> +
> +       if (ret) {
> +               si = get_swap_device(swpentry);
> +               if (!si)
> +                       goto out;
> +
> +               /* Two cases to directly release zswap_entry.
> +                * 1) -ENOMEM,if the swpentry has been freed, but cached in
> +                * swap_slots_cache(no page and swapcount = 0).
> +                * 2) -EEXIST, option zswap_exclusive_loads_enabled disabled and
> +                * zswap_load completed(page in swap_cache and swapcount = 0).
> +                */
> +               if (!swap_swapcount(si, swpentry))
> +                       ret = 0;
> +
> +               put_swap_device(si);
> +               goto out;
>         }
>
>         /*
> @@ -1106,7 +1121,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>                 spin_unlock(&tree->lock);
>                 delete_from_swap_cache(page_folio(page));
>                 ret = -ENOMEM;
> -               goto fail;
> +               goto out;
>         }
>         spin_unlock(&tree->lock);
>
> @@ -1151,7 +1166,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>
>         return ret;
>
> -fail:
> +out:
>         if (!zpool_can_sleep_mapped(pool))
>                 kfree(tmp);
>
> --
> 2.25.1
>
  
Zhongkun He Nov. 15, 2023, 12:12 p.m. UTC | #5
>
> Ah my bad, I should have been clearer.
>
> I was looking at the zswap shrinker patch series (specifically the
> cgroup-aware LRU patch), which moves the counter update out of
> zswap_writeback_entry. If we apply that patch on top of that series, we will
> screw up the counter. Should be easily fixable anyway though.

Got it.

>
> Ah I think I understand the point of the patch a bit better now.
>
> Essentially, we're invalidating these entries, which does reclaim the
> memory used for these compressed objects, but there is no IO involved.
> Writeback-less shrinking, if you will.
>
> This will still screw up one of the heuristics I'm using for the zswap
> shrinker a bit, but that should be easily fixable with some plumbing.
> Same goes for the writeback counter - but depending on the order in
> which Andrew apply the patches, you might have to resolve the conflicts
> there :)

OK,  I will fix it.

>
> Other than this objection, I think this optimization makes sense to me:
>
> In the first case, we already freed the swap entry. Might as well also
> dropped the zswap entry.
>
> In the second case, we already have another copy in memory, so
> dropping the compressed copy to make space for warmer objects
> coming into zswap makes sense to me. Might be worth doing a
> micro-benchmark to verify this intuition, but I agree that it's more
> important to maintain the LRU ordering than any CPU saving from
> skipping re-compression.
>
> I would suggest that you should expand on this on the commit log
> to make clearer the motivation behind this optimization, if you were
> to re-submit this patch for some reason (for e.g to resolve the
> aforementioned conflicts with the zswap shrinker series).
>
> But otherwise, LGTM!
>
> Feel free to add the following tag:
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Thanks, there are still some commits from Yosry,
after that, I will send it again.
  
Zhongkun He Nov. 15, 2023, 12:53 p.m. UTC | #6
> For case (1), I think a cleaner solution would be to move the
> zswap_invalidate() call from swap_range_free() (which is called after
> the cached slots are freed) to __swap_entry_free_locked() if the usage
> goes to 0. I actually think conceptually this makes not just for
> zswap_invalidate(), but also for the arch call, memcg uncharging, etc.
> Slots caching is a swapfile optimization that should be internal to
> swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND
> not in the swap cache), all the hooks should be called (memcg, zswap,
> arch, ..) as the swap entry is effectively freed. The fact that
> swapfile code internally batches and caches slots should be
> transparent to other parts of MM. I am not sure if the calls can just
> be moved or if there are underlying assumptions in the implementation
> that would be broken, but it feels like the right thing to do.

Good idea,  This is indeed a clear solution.  I'll try it in another
patch later.

>
> For case (2), I don't think zswap can just decide to free the entry.
>
> In that case, the page is in the swap cache pointing to a swp_entry
> which has a corresponding zswap entry, and the page is clean because
> it is already in swap/zswap, so we don't need to write it out again
> unless it is redirtied. If zswap just drops the entry, and reclaim
> tries to reclaim the page in the swap cache, it will drop the page
> assuming that there is a copy in swap/zswap (because it is clean). Now
> we lost all copies of the page.
>
> Am I missing something?
>

Ah, my bad.  Missed the step of marking the page as dirty.
Please have a look,  just like zswap_exclusive_loads_enabled,
set page dity so that it can be pageout again.
       if (!page_was_allocated) {
              if (!count) {
                       set_page_dirty(page);
                       ret = 0;
               } else
                       ret = -EEXIST;
               put_page(page);
}
Thanks  for your feedback, Yosry.
  
Yosry Ahmed Nov. 15, 2023, 8:12 p.m. UTC | #7
On Wed, Nov 15, 2023 at 4:53 AM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote:
>
> > For case (1), I think a cleaner solution would be to move the
> > zswap_invalidate() call from swap_range_free() (which is called after
> > the cached slots are freed) to __swap_entry_free_locked() if the usage
> > goes to 0. I actually think conceptually this makes not just for
> > zswap_invalidate(), but also for the arch call, memcg uncharging, etc.
> > Slots caching is a swapfile optimization that should be internal to
> > swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND
> > not in the swap cache), all the hooks should be called (memcg, zswap,
> > arch, ..) as the swap entry is effectively freed. The fact that
> > swapfile code internally batches and caches slots should be
> > transparent to other parts of MM. I am not sure if the calls can just
> > be moved or if there are underlying assumptions in the implementation
> > that would be broken, but it feels like the right thing to do.
>
> Good idea,  This is indeed a clear solution.  I'll try it in another
> patch later.
>
> >
> > For case (2), I don't think zswap can just decide to free the entry.
> >
> > In that case, the page is in the swap cache pointing to a swp_entry
> > which has a corresponding zswap entry, and the page is clean because
> > it is already in swap/zswap, so we don't need to write it out again
> > unless it is redirtied. If zswap just drops the entry, and reclaim
> > tries to reclaim the page in the swap cache, it will drop the page
> > assuming that there is a copy in swap/zswap (because it is clean). Now
> > we lost all copies of the page.
> >
> > Am I missing something?
> >
>
> Ah, my bad.  Missed the step of marking the page as dirty.
> Please have a look,  just like zswap_exclusive_loads_enabled,
> set page dity so that it can be pageout again.
>        if (!page_was_allocated) {
>               if (!count) {
>                        set_page_dirty(page);
>                        ret = 0;
>                } else
>                        ret = -EEXIST;
>                put_page(page);
> }

I think we may need to try to lock the folio. Otherwise we may race
with reclaim reading the dirty bit before we set it.

Taking a step back, this seems like we are going behind exclusive
loads. We "should" keep the page in zswap as exclusive loads are
disabled and the page is not yet invalidated from zswap (the swap
entry is still in use). What you are trying to do here is sneakily
drop the page from zswap as if we wrote it back, but we didn't. We
just know that it was already loaded from zswap. We are essentially
making the previous load exclusive retroactively.

Is there a reason why exclusive loads cannot be enabled to achieve the
same result in the (arguably) correct way?

> Thanks  for your feedback, Yosry.
  
Zhongkun He Nov. 16, 2023, 3:33 a.m. UTC | #8
On Thu, Nov 16, 2023 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> I think we may need to try to lock the folio. Otherwise we may race
> with reclaim reading the dirty bit before we set it.
>
> Taking a step back, this seems like we are going behind exclusive
> loads. We "should" keep the page in zswap as exclusive loads are
> disabled and the page is not yet invalidated from zswap (the swap
> entry is still in use). What you are trying to do here is sneakily
> drop the page from zswap as if we wrote it back, but we didn't.

If  we want to reclaim the cached zswap_entry, Writing back might
be the easy way.

> We just know that it was already loaded from zswap. We are essentially
> making the previous load exclusive retroactively.
>
> Is there a reason why exclusive loads cannot be enabled to achieve the
> same result in the (arguably) correct way?
>

zswap_exclusive_loads is not enabled by default, so the shrink_worker
may fail if there are many cached zswap_entries on the zswap_pool->lru.

Is it possible to make zswap_exclusive_loads the only way in zswap_load?
It only makes sense when the page is read and no longer dirty.
If the page is read frequently, it should stay in cache rather than zswap.
The benefit of doing this is very small, two copies of the same page
in memory.

Thanks.
  
Yosry Ahmed Nov. 16, 2023, 4:09 a.m. UTC | #9
On Wed, Nov 15, 2023 at 7:33 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote:
>
> On Thu, Nov 16, 2023 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > I think we may need to try to lock the folio. Otherwise we may race
> > with reclaim reading the dirty bit before we set it.
> >
> > Taking a step back, this seems like we are going behind exclusive
> > loads. We "should" keep the page in zswap as exclusive loads are
> > disabled and the page is not yet invalidated from zswap (the swap
> > entry is still in use). What you are trying to do here is sneakily
> > drop the page from zswap as if we wrote it back, but we didn't.
>
> If  we want to reclaim the cached zswap_entry, Writing back might
> be the easy way.
>
> > We just know that it was already loaded from zswap. We are essentially
> > making the previous load exclusive retroactively.
> >
> > Is there a reason why exclusive loads cannot be enabled to achieve the
> > same result in the (arguably) correct way?
> >
>
> zswap_exclusive_loads is not enabled by default, so the shrink_worker
> may fail if there are many cached zswap_entries on the zswap_pool->lru.


It can be enabled at runtime, or enabled by default by using
CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON.

>
>
> Is it possible to make zswap_exclusive_loads the only way in zswap_load?
> It only makes sense when the page is read and no longer dirty.
> If the page is read frequently, it should stay in cache rather than zswap.
> The benefit of doing this is very small, two copies of the same page
> in memory.


The reason I added it behind runtime and config knobs is to preserve
the existing behavior in case someone depends on it. At Google, we
have been using exclusive loads for a long time. If other users of
zswap agree to make this the default behavior or make it the only way
to do zswap loads I don't have a problem with it.

>
>
> Thanks.
  
Zhongkun He Nov. 16, 2023, 4:23 a.m. UTC | #10
On Thu, Nov 16, 2023 at 12:10 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> It can be enabled at runtime, or enabled by default by using
> CONFIG_ZSWAP_EXCLUSIVE_LOADS_DEFAULT_ON.
>

Yes, I see it in the doc. Thanks.

>
>
> The reason I added it behind runtime and config knobs is to preserve
> the existing behavior in case someone depends on it. At Google, we
> have been using exclusive loads for a long time. If other users of
> zswap agree to make this the default behavior or make it the only way
> to do zswap loads I don't have a problem with it.
>

Got it.  Thanks for your feedback.
  
Huang, Ying Nov. 16, 2023, 8:31 a.m. UTC | #11
Yosry Ahmed <yosryahmed@google.com> writes:

> +Ying
>
> On Mon, Nov 13, 2023 at 5:06 AM Zhongkun He
> <hezhongkun.hzk@bytedance.com> wrote:
>>
>> I recently found two scenarios where zswap entry could not be
>> released, which will cause shrink_worker and active recycling
>> to fail.
>> 1)The swap entry has been freed, but cached in swap_slots_cache,
>> no swap cache and swapcount=0.
>> 2)When the option zswap_exclusive_loads_enabled disabled and
>> zswap_load completed(page in swap_cache and swapcount = 0).
>
> For case (1), I think a cleaner solution would be to move the
> zswap_invalidate() call from swap_range_free() (which is called after
> the cached slots are freed) to __swap_entry_free_locked() if the usage
> goes to 0. I actually think conceptually this makes not just for
> zswap_invalidate(), but also for the arch call, memcg uncharging, etc.
> Slots caching is a swapfile optimization that should be internal to
> swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND
> not in the swap cache), all the hooks should be called (memcg, zswap,
> arch, ..) as the swap entry is effectively freed. The fact that
> swapfile code internally batches and caches slots should be
> transparent to other parts of MM. I am not sure if the calls can just
> be moved or if there are underlying assumptions in the implementation
> that would be broken, but it feels like the right thing to do.

This sounds reasonable for me.  Just don't forget to change other
swap_range_free() callers too.

--
Best Regards,
Huang, Ying
  
Zhongkun He Nov. 16, 2023, 10:34 a.m. UTC | #12
>
> This sounds reasonable for me.  Just don't forget to change other
> swap_range_free() callers too.

Got it, thanks.

>
> --
> Best Regards,
> Huang, Ying
  
Chris Li Nov. 16, 2023, 8:11 p.m. UTC | #13
Hi Yosry,

On Tue, Nov 14, 2023 at 9:16 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > 1)The swap entry has been freed, but cached in swap_slots_cache,
> > no swap cache and swapcount=0.
> > 2)When the option zswap_exclusive_loads_enabled disabled and
> > zswap_load completed(page in swap_cache and swapcount = 0).
>
> For case (1), I think a cleaner solution would be to move the
> zswap_invalidate() call from swap_range_free() (which is called after
> the cached slots are freed) to __swap_entry_free_locked() if the usage
> goes to 0. I actually think conceptually this makes not just for
> zswap_invalidate(), but also for the arch call, memcg uncharging, etc.
> Slots caching is a swapfile optimization that should be internal to
> swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND

Do you mean moving all swap slots free to bypass the swap slot cache, even it
is not from zswap? That might have unwanted side effects. The swap
slot cache is not just for swap files on disk. The batching has the
effect that on average lower cost of freeing per entry.

> not in the swap cache), all the hooks should be called (memcg, zswap,
> arch, ..) as the swap entry is effectively freed. The fact that
> swapfile code internally batches and caches slots should be
> transparent to other parts of MM. I am not sure if the calls can just
> be moved or if there are underlying assumptions in the implementation
> that would be broken, but it feels like the right thing to do.

There is also the behavior that if the page gets swapped in but hasn't
changed,  when swap out again, it is possible to avoid writing the
page again to the disk. For disk there is no overhead keeping the old
date on the disk not to touch it. For zpool it might have memory
overhead holding the compressed pool. The trade off might be
different.

Chris
  
Yosry Ahmed Nov. 16, 2023, 8:18 p.m. UTC | #14
On Thu, Nov 16, 2023 at 12:12 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Yosry,
>
> On Tue, Nov 14, 2023 at 9:16 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > 1)The swap entry has been freed, but cached in swap_slots_cache,
> > > no swap cache and swapcount=0.
> > > 2)When the option zswap_exclusive_loads_enabled disabled and
> > > zswap_load completed(page in swap_cache and swapcount = 0).
> >
> > For case (1), I think a cleaner solution would be to move the
> > zswap_invalidate() call from swap_range_free() (which is called after
> > the cached slots are freed) to __swap_entry_free_locked() if the usage
> > goes to 0. I actually think conceptually this makes not just for
> > zswap_invalidate(), but also for the arch call, memcg uncharging, etc.
> > Slots caching is a swapfile optimization that should be internal to
> > swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND
>
> Do you mean moving all swap slots free to bypass the swap slot cache, even it
> is not from zswap? That might have unwanted side effects. The swap
> slot cache is not just for swap files on disk. The batching has the
> effect that on average lower cost of freeing per entry.

Not bypassing the swap slot cache, just make the callbacks to
invalidate the zswap entry, do memg uncharging, etc when the slot is
no longer used and is entering the swap slot cache (i.e. when
free_swap_slot() is called), instead of when draining the swap slot
cache (i.e. when swap_range_free() is called). For all parts of MM
outside of swap, the swap entry is freed when free_swap_slot() is
called. We don't free it immediately because of caching, but this
should be transparent to other parts of MM (e.g. zswap, memcg, etc).

>
> > not in the swap cache), all the hooks should be called (memcg, zswap,
> > arch, ..) as the swap entry is effectively freed. The fact that
> > swapfile code internally batches and caches slots should be
> > transparent to other parts of MM. I am not sure if the calls can just
> > be moved or if there are underlying assumptions in the implementation
> > that would be broken, but it feels like the right thing to do.
>
> There is also the behavior that if the page gets swapped in but hasn't
> changed,  when swap out again, it is possible to avoid writing the
> page again to the disk. For disk there is no overhead keeping the old
> date on the disk not to touch it. For zpool it might have memory
> overhead holding the compressed pool. The trade off might be
> different.
>
> Chris
  
Chris Li Nov. 16, 2023, 8:30 p.m. UTC | #15
On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Not bypassing the swap slot cache, just make the callbacks to
> invalidate the zswap entry, do memg uncharging, etc when the slot is
> no longer used and is entering the swap slot cache (i.e. when
> free_swap_slot() is called), instead of when draining the swap slot
> cache (i.e. when swap_range_free() is called). For all parts of MM
> outside of swap, the swap entry is freed when free_swap_slot() is
> called. We don't free it immediately because of caching, but this
> should be transparent to other parts of MM (e.g. zswap, memcg, etc).

That will cancel the batching effect on the swap slot free, making the
common case for  swapping  faults take longer to complete, righ?
If I recall correctly, the uncharge is the expensive part of the swap
slot free operation.
I just want to figure out what we are trading off against. This is not
one side wins all situations.


Chris
  
Yosry Ahmed Nov. 16, 2023, 8:45 p.m. UTC | #16
On Thu, Nov 16, 2023 at 12:30 PM Chris Li <chriscli@google.com> wrote:
>
> On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > Not bypassing the swap slot cache, just make the callbacks to
> > invalidate the zswap entry, do memg uncharging, etc when the slot is
> > no longer used and is entering the swap slot cache (i.e. when
> > free_swap_slot() is called), instead of when draining the swap slot
> > cache (i.e. when swap_range_free() is called). For all parts of MM
> > outside of swap, the swap entry is freed when free_swap_slot() is
> > called. We don't free it immediately because of caching, but this
> > should be transparent to other parts of MM (e.g. zswap, memcg, etc).
>
> That will cancel the batching effect on the swap slot free, making the
> common case for  swapping  faults take longer to complete, righ?
> If I recall correctly, the uncharge is the expensive part of the swap
> slot free operation.
> I just want to figure out what we are trading off against. This is not
> one side wins all situations.

Interesting. Maybe we can just move the zswap_invalidate() call to
save memory early, and leave the memcg uncharge call to be batched?
IIUC we already do some version of this internally at Google.

>
>
> Chris
  
Zhongkun He Nov. 17, 2023, 9:56 a.m. UTC | #17
>
> That will cancel the batching effect on the swap slot free, making the
> common case for  swapping  faults take longer to complete, righ?
> If I recall correctly, the uncharge is the expensive part of the swap
> slot free operation.
> I just want to figure out what we are trading off against. This is not
> one side wins all situations.
>

Hi Chris, thanks for your feedback.  I have the same concerns,
maybe we should just move the zswap_invalidate() out of batches,
as Yosry mentioned above.
  
Chris Li Nov. 17, 2023, 11:30 p.m. UTC | #18
On Thu, Nov 16, 2023 at 12:46 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Nov 16, 2023 at 12:30 PM Chris Li <chriscli@google.com> wrote:
> >
> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > Not bypassing the swap slot cache, just make the callbacks to
> > > invalidate the zswap entry, do memg uncharging, etc when the slot is
> > > no longer used and is entering the swap slot cache (i.e. when
> > > free_swap_slot() is called), instead of when draining the swap slot
> > > cache (i.e. when swap_range_free() is called). For all parts of MM
> > > outside of swap, the swap entry is freed when free_swap_slot() is
> > > called. We don't free it immediately because of caching, but this
> > > should be transparent to other parts of MM (e.g. zswap, memcg, etc).
> >
> > That will cancel the batching effect on the swap slot free, making the
> > common case for  swapping  faults take longer to complete, righ?
> > If I recall correctly, the uncharge is the expensive part of the swap
> > slot free operation.
> > I just want to figure out what we are trading off against. This is not
> > one side wins all situations.
>
> Interesting. Maybe we can just move the zswap_invalidate() call to
> save memory early, and leave the memcg uncharge call to be batched?
> IIUC we already do some version of this internally at Google.

I would like to see the patch so I can reason about it better.
Do you have the link for the patch you are talking about?
I can also look up the Google internal one if you have a commit hash.

One thing I would like to find out is whether the behavior of  reusing
swap slots without page writing has been changed.
e.g. Previously if the swap slot can be page out again without
writing/compression again, if the page is not dirty. If we change that
behavior, I would like to understand the trade off better.

Chris
  
Chris Li Nov. 17, 2023, 11:47 p.m. UTC | #19
On Fri, Nov 17, 2023 at 1:56 AM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
> Hi Chris, thanks for your feedback.  I have the same concerns,
> maybe we should just move the zswap_invalidate() out of batches,
> as Yosry mentioned above.

As I replied in the previous email, I just want to understand the
other side effects of the change better.

To me, this patching is actually freeing the memory that does not
require actual page IO write from zswap. Which means the memory is
from some kind of cache. It would be interesting if we can not
complicate the write back path further. Instead, we can drop those
memories from the different cache if needed. I assume those caches are
doing something useful in the common case. If not, we should have a
patch to remove these caches instead.  Not sure how big a mess it will
be to implement separate the write and drop caches.

While you are here, I have some questions for you.

Can you help me understand how much memory you can free from this
patch? For example, are we talking about a few pages or a few GB?

Where does the freed memory come from?
If the memory comes from zswap entry struct. Due to the slab allocator
fragmentation. It would take a lot of zswap entries to have meaningful
memory reclaimed from the slab allocator.

If the memory comes from the swap cached pages, that would be much
more meaningful. But that is not what this patch is doing, right?

Chris
  
Zhongkun He Nov. 18, 2023, 1:45 a.m. UTC | #20
Hi Chris, thanks for your time.

>
> On Fri, Nov 17, 2023 at 1:56 AM Zhongkun He
> <hezhongkun.hzk@bytedance.com> wrote:
> > Hi Chris, thanks for your feedback.  I have the same concerns,
> > maybe we should just move the zswap_invalidate() out of batches,
> > as Yosry mentioned above.
>
> As I replied in the previous email, I just want to understand the
> other side effects of the change better.
>
> To me, this patching is actually freeing the memory that does not
> require actual page IO write from zswap. Which means the memory is
> from some kind of cache. It would be interesting if we can not
> complicate the write back path further. Instead, we can drop those
> memories from the different cache if needed. I assume those caches are
> doing something useful in the common case. If not, we should have a
> patch to remove these caches instead.  Not sure how big a mess it will
> be to implement separate the write and drop caches.
>
> While you are here, I have some questions for you.
>
> Can you help me understand how much memory you can free from this
> patch? For example, are we talking about a few pages or a few GB?
>
> Where does the freed memory come from?
> If the memory comes from zswap entry struct. Due to the slab allocator
> fragmentation. It would take a lot of zswap entries to have meaningful
> memory reclaimed from the slab allocator.
>
> If the memory comes from the swap cached pages, that would be much
> more meaningful. But that is not what this patch is doing, right?
>
> Chris

It's my bad for putting two cases together. The memory released in both
cases comes from zswap entry struct and zswap compressed page.

The original intention of this patch is to solve the problem that
shrink_work() fails to reclaim memory in two situations.

For case (1),  the zswap_writeback_entry() will failed for the
__read_swap_cache_async return NULL because the swap has been
freed but cached in swap_slots_cache, so the memory come from
the zswap entry struct and compressed page.
Count = SWAP_BATCH * ncpu.
Solution: move the zswap_invalidate() out of batches, free it once the swap
count equal to 0.

For case (2),  the zswap_writeback_entry() will failed for !page_was_allocated
because zswap_load will have two copies of the same page in memory
  (compressed and uncompressed) after faulting in a page from zswap when
zswap_exclusive_loads disabled. The amount of memory is greater but depends
on the usage.

Why do we need  to release them?
Consider this scenario,there is a lot of data cached in memory and zswap,
hit the limit,and shrink_worker will fail. The new coming data will be written
directly to swap due to zswap_store failure. Should we free the last one
to store the latest one in zswap.

According to the previous discussion, the writeback is inevitable.
So I want to make zswap_exclusive_loads_enabled the default behavior
or make it the only way to do zswap loads. It only makes sense when
the page is read and no longer dirty. If the page is read frequently, it
should stay in cache rather than zswap. The benefit of doing this is
very small, i.e. two copies of the same page in memory.

Thanks again.
  
Nhat Pham Nov. 18, 2023, 6:43 p.m. UTC | #21
On Fri, Nov 17, 2023 at 8:46 PM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> Hi Chris, thanks for your time.
>
> >
> > On Fri, Nov 17, 2023 at 1:56 AM Zhongkun He
> > <hezhongkun.hzk@bytedance.com> wrote:
> > > Hi Chris, thanks for your feedback.  I have the same concerns,
> > > maybe we should just move the zswap_invalidate() out of batches,
> > > as Yosry mentioned above.
> >
> > As I replied in the previous email, I just want to understand the
> > other side effects of the change better.
> >
> > To me, this patching is actually freeing the memory that does not
> > require actual page IO write from zswap. Which means the memory is
> > from some kind of cache. It would be interesting if we can not
> > complicate the write back path further. Instead, we can drop those
> > memories from the different cache if needed. I assume those caches are
> > doing something useful in the common case. If not, we should have a
> > patch to remove these caches instead.  Not sure how big a mess it will
> > be to implement separate the write and drop caches.
> >
> > While you are here, I have some questions for you.
> >
> > Can you help me understand how much memory you can free from this
> > patch? For example, are we talking about a few pages or a few GB?
> >
> > Where does the freed memory come from?
> > If the memory comes from zswap entry struct. Due to the slab allocator
> > fragmentation. It would take a lot of zswap entries to have meaningful
> > memory reclaimed from the slab allocator.
> >
> > If the memory comes from the swap cached pages, that would be much
> > more meaningful. But that is not what this patch is doing, right?
> >
> > Chris
>
> It's my bad for putting two cases together. The memory released in both
> cases comes from zswap entry struct and zswap compressed page.
>
> The original intention of this patch is to solve the problem that
> shrink_work() fails to reclaim memory in two situations.
>
> For case (1),  the zswap_writeback_entry() will failed for the
> __read_swap_cache_async return NULL because the swap has been
> freed but cached in swap_slots_cache, so the memory come from
> the zswap entry struct and compressed page.
> Count = SWAP_BATCH * ncpu.
> Solution: move the zswap_invalidate() out of batches, free it once the swap
> count equal to 0.
>
> For case (2),  the zswap_writeback_entry() will failed for !page_was_allocated
> because zswap_load will have two copies of the same page in memory
>   (compressed and uncompressed) after faulting in a page from zswap when
> zswap_exclusive_loads disabled. The amount of memory is greater but depends
> on the usage.
>
> Why do we need  to release them?
> Consider this scenario,there is a lot of data cached in memory and zswap,
> hit the limit,and shrink_worker will fail. The new coming data will be written
> directly to swap due to zswap_store failure. Should we free the last one
> to store the latest one in zswap.

Shameless plug: zswap will much less likely hit the limit (global or
cgroup) with the shrinker enabled ;) It will proactively reclaim the
objects way ahead of the limit.

It comes with its own can of worms, of course - it's unlikely to work
for all workloads in its current form, but perhaps worth experimenting
with/improved upon?


>
> According to the previous discussion, the writeback is inevitable.
> So I want to make zswap_exclusive_loads_enabled the default behavior
> or make it the only way to do zswap loads. It only makes sense when
> the page is read and no longer dirty. If the page is read frequently, it
> should stay in cache rather than zswap. The benefit of doing this is
> very small, i.e. two copies of the same page in memory.
>
> Thanks again.
  
Chris Li Nov. 19, 2023, 8:23 a.m. UTC | #22
Hi Zhongkun,


On Fri, Nov 17, 2023 at 5:46 PM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
> > Can you help me understand how much memory you can free from this
> > patch? For example, are we talking about a few pages or a few GB?
> >
> > Where does the freed memory come from?
> > If the memory comes from zswap entry struct. Due to the slab allocator
> > fragmentation. It would take a lot of zswap entries to have meaningful
> > memory reclaimed from the slab allocator.
> >
> > If the memory comes from the swap cached pages, that would be much
> > more meaningful. But that is not what this patch is doing, right?
> >
> > Chris
>
> It's my bad for putting two cases together. The memory released in both
> cases comes from zswap entry struct and zswap compressed page.

Thanks for the clarification. Keep in mind that memory freeing from
and zswap entry and zpool does not directly translate into page free.
If the page has other none freed zswap entry or zsmalloc usage, those
pages will not be free to the system. That is the fragmentation cost I
was talking about.

With this consideration, do you know many extra pages it can release
back to the system by this patch in your usage case? If the difference
is very small, it might not be worth the extra complexity to release
those.

> The original intention of this patch is to solve the problem that
> shrink_work() fails to reclaim memory in two situations.
>
> For case (1),  the zswap_writeback_entry() will failed for the
> __read_swap_cache_async return NULL because the swap has been
> freed but cached in swap_slots_cache, so the memory come from
> the zswap entry struct and compressed page.
In those cases, if we drop the swap_slots_cache, it will also free
those zswap entries and compressed pages (zpool), right?

> Count = SWAP_BATCH * ncpu.

That is the upper limit. Not all CPUs have swap batches fully loaded.

> Solution: move the zswap_invalidate() out of batches, free it once the swap
> count equal to 0.
Per previous discussion, this will have an impact on the
swap_slot_cache behavior.
We need some data points for cost benefit analysis.

> For case (2),  the zswap_writeback_entry() will failed for !page_was_allocated
> because zswap_load will have two copies of the same page in memory
>   (compressed and uncompressed) after faulting in a page from zswap when
> zswap_exclusive_loads disabled. The amount of memory is greater but depends
> on the usage.

That is basically disable the future swap out page IO write
optimization that skip the write if the page hasn't changed. If the
system are low on memory, that is justifiable. Again, it seems we can
have a pass to drop the compressed memory if the swap count is zero
(and mark page dirty).

>
> Why do we need  to release them?
> Consider this scenario,there is a lot of data cached in memory and zswap,
> hit the limit,and shrink_worker will fail. The new coming data will be written

Yes, the shrink_worker will need to allocate a page to store
uncompressed data for write back.

> directly to swap due to zswap_store failure. Should we free the last one
> to store the latest one in zswap.

The "last one" you mean the most recent zswap entry written into zswap?
Well, you need to allocate a page to write it out, that is an async process.
Shrink the zpool now is kind of too late already.

> According to the previous discussion, the writeback is inevitable.
> So I want to make zswap_exclusive_loads_enabled the default behavior
> or make it the only way to do zswap loads. It only makes sense when

We need some data point for how often we swap it out to zswap again,
where the zswap out write can be saved by using the existing compressed data.

It is totally possible this page IO write out optimization is not
worthwhile for zswap.
We need some data to support that claim.

> the page is read and no longer dirty. If the page is read frequently, it
> should stay in cache rather than zswap. The benefit of doing this is
> very small, i.e. two copies of the same page in memory.

If the benefit of doing this is very small, that seems to be the
argument against this patch?
Again we need some data points for cost and benefit analysis.

Chris
  
Chris Li Nov. 19, 2023, 8:29 a.m. UTC | #23
On Sat, Nov 18, 2023 at 10:44 AM Nhat Pham <nphamcs@gmail.com> wrote:

> > Why do we need  to release them?
> > Consider this scenario,there is a lot of data cached in memory and zswap,
> > hit the limit,and shrink_worker will fail. The new coming data will be written
> > directly to swap due to zswap_store failure. Should we free the last one
> > to store the latest one in zswap.
>
> Shameless plug: zswap will much less likely hit the limit (global or
> cgroup) with the shrinker enabled ;) It will proactively reclaim the
> objects way ahead of the limit.

I think that is actually the proper path, by the time it hits the
limit of zpool. That is already too late to shrink zpool to make room.
The shrinker should have guaranteed the amount of pages for write back
purposes to make forward progress.

> It comes with its own can of worms, of course - it's unlikely to work
> for all workloads in its current form, but perhaps worth experimenting
> with/improved upon?

Agree.

Chris
  
Zhongkun He Nov. 20, 2023, 2:42 a.m. UTC | #24
>
> Shameless plug: zswap will much less likely hit the limit (global or
> cgroup) with the shrinker enabled ;) It will proactively reclaim the
> objects way ahead of the limit.

Hi  Nhat,glad to hear from you.
Back to the beginning, the original intention of this patch is to solve
the problem that shrink_work() fails to reclaim memory in two situations.
The zswap_writeback_entry() will failed for !page_was_allocated
because zswap_load will have two copies of the same page in memory
  (compressed and uncompressed) after faulting in a page from zswap when
zswap_exclusive_loads disabled.

A simple test:
1): Turn off  zswap_exclusive_loads_enabled.
2): Run a read-only program and allocate more memory than the limit,
so the limit will be reached and shrinker will fail.

>
> It comes with its own can of worms, of course - it's unlikely to work
> for all workloads in its current form, but perhaps worth experimenting
> with/improved upon?
>
  
Zhongkun He Nov. 20, 2023, 3:16 a.m. UTC | #25
>
> Thanks for the clarification. Keep in mind that memory freeing from
> and zswap entry and zpool does not directly translate into page free.
> If the page has other none freed zswap entry or zsmalloc usage, those
> pages will not be free to the system. That is the fragmentation cost I
> was talking about.

Yes, it may need to be compacted.

>
> With this consideration, do you know many extra pages it can release
> back to the system by this patch in your usage case? If the difference
> is very small, it might not be worth the extra complexity to release
> those.
>

The original intention of this patch is to make shrink work properly,
not to release cache and related memory.

> > The original intention of this patch is to solve the problem that
> > shrink_work() fails to reclaim memory in two situations.
> >
> > For case (1),  the zswap_writeback_entry() will failed for the
> > __read_swap_cache_async return NULL because the swap has been
> > freed but cached in swap_slots_cache, so the memory come from
> > the zswap entry struct and compressed page.
> In those cases, if we drop the swap_slots_cache, it will also free
> those zswap entries and compressed pages (zpool), right?
>
> > Count = SWAP_BATCH * ncpu.
>
> That is the upper limit. Not all CPUs have swap batches fully loaded.

Yes.

>
> > Solution: move the zswap_invalidate() out of batches, free it once the swap
> > count equal to 0.
> Per previous discussion, this will have an impact on the
> swap_slot_cache behavior.
> We need some data points for cost benefit analysis.
>
> > For case (2),  the zswap_writeback_entry() will failed for !page_was_allocated
> > because zswap_load will have two copies of the same page in memory
> >   (compressed and uncompressed) after faulting in a page from zswap when
> > zswap_exclusive_loads disabled. The amount of memory is greater but depends
> > on the usage.
>
> That is basically disable the future swap out page IO write
> optimization that skip the write if the page hasn't changed. If the
> system are low on memory, that is justifiable. Again, it seems we can
> have a pass to drop the compressed memory if the swap count is zero
> (and mark page dirty).
>

OK.

> >
> > Why do we need  to release them?
> > Consider this scenario,there is a lot of data cached in memory and zswap,
> > hit the limit,and shrink_worker will fail. The new coming data will be written
>
> Yes, the shrink_worker will need to allocate a page to store
> uncompressed data for write back.
>
> > directly to swap due to zswap_store failure. Should we free the last one
> > to store the latest one in zswap.
>
> The "last one" you mean the most recent zswap entry written into zswap?
> Well, you need to allocate a page to write it out, that is an async process.
> Shrink the zpool now is kind of too late already.
>

The last zswap_entry in zswap_pool->lru.

> > According to the previous discussion, the writeback is inevitable.
> > So I want to make zswap_exclusive_loads_enabled the default behavior
> > or make it the only way to do zswap loads. It only makes sense when
>
> We need some data point for how often we swap it out to zswap again,
> where the zswap out write can be saved by using the existing compressed data.
>
> It is totally possible this page IO write out optimization is not
> worthwhile for zswap.
> We need some data to support that claim.
>

Got it. I will find it.

> > the page is read and no longer dirty. If the page is read frequently, it
> > should stay in cache rather than zswap. The benefit of doing this is
> > very small, i.e. two copies of the same page in memory.
>
> If the benefit of doing this is very small, that seems to be the
> argument against this patch?
> Again we need some data points for cost and benefit analysis.
>

Yes, it is the new idea to make zswap_exclusive_loads_enabled the
default behavior or make it the only way to do zswap loads.

> Chris
  
Huang, Ying Nov. 20, 2023, 3:18 a.m. UTC | #26
Chris Li <chriscli@google.com> writes:

> On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> Not bypassing the swap slot cache, just make the callbacks to
>> invalidate the zswap entry, do memg uncharging, etc when the slot is
>> no longer used and is entering the swap slot cache (i.e. when
>> free_swap_slot() is called), instead of when draining the swap slot
>> cache (i.e. when swap_range_free() is called). For all parts of MM
>> outside of swap, the swap entry is freed when free_swap_slot() is
>> called. We don't free it immediately because of caching, but this
>> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
>
> That will cancel the batching effect on the swap slot free, making the
> common case for  swapping  faults take longer to complete, righ?
> If I recall correctly, the uncharge is the expensive part of the swap
> slot free operation.
> I just want to figure out what we are trading off against. This is not
> one side wins all situations.

Per my understanding, we don't batch memcg uncharging in
swap_entry_free() now.  Although it's possible and may improve
performance.

--
Best Regards,
Huang, Ying
  
Chris Li Nov. 20, 2023, 5:31 a.m. UTC | #27
On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chriscli@google.com> writes:
>
> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > That will cancel the batching effect on the swap slot free, making the
> > common case for  swapping  faults take longer to complete, righ?
> > If I recall correctly, the uncharge is the expensive part of the swap
> > slot free operation.
> > I just want to figure out what we are trading off against. This is not
> > one side wins all situations.
>
> Per my understanding, we don't batch memcg uncharging in
> swap_entry_free() now.  Although it's possible and may improve
> performance.

swap_entry_free() does not do batching, it is at the caller level.

I just checked. The batching is done in free_swap_slot() is still
using swap slot cache and batching.
It uses swapcache_free_entries() to batch free the swap_slots. That is
where the uncharge happens per my understanding.

Chris
  
Huang, Ying Nov. 20, 2023, 5:39 a.m. UTC | #28
Chris Li <chrisl@kernel.org> writes:

> On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chriscli@google.com> writes:
>>
>> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>> > That will cancel the batching effect on the swap slot free, making the
>> > common case for  swapping  faults take longer to complete, righ?
>> > If I recall correctly, the uncharge is the expensive part of the swap
>> > slot free operation.
>> > I just want to figure out what we are trading off against. This is not
>> > one side wins all situations.
>>
>> Per my understanding, we don't batch memcg uncharging in
>> swap_entry_free() now.  Although it's possible and may improve
>> performance.
>
> swap_entry_free() does not do batching, it is at the caller level.
>
> I just checked. The batching is done in free_swap_slot() is still
> using swap slot cache and batching.
> It uses swapcache_free_entries() to batch free the swap_slots. That is
> where the uncharge happens per my understanding.

Per my understanding, memcg uncharging happens in

swapcache_free_entries()
  swap_entry_free()
    mem_cgroup_uncharge_swap()

The swap entries are uncharged one-by-one, not

acquire lock in memcg
uncharge all entries
release lock in memcg

--
Best Regards,
Huang, Ying
  
Chris Li Nov. 20, 2023, 5:51 a.m. UTC | #29
On Sun, Nov 19, 2023 at 9:41 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Per my understanding, memcg uncharging happens in
>
> swapcache_free_entries()
>   swap_entry_free()
>     mem_cgroup_uncharge_swap()
>
> The swap entries are uncharged one-by-one, not

Yes. That matches my understanding as well.
I think I am using the term "batching" very loosely. My bad and thanks
for the clarification.

I am referring to the fact that in most cases, the free_swap_slot()
does not perform uncharge.
It is grouped together with other entries to uncharge together inside
swapcache_free_entries(). Yes, the uncharge itself is done by a for
loop page by page. No batching in the for loop. BTW, Not all uncharges
can be batched, because they can come from different memcg.

Chris
  
Yosry Ahmed Nov. 20, 2023, 6:52 p.m. UTC | #30
On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chriscli@google.com> writes:
>
> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>
> >> Not bypassing the swap slot cache, just make the callbacks to
> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
> >> no longer used and is entering the swap slot cache (i.e. when
> >> free_swap_slot() is called), instead of when draining the swap slot
> >> cache (i.e. when swap_range_free() is called). For all parts of MM
> >> outside of swap, the swap entry is freed when free_swap_slot() is
> >> called. We don't free it immediately because of caching, but this
> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
> >
> > That will cancel the batching effect on the swap slot free, making the
> > common case for  swapping  faults take longer to complete, righ?
> > If I recall correctly, the uncharge is the expensive part of the swap
> > slot free operation.
> > I just want to figure out what we are trading off against. This is not
> > one side wins all situations.
>
> Per my understanding, we don't batch memcg uncharging in
> swap_entry_free() now.  Although it's possible and may improve
> performance.

Yes. It actually causes a long tail in swapin fault latency as Chris
discovered in our prod. I am wondering if doing the memcg uncharging
outside the slots cache will actually amortize the cost instead.

Regardless of memcg charging, which is more complicated, I think we
should at least move the call to zswap_invalidate() before the slots
cache. I would prefer that we move everything non-swapfile specific
outside the slots cache layer (zswap_invalidate(),
arch_swap_invalidate_page(),  clear_shadow_from_swap_cache(),
mem_cgroup_uncharge_swap(), ..).  However, if some of those are
controversial, we can move some of them for now.

When draining free swap slots from the cache, swap_range_free() is
called with nr_entries == 1 anyway, so I can't see how any batching is
going on. If anything it should help amortize the cost.

>
> --
> Best Regards,
> Huang, Ying
  
Huang, Ying Nov. 21, 2023, 12:54 a.m. UTC | #31
Yosry Ahmed <yosryahmed@google.com> writes:

> On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chriscli@google.com> writes:
>>
>> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>> >>
>> >> Not bypassing the swap slot cache, just make the callbacks to
>> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
>> >> no longer used and is entering the swap slot cache (i.e. when
>> >> free_swap_slot() is called), instead of when draining the swap slot
>> >> cache (i.e. when swap_range_free() is called). For all parts of MM
>> >> outside of swap, the swap entry is freed when free_swap_slot() is
>> >> called. We don't free it immediately because of caching, but this
>> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
>> >
>> > That will cancel the batching effect on the swap slot free, making the
>> > common case for  swapping  faults take longer to complete, righ?
>> > If I recall correctly, the uncharge is the expensive part of the swap
>> > slot free operation.
>> > I just want to figure out what we are trading off against. This is not
>> > one side wins all situations.
>>
>> Per my understanding, we don't batch memcg uncharging in
>> swap_entry_free() now.  Although it's possible and may improve
>> performance.
>
> Yes. It actually causes a long tail in swapin fault latency as Chris
> discovered in our prod. I am wondering if doing the memcg uncharging
> outside the slots cache will actually amortize the cost instead.
>
> Regardless of memcg charging, which is more complicated, I think we
> should at least move the call to zswap_invalidate() before the slots
> cache. I would prefer that we move everything non-swapfile specific
> outside the slots cache layer (zswap_invalidate(),
> arch_swap_invalidate_page(),  clear_shadow_from_swap_cache(),
> mem_cgroup_uncharge_swap(), ..).  However, if some of those are
> controversial, we can move some of them for now.

That makes sense for me.

> When draining free swap slots from the cache, swap_range_free() is
> called with nr_entries == 1 anyway, so I can't see how any batching is
> going on. If anything it should help amortize the cost.

In swapcache_free_entries(), the sis->lock will be held to free multiple
swap slots via swap_info_get_cont() if possible.  This can reduce
sis->lock contention.

--
Best Regards,
Huang, Ying
  
Yosry Ahmed Nov. 21, 2023, 1:15 a.m. UTC | #32
On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Chris Li <chriscli@google.com> writes:
> >>
> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >> >>
> >> >> Not bypassing the swap slot cache, just make the callbacks to
> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
> >> >> no longer used and is entering the swap slot cache (i.e. when
> >> >> free_swap_slot() is called), instead of when draining the swap slot
> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM
> >> >> outside of swap, the swap entry is freed when free_swap_slot() is
> >> >> called. We don't free it immediately because of caching, but this
> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
> >> >
> >> > That will cancel the batching effect on the swap slot free, making the
> >> > common case for  swapping  faults take longer to complete, righ?
> >> > If I recall correctly, the uncharge is the expensive part of the swap
> >> > slot free operation.
> >> > I just want to figure out what we are trading off against. This is not
> >> > one side wins all situations.
> >>
> >> Per my understanding, we don't batch memcg uncharging in
> >> swap_entry_free() now.  Although it's possible and may improve
> >> performance.
> >
> > Yes. It actually causes a long tail in swapin fault latency as Chris
> > discovered in our prod. I am wondering if doing the memcg uncharging
> > outside the slots cache will actually amortize the cost instead.
> >
> > Regardless of memcg charging, which is more complicated, I think we
> > should at least move the call to zswap_invalidate() before the slots
> > cache. I would prefer that we move everything non-swapfile specific
> > outside the slots cache layer (zswap_invalidate(),
> > arch_swap_invalidate_page(),  clear_shadow_from_swap_cache(),
> > mem_cgroup_uncharge_swap(), ..).  However, if some of those are
> > controversial, we can move some of them for now.
>
> That makes sense for me.
>
> > When draining free swap slots from the cache, swap_range_free() is
> > called with nr_entries == 1 anyway, so I can't see how any batching is
> > going on. If anything it should help amortize the cost.
>
> In swapcache_free_entries(), the sis->lock will be held to free multiple
> swap slots via swap_info_get_cont() if possible.  This can reduce
> sis->lock contention.

Ah yes that's a good point. Since most of these callbacks don't
actually access sis, but use the swap entry value itself, I am
guessing the reason we need to hold the lock for all these callbacks
is to prevent swapoff and swapon reusing the same swap entry on a
different swap device, right?
  
Huang, Ying Nov. 21, 2023, 1:53 a.m. UTC | #33
Yosry Ahmed <yosryahmed@google.com> writes:

> On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Chris Li <chriscli@google.com> writes:
>> >>
>> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>> >> >>
>> >> >> Not bypassing the swap slot cache, just make the callbacks to
>> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
>> >> >> no longer used and is entering the swap slot cache (i.e. when
>> >> >> free_swap_slot() is called), instead of when draining the swap slot
>> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM
>> >> >> outside of swap, the swap entry is freed when free_swap_slot() is
>> >> >> called. We don't free it immediately because of caching, but this
>> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
>> >> >
>> >> > That will cancel the batching effect on the swap slot free, making the
>> >> > common case for  swapping  faults take longer to complete, righ?
>> >> > If I recall correctly, the uncharge is the expensive part of the swap
>> >> > slot free operation.
>> >> > I just want to figure out what we are trading off against. This is not
>> >> > one side wins all situations.
>> >>
>> >> Per my understanding, we don't batch memcg uncharging in
>> >> swap_entry_free() now.  Although it's possible and may improve
>> >> performance.
>> >
>> > Yes. It actually causes a long tail in swapin fault latency as Chris
>> > discovered in our prod. I am wondering if doing the memcg uncharging
>> > outside the slots cache will actually amortize the cost instead.
>> >
>> > Regardless of memcg charging, which is more complicated, I think we
>> > should at least move the call to zswap_invalidate() before the slots
>> > cache. I would prefer that we move everything non-swapfile specific
>> > outside the slots cache layer (zswap_invalidate(),
>> > arch_swap_invalidate_page(),  clear_shadow_from_swap_cache(),
>> > mem_cgroup_uncharge_swap(), ..).  However, if some of those are
>> > controversial, we can move some of them for now.
>>
>> That makes sense for me.
>>
>> > When draining free swap slots from the cache, swap_range_free() is
>> > called with nr_entries == 1 anyway, so I can't see how any batching is
>> > going on. If anything it should help amortize the cost.
>>
>> In swapcache_free_entries(), the sis->lock will be held to free multiple
>> swap slots via swap_info_get_cont() if possible.  This can reduce
>> sis->lock contention.
>
> Ah yes that's a good point. Since most of these callbacks don't
> actually access sis, but use the swap entry value itself, I am
> guessing the reason we need to hold the lock for all these callbacks
> is to prevent swapoff and swapon reusing the same swap entry on a
> different swap device, right?

In,

swapcache_free_entries()
  swap_entry_free()
    swap_range_free()

Quite some sis fields will be accessed.

--
Best Regards,
Huang, Ying
  
Yosry Ahmed Nov. 21, 2023, 2:46 a.m. UTC | #34
On Mon, Nov 20, 2023 at 5:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Chris Li <chriscli@google.com> writes:
> >> >>
> >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >> >> >>
> >> >> >> Not bypassing the swap slot cache, just make the callbacks to
> >> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
> >> >> >> no longer used and is entering the swap slot cache (i.e. when
> >> >> >> free_swap_slot() is called), instead of when draining the swap slot
> >> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM
> >> >> >> outside of swap, the swap entry is freed when free_swap_slot() is
> >> >> >> called. We don't free it immediately because of caching, but this
> >> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
> >> >> >
> >> >> > That will cancel the batching effect on the swap slot free, making the
> >> >> > common case for  swapping  faults take longer to complete, righ?
> >> >> > If I recall correctly, the uncharge is the expensive part of the swap
> >> >> > slot free operation.
> >> >> > I just want to figure out what we are trading off against. This is not
> >> >> > one side wins all situations.
> >> >>
> >> >> Per my understanding, we don't batch memcg uncharging in
> >> >> swap_entry_free() now.  Although it's possible and may improve
> >> >> performance.
> >> >
> >> > Yes. It actually causes a long tail in swapin fault latency as Chris
> >> > discovered in our prod. I am wondering if doing the memcg uncharging
> >> > outside the slots cache will actually amortize the cost instead.
> >> >
> >> > Regardless of memcg charging, which is more complicated, I think we
> >> > should at least move the call to zswap_invalidate() before the slots
> >> > cache. I would prefer that we move everything non-swapfile specific
> >> > outside the slots cache layer (zswap_invalidate(),
> >> > arch_swap_invalidate_page(),  clear_shadow_from_swap_cache(),
> >> > mem_cgroup_uncharge_swap(), ..).  However, if some of those are
> >> > controversial, we can move some of them for now.
> >>
> >> That makes sense for me.
> >>
> >> > When draining free swap slots from the cache, swap_range_free() is
> >> > called with nr_entries == 1 anyway, so I can't see how any batching is
> >> > going on. If anything it should help amortize the cost.
> >>
> >> In swapcache_free_entries(), the sis->lock will be held to free multiple
> >> swap slots via swap_info_get_cont() if possible.  This can reduce
> >> sis->lock contention.
> >
> > Ah yes that's a good point. Since most of these callbacks don't
> > actually access sis, but use the swap entry value itself, I am
> > guessing the reason we need to hold the lock for all these callbacks
> > is to prevent swapoff and swapon reusing the same swap entry on a
> > different swap device, right?
>
> In,
>
> swapcache_free_entries()
>   swap_entry_free()
>     swap_range_free()
>
> Quite some sis fields will be accessed.

I wasn't referring to this code. I was what's preventing us from
moving the callbacks I mentioned outside the lock (zswap_invalidate(),
arch_swap_invalidate_page(),  clear_shadow_from_swap_cache(),
mem_cgroup_uncharge_swap(), ..). I think most or all of them don't
really access sis, but perhaps they need the lock to ensure the swap
entry value does not get reused?

>
> --
> Best Regards,
> Huang, Ying
  
Huang, Ying Nov. 21, 2023, 3:32 a.m. UTC | #35
Yosry Ahmed <yosryahmed@google.com> writes:

> On Mon, Nov 20, 2023 at 5:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Chris Li <chriscli@google.com> writes:
>> >> >>
>> >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>> >> >> >>
>> >> >> >> Not bypassing the swap slot cache, just make the callbacks to
>> >> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
>> >> >> >> no longer used and is entering the swap slot cache (i.e. when
>> >> >> >> free_swap_slot() is called), instead of when draining the swap slot
>> >> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM
>> >> >> >> outside of swap, the swap entry is freed when free_swap_slot() is
>> >> >> >> called. We don't free it immediately because of caching, but this
>> >> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
>> >> >> >
>> >> >> > That will cancel the batching effect on the swap slot free, making the
>> >> >> > common case for  swapping  faults take longer to complete, righ?
>> >> >> > If I recall correctly, the uncharge is the expensive part of the swap
>> >> >> > slot free operation.
>> >> >> > I just want to figure out what we are trading off against. This is not
>> >> >> > one side wins all situations.
>> >> >>
>> >> >> Per my understanding, we don't batch memcg uncharging in
>> >> >> swap_entry_free() now.  Although it's possible and may improve
>> >> >> performance.
>> >> >
>> >> > Yes. It actually causes a long tail in swapin fault latency as Chris
>> >> > discovered in our prod. I am wondering if doing the memcg uncharging
>> >> > outside the slots cache will actually amortize the cost instead.
>> >> >
>> >> > Regardless of memcg charging, which is more complicated, I think we
>> >> > should at least move the call to zswap_invalidate() before the slots
>> >> > cache. I would prefer that we move everything non-swapfile specific
>> >> > outside the slots cache layer (zswap_invalidate(),
>> >> > arch_swap_invalidate_page(),  clear_shadow_from_swap_cache(),
>> >> > mem_cgroup_uncharge_swap(), ..).  However, if some of those are
>> >> > controversial, we can move some of them for now.
>> >>
>> >> That makes sense for me.
>> >>
>> >> > When draining free swap slots from the cache, swap_range_free() is
>> >> > called with nr_entries == 1 anyway, so I can't see how any batching is
>> >> > going on. If anything it should help amortize the cost.
>> >>
>> >> In swapcache_free_entries(), the sis->lock will be held to free multiple
>> >> swap slots via swap_info_get_cont() if possible.  This can reduce
>> >> sis->lock contention.
>> >
>> > Ah yes that's a good point. Since most of these callbacks don't
>> > actually access sis, but use the swap entry value itself, I am
>> > guessing the reason we need to hold the lock for all these callbacks
>> > is to prevent swapoff and swapon reusing the same swap entry on a
>> > different swap device, right?
>>
>> In,
>>
>> swapcache_free_entries()
>>   swap_entry_free()
>>     swap_range_free()
>>
>> Quite some sis fields will be accessed.
>
> I wasn't referring to this code. I was what's preventing us from
> moving the callbacks I mentioned outside the lock (zswap_invalidate(),
> arch_swap_invalidate_page(),  clear_shadow_from_swap_cache(),
> mem_cgroup_uncharge_swap(), ..). I think most or all of them don't
> really access sis, but perhaps they need the lock to ensure the swap
> entry value does not get reused?

In fact, the swap entries to be freed by swapcache_free_entries() is in
a state that can not be freed by other path (including swapoff()).  It's
swap_map value is SWAP_HAS_CACHE, but we can not find folio in
swap_address_space().

To be honest, I don't know whether there are dependencies on sis->lock
in these callbacks.  You need to investigate them one by one.

--
Best Regards,
Huang, Ying
  
Yosry Ahmed Nov. 21, 2023, 3:37 a.m. UTC | #36
On Mon, Nov 20, 2023 at 7:35 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Mon, Nov 20, 2023 at 5:55 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >>
> >> >> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Chris Li <chriscli@google.com> writes:
> >> >> >>
> >> >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >> >> >> >>
> >> >> >> >> Not bypassing the swap slot cache, just make the callbacks to
> >> >> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
> >> >> >> >> no longer used and is entering the swap slot cache (i.e. when
> >> >> >> >> free_swap_slot() is called), instead of when draining the swap slot
> >> >> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM
> >> >> >> >> outside of swap, the swap entry is freed when free_swap_slot() is
> >> >> >> >> called. We don't free it immediately because of caching, but this
> >> >> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
> >> >> >> >
> >> >> >> > That will cancel the batching effect on the swap slot free, making the
> >> >> >> > common case for  swapping  faults take longer to complete, righ?
> >> >> >> > If I recall correctly, the uncharge is the expensive part of the swap
> >> >> >> > slot free operation.
> >> >> >> > I just want to figure out what we are trading off against. This is not
> >> >> >> > one side wins all situations.
> >> >> >>
> >> >> >> Per my understanding, we don't batch memcg uncharging in
> >> >> >> swap_entry_free() now.  Although it's possible and may improve
> >> >> >> performance.
> >> >> >
> >> >> > Yes. It actually causes a long tail in swapin fault latency as Chris
> >> >> > discovered in our prod. I am wondering if doing the memcg uncharging
> >> >> > outside the slots cache will actually amortize the cost instead.
> >> >> >
> >> >> > Regardless of memcg charging, which is more complicated, I think we
> >> >> > should at least move the call to zswap_invalidate() before the slots
> >> >> > cache. I would prefer that we move everything non-swapfile specific
> >> >> > outside the slots cache layer (zswap_invalidate(),
> >> >> > arch_swap_invalidate_page(),  clear_shadow_from_swap_cache(),
> >> >> > mem_cgroup_uncharge_swap(), ..).  However, if some of those are
> >> >> > controversial, we can move some of them for now.
> >> >>
> >> >> That makes sense for me.
> >> >>
> >> >> > When draining free swap slots from the cache, swap_range_free() is
> >> >> > called with nr_entries == 1 anyway, so I can't see how any batching is
> >> >> > going on. If anything it should help amortize the cost.
> >> >>
> >> >> In swapcache_free_entries(), the sis->lock will be held to free multiple
> >> >> swap slots via swap_info_get_cont() if possible.  This can reduce
> >> >> sis->lock contention.
> >> >
> >> > Ah yes that's a good point. Since most of these callbacks don't
> >> > actually access sis, but use the swap entry value itself, I am
> >> > guessing the reason we need to hold the lock for all these callbacks
> >> > is to prevent swapoff and swapon reusing the same swap entry on a
> >> > different swap device, right?
> >>
> >> In,
> >>
> >> swapcache_free_entries()
> >>   swap_entry_free()
> >>     swap_range_free()
> >>
> >> Quite some sis fields will be accessed.
> >
> > I wasn't referring to this code. I was what's preventing us from
> > moving the callbacks I mentioned outside the lock (zswap_invalidate(),
> > arch_swap_invalidate_page(),  clear_shadow_from_swap_cache(),
> > mem_cgroup_uncharge_swap(), ..). I think most or all of them don't
> > really access sis, but perhaps they need the lock to ensure the swap
> > entry value does not get reused?
>
> In fact, the swap entries to be freed by swapcache_free_entries() is in
> a state that can not be freed by other path (including swapoff()).  It's
> swap_map value is SWAP_HAS_CACHE, but we can not find folio in
> swap_address_space().

Interesting, it would be even nicer if we can move them outside the lock.

>
> To be honest, I don't know whether there are dependencies on sis->lock
> in these callbacks.  You need to investigate them one by one.

Yeah moving these callbacks outside batching and the lock is very
intriguing but needs to be done carefully. We don't need to do it all
at once, we can start with zswap_invalidate() and move them as we see
fit. It would be nice if the code is refactored such that it's clear
what callbacks are made immediately when the entry is no longer used
and what callbacks are made when the swap slot is being freed.

>
> --
> Best Regards,
> Huang, Ying
  

Patch

diff --git a/mm/zswap.c b/mm/zswap.c
index 74411dfdad92..db95491bcdd5 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1063,11 +1063,12 @@  static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct mempolicy *mpol;
 	struct scatterlist input, output;
 	struct crypto_acomp_ctx *acomp_ctx;
+	struct swap_info_struct *si;
 	struct zpool *pool = zswap_find_zpool(entry);
 	bool page_was_allocated;
 	u8 *src, *tmp = NULL;
 	unsigned int dlen;
-	int ret;
+	int ret = 0;
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
 	};
@@ -1082,16 +1083,30 @@  static int zswap_writeback_entry(struct zswap_entry *entry,
 	mpol = get_task_policy(current);
 	page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
 				NO_INTERLEAVE_INDEX, &page_was_allocated);
-	if (!page) {
+	if (!page)
 		ret = -ENOMEM;
-		goto fail;
-	}
-
-	/* Found an existing page, we raced with load/swapin */
-	if (!page_was_allocated) {
+	else if (!page_was_allocated) {
+		/* Found an existing page, we raced with load/swapin */
 		put_page(page);
 		ret = -EEXIST;
-		goto fail;
+	}
+
+	if (ret) {
+		si = get_swap_device(swpentry);
+		if (!si)
+			goto out;
+
+		/* Two cases to directly release zswap_entry.
+		 * 1) -ENOMEM,if the swpentry has been freed, but cached in
+		 * swap_slots_cache(no page and swapcount = 0).
+		 * 2) -EEXIST, option zswap_exclusive_loads_enabled disabled and
+		 * zswap_load completed(page in swap_cache and swapcount = 0).
+		 */
+		if (!swap_swapcount(si, swpentry))
+			ret = 0;
+
+		put_swap_device(si);
+		goto out;
 	}
 
 	/*
@@ -1106,7 +1121,7 @@  static int zswap_writeback_entry(struct zswap_entry *entry,
 		spin_unlock(&tree->lock);
 		delete_from_swap_cache(page_folio(page));
 		ret = -ENOMEM;
-		goto fail;
+		goto out;
 	}
 	spin_unlock(&tree->lock);
 
@@ -1151,7 +1166,7 @@  static int zswap_writeback_entry(struct zswap_entry *entry,
 
 	return ret;
 
-fail:
+out:
 	if (!zpool_can_sleep_mapped(pool))
 		kfree(tmp);