[RFC,0/1] mm/zswap: fix LRU reclaim for zswap writeback folios

Message ID 20240209115950.3885183-1-chengming.zhou@linux.dev
Headers
Series mm/zswap: fix LRU reclaim for zswap writeback folios |

Message

Chengming Zhou Feb. 9, 2024, 11:59 a.m. UTC
  From: Chengming Zhou <zhouchengming@bytedance.com>

The memory shrinker of zswap is special: it has to first allocate more
memory (folio) to free less memory (compressed copy), later that folio
can be freed after written back. So it's better to evict these folios
before trying to reclaim other folios.

Here may come some problems of LRU management:

1. zswap_writeback_entry() reuses the __read_swap_cache_async(), which
   is normally only used in the swapin context. This may cause refault
   accounting and active/workingset information of the folio to be wrong.

   For example, zswap shrinker writeback try to reclaim the folio, but
   workingset_refault() mark it active to put it at head of active list.

2. folio_end_writeback() will check PG_reclaim flag which we did set
   in zswap_writeback_entry(), to try to rotate the folio to the tail
   of inactive list, to speed up its reclaim.

   But folio_rotate_reclaimable() won't work as we thought, actually
   all LRU move interfaces may don't work when the folio is isolated
   from LRU. (per-cpu add batch is somewhat like isolated from LRU)

   So when folio_end_writeback() calls folio_rotate_reclaimable(),
   it won't do nothing but just clear PG_reclaim flag if that folio
   is isolated (in per-cpu add batch or isolated by vmscan shrinker)

3. so the final result is the folio that has been written back and is
   expected to be evicted, but now is not at tail of inactive list.

   Meanwhile vmscan shrinker may try to evict other folios to cause
   more refaults. There is a report [1] of this problem.

We should handle these cases better. First of all, we should consider
when and where to put these folios on LRU.

1. after alloc: now we use folio_add_lru() to put folio on local batch,
   so it will be put at head of inactive/active list when batch drain.

2. after writeback: clear PG_reclaim and folio_rotate_reclaimable().
   - after add batch drain: rotate successfully to tail of inactive list
   - before add batch drain: do nothing since folio is not on LRU list

So these are two main time points we care about: the first is somewhat
correct IMHO since the folio is under writeback and has PG_reclaim set,
it may confuse shrinker if we put those to the tail of inactive list
too early. If we really want to put it to tail, we can easily introduce
another folio_add_lru_tail() to put on a new local lru_add_tail batch.

The second is where we need to improve, we should rotate it to tail
even when folio is in local batch. Since we can't manipulate folio LRU
status when it's isolated in local batch, an obvious fix is to use
folio flag to tell later lru_add_fn() where the folio should be put:
active or inactive, head or tail.

But the problem is that PG_readahead is the alias of PG_reclaim, we
can't put readahead folio to the tail of inactive list obviously.

So this patch changes folio_rotate_reclaimable() to queue to local
rotate batch even when !PG_lru at first, hoping that:

 - folio_end_writeback() finish on the same cpu with lru_add batch,
   so it must be handled after the lru_add batch, after which it will
   see PG_lru and successfully rotate it to tail of inactive list.

 - even folio_end_writeback() is not finished on the same cpu, there
   maybe a big chance that rotate batch is handled after add batch.

Testing kernel build in tmpfs with memory.max = 2GB.
(zswap shrinker and writeback enabled with one 50GB swapfile)

                           mm-unstable-hot   zswap-lru-reclaim
real                       63.34             62.72            
user                       1063.20           1060.30          
sys                        272.04            256.14           
workingset_refault_anon    2103297.00        1788155.80       
workingset_refault_file    28638.20          39249.40         
workingset_activate_anon   746134.00         695435.40        
workingset_activate_file   4344.60           4255.80          
workingset_restore_anon    653163.80         605315.60        
workingset_restore_file    1079.00           883.00           
workingset_nodereclaim     0.00              0.00             
pgscan                     12971305.60       12730331.20      
pgscan_kswapd              0.00              0.00             
pgscan_direct              12971305.60       12730331.20      
pgscan_khugepaged          0.00              0.00             

We can see that refault and sys cpu have some improvements.

As for the workingset_refault() caused by zswap writeback, maybe we
should remove it in zswap writeback case, but there are more pgscan
and some regression. I don't know why, so just leave it as it is.

This is RFC, any comment or discussion is welcome! Thanks!

[1] https://lore.kernel.org/all/20231024142706.195517-1-hezhongkun.hzk@bytedance.com/

Chengming Zhou (1):
  mm/swap: queue reclaimable folio to local rotate batch when
    !folio_test_lru()

 mm/swap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)