[RFC,1/1] mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru()

Message ID 20240209115950.3885183-2-chengming.zhou@linux.dev
State New
Headers
Series mm/zswap: fix LRU reclaim for zswap writeback folios |

Commit Message

Chengming Zhou Feb. 9, 2024, 11:59 a.m. UTC
  From: Chengming Zhou <zhouchengming@bytedance.com>

All LRU move interfaces have a problem that it has no effect if the
folio is isolated from LRU (in cpu batch or isolated by shrinker).
Since it can't move/change folio LRU status when it's isolated, mostly
just clear the folio flag and do nothing in this case.

In our case, a written back and reclaimable folio won't be rotated to
the tail of inactive list, since it's still in cpu lru_add batch. It
may cause the delayed reclaim of this folio and evict other folios.

This patch changes to queue the reclaimable folio to cpu rotate batch
even when !folio_test_lru(), hoping it will likely be handled after
the lru_add batch which will put folio on the LRU list first, so
will be rotated to the tail successfully when handle rotate batch.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/swap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
  

Comments

Yosry Ahmed Feb. 13, 2024, 8:49 a.m. UTC | #1
On Fri, Feb 9, 2024 at 4:00 AM <chengming.zhou@linux.dev> wrote:
>
> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> All LRU move interfaces have a problem that it has no effect if the
> folio is isolated from LRU (in cpu batch or isolated by shrinker).
> Since it can't move/change folio LRU status when it's isolated, mostly
> just clear the folio flag and do nothing in this case.
>
> In our case, a written back and reclaimable folio won't be rotated to
> the tail of inactive list, since it's still in cpu lru_add batch. It
> may cause the delayed reclaim of this folio and evict other folios.
>
> This patch changes to queue the reclaimable folio to cpu rotate batch
> even when !folio_test_lru(), hoping it will likely be handled after
> the lru_add batch which will put folio on the LRU list first, so
> will be rotated to the tail successfully when handle rotate batch.

It seems to me that it is totally up to chance whether the lru_add
batch is handled first, especially that there may be problems if it
isn't.

>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/swap.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index cd8f0150ba3a..d304731e47cf 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -236,7 +236,8 @@ static void folio_batch_add_and_move(struct folio_batch *fbatch,
>
>  static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
>  {
> -       if (!folio_test_unevictable(folio)) {
> +       if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
> +           !folio_test_unevictable(folio) && !folio_test_active(folio)) {

What are these conditions based on? I assume we want to check if the
folio is locked because we no longer check that it is on the LRUs, so
we want to check that no one else is operating on it, but I am not
sure that's enough.

>                 lruvec_del_folio(lruvec, folio);
>                 folio_clear_active(folio);
>                 lruvec_add_folio_tail(lruvec, folio);
> @@ -254,7 +255,7 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
>  void folio_rotate_reclaimable(struct folio *folio)
>  {
>         if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
> -           !folio_test_unevictable(folio) && folio_test_lru(folio)) {
> +           !folio_test_unevictable(folio) && !folio_test_active(folio)) {

I am not sure it is safe to continue with a folio that is not on the
LRUs. It could be isolated for other purposes, and we end up adding it
to an LRU nonetheless. Also, folio_batch_move_lru() will do
folio_test_clear_lru() and skip such folios anyway. There may also be
messed up accounting, for example lru_move_tail_fn() calls
lruvec_del_folio(), which does some bookkeeping, at least for the
MGLRU case.

>                 struct folio_batch *fbatch;
>                 unsigned long flags;
>
> --
> 2.40.1
>
  
Chengming Zhou Feb. 14, 2024, 9:18 a.m. UTC | #2
On 2024/2/14 15:13, Yu Zhao wrote:
> On Fri, Feb 9, 2024 at 6:00 AM <chengming.zhou@linux.dev> wrote:
>>
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> All LRU move interfaces have a problem that it has no effect if the
>> folio is isolated from LRU (in cpu batch or isolated by shrinker).
>> Since it can't move/change folio LRU status when it's isolated, mostly
>> just clear the folio flag and do nothing in this case.
>>
>> In our case, a written back and reclaimable folio won't be rotated to
>> the tail of inactive list, since it's still in cpu lru_add batch. It
>> may cause the delayed reclaim of this folio and evict other folios.
>>
>> This patch changes to queue the reclaimable folio to cpu rotate batch
>> even when !folio_test_lru(), hoping it will likely be handled after
>> the lru_add batch which will put folio on the LRU list first, so
>> will be rotated to the tail successfully when handle rotate batch.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> I don't think the analysis is correct. IIRC, writeback from non
> reclaim paths doesn't require isolation and the reclaim path doesn't
> use struct folio_batch lru_add.

Ah, my bad, I forgot to mention the important context in the message:

This is not from the normal reclaim context, it's from zswap writeback
reclaim context, which will first set PG_reclaim flag, then submit the
async writeback io.

If the writeback io complete fast enough, folio_rotate_reclaimable()
will be called before that folio put on LRU list (it still in the local
lru_add batch, so it's somewhat like isolated too)

> 
> Did you see any performance improvements with this patch? In general,
> this kind of patches should have performance numbers to show it really
> helps (not just in theory).

Right, there are some improvements, the numbers are put in cover letter.
But this solution is not good enough, just RFC for discussion. :)

                           mm-unstable-hot   zswap-lru-reclaim
real                       63.34             62.72            
user                       1063.20           1060.30          
sys                        272.04            256.14           
workingset_refault_anon    2103297.00        1788155.80       
workingset_refault_file    28638.20          39249.40         
workingset_activate_anon   746134.00         695435.40        
workingset_activate_file   4344.60           4255.80          
workingset_restore_anon    653163.80         605315.60        
workingset_restore_file    1079.00           883.00           
workingset_nodereclaim     0.00              0.00             
pgscan                     12971305.60       12730331.20      
pgscan_kswapd              0.00              0.00             
pgscan_direct              12971305.60       12730331.20      
pgscan_khugepaged          0.00              0.00           

> 
> My guess is that you are hitting this problem [1].
> 
> [1] https://lore.kernel.org/linux-mm/20221116013808.3995280-1-yuzhao@google.com/

Right, I just see it, it's the same problem. The only difference is that
in your case the folio is isolated by shrinker, in my case, the folio is
in cpu lru_add batch. Anyway, the result is the same, that folio can't be
rotated successfully when writeback complete.

Thanks.
  
Chengming Zhou Feb. 18, 2024, 2:52 a.m. UTC | #3
On 2024/2/15 15:06, Yu Zhao wrote:
> On Wed, Feb 14, 2024 at 4:18 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>
>> On 2024/2/14 15:13, Yu Zhao wrote:
>>> On Fri, Feb 9, 2024 at 6:00 AM <chengming.zhou@linux.dev> wrote:
>>>>
>>>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>>>
>>>> All LRU move interfaces have a problem that it has no effect if the
>>>> folio is isolated from LRU (in cpu batch or isolated by shrinker).
>>>> Since it can't move/change folio LRU status when it's isolated, mostly
>>>> just clear the folio flag and do nothing in this case.
>>>>
>>>> In our case, a written back and reclaimable folio won't be rotated to
>>>> the tail of inactive list, since it's still in cpu lru_add batch. It
>>>> may cause the delayed reclaim of this folio and evict other folios.
>>>>
>>>> This patch changes to queue the reclaimable folio to cpu rotate batch
>>>> even when !folio_test_lru(), hoping it will likely be handled after
>>>> the lru_add batch which will put folio on the LRU list first, so
>>>> will be rotated to the tail successfully when handle rotate batch.
>>>>
>>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>>>
>>> I don't think the analysis is correct. IIRC, writeback from non
>>> reclaim paths doesn't require isolation and the reclaim path doesn't
>>> use struct folio_batch lru_add.
>>
>> Ah, my bad, I forgot to mention the important context in the message:
>>
>> This is not from the normal reclaim context, it's from zswap writeback
>> reclaim context, which will first set PG_reclaim flag, then submit the
>> async writeback io.
>>
>> If the writeback io complete fast enough, folio_rotate_reclaimable()
>> will be called before that folio put on LRU list (it still in the local
>> lru_add batch, so it's somewhat like isolated too)
>>
>>>
>>> Did you see any performance improvements with this patch? In general,
>>> this kind of patches should have performance numbers to show it really
>>> helps (not just in theory).
>>
>> Right, there are some improvements, the numbers are put in cover letter.
>> But this solution is not good enough, just RFC for discussion. :)
>>
>>                            mm-unstable-hot   zswap-lru-reclaim
>> real                       63.34             62.72
>> user                       1063.20           1060.30
>> sys                        272.04            256.14
>> workingset_refault_anon    2103297.00        1788155.80
>> workingset_refault_file    28638.20          39249.40
>> workingset_activate_anon   746134.00         695435.40
>> workingset_activate_file   4344.60           4255.80
>> workingset_restore_anon    653163.80         605315.60
>> workingset_restore_file    1079.00           883.00
>> workingset_nodereclaim     0.00              0.00
>> pgscan                     12971305.60       12730331.20
>> pgscan_kswapd              0.00              0.00
>> pgscan_direct              12971305.60       12730331.20
>> pgscan_khugepaged          0.00              0.00
>>
>>>
>>> My guess is that you are hitting this problem [1].
>>>
>>> [1] https://lore.kernel.org/linux-mm/20221116013808.3995280-1-yuzhao@google.com/
>>
>> Right, I just see it, it's the same problem. The only difference is that
>> in your case the folio is isolated by shrinker, in my case, the folio is
>> in cpu lru_add batch. Anyway, the result is the same, that folio can't be
>> rotated successfully when writeback complete.
> 
> In that case, a better solution would be to make lru_add add
> (_reclaim() && !_dirty() && !_writeback()) folios at the tail.
> (_rotate() needs to leave _reclaim() set if it fails to rotate.)

Right, this is a solution. But PG_readahead is alias of PG_reclaim,
I'm afraid this would rotate readahead folio to the inactive tail.
  

Patch

diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..d304731e47cf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -236,7 +236,8 @@  static void folio_batch_add_and_move(struct folio_batch *fbatch,
 
 static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
 {
-	if (!folio_test_unevictable(folio)) {
+	if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
+	    !folio_test_unevictable(folio) && !folio_test_active(folio)) {
 		lruvec_del_folio(lruvec, folio);
 		folio_clear_active(folio);
 		lruvec_add_folio_tail(lruvec, folio);
@@ -254,7 +255,7 @@  static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
 void folio_rotate_reclaimable(struct folio *folio)
 {
 	if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
-	    !folio_test_unevictable(folio) && folio_test_lru(folio)) {
+	    !folio_test_unevictable(folio) && !folio_test_active(folio)) {
 		struct folio_batch *fbatch;
 		unsigned long flags;