zsmalloc: move LRU update from zs_map_object() to zs_malloc()

Message ID 20230505185054.2417128-1-nphamcs@gmail.com
State New
Headers
Series zsmalloc: move LRU update from zs_map_object() to zs_malloc() |

Commit Message

Nhat Pham May 5, 2023, 6:50 p.m. UTC
  Under memory pressure, we sometimes observe the following crash:

[ 5694.832838] ------------[ cut here ]------------
[ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100)
[ 5694.858677] WARNING: CPU: 33 PID: 418824 at lib/list_debug.c:47 __list_del_entry_valid+0x42/0x80
[ 5694.961820] CPU: 33 PID: 418824 Comm: fuse_counters.s Kdump: loaded Tainted: G S                5.19.0-0_fbk3_rc3_hoangnhatpzsdynshrv41_10870_g85a9558a25de #1
[ 5694.990194] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021
[ 5695.007072] RIP: 0010:__list_del_entry_valid+0x42/0x80
[ 5695.017351] Code: 08 48 83 c2 22 48 39 d0 74 24 48 8b 10 48 39 f2 75 2c 48 8b 51 08 b0 01 48 39 f2 75 34 c3 48 c7 c7 55 d7 78 82 e8 4e 45 3b 00 <0f> 0b eb 31 48 c7 c7 27 a8 70 82 e8 3e 45 3b 00 0f 0b eb 21 48 c7
[ 5695.054919] RSP: 0018:ffffc90027aef4f0 EFLAGS: 00010246
[ 5695.065366] RAX: 41fe484987275300 RBX: ffff888008988180 RCX: 0000000000000000
[ 5695.079636] RDX: ffff88886006c280 RSI: ffff888860060480 RDI: ffff888860060480
[ 5695.093904] RBP: 0000000000000002 R08: 0000000000000000 R09: ffffc90027aef370
[ 5695.108175] R10: 0000000000000000 R11: ffffffff82fdf1c0 R12: 0000000010000002
[ 5695.122447] R13: ffff888014b6a448 R14: ffff888014b6a420 R15: 00000000138dc240
[ 5695.136717] FS:  00007f23a7d3f740(0000) GS:ffff888860040000(0000) knlGS:0000000000000000
[ 5695.152899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5695.164388] CR2: 0000560ceaab6ac0 CR3: 000000001c06c001 CR4: 00000000007706e0
[ 5695.178659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5695.192927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5695.207197] PKRU: 55555554
[ 5695.212602] Call Trace:
[ 5695.217486]  <TASK>
[ 5695.221674]  zs_map_object+0x91/0x270
[ 5695.229000]  zswap_frontswap_store+0x33d/0x870
[ 5695.237885]  ? do_raw_spin_lock+0x5d/0xa0
[ 5695.245899]  __frontswap_store+0x51/0xb0
[ 5695.253742]  swap_writepage+0x3c/0x60
[ 5695.261063]  shrink_page_list+0x738/0x1230
[ 5695.269255]  shrink_lruvec+0x5ec/0xcd0
[ 5695.276749]  ? shrink_slab+0x187/0x5f0
[ 5695.284240]  ? mem_cgroup_iter+0x6e/0x120
[ 5695.292255]  shrink_node+0x293/0x7b0
[ 5695.299402]  do_try_to_free_pages+0xea/0x550
[ 5695.307940]  try_to_free_pages+0x19a/0x490
[ 5695.316126]  __folio_alloc+0x19ff/0x3e40
[ 5695.323971]  ? __filemap_get_folio+0x8a/0x4e0
[ 5695.332681]  ? walk_component+0x2a8/0xb50
[ 5695.340697]  ? generic_permission+0xda/0x2a0
[ 5695.349231]  ? __filemap_get_folio+0x8a/0x4e0
[ 5695.357940]  ? walk_component+0x2a8/0xb50
[ 5695.365955]  vma_alloc_folio+0x10e/0x570
[ 5695.373796]  ? walk_component+0x52/0xb50
[ 5695.381634]  wp_page_copy+0x38c/0xc10
[ 5695.388953]  ? filename_lookup+0x378/0xbc0
[ 5695.397140]  handle_mm_fault+0x87f/0x1800
[ 5695.405157]  do_user_addr_fault+0x1bd/0x570
[ 5695.413520]  exc_page_fault+0x5d/0x110
[ 5695.421017]  asm_exc_page_fault+0x22/0x30

After some investigation, I have found the following issue: unlike other
zswap backends, zsmalloc performs the LRU list update at the object
mapping time, rather than when the slot for the object is allocated.
This deviation was discussed and agreed upon during the review process
of the zsmalloc writeback patch series:

https://lore.kernel.org/lkml/Y3flcAXNxxrvy3ZH@cmpxchg.org/

Unfortunately, this introduces a subtle bug that occurs when there is a
concurrent store and reclaim, which interleave as follows:

zswap_frontswap_store()            shrink_worker()
  zs_malloc()                        zs_zpool_shrink()
    spin_lock(&pool->lock)             zs_reclaim_page()
    zspage = find_get_zspage()
    spin_unlock(&pool->lock)
                                         spin_lock(&pool->lock)
                                         zspage = list_first_entry(&pool->lru)
                                         list_del(&zspage->lru)
                                           zspage->lru.next = LIST_POISON1
                                           zspage->lru.prev = LIST_POISON2
                                         spin_unlock(&pool->lock)
  zs_map_object()
    spin_lock(&pool->lock)
    if (!list_empty(&zspage->lru))
      list_del(&zspage->lru)
        CHECK_DATA_CORRUPTION(next == LIST_POISON1) /* BOOM */

With the current upstream code, this issue rarely happens. zswap only
triggers writeback when the pool is already full, at which point all
further store attempts are short-circuited. This creates an implicit
pseudo-serialization between reclaim and store. I am working on a new
zswap shrinking mechanism, which makes interleaving reclaim and store
more likely, exposing this bug.

zbud and z3fold do not have this problem, because they perform the LRU
list update in the alloc function, while still holding the pool's lock.
This patch fixes the aforementioned bug by moving the LRU update back to
zs_malloc(), analogous to zbud and z3fold.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zsmalloc.c | 36 +++++++++---------------------------
 1 file changed, 9 insertions(+), 27 deletions(-)
  

Comments

Andrew Morton May 5, 2023, 7:14 p.m. UTC | #1
On Fri,  5 May 2023 11:50:54 -0700 Nhat Pham <nphamcs@gmail.com> wrote:

> Under memory pressure, we sometimes observe the following crash:
> 
> [ 5694.832838] ------------[ cut here ]------------
> [ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100)
> 
> ...
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>

Am I correct in believing that we should backport this fix into earlier
kernels?

And that a suitable Fixes: target is 64f768c6b32e ("zsmalloc: add a LRU
to zs_pool to keep track of zspages in LRU order)"?
  
Nhat Pham May 5, 2023, 7:26 p.m. UTC | #2
On Fri, May 5, 2023 at 12:14 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri,  5 May 2023 11:50:54 -0700 Nhat Pham <nphamcs@gmail.com> wrote:
>
> > Under memory pressure, we sometimes observe the following crash:
> >
> > [ 5694.832838] ------------[ cut here ]------------
> > [ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100)
> >
> > ...
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
>
> Am I correct in believing that we should backport this fix into earlier
> kernels?
>
> And that a suitable Fixes: target is 64f768c6b32e ("zsmalloc: add a LRU
> to zs_pool to keep track of zspages in LRU order)"?
>

Hi Andrew,

For both questions - yes I believe so too!
  
Johannes Weiner May 5, 2023, 7:29 p.m. UTC | #3
On Fri, May 05, 2023 at 12:14:01PM -0700, Andrew Morton wrote:
> On Fri,  5 May 2023 11:50:54 -0700 Nhat Pham <nphamcs@gmail.com> wrote:
> 
> > Under memory pressure, we sometimes observe the following crash:
> > 
> > [ 5694.832838] ------------[ cut here ]------------
> > [ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100)
> > 
> > ...
> > 
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> 
> Am I correct in believing that we should backport this fix into earlier
> kernels?
>
> And that a suitable Fixes: target is 64f768c6b32e ("zsmalloc: add a LRU
> to zs_pool to keep track of zspages in LRU order)"?

Yes, I agree. The tags in the mm-unstable commit look right to me.

Thanks
  
Sergey Senozhatsky May 6, 2023, 3:01 a.m. UTC | #4
On (23/05/05 11:50), Nhat Pham wrote:
> Under memory pressure, we sometimes observe the following crash:
> 
> [ 5694.832838] ------------[ cut here ]------------
> [ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100)
> [ 5694.858677] WARNING: CPU: 33 PID: 418824 at lib/list_debug.c:47 __list_del_entry_valid+0x42/0x80
> [ 5694.961820] CPU: 33 PID: 418824 Comm: fuse_counters.s Kdump: loaded Tainted: G S                5.19.0-0_fbk3_rc3_hoangnhatpzsdynshrv41_10870_g85a9558a25de #1
> [ 5694.990194] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021
> [ 5695.007072] RIP: 0010:__list_del_entry_valid+0x42/0x80
> [ 5695.017351] Code: 08 48 83 c2 22 48 39 d0 74 24 48 8b 10 48 39 f2 75 2c 48 8b 51 08 b0 01 48 39 f2 75 34 c3 48 c7 c7 55 d7 78 82 e8 4e 45 3b 00 <0f> 0b eb 31 48 c7 c7 27 a8 70 82 e8 3e 45 3b 00 0f 0b eb 21 48 c7
> [ 5695.054919] RSP: 0018:ffffc90027aef4f0 EFLAGS: 00010246
> [ 5695.065366] RAX: 41fe484987275300 RBX: ffff888008988180 RCX: 0000000000000000
> [ 5695.079636] RDX: ffff88886006c280 RSI: ffff888860060480 RDI: ffff888860060480
> [ 5695.093904] RBP: 0000000000000002 R08: 0000000000000000 R09: ffffc90027aef370
> [ 5695.108175] R10: 0000000000000000 R11: ffffffff82fdf1c0 R12: 0000000010000002
> [ 5695.122447] R13: ffff888014b6a448 R14: ffff888014b6a420 R15: 00000000138dc240
> [ 5695.136717] FS:  00007f23a7d3f740(0000) GS:ffff888860040000(0000) knlGS:0000000000000000
> [ 5695.152899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5695.164388] CR2: 0000560ceaab6ac0 CR3: 000000001c06c001 CR4: 00000000007706e0
> [ 5695.178659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 5695.192927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 5695.207197] PKRU: 55555554
> [ 5695.212602] Call Trace:
> [ 5695.217486]  <TASK>
> [ 5695.221674]  zs_map_object+0x91/0x270
> [ 5695.229000]  zswap_frontswap_store+0x33d/0x870
> [ 5695.237885]  ? do_raw_spin_lock+0x5d/0xa0
> [ 5695.245899]  __frontswap_store+0x51/0xb0
> [ 5695.253742]  swap_writepage+0x3c/0x60
> [ 5695.261063]  shrink_page_list+0x738/0x1230
> [ 5695.269255]  shrink_lruvec+0x5ec/0xcd0
> [ 5695.276749]  ? shrink_slab+0x187/0x5f0
> [ 5695.284240]  ? mem_cgroup_iter+0x6e/0x120
> [ 5695.292255]  shrink_node+0x293/0x7b0
> [ 5695.299402]  do_try_to_free_pages+0xea/0x550
> [ 5695.307940]  try_to_free_pages+0x19a/0x490
> [ 5695.316126]  __folio_alloc+0x19ff/0x3e40
> [ 5695.323971]  ? __filemap_get_folio+0x8a/0x4e0
> [ 5695.332681]  ? walk_component+0x2a8/0xb50
> [ 5695.340697]  ? generic_permission+0xda/0x2a0
> [ 5695.349231]  ? __filemap_get_folio+0x8a/0x4e0
> [ 5695.357940]  ? walk_component+0x2a8/0xb50
> [ 5695.365955]  vma_alloc_folio+0x10e/0x570
> [ 5695.373796]  ? walk_component+0x52/0xb50
> [ 5695.381634]  wp_page_copy+0x38c/0xc10
> [ 5695.388953]  ? filename_lookup+0x378/0xbc0
> [ 5695.397140]  handle_mm_fault+0x87f/0x1800
> [ 5695.405157]  do_user_addr_fault+0x1bd/0x570
> [ 5695.413520]  exc_page_fault+0x5d/0x110
> [ 5695.421017]  asm_exc_page_fault+0x22/0x30
> 
> After some investigation, I have found the following issue: unlike other
> zswap backends, zsmalloc performs the LRU list update at the object
> mapping time, rather than when the slot for the object is allocated.
> This deviation was discussed and agreed upon during the review process
> of the zsmalloc writeback patch series:
> 
> https://lore.kernel.org/lkml/Y3flcAXNxxrvy3ZH@cmpxchg.org/
> 
> Unfortunately, this introduces a subtle bug that occurs when there is a
> concurrent store and reclaim, which interleave as follows:
> 
> zswap_frontswap_store()            shrink_worker()
>   zs_malloc()                        zs_zpool_shrink()
>     spin_lock(&pool->lock)             zs_reclaim_page()
>     zspage = find_get_zspage()
>     spin_unlock(&pool->lock)
>                                          spin_lock(&pool->lock)
>                                          zspage = list_first_entry(&pool->lru)
>                                          list_del(&zspage->lru)
>                                            zspage->lru.next = LIST_POISON1
>                                            zspage->lru.prev = LIST_POISON2

					Will list_del_init() there do the trick?

>                                          spin_unlock(&pool->lock)
>   zs_map_object()
>     spin_lock(&pool->lock)
>     if (!list_empty(&zspage->lru))
>       list_del(&zspage->lru)

	list_del_init()

>         CHECK_DATA_CORRUPTION(next == LIST_POISON1) /* BOOM */
  
Johannes Weiner May 8, 2023, 2:06 p.m. UTC | #5
On Sat, May 06, 2023 at 12:01:40PM +0900, Sergey Senozhatsky wrote:
> On (23/05/05 11:50), Nhat Pham wrote:
> > Under memory pressure, we sometimes observe the following crash:
> > 
> > [ 5694.832838] ------------[ cut here ]------------
> > [ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100)
> > [ 5694.858677] WARNING: CPU: 33 PID: 418824 at lib/list_debug.c:47 __list_del_entry_valid+0x42/0x80
> > [ 5694.961820] CPU: 33 PID: 418824 Comm: fuse_counters.s Kdump: loaded Tainted: G S                5.19.0-0_fbk3_rc3_hoangnhatpzsdynshrv41_10870_g85a9558a25de #1
> > [ 5694.990194] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021
> > [ 5695.007072] RIP: 0010:__list_del_entry_valid+0x42/0x80
> > [ 5695.017351] Code: 08 48 83 c2 22 48 39 d0 74 24 48 8b 10 48 39 f2 75 2c 48 8b 51 08 b0 01 48 39 f2 75 34 c3 48 c7 c7 55 d7 78 82 e8 4e 45 3b 00 <0f> 0b eb 31 48 c7 c7 27 a8 70 82 e8 3e 45 3b 00 0f 0b eb 21 48 c7
> > [ 5695.054919] RSP: 0018:ffffc90027aef4f0 EFLAGS: 00010246
> > [ 5695.065366] RAX: 41fe484987275300 RBX: ffff888008988180 RCX: 0000000000000000
> > [ 5695.079636] RDX: ffff88886006c280 RSI: ffff888860060480 RDI: ffff888860060480
> > [ 5695.093904] RBP: 0000000000000002 R08: 0000000000000000 R09: ffffc90027aef370
> > [ 5695.108175] R10: 0000000000000000 R11: ffffffff82fdf1c0 R12: 0000000010000002
> > [ 5695.122447] R13: ffff888014b6a448 R14: ffff888014b6a420 R15: 00000000138dc240
> > [ 5695.136717] FS:  00007f23a7d3f740(0000) GS:ffff888860040000(0000) knlGS:0000000000000000
> > [ 5695.152899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 5695.164388] CR2: 0000560ceaab6ac0 CR3: 000000001c06c001 CR4: 00000000007706e0
> > [ 5695.178659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 5695.192927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 5695.207197] PKRU: 55555554
> > [ 5695.212602] Call Trace:
> > [ 5695.217486]  <TASK>
> > [ 5695.221674]  zs_map_object+0x91/0x270
> > [ 5695.229000]  zswap_frontswap_store+0x33d/0x870
> > [ 5695.237885]  ? do_raw_spin_lock+0x5d/0xa0
> > [ 5695.245899]  __frontswap_store+0x51/0xb0
> > [ 5695.253742]  swap_writepage+0x3c/0x60
> > [ 5695.261063]  shrink_page_list+0x738/0x1230
> > [ 5695.269255]  shrink_lruvec+0x5ec/0xcd0
> > [ 5695.276749]  ? shrink_slab+0x187/0x5f0
> > [ 5695.284240]  ? mem_cgroup_iter+0x6e/0x120
> > [ 5695.292255]  shrink_node+0x293/0x7b0
> > [ 5695.299402]  do_try_to_free_pages+0xea/0x550
> > [ 5695.307940]  try_to_free_pages+0x19a/0x490
> > [ 5695.316126]  __folio_alloc+0x19ff/0x3e40
> > [ 5695.323971]  ? __filemap_get_folio+0x8a/0x4e0
> > [ 5695.332681]  ? walk_component+0x2a8/0xb50
> > [ 5695.340697]  ? generic_permission+0xda/0x2a0
> > [ 5695.349231]  ? __filemap_get_folio+0x8a/0x4e0
> > [ 5695.357940]  ? walk_component+0x2a8/0xb50
> > [ 5695.365955]  vma_alloc_folio+0x10e/0x570
> > [ 5695.373796]  ? walk_component+0x52/0xb50
> > [ 5695.381634]  wp_page_copy+0x38c/0xc10
> > [ 5695.388953]  ? filename_lookup+0x378/0xbc0
> > [ 5695.397140]  handle_mm_fault+0x87f/0x1800
> > [ 5695.405157]  do_user_addr_fault+0x1bd/0x570
> > [ 5695.413520]  exc_page_fault+0x5d/0x110
> > [ 5695.421017]  asm_exc_page_fault+0x22/0x30
> > 
> > After some investigation, I have found the following issue: unlike other
> > zswap backends, zsmalloc performs the LRU list update at the object
> > mapping time, rather than when the slot for the object is allocated.
> > This deviation was discussed and agreed upon during the review process
> > of the zsmalloc writeback patch series:
> > 
> > https://lore.kernel.org/lkml/Y3flcAXNxxrvy3ZH@cmpxchg.org/
> > 
> > Unfortunately, this introduces a subtle bug that occurs when there is a
> > concurrent store and reclaim, which interleave as follows:
> > 
> > zswap_frontswap_store()            shrink_worker()
> >   zs_malloc()                        zs_zpool_shrink()
> >     spin_lock(&pool->lock)             zs_reclaim_page()
> >     zspage = find_get_zspage()
> >     spin_unlock(&pool->lock)
> >                                          spin_lock(&pool->lock)
> >                                          zspage = list_first_entry(&pool->lru)
> >                                          list_del(&zspage->lru)
> >                                            zspage->lru.next = LIST_POISON1
> >                                            zspage->lru.prev = LIST_POISON2
> 
> 					Will list_del_init() there do the trick?
> 
> >                                          spin_unlock(&pool->lock)
> >   zs_map_object()
> >     spin_lock(&pool->lock)
> >     if (!list_empty(&zspage->lru))
> >       list_del(&zspage->lru)
> 
> 	list_del_init()

The deeper bug here is that zs_map_object() tries to add the page to
the LRU list while the shrinker has it isolated for reclaim. This is
way too sutble and error prone. Even if it worked now, it'll cause
corruption issues down the line.

For example, Nhat is adding a secondary entry point to reclaim.
Reclaim expects that a page that's on the LRU is also on the fullness
list, so this would lead to a double remove_zspage() and BUG_ON().

This patch doesn't just fix the crash, it eliminates the deeper LRU
isolation issue and makes the code more robust and simple.
  
Nhat Pham May 8, 2023, 4 p.m. UTC | #6
On Mon, May 8, 2023 at 7:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Sat, May 06, 2023 at 12:01:40PM +0900, Sergey Senozhatsky wrote:
> > On (23/05/05 11:50), Nhat Pham wrote:
> > > Under memory pressure, we sometimes observe the following crash:
> > >
> > > [ 5694.832838] ------------[ cut here ]------------
> > > [ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100)
> > > [ 5694.858677] WARNING: CPU: 33 PID: 418824 at lib/list_debug.c:47 __list_del_entry_valid+0x42/0x80
> > > [ 5694.961820] CPU: 33 PID: 418824 Comm: fuse_counters.s Kdump: loaded Tainted: G S                5.19.0-0_fbk3_rc3_hoangnhatpzsdynshrv41_10870_g85a9558a25de #1
> > > [ 5694.990194] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021
> > > [ 5695.007072] RIP: 0010:__list_del_entry_valid+0x42/0x80
> > > [ 5695.017351] Code: 08 48 83 c2 22 48 39 d0 74 24 48 8b 10 48 39 f2 75 2c 48 8b 51 08 b0 01 48 39 f2 75 34 c3 48 c7 c7 55 d7 78 82 e8 4e 45 3b 00 <0f> 0b eb 31 48 c7 c7 27 a8 70 82 e8 3e 45 3b 00 0f 0b eb 21 48 c7
> > > [ 5695.054919] RSP: 0018:ffffc90027aef4f0 EFLAGS: 00010246
> > > [ 5695.065366] RAX: 41fe484987275300 RBX: ffff888008988180 RCX: 0000000000000000
> > > [ 5695.079636] RDX: ffff88886006c280 RSI: ffff888860060480 RDI: ffff888860060480
> > > [ 5695.093904] RBP: 0000000000000002 R08: 0000000000000000 R09: ffffc90027aef370
> > > [ 5695.108175] R10: 0000000000000000 R11: ffffffff82fdf1c0 R12: 0000000010000002
> > > [ 5695.122447] R13: ffff888014b6a448 R14: ffff888014b6a420 R15: 00000000138dc240
> > > [ 5695.136717] FS:  00007f23a7d3f740(0000) GS:ffff888860040000(0000) knlGS:0000000000000000
> > > [ 5695.152899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 5695.164388] CR2: 0000560ceaab6ac0 CR3: 000000001c06c001 CR4: 00000000007706e0
> > > [ 5695.178659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [ 5695.192927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [ 5695.207197] PKRU: 55555554
> > > [ 5695.212602] Call Trace:
> > > [ 5695.217486]  <TASK>
> > > [ 5695.221674]  zs_map_object+0x91/0x270
> > > [ 5695.229000]  zswap_frontswap_store+0x33d/0x870
> > > [ 5695.237885]  ? do_raw_spin_lock+0x5d/0xa0
> > > [ 5695.245899]  __frontswap_store+0x51/0xb0
> > > [ 5695.253742]  swap_writepage+0x3c/0x60
> > > [ 5695.261063]  shrink_page_list+0x738/0x1230
> > > [ 5695.269255]  shrink_lruvec+0x5ec/0xcd0
> > > [ 5695.276749]  ? shrink_slab+0x187/0x5f0
> > > [ 5695.284240]  ? mem_cgroup_iter+0x6e/0x120
> > > [ 5695.292255]  shrink_node+0x293/0x7b0
> > > [ 5695.299402]  do_try_to_free_pages+0xea/0x550
> > > [ 5695.307940]  try_to_free_pages+0x19a/0x490
> > > [ 5695.316126]  __folio_alloc+0x19ff/0x3e40
> > > [ 5695.323971]  ? __filemap_get_folio+0x8a/0x4e0
> > > [ 5695.332681]  ? walk_component+0x2a8/0xb50
> > > [ 5695.340697]  ? generic_permission+0xda/0x2a0
> > > [ 5695.349231]  ? __filemap_get_folio+0x8a/0x4e0
> > > [ 5695.357940]  ? walk_component+0x2a8/0xb50
> > > [ 5695.365955]  vma_alloc_folio+0x10e/0x570
> > > [ 5695.373796]  ? walk_component+0x52/0xb50
> > > [ 5695.381634]  wp_page_copy+0x38c/0xc10
> > > [ 5695.388953]  ? filename_lookup+0x378/0xbc0
> > > [ 5695.397140]  handle_mm_fault+0x87f/0x1800
> > > [ 5695.405157]  do_user_addr_fault+0x1bd/0x570
> > > [ 5695.413520]  exc_page_fault+0x5d/0x110
> > > [ 5695.421017]  asm_exc_page_fault+0x22/0x30
> > >
> > > After some investigation, I have found the following issue: unlike other
> > > zswap backends, zsmalloc performs the LRU list update at the object
> > > mapping time, rather than when the slot for the object is allocated.
> > > This deviation was discussed and agreed upon during the review process
> > > of the zsmalloc writeback patch series:
> > >
> > > https://lore.kernel.org/lkml/Y3flcAXNxxrvy3ZH@cmpxchg.org/
> > >
> > > Unfortunately, this introduces a subtle bug that occurs when there is a
> > > concurrent store and reclaim, which interleave as follows:
> > >
> > > zswap_frontswap_store()            shrink_worker()
> > >   zs_malloc()                        zs_zpool_shrink()
> > >     spin_lock(&pool->lock)             zs_reclaim_page()
> > >     zspage = find_get_zspage()
> > >     spin_unlock(&pool->lock)
> > >                                          spin_lock(&pool->lock)
> > >                                          zspage = list_first_entry(&pool->lru)
> > >                                          list_del(&zspage->lru)
> > >                                            zspage->lru.next = LIST_POISON1
> > >                                            zspage->lru.prev = LIST_POISON2
> >
> >                                       Will list_del_init() there do the trick?
> >
> > >                                          spin_unlock(&pool->lock)
> > >   zs_map_object()
> > >     spin_lock(&pool->lock)
> > >     if (!list_empty(&zspage->lru))
> > >       list_del(&zspage->lru)
> >
> >       list_del_init()
>
> The deeper bug here is that zs_map_object() tries to add the page to
> the LRU list while the shrinker has it isolated for reclaim. This is
> way too sutble and error prone. Even if it worked now, it'll cause
> corruption issues down the line.
>
> For example, Nhat is adding a secondary entry point to reclaim.
> Reclaim expects that a page that's on the LRU is also on the fullness
> list, so this would lead to a double remove_zspage() and BUG_ON().
>
> This patch doesn't just fix the crash, it eliminates the deeper LRU
> isolation issue and makes the code more robust and simple.

I agree. IMO, less unnecessary concurrent interaction is always a
win for developers' and maintainers' cognitive load.

As a side benefit - this also gets rid of the inelegant check
(mm == ZS_MM_WO). The fact that we had to include a
a multi-paragraph explanation for a 3-line piece of code
should have been a red flag.
  
Sergey Senozhatsky May 9, 2023, 3 a.m. UTC | #7
On (23/05/08 09:00), Nhat Pham wrote:
> > The deeper bug here is that zs_map_object() tries to add the page to
> > the LRU list while the shrinker has it isolated for reclaim. This is
> > way too sutble and error prone. Even if it worked now, it'll cause
> > corruption issues down the line.
> >
> > For example, Nhat is adding a secondary entry point to reclaim.
> > Reclaim expects that a page that's on the LRU is also on the fullness
> > list, so this would lead to a double remove_zspage() and BUG_ON().
> >
> > This patch doesn't just fix the crash, it eliminates the deeper LRU
> > isolation issue and makes the code more robust and simple.
> 
> I agree. IMO, less unnecessary concurrent interaction is always a
> win for developers' and maintainers' cognitive load.

Thanks for all the explanations.

> As a side benefit - this also gets rid of the inelegant check
> (mm == ZS_MM_WO). The fact that we had to include a
> a multi-paragraph explanation for a 3-line piece of code
> should have been a red flag.

Minchan had some strong opinion on that, so we need to hear from him
before we decide how do we fix it.
  
Johannes Weiner May 9, 2023, 5:44 p.m. UTC | #8
On Tue, May 09, 2023 at 12:00:30PM +0900, Sergey Senozhatsky wrote:
> On (23/05/08 09:00), Nhat Pham wrote:
> > > The deeper bug here is that zs_map_object() tries to add the page to
> > > the LRU list while the shrinker has it isolated for reclaim. This is
> > > way too sutble and error prone. Even if it worked now, it'll cause
> > > corruption issues down the line.
> > >
> > > For example, Nhat is adding a secondary entry point to reclaim.
> > > Reclaim expects that a page that's on the LRU is also on the fullness
> > > list, so this would lead to a double remove_zspage() and BUG_ON().
> > >
> > > This patch doesn't just fix the crash, it eliminates the deeper LRU
> > > isolation issue and makes the code more robust and simple.
> > 
> > I agree. IMO, less unnecessary concurrent interaction is always a
> > win for developers' and maintainers' cognitive load.
> 
> Thanks for all the explanations.
> 
> > As a side benefit - this also gets rid of the inelegant check
> > (mm == ZS_MM_WO). The fact that we had to include a
> > a multi-paragraph explanation for a 3-line piece of code
> > should have been a red flag.
> 
> Minchan had some strong opinion on that, so we need to hear from him
> before we decide how do we fix it.

I'd be happy if he could validate the fix. But this fixes a crash, so
the clock is ticking.

I will also say, his was a design preference. One we agreed to only
very reluctantly: https://lore.kernel.org/lkml/Y3f6habiVuV9LMcu@google.com/

Now we have a crash that is a direct result of it, and which cost us
(and apparently is still costing us) time and energy to resolve.

Unless somebody surfaces a real technical problem with the fix, I'd
say let's do it our way this time.
  
Minchan Kim May 9, 2023, 6:20 p.m. UTC | #9
Hi Folks,

On Tue, May 09, 2023 at 01:44:01PM -0400, Johannes Weiner wrote:
> On Tue, May 09, 2023 at 12:00:30PM +0900, Sergey Senozhatsky wrote:
> > On (23/05/08 09:00), Nhat Pham wrote:
> > > > The deeper bug here is that zs_map_object() tries to add the page to
> > > > the LRU list while the shrinker has it isolated for reclaim. This is
> > > > way too sutble and error prone. Even if it worked now, it'll cause
> > > > corruption issues down the line.
> > > >
> > > > For example, Nhat is adding a secondary entry point to reclaim.
> > > > Reclaim expects that a page that's on the LRU is also on the fullness
> > > > list, so this would lead to a double remove_zspage() and BUG_ON().
> > > >
> > > > This patch doesn't just fix the crash, it eliminates the deeper LRU
> > > > isolation issue and makes the code more robust and simple.
> > > 
> > > I agree. IMO, less unnecessary concurrent interaction is always a
> > > win for developers' and maintainers' cognitive load.
> > 
> > Thanks for all the explanations.
> > 
> > > As a side benefit - this also gets rid of the inelegant check
> > > (mm == ZS_MM_WO). The fact that we had to include a
> > > a multi-paragraph explanation for a 3-line piece of code
> > > should have been a red flag.
> > 
> > Minchan had some strong opinion on that, so we need to hear from him
> > before we decide how do we fix it.
> 
> I'd be happy if he could validate the fix. But this fixes a crash, so
> the clock is ticking.
> 
> I will also say, his was a design preference. One we agreed to only
> very reluctantly: https://lore.kernel.org/lkml/Y3f6habiVuV9LMcu@google.com/
> 
> Now we have a crash that is a direct result of it, and which cost us
> (and apparently is still costing us) time and energy to resolve.
> 
> Unless somebody surfaces a real technical problem with the fix, I'd
> say let's do it our way this time.
> 

Sorry for being too late to review. The reason I insisted on it was
I overlookeded the bug and thought it was trivial change but better
semantic since zsmalloc provides separate API between allocation and
access unlike other allocators. Now, Nhat and Johannes provided it's
more error prone, I am totally fine with this fix and will live it
until the LRU writeback will move out of allocator.

Sorry for wasting your time to hunt this bug down and thank you for fix!

Acked-by: Minchan Kim <minchan@kernel.org>
  
Johannes Weiner May 9, 2023, 7:24 p.m. UTC | #10
On Tue, May 09, 2023 at 11:20:02AM -0700, Minchan Kim wrote:
> Hi Folks,
> 
> On Tue, May 09, 2023 at 01:44:01PM -0400, Johannes Weiner wrote:
> > On Tue, May 09, 2023 at 12:00:30PM +0900, Sergey Senozhatsky wrote:
> > > On (23/05/08 09:00), Nhat Pham wrote:
> > > > > The deeper bug here is that zs_map_object() tries to add the page to
> > > > > the LRU list while the shrinker has it isolated for reclaim. This is
> > > > > way too sutble and error prone. Even if it worked now, it'll cause
> > > > > corruption issues down the line.
> > > > >
> > > > > For example, Nhat is adding a secondary entry point to reclaim.
> > > > > Reclaim expects that a page that's on the LRU is also on the fullness
> > > > > list, so this would lead to a double remove_zspage() and BUG_ON().
> > > > >
> > > > > This patch doesn't just fix the crash, it eliminates the deeper LRU
> > > > > isolation issue and makes the code more robust and simple.
> > > > 
> > > > I agree. IMO, less unnecessary concurrent interaction is always a
> > > > win for developers' and maintainers' cognitive load.
> > > 
> > > Thanks for all the explanations.
> > > 
> > > > As a side benefit - this also gets rid of the inelegant check
> > > > (mm == ZS_MM_WO). The fact that we had to include a
> > > > a multi-paragraph explanation for a 3-line piece of code
> > > > should have been a red flag.
> > > 
> > > Minchan had some strong opinion on that, so we need to hear from him
> > > before we decide how do we fix it.
> > 
> > I'd be happy if he could validate the fix. But this fixes a crash, so
> > the clock is ticking.
> > 
> > I will also say, his was a design preference. One we agreed to only
> > very reluctantly: https://lore.kernel.org/lkml/Y3f6habiVuV9LMcu@google.com/
> > 
> > Now we have a crash that is a direct result of it, and which cost us
> > (and apparently is still costing us) time and energy to resolve.
> > 
> > Unless somebody surfaces a real technical problem with the fix, I'd
> > say let's do it our way this time.
> > 
> 
> Sorry for being too late to review. The reason I insisted on it was
> I overlookeded the bug and thought it was trivial change but better
> semantic since zsmalloc provides separate API between allocation and
> access unlike other allocators. Now, Nhat and Johannes provided it's
> more error prone, I am totally fine with this fix and will live it
> until the LRU writeback will move out of allocator.
> 
> Sorry for wasting your time to hunt this bug down and thank you for fix!
> 
> Acked-by: Minchan Kim <minchan@kernel.org>

Thanks Minchan!

Domenico is working on the LRU refactor right now, and should have
patches for review soon. This will indeed get rid of all the zsmalloc
warts and make our lives much easier going forward!
  
Nhat Pham May 9, 2023, 10:04 p.m. UTC | #11
On Tue, May 9, 2023 at 12:24 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, May 09, 2023 at 11:20:02AM -0700, Minchan Kim wrote:
> > Hi Folks,
> >
> > On Tue, May 09, 2023 at 01:44:01PM -0400, Johannes Weiner wrote:
> > > On Tue, May 09, 2023 at 12:00:30PM +0900, Sergey Senozhatsky wrote:
> > > > On (23/05/08 09:00), Nhat Pham wrote:
> > > > > > The deeper bug here is that zs_map_object() tries to add the page to
> > > > > > the LRU list while the shrinker has it isolated for reclaim. This is
> > > > > > way too sutble and error prone. Even if it worked now, it'll cause
> > > > > > corruption issues down the line.
> > > > > >
> > > > > > For example, Nhat is adding a secondary entry point to reclaim.
> > > > > > Reclaim expects that a page that's on the LRU is also on the fullness
> > > > > > list, so this would lead to a double remove_zspage() and BUG_ON().
> > > > > >
> > > > > > This patch doesn't just fix the crash, it eliminates the deeper LRU
> > > > > > isolation issue and makes the code more robust and simple.
> > > > >
> > > > > I agree. IMO, less unnecessary concurrent interaction is always a
> > > > > win for developers' and maintainers' cognitive load.
> > > >
> > > > Thanks for all the explanations.
> > > >
> > > > > As a side benefit - this also gets rid of the inelegant check
> > > > > (mm == ZS_MM_WO). The fact that we had to include a
> > > > > a multi-paragraph explanation for a 3-line piece of code
> > > > > should have been a red flag.
> > > >
> > > > Minchan had some strong opinion on that, so we need to hear from him
> > > > before we decide how do we fix it.
> > >
> > > I'd be happy if he could validate the fix. But this fixes a crash, so
> > > the clock is ticking.
> > >
> > > I will also say, his was a design preference. One we agreed to only
> > > very reluctantly: https://lore.kernel.org/lkml/Y3f6habiVuV9LMcu@google.com/
> > >
> > > Now we have a crash that is a direct result of it, and which cost us
> > > (and apparently is still costing us) time and energy to resolve.
> > >
> > > Unless somebody surfaces a real technical problem with the fix, I'd
> > > say let's do it our way this time.
> > >
> >
> > Sorry for being too late to review. The reason I insisted on it was
> > I overlookeded the bug and thought it was trivial change but better
> > semantic since zsmalloc provides separate API between allocation and
> > access unlike other allocators. Now, Nhat and Johannes provided it's
> > more error prone, I am totally fine with this fix and will live it
> > until the LRU writeback will move out of allocator.
> >
> > Sorry for wasting your time to hunt this bug down and thank you for fix!
> >
> > Acked-by: Minchan Kim <minchan@kernel.org>
>
> Thanks Minchan!
>
> Domenico is working on the LRU refactor right now, and should have
> patches for review soon. This will indeed get rid of all the zsmalloc
> warts and make our lives much easier going forward!

That's the dream! I look forward to Domenico's patches.
And thanks for the ack, Minchan!
  
Sergey Senozhatsky May 10, 2023, 12:39 a.m. UTC | #12
On (23/05/05 11:50), Nhat Pham wrote:
[..]
> zswap_frontswap_store()            shrink_worker()
>   zs_malloc()                        zs_zpool_shrink()
>     spin_lock(&pool->lock)             zs_reclaim_page()
>     zspage = find_get_zspage()
>     spin_unlock(&pool->lock)
>                                          spin_lock(&pool->lock)
>                                          zspage = list_first_entry(&pool->lru)
>                                          list_del(&zspage->lru)
>                                            zspage->lru.next = LIST_POISON1
>                                            zspage->lru.prev = LIST_POISON2
>                                          spin_unlock(&pool->lock)
>   zs_map_object()
>     spin_lock(&pool->lock)
>     if (!list_empty(&zspage->lru))
>       list_del(&zspage->lru)
>         CHECK_DATA_CORRUPTION(next == LIST_POISON1) /* BOOM */
> 
> With the current upstream code, this issue rarely happens. zswap only
> triggers writeback when the pool is already full, at which point all
> further store attempts are short-circuited. This creates an implicit
> pseudo-serialization between reclaim and store. I am working on a new
> zswap shrinking mechanism, which makes interleaving reclaim and store
> more likely, exposing this bug.
> 
> zbud and z3fold do not have this problem, because they perform the LRU
> list update in the alloc function, while still holding the pool's lock.
> This patch fixes the aforementioned bug by moving the LRU update back to
> zs_malloc(), analogous to zbud and z3fold.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>

Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
  

Patch

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 44ddaf5d601e..02f7f414aade 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1331,31 +1331,6 @@  void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	obj_to_location(obj, &page, &obj_idx);
 	zspage = get_zspage(page);
 
-#ifdef CONFIG_ZPOOL
-	/*
-	 * Move the zspage to front of pool's LRU.
-	 *
-	 * Note that this is swap-specific, so by definition there are no ongoing
-	 * accesses to the memory while the page is swapped out that would make
-	 * it "hot". A new entry is hot, then ages to the tail until it gets either
-	 * written back or swaps back in.
-	 *
-	 * Furthermore, map is also called during writeback. We must not put an
-	 * isolated page on the LRU mid-reclaim.
-	 *
-	 * As a result, only update the LRU when the page is mapped for write
-	 * when it's first instantiated.
-	 *
-	 * This is a deviation from the other backends, which perform this update
-	 * in the allocation function (zbud_alloc, z3fold_alloc).
-	 */
-	if (mm == ZS_MM_WO) {
-		if (!list_empty(&zspage->lru))
-			list_del(&zspage->lru);
-		list_add(&zspage->lru, &pool->lru);
-	}
-#endif
-
 	/*
 	 * migration cannot move any zpages in this zspage. Here, pool->lock
 	 * is too heavy since callers would take some time until they calls
@@ -1525,9 +1500,8 @@  unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 		fix_fullness_group(class, zspage);
 		record_obj(handle, obj);
 		class_stat_inc(class, ZS_OBJS_INUSE, 1);
-		spin_unlock(&pool->lock);
 
-		return handle;
+		goto out;
 	}
 
 	spin_unlock(&pool->lock);
@@ -1550,6 +1524,14 @@  unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 
 	/* We completely set up zspage so mark them as movable */
 	SetZsPageMovable(pool, zspage);
+out:
+#ifdef CONFIG_ZPOOL
+	/* Add/move zspage to beginning of LRU */
+	if (!list_empty(&zspage->lru))
+		list_del(&zspage->lru);
+	list_add(&zspage->lru, &pool->lru);
+#endif
+
 	spin_unlock(&pool->lock);
 
 	return handle;