[v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()

Message ID 20230404154716.23058-1-rongwei.wang@linux.alibaba.com
State New
Headers
Series [v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages() |

Commit Message

Rongwei Wang April 4, 2023, 3:47 p.m. UTC
  The si->lock must be held when deleting the si from
the available list.  Otherwise, another thread can
re-add the si to the available list, which can lead
to memory corruption. The only place we have found
where this happens is in the swapoff path. This case
can be described as below:

core 0                       core 1
swapoff

del_from_avail_list(si)      waiting

try lock si->lock            acquire swap_avail_lock
                             and re-add si into
                             swap_avail_head

acquire si->lock but
missing si already be
added again, and continuing
to clear SWP_WRITEOK, etc.

It can be easily found a massive warning messages can
be triggered inside get_swap_pages() by some special
cases, for example, we call madvise(MADV_PAGEOUT) on
blocks of touched memory concurrently, meanwhile, run
much swapon-swapoff operations (e.g. stress-ng-swap).

However, in the worst case, panic can be caused by the
above scene. In swapoff(), the memory used by si could
be kept in swap_info[] after turning off a swap. This
means memory corruption will not be caused immediately
until allocated and reset for a new swap in the swapon
path. A panic message caused:
(with CONFIG_PLIST_DEBUG enabled)

------------[ cut here ]------------
top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
Modules linked in: rfkill(E) crct10dif_ce(E)...
CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
pc : plist_check_prev_next_node+0x50/0x70
lr : plist_check_prev_next_node+0x50/0x70
sp : ffff0018009d3c30
x29: ffff0018009d3c40 x28: ffff800011b32a98
x27: 0000000000000000 x26: ffff001803908000
x25: ffff8000128ea088 x24: ffff800011b32a48
x23: 0000000000000028 x22: ffff001800875c00
x21: ffff800010f9e520 x20: ffff001800875c00
x19: ffff001800fdc6e0 x18: 0000000000000030
x17: 0000000000000000 x16: 0000000000000000
x15: 0736076307640766 x14: 0730073007380731
x13: 0736076307640766 x12: 0730073007380731
x11: 000000000004058d x10: 0000000085a85b76
x9 : ffff8000101436e4 x8 : ffff800011c8ce08
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0017df9ed338 x4 : 0000000000000001
x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
 plist_check_prev_next_node+0x50/0x70
 plist_check_head+0x80/0xf0
 plist_add+0x28/0x140
 add_to_avail_list+0x9c/0xf0
 _enable_swap_info+0x78/0xb4
 __do_sys_swapon+0x918/0xa10
 __arm64_sys_swapon+0x20/0x30
 el0_svc_common+0x8c/0x220
 do_el0_svc+0x2c/0x90
 el0_svc+0x1c/0x30
 el0_sync_handler+0xa8/0xb0
 el0_sync+0x148/0x180
irq event stamp: 2082270

Now, si->lock locked before calling 'del_from_avail_list()'
to make sure other thread see the si had been deleted
and SWP_WRITEOK cleared together, will not reinsert again.

This problem exists in versions after stable 5.10.y.

Cc: stable@vger.kernel.org
Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
---
 mm/swapfile.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
  

Comments

Rongwei Wang April 4, 2023, 4:08 p.m. UTC | #1
Hello

I have fix up some stuff base on Patch v1. And in order to help all 
readers and reviewers to

reproduce this bug, share a reproducer here:

swap_bomb.sh

#!/usr/bin/env bash

stress-ng -a 1 --class vm -t 12h --metrics --times -x 
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap 
--verify -v &
stress-ng -a 1 --class vm -t 12h --metrics --times -x 
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap 
--verify -v &
stress-ng -a 1 --class vm -t 12h --metrics --times -x 
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap 
--verify -v &
stress-ng -a 1 --class vm -t 12h --metrics --times -x 
bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap 
--verify -v


madvise_shared.c

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

#define MSIZE (1024 * 1024 * 2)

int main()
{
         char *shm_addr;
         unsigned long i;

         while (1) {
                 // Map shared memory segment
                 shm_addr =
                     mmap(NULL, MSIZE, PROT_READ | PROT_WRITE,
                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                 if (shm_addr == MAP_FAILED) {
                         perror("Failed to map shared memory segment");
                         exit(EXIT_FAILURE);
                 }

                 for (i = 0; i < MSIZE; i++) {
                         shm_addr[i] = 1;
                 }

                 // Advise kernel on usage pattern of shared memory
                 if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) {
                         perror
                             ("Failed to advise kernel on shared memory 
usage");
                         exit(EXIT_FAILURE);
                 }

                 for (i = 0; i < MSIZE; i++) {
                         shm_addr[i] = 1;
                 }

                 // Advise kernel on usage pattern of shared memory
                 if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) {
                         perror
                             ("Failed to advise kernel on shared memory 
usage");
                         exit(EXIT_FAILURE);
                 }
                 // Use shared memory
                 printf("Hello, shared memory: 0x%lx\n", shm_addr);

                 // Unmap shared memory segment
                 if (munmap(shm_addr, MSIZE) == -1) {
                         perror("Failed to unmap shared memory segment");
                         exit(EXIT_FAILURE);
                 }
         }

         return 0;
}

The bug will reproduce more quickly (about 2~5 minutes) if concurrent 
more swap_bomb.sh and madvise_shared.

Thanks.


change log:

v1 -> v2

* fix up some commits and add assert_spin_locked(&p->lock) inside 
__delete_from_avail_list() (suggested by Matthew Wilcox and Bagas Sanjaya)


On 4/4/23 11:47 PM, Rongwei Wang wrote:
> The si->lock must be held when deleting the si from
> the available list.  Otherwise, another thread can
> re-add the si to the available list, which can lead
> to memory corruption. The only place we have found
> where this happens is in the swapoff path. This case
> can be described as below:
>
> core 0                       core 1
> swapoff
>
> del_from_avail_list(si)      waiting
>
> try lock si->lock            acquire swap_avail_lock
>                               and re-add si into
>                               swap_avail_head
>
> acquire si->lock but
> missing si already be
> added again, and continuing
> to clear SWP_WRITEOK, etc.
>
> It can be easily found a massive warning messages can
> be triggered inside get_swap_pages() by some special
> cases, for example, we call madvise(MADV_PAGEOUT) on
> blocks of touched memory concurrently, meanwhile, run
> much swapon-swapoff operations (e.g. stress-ng-swap).
>
> However, in the worst case, panic can be caused by the
> above scene. In swapoff(), the memory used by si could
> be kept in swap_info[] after turning off a swap. This
> means memory corruption will not be caused immediately
> until allocated and reset for a new swap in the swapon
> path. A panic message caused:
> (with CONFIG_PLIST_DEBUG enabled)
>
> ------------[ cut here ]------------
> top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
> prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
> next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
> WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
> Modules linked in: rfkill(E) crct10dif_ce(E)...
> CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
> Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
> pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> pc : plist_check_prev_next_node+0x50/0x70
> lr : plist_check_prev_next_node+0x50/0x70
> sp : ffff0018009d3c30
> x29: ffff0018009d3c40 x28: ffff800011b32a98
> x27: 0000000000000000 x26: ffff001803908000
> x25: ffff8000128ea088 x24: ffff800011b32a48
> x23: 0000000000000028 x22: ffff001800875c00
> x21: ffff800010f9e520 x20: ffff001800875c00
> x19: ffff001800fdc6e0 x18: 0000000000000030
> x17: 0000000000000000 x16: 0000000000000000
> x15: 0736076307640766 x14: 0730073007380731
> x13: 0736076307640766 x12: 0730073007380731
> x11: 000000000004058d x10: 0000000085a85b76
> x9 : ffff8000101436e4 x8 : ffff800011c8ce08
> x7 : 0000000000000000 x6 : 0000000000000001
> x5 : ffff0017df9ed338 x4 : 0000000000000001
> x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
> x1 : 0000000000000000 x0 : 0000000000000000
> Call trace:
>   plist_check_prev_next_node+0x50/0x70
>   plist_check_head+0x80/0xf0
>   plist_add+0x28/0x140
>   add_to_avail_list+0x9c/0xf0
>   _enable_swap_info+0x78/0xb4
>   __do_sys_swapon+0x918/0xa10
>   __arm64_sys_swapon+0x20/0x30
>   el0_svc_common+0x8c/0x220
>   do_el0_svc+0x2c/0x90
>   el0_svc+0x1c/0x30
>   el0_sync_handler+0xa8/0xb0
>   el0_sync+0x148/0x180
> irq event stamp: 2082270
>
> Now, si->lock locked before calling 'del_from_avail_list()'
> to make sure other thread see the si had been deleted
> and SWP_WRITEOK cleared together, will not reinsert again.
>
> This problem exists in versions after stable 5.10.y.
>
> Cc: stable@vger.kernel.org
> Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com>
> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
> ---
>   mm/swapfile.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 62ba2bf577d7..2c718f45745f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
>   {
>   	int nid;
>   
> +	assert_spin_locked(&p->lock);
>   	for_each_node(nid)
>   		plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
>   }
> @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>   		spin_unlock(&swap_lock);
>   		goto out_dput;
>   	}
> -	del_from_avail_list(p);
>   	spin_lock(&p->lock);
> +	del_from_avail_list(p);
>   	if (p->prio < 0) {
>   		struct swap_info_struct *si = p;
>   		int nid;
  
Andrew Morton April 4, 2023, 7:26 p.m. UTC | #2
On Tue,  4 Apr 2023 23:47:16 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:

> The si->lock must be held when deleting the si from
> the available list.
>
> ...
>
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
>  {
>  	int nid;
>  
> +	assert_spin_locked(&p->lock);
>  	for_each_node(nid)
>  		plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
>  }
> @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  		spin_unlock(&swap_lock);
>  		goto out_dput;
>  	}
> -	del_from_avail_list(p);
>  	spin_lock(&p->lock);
> +	del_from_avail_list(p);
>  	if (p->prio < 0) {
>  		struct swap_info_struct *si = p;
>  		int nid;

So we have

swap_avail_lock
swap_info_struct.lock
swap_cluster_info.lock

Is the ranking of these three clearly documented somewhere?


Did you test this with lockdep fully enabled?


I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device
according to numa node") is the appropriate Fixes: target - do you
agree?


These functions use identifier `p' for the swap_info_struct*, whereas
most other code uses the much more sensible `si'.  That's just rude. 
But we shouldn't change that within this fix.
  
Rongwei Wang April 5, 2023, 6:49 a.m. UTC | #3
Hi Andrew

On 4/5/23 3:26 AM, Andrew Morton wrote:
> On Tue,  4 Apr 2023 23:47:16 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:
>
>> The si->lock must be held when deleting the si from
>> the available list.
>>
>> ...
>>
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
>>   {
>>   	int nid;
>>   
>> +	assert_spin_locked(&p->lock);
>>   	for_each_node(nid)
>>   		plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
>>   }
>> @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>   		spin_unlock(&swap_lock);
>>   		goto out_dput;
>>   	}
>> -	del_from_avail_list(p);
>>   	spin_lock(&p->lock);
>> +	del_from_avail_list(p);
>>   	if (p->prio < 0) {
>>   		struct swap_info_struct *si = p;
>>   		int nid;
> So we have
>
> swap_avail_lock
> swap_info_struct.lock
> swap_cluster_info.lock
>
> Is the ranking of these three clearly documented somewhere?

It seems have

swap_lock

swap_info_struct.lock

swap_avail_lock

I just summary the ranking of these three locks by reading code, not 
find any documents (maybe have).

>
>
> Did you test this with lockdep fully enabled?
>
>
> I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device
> according to numa node") is the appropriate Fixes: target - do you
> agree?

Yes, I'm sure my latest test version has included Aaron's a2468cc9bfdff, 
and my test .config has enabled CONFIG

as below:

CONFIG_LOCK_DEBUGGING_SUPPORT=y CONFIG_PROVE_LOCKING=y 
CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_LOCKDEP=y 
CONFIG_DEBUG_LOCKDEP=y CONFIG_DEBUG_ATOMIC_SLEEP=y

>
>
> These functions use identifier `p' for the swap_info_struct*, whereas
> most other code uses the much more sensible `si'.  That's just rude.
> But we shouldn't change that within this fix.

Indeed, It's confusing more or less to use both 'si' and 'p'. I can 
ready for another patch to replace 'p' with 'si'.

Thanks.
  
Aaron Lu April 6, 2023, 6:58 a.m. UTC | #4
Hi Andrew,

Sorry for replying a little late, it's holiday here yesterday.

On Tue, Apr 04, 2023 at 12:26:00PM -0700, Andrew Morton wrote:
> On Tue,  4 Apr 2023 23:47:16 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote:
> 
> > The si->lock must be held when deleting the si from
> > the available list.
> >
> > ...
> >
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
> >  {
> >  	int nid;
> >  
> > +	assert_spin_locked(&p->lock);
> >  	for_each_node(nid)
> >  		plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
> >  }
> > @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> >  		spin_unlock(&swap_lock);
> >  		goto out_dput;
> >  	}
> > -	del_from_avail_list(p);
> >  	spin_lock(&p->lock);
> > +	del_from_avail_list(p);
> >  	if (p->prio < 0) {
> >  		struct swap_info_struct *si = p;
> >  		int nid;
> 
> So we have
> 
> swap_avail_lock
> swap_info_struct.lock
> swap_cluster_info.lock
> 
> Is the ranking of these three clearly documented somewhere?
>

I see some comments in swapfile.c mentioned something related, e.g.
above the definition of swap_avail_heads, the comment mentioned
swap_lock has to be taken before si->lock and swap_avail_lock can be
taken after si->lock is held, but I'm not aware of a place documenting
these things. Documenting these things is useful information I think,
let me see if I can come up with something later.

> 
> Did you test this with lockdep fully enabled?
> 
> 
> I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device
> according to numa node") is the appropriate Fixes: target - do you
> agree?

It doesn't appear to be the case. For one thing, the problematic code
that removes the swap device from the avail list without acquiring
si->lock was there before my commit and my commit didn't change that
behaviour. For another, I wanted to see if the problem is still there
without my commit(just to make sure).

I followed Rongwei's description and used stress-ng/swap test together
with some test progs that does memory allocation then MADVISE(pageout)
in a loop to reproduce this problem and I can also see the warning like
below using Linus' master branch as of today, I believe this is the
problem Rongwei described:

[ 1914.518786] ------------[ cut here ]------------
[ 1914.519049] swap_info 9 in list but !SWP_WRITEOK
[ 1914.519274] WARNING: CPU: 14 PID: 14307 at mm/swapfile.c:1085 get_swap_pages+0x3b3/0x440
[ 1914.519660] Modules linked in:
[ 1914.519811] CPU: 14 PID: 14307 Comm: swap Tainted: G        W          6.3.0-rc5-00032-g99ddf2254feb #5
[ 1914.520238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc36 04/01/2014
[ 1914.520641] RIP: 0010:get_swap_pages+0x3b3/0x440
[ 1914.520860] Code: 48 8b 4c 24 30 48 c1 e0 3a 4c 09 e0 48 89 01 e8 43 79 96 00 e9 b2 fd ff ff 41 0f be 77 48 48 c7 c78
[ 1914.521709] RSP: 0018:ffffc9000ba0f838 EFLAGS: 00010282
[ 1914.521950] RAX: 0000000000000000 RBX: ffff888154411400 RCX: 0000000000000000
[ 1914.522273] RDX: 0000000000000004 RSI: ffffffff824035cb RDI: 0000000000000001
[ 1914.522601] RBP: ffff888100d95f68 R08: 0000000000000001 R09: 0000000000000003
[ 1914.522926] R10: ffffffff82a7a420 R11: ffffffff82a7a420 R12: 0000000000000350
[ 1914.523249] R13: ffff888100d95da8 R14: ffff888100d95f50 R15: ffff888100d95c00
[ 1914.523576] FS:  00007f23abea2600(0000) GS:ffff88823b600000(0000) knlGS:0000000000000000
[ 1914.523942] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1914.524206] CR2: 00007f23abbff000 CR3: 0000000104b86004 CR4: 0000000000770ee0
[ 1914.524534] PKRU: 55555554
[ 1914.524661] Call Trace:
[ 1914.524782]  <TASK>
[ 1914.524889]  folio_alloc_swap+0xde/0x230
[ 1914.525076]  add_to_swap+0x36/0xb0
[ 1914.525242]  shrink_folio_list+0x9ab/0xef0
[ 1914.525445]  reclaim_folio_list+0x70/0x130
[ 1914.525644]  reclaim_pages+0x9c/0x1c0
[ 1914.525819]  madvise_cold_or_pageout_pte_range+0x79f/0xc80
[ 1914.526073]  walk_pgd_range+0x4d8/0x940
[ 1914.526255]  ? mt_find+0x15b/0x490
[ 1914.526426]  __walk_page_range+0x211/0x230
[ 1914.526619]  walk_page_range+0x17a/0x1e0
[ 1914.526807]  madvise_pageout+0xef/0x250

And when I reverted my commit on the same branch(needs some manual edits),
the problem is still there.

Another thing is, I noticed Rongwei mentioned "This problem exists in
versions after stable 5.10.y." in the changelog while my commit entered
mainline in v4.14.

So either this problem is always there, i.e. earlier than my commit; or
this problem is indeed only there after v5.10, then it should be something
else that triggered it. My qemu refuses to boot v4.14 kernel so I can
not verify the former yet.
  
Aaron Lu April 6, 2023, 12:12 p.m. UTC | #5
On Wed, Apr 05, 2023 at 12:08:47AM +0800, Rongwei Wang wrote:
> Hello
> 
> I have fix up some stuff base on Patch v1. And in order to help all readers
> and reviewers to
> 
> reproduce this bug, share a reproducer here:

I reproduced this problem under a VM this way:

$ sudo ./stress-ng --swap 1
// on another terminal
$ for i in `seq 8`; do ./swap & done
Looks simpler than yours :-)
(Didn't realize you have posted your reproducer here since I'm not CCed
and just found it after invented mine)
Then the warning message normally appear within a few seconds.

Here is the code for the above swap prog:

#include <stdio.h>
#include <stddef.h>
#include <sys/mman.h>

#define SIZE 0x100000

int main(void)
{
        int i, ret;
        void *p;

        p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
        if (p == MAP_FAILED) {
                perror("mmap");
                return -1;
        }

        ret = 0;
        while (1) {
                for (i = 0; i < SIZE; i += 0x1000)
                        ((char *)p)[i] = 1;
                ret = madvise(p, SIZE, MADV_PAGEOUT);
                if (ret != 0) {
                        perror("madvise");
                        break;
                }
        }

        return ret;
}

Unfortunately, this test prog did not work on kernels before v5.4 because
MADV_PAGEOUT is introduced in v5.4. I tested on v5.4 and the problem is
also there.

Haven't found a way to trigger swap with swap device come and go on
kernels before v5.4; tried putting the test prog in a memcg with memory
limit but then the prog is easily killed due to nowhere to swap out.

> 
> swap_bomb.sh
> 
> #!/usr/bin/env bash
> 
> stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
> --verify -v &
> stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
> --verify -v &
> stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
> --verify -v &
> stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap
> --verify -v
> 
> 
> madvise_shared.c
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <unistd.h>
> 
> #define MSIZE (1024 * 1024 * 2)
> 
> int main()
> {
>         char *shm_addr;
>         unsigned long i;
> 
>         while (1) {
>                 // Map shared memory segment
>                 shm_addr =
>                     mmap(NULL, MSIZE, PROT_READ | PROT_WRITE,
>                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>                 if (shm_addr == MAP_FAILED) {
>                         perror("Failed to map shared memory segment");
>                         exit(EXIT_FAILURE);
>                 }
> 
>                 for (i = 0; i < MSIZE; i++) {
>                         shm_addr[i] = 1;
>                 }
> 
>                 // Advise kernel on usage pattern of shared memory
>                 if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) {
>                         perror
>                             ("Failed to advise kernel on shared memory
> usage");
>                         exit(EXIT_FAILURE);
>                 }
> 
>                 for (i = 0; i < MSIZE; i++) {
>                         shm_addr[i] = 1;
>                 }
> 
>                 // Advise kernel on usage pattern of shared memory
>                 if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) {
>                         perror
>                             ("Failed to advise kernel on shared memory
> usage");
>                         exit(EXIT_FAILURE);
>                 }
>                 // Use shared memory
>                 printf("Hello, shared memory: 0x%lx\n", shm_addr);
> 
>                 // Unmap shared memory segment
>                 if (munmap(shm_addr, MSIZE) == -1) {
>                         perror("Failed to unmap shared memory segment");
>                         exit(EXIT_FAILURE);
>                 }
>         }
> 
>         return 0;
> }
> 
> The bug will reproduce more quickly (about 2~5 minutes) if concurrent more
> swap_bomb.sh and madvise_shared.
> 
> Thanks.
  
Rongwei Wang April 6, 2023, 12:20 p.m. UTC | #6
> It doesn't appear to be the case. For one thing, the problematic code
> that removes the swap device from the avail list without acquiring
> si->lock was there before my commit and my commit didn't change that
> behaviour. For another, I wanted to see if the problem is still there
> without my commit(just to make sure).
>
> I followed Rongwei's description and used stress-ng/swap test together
> with some test progs that does memory allocation then MADVISE(pageout)
> in a loop to reproduce this problem and I can also see the warning like
> below using Linus' master branch as of today, I believe this is the
> problem Rongwei described:
Hi, Aaron, I can sure this is that bug, and the panic will happen when 
CONFIG_PLIST_DEBUG enabled (I'm not sure whether you have enabled it).
>
> [ 1914.518786] ------------[ cut here ]------------
> [ 1914.519049] swap_info 9 in list but !SWP_WRITEOK
> [ 1914.519274] WARNING: CPU: 14 PID: 14307 at mm/swapfile.c:1085 get_swap_pages+0x3b3/0x440
> [ 1914.519660] Modules linked in:
> [ 1914.519811] CPU: 14 PID: 14307 Comm: swap Tainted: G        W          6.3.0-rc5-00032-g99ddf2254feb #5
> [ 1914.520238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc36 04/01/2014
> [ 1914.520641] RIP: 0010:get_swap_pages+0x3b3/0x440
> [ 1914.520860] Code: 48 8b 4c 24 30 48 c1 e0 3a 4c 09 e0 48 89 01 e8 43 79 96 00 e9 b2 fd ff ff 41 0f be 77 48 48 c7 c78
> [ 1914.521709] RSP: 0018:ffffc9000ba0f838 EFLAGS: 00010282
> [ 1914.521950] RAX: 0000000000000000 RBX: ffff888154411400 RCX: 0000000000000000
> [ 1914.522273] RDX: 0000000000000004 RSI: ffffffff824035cb RDI: 0000000000000001
> [ 1914.522601] RBP: ffff888100d95f68 R08: 0000000000000001 R09: 0000000000000003
> [ 1914.522926] R10: ffffffff82a7a420 R11: ffffffff82a7a420 R12: 0000000000000350
> [ 1914.523249] R13: ffff888100d95da8 R14: ffff888100d95f50 R15: ffff888100d95c00
> [ 1914.523576] FS:  00007f23abea2600(0000) GS:ffff88823b600000(0000) knlGS:0000000000000000
> [ 1914.523942] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1914.524206] CR2: 00007f23abbff000 CR3: 0000000104b86004 CR4: 0000000000770ee0
> [ 1914.524534] PKRU: 55555554
> [ 1914.524661] Call Trace:
> [ 1914.524782]  <TASK>
> [ 1914.524889]  folio_alloc_swap+0xde/0x230
> [ 1914.525076]  add_to_swap+0x36/0xb0
> [ 1914.525242]  shrink_folio_list+0x9ab/0xef0
> [ 1914.525445]  reclaim_folio_list+0x70/0x130
> [ 1914.525644]  reclaim_pages+0x9c/0x1c0
> [ 1914.525819]  madvise_cold_or_pageout_pte_range+0x79f/0xc80
> [ 1914.526073]  walk_pgd_range+0x4d8/0x940
> [ 1914.526255]  ? mt_find+0x15b/0x490
> [ 1914.526426]  __walk_page_range+0x211/0x230
> [ 1914.526619]  walk_page_range+0x17a/0x1e0
> [ 1914.526807]  madvise_pageout+0xef/0x250
>
> And when I reverted my commit on the same branch(needs some manual edits),
> the problem is still there.
>
> Another thing is, I noticed Rongwei mentioned "This problem exists in
> versions after stable 5.10.y." in the changelog while my commit entered
> mainline in v4.14.
>
> So either this problem is always there, i.e. earlier than my commit; or
> this problem is indeed only there after v5.10, then it should be something
> else that triggered it. My qemu refuses to boot v4.14 kernel so I can
> not verify the former yet.

Me too. The oldest kernel that my qemu can run is 4.19.

BTW, I try to replace 'p' with 'si' today, and find there are many areas 
need to be modified, especially inside swapoff() and swapon(). So many 
modifications maybe affect future tracking of code modifications and 
will cost some time to test. So I wanna to ensure whether need I to do 
this. If need, I can continue to do this.


Thanks.
  
Rongwei Wang April 6, 2023, 12:55 p.m. UTC | #7
Oh, sorry, I miss this email just now, that because of I'm also replying 
your previous email.

On 2023/4/6 20:12, Aaron Lu wrote:
> On Wed, Apr 05, 2023 at 12:08:47AM +0800, Rongwei Wang wrote:
>> Hello
>>
>> I have fix up some stuff base on Patch v1. And in order to help all readers
>> and reviewers to
>>
>> reproduce this bug, share a reproducer here:
> I reproduced this problem under a VM this way:
>
> $ sudo ./stress-ng --swap 1
> // on another terminal
> $ for i in `seq 8`; do ./swap & done
> Looks simpler than yours :-)
Cool, indeed become simpler.
> (Didn't realize you have posted your reproducer here since I'm not CCed
> and just found it after invented mine)
> Then the warning message normally appear within a few seconds.
>
> Here is the code for the above swap prog:
>
> #include <stdio.h>
> #include <stddef.h>
> #include <sys/mman.h>
>
> #define SIZE 0x100000
>
> int main(void)
> {
>          int i, ret;
>          void *p;
>
>          p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
>          if (p == MAP_FAILED) {
>                  perror("mmap");
>                  return -1;
>          }
>
>          ret = 0;
>          while (1) {
>                  for (i = 0; i < SIZE; i += 0x1000)
>                          ((char *)p)[i] = 1;
>                  ret = madvise(p, SIZE, MADV_PAGEOUT);
>                  if (ret != 0) {
>                          perror("madvise");
>                          break;
>                  }
>          }
>
>          return ret;
> }
>
> Unfortunately, this test prog did not work on kernels before v5.4 because
> MADV_PAGEOUT is introduced in v5.4. I tested on v5.4 and the problem is
> also there.

Maybe that is this bug can not be found since now. And we found this is 
triggered by stress-ng-swap and stress-ng-madvise (PAGEOUT) firstly. It 
seems this is that reason.

It seems MADV_COLD is also introduced together with MADV_PAGEOUT. I have 
no idea and have to depend on you.:-)

>
> Haven't found a way to trigger swap with swap device come and go on
> kernels before v5.4; tried putting the test prog in a memcg with memory
> limit but then the prog is easily killed due to nowhere to swap out.
>
Personally, I do not intend to continuing searching for the method to 
reproduce before v5.4. Of course, if you have idea, I can try.


Thanks:-)
  
Aaron Lu April 6, 2023, 2:04 p.m. UTC | #8
On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote:
> The si->lock must be held when deleting the si from
> the available list.  Otherwise, another thread can
> re-add the si to the available list, which can lead
> to memory corruption. The only place we have found
> where this happens is in the swapoff path. This case
> can be described as below:
> 
> core 0                       core 1
> swapoff
> 
> del_from_avail_list(si)      waiting
> 
> try lock si->lock            acquire swap_avail_lock
>                              and re-add si into
>                              swap_avail_head

                               confused here.

If del_from_avail_list(si) finished in swaoff path, then this si should
not exist in any of the per-node avail list and core 1 should not be
able to re-add it.

I stared at the code for a while and couldn't figure out how this
happened, will continue to look at this tomorrow.

Thanks,
Aaron

> 
> acquire si->lock but
> missing si already be
> added again, and continuing
> to clear SWP_WRITEOK, etc.
> 
> It can be easily found a massive warning messages can
> be triggered inside get_swap_pages() by some special
> cases, for example, we call madvise(MADV_PAGEOUT) on
> blocks of touched memory concurrently, meanwhile, run
> much swapon-swapoff operations (e.g. stress-ng-swap).
> 
> However, in the worst case, panic can be caused by the
> above scene. In swapoff(), the memory used by si could
> be kept in swap_info[] after turning off a swap. This
> means memory corruption will not be caused immediately
> until allocated and reset for a new swap in the swapon
> path. A panic message caused:
> (with CONFIG_PLIST_DEBUG enabled)
> 
> ------------[ cut here ]------------
> top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
> prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
> next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
> WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
> Modules linked in: rfkill(E) crct10dif_ce(E)...
> CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
> Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
> pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> pc : plist_check_prev_next_node+0x50/0x70
> lr : plist_check_prev_next_node+0x50/0x70
> sp : ffff0018009d3c30
> x29: ffff0018009d3c40 x28: ffff800011b32a98
> x27: 0000000000000000 x26: ffff001803908000
> x25: ffff8000128ea088 x24: ffff800011b32a48
> x23: 0000000000000028 x22: ffff001800875c00
> x21: ffff800010f9e520 x20: ffff001800875c00
> x19: ffff001800fdc6e0 x18: 0000000000000030
> x17: 0000000000000000 x16: 0000000000000000
> x15: 0736076307640766 x14: 0730073007380731
> x13: 0736076307640766 x12: 0730073007380731
> x11: 000000000004058d x10: 0000000085a85b76
> x9 : ffff8000101436e4 x8 : ffff800011c8ce08
> x7 : 0000000000000000 x6 : 0000000000000001
> x5 : ffff0017df9ed338 x4 : 0000000000000001
> x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
> x1 : 0000000000000000 x0 : 0000000000000000
> Call trace:
>  plist_check_prev_next_node+0x50/0x70
>  plist_check_head+0x80/0xf0
>  plist_add+0x28/0x140
>  add_to_avail_list+0x9c/0xf0
>  _enable_swap_info+0x78/0xb4
>  __do_sys_swapon+0x918/0xa10
>  __arm64_sys_swapon+0x20/0x30
>  el0_svc_common+0x8c/0x220
>  do_el0_svc+0x2c/0x90
>  el0_svc+0x1c/0x30
>  el0_sync_handler+0xa8/0xb0
>  el0_sync+0x148/0x180
> irq event stamp: 2082270
> 
> Now, si->lock locked before calling 'del_from_avail_list()'
> to make sure other thread see the si had been deleted
> and SWP_WRITEOK cleared together, will not reinsert again.
> 
> This problem exists in versions after stable 5.10.y.
> 
> Cc: stable@vger.kernel.org
> Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com>
> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
> ---
>  mm/swapfile.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 62ba2bf577d7..2c718f45745f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
>  {
>  	int nid;
>  
> +	assert_spin_locked(&p->lock);
>  	for_each_node(nid)
>  		plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
>  }
> @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  		spin_unlock(&swap_lock);
>  		goto out_dput;
>  	}
> -	del_from_avail_list(p);
>  	spin_lock(&p->lock);
> +	del_from_avail_list(p);
>  	if (p->prio < 0) {
>  		struct swap_info_struct *si = p;
>  		int nid;
> -- 
> 2.27.0
>
  
Aaron Lu April 6, 2023, 2:57 p.m. UTC | #9
On Thu, Apr 06, 2023 at 10:04:16PM +0800, Aaron Lu wrote:
> On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote:
> > The si->lock must be held when deleting the si from
> > the available list.  Otherwise, another thread can
> > re-add the si to the available list, which can lead
> > to memory corruption. The only place we have found
> > where this happens is in the swapoff path. This case
> > can be described as below:
> > 
> > core 0                       core 1
> > swapoff
> > 
> > del_from_avail_list(si)      waiting
> > 
> > try lock si->lock            acquire swap_avail_lock
> >                              and re-add si into
> >                              swap_avail_head
> 
>                                confused here.
> 
> If del_from_avail_list(si) finished in swaoff path, then this si should
> not exist in any of the per-node avail list and core 1 should not be
> able to re-add it.

I think a possible sequence could be like this:

cpuX                             cpuY
swapoff                          put_swap_folio()

del_from_avail_list(si)
                                 taken si->lock
spin_lock(&si->lock); 

				 swap_range_free()
				 was_full && SWP_WRITEOK -> re-add!
				 drop si->lock

taken si->lock
proceed removing si

End result: si left on avail_list after being swapped off.

The problem is, in add_to_avail_list(), it has no idea this si is being
swapped off and taking si->lock then del_from_avail_list() could avoid
this problem, so I think this patch did the right thing but the
changelog about how this happened needs updating and after that:

Reviewed-by: Aaron Lu <aaron.lu@intel.com>

Thanks,
Aaron

> 
> I stared at the code for a while and couldn't figure out how this
> happened, will continue to look at this tomorrow.
> > 
> > acquire si->lock but
> > missing si already be
> > added again, and continuing
> > to clear SWP_WRITEOK, etc.
> > 
> > It can be easily found a massive warning messages can
> > be triggered inside get_swap_pages() by some special
> > cases, for example, we call madvise(MADV_PAGEOUT) on
> > blocks of touched memory concurrently, meanwhile, run
> > much swapon-swapoff operations (e.g. stress-ng-swap).
> > 
> > However, in the worst case, panic can be caused by the
> > above scene. In swapoff(), the memory used by si could
> > be kept in swap_info[] after turning off a swap. This
> > means memory corruption will not be caused immediately
> > until allocated and reset for a new swap in the swapon
> > path. A panic message caused:
> > (with CONFIG_PLIST_DEBUG enabled)
> > 
> > ------------[ cut here ]------------
> > top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
> > prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
> > next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
> > WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
> > Modules linked in: rfkill(E) crct10dif_ce(E)...
> > CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
> > Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
> > pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> > pc : plist_check_prev_next_node+0x50/0x70
> > lr : plist_check_prev_next_node+0x50/0x70
> > sp : ffff0018009d3c30
> > x29: ffff0018009d3c40 x28: ffff800011b32a98
> > x27: 0000000000000000 x26: ffff001803908000
> > x25: ffff8000128ea088 x24: ffff800011b32a48
> > x23: 0000000000000028 x22: ffff001800875c00
> > x21: ffff800010f9e520 x20: ffff001800875c00
> > x19: ffff001800fdc6e0 x18: 0000000000000030
> > x17: 0000000000000000 x16: 0000000000000000
> > x15: 0736076307640766 x14: 0730073007380731
> > x13: 0736076307640766 x12: 0730073007380731
> > x11: 000000000004058d x10: 0000000085a85b76
> > x9 : ffff8000101436e4 x8 : ffff800011c8ce08
> > x7 : 0000000000000000 x6 : 0000000000000001
> > x5 : ffff0017df9ed338 x4 : 0000000000000001
> > x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
> > x1 : 0000000000000000 x0 : 0000000000000000
> > Call trace:
> >  plist_check_prev_next_node+0x50/0x70
> >  plist_check_head+0x80/0xf0
> >  plist_add+0x28/0x140
> >  add_to_avail_list+0x9c/0xf0
> >  _enable_swap_info+0x78/0xb4
> >  __do_sys_swapon+0x918/0xa10
> >  __arm64_sys_swapon+0x20/0x30
> >  el0_svc_common+0x8c/0x220
> >  do_el0_svc+0x2c/0x90
> >  el0_svc+0x1c/0x30
> >  el0_sync_handler+0xa8/0xb0
> >  el0_sync+0x148/0x180
> > irq event stamp: 2082270
> > 
> > Now, si->lock locked before calling 'del_from_avail_list()'
> > to make sure other thread see the si had been deleted
> > and SWP_WRITEOK cleared together, will not reinsert again.
> > 
> > This problem exists in versions after stable 5.10.y.
> > 
> > Cc: stable@vger.kernel.org
> > Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com>
> > Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
> > ---
> >  mm/swapfile.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 62ba2bf577d7..2c718f45745f 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p)
> >  {
> >  	int nid;
> >  
> > +	assert_spin_locked(&p->lock);
> >  	for_each_node(nid)
> >  		plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
> >  }
> > @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> >  		spin_unlock(&swap_lock);
> >  		goto out_dput;
> >  	}
> > -	del_from_avail_list(p);
> >  	spin_lock(&p->lock);
> > +	del_from_avail_list(p);
> >  	if (p->prio < 0) {
> >  		struct swap_info_struct *si = p;
> >  		int nid;
> > -- 
> > 2.27.0
> >
  
Rongwei Wang April 7, 2023, 2:20 a.m. UTC | #10
On 2023/4/6 22:57, Aaron Lu wrote:
> On Thu, Apr 06, 2023 at 10:04:16PM +0800, Aaron Lu wrote:
>> On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote:
>>> The si->lock must be held when deleting the si from
>>> the available list.  Otherwise, another thread can
>>> re-add the si to the available list, which can lead
>>> to memory corruption. The only place we have found
>>> where this happens is in the swapoff path. This case
>>> can be described as below:
>>>
>>> core 0                       core 1
>>> swapoff
>>>
>>> del_from_avail_list(si)      waiting
>>>
>>> try lock si->lock            acquire swap_avail_lock
>>>                               and re-add si into
>>>                               swap_avail_head
>>                                 confused here.
>>
>> If del_from_avail_list(si) finished in swaoff path, then this si should
>> not exist in any of the per-node avail list and core 1 should not be
>> able to re-add it.
> I think a possible sequence could be like this:
>
> cpuX                             cpuY
> swapoff                          put_swap_folio()
>
> del_from_avail_list(si)
>                                   taken si->lock
> spin_lock(&si->lock);
>
> 				 swap_range_free()
> 				 was_full && SWP_WRITEOK -> re-add!
> 				 drop si->lock
>
> taken si->lock
> proceed removing si
>
> End result: si left on avail_list after being swapped off.
>
> The problem is, in add_to_avail_list(), it has no idea this si is being
> swapped off and taking si->lock then del_from_avail_list() could avoid
> this problem, so I think this patch did the right thing but the
> changelog about how this happened needs updating and after that:

Hi Aaron

That's my fault. Actually, I don't refers specifically to 
swap_range_free() path in commit, mainly because cpuY can stand all 
threads which is waiting swap_avail_lock, then to call 
add_to_avail_list(), not only swap_range_free(), e.g. maybe swapon().

>
> Reviewed-by: Aaron Lu <aaron.lu@intel.com>

Thanks for your time.

-wrw

>
> Thanks,
> Aaron
>
  

Patch

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 62ba2bf577d7..2c718f45745f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -679,6 +679,7 @@  static void __del_from_avail_list(struct swap_info_struct *p)
 {
 	int nid;
 
+	assert_spin_locked(&p->lock);
 	for_each_node(nid)
 		plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]);
 }
@@ -2434,8 +2435,8 @@  SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 		spin_unlock(&swap_lock);
 		goto out_dput;
 	}
-	del_from_avail_list(p);
 	spin_lock(&p->lock);
+	del_from_avail_list(p);
 	if (p->prio < 0) {
 		struct swap_info_struct *si = p;
 		int nid;