xfs: fix possible assert failed in xfs_fs_put_super() when do cpu offline

Message ID 20230314090649.326642-1-yebin@huaweicloud.com
State New
Headers
Series xfs: fix possible assert failed in xfs_fs_put_super() when do cpu offline |

Commit Message

Ye Bin March 14, 2023, 9:06 a.m. UTC
  From: Ye Bin <yebin10@huawei.com>

There's a issue when do cpu offline test:
CPU: 48 PID: 1168152 Comm: umount Kdump: loaded Tainted: G L
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : assfail+0x8c/0xb4
lr : assfail+0x38/0xb4
sp : ffffa00033ce7c40
x29: ffffa00033ce7c40 x28: ffffa00014794f30
x27: ffffa00014f6ca20 x26: 1fffe0120b2e2030
x25: ffff009059710188 x24: ffff00886c0a4650
x23: 1fffe0110d8148ca x22: ffff009059710180
x21: ffffa00015155680 x20: ffff00886c0a4000
x19: 0000000000000001 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000
x15: 0000000000000007 x14: 1fffe00304cef265
x13: ffff00182642b200 x12: ffff8012d37757bf
x11: 1fffe012d37757be x10: ffff8012d37757be
x9 : ffffa00010603a0c x8 : 0000000041b58ab3
x7 : ffff94000679cf44 x6 : 00000000ffffffc0
x5 : 0000000000000021 x4 : 00000000ffffffca
x3 : 1ffff40002a27ee1 x2 : 0000000000000004
x1 : 0000000000000000 x0 : ffffa0001513f000
Call trace:
 assfail+0x8c/0xb4
 xfs_destroy_percpu_counters+0x98/0xa4
 xfs_fs_put_super+0x1a0/0x2a4
 generic_shutdown_super+0x104/0x2c0
 kill_block_super+0x8c/0xf4
 deactivate_locked_super+0xa4/0x164
 deactivate_super+0xb0/0xdc
 cleanup_mnt+0x29c/0x3ec
 __cleanup_mnt+0x1c/0x30
 task_work_run+0xe0/0x200
 do_notify_resume+0x244/0x320
 work_pending+0xc/0xa0

We analyzed the data in vmcore is correct. But triggered above issue.
As f689054aace2 ("percpu_counter: add percpu_counter_sum_all interface")
commit describes there is a small race window between the online CPUs traversal
of percpu_counter_sum and the CPU offline callback. This means percpu_counter_sum()
may return incorrect result during cpu offline.
To solve above issue use percpu_counter_sum_all() interface to make sure
result is correct to prevent false triggering of assertions.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 fs/xfs/xfs_super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Darrick J. Wong March 14, 2023, 4:31 p.m. UTC | #1
On Tue, Mar 14, 2023 at 05:06:49PM +0800, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
> 
> There's a issue when do cpu offline test:
> CPU: 48 PID: 1168152 Comm: umount Kdump: loaded Tainted: G L
> pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> pc : assfail+0x8c/0xb4
> lr : assfail+0x38/0xb4
> sp : ffffa00033ce7c40
> x29: ffffa00033ce7c40 x28: ffffa00014794f30
> x27: ffffa00014f6ca20 x26: 1fffe0120b2e2030
> x25: ffff009059710188 x24: ffff00886c0a4650
> x23: 1fffe0110d8148ca x22: ffff009059710180
> x21: ffffa00015155680 x20: ffff00886c0a4000
> x19: 0000000000000001 x18: 0000000000000000
> x17: 0000000000000000 x16: 0000000000000000
> x15: 0000000000000007 x14: 1fffe00304cef265
> x13: ffff00182642b200 x12: ffff8012d37757bf
> x11: 1fffe012d37757be x10: ffff8012d37757be
> x9 : ffffa00010603a0c x8 : 0000000041b58ab3
> x7 : ffff94000679cf44 x6 : 00000000ffffffc0
> x5 : 0000000000000021 x4 : 00000000ffffffca
> x3 : 1ffff40002a27ee1 x2 : 0000000000000004
> x1 : 0000000000000000 x0 : ffffa0001513f000
> Call trace:
>  assfail+0x8c/0xb4
>  xfs_destroy_percpu_counters+0x98/0xa4
>  xfs_fs_put_super+0x1a0/0x2a4
>  generic_shutdown_super+0x104/0x2c0
>  kill_block_super+0x8c/0xf4
>  deactivate_locked_super+0xa4/0x164
>  deactivate_super+0xb0/0xdc
>  cleanup_mnt+0x29c/0x3ec
>  __cleanup_mnt+0x1c/0x30
>  task_work_run+0xe0/0x200
>  do_notify_resume+0x244/0x320
>  work_pending+0xc/0xa0
> 
> We analyzed the data in vmcore is correct. But triggered above issue.
> As f689054aace2 ("percpu_counter: add percpu_counter_sum_all interface")
> commit describes there is a small race window between the online CPUs traversal
> of percpu_counter_sum and the CPU offline callback. This means percpu_counter_sum()
> may return incorrect result during cpu offline.
> To solve above issue use percpu_counter_sum_all() interface to make sure
> result is correct to prevent false triggering of assertions.

How about the other percpu_counter_sum callsites inside XFS?  Some of
them are involved in writing ondisk metadata (xfs_log_sb) or doing
correctness checks (fs/xfs/scrub/*); shouldn't those also be using the
_all variant?

--D

> Signed-off-by: Ye Bin <yebin10@huawei.com>
> ---
>  fs/xfs/xfs_super.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 2479b5cbd75e..c0ce66f966ee 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1076,7 +1076,7 @@ xfs_destroy_percpu_counters(
>  	percpu_counter_destroy(&mp->m_ifree);
>  	percpu_counter_destroy(&mp->m_fdblocks);
>  	ASSERT(xfs_is_shutdown(mp) ||
> -	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
> +	       percpu_counter_sum_all(&mp->m_delalloc_blks) == 0);
>  	percpu_counter_destroy(&mp->m_delalloc_blks);
>  	percpu_counter_destroy(&mp->m_frextents);
>  }
> -- 
> 2.31.1
>
  
Dave Chinner March 14, 2023, 10:13 p.m. UTC | #2
On Tue, Mar 14, 2023 at 09:31:00AM -0700, Darrick J. Wong wrote:
> On Tue, Mar 14, 2023 at 05:06:49PM +0800, Ye Bin wrote:
> > From: Ye Bin <yebin10@huawei.com>
> > 
> > There's a issue when do cpu offline test:
> > CPU: 48 PID: 1168152 Comm: umount Kdump: loaded Tainted: G L
> > pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> > pc : assfail+0x8c/0xb4
> > lr : assfail+0x38/0xb4
> > sp : ffffa00033ce7c40
> > x29: ffffa00033ce7c40 x28: ffffa00014794f30
> > x27: ffffa00014f6ca20 x26: 1fffe0120b2e2030
> > x25: ffff009059710188 x24: ffff00886c0a4650
> > x23: 1fffe0110d8148ca x22: ffff009059710180
> > x21: ffffa00015155680 x20: ffff00886c0a4000
> > x19: 0000000000000001 x18: 0000000000000000
> > x17: 0000000000000000 x16: 0000000000000000
> > x15: 0000000000000007 x14: 1fffe00304cef265
> > x13: ffff00182642b200 x12: ffff8012d37757bf
> > x11: 1fffe012d37757be x10: ffff8012d37757be
> > x9 : ffffa00010603a0c x8 : 0000000041b58ab3
> > x7 : ffff94000679cf44 x6 : 00000000ffffffc0
> > x5 : 0000000000000021 x4 : 00000000ffffffca
> > x3 : 1ffff40002a27ee1 x2 : 0000000000000004
> > x1 : 0000000000000000 x0 : ffffa0001513f000
> > Call trace:
> >  assfail+0x8c/0xb4
> >  xfs_destroy_percpu_counters+0x98/0xa4
> >  xfs_fs_put_super+0x1a0/0x2a4
> >  generic_shutdown_super+0x104/0x2c0
> >  kill_block_super+0x8c/0xf4
> >  deactivate_locked_super+0xa4/0x164
> >  deactivate_super+0xb0/0xdc
> >  cleanup_mnt+0x29c/0x3ec
> >  __cleanup_mnt+0x1c/0x30
> >  task_work_run+0xe0/0x200
> >  do_notify_resume+0x244/0x320
> >  work_pending+0xc/0xa0
> > 
> > We analyzed the data in vmcore is correct. But triggered above issue.
> > As f689054aace2 ("percpu_counter: add percpu_counter_sum_all interface")
> > commit describes there is a small race window between the online CPUs traversal
> > of percpu_counter_sum and the CPU offline callback. This means percpu_counter_sum()
> > may return incorrect result during cpu offline.
> > To solve above issue use percpu_counter_sum_all() interface to make sure
> > result is correct to prevent false triggering of assertions.
> 
> How about the other percpu_counter_sum callsites inside XFS?  Some of
> them are involved in writing ondisk metadata (xfs_log_sb) or doing
> correctness checks (fs/xfs/scrub/*); shouldn't those also be using the
> _all variant?

Ugh. I kinda wish that the percpu_counter_sum_all() patch had been
cc'd to lists for subsystems that use percpu_counter_sum()
extensively, or just to people who have modified that code in the
past.

The problem is that it uses cpu_possible_mask, which means it
walks all possible CPUs that can be added to the system even if the
CPUs aren't physically present. That can be a lot in the case of
systems that can have cpu-capable nodes hotplugged into them, and
that makes percpu_counter_sum_all() excitingly expensive for no good
reason.

AFAICT, if we are trying to close a race condition between iterating
online CPUs not summing dying CPUs and the cpu dead notification
updating the sum, then shouldn't we be using
cpu_mask_or(cpu_online_mask, cpu_dying_mask) for summing iteration
rather than just cpu_online_mask?

i.e. when a CPU is being taken down, it gets added to the
cpu_dying_mask, then removed from the cpu_online_mask, then the
offline notifications are run (i.e. the percpu counter dead
callback), and when the CPU reaches the CPUHP_TEARDOWN_CPU state,
it is removed from the cpu_dying_mask because it is now dead and all
the "cpu dying" callbacks have been run.

Except when a hotplug event is being processed, cpu_dying_mask will
be empty, hence there is little change in summing overhead. But it
will close the summing race condition when a CPU is being
offlined...

That, I think, is the solution we want for XFS. Having the percpu
counters just do the right thing is far better than always having to
wonder if summation interface we are using is correct in the face of
CPU hotplug. I'll put a patchset together to do:

1. fix percpu_counter_sum() to include the dying mask in it's
iteration. This should fix the XFS issue.
2. change the only user of percpu_counter_sum_all() to only use
percpu_counter_sum() because percpu_counter_sum_all() is now
redundant.
3. remove percpu_counter_sum_all() because it is unused.

Cheers,

Dave.
  
Darrick J. Wong March 15, 2023, 3:06 a.m. UTC | #3
[add lfsdevel to cc to spread the, um, love]

TLDR: percpu_counter_sum doesn't add in the values from CPUs in the
dying mask.  As a result, the summation can race with cpu hotunplug
and return the wrong values.

On Wed, Mar 15, 2023 at 09:13:05AM +1100, Dave Chinner wrote:
> On Tue, Mar 14, 2023 at 09:31:00AM -0700, Darrick J. Wong wrote:
> > On Tue, Mar 14, 2023 at 05:06:49PM +0800, Ye Bin wrote:
> > > From: Ye Bin <yebin10@huawei.com>
> > > 
> > > There's a issue when do cpu offline test:
> > > CPU: 48 PID: 1168152 Comm: umount Kdump: loaded Tainted: G L
> > > pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> > > pc : assfail+0x8c/0xb4
> > > lr : assfail+0x38/0xb4
> > > sp : ffffa00033ce7c40
> > > x29: ffffa00033ce7c40 x28: ffffa00014794f30
> > > x27: ffffa00014f6ca20 x26: 1fffe0120b2e2030
> > > x25: ffff009059710188 x24: ffff00886c0a4650
> > > x23: 1fffe0110d8148ca x22: ffff009059710180
> > > x21: ffffa00015155680 x20: ffff00886c0a4000
> > > x19: 0000000000000001 x18: 0000000000000000
> > > x17: 0000000000000000 x16: 0000000000000000
> > > x15: 0000000000000007 x14: 1fffe00304cef265
> > > x13: ffff00182642b200 x12: ffff8012d37757bf
> > > x11: 1fffe012d37757be x10: ffff8012d37757be
> > > x9 : ffffa00010603a0c x8 : 0000000041b58ab3
> > > x7 : ffff94000679cf44 x6 : 00000000ffffffc0
> > > x5 : 0000000000000021 x4 : 00000000ffffffca
> > > x3 : 1ffff40002a27ee1 x2 : 0000000000000004
> > > x1 : 0000000000000000 x0 : ffffa0001513f000
> > > Call trace:
> > >  assfail+0x8c/0xb4
> > >  xfs_destroy_percpu_counters+0x98/0xa4
> > >  xfs_fs_put_super+0x1a0/0x2a4
> > >  generic_shutdown_super+0x104/0x2c0
> > >  kill_block_super+0x8c/0xf4
> > >  deactivate_locked_super+0xa4/0x164
> > >  deactivate_super+0xb0/0xdc
> > >  cleanup_mnt+0x29c/0x3ec
> > >  __cleanup_mnt+0x1c/0x30
> > >  task_work_run+0xe0/0x200
> > >  do_notify_resume+0x244/0x320
> > >  work_pending+0xc/0xa0
> > > 
> > > We analyzed the data in vmcore is correct. But triggered above issue.
> > > As f689054aace2 ("percpu_counter: add percpu_counter_sum_all interface")
> > > commit describes there is a small race window between the online CPUs traversal
> > > of percpu_counter_sum and the CPU offline callback. This means percpu_counter_sum()
> > > may return incorrect result during cpu offline.
> > > To solve above issue use percpu_counter_sum_all() interface to make sure
> > > result is correct to prevent false triggering of assertions.
> > 
> > How about the other percpu_counter_sum callsites inside XFS?  Some of
> > them are involved in writing ondisk metadata (xfs_log_sb) or doing
> > correctness checks (fs/xfs/scrub/*); shouldn't those also be using the
> > _all variant?
> 
> Ugh. I kinda wish that the percpu_counter_sum_all() patch had been
> cc'd to lists for subsystems that use percpu_counter_sum()
> extensively, or just to people who have modified that code in the
> past.
> 
> The problem is that it uses cpu_possible_mask, which means it
> walks all possible CPUs that can be added to the system even if the
> CPUs aren't physically present. That can be a lot in the case of
> systems that can have cpu-capable nodes hotplugged into them, and
> that makes percpu_counter_sum_all() excitingly expensive for no good
> reason.
> 
> AFAICT, if we are trying to close a race condition between iterating
> online CPUs not summing dying CPUs and the cpu dead notification
> updating the sum, then shouldn't we be using
> cpu_mask_or(cpu_online_mask, cpu_dying_mask) for summing iteration
> rather than just cpu_online_mask?
> 
> i.e. when a CPU is being taken down, it gets added to the
> cpu_dying_mask, then removed from the cpu_online_mask, then the
> offline notifications are run (i.e. the percpu counter dead
> callback), and when the CPU reaches the CPUHP_TEARDOWN_CPU state,
> it is removed from the cpu_dying_mask because it is now dead and all
> the "cpu dying" callbacks have been run.
> 
> Except when a hotplug event is being processed, cpu_dying_mask will
> be empty, hence there is little change in summing overhead. But it
> will close the summing race condition when a CPU is being
> offlined...
> 
> That, I think, is the solution we want for XFS. Having the percpu
> counters just do the right thing is far better than always having to
> wonder if summation interface we are using is correct in the face of
> CPU hotplug. I'll put a patchset together to do:
> 
> 1. fix percpu_counter_sum() to include the dying mask in it's
> iteration. This should fix the XFS issue.

I took a quick look at ext4 and btrfs usage of percpu_counter_sum.  I
/think/ they're less impacted because most of the usage seems to be for
things like statfs which are inherently racy.  That said, mixing in the
dying mask sounds like a cheap fix.

> 2. change the only user of percpu_counter_sum_all() to only use
> percpu_counter_sum() because percpu_counter_sum_all() is now
> redundant.
> 3. remove percpu_counter_sum_all() because it is unused.

Sounds reasonable to /me. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
  

Patch

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2479b5cbd75e..c0ce66f966ee 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1076,7 +1076,7 @@  xfs_destroy_percpu_counters(
 	percpu_counter_destroy(&mp->m_ifree);
 	percpu_counter_destroy(&mp->m_fdblocks);
 	ASSERT(xfs_is_shutdown(mp) ||
-	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
+	       percpu_counter_sum_all(&mp->m_delalloc_blks) == 0);
 	percpu_counter_destroy(&mp->m_delalloc_blks);
 	percpu_counter_destroy(&mp->m_frextents);
 }