[RFC,RESEND,1/1] fs/namespace: defer free_mount from namespace_unlock

Message ID 20230119211455.498968-2-echanude@redhat.com
State New
Headers
Series fs/namespace: defer free_mount from namespace_unlock |

Commit Message

Eric Chanudet Jan. 19, 2023, 9:14 p.m. UTC
  From: Alexander Larsson <alexl@redhat.com>

Use call_rcu to defer releasing the umount'ed or detached filesystem
when calling namepsace_unlock().

Calling synchronize_rcu_expedited() has a significant cost on RT kernel
that default to rcupdate.rcu_normal_after_boot=1.

For example, on a 6.2-rt1 kernel:
perf stat -r 10 --null --pre 'mount -t tmpfs tmpfs mnt' -- umount mnt
           0.07464 +- 0.00396 seconds time elapsed  ( +-  5.31% )

With this change applied:
perf stat -r 10 --null --pre 'mount -t tmpfs tmpfs mnt' -- umount mnt
        0.00162604 +- 0.00000637 seconds time elapsed  ( +-  0.39% )

Waiting for the grace period before completing the syscall does not seem
mandatory. The struct mount umount'ed are queued up for release in a
separate list and no longer accessible to following syscalls.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Eric Chanudet <echanude@redhat.com>
---
 fs/namespace.c | 42 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 7 deletions(-)
  

Comments

Al Viro Jan. 19, 2023, 10:09 p.m. UTC | #1
On Thu, Jan 19, 2023 at 04:14:55PM -0500, Eric Chanudet wrote:
> From: Alexander Larsson <alexl@redhat.com>
> 
> Use call_rcu to defer releasing the umount'ed or detached filesystem
> when calling namepsace_unlock().
> 
> Calling synchronize_rcu_expedited() has a significant cost on RT kernel
> that default to rcupdate.rcu_normal_after_boot=1.
> 
> For example, on a 6.2-rt1 kernel:
> perf stat -r 10 --null --pre 'mount -t tmpfs tmpfs mnt' -- umount mnt
>            0.07464 +- 0.00396 seconds time elapsed  ( +-  5.31% )
> 
> With this change applied:
> perf stat -r 10 --null --pre 'mount -t tmpfs tmpfs mnt' -- umount mnt
>         0.00162604 +- 0.00000637 seconds time elapsed  ( +-  0.39% )
> 
> Waiting for the grace period before completing the syscall does not seem
> mandatory. The struct mount umount'ed are queued up for release in a
> separate list and no longer accessible to following syscalls.

Again, NAK.  If a filesystem is expected to be shut down by umount(2),
userland expects it to have been already shut down by the time the
syscall returns.

It's not just visibility in namespace; it's "can I pull the disk out?".
Or "can the shutdown get to taking the network down?", for that matter.
  
Alexander Larsson Jan. 20, 2023, 8:43 a.m. UTC | #2
On Thu, Jan 19, 2023 at 11:09 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Jan 19, 2023 at 04:14:55PM -0500, Eric Chanudet wrote:
> > From: Alexander Larsson <alexl@redhat.com>
> >
> > Use call_rcu to defer releasing the umount'ed or detached filesystem
> > when calling namepsace_unlock().
> >
> > Calling synchronize_rcu_expedited() has a significant cost on RT kernel
> > that default to rcupdate.rcu_normal_after_boot=1.
> >
> > For example, on a 6.2-rt1 kernel:
> > perf stat -r 10 --null --pre 'mount -t tmpfs tmpfs mnt' -- umount mnt
> >            0.07464 +- 0.00396 seconds time elapsed  ( +-  5.31% )
> >
> > With this change applied:
> > perf stat -r 10 --null --pre 'mount -t tmpfs tmpfs mnt' -- umount mnt
> >         0.00162604 +- 0.00000637 seconds time elapsed  ( +-  0.39% )
> >
> > Waiting for the grace period before completing the syscall does not seem
> > mandatory. The struct mount umount'ed are queued up for release in a
> > separate list and no longer accessible to following syscalls.
>
> Again, NAK.  If a filesystem is expected to be shut down by umount(2),
> userland expects it to have been already shut down by the time the
> syscall returns.
>
> It's not just visibility in namespace; it's "can I pull the disk out?".
> Or "can the shutdown get to taking the network down?", for that matter.

In the usecase we're worrying about, all the unmounts are lazy (i.e.
MNT_DETACH). What about delaying the destroy in that case? That seems
in line with the expected behaviour of lazy shutdown. I.e. you can't
rely on it to be settled anyway.
  
Liu, Yujie Jan. 30, 2023, 2:57 a.m. UTC | #3
Greeting,

FYI, we noticed WARNING:inconsistent_lock_state due to commit (built with gcc-11):

commit: fcd28ffda89717f5d68fff9d3260b57f02d93eda ("[RFC PATCH RESEND 1/1] fs/namespace: defer free_mount from namespace_unlock")
url: https://github.com/intel-lab-lkp/linux/commits/Eric-Chanudet/fs-namespace-defer-free_mount-from-namespace_unlock/20230120-052658
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git d368967cb1039b5c4cccb62b5a4b9468c50cd143
patch link: https://lore.kernel.org/all/20230119211455.498968-2-echanude@redhat.com/
patch subject: [RFC PATCH RESEND 1/1] fs/namespace: defer free_mount from namespace_unlock

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


[   13.499613][    C0] WARNING: inconsistent lock state
[   13.500185][    C0] 6.2.0-rc4-00078-gfcd28ffda897 #1 Not tainted
[   13.500852][    C0] --------------------------------
[   13.501426][    C0] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[   13.502160][    C0] systemd/1 [HC0[0]:SC1[1]:HE1:SE0] takes:
[ 13.502810][ C0] ffffffff84414790 (mount_lock){+.?.}-{2:2}, at: mntput_no_expire (include/linux/seqlock.h:469 include/linux/seqlock.h:515 include/linux/seqlock.h:541 include/linux/seqlock.h:887 fs/namespace.c:125 fs/namespace.c:1337) 
[   13.503774][    C0] {SOFTIRQ-ON-W} state was registered at:
[ 13.504400][ C0] __lock_acquire (kernel/locking/lockdep.c:5009) 
[ 13.504961][ C0] lock_acquire (kernel/locking/lockdep.c:466 kernel/locking/lockdep.c:5670 kernel/locking/lockdep.c:5633) 
[ 13.505496][ C0] _raw_spin_lock (include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) 
[ 13.506033][ C0] vfs_create_mount (include/linux/seqlock.h:469 include/linux/seqlock.h:515 include/linux/seqlock.h:541 include/linux/seqlock.h:887 fs/namespace.c:125 fs/namespace.c:1145) 
[ 13.506594][ C0] vfs_kern_mount (fs/namespace.c:1201) 
[ 13.507214][ C0] kern_mount (fs/namespace.c:4590) 
[ 13.507716][ C0] shmem_init (mm/shmem.c:4060) 
[ 13.508225][ C0] mnt_init (fs/namespace.c:4574) 
[ 13.508723][ C0] vfs_caches_init (fs/dcache.c:3354) 
[ 13.509273][ C0] start_kernel (init/main.c:1131) 
[ 13.509812][ C0] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358) 
[   13.510468][    C0] irq event stamp: 1059406
[ 13.510999][ C0] hardirqs last enabled at (1059406): kasan_quarantine_put (arch/x86/include/asm/irqflags.h:45 (discriminator 1) arch/x86/include/asm/irqflags.h:80 (discriminator 1) arch/x86/include/asm/irqflags.h:138 (discriminator 1) mm/kasan/quarantine.c:242 (discriminator 1)) 
[ 13.512073][ C0] hardirqs last disabled at (1059405): kasan_quarantine_put (mm/kasan/quarantine.c:217 (discriminator 1)) 
[ 13.513123][ C0] softirqs last enabled at (1058838): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:415 kernel/softirq.c:600) 
[ 13.514123][ C0] softirqs last disabled at (1059387): __irq_exit_rcu (kernel/softirq.c:445 kernel/softirq.c:650) 
[   13.515161][    C0]
[   13.515161][    C0] other info that might help us debug this:
[   13.516033][    C0]  Possible unsafe locking scenario:
[   13.516033][    C0]
[   13.516857][    C0]        CPU0
[   13.517282][    C0]        ----
[   13.517700][    C0]   lock(mount_lock);
[   13.518184][    C0]   <Interrupt>
[   13.518612][    C0]     lock(mount_lock);
[   13.519118][    C0]
[   13.519118][    C0]  *** DEADLOCK ***
[   13.519118][    C0]
[   13.520048][    C0] 3 locks held by systemd/1:
[ 13.520581][ C0] #0: ffffffff84925200 (rcu_read_lock){....}-{1:2}, at: __is_insn_slot_addr (kernel/kprobes.c:297) 
[ 13.521514][ C0] #1: ffffffff849250e0 (rcu_callback){....}-{0:0}, at: rcu_do_batch (include/linux/rcupdate.h:325 kernel/rcu/tree.c:2241) 
[ 13.522372][ C0] #2: ffffffff84925200 (rcu_read_lock){....}-{1:2}, at: mntput_no_expire (fs/namespace.c:1318) 
[   13.523384][    C0]
[   13.523384][    C0] stack backtrace:
[   13.524062][    C0] CPU: 0 PID: 1 Comm: systemd Not tainted 6.2.0-rc4-00078-gfcd28ffda897 #1
[   13.524985][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[   13.526078][    C0] Call Trace:
[   13.526492][    C0]  <IRQ>
[ 13.526882][ C0] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) 
[ 13.527408][ C0] mark_lock_irq.cold (kernel/locking/lockdep.c:4206 kernel/locking/lockdep.c:3975 kernel/locking/lockdep.c:4178) 
[ 13.527961][ C0] ? lock_chain_count (kernel/locking/lockdep.c:4169) 
[ 13.528512][ C0] ? filter_irq_stacks (kernel/stacktrace.c:114) 
[ 13.529078][ C0] ? save_trace (kernel/locking/lockdep.c:585) 
[ 13.529597][ C0] ? stack_trace_save (kernel/stacktrace.c:123) 
[ 13.530151][ C0] mark_lock+0x4c9/0xac0 
[ 13.530704][ C0] ? mark_lock_irq (kernel/locking/lockdep.c:4593) 
[ 13.531246][ C0] ? kasan_save_stack (mm/kasan/common.c:47) 
[ 13.531804][ C0] ? kasan_save_stack (mm/kasan/common.c:46) 
[ 13.532359][ C0] ? kasan_set_track (mm/kasan/common.c:52) 
[ 13.532901][ C0] ? kasan_save_free_info (mm/kasan/generic.c:520) 
[ 13.533485][ C0] mark_usage (kernel/locking/lockdep.c:4529) 
[ 13.533981][ C0] __lock_acquire (kernel/locking/lockdep.c:5009) 
[ 13.534531][ C0] lock_acquire (kernel/locking/lockdep.c:466 kernel/locking/lockdep.c:5670 kernel/locking/lockdep.c:5633) 
[ 13.535060][ C0] ? mntput_no_expire (include/linux/seqlock.h:469 include/linux/seqlock.h:515 include/linux/seqlock.h:541 include/linux/seqlock.h:887 fs/namespace.c:125 fs/namespace.c:1337) 
[ 13.535631][ C0] ? rcu_read_unlock (include/linux/rcupdate.h:793 (discriminator 5)) 
[ 13.536176][ C0] ? rcu_read_unlock (include/linux/rcupdate.h:793 (discriminator 5)) 
[ 13.536718][ C0] ? lock_downgrade (kernel/locking/lockdep.c:5320) 
[ 13.537261][ C0] _raw_spin_lock (include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) 
[ 13.537780][ C0] ? mntput_no_expire (include/linux/seqlock.h:469 include/linux/seqlock.h:515 include/linux/seqlock.h:541 include/linux/seqlock.h:887 fs/namespace.c:125 fs/namespace.c:1337) 
[ 13.538339][ C0] mntput_no_expire (include/linux/seqlock.h:469 include/linux/seqlock.h:515 include/linux/seqlock.h:541 include/linux/seqlock.h:887 fs/namespace.c:125 fs/namespace.c:1337) 
[ 13.538901][ C0] ? rcu_read_unlock (include/linux/rcupdate.h:793 (discriminator 5)) 
[ 13.539451][ C0] ? lock_downgrade (kernel/locking/lockdep.c:5320) 
[ 13.539999][ C0] ? slab_free_freelist_hook (mm/slub.c:1807) 
[ 13.540609][ C0] ? mnt_get_count (fs/namespace.c:1318) 
[ 13.541158][ C0] ? lock_is_held_type (kernel/locking/lockdep.c:5409 kernel/locking/lockdep.c:5711) 
[ 13.541731][ C0] delayed_mount_release (fs/namespace.c:1607) 
[ 13.542305][ C0] rcu_do_batch (include/linux/rcupdate.h:330 kernel/rcu/tree.c:2248) 
[ 13.543383][ C0] ? rcu_implicit_dynticks_qs (kernel/rcu/tree.c:2185) 
[ 13.544036][ C0] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) 
[ 13.544610][ C0] ? _raw_spin_unlock_irqrestore (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 arch/x86/include/asm/irqflags.h:138 include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) 
[ 13.545254][ C0] ? rcu_report_qs_rdp (kernel/rcu/tree.c:2050) 
[ 13.545826][ C0] rcu_core (kernel/rcu/tree.c:2508) 
[ 13.546315][ C0] __do_softirq (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:572) 
[ 13.547696][ C0] __irq_exit_rcu (kernel/softirq.c:445 kernel/softirq.c:650) 
[ 13.548241][ C0] irq_exit_rcu (kernel/softirq.c:664) 
[ 13.548741][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1107 (discriminator 14)) 
[   13.549370][    C0]  </IRQ>
[   13.549757][    C0]  <TASK>
[ 13.550132][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:649) 
[ 13.550793][ C0] RIP: 0010:lock_acquire (kernel/locking/lockdep.c:5636) 
[ 13.551387][ C0] Code: ff ff 48 83 c4 20 65 0f c1 05 af 26 d0 7e 83 f8 01 0f 85 ae 02 00 00 48 83 7c 24 08 00 74 01 fb 48 b8 00 00 00 00 00 fc ff df <48> 01 c3 48 c7 03 00 00 00 00 48 c7 43 08 00 00 00 00 48 8b 84 24
All code
========
   0:	ff                   	(bad)  
   1:	ff 48 83             	decl   -0x7d(%rax)
   4:	c4                   	(bad)  
   5:	20 65 0f             	and    %ah,0xf(%rbp)
   8:	c1 05 af 26 d0 7e 83 	roll   $0x83,0x7ed026af(%rip)        # 0x7ed026be
   f:	f8                   	clc    
  10:	01 0f                	add    %ecx,(%rdi)
  12:	85 ae 02 00 00 48    	test   %ebp,0x48000002(%rsi)
  18:	83 7c 24 08 00       	cmpl   $0x0,0x8(%rsp)
  1d:	74 01                	je     0x20
  1f:	fb                   	sti    
  20:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  27:	fc ff df 
  2a:*	48 01 c3             	add    %rax,%rbx		<-- trapping instruction
  2d:	48 c7 03 00 00 00 00 	movq   $0x0,(%rbx)
  34:	48 c7 43 08 00 00 00 	movq   $0x0,0x8(%rbx)
  3b:	00 
  3c:	48                   	rex.W
  3d:	8b                   	.byte 0x8b
  3e:	84                   	.byte 0x84
  3f:	24                   	.byte 0x24

Code starting with the faulting instruction
===========================================
   0:	48 01 c3             	add    %rax,%rbx
   3:	48 c7 03 00 00 00 00 	movq   $0x0,(%rbx)
   a:	48 c7 43 08 00 00 00 	movq   $0x0,0x8(%rbx)
  11:	00 
  12:	48                   	rex.W
  13:	8b                   	.byte 0x8b
  14:	84                   	.byte 0x84
  15:	24                   	.byte 0x24


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <yujie.liu@intel.com>
| Link: https://lore.kernel.org/oe-lkp/202301301039.8bdb3db5-yujie.liu@intel.com


To reproduce:

        # build kernel
	cd linux
	cp config-6.2.0-rc4-00078-gfcd28ffda897 .config
	make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
	make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
	cd <mod-install-dir>
	find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
  

Patch

diff --git a/fs/namespace.c b/fs/namespace.c
index ab467ee58341..11d219a6e83c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -44,6 +44,11 @@  static unsigned int m_hash_shift __read_mostly;
 static unsigned int mp_hash_mask __read_mostly;
 static unsigned int mp_hash_shift __read_mostly;
 
+struct mount_delayed_release {
+	struct rcu_head rcu;
+	struct hlist_head release_list;
+};
+
 static __initdata unsigned long mhash_entries;
 static int __init set_mhash_entries(char *str)
 {
@@ -1582,11 +1587,31 @@  int may_umount(struct vfsmount *mnt)
 
 EXPORT_SYMBOL(may_umount);
 
-static void namespace_unlock(void)
+static void free_mounts(struct hlist_head *mount_list)
 {
-	struct hlist_head head;
 	struct hlist_node *p;
 	struct mount *m;
+
+	hlist_for_each_entry_safe(m, p, mount_list, mnt_umount) {
+		hlist_del(&m->mnt_umount);
+		mntput(&m->mnt);
+	}
+}
+
+static void delayed_mount_release(struct rcu_head *head)
+{
+	struct mount_delayed_release *drelease =
+		container_of(head, struct mount_delayed_release, rcu);
+
+	free_mounts(&drelease->release_list);
+	kfree(drelease);
+}
+
+static void namespace_unlock(void)
+{
+	struct hlist_head head;
+	struct mount_delayed_release *drelease;
+
 	LIST_HEAD(list);
 
 	hlist_move_list(&unmounted, &head);
@@ -1599,12 +1624,15 @@  static void namespace_unlock(void)
 	if (likely(hlist_empty(&head)))
 		return;
 
-	synchronize_rcu_expedited();
-
-	hlist_for_each_entry_safe(m, p, &head, mnt_umount) {
-		hlist_del(&m->mnt_umount);
-		mntput(&m->mnt);
+	drelease = kmalloc(sizeof(*drelease), GFP_KERNEL);
+	if (unlikely(!drelease)) {
+		synchronize_rcu_expedited();
+		free_mounts(&head);
+		return;
 	}
+
+	hlist_move_list(&head, &drelease->release_list);
+	call_rcu(&drelease->rcu, delayed_mount_release);
 }
 
 static inline void namespace_lock(void)