[v4,4/5] fsnotify: allow sleepable child flag update

Message ID 20221111220614.991928-5-stephen.s.brennan@oracle.com
State New
Headers
Series fsnotify: fix softlockups iterating over d_subdirs |

Commit Message

Stephen Brennan Nov. 11, 2022, 10:06 p.m. UTC
  With very large d_subdirs lists, iteration can take a long time. Since
iteration needs to hold alias->d_lock, this can trigger soft lockups.
It would be best to make this iteration sleepable. We can drop
alias->d_lock and sleep, by taking a reference to the current child.
However, we need to be careful, since it's possible for the child's
list pointers to be modified once we drop alias->d_lock. The following
are the only cases where the list pointers are modified:

1. dentry_unlist() in fs/dcache.c

   This is a helper of dentry_kill(). This function is quite careful to
   check the reference count of the dentry once it has taken the
   requisite locks, and bail out if a new reference was taken. So we
   can be safe that, assuming we found the dentry and took a reference
   before dropping alias->d_lock, any concurrent dentry_kill() should
   bail out and leave our list pointers untouched.

2. __d_move() in fs/dcache.c

   If the child was moved to a new parent, then we can detect this by
   testing d_parent and retrying the iteration.

3. Initialization code in d_alloc() family

   We are safe from this code, since we cannot encounter a dentry until
   it has been initialized.

4. Cursor code in fs/libfs.c for dcache_readdir()

   Dentries with DCACHE_DENTRY_CURSOR should be skipped before sleeping,
   since we could awaken to find they have skipped over a portion of the
   child list.

Given these considerations, we can carefully write a loop that iterates
over d_subdirs and is capable of going to sleep periodically.

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
---

Notes:
    v4:
    - I've updated this patch so it should be safe even without the
      inode locked, by handling cursors and d_move() races.
    - Moved comments to lower indentation
    - I didn't keep Amir's R-b since this was a fair bit of change.
    v3:
    - removed if statements around dput()
    v2:
    - added a check for child->d_parent != alias and retry logic

 fs/notify/fsnotify.c | 46 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)
  

Comments

Amir Goldstein Nov. 12, 2022, 10 a.m. UTC | #1
On Sat, Nov 12, 2022 at 12:06 AM Stephen Brennan
<stephen.s.brennan@oracle.com> wrote:
>
> With very large d_subdirs lists, iteration can take a long time. Since
> iteration needs to hold alias->d_lock, this can trigger soft lockups.
> It would be best to make this iteration sleepable. We can drop
> alias->d_lock and sleep, by taking a reference to the current child.
> However, we need to be careful, since it's possible for the child's
> list pointers to be modified once we drop alias->d_lock. The following
> are the only cases where the list pointers are modified:
>
> 1. dentry_unlist() in fs/dcache.c
>
>    This is a helper of dentry_kill(). This function is quite careful to
>    check the reference count of the dentry once it has taken the
>    requisite locks, and bail out if a new reference was taken. So we
>    can be safe that, assuming we found the dentry and took a reference
>    before dropping alias->d_lock, any concurrent dentry_kill() should
>    bail out and leave our list pointers untouched.
>
> 2. __d_move() in fs/dcache.c
>
>    If the child was moved to a new parent, then we can detect this by
>    testing d_parent and retrying the iteration.
>
> 3. Initialization code in d_alloc() family
>
>    We are safe from this code, since we cannot encounter a dentry until
>    it has been initialized.
>
> 4. Cursor code in fs/libfs.c for dcache_readdir()
>
>    Dentries with DCACHE_DENTRY_CURSOR should be skipped before sleeping,
>    since we could awaken to find they have skipped over a portion of the
>    child list.
>
> Given these considerations, we can carefully write a loop that iterates
> over d_subdirs and is capable of going to sleep periodically.
>
> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
> ---
>
> Notes:
>     v4:
>     - I've updated this patch so it should be safe even without the
>       inode locked, by handling cursors and d_move() races.
>     - Moved comments to lower indentation
>     - I didn't keep Amir's R-b since this was a fair bit of change.
>     v3:
>     - removed if statements around dput()
>     v2:
>     - added a check for child->d_parent != alias and retry logic
>
>  fs/notify/fsnotify.c | 46 ++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 409d479cbbc6..0ba61211456c 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -105,7 +105,7 @@ void fsnotify_sb_delete(struct super_block *sb)
>   */
>  void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
>  {
> -       struct dentry *alias, *child;
> +       struct dentry *alias, *child, *last_ref = NULL;
>
>         if (!S_ISDIR(inode->i_mode))
>                 return;
> @@ -116,12 +116,49 @@ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
>                 return;
>
>         /*
> -        * run all of the children of the original inode and fix their
> -        * d_flags to indicate parental interest (their parent is the
> -        * original inode)
> +        * These lists can get very long, so we may need to sleep during
> +        * iteration. Normally this would be impossible without a cursor,
> +        * but since we have the inode locked exclusive, we're guaranteed

Not exactly true for v4 patchset order, but I do prefer that we make it true.

> +        * that the directory won't be modified, so whichever dentry we
> +        * pick to sleep on won't get moved. So, start a manual iteration
> +        * over d_subdirs which will allow us to sleep.
>          */
>         spin_lock(&alias->d_lock);
> +retry:

Better if we can avoid this retry by inode lock.
Note that it is enough to take inode_lock_shared() to protect
this code.

It means that tasks doing remove+add parent watch may
iterate d_subdirs in parallel, but maybe that's even better
then having them iterate d_subdirs sequentially?

>         list_for_each_entry(child, &alias->d_subdirs, d_child) {
> +               /*
> +                * We need to hold a reference while we sleep. But we cannot
> +                * sleep holding a reference to a cursor, or we risk skipping
> +                * through the list.
> +                *
> +                * When we wake, dput() could free the dentry, invalidating the
> +                * list pointers.  We can't look at the list pointers until we
> +                * re-lock the parent, and we can't dput() once we have the
> +                * parent locked.  So the solution is to hold onto our reference
> +                * and free it the *next* time we drop alias->d_lock: either at
> +                * the end of the function, or at the time of the next sleep.
> +                */

My usual nit picking: you could concat this explanation to the
comment outside the loop. It won't make it any less clear, maybe
even more clear in the wider context of how the children are safely
iterated.

Thanks,
Amir.
  
kernel test robot Nov. 15, 2022, 7:10 a.m. UTC | #2
Hi, Stephen Brennan, we made report upon v1 patch
https://lore.kernel.org/all/202210271500.731e3808-yujie.liu@intel.com/
now we noticed similar issue still on v4. FYI.


Greeting,

FYI, we noticed WARNING:possible_recursive_locking_detected due to commit (built with gcc-11):

commit: 5482581c7c96a8fe60bb29c181e86cdc9889b068 ("[PATCH v4 4/5] fsnotify: allow sleepable child flag update")
url: https://github.com/intel-lab-lkp/linux/commits/Stephen-Brennan/fsnotify-clear-PARENT_WATCHED-flags-lazily/20221112-070656
base: https://git.kernel.org/cgit/linux/kernel/git/jack/linux-fs.git fsnotify
patch subject: [PATCH v4 4/5] fsnotify: allow sleepable child flag update

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Link: https://lore.kernel.org/oe-lkp/202211151309.1fdf8fd0-oliver.sang@intel.com


[   16.419535][  T369] WARNING: possible recursive locking detected
[   16.420080][  T369] 6.0.0-rc4-00068-g5482581c7c96 #1 Not tainted
[   16.420595][  T369] --------------------------------------------
[   16.421105][  T369] dnsmasq/369 is trying to acquire lock:
[ 16.421578][ T369] bfaa3858 (&dentry->d_lock){+.+.}-{2:2}, at: lockref_get (??:?) 
[   16.422251][  T369]
[   16.422251][  T369] but task is already holding lock:
[ 16.422868][ T369] bf804d60 (&dentry->d_lock){+.+.}-{2:2}, at: fsnotify_update_children_dentry_flags (??:?) 
[   16.423733][  T369]
[   16.423733][  T369] other info that might help us debug this:
[   16.424438][  T369]  Possible unsafe locking scenario:
[   16.424438][  T369]
[   16.425082][  T369]        CPU0
[   16.425376][  T369]        ----
[   16.425660][  T369]   lock(&dentry->d_lock);
[   16.426052][  T369]   lock(&dentry->d_lock);
[   16.426439][  T369]
[   16.426439][  T369]  *** DEADLOCK ***
[   16.426439][  T369]
[   16.427109][  T369]  May be due to missing lock nesting notation
[   16.427109][  T369]
[   16.427808][  T369] 2 locks held by dnsmasq/369:
[ 16.428219][ T369] #0: 409474b0 (&group->mark_mutex){+.+.}-{3:3}, at: inotify_update_watch (inotify_user.c:?) 
[ 16.429047][ T369] #1: bf804d60 (&dentry->d_lock){+.+.}-{2:2}, at: fsnotify_update_children_dentry_flags (??:?) 
[   16.430016][  T369]
[   16.430016][  T369] stack backtrace:
[   16.430538][  T369] CPU: 1 PID: 369 Comm: dnsmasq Not tainted 6.0.0-rc4-00068-g5482581c7c96 #1
[   16.431276][  T369] Call Trace:
[ 16.431560][ T369] ? dump_stack_lvl (??:?) 
[ 16.431990][ T369] ? dump_stack (??:?) 
[ 16.432347][ T369] ? validate_chain.cold (lockdep.c:?) 
[ 16.432791][ T369] ? __lock_acquire (lockdep.c:?) 
[ 16.433233][ T369] ? lock_acquire (??:?) 
[ 16.433651][ T369] ? lockref_get (??:?) 
[ 16.434034][ T369] ? __lock_release (lockdep.c:?) 
[ 16.434454][ T369] ? fsnotify_update_children_dentry_flags (??:?) 
[ 16.435069][ T369] ? _raw_spin_lock (??:?) 
[ 16.435465][ T369] ? lockref_get (??:?) 
[ 16.435832][ T369] ? lockref_get (??:?) 
[ 16.436190][ T369] ? fsnotify_update_children_dentry_flags (??:?) 
[ 16.436753][ T369] ? fsnotify_recalc_mask (??:?) 
[ 16.437229][ T369] ? fsnotify_add_mark_locked (??:?) 
[ 16.437716][ T369] ? inotify_new_watch (inotify_user.c:?) 
[ 16.438166][ T369] ? inotify_update_watch (inotify_user.c:?) 
[ 16.438627][ T369] ? __ia32_sys_inotify_add_watch (??:?) 
[ 16.439183][ T369] ? do_int80_syscall_32 (??:?) 
[ 16.439640][ T369] ? entry_INT80_32 (entry_32.o:?) 
LKP: ttyS0: 641: Kernel tests: Boot OK!
LKP: ttyS0: 641: HOSTNAME vm-snb, MAC 52:54:00:12:34:56, kernel 6.0.0-rc4-00068-g5482581c7c96 1
LKP: ttyS0: 641:  /lkp/lkp/src/bin/run-lkp /lkp/jobs/scheduled/vm-meta-42/boot-1-openwrt-i386-generic-20190428.cgz-5482581c7c96a8fe60bb29c181e86cdc9889b068-20221115-34610-11px8cv-40.yaml
[   18.633802][  T655] mkdir: can't create directory '/sys/kernel/debug': Operation not permitted
[   18.633802][  T655] mount: mounting debug on /sys/kernel/debug failed: No such file or directory
[   21.139501][  T856] process 856 (wait) attempted a POSIX timer syscall while CONFIG_POSIX_TIMERS is not set
[   32.651080][  T957] rsync (957) used greatest stack depth: 6208 bytes left
LKP: ttyS0: 641: LKP: rebooting forcely
[   42.464678][  T641] sysrq: Emergency Sync
[   42.465117][  T641] sysrq: Resetting

Kboot worker: lkp-worker53
Elapsed time: 60

kvm=(
qemu-system-x86_64
-enable-kvm
-cpu SandyBridge
-kernel $kernel
-initrd initrd-vm-meta-42.cgz
-m 16384
-smp 2
-device e1000,netdev=net0
-netdev user,id=net0,hostfwd=tcp::32032-:22
-boot order=nc
-no-reboot
-device i6300esb
-watchdog-action debug
-rtc base=localtime
-serial stdio
-display none
-monitor null
)

append=(
ip=::::vm-meta-42::dhcp
root=/dev/ram0
RESULT_ROOT=/result/boot/1/vm-snb/openwrt-i386-generic-20190428.cgz/i386-randconfig-a011-20210930/gcc-11/5482581c7c96a8fe60bb29c181e86cdc9889b068/73
BOOT_IMAGE=/pkg/linux/i386-randconfig-a011-20210930/gcc-11/5482581c7c96a8fe60bb29c181e86cdc9889b068/vmlinuz-6.0.0-rc4-00068-g5482581c7c96
branch=linux-review/Stephen-Brennan/fsnotify-clear-PARENT_WATCHED-flags-lazily/20221112-070656
job=/job-script
user=lkp
ARCH=i386
kconfig=i386-randconfig-a011-20210930
commit=5482581c7c96a8fe60bb29c181e86cdc9889b068
vmalloc=256M
initramfs_async=0
page_owner=on
initcall_debug
max_uptime=600
result_service=tmpfs
selinux=0
debug
apic=debug


To reproduce:

        # build kernel
	cd linux
	cp config-6.0.0-rc4-00068-g5482581c7c96 .config
	make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
	make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
	cd <mod-install-dir>
	find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
  

Patch

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 409d479cbbc6..0ba61211456c 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -105,7 +105,7 @@  void fsnotify_sb_delete(struct super_block *sb)
  */
 void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
 {
-	struct dentry *alias, *child;
+	struct dentry *alias, *child, *last_ref = NULL;
 
 	if (!S_ISDIR(inode->i_mode))
 		return;
@@ -116,12 +116,49 @@  void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
 		return;
 
 	/*
-	 * run all of the children of the original inode and fix their
-	 * d_flags to indicate parental interest (their parent is the
-	 * original inode)
+	 * These lists can get very long, so we may need to sleep during
+	 * iteration. Normally this would be impossible without a cursor,
+	 * but since we have the inode locked exclusive, we're guaranteed
+	 * that the directory won't be modified, so whichever dentry we
+	 * pick to sleep on won't get moved. So, start a manual iteration
+	 * over d_subdirs which will allow us to sleep.
 	 */
 	spin_lock(&alias->d_lock);
+retry:
 	list_for_each_entry(child, &alias->d_subdirs, d_child) {
+		/*
+		 * We need to hold a reference while we sleep. But we cannot
+		 * sleep holding a reference to a cursor, or we risk skipping
+		 * through the list.
+		 *
+		 * When we wake, dput() could free the dentry, invalidating the
+		 * list pointers.  We can't look at the list pointers until we
+		 * re-lock the parent, and we can't dput() once we have the
+		 * parent locked.  So the solution is to hold onto our reference
+		 * and free it the *next* time we drop alias->d_lock: either at
+		 * the end of the function, or at the time of the next sleep.
+		 */
+		if (child->d_flags & DCACHE_DENTRY_CURSOR)
+			continue;
+		if (need_resched()) {
+			dget(child);
+			spin_unlock(&alias->d_lock);
+			dput(last_ref);
+			last_ref = child;
+			cond_resched();
+			spin_lock(&alias->d_lock);
+			/* Check for races with __d_move() */
+			if (child->d_parent != alias)
+				goto retry;
+		}
+
+		/*
+		 * This check is after the cond_resched() for a reason: it is
+		 * safe to sleep holding a negative dentry reference. If the
+		 * directory contained many negative dentries, and we skipped
+		 * them checking need_resched(), we may never end up sleeping
+		 * where necessary, and could trigger a soft lockup.
+		 */
 		if (!child->d_inode)
 			continue;
 
@@ -133,6 +170,7 @@  void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
 		spin_unlock(&child->d_lock);
 	}
 	spin_unlock(&alias->d_lock);
+	dput(last_ref);
 	dput(alias);
 }