[RFC] workqueue: allow system workqueue be used in memory reclaim

Message ID 20231108012821.56104-1-junxiao.bi@oracle.com
State New
Headers
Series [RFC] workqueue: allow system workqueue be used in memory reclaim |

Commit Message

Junxiao Bi Nov. 8, 2023, 1:28 a.m. UTC
  The following deadlock was triggered on Intel IMSM raid1 volumes.

The sequence of the event is this:

1. memory reclaim was waiting xfs journal flushing and get stucked by
md flush work.

2. md flush work was queued into "md" workqueue, but never get executed,
kworker thread can not be created and also the rescuer thread was executing
md flush work for another md disk and get stuck because
"MD_SB_CHANGE_PENDING" flag was set.

3. That flag should be set by some md write process which was asking to
update md superblock to change in_sync status to 0, and then it used
kernfs_notify to ask "mdmon" process to update superblock, after that,
write process waited that flag to be cleared.

4. But "mdmon" was never wake up, because kernfs_notify() depended on
system wide workqueue "system_wq" to do the notify, but since that
workqueue doesn't have a rescuer thread, notify will not happen.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
---
 kernel/workqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Tejun Heo Nov. 9, 2023, 6:58 p.m. UTC | #1
Hello,

On Tue, Nov 07, 2023 at 05:28:21PM -0800, Junxiao Bi wrote:
> The following deadlock was triggered on Intel IMSM raid1 volumes.
> 
> The sequence of the event is this:
> 
> 1. memory reclaim was waiting xfs journal flushing and get stucked by
> md flush work.
> 
> 2. md flush work was queued into "md" workqueue, but never get executed,
> kworker thread can not be created and also the rescuer thread was executing
> md flush work for another md disk and get stuck because
> "MD_SB_CHANGE_PENDING" flag was set.
> 
> 3. That flag should be set by some md write process which was asking to
> update md superblock to change in_sync status to 0, and then it used
> kernfs_notify to ask "mdmon" process to update superblock, after that,
> write process waited that flag to be cleared.
> 
> 4. But "mdmon" was never wake up, because kernfs_notify() depended on
> system wide workqueue "system_wq" to do the notify, but since that
> workqueue doesn't have a rescuer thread, notify will not happen.

Things like this can't be fixed by adding RECLAIM to system_wq because
system_wq is shared and someone else might occupy that rescuer thread. The
flag doesn't guarantee unlimited forward progress. It only guarantees
forward progress of one work item.

That seems to be where the problem is in #2 in the first place. If a work
item is required during memory reclaim, it must have guaranteed forward
progress but it looks like that's waiting for someone else who can end up
waiting for userspace?

You'll need to untangle the dependencies earlier.

Thanks.
  
Junxiao Bi Nov. 9, 2023, 11:36 p.m. UTC | #2
On 11/9/23 10:58 AM, Tejun Heo wrote:
> Hello,
>
> On Tue, Nov 07, 2023 at 05:28:21PM -0800, Junxiao Bi wrote:
>> The following deadlock was triggered on Intel IMSM raid1 volumes.
>>
>> The sequence of the event is this:
>>
>> 1. memory reclaim was waiting xfs journal flushing and get stucked by
>> md flush work.
>>
>> 2. md flush work was queued into "md" workqueue, but never get executed,
>> kworker thread can not be created and also the rescuer thread was executing
>> md flush work for another md disk and get stuck because
>> "MD_SB_CHANGE_PENDING" flag was set.
>>
>> 3. That flag should be set by some md write process which was asking to
>> update md superblock to change in_sync status to 0, and then it used
>> kernfs_notify to ask "mdmon" process to update superblock, after that,
>> write process waited that flag to be cleared.
>>
>> 4. But "mdmon" was never wake up, because kernfs_notify() depended on
>> system wide workqueue "system_wq" to do the notify, but since that
>> workqueue doesn't have a rescuer thread, notify will not happen.
> Things like this can't be fixed by adding RECLAIM to system_wq because
> system_wq is shared and someone else might occupy that rescuer thread. The
> flag doesn't guarantee unlimited forward progress. It only guarantees
> forward progress of one work item.
>
> That seems to be where the problem is in #2 in the first place. If a work
> item is required during memory reclaim, it must have guaranteed forward
> progress but it looks like that's waiting for someone else who can end up
> waiting for userspace?
>
> You'll need to untangle the dependencies earlier.
Make sense. Thanks a lot for the comments.
>
> Thanks.
>
  
kernel test robot Nov. 16, 2023, 7:39 a.m. UTC | #3
Hello,

kernel test robot noticed "WARNING:possible_circular_locking_dependency_detected" on:

commit: c8c183493c1dcc874a9d903cb6ba685c98f6c12a ("[RFC] workqueue: allow system workqueue be used in memory reclaim")
url: https://github.com/intel-lab-lkp/linux/commits/Junxiao-Bi/workqueue-allow-system-workqueue-be-used-in-memory-reclaim/20231108-093107
base: https://git.kernel.org/cgit/linux/kernel/git/tj/wq.git for-next
patch link: https://lore.kernel.org/all/20231108012821.56104-1-junxiao.bi@oracle.com/
patch subject: [RFC] workqueue: allow system workqueue be used in memory reclaim

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202311161556.59af3ec9-oliver.sang@intel.com


[    6.524239][    T9] WARNING: possible circular locking dependency detected
[    6.524787][    T9] 6.6.0-rc6-00056-gc8c183493c1d #1 Not tainted
[    6.525271][    T9] ------------------------------------------------------
[    6.525606][    T9] kworker/0:1/9 is trying to acquire lock:
[ 6.525606][ T9] ffffffff88f6f480 (cpu_hotplug_lock){++++}-{0:0}, at: vmstat_shepherd (include/linux/find.h:63 mm/vmstat.c:2025) 
[    6.525606][    T9]
[    6.525606][    T9] but task is already holding lock:
[ 6.525606][ T9] ffff888110aa7d88 ((shepherd).work){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:2606) 
[    6.525606][    T9]
[    6.525606][    T9] which lock already depends on the new lock.
[    6.525606][    T9]
[    6.525606][    T9] the existing dependency chain (in reverse order) is:
[    6.525606][    T9]
[    6.525606][    T9] -> #2 ((shepherd).work){+.+.}-{0:0}:
[ 6.525606][ T9] __lock_acquire (kernel/locking/lockdep.c:5136) 
[ 6.525606][ T9] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) 
[ 6.525606][ T9] process_one_work (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:444 include/linux/jump_label.h:260 include/linux/jump_label.h:270 include/trace/events/workqueue.h:82 kernel/workqueue.c:2629) 
[ 6.525606][ T9] worker_thread (kernel/workqueue.c:2697 kernel/workqueue.c:2784) 
[ 6.525606][ T9] kthread (kernel/kthread.c:388) 
[ 6.525606][ T9] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 6.525606][ T9] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) 
[    6.525606][    T9]
[    6.525606][    T9] -> #1 ((wq_completion)events){+.+.}-{0:0}:
[ 6.525606][ T9] __lock_acquire (kernel/locking/lockdep.c:5136) 
[ 6.525606][ T9] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) 
[ 6.525606][ T9] start_flush_work (kernel/workqueue.c:3383) 
[ 6.525606][ T9] __flush_work (kernel/workqueue.c:3406) 
[ 6.525606][ T9] schedule_on_each_cpu (kernel/workqueue.c:3668 (discriminator 3)) 
[ 6.525606][ T9] rcu_tasks_one_gp (kernel/rcu/rcu.h:109 kernel/rcu/tasks.h:587) 
[ 6.525606][ T9] rcu_tasks_kthread (kernel/rcu/tasks.h:625 (discriminator 1)) 
[ 6.525606][ T9] kthread (kernel/kthread.c:388) 
[ 6.525606][ T9] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 6.525606][ T9] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) 
[    6.525606][    T9]
[    6.525606][    T9] -> #0 (cpu_hotplug_lock){++++}-{0:0}:
[ 6.525606][ T9] check_prev_add (kernel/locking/lockdep.c:3135) 
[ 6.525606][ T9] validate_chain (kernel/locking/lockdep.c:3254 kernel/locking/lockdep.c:3868) 
[ 6.525606][ T9] __lock_acquire (kernel/locking/lockdep.c:5136) 
[ 6.525606][ T9] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) 
[ 6.525606][ T9] cpus_read_lock (include/linux/percpu-rwsem.h:53 kernel/cpu.c:489) 
[ 6.525606][ T9] vmstat_shepherd (include/linux/find.h:63 mm/vmstat.c:2025) 
[ 6.525606][ T9] process_one_work (kernel/workqueue.c:2635) 
[ 6.525606][ T9] worker_thread (kernel/workqueue.c:2697 kernel/workqueue.c:2784) 
[ 6.525606][ T9] kthread (kernel/kthread.c:388) 
[ 6.525606][ T9] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 6.525606][ T9] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) 
[    6.525606][    T9]
[    6.525606][    T9] other info that might help us debug this:
[    6.525606][    T9]
[    6.525606][    T9] Chain exists of:
[    6.525606][    T9]   cpu_hotplug_lock --> (wq_completion)events --> (shepherd).work
[    6.525606][    T9]
[    6.525606][    T9]  Possible unsafe locking scenario:
[    6.525606][    T9]
[    6.525606][    T9]        CPU0                    CPU1
[    6.525606][    T9]        ----                    ----
[    6.525606][    T9]   lock((shepherd).work);
[    6.525606][    T9]                                lock((wq_completion)events);
[    6.525606][    T9]                                lock((shepherd).work);
[    6.525606][    T9]   rlock(cpu_hotplug_lock);
[    6.525606][    T9]
[    6.525606][    T9]  *** DEADLOCK ***
[    6.525606][    T9]
[    6.525606][    T9] 2 locks held by kworker/0:1/9:
[ 6.525606][ T9] #0: ffff88810007cd48 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:2603) 
[ 6.525606][ T9] #1: ffff888110aa7d88 ((shepherd).work){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:2606) 
[    6.525606][    T9]
[    6.525606][    T9] stack backtrace:
[    6.525606][    T9] CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc6-00056-gc8c183493c1d #1
[    6.525606][    T9] Workqueue: events vmstat_shepherd
[    6.525606][    T9] Call Trace:
[    6.525606][    T9]  <TASK>
[ 6.525606][ T9] dump_stack_lvl (lib/dump_stack.c:107) 
[ 6.525606][ T9] check_noncircular (kernel/locking/lockdep.c:2187) 
[ 6.525606][ T9] ? print_circular_bug (kernel/locking/lockdep.c:2163) 
[ 6.525606][ T9] ? stack_trace_save (kernel/stacktrace.c:123) 
[ 6.525606][ T9] ? stack_trace_snprint (kernel/stacktrace.c:114) 
[ 6.525606][ T9] check_prev_add (kernel/locking/lockdep.c:3135) 
[ 6.525606][ T9] validate_chain (kernel/locking/lockdep.c:3254 kernel/locking/lockdep.c:3868) 
[ 6.525606][ T9] ? check_prev_add (kernel/locking/lockdep.c:3824) 
[ 6.525606][ T9] ? hlock_class (arch/x86/include/asm/bitops.h:228 arch/x86/include/asm/bitops.h:240 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228) 
[ 6.525606][ T9] ? mark_lock (kernel/locking/lockdep.c:4655 (discriminator 3)) 
[ 6.525606][ T9] __lock_acquire (kernel/locking/lockdep.c:5136) 
[ 6.525606][ T9] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) 
[ 6.525606][ T9] ? vmstat_shepherd (include/linux/find.h:63 mm/vmstat.c:2025) 
[ 6.525606][ T9] ? lock_sync (kernel/locking/lockdep.c:5721) 
[ 6.525606][ T9] ? debug_object_active_state (lib/debugobjects.c:772) 
[ 6.525606][ T9] ? __cant_migrate (kernel/sched/core.c:10142) 
[ 6.525606][ T9] cpus_read_lock (include/linux/percpu-rwsem.h:53 kernel/cpu.c:489) 
[ 6.525606][ T9] ? vmstat_shepherd (include/linux/find.h:63 mm/vmstat.c:2025) 
[ 6.525606][ T9] vmstat_shepherd (include/linux/find.h:63 mm/vmstat.c:2025) 
[ 6.525606][ T9] process_one_work (kernel/workqueue.c:2635) 
[ 6.525606][ T9] ? worker_thread (kernel/workqueue.c:2740) 
[ 6.525606][ T9] ? show_pwq (kernel/workqueue.c:2539) 
[ 6.525606][ T9] ? assign_work (kernel/workqueue.c:1096) 
[ 6.525606][ T9] worker_thread (kernel/workqueue.c:2697 kernel/workqueue.c:2784) 
[ 6.525606][ T9] ? __kthread_parkme (kernel/kthread.c:293 (discriminator 3)) 
[ 6.525606][ T9] ? schedule (arch/x86/include/asm/bitops.h:207 (discriminator 1) arch/x86/include/asm/bitops.h:239 (discriminator 1) include/linux/thread_info.h:184 (discriminator 1) include/linux/sched.h:2255 (discriminator 1) kernel/sched/core.c:6773 (discriminator 1)) 
[ 6.525606][ T9] ? process_one_work (kernel/workqueue.c:2730) 
[ 6.525606][ T9] kthread (kernel/kthread.c:388) 
[ 6.525606][ T9] ? _raw_spin_unlock_irq (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:202) 


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231116/202311161556.59af3ec9-oliver.sang@intel.com
  

Patch

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6e578f576a6f..e3338e3be700 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6597,7 +6597,7 @@  void __init workqueue_init_early(void)
 		ordered_wq_attrs[i] = attrs;
 	}
 
-	system_wq = alloc_workqueue("events", 0, 0);
+	system_wq = alloc_workqueue("events", WQ_MEM_RECLAIM, 0);
 	system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
 	system_long_wq = alloc_workqueue("events_long", 0, 0);
 	system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,