kernfs: support kernfs notify in memory recliam context

Message ID 20231114185947.42829-1-junxiao.bi@oracle.com
State New
Headers
Series kernfs: support kernfs notify in memory recliam context |

Commit Message

Junxiao Bi Nov. 14, 2023, 6:59 p.m. UTC
  kernfs notify is used in write path of md (md_write_start) to wake up
userspace daemon, like "mdmon" for updating md superblock of imsm raid,
md write will wait for that update done before issuing the write, if this
write is used for memory reclaim, the system may hung due to kernel notify
can't be executed, that's because kernel notify is executed by "system_wq"
which doesn't have a rescuer thread and kworker thread may not be created
due to memory pressure, then userspace daemon can't be woke up and md write
will hung.

According Tejun, this can't be fixed by add RECLAIM to "system_wq" because
that workqueue is shared and someone else might occupy that rescuer thread,
to fix this from md side, have to replace kernfs notify with other way to
communite with userspace daemon, that will break userspace interface,
so use a separated workqueue for kernefs notify to allow it be used in
memory reclaim context.

Link: https://lore.kernel.org/all/a131af22-0a5b-4be1-b77e-8716c63e8883@oracle.com/T/
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
---
 fs/kernfs/file.c            | 2 +-
 fs/kernfs/kernfs-internal.h | 1 +
 fs/kernfs/mount.c           | 3 +++
 3 files changed, 5 insertions(+), 1 deletion(-)
  

Comments

Tejun Heo Nov. 14, 2023, 7:06 p.m. UTC | #1
Hello,

On Tue, Nov 14, 2023 at 10:59:47AM -0800, Junxiao Bi wrote:
> kernfs notify is used in write path of md (md_write_start) to wake up
> userspace daemon, like "mdmon" for updating md superblock of imsm raid,
> md write will wait for that update done before issuing the write, if this

How is forward progress guarnateed for that userspace daemon? This sounds
like a really fragile setup.

> write is used for memory reclaim, the system may hung due to kernel notify
> can't be executed, that's because kernel notify is executed by "system_wq"
> which doesn't have a rescuer thread and kworker thread may not be created
> due to memory pressure, then userspace daemon can't be woke up and md write
> will hung.
> 
> According Tejun, this can't be fixed by add RECLAIM to "system_wq" because
> that workqueue is shared and someone else might occupy that rescuer thread,
> to fix this from md side, have to replace kernfs notify with other way to
> communite with userspace daemon, that will break userspace interface,
> so use a separated workqueue for kernefs notify to allow it be used in
> memory reclaim context.

I'm not necessarily against the change but please go into a bit more details
on how and why it's structured this way and add a comment explaining
explaining who's depending on kernfs notify for reclaim forward progress.

Thanks.
  
Junxiao Bi Nov. 14, 2023, 8:09 p.m. UTC | #2
On 11/14/23 11:06 AM, Tejun Heo wrote:

> Hello,
>
> On Tue, Nov 14, 2023 at 10:59:47AM -0800, Junxiao Bi wrote:
>> kernfs notify is used in write path of md (md_write_start) to wake up
>> userspace daemon, like "mdmon" for updating md superblock of imsm raid,
>> md write will wait for that update done before issuing the write, if this
> How is forward progress guarnateed for that userspace daemon? This sounds
> like a really fragile setup.

For imsm raid, userspace daemon "mdmon" is responsible for updating raid 
metadata, kernel will use kernfs_notify to wake up the daemon anywhere 
metadata update is required. If the daemon can't move forward, write may 
hung, but that will be a bug in the daemon?

>
>> write is used for memory reclaim, the system may hung due to kernel notify
>> can't be executed, that's because kernel notify is executed by "system_wq"
>> which doesn't have a rescuer thread and kworker thread may not be created
>> due to memory pressure, then userspace daemon can't be woke up and md write
>> will hung.
>>
>> According Tejun, this can't be fixed by add RECLAIM to "system_wq" because
>> that workqueue is shared and someone else might occupy that rescuer thread,
>> to fix this from md side, have to replace kernfs notify with other way to
>> communite with userspace daemon, that will break userspace interface,
>> so use a separated workqueue for kernefs notify to allow it be used in
>> memory reclaim context.
> I'm not necessarily against the change but please go into a bit more details
> on how and why it's structured this way and add a comment explaining
> explaining who's depending on kernfs notify for reclaim forward progress.

"kthreadd" was doing memory reclaim and stuck by md flush work, md flush 
work was stuck by md_write_start, where it was waiting 
"MD_SB_CHANGE_PENDING" flag to be cleared, before waiting, it invoked 
kernefs_notify to wake up userspace daemon which should update the 
meatadata and clear the flag.



PID: 2        TASK: ffff8df829539e40  CPU: 103  COMMAND: "kthreadd"
  #0 [ffffaf14800f3220] __schedule at ffffffff9488cbac
  #1 [ffffaf14800f32c0] schedule at ffffffff9488d1c6
  #2 [ffffaf14800f32d8] schedule_timeout at ffffffff948916e6
  #3 [ffffaf14800f3360] wait_for_completion at ffffffff9488ddeb
  #4 [ffffaf14800f33c8] flush_work at ffffffff940b5103
  #5 [ffffaf14800f3448] xlog_cil_force_lsn at ffffffffc0571791 [xfs]
  #6 [ffffaf14800f34e8] _xfs_log_force_lsn at ffffffffc056f79f [xfs]
  #7 [ffffaf14800f3570] xfs_log_force_lsn at ffffffffc056fa8c [xfs]
  #8 [ffffaf14800f35a8] __dta___xfs_iunpin_wait_3444 at ffffffffc05595c4 
[xfs]
  #9 [ffffaf14800f3620] xfs_iunpin_wait at ffffffffc055c229 [xfs]
#10 [ffffaf14800f3630] __dta_xfs_reclaim_inode_3358 at ffffffffc054f8cc 
[xfs]
#11 [ffffaf14800f3680] xfs_reclaim_inodes_ag at ffffffffc054fd56 [xfs]
#12 [ffffaf14800f3818] xfs_reclaim_inodes_nr at ffffffffc0551013 [xfs]
#13 [ffffaf14800f3838] xfs_fs_free_cached_objects at ffffffffc0565469 [xfs]
#14 [ffffaf14800f3848] super_cache_scan at ffffffff942959a7
#15 [ffffaf14800f38a0] shrink_slab at ffffffff941fa935
#16 [ffffaf14800f3988] shrink_node at ffffffff942005d8
#17 [ffffaf14800f3a10] do_try_to_free_pages at ffffffff94200ae2
#18 [ffffaf14800f3a78] try_to_free_pages at ffffffff94200e89
#19 [ffffaf14800f3b00] __alloc_pages_slowpath at ffffffff941ed82c
#20 [ffffaf14800f3c20] __alloc_pages_nodemask at ffffffff941ee191
#21 [ffffaf14800f3c90] __vmalloc_node_range at ffffffff9423a8e7
#22 [ffffaf14800f3d00] copy_process at ffffffff94096670
#23 [ffffaf14800f3de8] _do_fork at ffffffff94097f30
#24 [ffffaf14800f3e68] kernel_thread at ffffffff94098219
#25 [ffffaf14800f3e78] kthreadd at ffffffff940bd4e5
#26 [ffffaf14800f3f50] ret_from_fork at ffffffff94a00354


PID: 852      TASK: ffff8e351fc51e40  CPU: 77   COMMAND: "md"
  #0 [ffffaf148e983c68] __schedule at ffffffff9488cbac
  #1 [ffffaf148e983d08] schedule at ffffffff9488d1c6
  #2 [ffffaf148e983d20] md_write_start at ffffffff9469fc75
  #3 [ffffaf148e983d80] raid1_make_request at ffffffffc038d8bd [raid1]
  #4 [ffffaf148e983da8] md_handle_request at ffffffff9469cc24
  #5 [ffffaf148e983e18] md_submit_flush_data at ffffffff9469cce1
  #6 [ffffaf148e983e38] process_one_work at ffffffff940b5bd9
  #7 [ffffaf148e983e80] rescuer_thread at ffffffff940b6334
  #8 [ffffaf148e983f08] kthread at ffffffff940bc245
  #9 [ffffaf148e983f50] ret_from_fork at ffffffff94a00354


bool md_write_start(struct mddev *mddev, struct bio *bi)
{
     ...

    >>> process 852 go into the "if" and set "MD_SB_CHANGE_PENDING"

     if (mddev->in_sync || mddev->sync_checkers) {
         spin_lock(&mddev->lock);
         if (mddev->in_sync) {
             mddev->in_sync = 0;
             set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
             set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
             md_wakeup_thread(mddev->thread);
             did_change = 1;
         }
         spin_unlock(&mddev->lock);
     }
     rcu_read_unlock();

     >>> invoke kernfs_notify to wake up userspace daemon

     if (did_change)
         sysfs_notify_dirent_safe(mddev->sysfs_state);
     if (!mddev->has_superblocks)
         return true;

    >>>> hung here waiting userspace daemon clear that flag.

     wait_event(mddev->sb_wait,
            !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) ||
            is_md_suspended(mddev));
     if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
         percpu_ref_put(&mddev->writes_pending);
         return false;
     }
     return true;
}

Thanks,

Junxiao.

>
> Thanks.
>
  
Tejun Heo Nov. 14, 2023, 8:24 p.m. UTC | #3
Hello,

On Tue, Nov 14, 2023 at 12:09:19PM -0800, junxiao.bi@oracle.com wrote:
> On 11/14/23 11:06 AM, Tejun Heo wrote:
> > On Tue, Nov 14, 2023 at 10:59:47AM -0800, Junxiao Bi wrote:
> > > kernfs notify is used in write path of md (md_write_start) to wake up
> > > userspace daemon, like "mdmon" for updating md superblock of imsm raid,
> > > md write will wait for that update done before issuing the write, if this
> > How is forward progress guarnateed for that userspace daemon? This sounds
> > like a really fragile setup.
> 
> For imsm raid, userspace daemon "mdmon" is responsible for updating raid
> metadata, kernel will use kernfs_notify to wake up the daemon anywhere
> metadata update is required. If the daemon can't move forward, write may
> hung, but that will be a bug in the daemon?

I see. That sounds very fragile and I'm not quite sure that can ever be made
to work reliably. While memlocking everything needed and being really
judicious about which syscalls to make will probably get you pretty far,
there are things like task_work which gets scheduled and executed when a
task is about to return to userspace. Those things are allowed to grab e.g.
mutexes in the kernel and allocate memory and so on, and can deadlock under
memory pressure.

Even just looking at kernfs_notify, it down_reads()
root->kernfs_supers_rwsem which might already be write-locked by somebody
who's waiting on memory allocation.

The patch you're proposing removes one link in the dependency chain but
there are many on there. I'm not sure this is fixable. Nobody writes kernel
code thinking that userspace code can be on the memory reclaim dependency
chain.

Thanks.
  
Junxiao Bi Nov. 14, 2023, 11:53 p.m. UTC | #4
On 11/14/23 12:24 PM, Tejun Heo wrote:

> Hello,
>
> On Tue, Nov 14, 2023 at 12:09:19PM -0800, junxiao.bi@oracle.com wrote:
>> On 11/14/23 11:06 AM, Tejun Heo wrote:
>>> On Tue, Nov 14, 2023 at 10:59:47AM -0800, Junxiao Bi wrote:
>>>> kernfs notify is used in write path of md (md_write_start) to wake up
>>>> userspace daemon, like "mdmon" for updating md superblock of imsm raid,
>>>> md write will wait for that update done before issuing the write, if this
>>> How is forward progress guarnateed for that userspace daemon? This sounds
>>> like a really fragile setup.
>> For imsm raid, userspace daemon "mdmon" is responsible for updating raid
>> metadata, kernel will use kernfs_notify to wake up the daemon anywhere
>> metadata update is required. If the daemon can't move forward, write may
>> hung, but that will be a bug in the daemon?
> I see. That sounds very fragile and I'm not quite sure that can ever be made
> to work reliably. While memlocking everything needed and being really
> judicious about which syscalls to make will probably get you pretty far,
> there are things like task_work which gets scheduled and executed when a
> task is about to return to userspace. Those things are allowed to grab e.g.
> mutexes in the kernel and allocate memory and so on, and can deadlock under
> memory pressure.
>
> Even just looking at kernfs_notify, it down_reads()
> root->kernfs_supers_rwsem which might already be write-locked by somebody
> who's waiting on memory allocation.
>
> The patch you're proposing removes one link in the dependency chain but
> there are many on there. I'm not sure this is fixable. Nobody writes kernel
> code thinking that userspace code can be on the memory reclaim dependency
> chain.

Understood, thanks. Sound like depending on Userspace on memory reclaim 
path is really bad idea and the only option for fixing it is to remove 
that dependency, but i am not sure that is possible without breaking the 
consistency of metadata.

Thanks,

Junxiao.

>
> Thanks.
>
  
Mariusz Tkaczyk Nov. 15, 2023, 3:30 p.m. UTC | #5
On Tue, 14 Nov 2023 15:53:47 -0800
junxiao.bi@oracle.com wrote:

> Understood, thanks. Sound like depending on Userspace on memory reclaim 
> path is really bad idea and the only option for fixing it is to remove 
> that dependency, but i am not sure that is possible without breaking the 
> consistency of metadata.
> 
> Thanks,
> 
> Junxiao.

Indeed the project of external metadata management if fragile. You cares about
IMSM here (same as me) so ideally we should implement metadata management in
kernel- I think that IMSM deserved that after 10 years on the market. There is
no better option, other options are just "workarounds" for the lack of metadata
management in kernel.

Song, any comments here?

From the second hand, there is native raid which should just work, so maybe you
can switch to the native raid?

Thanks,
Mariusz
  
Junxiao Bi Nov. 16, 2023, 5:04 p.m. UTC | #6
On 11/15/23 7:30 AM, Mariusz Tkaczyk wrote:

> On Tue, 14 Nov 2023 15:53:47 -0800
> junxiao.bi@oracle.com wrote:
>
>> Understood, thanks. Sound like depending on Userspace on memory reclaim
>> path is really bad idea and the only option for fixing it is to remove
>> that dependency, but i am not sure that is possible without breaking the
>> consistency of metadata.
>>
>> Thanks,
>>
>> Junxiao.
> Indeed the project of external metadata management if fragile. You cares about
> IMSM here (same as me) so ideally we should implement metadata management in
> kernel- I think that IMSM deserved that after 10 years on the market. There is
> no better option, other options are just "workarounds" for the lack of metadata
> management in kernel.
Agree, sound like that's the way to proceed.
>
> Song, any comments here?
>
>  From the second hand, there is native raid which should just work, so maybe you
> can switch to the native raid?

Unfortunately that's is not possible, it's a production setup to use 
imsm raid.

Thanks,

Junxiao.

>
> Thanks,
> Mariusz
  
Mariusz Tkaczyk Nov. 17, 2023, 8:36 a.m. UTC | #7
On Thu, 16 Nov 2023 09:04:34 -0800
junxiao.bi@oracle.com wrote:

> On 11/15/23 7:30 AM, Mariusz Tkaczyk wrote:
> 
> > On Tue, 14 Nov 2023 15:53:47 -0800
> > junxiao.bi@oracle.com wrote:
> >  
> >> Understood, thanks. Sound like depending on Userspace on memory reclaim
> >> path is really bad idea and the only option for fixing it is to remove
> >> that dependency, but i am not sure that is possible without breaking the
> >> consistency of metadata.
> >>
> >> Thanks,
> >>
> >> Junxiao.  
> > Indeed the project of external metadata management if fragile. You cares
> > about IMSM here (same as me) so ideally we should implement metadata
> > management in kernel- I think that IMSM deserved that after 10 years on the
> > market. There is no better option, other options are just "workarounds" for
> > the lack of metadata management in kernel.  
> Agree, sound like that's the way to proceed.
> >
> > Song, any comments here?
> >
> >  From the second hand, there is native raid which should just work, so
> > maybe you can switch to the native raid?  
> 
> Unfortunately that's is not possible, it's a production setup to use 
> imsm raid.

Implementing IMSM in kernel is a goal for months/ years so I don't
see it in a timeline you would need that on to fix your production setup. It
will be a big feature, but let's wait for Song voice first.

Thanks,
Mariusz
  

Patch

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index f0cb729e9a97..726bfd40a912 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -974,7 +974,7 @@  void kernfs_notify(struct kernfs_node *kn)
 		kernfs_get(kn);
 		kn->attr.notify_next = kernfs_notify_list;
 		kernfs_notify_list = kn;
-		schedule_work(&kernfs_notify_work);
+		queue_work(kernfs_wq, &kernfs_notify_work);
 	}
 	spin_unlock_irqrestore(&kernfs_notify_lock, flags);
 }
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 237f2764b941..beae5d328342 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -123,6 +123,7 @@  static inline bool kernfs_dir_changed(struct kernfs_node *parent,
 
 extern const struct super_operations kernfs_sops;
 extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
+extern struct workqueue_struct *kernfs_wq;
 
 /*
  * inode.c
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 4628edde2e7e..7346ec49a621 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -24,6 +24,7 @@ 
 struct kmem_cache *kernfs_node_cache __ro_after_init;
 struct kmem_cache *kernfs_iattrs_cache __ro_after_init;
 struct kernfs_global_locks *kernfs_locks __ro_after_init;
+struct workqueue_struct *kernfs_wq __ro_after_init;
 
 static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
 {
@@ -432,4 +433,6 @@  void __init kernfs_init(void)
 					      0, SLAB_PANIC, NULL);
 
 	kernfs_lock_init();
+
+	kernfs_wq = alloc_workqueue("kernfs", WQ_MEM_RECLAIM, 0);
 }