[1/2] fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem

Message ID 20221018041233.376977-2-stephen.s.brennan@oracle.com
State New
Headers
Series fsnotify: fix softlockups iterating over d_subdirs |

Commit Message

Stephen Brennan Oct. 18, 2022, 4:12 a.m. UTC
  When an inode is interested in events on its children, it must set
DCACHE_FSNOTIFY_PARENT_WATCHED flag on all its children. Currently, when
the fsnotify connector is removed and i_fsnotify_mask becomes zero, we
lazily allow __fsnotify_parent() to do this the next time we see an
event on a child.

However, if the list of children is very long (e.g., in the millions),
and lots of activity is occurring on the directory, then it's possible
for many CPUs to end up blocked on the inode spinlock in
__fsnotify_update_child_flags(). Each CPU will then redundantly iterate
over the very long list of children. This situation can cause soft
lockups.

To avoid this, stop lazily updating child flags in __fsnotify_parent().
Protect the child flag update with i_rwsem held exclusive, to ensure
that we only iterate over the child list when it's absolutely necessary,
and even then, only once.

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
---

Notes:

It seems that there are two implementation options for this, regarding
what i_rwsem protects:

1. Both updates to i_fsnotify_mask, and the child dentry flags, or
2. Only updates to the child dentry flags

I wanted to do #1, but it got really tricky with fsnotify_put_mark(). We
don't want to hold the inode lock whenever we decrement the refcount,
but if we don't, then we're stuck holding a spinlock when the refcount
goes to zero, and we need to grab the inode rwsem to synchronize the
update to the child flags. I'm sure there's a way around this, but I
didn't keep going with it.

With #1, as currently implemented, we have the unfortunate effect of
that a mark can be added, can see that no update is required, and
return, despite the fact that the flag update is still in progress on a
different CPU/thread. From our discussion, that seems to be the current
status quo, but I wanted to explicitly point that out. If we want to
move to #1, it should be possible with some work.

 fs/notify/fsnotify.c | 12 ++++++++--
 fs/notify/mark.c     | 55 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 53 insertions(+), 14 deletions(-)
  

Comments

Amir Goldstein Oct. 18, 2022, 7:39 a.m. UTC | #1
On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan
<stephen.s.brennan@oracle.com> wrote:
>
> When an inode is interested in events on its children, it must set
> DCACHE_FSNOTIFY_PARENT_WATCHED flag on all its children. Currently, when
> the fsnotify connector is removed and i_fsnotify_mask becomes zero, we
> lazily allow __fsnotify_parent() to do this the next time we see an
> event on a child.
>
> However, if the list of children is very long (e.g., in the millions),
> and lots of activity is occurring on the directory, then it's possible
> for many CPUs to end up blocked on the inode spinlock in
> __fsnotify_update_child_flags(). Each CPU will then redundantly iterate
> over the very long list of children. This situation can cause soft
> lockups.
>
> To avoid this, stop lazily updating child flags in __fsnotify_parent().
> Protect the child flag update with i_rwsem held exclusive, to ensure
> that we only iterate over the child list when it's absolutely necessary,
> and even then, only once.
>
> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
> ---
>
> Notes:
>
> It seems that there are two implementation options for this, regarding
> what i_rwsem protects:
>
> 1. Both updates to i_fsnotify_mask, and the child dentry flags, or
> 2. Only updates to the child dentry flags
>
> I wanted to do #1, but it got really tricky with fsnotify_put_mark(). We
> don't want to hold the inode lock whenever we decrement the refcount,
> but if we don't, then we're stuck holding a spinlock when the refcount
> goes to zero, and we need to grab the inode rwsem to synchronize the
> update to the child flags. I'm sure there's a way around this, but I
> didn't keep going with it.
>
> With #1, as currently implemented, we have the unfortunate effect of
> that a mark can be added, can see that no update is required, and
> return, despite the fact that the flag update is still in progress on a
> different CPU/thread. From our discussion, that seems to be the current
> status quo, but I wanted to explicitly point that out. If we want to
> move to #1, it should be possible with some work.

I think the solution may be to store the state of children in conn
like you suggested.

See fsnotify_update_iref() and conn flag
FSNOTIFY_CONN_FLAG_HAS_IREF.

You can add a conn flag
FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
that caches the result of the last invocation of update children flags.

For example, fsnotify_update_iref() becomes
fsnotify_update_inode_conn_flags() and
returns inode if either inode ref should be dropped
or if children flags need to be updated (or both)
maybe use some out argument to differentiate the cases.
Same for fsnotify_detach_connector_from_object().

Then, where fsnotify_drop_object() is called, for the
case that inode children need to be updated,
take inode_lock(), take connector spin lock
to check if another thread has already done the update
if not release spin lock, perform the update under inode lock
and at the end, take spin lock again and set the
FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
connector flag.

Not sure if it all works out... maybe

>
>  fs/notify/fsnotify.c | 12 ++++++++--
>  fs/notify/mark.c     | 55 ++++++++++++++++++++++++++++++++++----------
>  2 files changed, 53 insertions(+), 14 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 7974e91ffe13..e887a195983b 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -207,8 +207,16 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
>         parent = dget_parent(dentry);
>         p_inode = parent->d_inode;
>         p_mask = fsnotify_inode_watches_children(p_inode);
> -       if (unlikely(parent_watched && !p_mask))
> -               __fsnotify_update_child_dentry_flags(p_inode);
> +       if (unlikely(parent_watched && !p_mask)) {
> +               /*
> +                * Flag would be cleared soon by
> +                * __fsnotify_update_child_dentry_flags(), but as an
> +                * optimization, clear it now.
> +                */

I think that we need to also take p_inode spin_lock here and
check  fsnotify_inode_watches_children() under lock
otherwise, we could be clearing the WATCHED flag
*after* __fsnotify_update_child_dentry_flags() had
already set it, because you we not observe the change to
p_inode mask.

I would consider renaming __fsnotify_update_child_dentry_flags()
to __fsnotify_update_children_dentry_flags(struct inode *dir)

and creating another inline helper for this call site called:
fsnotify_update_child_dentry_flags(struct inode *dir, struct dentry *child)


> +               spin_lock(&dentry->d_lock);
> +               dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> +               spin_unlock(&dentry->d_lock);
> +       }
>
>         /*
>          * Include parent/name in notification either if some notification
> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> index c74ef947447d..da9f944fcbbb 100644
> --- a/fs/notify/mark.c
> +++ b/fs/notify/mark.c
> @@ -184,15 +184,36 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
>   */
>  void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
>  {
> +       struct inode *inode = NULL;
> +       int watched_before, watched_after;
> +
>         if (!conn)
>                 return;
>
> -       spin_lock(&conn->lock);
> -       __fsnotify_recalc_mask(conn);
> -       spin_unlock(&conn->lock);
> -       if (conn->type == FSNOTIFY_OBJ_TYPE_INODE)
> -               __fsnotify_update_child_dentry_flags(
> -                                       fsnotify_conn_inode(conn));
> +       if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
> +               /*
> +                * For inodes, we may need to update flags on the child
> +                * dentries. To ensure these updates occur exactly once,
> +                * synchronize the recalculation with the inode mutex.
> +                */
> +               inode = fsnotify_conn_inode(conn);
> +               spin_lock(&conn->lock);
> +               watched_before = fsnotify_inode_watches_children(inode);
> +               __fsnotify_recalc_mask(conn);
> +               watched_after = fsnotify_inode_watches_children(inode);
> +               spin_unlock(&conn->lock);
> +
> +               inode_lock(inode);

With the pattern that I suggested above, this if / else would
be unified to code that looks something like this:

spin_lock(&conn->lock);
inode =  __fsnotify_recalc_mask(conn);
spin_unlock(&conn->lock);

if (inode)
    fsnotify_update_children_dentry_flags(conn, inode);

Where fsnotify_update_children_dentry_flags()
takes inode lock around entire update and conn spin lock
only around check and update of conn flags.

FYI, at this time in the code, adding  a mark or updating
existing mark mask cannot result in the need to drop iref.
That is the reason that return value of __fsnotify_recalc_mask()
is not checked here.

> +               if ((watched_before && !watched_after) ||
> +                   (!watched_before && watched_after)) {
> +                       __fsnotify_update_child_dentry_flags(inode);
> +               }
> +               inode_unlock(inode);
> +       } else {
> +               spin_lock(&conn->lock);
> +               __fsnotify_recalc_mask(conn);
> +               spin_unlock(&conn->lock);
> +       }
>  }
>
>  /* Free all connectors queued for freeing once SRCU period ends */
> @@ -295,6 +316,8 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
>         struct fsnotify_mark_connector *conn = READ_ONCE(mark->connector);
>         void *objp = NULL;
>         unsigned int type = FSNOTIFY_OBJ_TYPE_DETACHED;
> +       struct inode *inode = NULL;
> +       int watched_before, watched_after;
>         bool free_conn = false;
>
>         /* Catch marks that were actually never attached to object */
> @@ -311,17 +334,31 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
>         if (!refcount_dec_and_lock(&mark->refcnt, &conn->lock))
>                 return;
>
> +       if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
> +               inode = fsnotify_conn_inode(conn);
> +               watched_before = fsnotify_inode_watches_children(inode);
> +       }
> +
>         hlist_del_init_rcu(&mark->obj_list);
>         if (hlist_empty(&conn->list)) {
>                 objp = fsnotify_detach_connector_from_object(conn, &type);
>                 free_conn = true;
> +               watched_after = 0;
>         } else {
>                 objp = __fsnotify_recalc_mask(conn);
>                 type = conn->type;
> +               watched_after = fsnotify_inode_watches_children(inode);
>         }
>         WRITE_ONCE(mark->connector, NULL);
>         spin_unlock(&conn->lock);
>
> +       if (inode) {
> +               inode_lock(inode);
> +               if (watched_before && !watched_after)
> +                       __fsnotify_update_child_dentry_flags(inode);
> +               inode_unlock(inode);
> +       }
> +
>         fsnotify_drop_object(type, objp);
>

Here as well something like:
if (objp)
    fsnotify_update_children_dentry_flags(conn, obj);

But need to distinguish when inode ref needs to be dropped
children flags updates or both.

Hope that this suggestion direction turns out to be useful and not
a complete waste of time...

Thanks,
Amir.
  
Stephen Brennan Oct. 21, 2022, 12:33 a.m. UTC | #2
Amir Goldstein <amir73il@gmail.com> writes:

> On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan
> <stephen.s.brennan@oracle.com> wrote:
>>
>> When an inode is interested in events on its children, it must set
>> DCACHE_FSNOTIFY_PARENT_WATCHED flag on all its children. Currently, when
>> the fsnotify connector is removed and i_fsnotify_mask becomes zero, we
>> lazily allow __fsnotify_parent() to do this the next time we see an
>> event on a child.
>>
>> However, if the list of children is very long (e.g., in the millions),
>> and lots of activity is occurring on the directory, then it's possible
>> for many CPUs to end up blocked on the inode spinlock in
>> __fsnotify_update_child_flags(). Each CPU will then redundantly iterate
>> over the very long list of children. This situation can cause soft
>> lockups.
>>
>> To avoid this, stop lazily updating child flags in __fsnotify_parent().
>> Protect the child flag update with i_rwsem held exclusive, to ensure
>> that we only iterate over the child list when it's absolutely necessary,
>> and even then, only once.
>>
>> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
>> ---
>>
>> Notes:
>>
>> It seems that there are two implementation options for this, regarding
>> what i_rwsem protects:
>>
>> 1. Both updates to i_fsnotify_mask, and the child dentry flags, or
>> 2. Only updates to the child dentry flags
>>
>> I wanted to do #1, but it got really tricky with fsnotify_put_mark(). We
>> don't want to hold the inode lock whenever we decrement the refcount,
>> but if we don't, then we're stuck holding a spinlock when the refcount
>> goes to zero, and we need to grab the inode rwsem to synchronize the
>> update to the child flags. I'm sure there's a way around this, but I
>> didn't keep going with it.
>>
>> With #1, as currently implemented, we have the unfortunate effect of
>> that a mark can be added, can see that no update is required, and
>> return, despite the fact that the flag update is still in progress on a
>> different CPU/thread. From our discussion, that seems to be the current
>> status quo, but I wanted to explicitly point that out. If we want to
>> move to #1, it should be possible with some work.
>
> I think the solution may be to store the state of children in conn
> like you suggested.
>
> See fsnotify_update_iref() and conn flag
> FSNOTIFY_CONN_FLAG_HAS_IREF.
>
> You can add a conn flag
> FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
> that caches the result of the last invocation of update children flags.
>
> For example, fsnotify_update_iref() becomes
> fsnotify_update_inode_conn_flags() and
> returns inode if either inode ref should be dropped
> or if children flags need to be updated (or both)
> maybe use some out argument to differentiate the cases.
> Same for fsnotify_detach_connector_from_object().
>
> Then, where fsnotify_drop_object() is called, for the
> case that inode children need to be updated,
> take inode_lock(), take connector spin lock
> to check if another thread has already done the update
> if not release spin lock, perform the update under inode lock
> and at the end, take spin lock again and set the
> FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
> connector flag.
>
> Not sure if it all works out... maybe

I did this for v2 and I think it has worked well, all threads seem to
block until the flags are updated on all dentries.

>>
>>  fs/notify/fsnotify.c | 12 ++++++++--
>>  fs/notify/mark.c     | 55 ++++++++++++++++++++++++++++++++++----------
>>  2 files changed, 53 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
>> index 7974e91ffe13..e887a195983b 100644
>> --- a/fs/notify/fsnotify.c
>> +++ b/fs/notify/fsnotify.c
>> @@ -207,8 +207,16 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
>>         parent = dget_parent(dentry);
>>         p_inode = parent->d_inode;
>>         p_mask = fsnotify_inode_watches_children(p_inode);
>> -       if (unlikely(parent_watched && !p_mask))
>> -               __fsnotify_update_child_dentry_flags(p_inode);
>> +       if (unlikely(parent_watched && !p_mask)) {
>> +               /*
>> +                * Flag would be cleared soon by
>> +                * __fsnotify_update_child_dentry_flags(), but as an
>> +                * optimization, clear it now.
>> +                */
>
> I think that we need to also take p_inode spin_lock here and
> check  fsnotify_inode_watches_children() under lock
> otherwise, we could be clearing the WATCHED flag
> *after* __fsnotify_update_child_dentry_flags() had
> already set it, because you we not observe the change to
> p_inode mask.

I'm not sure I follow. The i_fsnotify_mask field isn't protected by the
p_inode spinlock. It isn't really protected at all, though it mainly
gets modified with the conn->lock held.

Wouldn't it be sufficient to do in a new helper:

spin_lock(&dentry->d_lock);
if (!fsnotify_inode_watches_children(p_inode))
    dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
spin_unlock(&dentry->d_lock);

I'm sure I'm missing something about your comment. For the moment I left
it as is in the second version of the patch, we can discuss it more and
I can update it for a v3.

>
> I would consider renaming __fsnotify_update_child_dentry_flags()
> to __fsnotify_update_children_dentry_flags(struct inode *dir)
>
> and creating another inline helper for this call site called:
> fsnotify_update_child_dentry_flags(struct inode *dir, struct dentry *child)
>
>
>> +               spin_lock(&dentry->d_lock);
>> +               dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
>> +               spin_unlock(&dentry->d_lock);
>> +       }
>>
>>         /*
>>          * Include parent/name in notification either if some notification
>> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
>> index c74ef947447d..da9f944fcbbb 100644
>> --- a/fs/notify/mark.c
>> +++ b/fs/notify/mark.c
>> @@ -184,15 +184,36 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
>>   */
>>  void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
>>  {
>> +       struct inode *inode = NULL;
>> +       int watched_before, watched_after;
>> +
>>         if (!conn)
>>                 return;
>>
>> -       spin_lock(&conn->lock);
>> -       __fsnotify_recalc_mask(conn);
>> -       spin_unlock(&conn->lock);
>> -       if (conn->type == FSNOTIFY_OBJ_TYPE_INODE)
>> -               __fsnotify_update_child_dentry_flags(
>> -                                       fsnotify_conn_inode(conn));
>> +       if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
>> +               /*
>> +                * For inodes, we may need to update flags on the child
>> +                * dentries. To ensure these updates occur exactly once,
>> +                * synchronize the recalculation with the inode mutex.
>> +                */
>> +               inode = fsnotify_conn_inode(conn);
>> +               spin_lock(&conn->lock);
>> +               watched_before = fsnotify_inode_watches_children(inode);
>> +               __fsnotify_recalc_mask(conn);
>> +               watched_after = fsnotify_inode_watches_children(inode);
>> +               spin_unlock(&conn->lock);
>> +
>> +               inode_lock(inode);
>
> With the pattern that I suggested above, this if / else would
> be unified to code that looks something like this:
>
> spin_lock(&conn->lock);
> inode =  __fsnotify_recalc_mask(conn);
> spin_unlock(&conn->lock);
>
> if (inode)
>     fsnotify_update_children_dentry_flags(conn, inode);
>
> Where fsnotify_update_children_dentry_flags()
> takes inode lock around entire update and conn spin lock
> only around check and update of conn flags.
>
> FYI, at this time in the code, adding  a mark or updating
> existing mark mask cannot result in the need to drop iref.
> That is the reason that return value of __fsnotify_recalc_mask()
> is not checked here.

For v3 I tried this with a new "flags" out variable and two flags - one
for requiring an iput(), and one for calling
fsnotify_update_children_dentry_flags(). As a result, I did stick a
WARN_ON_ONCE here, but it more or less looks just like this code :)

>> +               if ((watched_before && !watched_after) ||
>> +                   (!watched_before && watched_after)) {
>> +                       __fsnotify_update_child_dentry_flags(inode);
>> +               }
>> +               inode_unlock(inode);
>> +       } else {
>> +               spin_lock(&conn->lock);
>> +               __fsnotify_recalc_mask(conn);
>> +               spin_unlock(&conn->lock);
>> +       }
>>  }
>>
>>  /* Free all connectors queued for freeing once SRCU period ends */
>> @@ -295,6 +316,8 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
>>         struct fsnotify_mark_connector *conn = READ_ONCE(mark->connector);
>>         void *objp = NULL;
>>         unsigned int type = FSNOTIFY_OBJ_TYPE_DETACHED;
>> +       struct inode *inode = NULL;
>> +       int watched_before, watched_after;
>>         bool free_conn = false;
>>
>>         /* Catch marks that were actually never attached to object */
>> @@ -311,17 +334,31 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
>>         if (!refcount_dec_and_lock(&mark->refcnt, &conn->lock))
>>                 return;
>>
>> +       if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
>> +               inode = fsnotify_conn_inode(conn);
>> +               watched_before = fsnotify_inode_watches_children(inode);
>> +       }
>> +
>>         hlist_del_init_rcu(&mark->obj_list);
>>         if (hlist_empty(&conn->list)) {
>>                 objp = fsnotify_detach_connector_from_object(conn, &type);
>>                 free_conn = true;
>> +               watched_after = 0;
>>         } else {
>>                 objp = __fsnotify_recalc_mask(conn);
>>                 type = conn->type;
>> +               watched_after = fsnotify_inode_watches_children(inode);
>>         }
>>         WRITE_ONCE(mark->connector, NULL);
>>         spin_unlock(&conn->lock);
>>
>> +       if (inode) {
>> +               inode_lock(inode);
>> +               if (watched_before && !watched_after)
>> +                       __fsnotify_update_child_dentry_flags(inode);
>> +               inode_unlock(inode);
>> +       }
>> +
>>         fsnotify_drop_object(type, objp);
>>
>
> Here as well something like:
> if (objp)
>     fsnotify_update_children_dentry_flags(conn, obj);
>
> But need to distinguish when inode ref needs to be dropped
> children flags updates or both.

With a flags out-param, it works well. I actually was able to stuff this
into fsnotify_drop_object, which was good because I had missed a whole
other function that can detach a connector from an inode
(fsnotify_destroy_marks()).

> Hope that this suggestion direction turns out to be useful and not
> a complete waste of time...
>
> Thanks,
> Amir.
  
Amir Goldstein Oct. 21, 2022, 7:22 a.m. UTC | #3
()


On Fri, Oct 21, 2022 at 3:33 AM Stephen Brennan
<stephen.s.brennan@oracle.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> writes:
>
> > On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan
> > <stephen.s.brennan@oracle.com> wrote:
> >>
> >> When an inode is interested in events on its children, it must set
> >> DCACHE_FSNOTIFY_PARENT_WATCHED flag on all its children. Currently, when
> >> the fsnotify connector is removed and i_fsnotify_mask becomes zero, we
> >> lazily allow __fsnotify_parent() to do this the next time we see an
> >> event on a child.
> >>
> >> However, if the list of children is very long (e.g., in the millions),
> >> and lots of activity is occurring on the directory, then it's possible
> >> for many CPUs to end up blocked on the inode spinlock in
> >> __fsnotify_update_child_flags(). Each CPU will then redundantly iterate
> >> over the very long list of children. This situation can cause soft
> >> lockups.
> >>
> >> To avoid this, stop lazily updating child flags in __fsnotify_parent().
> >> Protect the child flag update with i_rwsem held exclusive, to ensure
> >> that we only iterate over the child list when it's absolutely necessary,
> >> and even then, only once.
> >>
> >> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
> >> ---
> >>
> >> Notes:
> >>
> >> It seems that there are two implementation options for this, regarding
> >> what i_rwsem protects:
> >>
> >> 1. Both updates to i_fsnotify_mask, and the child dentry flags, or
> >> 2. Only updates to the child dentry flags
> >>
> >> I wanted to do #1, but it got really tricky with fsnotify_put_mark(). We
> >> don't want to hold the inode lock whenever we decrement the refcount,
> >> but if we don't, then we're stuck holding a spinlock when the refcount
> >> goes to zero, and we need to grab the inode rwsem to synchronize the
> >> update to the child flags. I'm sure there's a way around this, but I
> >> didn't keep going with it.
> >>
> >> With #1, as currently implemented, we have the unfortunate effect of
> >> that a mark can be added, can see that no update is required, and
> >> return, despite the fact that the flag update is still in progress on a
> >> different CPU/thread. From our discussion, that seems to be the current
> >> status quo, but I wanted to explicitly point that out. If we want to
> >> move to #1, it should be possible with some work.
> >
> > I think the solution may be to store the state of children in conn
> > like you suggested.
> >
> > See fsnotify_update_iref() and conn flag
> > FSNOTIFY_CONN_FLAG_HAS_IREF.
> >
> > You can add a conn flag
> > FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
> > that caches the result of the last invocation of update children flags.
> >
> > For example, fsnotify_update_iref() becomes
> > fsnotify_update_inode_conn_flags() and
> > returns inode if either inode ref should be dropped
> > or if children flags need to be updated (or both)
> > maybe use some out argument to differentiate the cases.
> > Same for fsnotify_detach_connector_from_object().
> >
> > Then, where fsnotify_drop_object() is called, for the
> > case that inode children need to be updated,
> > take inode_lock(), take connector spin lock
> > to check if another thread has already done the update
> > if not release spin lock, perform the update under inode lock
> > and at the end, take spin lock again and set the
> > FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
> > connector flag.
> >
> > Not sure if it all works out... maybe
>
> I did this for v2 and I think it has worked well, all threads seem to
> block until the flags are updated on all dentries.
>

It did work out pretty well. v2 looks nice :)

> >>
> >>  fs/notify/fsnotify.c | 12 ++++++++--
> >>  fs/notify/mark.c     | 55 ++++++++++++++++++++++++++++++++++----------
> >>  2 files changed, 53 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> >> index 7974e91ffe13..e887a195983b 100644
> >> --- a/fs/notify/fsnotify.c
> >> +++ b/fs/notify/fsnotify.c
> >> @@ -207,8 +207,16 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
> >>         parent = dget_parent(dentry);
> >>         p_inode = parent->d_inode;
> >>         p_mask = fsnotify_inode_watches_children(p_inode);
> >> -       if (unlikely(parent_watched && !p_mask))
> >> -               __fsnotify_update_child_dentry_flags(p_inode);
> >> +       if (unlikely(parent_watched && !p_mask)) {
> >> +               /*
> >> +                * Flag would be cleared soon by
> >> +                * __fsnotify_update_child_dentry_flags(), but as an
> >> +                * optimization, clear it now.
> >> +                */
> >
> > I think that we need to also take p_inode spin_lock here and
> > check  fsnotify_inode_watches_children() under lock
> > otherwise, we could be clearing the WATCHED flag
> > *after* __fsnotify_update_child_dentry_flags() had
> > already set it, because you we not observe the change to
> > p_inode mask.
>
> I'm not sure I follow. The i_fsnotify_mask field isn't protected by the
> p_inode spinlock. It isn't really protected at all, though it mainly
> gets modified with the conn->lock held.
>

Right.

> Wouldn't it be sufficient to do in a new helper:
>
> spin_lock(&dentry->d_lock);
> if (!fsnotify_inode_watches_children(p_inode))
>     dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> spin_unlock(&dentry->d_lock);
>

Yes, I think this is better.

I am not sure if that data race can lead to the wrong state of dentry
flags forever on upstream, but I'm pretty sure that it can on v1.

> I'm sure I'm missing something about your comment. For the moment I left
> it as is in the second version of the patch, we can discuss it more and
> I can update it for a v3.
>

v1 doesn't have a memory barrier between loading value into
parent_watched and into p_mask, so the check wasn't safe.

In v2,  __fsnotify_update_child_dentry_flags() rechecks
fsnotify_inode_watches_children() after the the barrier provided
by the dentry spin lock, so the "lockless" check is safe:

Writer:
store i_fsnotify_mask
smp_store_release(d_lock) // spin unlock
store d_flags

Reader:
load d_flags
smp_load_acquire(d_lock) // spin_lock
load i_fsnotify_mask

Best resource I like for refreshing my "lockless" skills:
https://lwn.net/Articles/844224/

> >
> > I would consider renaming __fsnotify_update_child_dentry_flags()
> > to __fsnotify_update_children_dentry_flags(struct inode *dir)
> >
> > and creating another inline helper for this call site called:
> > fsnotify_update_child_dentry_flags(struct inode *dir, struct dentry *child)
> >
> >
> >> +               spin_lock(&dentry->d_lock);
> >> +               dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> >> +               spin_unlock(&dentry->d_lock);
> >> +       }
> >>
> >>         /*
> >>          * Include parent/name in notification either if some notification
> >> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> >> index c74ef947447d..da9f944fcbbb 100644
> >> --- a/fs/notify/mark.c
> >> +++ b/fs/notify/mark.c
> >> @@ -184,15 +184,36 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
> >>   */
> >>  void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
> >>  {
> >> +       struct inode *inode = NULL;
> >> +       int watched_before, watched_after;
> >> +
> >>         if (!conn)
> >>                 return;
> >>
> >> -       spin_lock(&conn->lock);
> >> -       __fsnotify_recalc_mask(conn);
> >> -       spin_unlock(&conn->lock);
> >> -       if (conn->type == FSNOTIFY_OBJ_TYPE_INODE)
> >> -               __fsnotify_update_child_dentry_flags(
> >> -                                       fsnotify_conn_inode(conn));
> >> +       if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
> >> +               /*
> >> +                * For inodes, we may need to update flags on the child
> >> +                * dentries. To ensure these updates occur exactly once,
> >> +                * synchronize the recalculation with the inode mutex.
> >> +                */
> >> +               inode = fsnotify_conn_inode(conn);
> >> +               spin_lock(&conn->lock);
> >> +               watched_before = fsnotify_inode_watches_children(inode);
> >> +               __fsnotify_recalc_mask(conn);
> >> +               watched_after = fsnotify_inode_watches_children(inode);
> >> +               spin_unlock(&conn->lock);
> >> +
> >> +               inode_lock(inode);
> >
> > With the pattern that I suggested above, this if / else would
> > be unified to code that looks something like this:
> >
> > spin_lock(&conn->lock);
> > inode =  __fsnotify_recalc_mask(conn);
> > spin_unlock(&conn->lock);
> >
> > if (inode)
> >     fsnotify_update_children_dentry_flags(conn, inode);
> >
> > Where fsnotify_update_children_dentry_flags()
> > takes inode lock around entire update and conn spin lock
> > only around check and update of conn flags.
> >
> > FYI, at this time in the code, adding  a mark or updating
> > existing mark mask cannot result in the need to drop iref.
> > That is the reason that return value of __fsnotify_recalc_mask()
> > is not checked here.
>
> For v3 I tried this with a new "flags" out variable and two flags - one
> for requiring an iput(), and one for calling
> fsnotify_update_children_dentry_flags(). As a result, I did stick a
> WARN_ON_ONCE here, but it more or less looks just like this code :)
>

Cool.

> >> +               if ((watched_before && !watched_after) ||
> >> +                   (!watched_before && watched_after)) {
> >> +                       __fsnotify_update_child_dentry_flags(inode);
> >> +               }
> >> +               inode_unlock(inode);
> >> +       } else {
> >> +               spin_lock(&conn->lock);
> >> +               __fsnotify_recalc_mask(conn);
> >> +               spin_unlock(&conn->lock);
> >> +       }
> >>  }
> >>
> >>  /* Free all connectors queued for freeing once SRCU period ends */
> >> @@ -295,6 +316,8 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
> >>         struct fsnotify_mark_connector *conn = READ_ONCE(mark->connector);
> >>         void *objp = NULL;
> >>         unsigned int type = FSNOTIFY_OBJ_TYPE_DETACHED;
> >> +       struct inode *inode = NULL;
> >> +       int watched_before, watched_after;
> >>         bool free_conn = false;
> >>
> >>         /* Catch marks that were actually never attached to object */
> >> @@ -311,17 +334,31 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
> >>         if (!refcount_dec_and_lock(&mark->refcnt, &conn->lock))
> >>                 return;
> >>
> >> +       if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
> >> +               inode = fsnotify_conn_inode(conn);
> >> +               watched_before = fsnotify_inode_watches_children(inode);
> >> +       }
> >> +
> >>         hlist_del_init_rcu(&mark->obj_list);
> >>         if (hlist_empty(&conn->list)) {
> >>                 objp = fsnotify_detach_connector_from_object(conn, &type);
> >>                 free_conn = true;
> >> +               watched_after = 0;
> >>         } else {
> >>                 objp = __fsnotify_recalc_mask(conn);
> >>                 type = conn->type;
> >> +               watched_after = fsnotify_inode_watches_children(inode);
> >>         }
> >>         WRITE_ONCE(mark->connector, NULL);
> >>         spin_unlock(&conn->lock);
> >>
> >> +       if (inode) {
> >> +               inode_lock(inode);
> >> +               if (watched_before && !watched_after)
> >> +                       __fsnotify_update_child_dentry_flags(inode);
> >> +               inode_unlock(inode);
> >> +       }
> >> +
> >>         fsnotify_drop_object(type, objp);
> >>
> >
> > Here as well something like:
> > if (objp)
> >     fsnotify_update_children_dentry_flags(conn, obj);
> >
> > But need to distinguish when inode ref needs to be dropped
> > children flags updates or both.
>
> With a flags out-param, it works well. I actually was able to stuff this
> into fsnotify_drop_object, which was good because I had missed a whole
> other function that can detach a connector from an inode
> (fsnotify_destroy_marks()).
>

Haha, this call is coming from __fsnotify_inode_delete()
the evicted dir inode cannot have any children in dache,
but it's good to have robust code ;)

Thanks,
Amir.
  

Patch

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 7974e91ffe13..e887a195983b 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -207,8 +207,16 @@  int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
 	parent = dget_parent(dentry);
 	p_inode = parent->d_inode;
 	p_mask = fsnotify_inode_watches_children(p_inode);
-	if (unlikely(parent_watched && !p_mask))
-		__fsnotify_update_child_dentry_flags(p_inode);
+	if (unlikely(parent_watched && !p_mask)) {
+		/*
+		 * Flag would be cleared soon by
+		 * __fsnotify_update_child_dentry_flags(), but as an
+		 * optimization, clear it now.
+		 */
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
+		spin_unlock(&dentry->d_lock);
+	}
 
 	/*
 	 * Include parent/name in notification either if some notification
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index c74ef947447d..da9f944fcbbb 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -184,15 +184,36 @@  static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
  */
 void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
 {
+	struct inode *inode = NULL;
+	int watched_before, watched_after;
+
 	if (!conn)
 		return;
 
-	spin_lock(&conn->lock);
-	__fsnotify_recalc_mask(conn);
-	spin_unlock(&conn->lock);
-	if (conn->type == FSNOTIFY_OBJ_TYPE_INODE)
-		__fsnotify_update_child_dentry_flags(
-					fsnotify_conn_inode(conn));
+	if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
+		/*
+		 * For inodes, we may need to update flags on the child
+		 * dentries. To ensure these updates occur exactly once,
+		 * synchronize the recalculation with the inode mutex.
+		 */
+		inode = fsnotify_conn_inode(conn);
+		spin_lock(&conn->lock);
+		watched_before = fsnotify_inode_watches_children(inode);
+		__fsnotify_recalc_mask(conn);
+		watched_after = fsnotify_inode_watches_children(inode);
+		spin_unlock(&conn->lock);
+
+		inode_lock(inode);
+		if ((watched_before && !watched_after) ||
+		    (!watched_before && watched_after)) {
+			__fsnotify_update_child_dentry_flags(inode);
+		}
+		inode_unlock(inode);
+	} else {
+		spin_lock(&conn->lock);
+		__fsnotify_recalc_mask(conn);
+		spin_unlock(&conn->lock);
+	}
 }
 
 /* Free all connectors queued for freeing once SRCU period ends */
@@ -295,6 +316,8 @@  void fsnotify_put_mark(struct fsnotify_mark *mark)
 	struct fsnotify_mark_connector *conn = READ_ONCE(mark->connector);
 	void *objp = NULL;
 	unsigned int type = FSNOTIFY_OBJ_TYPE_DETACHED;
+	struct inode *inode = NULL;
+	int watched_before, watched_after;
 	bool free_conn = false;
 
 	/* Catch marks that were actually never attached to object */
@@ -311,17 +334,31 @@  void fsnotify_put_mark(struct fsnotify_mark *mark)
 	if (!refcount_dec_and_lock(&mark->refcnt, &conn->lock))
 		return;
 
+	if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
+		inode = fsnotify_conn_inode(conn);
+		watched_before = fsnotify_inode_watches_children(inode);
+	}
+
 	hlist_del_init_rcu(&mark->obj_list);
 	if (hlist_empty(&conn->list)) {
 		objp = fsnotify_detach_connector_from_object(conn, &type);
 		free_conn = true;
+		watched_after = 0;
 	} else {
 		objp = __fsnotify_recalc_mask(conn);
 		type = conn->type;
+		watched_after = fsnotify_inode_watches_children(inode);
 	}
 	WRITE_ONCE(mark->connector, NULL);
 	spin_unlock(&conn->lock);
 
+	if (inode) {
+		inode_lock(inode);
+		if (watched_before && !watched_after)
+			__fsnotify_update_child_dentry_flags(inode);
+		inode_unlock(inode);
+	}
+
 	fsnotify_drop_object(type, objp);
 
 	if (free_conn) {
@@ -331,12 +368,6 @@  void fsnotify_put_mark(struct fsnotify_mark *mark)
 		spin_unlock(&destroy_lock);
 		queue_work(system_unbound_wq, &connector_reaper_work);
 	}
-	/*
-	 * Note that we didn't update flags telling whether inode cares about
-	 * what's happening with children. We update these flags from
-	 * __fsnotify_parent() lazily when next event happens on one of our
-	 * children.
-	 */
 	spin_lock(&destroy_lock);
 	list_add(&mark->g_list, &destroy_list);
 	spin_unlock(&destroy_lock);