[v3,0/3] fsnotify: fix softlockups iterating over d_subdirs

Message ID 20221028001016.332663-1-stephen.s.brennan@oracle.com
Headers
Series fsnotify: fix softlockups iterating over d_subdirs |

Message

Stephen Brennan Oct. 28, 2022, 12:10 a.m. UTC
  Hi all,

Here is v3 of the patch series. I've taken all of the feedback,
thanks Amir, Christian, Hilf, et al. Differences are noted in each
patch.

I caught an obvious and silly dentry reference leak: d_find_any_alias()
returns a reference, which I never called dput() on. With that change, I
no longer see the rpc_pipefs issue, but I do think I need more testing
and thinking through the third patch. Al, I'd love your feedback on that
one especially.

Thanks,
Stephen

Stephen Brennan (3):
  fsnotify: Use d_find_any_alias to get dentry associated with inode
  fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
  fsnotify: allow sleepable child flag update

 fs/notify/fsnotify.c             | 115 +++++++++++++++++++++-------
 fs/notify/fsnotify.h             |  13 +++-
 fs/notify/mark.c                 | 124 ++++++++++++++++++++-----------
 include/linux/fsnotify_backend.h |   1 +
 4 files changed, 185 insertions(+), 68 deletions(-)
  

Comments

Jan Kara Nov. 1, 2022, 5:51 p.m. UTC | #1
Hi Stephen!

On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
> Here is v3 of the patch series. I've taken all of the feedback,
> thanks Amir, Christian, Hilf, et al. Differences are noted in each
> patch.
> 
> I caught an obvious and silly dentry reference leak: d_find_any_alias()
> returns a reference, which I never called dput() on. With that change, I
> no longer see the rpc_pipefs issue, but I do think I need more testing
> and thinking through the third patch. Al, I'd love your feedback on that
> one especially.
> 
> Thanks,
> Stephen
> 
> Stephen Brennan (3):
>   fsnotify: Use d_find_any_alias to get dentry associated with inode
>   fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
>   fsnotify: allow sleepable child flag update

Thanks for the patches Stephen and I'm sorry for replying somewhat late.
The first patch is a nobrainer. The other two patches ... complicate things
somewhat more complicated than I'd like. I guess I can live with them if we
don't find a better solution but I'd like to discuss a bit more about
alternatives.

So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
__fsnotify_parent() for the dentry which triggered the event and does not
have watched parent anymore and never bother with full children walk? I
suppose your contention problems will be gone, we'll just pay the price of
dget_parent() + fsnotify_inode_watches_children() for each child that
falsely triggers instead of for only one. Maybe that's not too bad? After
all any event upto this moment triggered this overhead as well...

Am I missing something? AFAIU this would allow us to avoid the games with
the new connector flag etc... We would probably still need to avoid
softlockups when setting the flag DCACHE_FSNOTIFY_PARENT_WATCHED but that
should be much simpler (we could use i_rwsem trick like you do).

								Honza
  
Stephen Brennan Nov. 1, 2022, 8:48 p.m. UTC | #2
Hi Jan,

Jan Kara <jack@suse.cz> writes:
> Hi Stephen!
>
> On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
>> Here is v3 of the patch series. I've taken all of the feedback,
>> thanks Amir, Christian, Hilf, et al. Differences are noted in each
>> patch.
>> 
>> I caught an obvious and silly dentry reference leak: d_find_any_alias()
>> returns a reference, which I never called dput() on. With that change, I
>> no longer see the rpc_pipefs issue, but I do think I need more testing
>> and thinking through the third patch. Al, I'd love your feedback on that
>> one especially.
>> 
>> Thanks,
>> Stephen
>> 
>> Stephen Brennan (3):
>>   fsnotify: Use d_find_any_alias to get dentry associated with inode
>>   fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
>>   fsnotify: allow sleepable child flag update
>
> Thanks for the patches Stephen and I'm sorry for replying somewhat late.

Absolutely no worries, these things take time. Thanks for taking a look!

> The first patch is a nobrainer. The other two patches ... complicate things
> somewhat more complicated than I'd like. I guess I can live with them if we
> don't find a better solution but I'd like to discuss a bit more about
> alternatives.

Understood!

> So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
> __fsnotify_parent() for the dentry which triggered the event and does not
> have watched parent anymore and never bother with full children walk? I
> suppose your contention problems will be gone, we'll just pay the price of
> dget_parent() + fsnotify_inode_watches_children() for each child that
> falsely triggers instead of for only one. Maybe that's not too bad? After
> all any event upto this moment triggered this overhead as well...

This is an interesting idea. It came across my mind but I don't think I
considered it seriously because I assumed that it was too big a change.
But I suppose in the process I created an even bigger change :P

The false positive dget_parent() + fsnotify_inode_watches_children()
shouldn't be too bad. I could see a situation where there's a lot of
random accesses within a directory, where the dget_parent() could cause
some contention over the parent dentry. But to be fair, the performance
would have been the same or worse while fsnotify was active in that
case, and the contention would go away as most of the dentries get their
flags cleared. So I don't think this is a problem.

> Am I missing something?

I think there's one thing missed here. I understand you'd like to get
rid of the extra flag in the connector. But the advantage of the flag is
avoiding duplicate work by saving a bit of state. Suppose that a mark is
added to a connector, which causes fsnotify_inode_watches_children() to
become true. Then, any subsequent call to fsnotify_recalc_mask() must
call __fsnotify_update_child_dentry_flags(), even though the child
dentry flags don't need to be updated: they're already set. For (very)
large directories, this can take a few seconds, which means that we're
doing a few extra seconds of work each time a new mark is added to or
removed from a connector in that case. I can't imagine that's a super
common workload though, and I don't know if my customers do that (my
guess would be no).

For an example of a test workload where this would make a difference:
one of my test cases is to create a directory with millions of negative
dentries, and then run "for i in {1..20}; do inotifywait $DIR & done".
With the series as-is, only the first task needs to do the child flag
update. With your proposed alternative, each task would re-do the
update.

If we were to manage to get __fsnotify_update_child_dentry_flags() to
work correctly, and safely, with the ability to drop the d_lock and
sleep, then I think that wouldn't be too much of a problem, because then
other spinlock users of the directory will get a chance to grab it, and
so we don't risk softlockups. Without the sleepable iterations, it would
be marginally worse than the current solution, but I can't really
comment _how_ much worse because like I said, it doesn't sound like a
frequent usage pattern.

I think I have a _slight_ preference for the current design, but I see
the appeal of simpler code, and your design would still improve things a
lot! Maybe Amir has an opinion too, but of course I'll defer to what you
want.

Thanks,
Stephen
  
Amir Goldstein Nov. 2, 2022, 8:55 a.m. UTC | #3
On Tue, Nov 1, 2022 at 10:49 PM Stephen Brennan
<stephen.s.brennan@oracle.com> wrote:
>
> Hi Jan,
>
> Jan Kara <jack@suse.cz> writes:
> > Hi Stephen!
> >
> > On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
> >> Here is v3 of the patch series. I've taken all of the feedback,
> >> thanks Amir, Christian, Hilf, et al. Differences are noted in each
> >> patch.
> >>
> >> I caught an obvious and silly dentry reference leak: d_find_any_alias()
> >> returns a reference, which I never called dput() on. With that change, I
> >> no longer see the rpc_pipefs issue, but I do think I need more testing
> >> and thinking through the third patch. Al, I'd love your feedback on that
> >> one especially.
> >>
> >> Thanks,
> >> Stephen
> >>
> >> Stephen Brennan (3):
> >>   fsnotify: Use d_find_any_alias to get dentry associated with inode
> >>   fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
> >>   fsnotify: allow sleepable child flag update
> >
> > Thanks for the patches Stephen and I'm sorry for replying somewhat late.
>
> Absolutely no worries, these things take time. Thanks for taking a look!
>
> > The first patch is a nobrainer. The other two patches ... complicate things
> > somewhat more complicated than I'd like. I guess I can live with them if we
> > don't find a better solution but I'd like to discuss a bit more about
> > alternatives.
>
> Understood!
>
> > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
> > __fsnotify_parent() for the dentry which triggered the event and does not
> > have watched parent anymore and never bother with full children walk? I
> > suppose your contention problems will be gone, we'll just pay the price of
> > dget_parent() + fsnotify_inode_watches_children() for each child that
> > falsely triggers instead of for only one. Maybe that's not too bad? After
> > all any event upto this moment triggered this overhead as well...
>
> This is an interesting idea. It came across my mind but I don't think I
> considered it seriously because I assumed that it was too big a change.
> But I suppose in the process I created an even bigger change :P
>
> The false positive dget_parent() + fsnotify_inode_watches_children()
> shouldn't be too bad. I could see a situation where there's a lot of
> random accesses within a directory, where the dget_parent() could cause
> some contention over the parent dentry. But to be fair, the performance
> would have been the same or worse while fsnotify was active in that
> case, and the contention would go away as most of the dentries get their
> flags cleared. So I don't think this is a problem.
>

I also don't think that is a problem.

> > Am I missing something?
>
> I think there's one thing missed here. I understand you'd like to get
> rid of the extra flag in the connector. But the advantage of the flag is
> avoiding duplicate work by saving a bit of state. Suppose that a mark is
> added to a connector, which causes fsnotify_inode_watches_children() to
> become true. Then, any subsequent call to fsnotify_recalc_mask() must
> call __fsnotify_update_child_dentry_flags(), even though the child
> dentry flags don't need to be updated: they're already set. For (very)
> large directories, this can take a few seconds, which means that we're
> doing a few extra seconds of work each time a new mark is added to or
> removed from a connector in that case. I can't imagine that's a super
> common workload though, and I don't know if my customers do that (my
> guess would be no).
>
> For an example of a test workload where this would make a difference:
> one of my test cases is to create a directory with millions of negative
> dentries, and then run "for i in {1..20}; do inotifywait $DIR & done".
> With the series as-is, only the first task needs to do the child flag
> update. With your proposed alternative, each task would re-do the
> update.
>
> If we were to manage to get __fsnotify_update_child_dentry_flags() to
> work correctly, and safely, with the ability to drop the d_lock and
> sleep, then I think that wouldn't be too much of a problem, because then
> other spinlock users of the directory will get a chance to grab it, and
> so we don't risk softlockups. Without the sleepable iterations, it would
> be marginally worse than the current solution, but I can't really
> comment _how_ much worse because like I said, it doesn't sound like a
> frequent usage pattern.
>
> I think I have a _slight_ preference for the current design, but I see
> the appeal of simpler code, and your design would still improve things a
> lot! Maybe Amir has an opinion too, but of course I'll defer to what you
> want.
>

IIUC, patches #1+#3 fix a reproducible softlock, so if Al approves them,
I see no reason to withhold.

With patches #1+#3 applied, the penalty is restricted to adding or
removing/destroying multiple marks on very large dirs or dirs with
many negative dentries.

I think that fixing the synthetic test case of multiple added marks
is rather easy even without DCACHE_FSNOTIFY_PARENT_WATCHED.
Something like the attached patch is what Jan had suggested in the
first place.

The synthetic test case of concurrent add/remove watch on the same
dir may still result in multiple children iterations, but that will usually
not be avoided even with DCACHE_FSNOTIFY_PARENT_WATCHED
and probably not worth optimizing for.

Thoughts?

Thanks,
Amir.
  
Jan Kara Nov. 2, 2022, 5:52 p.m. UTC | #4
On Tue 01-11-22 13:48:54, Stephen Brennan wrote:
> Jan Kara <jack@suse.cz> writes:
> > Hi Stephen!
> >
> > On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
> >> Here is v3 of the patch series. I've taken all of the feedback,
> >> thanks Amir, Christian, Hilf, et al. Differences are noted in each
> >> patch.
> >> 
> >> I caught an obvious and silly dentry reference leak: d_find_any_alias()
> >> returns a reference, which I never called dput() on. With that change, I
> >> no longer see the rpc_pipefs issue, but I do think I need more testing
> >> and thinking through the third patch. Al, I'd love your feedback on that
> >> one especially.
> >> 
> >> Thanks,
> >> Stephen
> >> 
> >> Stephen Brennan (3):
> >>   fsnotify: Use d_find_any_alias to get dentry associated with inode
> >>   fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
> >>   fsnotify: allow sleepable child flag update
> >
> > Thanks for the patches Stephen and I'm sorry for replying somewhat late.
> 
> Absolutely no worries, these things take time. Thanks for taking a look!
> 
> > The first patch is a nobrainer. The other two patches ... complicate things
> > somewhat more complicated than I'd like. I guess I can live with them if we
> > don't find a better solution but I'd like to discuss a bit more about
> > alternatives.
> 
> Understood!
> 
> > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
> > __fsnotify_parent() for the dentry which triggered the event and does not
> > have watched parent anymore and never bother with full children walk? I
> > suppose your contention problems will be gone, we'll just pay the price of
> > dget_parent() + fsnotify_inode_watches_children() for each child that
> > falsely triggers instead of for only one. Maybe that's not too bad? After
> > all any event upto this moment triggered this overhead as well...
> 
> This is an interesting idea. It came across my mind but I don't think I
> considered it seriously because I assumed that it was too big a change.
> But I suppose in the process I created an even bigger change :P
> 
> The false positive dget_parent() + fsnotify_inode_watches_children()
> shouldn't be too bad. I could see a situation where there's a lot of
> random accesses within a directory, where the dget_parent() could cause
> some contention over the parent dentry. But to be fair, the performance
> would have been the same or worse while fsnotify was active in that
> case, and the contention would go away as most of the dentries get their
> flags cleared. So I don't think this is a problem.
> 
> > Am I missing something?
> 
> I think there's one thing missed here. I understand you'd like to get
> rid of the extra flag in the connector. But the advantage of the flag is
> avoiding duplicate work by saving a bit of state. Suppose that a mark is
> added to a connector, which causes fsnotify_inode_watches_children() to
> become true. Then, any subsequent call to fsnotify_recalc_mask() must
> call __fsnotify_update_child_dentry_flags(), even though the child
> dentry flags don't need to be updated: they're already set. For (very)
> large directories, this can take a few seconds, which means that we're
> doing a few extra seconds of work each time a new mark is added to or
> removed from a connector in that case. I can't imagine that's a super
> common workload though, and I don't know if my customers do that (my
> guess would be no).

I understand. This basically matters for fsnotify_recalc_mask(). As a side
note I've realized that your changes to fsnotify_recalc_mask() acquiring
inode->i_rwsem for updating dentry flags in patch 2/3 are problematic for
dnotify because that calls fsnotify_recalc_mask() under a spinlock.
Furthermore it is somewhat worrying also for inotify & fanotify because it
nests inode->i_rwsem inside fsnotify_group->lock however I'm not 100% sure
something doesn't force the ordering the other way around (e.g. the removal
of oneshot mark during modify event generation). Did you run tests with
lockdep enabled?

Anyway, if the lock ordering issues can be solved, I suppose we can
optimize fsnotify_recalc_mask() like:

	inode_lock(inode);
	spin_lock(&conn->lock);
	oldmask = inode->i_fsnotify_mask;
	__fsnotify_recalc_mask(conn);
	newmask = inode->i_fsnotify_mask;
	spin_unlock(&conn->lock);
	if (watching children changed(oldmask, newmask))
		__fsnotify_update_child_dentry_flags(...)
	inode_unlock(inode);

And because everything is serialized by inode_lock, we don't have to worry
about inode->i_fsnotify_mask and dentry flags getting out of sync or some
mark addition returning before all children are marked for reporting
events. No need for the connector flag AFAICT.

But the locking issue needs to be resolved first in any case. I need to
think some more...

								Honza
  
Stephen Brennan Nov. 4, 2022, 11:33 p.m. UTC | #5
Jan Kara <jack@suse.cz> writes:
> On Tue 01-11-22 13:48:54, Stephen Brennan wrote:
>> Jan Kara <jack@suse.cz> writes:
>> > Hi Stephen!
>> >
>> > On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
>> >> Here is v3 of the patch series. I've taken all of the feedback,
>> >> thanks Amir, Christian, Hilf, et al. Differences are noted in each
>> >> patch.
>> >> 
>> >> I caught an obvious and silly dentry reference leak: d_find_any_alias()
>> >> returns a reference, which I never called dput() on. With that change, I
>> >> no longer see the rpc_pipefs issue, but I do think I need more testing
>> >> and thinking through the third patch. Al, I'd love your feedback on that
>> >> one especially.
>> >> 
>> >> Thanks,
>> >> Stephen
>> >> 
>> >> Stephen Brennan (3):
>> >>   fsnotify: Use d_find_any_alias to get dentry associated with inode
>> >>   fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
>> >>   fsnotify: allow sleepable child flag update
>> >
>> > Thanks for the patches Stephen and I'm sorry for replying somewhat late.
>> 
>> Absolutely no worries, these things take time. Thanks for taking a look!
>> 
>> > The first patch is a nobrainer. The other two patches ... complicate things
>> > somewhat more complicated than I'd like. I guess I can live with them if we
>> > don't find a better solution but I'd like to discuss a bit more about
>> > alternatives.
>> 
>> Understood!
>> 
>> > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
>> > __fsnotify_parent() for the dentry which triggered the event and does not
>> > have watched parent anymore and never bother with full children walk? I
>> > suppose your contention problems will be gone, we'll just pay the price of
>> > dget_parent() + fsnotify_inode_watches_children() for each child that
>> > falsely triggers instead of for only one. Maybe that's not too bad? After
>> > all any event upto this moment triggered this overhead as well...
>> 
>> This is an interesting idea. It came across my mind but I don't think I
>> considered it seriously because I assumed that it was too big a change.
>> But I suppose in the process I created an even bigger change :P
>> 
>> The false positive dget_parent() + fsnotify_inode_watches_children()
>> shouldn't be too bad. I could see a situation where there's a lot of
>> random accesses within a directory, where the dget_parent() could cause
>> some contention over the parent dentry. But to be fair, the performance
>> would have been the same or worse while fsnotify was active in that
>> case, and the contention would go away as most of the dentries get their
>> flags cleared. So I don't think this is a problem.
>> 
>> > Am I missing something?
>> 
>> I think there's one thing missed here. I understand you'd like to get
>> rid of the extra flag in the connector. But the advantage of the flag is
>> avoiding duplicate work by saving a bit of state. Suppose that a mark is
>> added to a connector, which causes fsnotify_inode_watches_children() to
>> become true. Then, any subsequent call to fsnotify_recalc_mask() must
>> call __fsnotify_update_child_dentry_flags(), even though the child
>> dentry flags don't need to be updated: they're already set. For (very)
>> large directories, this can take a few seconds, which means that we're
>> doing a few extra seconds of work each time a new mark is added to or
>> removed from a connector in that case. I can't imagine that's a super
>> common workload though, and I don't know if my customers do that (my
>> guess would be no).
>
> I understand. This basically matters for fsnotify_recalc_mask(). As a side
> note I've realized that your changes to fsnotify_recalc_mask() acquiring
> inode->i_rwsem for updating dentry flags in patch 2/3 are problematic for
> dnotify because that calls fsnotify_recalc_mask() under a spinlock.
> Furthermore it is somewhat worrying also for inotify & fanotify because it
> nests inode->i_rwsem inside fsnotify_group->lock however I'm not 100% sure
> something doesn't force the ordering the other way around (e.g. the removal
> of oneshot mark during modify event generation). Did you run tests with
> lockdep enabled?

No I didn't. I'll be sure to get the LTP tests running with lockdep
early next week and try this series out, we'll probably get an error
like you say.

I'll also take a look at the dnotify use case and see if there's
anything to do there. Hopefully there's something we can do to salvage
it :D

Thanks,
Stephen

> Anyway, if the lock ordering issues can be solved, I suppose we can
> optimize fsnotify_recalc_mask() like:
>
> 	inode_lock(inode);
> 	spin_lock(&conn->lock);
> 	oldmask = inode->i_fsnotify_mask;
> 	__fsnotify_recalc_mask(conn);
> 	newmask = inode->i_fsnotify_mask;
> 	spin_unlock(&conn->lock);
> 	if (watching children changed(oldmask, newmask))
> 		__fsnotify_update_child_dentry_flags(...)
> 	inode_unlock(inode);
>
> And because everything is serialized by inode_lock, we don't have to worry
> about inode->i_fsnotify_mask and dentry flags getting out of sync or some
> mark addition returning before all children are marked for reporting
> events. No need for the connector flag AFAICT.
>
> But the locking issue needs to be resolved first in any case. I need to
> think some more...
>
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
  
Jan Kara Nov. 7, 2022, 11:56 a.m. UTC | #6
On Fri 04-11-22 16:33:18, Stephen Brennan wrote:
> Jan Kara <jack@suse.cz> writes:
> > On Tue 01-11-22 13:48:54, Stephen Brennan wrote:
> >> Jan Kara <jack@suse.cz> writes:
> >> > Hi Stephen!
> >> >
> >> > On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
> >> >> Here is v3 of the patch series. I've taken all of the feedback,
> >> >> thanks Amir, Christian, Hilf, et al. Differences are noted in each
> >> >> patch.
> >> >> 
> >> >> I caught an obvious and silly dentry reference leak: d_find_any_alias()
> >> >> returns a reference, which I never called dput() on. With that change, I
> >> >> no longer see the rpc_pipefs issue, but I do think I need more testing
> >> >> and thinking through the third patch. Al, I'd love your feedback on that
> >> >> one especially.
> >> >> 
> >> >> Thanks,
> >> >> Stephen
> >> >> 
> >> >> Stephen Brennan (3):
> >> >>   fsnotify: Use d_find_any_alias to get dentry associated with inode
> >> >>   fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
> >> >>   fsnotify: allow sleepable child flag update
> >> >
> >> > Thanks for the patches Stephen and I'm sorry for replying somewhat late.
> >> 
> >> Absolutely no worries, these things take time. Thanks for taking a look!
> >> 
> >> > The first patch is a nobrainer. The other two patches ... complicate things
> >> > somewhat more complicated than I'd like. I guess I can live with them if we
> >> > don't find a better solution but I'd like to discuss a bit more about
> >> > alternatives.
> >> 
> >> Understood!
> >> 
> >> > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
> >> > __fsnotify_parent() for the dentry which triggered the event and does not
> >> > have watched parent anymore and never bother with full children walk? I
> >> > suppose your contention problems will be gone, we'll just pay the price of
> >> > dget_parent() + fsnotify_inode_watches_children() for each child that
> >> > falsely triggers instead of for only one. Maybe that's not too bad? After
> >> > all any event upto this moment triggered this overhead as well...
> >> 
> >> This is an interesting idea. It came across my mind but I don't think I
> >> considered it seriously because I assumed that it was too big a change.
> >> But I suppose in the process I created an even bigger change :P
> >> 
> >> The false positive dget_parent() + fsnotify_inode_watches_children()
> >> shouldn't be too bad. I could see a situation where there's a lot of
> >> random accesses within a directory, where the dget_parent() could cause
> >> some contention over the parent dentry. But to be fair, the performance
> >> would have been the same or worse while fsnotify was active in that
> >> case, and the contention would go away as most of the dentries get their
> >> flags cleared. So I don't think this is a problem.
> >> 
> >> > Am I missing something?
> >> 
> >> I think there's one thing missed here. I understand you'd like to get
> >> rid of the extra flag in the connector. But the advantage of the flag is
> >> avoiding duplicate work by saving a bit of state. Suppose that a mark is
> >> added to a connector, which causes fsnotify_inode_watches_children() to
> >> become true. Then, any subsequent call to fsnotify_recalc_mask() must
> >> call __fsnotify_update_child_dentry_flags(), even though the child
> >> dentry flags don't need to be updated: they're already set. For (very)
> >> large directories, this can take a few seconds, which means that we're
> >> doing a few extra seconds of work each time a new mark is added to or
> >> removed from a connector in that case. I can't imagine that's a super
> >> common workload though, and I don't know if my customers do that (my
> >> guess would be no).
> >
> > I understand. This basically matters for fsnotify_recalc_mask(). As a side
> > note I've realized that your changes to fsnotify_recalc_mask() acquiring
> > inode->i_rwsem for updating dentry flags in patch 2/3 are problematic for
> > dnotify because that calls fsnotify_recalc_mask() under a spinlock.
> > Furthermore it is somewhat worrying also for inotify & fanotify because it
> > nests inode->i_rwsem inside fsnotify_group->lock however I'm not 100% sure
> > something doesn't force the ordering the other way around (e.g. the removal
> > of oneshot mark during modify event generation). Did you run tests with
> > lockdep enabled?
> 
> No I didn't. I'll be sure to get the LTP tests running with lockdep
> early next week and try this series out, we'll probably get an error
> like you say.
> 
> I'll also take a look at the dnotify use case and see if there's
> anything to do there. Hopefully there's something we can do to salvage
> it :D

Yeah, dnotify should be solvable. I'm even willing to accept somewhat
unusual methods for dnotify ;). It is ancient and rarely used API these
days so I don't want it to block fixes for newer APIs. From a quick look it
should be enough to move fsnotify_recalc_mask() call in
dnotify_recalc_inode_mask() from under the mark->lock out into callers and
perhaps rename dnotify_recalc_inode_mask() to dnotify_recalc_mark_mask()
to make the name somewhat less confusing.

								Honza
  
Stephen Brennan Nov. 10, 2022, 8:04 p.m. UTC | #7
Amir Goldstein <amir73il@gmail.com> writes:
[...]
> IIUC, patches #1+#3 fix a reproducible softlock, so if Al approves them,
> I see no reason to withhold.
>
> With patches #1+#3 applied, the penalty is restricted to adding or
> removing/destroying multiple marks on very large dirs or dirs with
> many negative dentries.
>
> I think that fixing the synthetic test case of multiple added marks
> is rather easy even without DCACHE_FSNOTIFY_PARENT_WATCHED.
> Something like the attached patch is what Jan had suggested in the
> first place.
>
> The synthetic test case of concurrent add/remove watch on the same
> dir may still result in multiple children iterations, but that will usually
> not be avoided even with DCACHE_FSNOTIFY_PARENT_WATCHED
> and probably not worth optimizing for.
>
> Thoughts?

If I understand your train of thought, your below patch would take the
place of my patch #2, since #3 requires we not hold a spinlock during
fsnotify_update_children_dentry_flags().

And since you fully rely on dentry accessors to lazily clear flags, you
don't need to have the mutual exclusion provided by the inode lock.

I like that a lot.

However, the one thing I see here is that if we no longer hold the inode
lock, patch #3 needs to be re-examined: it assumes that dentries can't
be removed or moved around, and that assumption is only true with the
inode lock held.

I'll test things out with this patch replacing #2 and see how things
shake out. Maybe I can improve #3.

Thanks,
Stephen

>
> Thanks,
> Amir.
> From c8ea1d84397c26ce334bff9d9f721400cebb88dd Mon Sep 17 00:00:00 2001
> From: Amir Goldstein <amir73il@gmail.com>
> Date: Wed, 2 Nov 2022 10:28:01 +0200
> Subject: [PATCH] fsnotify: clear PARENT_WATCHED flags lazily
>
> Call fsnotify_update_children_dentry_flags() to set PARENT_WATCHED flags
> only when parent starts watching children.
>
> When parent stops watching children, clear false positive PARENT_WATCHED
> flags lazily in __fsnotify_parent() for each accessed child.
>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/notify/fsnotify.c             | 26 ++++++++++++++++++++------
>  fs/notify/fsnotify.h             |  3 ++-
>  fs/notify/mark.c                 | 32 +++++++++++++++++++++++++++++---
>  include/linux/fsnotify_backend.h |  8 +++++---
>  4 files changed, 56 insertions(+), 13 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 1e541a9bd12b..f60078d6bb97 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -103,17 +103,13 @@ void fsnotify_sb_delete(struct super_block *sb)
>   * parent cares.  Thus when an event happens on a child it can quickly tell
>   * if there is a need to find a parent and send the event to the parent.
>   */
> -void __fsnotify_update_child_dentry_flags(struct inode *inode)
> +void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
>  {
>  	struct dentry *alias;
> -	int watched;
>  
>  	if (!S_ISDIR(inode->i_mode))
>  		return;
>  
> -	/* determine if the children should tell inode about their events */
> -	watched = fsnotify_inode_watches_children(inode);
> -
>  	spin_lock(&inode->i_lock);
>  	/* run all of the dentries associated with this inode.  Since this is a
>  	 * directory, there damn well better only be one item on this list */
> @@ -140,6 +136,24 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
>  	spin_unlock(&inode->i_lock);
>  }
>  
> +/*
> + * Lazily clear false positive PARENT_WATCHED flag for child whose parent had
> + * stopped wacthing children.
> + */
> +static void fsnotify_update_child_dentry_flags(struct inode *inode,
> +					       struct dentry *dentry)
> +{
> +	spin_lock(&dentry->d_lock);
> +	/*
> +	 * d_lock is a sufficient barrier to prevent observing a non-watched
> +	 * parent state from before the fsnotify_update_children_dentry_flags()
> +	 * or fsnotify_update_flags() call that had set PARENT_WATCHED.
> +	 */
> +	if (!fsnotify_inode_watches_children(inode))
> +		dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> +	spin_unlock(&dentry->d_lock);
> +}
> +
>  /* Are inode/sb/mount interested in parent and name info with this event? */
>  static bool fsnotify_event_needs_parent(struct inode *inode, struct mount *mnt,
>  					__u32 mask)
> @@ -208,7 +222,7 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
>  	p_inode = parent->d_inode;
>  	p_mask = fsnotify_inode_watches_children(p_inode);
>  	if (unlikely(parent_watched && !p_mask))
> -		__fsnotify_update_child_dentry_flags(p_inode);
> +		fsnotify_update_child_dentry_flags(p_inode, dentry);
>  
>  	/*
>  	 * Include parent/name in notification either if some notification
> diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
> index fde74eb333cc..bce9be36d06b 100644
> --- a/fs/notify/fsnotify.h
> +++ b/fs/notify/fsnotify.h
> @@ -74,7 +74,8 @@ static inline void fsnotify_clear_marks_by_sb(struct super_block *sb)
>   * update the dentry->d_flags of all of inode's children to indicate if inode cares
>   * about events that happen to its children.
>   */
> -extern void __fsnotify_update_child_dentry_flags(struct inode *inode);
> +extern void fsnotify_update_children_dentry_flags(struct inode *inode,
> +						  bool watched);
>  
>  extern struct kmem_cache *fsnotify_mark_connector_cachep;
>  
> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> index fcc68b8a40fd..614bce0e7261 100644
> --- a/fs/notify/mark.c
> +++ b/fs/notify/mark.c
> @@ -176,6 +176,24 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
>  	return fsnotify_update_iref(conn, want_iref);
>  }
>  
> +static bool fsnotify_conn_watches_children(
> +					struct fsnotify_mark_connector *conn)
> +{
> +	if (conn->type != FSNOTIFY_OBJ_TYPE_INODE)
> +		return false;
> +
> +	return fsnotify_inode_watches_children(fsnotify_conn_inode(conn));
> +}
> +
> +static void fsnotify_conn_set_children_dentry_flags(
> +					struct fsnotify_mark_connector *conn)
> +{
> +	if (conn->type != FSNOTIFY_OBJ_TYPE_INODE)
> +		return;
> +
> +	fsnotify_update_children_dentry_flags(fsnotify_conn_inode(conn), true);
> +}
> +
>  /*
>   * Calculate mask of events for a list of marks. The caller must make sure
>   * connector and connector->obj cannot disappear under us.  Callers achieve
> @@ -184,15 +202,23 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
>   */
>  void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
>  {
> +	bool update_children;
> +
>  	if (!conn)
>  		return;
>  
>  	spin_lock(&conn->lock);
> +	update_children = !fsnotify_conn_watches_children(conn);
>  	__fsnotify_recalc_mask(conn);
> +	update_children &= fsnotify_conn_watches_children(conn);
>  	spin_unlock(&conn->lock);
> -	if (conn->type == FSNOTIFY_OBJ_TYPE_INODE)
> -		__fsnotify_update_child_dentry_flags(
> -					fsnotify_conn_inode(conn));
> +	/*
> +	 * Set children's PARENT_WATCHED flags only if parent started watching.
> +	 * When parent stops watching, we clear false positive PARENT_WATCHED
> +	 * flags lazily in __fsnotify_parent().
> +	 */
> +	if (update_children)
> +		fsnotify_conn_set_children_dentry_flags(conn);
>  }
>  
>  /* Free all connectors queued for freeing once SRCU period ends */
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index a31423c376a7..bd90bcf6c3b0 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -586,12 +586,14 @@ static inline __u32 fsnotify_parent_needed_mask(__u32 mask)
>  
>  static inline int fsnotify_inode_watches_children(struct inode *inode)
>  {
> +	__u32 parent_mask = READ_ONCE(inode->i_fsnotify_mask);
> +
>  	/* FS_EVENT_ON_CHILD is set if the inode may care */
> -	if (!(inode->i_fsnotify_mask & FS_EVENT_ON_CHILD))
> +	if (!(parent_mask & FS_EVENT_ON_CHILD))
>  		return 0;
>  	/* this inode might care about child events, does it care about the
>  	 * specific set of events that can happen on a child? */
> -	return inode->i_fsnotify_mask & FS_EVENTS_POSS_ON_CHILD;
> +	return parent_mask & FS_EVENTS_POSS_ON_CHILD;
>  }
>  
>  /*
> @@ -605,7 +607,7 @@ static inline void fsnotify_update_flags(struct dentry *dentry)
>  	/*
>  	 * Serialisation of setting PARENT_WATCHED on the dentries is provided
>  	 * by d_lock. If inotify_inode_watched changes after we have taken
> -	 * d_lock, the following __fsnotify_update_child_dentry_flags call will
> +	 * d_lock, the following fsnotify_update_children_dentry_flags call will
>  	 * find our entry, so it will spin until we complete here, and update
>  	 * us with the new state.
>  	 */
> -- 
> 2.25.1