[1/3] nfsd: use __fput_sync() to avoid delayed closing of files.

Message ID 20231208033006.5546-2-neilb@suse.de
State New
Headers
Series nfsd: fully close all files in the nfsd threads |

Commit Message

NeilBrown Dec. 8, 2023, 3:27 a.m. UTC
  Calling fput() directly or though filp_close() from a kernel thread like
nfsd causes the final __fput() (if necessary) to be called from a
workqueue.  This means that nfsd is not forced to wait for any work to
complete.  If the ->release of ->destroy_inode function is slow for any
reason, this can result in nfsd closing files more quickly than the
workqueue can complete the close and the queue of pending closes can
grow without bounces (30 million has been seen at one customer site,
though this was in part due to a slowness in xfs which has since been
fixed).

nfsd does not need this.  This quite appropriate and safe for nfsd to do
its own close work.  There is now reason that close should ever wait for
nfsd, so no deadlock can occur.

So change all fput() calls to __fput_sync(), and convert filp_close() to
the sequence get_file();filp_close();__fput_sync().

This ensure that no fput work is queued to the workqueue.

Note that this removes the only in-module use of flush_fput_queue().

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfsd/filecache.c   |  3 ++-
 fs/nfsd/lockd.c       |  2 +-
 fs/nfsd/nfs4proc.c    |  4 ++--
 fs/nfsd/nfs4recover.c |  2 +-
 fs/nfsd/vfs.c         | 12 ++++++------
 5 files changed, 12 insertions(+), 11 deletions(-)
  

Comments

Chuck Lever Dec. 8, 2023, 3:01 p.m. UTC | #1
On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote:
> Calling fput() directly or though filp_close() from a kernel thread like
> nfsd causes the final __fput() (if necessary) to be called from a
> workqueue.  This means that nfsd is not forced to wait for any work to
> complete.  If the ->release of ->destroy_inode function is slow for any
> reason, this can result in nfsd closing files more quickly than the
> workqueue can complete the close and the queue of pending closes can
> grow without bounces (30 million has been seen at one customer site,
> though this was in part due to a slowness in xfs which has since been
> fixed).
> 
> nfsd does not need this.

That is technically true, but IIUC, there is only one case where a
synchronous close matters for the backlog problem, and that's when
nfsd_file_free() is called from nfsd_file_put(). AFAICT all other
call sites (except rename) are error paths, so there aren't negative
consequences for the lack of synchronous wait there...

Consider at least breaking this patch into pieces so that there is
one call site changed per patch:

 - If a performance regression occurs, a bisect can point out the
   specific call site replacement that triggered it

 - There is an opportunity to provide rationale for each call site,
   because it seems to me there are two or three distinct
   motivations across this patch, not all of which apply to each
   replacement. More specific (perhaps naive) questions below.

 - It's more surgical to drop one or two of these smaller patches
   if we find a problem or that the particular change is unneeded.

Also, it would be convenient if this patch (series) introduced an
nfsd_close() that then called __fput_sync():

 - The kdoc for that function could explain why __fput_sync() is
   preferred and safe for nfsd

 - When troubleshooting, function boundary tracing could use this
   utility function to easily filter (sometimes costly) calls to
   __fput_sync() from nfsd and ignore other calls to __fput_sync()

But if it turns out only one or two fput() call sites need to be 
replaced, an nfsd_close() utility doesn't make sense.


I'm trying to think of a benchmark workload that I can use to
exercise this patch series. We've used fstests generic/531 in the
past for similar issues. Any others to consider?


> This quite appropriate and safe for nfsd to do
> its own close work.  There is now reason that close should ever wait for
> nfsd, so no deadlock can occur.
> 
> So change all fput() calls to __fput_sync(), and convert filp_close() to
> the sequence get_file();filp_close();__fput_sync().
> 
> This ensure that no fput work is queued to the workqueue.
> 
> Note that this removes the only in-module use of flush_fput_queue().
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfsd/filecache.c   |  3 ++-
>  fs/nfsd/lockd.c       |  2 +-
>  fs/nfsd/nfs4proc.c    |  4 ++--
>  fs/nfsd/nfs4recover.c |  2 +-
>  fs/nfsd/vfs.c         | 12 ++++++------
>  5 files changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index ef063f93fde9..e9734c7451b5 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -283,7 +283,9 @@ nfsd_file_free(struct nfsd_file *nf)
>  		nfsd_file_mark_put(nf->nf_mark);
>  	if (nf->nf_file) {
>  		nfsd_file_check_write_error(nf);
> +		get_file(nf->nf_file);
>  		filp_close(nf->nf_file, NULL);
> +		__fput_sync(nf->nf_file);
>  	}

This effectively reverts 4c475eee0237 ("nfsd: don't fsync nfsd_files
on last close")... Won't this cause a regression? Jeff?

And this is really the only place where a flushing close can
produce back pressure on clients, isn't it?

>  
>  	/*
> @@ -631,7 +633,6 @@ nfsd_file_close_inode_sync(struct inode *inode)
>  		list_del_init(&nf->nf_lru);
>  		nfsd_file_free(nf);
>  	}
> -	flush_delayed_fput();
>  }
>  
>  /**
> diff --git a/fs/nfsd/lockd.c b/fs/nfsd/lockd.c
> index 46a7f9b813e5..f9d1059096a4 100644
> --- a/fs/nfsd/lockd.c
> +++ b/fs/nfsd/lockd.c
> @@ -60,7 +60,7 @@ nlm_fopen(struct svc_rqst *rqstp, struct nfs_fh *f, struct file **filp,
>  static void
>  nlm_fclose(struct file *filp)
>  {
> -	fput(filp);
> +	__fput_sync(filp);
>  }

Will lock file descriptors have dirty data? This function is called
from nlm_traverse_files(), which is looping over all files in a
table, and it's called while a global mutex is held. Any data
showing this won't have a scalability impact?


>  static const struct nlmsvc_binding nfsd_nlm_ops = {
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 6f2d4aa4970d..20d60823d530 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -629,7 +629,7 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  		nn->somebody_reclaimed = true;
>  out:
>  	if (open->op_filp) {
> -		fput(open->op_filp);
> +		__fput_sync(open->op_filp);
>  		open->op_filp = NULL;
>  	}

Isn't this just discarding a file descriptor that wasn't needed?
What's the need for a flushing close here?


>  	if (resfh && resfh != &cstate->current_fh) {
> @@ -1546,7 +1546,7 @@ nfsd4_cleanup_inter_ssc(struct nfsd4_ssc_umount_item *nsui, struct file *filp,
>  	long timeout = msecs_to_jiffies(nfsd4_ssc_umount_timeout);
>  
>  	nfs42_ssc_close(filp);
> -	fput(filp);
> +	__fput_sync(filp);
>  
>  	spin_lock(&nn->nfsd_ssc_lock);
>  	list_del(&nsui->nsui_list);
> diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
> index 3509e73abe1f..f8f0112fd9f5 100644
> --- a/fs/nfsd/nfs4recover.c
> +++ b/fs/nfsd/nfs4recover.c
> @@ -561,7 +561,7 @@ nfsd4_shutdown_recdir(struct net *net)
>  
>  	if (!nn->rec_file)
>  		return;
> -	fput(nn->rec_file);
> +	__fput_sync(nn->rec_file);
>  	nn->rec_file = NULL;
>  }

What's the justification for a flushing close in this path?


>  
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index fbbea7498f02..15a811229211 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -879,7 +879,7 @@ __nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type,
>  
>  	host_err = ima_file_check(file, may_flags);
>  	if (host_err) {
> -		fput(file);
> +		__fput_sync(file);
>  		goto out;
>  	}

AFAICT __nfsd_open is used only for creating a file descriptor for
either an NLM request or for handling a readdir. In fact, I'm not
even sure why there is an ima_file_check() call site here. IMO
there doesn't need to be a flushing close in this error flow.


> @@ -1884,10 +1884,10 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
>  	fh_drop_write(ffhp);
>  
>  	/*
> -	 * If the target dentry has cached open files, then we need to try to
> -	 * close them prior to doing the rename. Flushing delayed fput
> -	 * shouldn't be done with locks held however, so we delay it until this
> -	 * point and then reattempt the whole shebang.
> +	 * If the target dentry has cached open files, then we need to
> +	 * try to close them prior to doing the rename.  Final fput
> +	 * shouldn't be done with locks held however, so we delay it
> +	 * until this point and then reattempt the whole shebang.
>  	 */
>  	if (close_cached) {
>  		close_cached = false;
> @@ -2141,7 +2141,7 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
>  	if (err == nfserr_eof || err == nfserr_toosmall)
>  		err = nfs_ok; /* can still be found in ->err */
>  out_close:
> -	fput(file);
> +	__fput_sync(file);

Do we expect a directory file descriptor to need a flushing close?


>  out:
>  	return err;
>  }
> -- 
> 2.43.0
> 
>
  
NeilBrown Dec. 10, 2023, 10:47 p.m. UTC | #2
On Sat, 09 Dec 2023, Chuck Lever wrote:
> On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote:
> > Calling fput() directly or though filp_close() from a kernel thread like
> > nfsd causes the final __fput() (if necessary) to be called from a
> > workqueue.  This means that nfsd is not forced to wait for any work to
> > complete.  If the ->release of ->destroy_inode function is slow for any
> > reason, this can result in nfsd closing files more quickly than the
> > workqueue can complete the close and the queue of pending closes can
> > grow without bounces (30 million has been seen at one customer site,
> > though this was in part due to a slowness in xfs which has since been
> > fixed).
> > 
> > nfsd does not need this.
> 
> That is technically true, but IIUC, there is only one case where a
> synchronous close matters for the backlog problem, and that's when
> nfsd_file_free() is called from nfsd_file_put(). AFAICT all other
> call sites (except rename) are error paths, so there aren't negative
> consequences for the lack of synchronous wait there...

What you say is technically true but it isn't the way I see it.

Firstly I should clarify that __fput_sync() is *not* a flushing close as
you describe it below.
All it does, apart for some trivial book-keeping, is to call ->release
and possibly ->destroy_inode immediately rather than shunting them off
to another thread.
Apparently ->release sometimes does something that can deadlock with
some kernel threads or if some awkward locks are held, so the whole
final __fput is delay by default.  But this does not apply to nfsd.
Standard fput() is really the wrong interface for nfsd to use.  
It should use __fput_sync() (which shouldn't have such a scary name).

The comment above flush_delayed_fput() seems to suggest that unmounting
is a core issue.  Maybe the fact that __fput() can call
dissolve_on_fput() is a reason why it is sometimes safer to leave the
work to later.  But I don't see that applying to nfsd.

Of course a ->release function *could* do synchronous writes just like
the XFS ->destroy_inode function used to do synchronous reads.
I don't think we should ever try to hide that by putting it in
a workqueue.  It's probably a bug and it is best if bugs are visible.

Note that the XFS ->release function does call filemap_flush() in some
cases, but that is an async flush, so __fput_sync doesn't wait for the
flush to complete.

The way I see this patch is that fput() is the wrong interface for nfsd
to use, __fput_sync is the right interface.  So we should change.  1
patch.
The details about exhausting memory explain a particular symptom that
motivated the examination which revealed that nfsd was using the wrong
interface.

If we have nfsd sometimes using fput() and sometimes __fput_sync, then
we need to have clear rules for when to use which.  It is much easier to
have a simple rule: always use __fput_sync().

I'm certainly happy to revise function documentation and provide
wrapper functions if needed.

I might be good to have

  void filp_close_sync(struct file *f)
  {
       get_file(f);
       filp_close(f);
       __fput_sync(f);
  }

but as that would only be called once, it was hard to motivate.
Having it in linux/fs.h would be nice.

Similarly would could wrap __fput_sync() is a more friendly name, but
that would be better if we actually renamed the function.

  void fput_now(struct file *f)
  {
      __fput_sync(f);
  }

??

Thanks,
NeilBrown
  
Chuck Lever Dec. 11, 2023, 7:01 p.m. UTC | #3
On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote:
> On Sat, 09 Dec 2023, Chuck Lever wrote:
> > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote:
> > > Calling fput() directly or though filp_close() from a kernel thread like
> > > nfsd causes the final __fput() (if necessary) to be called from a
> > > workqueue.  This means that nfsd is not forced to wait for any work to
> > > complete.  If the ->release of ->destroy_inode function is slow for any
> > > reason, this can result in nfsd closing files more quickly than the
> > > workqueue can complete the close and the queue of pending closes can
> > > grow without bounces (30 million has been seen at one customer site,
> > > though this was in part due to a slowness in xfs which has since been
> > > fixed).
> > > 
> > > nfsd does not need this.
> > 
> > That is technically true, but IIUC, there is only one case where a
> > synchronous close matters for the backlog problem, and that's when
> > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other
> > call sites (except rename) are error paths, so there aren't negative
> > consequences for the lack of synchronous wait there...
> 
> What you say is technically true but it isn't the way I see it.
> 
> Firstly I should clarify that __fput_sync() is *not* a flushing close as
> you describe it below.
> All it does, apart for some trivial book-keeping, is to call ->release
> and possibly ->destroy_inode immediately rather than shunting them off
> to another thread.
> Apparently ->release sometimes does something that can deadlock with
> some kernel threads or if some awkward locks are held, so the whole
> final __fput is delay by default.  But this does not apply to nfsd.
> Standard fput() is really the wrong interface for nfsd to use.  
> It should use __fput_sync() (which shouldn't have such a scary name).
> 
> The comment above flush_delayed_fput() seems to suggest that unmounting
> is a core issue.  Maybe the fact that __fput() can call
> dissolve_on_fput() is a reason why it is sometimes safer to leave the
> work to later.  But I don't see that applying to nfsd.
> 
> Of course a ->release function *could* do synchronous writes just like
> the XFS ->destroy_inode function used to do synchronous reads.

I had assumed ->release for NFS re-export would flush due to close-
to-open semantics. There seem to be numerous corner cases that
might result in pile-ups which would change the situation in your
problem statement but might not result in an overall improvement.


> I don't think we should ever try to hide that by putting it in
> a workqueue.  It's probably a bug and it is best if bugs are visible.


I'm not objecting, per se, to this change. I would simply like to
see a little more due diligence before moving forward until it is
clear how frequently ->release or ->destroy_inode will do I/O (or
"is slow for any reason" as you say above).


> Note that the XFS ->release function does call filemap_flush() in some
> cases, but that is an async flush, so __fput_sync doesn't wait for the
> flush to complete.

When Jeff was working on the file cache a year ago, I did some
performance analysis that shows even an async flush is costly when
there is a lot of dirty data in the file being closed. The VFS walks
through the whole file and starts I/O on every dirty page. This is
quite CPU intensive, and can take on the order of a millisecond
before the async flush request returns to its caller.

IME async flushes are not free.


> The way I see this patch is that fput() is the wrong interface for nfsd
> to use, __fput_sync is the right interface.  So we should change.  1
> patch.

The practical matter is I see this as a change with a greater than
zero risk, and we need to mitigate that risk. Or rather, as a
maintainer of NFSD, /I/ need to see that the risk is as minimal as
is practical.


> The details about exhausting memory explain a particular symptom that
> motivated the examination which revealed that nfsd was using the wrong
> interface.
> 
> If we have nfsd sometimes using fput() and sometimes __fput_sync, then
> we need to have clear rules for when to use which.  It is much easier to
> have a simple rule: always use __fput_sync().

I don't agree that we should just flop all these over and hope for
the best. In particular:

 - the changes in fs/nfsd/filecache.c appear to revert a bug
   fix, so I need to see data that shows that change doesn't
   cause a re-regression

 - the changes in fs/lockd/ can result in long waits while a
   global mutex is held (global as in all namespaces and all
   locked files on the server), so I need to see data that
   demonstrates there won't be a regression

 - the other changes don't appear to have motivation in terms
   of performance or behavior, and carry similar (if lesser)
   risks as the other two changes. My preferred solution to
   potential auditor confusion about the use of __fput_sync()
   in some places and fput() in others is to document, and
   leave call sites alone if there's no technical reason to
   change them at this time.

There is enough of a risk of regression that I need to see a clear
rationale for each hunk /and/ I need to see data that there is
no regression. I know that won't be perfect coverage, but it's
better than not having any data at all.


> I'm certainly happy to revise function documentation and provide
> wrapper functions if needed.
> 
> It might be good to have
> 
>   void filp_close_sync(struct file *f)
>   {
>        get_file(f);
>        filp_close(f);
>        __fput_sync(f);
>   }
> 
> but as that would only be called once, it was hard to motivate.
> Having it in linux/fs.h would be nice.
> 
> Similarly we could wrap __fput_sync() in a more friendly name, but
> that would be better if we actually renamed the function.
> 
>   void fput_now(struct file *f)
>   {
>       __fput_sync(f);
>   }
> 
> ??

Since this is an issue strictly for nfsd, the place for this
utility function is in fs/nfsd/vfs.c, IMO, along with a documenting
comment that provides a rationale for why nfsd does not want plain
fput() in specific cases.

When other subsystems need a similar capability, then let's
consider a common helper.
  
Al Viro Dec. 11, 2023, 7:11 p.m. UTC | #4
On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote:

> Similarly would could wrap __fput_sync() is a more friendly name, but
> that would be better if we actually renamed the function.
> 
>   void fput_now(struct file *f)
>   {
>       __fput_sync(f);
>   }

It is unfriendly *precisely* because it should not be used without
a very good reason.

It may be the last opened file keeping a lazy-umounted mount alive.
It may be taking pretty much any locks, or eating a lot of stack
space.

It really isn't a general-purpose API; any "more friendly name"
is going to be NAKed for that reason alone.

Al, very much tempted to send a patch renaming that sucker to
__fput_dont_use_that_unless_you_really_know_what_you_are_doing().
  
NeilBrown Dec. 11, 2023, 10:04 p.m. UTC | #5
On Tue, 12 Dec 2023, Chuck Lever wrote:
> On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote:
> > On Sat, 09 Dec 2023, Chuck Lever wrote:
> > > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote:
> > > > Calling fput() directly or though filp_close() from a kernel thread like
> > > > nfsd causes the final __fput() (if necessary) to be called from a
> > > > workqueue.  This means that nfsd is not forced to wait for any work to
> > > > complete.  If the ->release of ->destroy_inode function is slow for any
> > > > reason, this can result in nfsd closing files more quickly than the
> > > > workqueue can complete the close and the queue of pending closes can
> > > > grow without bounces (30 million has been seen at one customer site,
> > > > though this was in part due to a slowness in xfs which has since been
> > > > fixed).
> > > > 
> > > > nfsd does not need this.
> > > 
> > > That is technically true, but IIUC, there is only one case where a
> > > synchronous close matters for the backlog problem, and that's when
> > > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other
> > > call sites (except rename) are error paths, so there aren't negative
> > > consequences for the lack of synchronous wait there...
> > 
> > What you say is technically true but it isn't the way I see it.
> > 
> > Firstly I should clarify that __fput_sync() is *not* a flushing close as
> > you describe it below.
> > All it does, apart for some trivial book-keeping, is to call ->release
> > and possibly ->destroy_inode immediately rather than shunting them off
> > to another thread.
> > Apparently ->release sometimes does something that can deadlock with
> > some kernel threads or if some awkward locks are held, so the whole
> > final __fput is delay by default.  But this does not apply to nfsd.
> > Standard fput() is really the wrong interface for nfsd to use.  
> > It should use __fput_sync() (which shouldn't have such a scary name).
> > 
> > The comment above flush_delayed_fput() seems to suggest that unmounting
> > is a core issue.  Maybe the fact that __fput() can call
> > dissolve_on_fput() is a reason why it is sometimes safer to leave the
> > work to later.  But I don't see that applying to nfsd.
> > 
> > Of course a ->release function *could* do synchronous writes just like
> > the XFS ->destroy_inode function used to do synchronous reads.
> 
> I had assumed ->release for NFS re-export would flush due to close-
> to-open semantics. There seem to be numerous corner cases that
> might result in pile-ups which would change the situation in your
> problem statement but might not result in an overall improvement.

That's the ->flush call in filp_close().

> 
> 
> > I don't think we should ever try to hide that by putting it in
> > a workqueue.  It's probably a bug and it is best if bugs are visible.
> 
> 
> I'm not objecting, per se, to this change. I would simply like to
> see a little more due diligence before moving forward until it is
> clear how frequently ->release or ->destroy_inode will do I/O (or
> "is slow for any reason" as you say above).
> 
> 
> > Note that the XFS ->release function does call filemap_flush() in some
> > cases, but that is an async flush, so __fput_sync doesn't wait for the
> > flush to complete.
> 
> When Jeff was working on the file cache a year ago, I did some
> performance analysis that shows even an async flush is costly when
> there is a lot of dirty data in the file being closed. The VFS walks
> through the whole file and starts I/O on every dirty page. This is
> quite CPU intensive, and can take on the order of a millisecond
> before the async flush request returns to its caller.
> 
> IME async flushes are not free.

True, they aren't free.  But some thread has to pay that price.
I think nfsd should.

You might argue that nfsd should wait to pay the price until after it
has sent a reply to the client.  My patches already effectively do that
for garbage-collected files.  Doing it for all files would probably be
easy. But is it really worth the (small) complexity?  I don't know.

> 
> 
> > The way I see this patch is that fput() is the wrong interface for nfsd
> > to use, __fput_sync is the right interface.  So we should change.  1
> > patch.
> 
> The practical matter is I see this as a change with a greater than
> zero risk, and we need to mitigate that risk. Or rather, as a
> maintainer of NFSD, /I/ need to see that the risk is as minimal as
> is practical.
> 
> 
> > The details about exhausting memory explain a particular symptom that
> > motivated the examination which revealed that nfsd was using the wrong
> > interface.
> > 
> > If we have nfsd sometimes using fput() and sometimes __fput_sync, then
> > we need to have clear rules for when to use which.  It is much easier to
> > have a simple rule: always use __fput_sync().
> 
> I don't agree that we should just flop all these over and hope for
> the best. In particular:
> 
>  - the changes in fs/nfsd/filecache.c appear to revert a bug
>    fix, so I need to see data that shows that change doesn't
>    cause a re-regression

The bug fix you refer to is
  "nfsd: don't fsync nfsd_files on last close"
The patch doesn't change when fsync (or ->flush) is called, so
it doesn't revert this bugfix.

> 
>  - the changes in fs/lockd/ can result in long waits while a
>    global mutex is held (global as in all namespaces and all
>    locked files on the server), so I need to see data that
>    demonstrates there won't be a regression

It's probably impossible to provide any such data.
The patch certainly moves work inside that mutex and so would increase
the hold time, if only slightly.  Is that lock hot enough to notice?
Conventional wisdom is that locking is only a tiny fraction of NFS
traffic.  It might be possible to construct a workload that saturates
lockd, but I doubt it would be relevant to the real world.

Maybe we should just break up that lock so that the problem becomes moot.

> 
>  - the other changes don't appear to have motivation in terms
>    of performance or behavior, and carry similar (if lesser)
>    risks as the other two changes. My preferred solution to
>    potential auditor confusion about the use of __fput_sync()
>    in some places and fput() in others is to document, and
>    leave call sites alone if there's no technical reason to
>    change them at this time.

Sounds to me like a good way to grow technical debt, but I'll do it like
that if you prefer.

> 
> There is enough of a risk of regression that I need to see a clear
> rationale for each hunk /and/ I need to see data that there is
> no regression. I know that won't be perfect coverage, but it's
> better than not having any data at all.
> 
> 
> > I'm certainly happy to revise function documentation and provide
> > wrapper functions if needed.
> > 
> > It might be good to have
> > 
> >   void filp_close_sync(struct file *f)
> >   {
> >        get_file(f);
> >        filp_close(f);
> >        __fput_sync(f);
> >   }
> > 
> > but as that would only be called once, it was hard to motivate.
> > Having it in linux/fs.h would be nice.
> > 
> > Similarly we could wrap __fput_sync() in a more friendly name, but
> > that would be better if we actually renamed the function.
> > 
> >   void fput_now(struct file *f)
> >   {
> >       __fput_sync(f);
> >   }
> > 
> > ??
> 
> Since this is an issue strictly for nfsd, the place for this
> utility function is in fs/nfsd/vfs.c, IMO, along with a documenting
> comment that provides a rationale for why nfsd does not want plain
> fput() in specific cases.
> 
> When other subsystems need a similar capability, then let's
> consider a common helper.

fs/smb/server/ probably would benefit from the same helper today.

Thanks,
NeilBrown


> 
> 
> -- 
> Chuck Lever
>
  
NeilBrown Dec. 11, 2023, 10:23 p.m. UTC | #6
On Tue, 12 Dec 2023, Al Viro wrote:
> On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote:
> 
> > Similarly would could wrap __fput_sync() is a more friendly name, but
> > that would be better if we actually renamed the function.
> > 
> >   void fput_now(struct file *f)
> >   {
> >       __fput_sync(f);
> >   }
> 
> It is unfriendly *precisely* because it should not be used without
> a very good reason.
> 
> It may be the last opened file keeping a lazy-umounted mount alive.
> It may be taking pretty much any locks, or eating a lot of stack
> space.

Previously you've suggested problems with ->release blocking.
Now you refer to lazy-umount, which is what the comment above
__fput_sync() mentions.

"pretty much an locks" seems like hyperbole.  I don't see it taking
nfsd_mutex or nlmsvc_mutex.  Maybe you mean any filesystem lock?

My understanding is that the advent of vmalloc allocated stacks means
that kernel stack space is not an important consideration.

It would really help if we could have clear documented explanation of
what problems can occur.  Maybe an example of contexts where it isn't
safe to call __fput_sync().

I can easily see that lazy-unmount is an interesting case which could
easily catch people unawares.  Punting the tail end of mntput_no_expire
(i.e.  if count reaches zero) to a workqueue/task_work makes sense and
would be much less impact than punting every __fput to a workqueue.

Would that make an fput_now() call safe to use in most contexts, or is
there something about ->release or dentry_kill() that can still cause
problems?

Thanks,
NeilBrown


> 
> It really isn't a general-purpose API; any "more friendly name"
> is going to be NAKed for that reason alone.
> 
> Al, very much tempted to send a patch renaming that sucker to
> __fput_dont_use_that_unless_you_really_know_what_you_are_doing().
>
  
Al Viro Dec. 11, 2023, 11:13 p.m. UTC | #7
On Tue, Dec 12, 2023 at 09:23:51AM +1100, NeilBrown wrote:

> Previously you've suggested problems with ->release blocking.
> Now you refer to lazy-umount, which is what the comment above
> __fput_sync() mentions.

Yes?  What I'm saying is that the set of locks involved is
too large for any sane analysis.  And lest you discard ->release(),
that brings ->i_rwsem, and thus anything that might be grabbed
under that.  Someone's ->mmap_lock, for example.

> "pretty much an locks" seems like hyperbole.  I don't see it taking
> nfsd_mutex or nlmsvc_mutex.

I don't know - and I can't tell without serious search.  What I can
tell is that before making fput() delayed we used to find deadlocks
on regular basis; that was a massive source of headache.

> Maybe you mean any filesystem lock?

Don't forget VM.  And drivers.  And there was quite a bit of fun
happening in net/unix, etc.  Sure, in case of nfsd the last two
_probably_ won't occur - not directly, anyway.

But making it a general nuisan^Wfacility is asking for trouble.

> My understanding is that the advent of vmalloc allocated stacks means
> that kernel stack space is not an important consideration.
> 
> It would really help if we could have clear documented explanation of
> what problems can occur.  Maybe an example of contexts where it isn't
> safe to call __fput_sync().
> 
> I can easily see that lazy-unmount is an interesting case which could
> easily catch people unawares.  Punting the tail end of mntput_no_expire
> (i.e.  if count reaches zero) to a workqueue/task_work makes sense and
> would be much less impact than punting every __fput to a workqueue.
> 
> Would that make an fput_now() call safe to use in most contexts, or is
> there something about ->release or dentry_kill() that can still cause
> problems?

dentry_kill() means ->d_release(), ->d_iput() and anything final iput()
could do.  Including e.g. anything that might be done by afs_silly_iput(),
with its "send REMOVE to server, wait for completion".  No, that's not
a deadlock per se, but it can stall you a bit more than you would
probably consider tolerable...  Sure, you could argue that AFS ought to
make that thing asynchronous, but...

Anyway, it won't be "safe to use in most contexts".  ->mmap_lock alone
is enough for that, and that's just the one I remember to have given
us a lot of headache.  And that's without bringing the "nfsd won't
touch those files" cases - make it generally accessible and you get
to audit all locks that might be taken when we close a socket, etc.
  
Al Viro Dec. 11, 2023, 11:21 p.m. UTC | #8
On Mon, Dec 11, 2023 at 11:13:30PM +0000, Al Viro wrote:

> dentry_kill() means ->d_release(), ->d_iput() and anything final iput()
> could do.  Including e.g. anything that might be done by afs_silly_iput(),
> with its "send REMOVE to server, wait for completion".  No, that's not
> a deadlock per se, but it can stall you a bit more than you would
> probably consider tolerable...  Sure, you could argue that AFS ought to
> make that thing asynchronous, but...
> 
> Anyway, it won't be "safe to use in most contexts".  ->mmap_lock alone
> is enough for that, and that's just the one I remember to have given
> us a lot of headache.  And that's without bringing the "nfsd won't
> touch those files" cases - make it generally accessible and you get
> to audit all locks that might be taken when we close a socket, etc.

PS: put it that way - I can buy "nfsd is doing that only to regular
files and not on an arbitrary filesystem, at that; having the thread
wait on that sucker is not going to cause too much trouble"; I do *not*
buy turning it into a thing usable outside of a very narrow set of
circumstances.
  
Chuck Lever Dec. 12, 2023, 4:17 p.m. UTC | #9
On Tue, Dec 12, 2023 at 09:04:58AM +1100, NeilBrown wrote:
> On Tue, 12 Dec 2023, Chuck Lever wrote:
> > On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote:
> > > On Sat, 09 Dec 2023, Chuck Lever wrote:
> > > > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote:
> > > > > Calling fput() directly or though filp_close() from a kernel thread like
> > > > > nfsd causes the final __fput() (if necessary) to be called from a
> > > > > workqueue.  This means that nfsd is not forced to wait for any work to
> > > > > complete.  If the ->release of ->destroy_inode function is slow for any
> > > > > reason, this can result in nfsd closing files more quickly than the
> > > > > workqueue can complete the close and the queue of pending closes can
> > > > > grow without bounces (30 million has been seen at one customer site,
> > > > > though this was in part due to a slowness in xfs which has since been
> > > > > fixed).
> > > > > 
> > > > > nfsd does not need this.
> > > > 
> > > > That is technically true, but IIUC, there is only one case where a
> > > > synchronous close matters for the backlog problem, and that's when
> > > > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other
> > > > call sites (except rename) are error paths, so there aren't negative
> > > > consequences for the lack of synchronous wait there...
> > > 
> > > What you say is technically true but it isn't the way I see it.
> > > 
> > > Firstly I should clarify that __fput_sync() is *not* a flushing close as
> > > you describe it below.
> > > All it does, apart for some trivial book-keeping, is to call ->release
> > > and possibly ->destroy_inode immediately rather than shunting them off
> > > to another thread.
> > > Apparently ->release sometimes does something that can deadlock with
> > > some kernel threads or if some awkward locks are held, so the whole
> > > final __fput is delay by default.  But this does not apply to nfsd.
> > > Standard fput() is really the wrong interface for nfsd to use.  
> > > It should use __fput_sync() (which shouldn't have such a scary name).
> > > 
> > > The comment above flush_delayed_fput() seems to suggest that unmounting
> > > is a core issue.  Maybe the fact that __fput() can call
> > > dissolve_on_fput() is a reason why it is sometimes safer to leave the
> > > work to later.  But I don't see that applying to nfsd.
> > > 
> > > Of course a ->release function *could* do synchronous writes just like
> > > the XFS ->destroy_inode function used to do synchronous reads.
> > 
> > I had assumed ->release for NFS re-export would flush due to close-
> > to-open semantics. There seem to be numerous corner cases that
> > might result in pile-ups which would change the situation in your
> > problem statement but might not result in an overall improvement.
> 
> That's the ->flush call in filp_close().
> 
> > > I don't think we should ever try to hide that by putting it in
> > > a workqueue.  It's probably a bug and it is best if bugs are visible.
> > 
> > 
> > I'm not objecting, per se, to this change. I would simply like to
> > see a little more due diligence before moving forward until it is
> > clear how frequently ->release or ->destroy_inode will do I/O (or
> > "is slow for any reason" as you say above).
> > 
> > 
> > > Note that the XFS ->release function does call filemap_flush() in some
> > > cases, but that is an async flush, so __fput_sync doesn't wait for the
> > > flush to complete.
> > 
> > When Jeff was working on the file cache a year ago, I did some
> > performance analysis that shows even an async flush is costly when
> > there is a lot of dirty data in the file being closed. The VFS walks
> > through the whole file and starts I/O on every dirty page. This is
> > quite CPU intensive, and can take on the order of a millisecond
> > before the async flush request returns to its caller.
> > 
> > IME async flushes are not free.
> 
> True, they aren't free.  But some thread has to pay that price.
> I think nfsd should.

An async flush can be as expensive as a synchronous flush in some
cases. I'm not convinced that simply because a flush happens to be
asynchronous, that makes it harmless to do at scale in the
foreground.

Our original desire way back when was to insert tactical async
flushes during UNSTABLE WRITEs to get the server writing dirty data
to storage sooner. It had an immediate negative throughput and
latency impact.

I have no philosophical disagreement about nfsd threads doing more
work during a file close. I'm just saying that moving even an async
flush from a worker to an nfsd thread might be visible to a client
workload.


> You might argue that nfsd should wait to pay the price until after it
> has sent a reply to the client.  My patches already effectively do that
> for garbage-collected files.  Doing it for all files would probably be
> easy. But is it really worth the (small) complexity?  I don't know.
>
> > > The way I see this patch is that fput() is the wrong interface for nfsd
> > > to use, __fput_sync is the right interface.  So we should change.  1
> > > patch.
> > 
> > The practical matter is I see this as a change with a greater than
> > zero risk, and we need to mitigate that risk. Or rather, as a
> > maintainer of NFSD, /I/ need to see that the risk is as minimal as
> > is practical.
> > 
> > 
> > > The details about exhausting memory explain a particular symptom that
> > > motivated the examination which revealed that nfsd was using the wrong
> > > interface.
> > > 
> > > If we have nfsd sometimes using fput() and sometimes __fput_sync, then
> > > we need to have clear rules for when to use which.  It is much easier to
> > > have a simple rule: always use __fput_sync().
> > 
> > I don't agree that we should just flop all these over and hope for
> > the best. In particular:
> > 
> >  - the changes in fs/nfsd/filecache.c appear to revert a bug
> >    fix, so I need to see data that shows that change doesn't
> >    cause a re-regression
> 
> The bug fix you refer to is
>   "nfsd: don't fsync nfsd_files on last close"
> The patch doesn't change when fsync (or ->flush) is called, so
> it doesn't revert this bugfix.

If the async flush is being done directly by nfsd_file_free instead
of deferred to a work queue, that will slow down file closing, IMO,
in particular in the single-threaded GC case.

Again, not an objection, but we need to be aware of the impact of
this change, and if it is negative, try to mitigate it. If that
mitigation takes the form of patch 2/3, then maybe that patch needs
to be applied before this one.


> >  - the changes in fs/lockd/ can result in long waits while a
> >    global mutex is held (global as in all namespaces and all
> >    locked files on the server), so I need to see data that
> >    demonstrates there won't be a regression
> 
> It's probably impossible to provide any such data.
> The patch certainly moves work inside that mutex and so would increase
> the hold time, if only slightly.  Is that lock hot enough to notice?
> Conventional wisdom is that locking is only a tiny fraction of NFS
> traffic.  It might be possible to construct a workload that saturates
> lockd, but I doubt it would be relevant to the real world.

I have seen, in the past, customer workloads that are nothing but a
stream of NLM lock and unlock requests, and the business requirement
was that those need to go as fast as possible. Inserting I/O (even
occasional asynchronous I/O) would result in an unwanted regression
for that kind of workload.


> Maybe we should just break up that lock so that the problem becomes moot.

Let's drop this hunk for now until we (I) have a way to assess this
change properly. I'm just not convinced this hunk is going to be a
harmless change in some cases, and overall it doesn't appear to be
necessary to address the close back-pressure concern.

Replacing fput() here might be done after the file table data
structure becomes more parallel. Maybe an rhashtable would be
suitable.


> >  - the other changes don't appear to have motivation in terms
> >    of performance or behavior, and carry similar (if lesser)
> >    risks as the other two changes. My preferred solution to
> >    potential auditor confusion about the use of __fput_sync()
> >    in some places and fput() in others is to document, and
> >    leave call sites alone if there's no technical reason to
> >    change them at this time.
> 
> Sounds to me like a good way to grow technical debt, but I'll do it like
> that if you prefer.

Yes, I prefer a helper with an explanation of why nfsd uses
__fput_sync() for certain cases. I'm going to take the admonition
in __fput_sync()'s kdoc comment seriously.
  
NeilBrown Dec. 13, 2023, 12:28 a.m. UTC | #10
On Tue, 12 Dec 2023, Al Viro wrote:
> On Mon, Dec 11, 2023 at 11:13:30PM +0000, Al Viro wrote:
> 
> > dentry_kill() means ->d_release(), ->d_iput() and anything final iput()
> > could do.  Including e.g. anything that might be done by afs_silly_iput(),
> > with its "send REMOVE to server, wait for completion".  No, that's not
> > a deadlock per se, but it can stall you a bit more than you would
> > probably consider tolerable...  Sure, you could argue that AFS ought to
> > make that thing asynchronous, but...
> > 
> > Anyway, it won't be "safe to use in most contexts".  ->mmap_lock alone
> > is enough for that, and that's just the one I remember to have given
> > us a lot of headache.  And that's without bringing the "nfsd won't
> > touch those files" cases - make it generally accessible and you get
> > to audit all locks that might be taken when we close a socket, etc.
> 
> PS: put it that way - I can buy "nfsd is doing that only to regular
> files and not on an arbitrary filesystem, at that; having the thread
> wait on that sucker is not going to cause too much trouble"; I do *not*
> buy turning it into a thing usable outside of a very narrow set of
> circumstances.
> 

Can you say more about "not on an arbitrary filesystem" ?
I guess you means that procfs and/or sysfs might be problematic as may
similar virtual filesystems (nfsd maybe).

Could we encode some of this in the comment for __fput_sync ??

/**
 * __fput_sync : drop reference to a file synchronously
 * @f: file to drop
 *
 * Drop a reference on a file and do most cleanup work before returning.
 *
 * Due the the wide use of files in the design of Linux, dropping the
 * final reference to a file can result in dropping the final reference
 * to any of a large variety of other objects.  Dropping those final
 * references can result in nearly arbitrary work.  It should be assumed
 * that, unless prior checks or actions confirm otherwise, calling
 * __fput_sync() might:
 * - allocate memory
 * - perform synchronous IO
 * - wait for a remote service (for networked filesystems)
 * - take ->i_rwsem and other related VFS and filesystem locks
 * - take ->s_umount (if file is on a MNT_INTERNAL filesystem)
 * - take locks in a device driver if the file is CHR, BLK or SOCK
 *
 * If the caller cannot be confident that none of these will cause a
 * problem, it should use fput() instead.
 *
 * Note that the final unmount of a lazy-unmounted non-MNT_INTERNAL
 * filesystem will always be handled asynchronously.  Individual drivers
 * might also leave some clean up to asynchronous threads.
 */

Thanks,
NeilBrown
  
David Laight Dec. 15, 2023, 6:27 p.m. UTC | #11
...
> > PS: put it that way - I can buy "nfsd is doing that only to regular
> > files and not on an arbitrary filesystem, at that; having the thread
> > wait on that sucker is not going to cause too much trouble"; I do *not*
> > buy turning it into a thing usable outside of a very narrow set of
> > circumstances.
> >
> 
> Can you say more about "not on an arbitrary filesystem" ?
> I guess you means that procfs and/or sysfs might be problematic as may
> similar virtual filesystems (nfsd maybe).

Can nfs export an ext4 fs that is on a loopback mount on a file
that is remotely nfs (or other) mounted?

As soon as you get loops like that you might find that fput() starts
being problematic.

I'm also sure I remember that nfs wasn't supposed to respond to a write
until it had issued the actual disk write - but maybe no one do that
any more because it really is too slow.
(Especially if the 'disk' is a USB stick.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
David Laight Dec. 15, 2023, 6:28 p.m. UTC | #12
..
> My understanding is that the advent of vmalloc allocated stacks means
> that kernel stack space is not an important consideration.

They are still the same (small) size - just allocated differently.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
  
Chuck Lever Dec. 15, 2023, 7:35 p.m. UTC | #13
> On Dec 15, 2023, at 1:27 PM, David Laight <David.Laight@ACULAB.COM> wrote:
> 
> ...
> 
> I'm also sure I remember that nfs wasn't supposed to respond to a write
> until it had issued the actual disk write - but maybe no one do that
> any more because it really is too slow.
> (Especially if the 'disk' is a USB stick.)

That rule applies only to NFSv2, which no-one should use any more.

NFSv3 and later use a two-phase write. Clients send a COMMIT after
an UNSTABLE WRITE to get the persistence guarantee.

--
Chuck Lever
  
NeilBrown Dec. 15, 2023, 10:36 p.m. UTC | #14
On Sat, 16 Dec 2023, David Laight wrote:
> ...
> > > PS: put it that way - I can buy "nfsd is doing that only to regular
> > > files and not on an arbitrary filesystem, at that; having the thread
> > > wait on that sucker is not going to cause too much trouble"; I do *not*
> > > buy turning it into a thing usable outside of a very narrow set of
> > > circumstances.
> > >
> > 
> > Can you say more about "not on an arbitrary filesystem" ?
> > I guess you means that procfs and/or sysfs might be problematic as may
> > similar virtual filesystems (nfsd maybe).
> 
> Can nfs export an ext4 fs that is on a loopback mount on a file
> that is remotely nfs (or other) mounted?

Sure.  There is no reason this would cause a problem.
If the nfs mount were also a loopback mount, that might be interesting.
i.e.  You have a local filesystem, containing a file with a filesystem
image.
You nfs-export that local filesystem, and nfs mount it on the same host.
Then you loop-mount that file in the nfs-mounted filesystem.  Now you
are getting into uncharted waters.  I've testing loop-back NFS mounting
and assure that it works.  I haven't tried the double-loop.
But if that caused problem, I though it would be fput.  It would be
fsync or writeback which causes the problem.

> 
> As soon as you get loops like that you might find that fput() starts
> being problematic.

When calling fput on a regular file there are, I think, only two problem
areas.  One is that the fput might lead to a lazy-filesystem unmount
completing.  That only applies to MNT_INTERNAL filesystems, and they are
unlikely to be exported (probably it's impossible, but I haven't
checked).
The other is synchronous (or even async) IO in the filesystem code,
maybe completing an unlink or a truncate.  This is no worse than any
other synchronous IO that nfsd does.

Thanks,
NeilBrown


> 
> I'm also sure I remember that nfs wasn't supposed to respond to a write
> until it had issued the actual disk write - but maybe no one do that
> any more because it really is too slow.
> (Especially if the 'disk' is a USB stick.)
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
  
Dave Chinner Dec. 16, 2023, 1:50 a.m. UTC | #15
On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote:
> On Sat, 09 Dec 2023, Chuck Lever wrote:
> > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote:
> > > Calling fput() directly or though filp_close() from a kernel thread like
> > > nfsd causes the final __fput() (if necessary) to be called from a
> > > workqueue.  This means that nfsd is not forced to wait for any work to
> > > complete.  If the ->release of ->destroy_inode function is slow for any
> > > reason, this can result in nfsd closing files more quickly than the
> > > workqueue can complete the close and the queue of pending closes can
> > > grow without bounces (30 million has been seen at one customer site,
> > > though this was in part due to a slowness in xfs which has since been
> > > fixed).
> > > 
> > > nfsd does not need this.
> > 
> > That is technically true, but IIUC, there is only one case where a
> > synchronous close matters for the backlog problem, and that's when
> > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other
> > call sites (except rename) are error paths, so there aren't negative
> > consequences for the lack of synchronous wait there...
> 
> What you say is technically true but it isn't the way I see it.
> 
> Firstly I should clarify that __fput_sync() is *not* a flushing close as
> you describe it below.
> All it does, apart for some trivial book-keeping, is to call ->release
> and possibly ->destroy_inode immediately rather than shunting them off
> to another thread.
> Apparently ->release sometimes does something that can deadlock with
> some kernel threads or if some awkward locks are held, so the whole
> final __fput is delay by default.  But this does not apply to nfsd.
> Standard fput() is really the wrong interface for nfsd to use.  
> It should use __fput_sync() (which shouldn't have such a scary name).
> 
> The comment above flush_delayed_fput() seems to suggest that unmounting
> is a core issue.  Maybe the fact that __fput() can call
> dissolve_on_fput() is a reason why it is sometimes safer to leave the
> work to later.  But I don't see that applying to nfsd.
> 
> Of course a ->release function *could* do synchronous writes just like
> the XFS ->destroy_inode function used to do synchronous reads.

What do you mean "could"? The correct word is "does".

> I don't think we should ever try to hide that by putting it in
> a workqueue.  It's probably a bug and it is best if bugs are visible.

Most definitely *not* a bug.

XFS, ext4 and btrfs all call filemap_flush() from their respective
->release methods. This is required to protect user data against
loss caused by poorly written applications that overwrite user data
in an unsafe manner (i.e. the open-truncate-write-close overwrite
anti-pattern).

The btrfs flush trigger is very similar to XFS:

	/*
         * Set by setattr when we are about to truncate a file from a non-zero
         * size to a zero size.  This tries to flush down new bytes that may
         * have been written if the application were using truncate to replace
         * a file in place.
         */
        if (test_and_clear_bit(BTRFS_INODE_FLUSH_ON_CLOSE,
                               &BTRFS_I(inode)->runtime_flags))
                        filemap_flush(inode->i_mapping);

XFS started doing this in 2006, ext4 in 2008, and git will tell you
when btrfs picked this up, too. IOWs, we've been doing writeback
from ->release for a *very long time*.

> Note that the XFS ->release function does call filemap_flush() in some
> cases, but that is an async flush, so __fput_sync doesn't wait for the
> flush to complete.

"async flush" does not mean it will not block for long periods of
time, it just means it won't wait for *all* the IO to complete.
i.e. if the async flush saturates the device, bio submission will
wait for previous IO that the flush submitted own IO to complete
before it can continue flushing the data.

But wait, it gets better!

XFS, btrfs and ext4 all implement delayed allocation, which means
writeback often needs to run extent allocation transactions. In
these cases, transaction reservation can block on metadata writeback
to free up journal space. In the case of XFS, this could be tens of
thousands of metadata IOs needing to be submitted and completed!.

Then consider that extent allocation needs to search for free space
which may need to read in metadata. i.e. extent allocation will end
up submitting and waiting on synchronous read IO. Also, reading that
metadata requires memory allocation for the buffers that will store
it - memory allocation can also block on IO and other subsystems to
free up memory.

Even less obvious is the stack usage issues calling ->release from
arbitrary code entails. The filesystem writeback stack is -deep-.

Remember all the problems we used to have with ->writepage() being
called from direct memory reclaim and so putting the writeback path
at arbitrary depths in the stack and then running out of stack
space?  We really don't want to go back to the bad old days where
filesystem write paths can be entered from code that has already
consumed most of the stack space....

Hence, IMO, __fput_sync() is something that needs to be very
carefully controlled and should have big scary warnings on it. We
really don't want it to be called from just anywhere...

Cheers,

Dave.
  

Patch

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index ef063f93fde9..e9734c7451b5 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -283,7 +283,9 @@  nfsd_file_free(struct nfsd_file *nf)
 		nfsd_file_mark_put(nf->nf_mark);
 	if (nf->nf_file) {
 		nfsd_file_check_write_error(nf);
+		get_file(nf->nf_file);
 		filp_close(nf->nf_file, NULL);
+		__fput_sync(nf->nf_file);
 	}
 
 	/*
@@ -631,7 +633,6 @@  nfsd_file_close_inode_sync(struct inode *inode)
 		list_del_init(&nf->nf_lru);
 		nfsd_file_free(nf);
 	}
-	flush_delayed_fput();
 }
 
 /**
diff --git a/fs/nfsd/lockd.c b/fs/nfsd/lockd.c
index 46a7f9b813e5..f9d1059096a4 100644
--- a/fs/nfsd/lockd.c
+++ b/fs/nfsd/lockd.c
@@ -60,7 +60,7 @@  nlm_fopen(struct svc_rqst *rqstp, struct nfs_fh *f, struct file **filp,
 static void
 nlm_fclose(struct file *filp)
 {
-	fput(filp);
+	__fput_sync(filp);
 }
 
 static const struct nlmsvc_binding nfsd_nlm_ops = {
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 6f2d4aa4970d..20d60823d530 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -629,7 +629,7 @@  nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 		nn->somebody_reclaimed = true;
 out:
 	if (open->op_filp) {
-		fput(open->op_filp);
+		__fput_sync(open->op_filp);
 		open->op_filp = NULL;
 	}
 	if (resfh && resfh != &cstate->current_fh) {
@@ -1546,7 +1546,7 @@  nfsd4_cleanup_inter_ssc(struct nfsd4_ssc_umount_item *nsui, struct file *filp,
 	long timeout = msecs_to_jiffies(nfsd4_ssc_umount_timeout);
 
 	nfs42_ssc_close(filp);
-	fput(filp);
+	__fput_sync(filp);
 
 	spin_lock(&nn->nfsd_ssc_lock);
 	list_del(&nsui->nsui_list);
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 3509e73abe1f..f8f0112fd9f5 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -561,7 +561,7 @@  nfsd4_shutdown_recdir(struct net *net)
 
 	if (!nn->rec_file)
 		return;
-	fput(nn->rec_file);
+	__fput_sync(nn->rec_file);
 	nn->rec_file = NULL;
 }
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index fbbea7498f02..15a811229211 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -879,7 +879,7 @@  __nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type,
 
 	host_err = ima_file_check(file, may_flags);
 	if (host_err) {
-		fput(file);
+		__fput_sync(file);
 		goto out;
 	}
 
@@ -1884,10 +1884,10 @@  nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	fh_drop_write(ffhp);
 
 	/*
-	 * If the target dentry has cached open files, then we need to try to
-	 * close them prior to doing the rename. Flushing delayed fput
-	 * shouldn't be done with locks held however, so we delay it until this
-	 * point and then reattempt the whole shebang.
+	 * If the target dentry has cached open files, then we need to
+	 * try to close them prior to doing the rename.  Final fput
+	 * shouldn't be done with locks held however, so we delay it
+	 * until this point and then reattempt the whole shebang.
 	 */
 	if (close_cached) {
 		close_cached = false;
@@ -2141,7 +2141,7 @@  nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
 	if (err == nfserr_eof || err == nfserr_toosmall)
 		err = nfs_ok; /* can still be found in ->err */
 out_close:
-	fput(file);
+	__fput_sync(file);
 out:
 	return err;
 }