[2/2] nfsd: Don't leave work of closing files to a work queue.

Message ID 20231204014042.6754-3-neilb@suse.de
State New
Headers
Series Move all file-close work for nfsd into nfsd threads |

Commit Message

NeilBrown Dec. 4, 2023, 1:36 a.m. UTC
  The work of closing a file can have non-trivial cost.  Doing it in a
separate work queue thread means that cost isn't imposed on the nfsd
threads and an imbalance can be created.

I have evidence from a customer site when nfsd is being asked to modify
many millions of files which causes sufficient memory pressure that some
cache (in XFS I think) gets cleaned earlier than would be ideal.  When
__dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode()
needs to synchronously read back previously cached info from storage.
This slows down the single thread that is making all the final __dput()
calls for all the nfsd threads with the net result that files are added
to the delayed_fput_list faster than they are removed, and the system
eventually runs out of memory.

To avoid this work imbalance that exhausts memory, this patch moves all
work for closing files into the nfsd threads.  This means that when the
work imposes a cost, that cost appears where it would be expected - in
the work of the nfsd thread.

There are two changes to achieve this.

1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the
   final __dput() for any file closed by a given nfsd thread is handled
   by that thread.  This ensures that the number of files that are
   queued for a final close is limited by the number of threads and
   cannot grow without bound.

2/ Files opened for NFSv3 are never explicitly closed by the client and are
  kept open by the server in the "filecache", which responds to memory
  pressure, is garbage collected even when there is no pressure, and
  sometimes closes files when there is particular need such as for
  rename.  These files currently have filp_close() called in a dedicated
  work queue, so their __dput() can have no effect on nfsd threads.

  This patch discards the work queue and instead has each nfsd thread
  call flip_close() on as many as 8 files from the filecache each time
  it acts on a client request (or finds there are no pending client
  requests).  If there are more to be closed, more threads are woken.
  This spreads the work of __dput() over multiple threads and imposes
  any cost on those threads.

  The number 8 is somewhat arbitrary.  It needs to be greater than 1 to
  ensure that files are closed more quickly than they can be added to
  the cache.  It needs to be small enough to limit the per-request
  delays that will be imposed on clients when all threads are busy
  closing files.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfsd/filecache.c | 62 ++++++++++++++++++---------------------------
 fs/nfsd/filecache.h |  1 +
 fs/nfsd/nfssvc.c    |  6 +++++
 3 files changed, 32 insertions(+), 37 deletions(-)
  

Comments

Chuck Lever Dec. 4, 2023, 4:58 p.m. UTC | #1
On Mon, Dec 04, 2023 at 12:36:42PM +1100, NeilBrown wrote:
> The work of closing a file can have non-trivial cost.  Doing it in a
> separate work queue thread means that cost isn't imposed on the nfsd
> threads and an imbalance can be created.
> 
> I have evidence from a customer site when nfsd is being asked to modify
> many millions of files which causes sufficient memory pressure that some
> cache (in XFS I think) gets cleaned earlier than would be ideal.  When
> __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode()
> needs to synchronously read back previously cached info from storage.
> This slows down the single thread that is making all the final __dput()
> calls for all the nfsd threads with the net result that files are added
> to the delayed_fput_list faster than they are removed, and the system
> eventually runs out of memory.
> 
> To avoid this work imbalance that exhausts memory, this patch moves all
> work for closing files into the nfsd threads.  This means that when the
> work imposes a cost, that cost appears where it would be expected - in
> the work of the nfsd thread.

Thanks for pursuing this next step in the evolution of the NFSD
file cache.

Your problem statement should mention whether you have observed the
issue with an NFSv3 or an NFSv4 workload or if you see this issue
with both, since those two use cases are handled very differently
within the file cache implementation.


> There are two changes to achieve this.
> 
> 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the
>    final __dput() for any file closed by a given nfsd thread is handled
>    by that thread.  This ensures that the number of files that are
>    queued for a final close is limited by the number of threads and
>    cannot grow without bound.
> 
> 2/ Files opened for NFSv3 are never explicitly closed by the client and are
>   kept open by the server in the "filecache", which responds to memory
>   pressure, is garbage collected even when there is no pressure, and
>   sometimes closes files when there is particular need such as for
>   rename.

There is a good reason for close-on-rename: IIRC we want to avoid
triggering a silly-rename on NFS re-exports.

Also, I think we do want to close cached garbage-collected files
quickly, even without memory pressure. Files left open in this way
can conflict with subsequent NFSv4 OPENs that might hand out a
delegation as long as no other clients are using them. Files held
open by the file cache will interfere with that.


>   These files currently have filp_close() called in a dedicated
>   work queue, so their __dput() can have no effect on nfsd threads.
> 
>   This patch discards the work queue and instead has each nfsd thread
>   call flip_close() on as many as 8 files from the filecache each time
>   it acts on a client request (or finds there are no pending client
>   requests).  If there are more to be closed, more threads are woken.
>   This spreads the work of __dput() over multiple threads and imposes
>   any cost on those threads.
> 
>   The number 8 is somewhat arbitrary.  It needs to be greater than 1 to
>   ensure that files are closed more quickly than they can be added to
>   the cache.  It needs to be small enough to limit the per-request
>   delays that will be imposed on clients when all threads are busy
>   closing files.

IMO we want to explicitly separate the mechanisms of handling
garbage-collected files and non-garbage-collected files.

In the non-garbage-collected (NFSv4) case, the kthread can wait
for everything it has opened to be closed. task_work seems
appropriate for that IIUC.

The problem with handling a limited number of garbage-collected
items is that once the RPC workload stops, any remaining open
files will remain open because garbage collection has effectively
stopped. We really need those files closed out within a couple of
seconds.

We used to have watermarks in the nfsd_file_put() path to kick
garbage-collection if there were too many open files. Instead,
waiting for the GC thread to make progress before recycling the
kthread might be beneficial.

And, as we discussed in a previous thread, replacing the per-
namespace worker with a parallel mechanism would help GC proceed
more quickly to reduce the flush/close backlog for NFSv3.


> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfsd/filecache.c | 62 ++++++++++++++++++---------------------------
>  fs/nfsd/filecache.h |  1 +
>  fs/nfsd/nfssvc.c    |  6 +++++
>  3 files changed, 32 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index ee9c923192e0..55268b7362d4 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -39,6 +39,7 @@
>  #include <linux/fsnotify.h>
>  #include <linux/seq_file.h>
>  #include <linux/rhashtable.h>
> +#include <linux/task_work.h>
>  
>  #include "vfs.h"
>  #include "nfsd.h"
> @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
>  static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
>  
>  struct nfsd_fcache_disposal {
> -	struct work_struct work;
>  	spinlock_t lock;
>  	struct list_head freeme;
>  };
>  
> -static struct workqueue_struct *nfsd_filecache_wq __read_mostly;
> -
>  static struct kmem_cache		*nfsd_file_slab;
>  static struct kmem_cache		*nfsd_file_mark_slab;
>  static struct list_lru			nfsd_file_lru;
> @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
>  		spin_lock(&l->lock);
>  		list_move_tail(&nf->nf_lru, &l->freeme);
>  		spin_unlock(&l->lock);
> -		queue_work(nfsd_filecache_wq, &l->work);
> +		svc_wake_up(nn->nfsd_serv);
>  	}
>  }
>  
> +/**
> + * nfsd_file_dispose_some

This needs a short description and:

 * @nn: namespace to check

Or something more enlightening than that.

Also, the function name exposes mechanism; I think I'd prefer a name
that is more abstract, such as nfsd_file_net_release() ?


> + *
> + */
> +void nfsd_file_dispose_some(struct nfsd_net *nn)
> +{
> +	struct nfsd_fcache_disposal *l = nn->fcache_disposal;
> +	LIST_HEAD(dispose);
> +	int i;
> +
> +	if (list_empty(&l->freeme))
> +		return;
> +	spin_lock(&l->lock);
> +	for (i = 0; i < 8 && !list_empty(&l->freeme); i++)
> +		list_move(l->freeme.next, &dispose);
> +	spin_unlock(&l->lock);
> +	if (!list_empty(&l->freeme))
> +		svc_wake_up(nn->nfsd_serv);
> +	nfsd_file_dispose_list(&dispose);
> +}
> +
>  /**
>   * nfsd_file_lru_cb - Examine an entry on the LRU list
>   * @item: LRU entry to examine
> @@ -635,28 +654,8 @@ nfsd_file_close_inode_sync(struct inode *inode)
>  		list_del_init(&nf->nf_lru);
>  		nfsd_file_free(nf);
>  	}
> -	flush_delayed_fput();
> -}
> -
> -/**
> - * nfsd_file_delayed_close - close unused nfsd_files
> - * @work: dummy
> - *
> - * Scrape the freeme list for this nfsd_net, and then dispose of them
> - * all.
> - */
> -static void
> -nfsd_file_delayed_close(struct work_struct *work)
> -{
> -	LIST_HEAD(head);
> -	struct nfsd_fcache_disposal *l = container_of(work,
> -			struct nfsd_fcache_disposal, work);
> -
> -	spin_lock(&l->lock);
> -	list_splice_init(&l->freeme, &head);
> -	spin_unlock(&l->lock);
> -
> -	nfsd_file_dispose_list(&head);
> +	/* Flush any delayed fput */
> +	task_work_run();
>  }
>  
>  static int
> @@ -721,10 +720,6 @@ nfsd_file_cache_init(void)
>  		return ret;
>  
>  	ret = -ENOMEM;
> -	nfsd_filecache_wq = alloc_workqueue("nfsd_filecache", 0, 0);
> -	if (!nfsd_filecache_wq)
> -		goto out;
> -
>  	nfsd_file_slab = kmem_cache_create("nfsd_file",
>  				sizeof(struct nfsd_file), 0, 0, NULL);
>  	if (!nfsd_file_slab) {
> @@ -739,7 +734,6 @@ nfsd_file_cache_init(void)
>  		goto out_err;
>  	}
>  
> -
>  	ret = list_lru_init(&nfsd_file_lru);
>  	if (ret) {
>  		pr_err("nfsd: failed to init nfsd_file_lru: %d\n", ret);
> @@ -782,8 +776,6 @@ nfsd_file_cache_init(void)
>  	nfsd_file_slab = NULL;
>  	kmem_cache_destroy(nfsd_file_mark_slab);
>  	nfsd_file_mark_slab = NULL;
> -	destroy_workqueue(nfsd_filecache_wq);
> -	nfsd_filecache_wq = NULL;
>  	rhltable_destroy(&nfsd_file_rhltable);
>  	goto out;
>  }
> @@ -829,7 +821,6 @@ nfsd_alloc_fcache_disposal(void)
>  	l = kmalloc(sizeof(*l), GFP_KERNEL);
>  	if (!l)
>  		return NULL;
> -	INIT_WORK(&l->work, nfsd_file_delayed_close);
>  	spin_lock_init(&l->lock);
>  	INIT_LIST_HEAD(&l->freeme);
>  	return l;
> @@ -838,7 +829,6 @@ nfsd_alloc_fcache_disposal(void)
>  static void
>  nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l)
>  {
> -	cancel_work_sync(&l->work);
>  	nfsd_file_dispose_list(&l->freeme);
>  	kfree(l);
>  }
> @@ -907,8 +897,6 @@ nfsd_file_cache_shutdown(void)
>  	fsnotify_wait_marks_destroyed();
>  	kmem_cache_destroy(nfsd_file_mark_slab);
>  	nfsd_file_mark_slab = NULL;
> -	destroy_workqueue(nfsd_filecache_wq);
> -	nfsd_filecache_wq = NULL;
>  	rhltable_destroy(&nfsd_file_rhltable);
>  
>  	for_each_possible_cpu(i) {
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index e54165a3224f..bc8c3363bbdf 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -56,6 +56,7 @@ void nfsd_file_cache_shutdown_net(struct net *net);
>  void nfsd_file_put(struct nfsd_file *nf);
>  struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
>  void nfsd_file_close_inode_sync(struct inode *inode);
> +void nfsd_file_dispose_some(struct nfsd_net *nn);
>  bool nfsd_file_is_cached(struct inode *inode);
>  __be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  		  unsigned int may_flags, struct nfsd_file **nfp);
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index c7af1095f6b5..02ea16636b54 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -13,6 +13,7 @@
>  #include <linux/fs_struct.h>
>  #include <linux/swap.h>
>  #include <linux/siphash.h>
> +#include <linux/task_work.h>
>  
>  #include <linux/sunrpc/stats.h>
>  #include <linux/sunrpc/svcsock.h>
> @@ -949,6 +950,7 @@ nfsd(void *vrqstp)
>  	}
>  
>  	current->fs->umask = 0;
> +	current->flags |= PF_RUNS_TASK_WORK;
>  
>  	atomic_inc(&nfsdstats.th_cnt);
>  
> @@ -963,6 +965,10 @@ nfsd(void *vrqstp)
>  
>  		svc_recv(rqstp);
>  		validate_process_creds();
> +
> +		nfsd_file_dispose_some(nn);
> +		if (task_work_pending(current))
> +			task_work_run();

I'd prefer that these task_work details reside inside
nfsd_file_dispose_some(), or whatever we want to call to call it ...


>  	}
>  
>  	atomic_dec(&nfsdstats.th_cnt);
> -- 
> 2.43.0
>
  
NeilBrown Dec. 4, 2023, 10:21 p.m. UTC | #2
On Tue, 05 Dec 2023, Chuck Lever wrote:
> On Mon, Dec 04, 2023 at 12:36:42PM +1100, NeilBrown wrote:
> > The work of closing a file can have non-trivial cost.  Doing it in a
> > separate work queue thread means that cost isn't imposed on the nfsd
> > threads and an imbalance can be created.
> > 
> > I have evidence from a customer site when nfsd is being asked to modify
> > many millions of files which causes sufficient memory pressure that some
> > cache (in XFS I think) gets cleaned earlier than would be ideal.  When
> > __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode()
> > needs to synchronously read back previously cached info from storage.
> > This slows down the single thread that is making all the final __dput()
> > calls for all the nfsd threads with the net result that files are added
> > to the delayed_fput_list faster than they are removed, and the system
> > eventually runs out of memory.
> > 
> > To avoid this work imbalance that exhausts memory, this patch moves all
> > work for closing files into the nfsd threads.  This means that when the
> > work imposes a cost, that cost appears where it would be expected - in
> > the work of the nfsd thread.
> 
> Thanks for pursuing this next step in the evolution of the NFSD
> file cache.
> 
> Your problem statement should mention whether you have observed the
> issue with an NFSv3 or an NFSv4 workload or if you see this issue
> with both, since those two use cases are handled very differently
> within the file cache implementation.

I have added:

=============
The customer was using NFSv4.  I can demonstrate the same problem using
NFSv3 or NFSv4 (which close files in different ways) by adding
msleep(25) to for FMODE_WRITE files in __fput().  This simulates
slowness in the final close and when writing through nfsd it causes
/proc/sys/fs/file-nr to grow without bound.
==============

> 
> 
> > There are two changes to achieve this.
> > 
> > 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the
> >    final __dput() for any file closed by a given nfsd thread is handled
> >    by that thread.  This ensures that the number of files that are
> >    queued for a final close is limited by the number of threads and
> >    cannot grow without bound.
> > 
> > 2/ Files opened for NFSv3 are never explicitly closed by the client and are
> >   kept open by the server in the "filecache", which responds to memory
> >   pressure, is garbage collected even when there is no pressure, and
> >   sometimes closes files when there is particular need such as for
> >   rename.
> 
> There is a good reason for close-on-rename: IIRC we want to avoid
> triggering a silly-rename on NFS re-exports.
> 
> Also, I think we do want to close cached garbage-collected files
> quickly, even without memory pressure. Files left open in this way
> can conflict with subsequent NFSv4 OPENs that might hand out a
> delegation as long as no other clients are using them. Files held
> open by the file cache will interfere with that.

Yes - I agree all this behaviour is appropriate.  I was just setting out
the current behaviour of the filecache so that effect of the proposed
changes would be easier to understand.

> 
> 
> >   These files currently have filp_close() called in a dedicated
> >   work queue, so their __dput() can have no effect on nfsd threads.
> > 
> >   This patch discards the work queue and instead has each nfsd thread
> >   call flip_close() on as many as 8 files from the filecache each time
> >   it acts on a client request (or finds there are no pending client
> >   requests).  If there are more to be closed, more threads are woken.
> >   This spreads the work of __dput() over multiple threads and imposes
> >   any cost on those threads.
> > 
> >   The number 8 is somewhat arbitrary.  It needs to be greater than 1 to
> >   ensure that files are closed more quickly than they can be added to
> >   the cache.  It needs to be small enough to limit the per-request
> >   delays that will be imposed on clients when all threads are busy
> >   closing files.
> 
> IMO we want to explicitly separate the mechanisms of handling
> garbage-collected files and non-garbage-collected files.

I think we already have explicit separation.
garbage-collected files are handled to nfsd_file_display_list_delayed(),
either when they fall off the lru or through nfsd_file_close_inode() -
which is used by lease and fsnotify callbacks.

non-garbage collected files are closed directly by nfsd_file_put().

> 
> In the non-garbage-collected (NFSv4) case, the kthread can wait
> for everything it has opened to be closed. task_work seems
> appropriate for that IIUC.

Agreed.  The task_work change is all that we need for NFSv4.

> 
> The problem with handling a limited number of garbage-collected
> items is that once the RPC workload stops, any remaining open
> files will remain open because garbage collection has effectively
> stopped. We really need those files closed out within a couple of
> seconds.

Why would garbage collection stop?
nfsd_filecache_laundrette is still running on the system_wq.  It will
continue to garbage collect and queue files using
nfsd_file_display_list_delayed().
That will wake up an nfsd thread if none is running.  The thread will
close a few, but will first wake another thread if there was more than
it was willing to manage.  So the closing of files should proceed
promptly, and if any close operation takes a non-trivial amount of time,
more threads will be woken and work will proceed in parallel.

> 
> We used to have watermarks in the nfsd_file_put() path to kick
> garbage-collection if there were too many open files. Instead,
> waiting for the GC thread to make progress before recycling the
> kthread might be beneficial.

"too many" is only meaningful in the context of memory usage.  Having
the shrinker callback is exactly the right way to address this - nothing
else is needed.

The GC thread is expected to be CPU intensive.  The main cause of delay
is skipping over lots of files that cannot be closed yet - looking for
files that can.  This could delay the closing of files, but not nearly
as much as the delays I saw caused by synchronous IO.

We might be able to improve the situation a bit by queuing files as soon
as list_lru_walk finds them, rather than gathering them all into a list
and the queuing them one by one from that list.

It isn't clear to me that there is an issue here that needs fixing.

> 
> And, as we discussed in a previous thread, replacing the per-
> namespace worker with a parallel mechanism would help GC proceed
> more quickly to reduce the flush/close backlog for NFSv3.

This patch discards the per-namespace worker.

The GC step (searching the LRU list for "garbage") is still
single-threaded.  The filecache is shared by all net-namespaces and
there is a single GC thread for the filecache.

Files that are found *were* filp_close()ed by per-net-fs work-items.
With this patch the filp_close() is called by the nfsd threads.

The file __fput of those files *was* handled by a single system-wide
work-item.  With this patch they are called by the nfsd thread which
called the filp_close().

> 
> 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/nfsd/filecache.c | 62 ++++++++++++++++++---------------------------
> >  fs/nfsd/filecache.h |  1 +
> >  fs/nfsd/nfssvc.c    |  6 +++++
> >  3 files changed, 32 insertions(+), 37 deletions(-)
> > 
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index ee9c923192e0..55268b7362d4 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -39,6 +39,7 @@
> >  #include <linux/fsnotify.h>
> >  #include <linux/seq_file.h>
> >  #include <linux/rhashtable.h>
> > +#include <linux/task_work.h>
> >  
> >  #include "vfs.h"
> >  #include "nfsd.h"
> > @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
> >  static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
> >  
> >  struct nfsd_fcache_disposal {
> > -	struct work_struct work;
> >  	spinlock_t lock;
> >  	struct list_head freeme;
> >  };
> >  
> > -static struct workqueue_struct *nfsd_filecache_wq __read_mostly;
> > -
> >  static struct kmem_cache		*nfsd_file_slab;
> >  static struct kmem_cache		*nfsd_file_mark_slab;
> >  static struct list_lru			nfsd_file_lru;
> > @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
> >  		spin_lock(&l->lock);
> >  		list_move_tail(&nf->nf_lru, &l->freeme);
> >  		spin_unlock(&l->lock);
> > -		queue_work(nfsd_filecache_wq, &l->work);
> > +		svc_wake_up(nn->nfsd_serv);
> >  	}
> >  }
> >  
> > +/**
> > + * nfsd_file_dispose_some
> 
> This needs a short description and:
> 
>  * @nn: namespace to check
> 
> Or something more enlightening than that.
> 
> Also, the function name exposes mechanism; I think I'd prefer a name
> that is more abstract, such as nfsd_file_net_release() ?

Sometimes exposing mechanism is a good thing.  It means the casual reader
can get a sense of what the function does without having to look at the
function.
So I still prefer my name, but I changed to nfsd_file_net_dispose() so
as suit your preference, but follow the established pattern of using the
word "dispose".  "release" usually just drops a reference.  "dispose"
makes it clear that the thing is going away now.

/**
 * nfsd_file_net_dispose - deal with nfsd_files wait to be disposed.
 * @nn: nfsd_net in which to find files to be disposed.
 *
 * When files held open for nfsv3 are removed from the filecache, whether
 * due to memory pressure or garbage collection, they are queued to
 * a per-net-ns queue.  This function completes the disposal, either
 * directly or by waking another nfsd thread to help with the work.
 */
 
> 
> > + *
> > + */
> > +void nfsd_file_dispose_some(struct nfsd_net *nn)
> > +{
> > +	struct nfsd_fcache_disposal *l = nn->fcache_disposal;
> > +	LIST_HEAD(dispose);
> > +	int i;
> > +
> > +	if (list_empty(&l->freeme))
> > +		return;
> > +	spin_lock(&l->lock);
> > +	for (i = 0; i < 8 && !list_empty(&l->freeme); i++)
> > +		list_move(l->freeme.next, &dispose);
> > +	spin_unlock(&l->lock);
> > +	if (!list_empty(&l->freeme))
> > +		svc_wake_up(nn->nfsd_serv);
> > +	nfsd_file_dispose_list(&dispose);
..
> > @@ -949,6 +950,7 @@ nfsd(void *vrqstp)
> >  	}
> >  
> >  	current->fs->umask = 0;
> > +	current->flags |= PF_RUNS_TASK_WORK;
> >  
> >  	atomic_inc(&nfsdstats.th_cnt);
> >  
> > @@ -963,6 +965,10 @@ nfsd(void *vrqstp)
> >  
> >  		svc_recv(rqstp);
> >  		validate_process_creds();
> > +
> > +		nfsd_file_dispose_some(nn);
> > +		if (task_work_pending(current))
> > +			task_work_run();
> 
> I'd prefer that these task_work details reside inside
> nfsd_file_dispose_some(), or whatever we want to call to call it ...

I don't agree.  They are performing quite separate tasks.
nfsd_file_net_dispose() is disposing files queued for this net.
task_run_work() is finalising the close of any file closed by this
thread, including those used for NFSv4 that are not touched by
nfsd_file_dispose_some().
I don't think they belong in the same function.

Thanks,
NeilBrown
  
Chuck Lever Dec. 4, 2023, 11:48 p.m. UTC | #3
On Tue, Dec 05, 2023 at 09:21:08AM +1100, NeilBrown wrote:
> On Tue, 05 Dec 2023, Chuck Lever wrote:
> > On Mon, Dec 04, 2023 at 12:36:42PM +1100, NeilBrown wrote:
> > > The work of closing a file can have non-trivial cost.  Doing it in a
> > > separate work queue thread means that cost isn't imposed on the nfsd
> > > threads and an imbalance can be created.
> > > 
> > > I have evidence from a customer site when nfsd is being asked to modify
> > > many millions of files which causes sufficient memory pressure that some
> > > cache (in XFS I think) gets cleaned earlier than would be ideal.  When
> > > __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode()
> > > needs to synchronously read back previously cached info from storage.
> > > This slows down the single thread that is making all the final __dput()
> > > calls for all the nfsd threads with the net result that files are added
> > > to the delayed_fput_list faster than they are removed, and the system
> > > eventually runs out of memory.
> > > 
> > > To avoid this work imbalance that exhausts memory, this patch moves all
> > > work for closing files into the nfsd threads.  This means that when the
> > > work imposes a cost, that cost appears where it would be expected - in
> > > the work of the nfsd thread.
> > 
> > Thanks for pursuing this next step in the evolution of the NFSD
> > file cache.
> > 
> > Your problem statement should mention whether you have observed the
> > issue with an NFSv3 or an NFSv4 workload or if you see this issue
> > with both, since those two use cases are handled very differently
> > within the file cache implementation.
> 
> I have added:
> 
> =============
> The customer was using NFSv4.  I can demonstrate the same problem using
> NFSv3 or NFSv4 (which close files in different ways) by adding
> msleep(25) to for FMODE_WRITE files in __fput().  This simulates
> slowness in the final close and when writing through nfsd it causes
> /proc/sys/fs/file-nr to grow without bound.
> ==============
> 
> > 
> > 
> > > There are two changes to achieve this.
> > > 
> > > 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the
> > >    final __dput() for any file closed by a given nfsd thread is handled
> > >    by that thread.  This ensures that the number of files that are
> > >    queued for a final close is limited by the number of threads and
> > >    cannot grow without bound.
> > > 
> > > 2/ Files opened for NFSv3 are never explicitly closed by the client and are
> > >   kept open by the server in the "filecache", which responds to memory
> > >   pressure, is garbage collected even when there is no pressure, and
> > >   sometimes closes files when there is particular need such as for
> > >   rename.
> > 
> > There is a good reason for close-on-rename: IIRC we want to avoid
> > triggering a silly-rename on NFS re-exports.
> > 
> > Also, I think we do want to close cached garbage-collected files
> > quickly, even without memory pressure. Files left open in this way
> > can conflict with subsequent NFSv4 OPENs that might hand out a
> > delegation as long as no other clients are using them. Files held
> > open by the file cache will interfere with that.
> 
> Yes - I agree all this behaviour is appropriate.  I was just setting out
> the current behaviour of the filecache so that effect of the proposed
> changes would be easier to understand.

Ok, I misread "when there is particular need" as "when there is no
particular need." My bad.


> > >   These files currently have filp_close() called in a dedicated
> > >   work queue, so their __dput() can have no effect on nfsd threads.
> > > 
> > >   This patch discards the work queue and instead has each nfsd thread
> > >   call flip_close() on as many as 8 files from the filecache each time
> > >   it acts on a client request (or finds there are no pending client
> > >   requests).  If there are more to be closed, more threads are woken.
> > >   This spreads the work of __dput() over multiple threads and imposes
> > >   any cost on those threads.
> > > 
> > >   The number 8 is somewhat arbitrary.  It needs to be greater than 1 to
> > >   ensure that files are closed more quickly than they can be added to
> > >   the cache.  It needs to be small enough to limit the per-request
> > >   delays that will be imposed on clients when all threads are busy
> > >   closing files.
> > 
> > IMO we want to explicitly separate the mechanisms of handling
> > garbage-collected files and non-garbage-collected files.
> 
> I think we already have explicit separation.
> garbage-collected files are handled to nfsd_file_display_list_delayed(),
> either when they fall off the lru or through nfsd_file_close_inode() -
> which is used by lease and fsnotify callbacks.
> 
> non-garbage collected files are closed directly by nfsd_file_put().

The separation is more clear to me now. Building this all into a
single patch kind of blurred the edges between the two.


> > In the non-garbage-collected (NFSv4) case, the kthread can wait
> > for everything it has opened to be closed. task_work seems
> > appropriate for that IIUC.
> 
> Agreed.  The task_work change is all that we need for NFSv4.
> 
> > The problem with handling a limited number of garbage-collected
> > items is that once the RPC workload stops, any remaining open
> > files will remain open because garbage collection has effectively
> > stopped. We really need those files closed out within a couple of
> > seconds.
> 
> Why would garbage collection stop?

Because with your patch GC now appears to be driven through
nfsd_file_dispose_some(). I see now that there is a hidden
recursion that wakes more nfsd threads if there's more GC to
be done. So file disposal is indeed not dependent on more
ingress RPC traffic.

The "If there are more to be closed" remark above in the patch
description was ambiguous to me, but I think I get it now.


> nfsd_filecache_laundrette is still running on the system_wq.  It will
> continue to garbage collect and queue files using
> nfsd_file_display_list_delayed().
> That will wake up an nfsd thread if none is running.  The thread will
> close a few, but will first wake another thread if there was more than
> it was willing to manage.  So the closing of files should proceed
> promptly, and if any close operation takes a non-trivial amount of time,
> more threads will be woken and work will proceed in parallel.

OK, that is what the svc_wake_up()s are doing.


> > And, as we discussed in a previous thread, replacing the per-
> > namespace worker with a parallel mechanism would help GC proceed
> > more quickly to reduce the flush/close backlog for NFSv3.
> 
> This patch discards the per-namespace worker.
> 
> The GC step (searching the LRU list for "garbage") is still
> single-threaded. The filecache is shared by all net-namespaces and
> there is a single GC thread for the filecache.

Agreed.


> Files that are found *were* filp_close()ed by per-net-fs work-items.
> With this patch the filp_close() is called by the nfsd threads.
> 
> The file __fput of those files *was* handled by a single system-wide
> work-item.  With this patch they are called by the nfsd thread which
> called the filp_close().

Fwiw, none of that is obvious to me when looking at the diff.


> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/nfsd/filecache.c | 62 ++++++++++++++++++---------------------------
> > >  fs/nfsd/filecache.h |  1 +
> > >  fs/nfsd/nfssvc.c    |  6 +++++
> > >  3 files changed, 32 insertions(+), 37 deletions(-)
> > > 
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index ee9c923192e0..55268b7362d4 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -39,6 +39,7 @@
> > >  #include <linux/fsnotify.h>
> > >  #include <linux/seq_file.h>
> > >  #include <linux/rhashtable.h>
> > > +#include <linux/task_work.h>
> > >  
> > >  #include "vfs.h"
> > >  #include "nfsd.h"
> > > @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
> > >  static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
> > >  
> > >  struct nfsd_fcache_disposal {
> > > -	struct work_struct work;
> > >  	spinlock_t lock;
> > >  	struct list_head freeme;
> > >  };
> > >  
> > > -static struct workqueue_struct *nfsd_filecache_wq __read_mostly;
> > > -
> > >  static struct kmem_cache		*nfsd_file_slab;
> > >  static struct kmem_cache		*nfsd_file_mark_slab;
> > >  static struct list_lru			nfsd_file_lru;
> > > @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
> > >  		spin_lock(&l->lock);
> > >  		list_move_tail(&nf->nf_lru, &l->freeme);
> > >  		spin_unlock(&l->lock);
> > > -		queue_work(nfsd_filecache_wq, &l->work);
> > > +		svc_wake_up(nn->nfsd_serv);
> > >  	}
> > >  }
> > >  
> > > +/**
> > > + * nfsd_file_dispose_some
> > 
> > This needs a short description and:
> > 
> >  * @nn: namespace to check
> > 
> > Or something more enlightening than that.
> > 
> > Also, the function name exposes mechanism; I think I'd prefer a name
> > that is more abstract, such as nfsd_file_net_release() ?
> 
> Sometimes exposing mechanism is a good thing.  It means the casual reader
> can get a sense of what the function does without having to look at the
> function.
> So I still prefer my name, but I changed to nfsd_file_net_dispose() so
> as suit your preference, but follow the established pattern of using the
> word "dispose".  "release" usually just drops a reference.  "dispose"
> makes it clear that the thing is going away now.
> 
> /**
>  * nfsd_file_net_dispose - deal with nfsd_files wait to be disposed.
>  * @nn: nfsd_net in which to find files to be disposed.
>  *
>  * When files held open for nfsv3 are removed from the filecache, whether

This comment is helpful. But note that we quite purposely do not
refer to NFS versions in filecache.c -- it's either garbage-
collected or not garbage-collected files. IIRC on occasion NFSv3
wants to use a non-garbage-collected file, and NFSv4 might sometimes
use a GC-d file. I've forgotten the details.


>  * due to memory pressure or garbage collection, they are queued to
>  * a per-net-ns queue.  This function completes the disposal, either
>  * directly or by waking another nfsd thread to help with the work.
>  */

I understand why you want to keep this name: this function handles
only garbage-collected files.

I would still like nfsd() to call a wrapper function to handle
the details of closing both types of files rather than open-coding
calling nfsd_file_net_dispose() and task_run_work(), especially
because there is no code comment explaining why the task_run_work()
call is needed. That level of filecache implementation detail
doesn't belong in nfsd().
  
kernel test robot Dec. 5, 2023, 6:36 a.m. UTC | #4
Hi NeilBrown,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on linus/master v6.7-rc4]
[cannot apply to brauner-vfs/vfs.all next-20231205]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/NeilBrown/Allow-a-kthread-to-declare-that-it-calls-task_work_run/20231204-094248
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20231204014042.6754-3-neilb%40suse.de
patch subject: [PATCH 2/2] nfsd: Don't leave work of closing files to a work queue.
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20231205/202312051443.02Ddb5pd-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231205/202312051443.02Ddb5pd-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312051443.02Ddb5pd-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> fs/nfsd/filecache.c:431: warning: Function parameter or member 'nn' not described in 'nfsd_file_dispose_some'


vim +431 fs/nfsd/filecache.c

   425	
   426	/**
   427	 * nfsd_file_dispose_some
   428	 *
   429	 */
   430	void nfsd_file_dispose_some(struct nfsd_net *nn)
 > 431	{
   432		struct nfsd_fcache_disposal *l = nn->fcache_disposal;
   433		LIST_HEAD(dispose);
   434		int i;
   435	
   436		if (list_empty(&l->freeme))
   437			return;
   438		spin_lock(&l->lock);
   439		for (i = 0; i < 8 && !list_empty(&l->freeme); i++)
   440			list_move(l->freeme.next, &dispose);
   441		spin_unlock(&l->lock);
   442		if (!list_empty(&l->freeme))
   443			svc_wake_up(nn->nfsd_serv);
   444		nfsd_file_dispose_list(&dispose);
   445	}
   446
  

Patch

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index ee9c923192e0..55268b7362d4 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -39,6 +39,7 @@ 
 #include <linux/fsnotify.h>
 #include <linux/seq_file.h>
 #include <linux/rhashtable.h>
+#include <linux/task_work.h>
 
 #include "vfs.h"
 #include "nfsd.h"
@@ -61,13 +62,10 @@  static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
 static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
 
 struct nfsd_fcache_disposal {
-	struct work_struct work;
 	spinlock_t lock;
 	struct list_head freeme;
 };
 
-static struct workqueue_struct *nfsd_filecache_wq __read_mostly;
-
 static struct kmem_cache		*nfsd_file_slab;
 static struct kmem_cache		*nfsd_file_mark_slab;
 static struct list_lru			nfsd_file_lru;
@@ -421,10 +419,31 @@  nfsd_file_dispose_list_delayed(struct list_head *dispose)
 		spin_lock(&l->lock);
 		list_move_tail(&nf->nf_lru, &l->freeme);
 		spin_unlock(&l->lock);
-		queue_work(nfsd_filecache_wq, &l->work);
+		svc_wake_up(nn->nfsd_serv);
 	}
 }
 
+/**
+ * nfsd_file_dispose_some
+ *
+ */
+void nfsd_file_dispose_some(struct nfsd_net *nn)
+{
+	struct nfsd_fcache_disposal *l = nn->fcache_disposal;
+	LIST_HEAD(dispose);
+	int i;
+
+	if (list_empty(&l->freeme))
+		return;
+	spin_lock(&l->lock);
+	for (i = 0; i < 8 && !list_empty(&l->freeme); i++)
+		list_move(l->freeme.next, &dispose);
+	spin_unlock(&l->lock);
+	if (!list_empty(&l->freeme))
+		svc_wake_up(nn->nfsd_serv);
+	nfsd_file_dispose_list(&dispose);
+}
+
 /**
  * nfsd_file_lru_cb - Examine an entry on the LRU list
  * @item: LRU entry to examine
@@ -635,28 +654,8 @@  nfsd_file_close_inode_sync(struct inode *inode)
 		list_del_init(&nf->nf_lru);
 		nfsd_file_free(nf);
 	}
-	flush_delayed_fput();
-}
-
-/**
- * nfsd_file_delayed_close - close unused nfsd_files
- * @work: dummy
- *
- * Scrape the freeme list for this nfsd_net, and then dispose of them
- * all.
- */
-static void
-nfsd_file_delayed_close(struct work_struct *work)
-{
-	LIST_HEAD(head);
-	struct nfsd_fcache_disposal *l = container_of(work,
-			struct nfsd_fcache_disposal, work);
-
-	spin_lock(&l->lock);
-	list_splice_init(&l->freeme, &head);
-	spin_unlock(&l->lock);
-
-	nfsd_file_dispose_list(&head);
+	/* Flush any delayed fput */
+	task_work_run();
 }
 
 static int
@@ -721,10 +720,6 @@  nfsd_file_cache_init(void)
 		return ret;
 
 	ret = -ENOMEM;
-	nfsd_filecache_wq = alloc_workqueue("nfsd_filecache", 0, 0);
-	if (!nfsd_filecache_wq)
-		goto out;
-
 	nfsd_file_slab = kmem_cache_create("nfsd_file",
 				sizeof(struct nfsd_file), 0, 0, NULL);
 	if (!nfsd_file_slab) {
@@ -739,7 +734,6 @@  nfsd_file_cache_init(void)
 		goto out_err;
 	}
 
-
 	ret = list_lru_init(&nfsd_file_lru);
 	if (ret) {
 		pr_err("nfsd: failed to init nfsd_file_lru: %d\n", ret);
@@ -782,8 +776,6 @@  nfsd_file_cache_init(void)
 	nfsd_file_slab = NULL;
 	kmem_cache_destroy(nfsd_file_mark_slab);
 	nfsd_file_mark_slab = NULL;
-	destroy_workqueue(nfsd_filecache_wq);
-	nfsd_filecache_wq = NULL;
 	rhltable_destroy(&nfsd_file_rhltable);
 	goto out;
 }
@@ -829,7 +821,6 @@  nfsd_alloc_fcache_disposal(void)
 	l = kmalloc(sizeof(*l), GFP_KERNEL);
 	if (!l)
 		return NULL;
-	INIT_WORK(&l->work, nfsd_file_delayed_close);
 	spin_lock_init(&l->lock);
 	INIT_LIST_HEAD(&l->freeme);
 	return l;
@@ -838,7 +829,6 @@  nfsd_alloc_fcache_disposal(void)
 static void
 nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l)
 {
-	cancel_work_sync(&l->work);
 	nfsd_file_dispose_list(&l->freeme);
 	kfree(l);
 }
@@ -907,8 +897,6 @@  nfsd_file_cache_shutdown(void)
 	fsnotify_wait_marks_destroyed();
 	kmem_cache_destroy(nfsd_file_mark_slab);
 	nfsd_file_mark_slab = NULL;
-	destroy_workqueue(nfsd_filecache_wq);
-	nfsd_filecache_wq = NULL;
 	rhltable_destroy(&nfsd_file_rhltable);
 
 	for_each_possible_cpu(i) {
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index e54165a3224f..bc8c3363bbdf 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -56,6 +56,7 @@  void nfsd_file_cache_shutdown_net(struct net *net);
 void nfsd_file_put(struct nfsd_file *nf);
 struct nfsd_file *nfsd_file_get(struct nfsd_file *nf);
 void nfsd_file_close_inode_sync(struct inode *inode);
+void nfsd_file_dispose_some(struct nfsd_net *nn);
 bool nfsd_file_is_cached(struct inode *inode);
 __be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		  unsigned int may_flags, struct nfsd_file **nfp);
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index c7af1095f6b5..02ea16636b54 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -13,6 +13,7 @@ 
 #include <linux/fs_struct.h>
 #include <linux/swap.h>
 #include <linux/siphash.h>
+#include <linux/task_work.h>
 
 #include <linux/sunrpc/stats.h>
 #include <linux/sunrpc/svcsock.h>
@@ -949,6 +950,7 @@  nfsd(void *vrqstp)
 	}
 
 	current->fs->umask = 0;
+	current->flags |= PF_RUNS_TASK_WORK;
 
 	atomic_inc(&nfsdstats.th_cnt);
 
@@ -963,6 +965,10 @@  nfsd(void *vrqstp)
 
 		svc_recv(rqstp);
 		validate_process_creds();
+
+		nfsd_file_dispose_some(nn);
+		if (task_work_pending(current))
+			task_work_run();
 	}
 
 	atomic_dec(&nfsdstats.th_cnt);